Model comparison

Gemma 4 31B vs GPT-5.5

Data verified July 22, 2026

Head-to-head evidence from 26 shared benchmark results across 6 categories. Overall scores shown here use the public BenchAlign v5 ranking lane.

Gemma 4 31B

Google

61.08/100

Margin

12.4pts

winning →

GPT-5.5

OpenAI

73.51/100

1 category wins2 category wins

Public leaderboard positions: Gemma 4 31B #43 (Supported); GPT-5.5 #9 (Estimated). Intervals and evidence labels describe ranking uncertainty, not a guarantee for a specific workload.

Evidence parity. Gemma 4 31B and GPT-5.5 share 26 comparable benchmark results. 3 of 8 categories are comparable. 3 results are unique to Gemma 4 31B; 31 to GPT-5.5.

Updated July 22, 2026

Shared results: 26
Gemma 4 31B only: 3
GPT-5.5 only: 31
Comparable categories: 3 / 8

Pick GPT-5.5 if you want the stronger benchmark profile. Gemma 4 31B only becomes the better choice if multimodal & grounded is the priority or you want the cheaper token bill.

Confidence note. This is a partial-evidence comparison with 26 shared benchmark results across 6 evidence categories; 3 of 8 categories currently have scoreable aggregates for both models. Treat the verdict as directional until coverage is more balanced.

Why this result

GPT-5.5 is clearly ahead on the BenchAlign aggregate, 73.51 to 61.08. The gap is large enough that you do not need to squint at the spreadsheet to see the difference.

GPT-5.5's sharpest advantage is in coding, where it averages 58.6 against 41.6. The single biggest benchmark swing on the page is HLE, 26.5% to 52.2%. Gemma 4 31B does hit back in multimodal & grounded, so the answer changes if that is the part of the workload you care about most.

GPT-5.5 is also the more expensive model on tokens at $5.00 input / $30.00 output per 1M tokens, versus $0.00 input / $0.00 output per 1M tokens for Gemma 4 31B. That is roughly Infinityx on output cost alone. GPT-5.5 gives you the larger context window at 1M, compared with 256K for Gemma 4 31B.

Category breakdown

Exact category averages are shown below. Not measured means BenchLM does not have enough sourced public coverage for that model and category.

Category scores and score margins for Gemma 4 31B and GPT-5.5
Category	Gemma 4 31B	Δ	GPT-5.5
Coding	Gemma 4 31B41.6	Margin→ 17.0	GPT-5.558.6
Multimodal	Gemma 4 31B76.9	Margin← 6.5	GPT-5.570.4
Knowledge	Gemma 4 31B52.9	Margin→ 4.9	GPT-5.557.8
Agentic	Gemma 4 31BNot measured	MarginNo overlap	GPT-5.581.6
Reasoning	Gemma 4 31BNot measured	MarginNo overlap	GPT-5.585.0
Math	Gemma 4 31BNot measured	MarginNo overlap	GPT-5.547.6

Decisive benchmark drivers

The largest measured benchmark gaps in this matchup, with exact reported values.

A · Gemma 4 31BB · GPT-5.5

HLE
Knowledge
Source ↗
A 26.5%B 52.2%
Winner: GPT-5.5Δ 25.7
HLE: Gemma 4 31B scored 26.5%; GPT-5.5 scored 52.2%. GPT-5.5 wins this benchmark.
GPQA
Knowledge
Source ↗
A 84.3%B 93.6%
Winner: GPT-5.5Δ 9.3
GPQA: Gemma 4 31B scored 84.3%; GPT-5.5 scored 93.6%. GPT-5.5 wins this benchmark.
MMMU-Pro
Multimodal
Source ↗
A 76.9%B 81.2%
Winner: GPT-5.5Δ 4.3
MMMU-Pro: Gemma 4 31B scored 76.9%; GPT-5.5 scored 81.2%. GPT-5.5 wins this benchmark.

Operational comparison

Runtime and commercial metrics are compared only when both models have a complete sourced value.

Metric	Gemma 4 31B	GPT-5.5	Comparison
Input / output priceUSD per 1M tokens	Gemma 4 31B$0 input / $0 output	GPT-5.5$5 input / $30 output	Gemma 4 31B has the lower combined listed price.
Generation speedtokens per second	Gemma 4 31BNot available	GPT-5.5Not available	A complete speed comparison is not available.
First-answer latencyseconds to first token	Gemma 4 31BNot available	GPT-5.5Not available	A complete latency comparison is not available.
Context windowmaximum listed tokens	Gemma 4 31B256K	GPT-5.51M	GPT-5.5 lists the larger context window.

Benchmark Deep Dive

Agentic

24 benchmarks

Benchmark	Gemma 4 31B	GPT-5.5	Result
AA Agentic IndexSource	14.4%	44.9%	GPT-5.5 leads
τ²-bench resultsSource	59.9%	93.9%	GPT-5.5 leads
GDPval-AASource	15.2%	49.5%	GPT-5.5 leads
GDPval-AASource	804	1490	GPT-5.5 leads
Gert LabsSource	35.26%	72.93%	GPT-5.5 leads
AA EnterpriseOps-GymSource	28.3%	46.6%	GPT-5.5 leads
AA ITBenchSource	37.3%	45.8%	GPT-5.5 leads
AA Tau3 BankingSource	15.1%	31.3%	GPT-5.5 leads
terminalBenchHardSource	36.4%	60.6%	GPT-5.5 leads
Terminal-Bench 2.0Source	—	82%	Not comparable
CyberGymSource	—	81.8%	Not comparable
BrowseCompSource	—	84.4%	Not comparable
OSWorld-VerifiedSource	—	78.7%	Not comparable
MCP AtlasSource	—	75.3%	Not comparable
ToolathlonSource	—	55.6%	Not comparable
APEX-Agents-AASource	—	37.7%	Not comparable
ResearchClawBenchSource	—	17.0%	Not comparable
OSWorld 2.0Source	—	13.0%	Not comparable
JobBenchSource	—	42.7%	Not comparable
ExploitGymSource	—	13.4%	Not comparable
AA BriefcaseSource	—	1154	Not comparable
AA AutomationBenchSource	—	42.1%	Not comparable
AA Harvey LABSource	—	86.3%	Not comparable
aaTerminalBench21Source	—	84.3%	Not comparable

CodingGPT-5.5 wins

10 benchmarks

Benchmark	Gemma 4 31B	GPT-5.5	Result
SWE-RebenchSource	41.6%	—	Not comparable
React Native EvalsSource	75.2%	84.7%	GPT-5.5 leads
AA Coding IndexSource	43.4%	74.9%	GPT-5.5 leads
AA-SciCodeSource	43.4%	56.1%	GPT-5.5 leads
SWE-bench ProSource	—	58.6%	Not comparable
Terminal-Bench 2.0Source	—	82.0%	Not comparable
Vibe Code BenchSource	—	69.85%	Not comparable
cursorBench31Source	—	59.2%	Not comparable
cursorBench32Source	—	58.4%	Not comparable
FrontierCode 1.1 MainSource	—	43.0%	Not comparable

Reasoning

5 benchmarks

Benchmark	Gemma 4 31B	GPT-5.5	Result
AA-LCRSource	62.0%	74.3%	GPT-5.5 leads
CritPtSource	1.4%	27.1%	GPT-5.5 leads
MRCR v2 64K-128KSource	—	83.1%	Not comparable
MRCR v2 128K-256KSource	—	87.5%	Not comparable
ARC-AGI-2Source	—	85%	Not comparable

KnowledgeGPT-5.5 wins

12 benchmarks

Benchmark	Gemma 4 31B	GPT-5.5	Result
GPQASource	84.3%	93.6%	GPT-5.5 leads
MMLU-ProSource	85.2%	—	Not comparable
HLESource	26.5%	52.2%	GPT-5.5 leads
HLE w/o toolsSource	19.5%	41.4%	GPT-5.5 leads
Artificial Analysis Intelligence IndexSource	29.4%	54.8%	GPT-5.5 leads
AA-GPQA DiamondSource	85.7%	93.5%	GPT-5.5 leads
AA-HLESource	22.7%	44.3%	GPT-5.5 leads
AA-Omniscience IndexSource	-45.4%	20.1%	GPT-5.5 leads
AA-Omniscience AccuracySource	19.9%	56.9%	GPT-5.5 leads
AA-Omniscience Hallucination RateSource	81.6%	85.5%	Gemma 4 31B leads
AA Openness IndexSource	38.9%	—	Not comparable
GPQA-DSource	—	93.6%	Not comparable

Math

3 benchmarks

Benchmark	Gemma 4 31B	GPT-5.5	Result
FrontierMath (legacy)Source	—	51.7%	Not comparable
FrontierMath v2 (Tiers 1-3)Source	—	51.700%	Not comparable
FrontierMath v2 (Tier 4)Source	—	35.400%	Not comparable

MultimodalGemma 4 31B wins

5 benchmarks

Benchmark	Gemma 4 31B	GPT-5.5	Result
MMMU-ProSource	76.9%	81.2%	GPT-5.5 leads
AA-MMMU-ProSource	73.4%	79.9%	GPT-5.5 leads
MMMU-Pro w/ PythonSource	—	83.2%	Not comparable
OfficeQA ProSource	—	54.1%	Not comparable
Design Arena WebsiteSource	—	1282	Not comparable

Inst. Following

1 benchmarks

Benchmark	Gemma 4 31B	GPT-5.5	Result
AA-IFBenchSource	75.6%	75.9%	GPT-5.5 leads

Frequently Asked Questions (4)

Which is better, Gemma 4 31B or GPT-5.5?

GPT-5.5 is ahead on BenchLM's BenchAlign leaderboard, 73.51 to 61.08. The biggest single separator in this matchup is HLE, where the scores are 26.5% and 52.2%.

Which is better for knowledge tasks, Gemma 4 31B or GPT-5.5?

GPT-5.5 has the edge for knowledge tasks in this comparison, averaging 57.8 versus 52.9. Inside this category, AA-Omniscience Index is the benchmark that creates the most daylight between them.

Which is better for coding, Gemma 4 31B or GPT-5.5?

GPT-5.5 has the edge for coding in this comparison, averaging 58.6 versus 41.6. Inside this category, AA Coding Index is the benchmark that creates the most daylight between them.

Which is better for multimodal and grounded tasks, Gemma 4 31B or GPT-5.5?

Gemma 4 31B has the edge for multimodal and grounded tasks in this comparison, averaging 76.9 versus 70.4. Inside this category, AA-MMMU-Pro is the benchmark that creates the most daylight between them.

Self-host vs API cost

Estimates at 50,000 req/day · 1000 tokens/req average.

Gemma 4 31B

API / mo$0

Self-host / mo$429

Break-even—

GPT-5.5

API / mo$26,250

Self-host / moNot listed

Break-even—

Proprietary model — self-hosting not applicable.

Model the full break-even

Related Comparisons

Explore More

Google Compare Pricing Methodology Find Your Best LLM Overall Rankings

Last updated: July 22, 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.