Model comparison

Gemma 4 31B vs Step 3.7 Flash

Data verified July 16, 2026

Head-to-head evidence from 18 shared benchmark results across 6 categories. Overall scores shown here use the public BenchAlign v5 ranking lane.

Gemma 4 31B

Google

61/100

Margin

10.2pts

← winning

Step 3.7 Flash

StepFun

50.76/100

0 category wins1 category wins

BenchAlign evidence: Gemma 4 31B supported; Step 3.7 Flash estimated. Intervals and evidence labels describe ranking uncertainty, not a guarantee for a specific workload.

Evidence parity. Gemma 4 31B and Step 3.7 Flash share 18 comparable benchmark results. 1 of 8 categories are comparable. 12 results are unique to Gemma 4 31B; 12 to Step 3.7 Flash.

Updated July 16, 2026

Shared results: 18
Gemma 4 31B only: 12
Step 3.7 Flash only: 12
Comparable categories: 1 / 8

Pick Gemma 4 31B if you want the stronger benchmark profile. Step 3.7 Flash only becomes the better choice if coding is the priority.

Confidence note. This is a partial-evidence comparison with 18 shared benchmark results across 6 evidence categories; 1 of 8 categories currently have scoreable aggregates for both models. Treat the verdict as directional until coverage is more balanced.

Why this result

Gemma 4 31B is clearly ahead on the provisional aggregate, 62 to 57. The gap is large enough that you do not need to squint at the spreadsheet to see the difference.

Step 3.7 Flash is also the more expensive model on tokens at $0.20 input / $1.15 output per 1M tokens, versus $0.00 input / $0.00 output per 1M tokens for Gemma 4 31B. That is roughly Infinityx on output cost alone.

Category breakdown

Exact category averages are shown below. Not measured means BenchLM does not have enough sourced public coverage for that model and category.

Category scores and score margins for Gemma 4 31B and Step 3.7 Flash
Category	Gemma 4 31B	Δ	Step 3.7 Flash
Coding	Gemma 4 31B41.6	Margin→ 14.7	Step 3.7 Flash56.3
Agentic	Gemma 4 31BNot measured	MarginNo overlap	Step 3.7 Flash66.4
Knowledge	Gemma 4 31B53.3	MarginNo overlap	Step 3.7 FlashNot measured
Multimodal	Gemma 4 31B76.9	MarginNo overlap	Step 3.7 FlashNot measured

Operational comparison

Runtime and commercial metrics are compared only when both models have a complete sourced value.

Metric	Gemma 4 31B	Step 3.7 Flash	Comparison
Input / output priceUSD per 1M tokens	Gemma 4 31B$0 input / $0 output	Step 3.7 Flash$0.2 input / $1.15 output	Gemma 4 31B has the lower combined listed price.
Generation speedtokens per second	Gemma 4 31BNot available	Step 3.7 FlashNot available	A complete speed comparison is not available.
First-answer latencyseconds to first token	Gemma 4 31BNot available	Step 3.7 FlashNot available	A complete latency comparison is not available.
Context windowmaximum listed tokens	Gemma 4 31B256K	Step 3.7 Flash256K	Listed context windows are equal.

Benchmark Deep Dive

Agentic

16 benchmarks

Benchmark	Gemma 4 31B	Step 3.7 Flash	Result
AA Agentic IndexSource	14.4%	21.5%	Step 3.7 Flash leads
τ²-bench resultsSource	59.9%	98.5%	Step 3.7 Flash leads
GDPval-AASource	15.2%	25.9%	Step 3.7 Flash leads
GDPval-AASource	804	1017	Step 3.7 Flash leads
Gert LabsSource	35.26%	51.57%	Step 3.7 Flash leads
AA EnterpriseOps-GymSource	28.3%	—	Not comparable
AA Harvey LABSource	0.0%	—	Not comparable
AA ITBenchSource	37.3%	—	Not comparable
AA Tau3 BankingSource	15.1%	—	Not comparable
Terminal-Bench 2.0Source	—	59.5%	Not comparable
BrowseCompSource	—	75.8%	Not comparable
DeepSearchQASource	—	92.8%	Not comparable
ToolathlonSource	—	49.5%	Not comparable
Claw-EvalSource	—	67.1%	Not comparable
HLE w/ toolsSource	—	47.2%	Not comparable
APEX-Agents-AASource	—	14.8%	Not comparable

CodingStep 3.7 Flash wins

7 benchmarks

Benchmark	Gemma 4 31B	Step 3.7 Flash	Result
SWE-RebenchSource	41.6%	—	Not comparable
React Native EvalsSource	75.2%	—	Not comparable
AA Coding IndexSource	43.4%	39.6%	Gemma 4 31B leads
Terminal-Bench HardSource	36.4%	35.6%	Gemma 4 31B leads
AA-SciCodeSource	43.4%	40.0%	Gemma 4 31B leads
SWE-bench ProSource	—	56.3%	Not comparable
Terminal-Bench 2.0Source	—	59.5%	Not comparable

Reasoning

2 benchmarks

Benchmark	Gemma 4 31B	Step 3.7 Flash	Result
AA-LCRSource	62.0%	63.7%	Step 3.7 Flash leads
CritPtSource	1.4%	2.3%	Step 3.7 Flash leads

Knowledge

11 benchmarks

Benchmark	Gemma 4 31B	Step 3.7 Flash	Result
GPQASource	84.3%	—	Not comparable
MMLU-ProSource	85.2%	—	Not comparable
HLESource	26.5%	—	Not comparable
HLE w/o toolsSource	19.5%	—	Not comparable
Artificial Analysis Intelligence IndexSource	29.4%	30.3%	Step 3.7 Flash leads
AA-GPQA DiamondSource	85.7%	80.9%	Gemma 4 31B leads
AA-HLESource	22.7%	19.9%	Gemma 4 31B leads
AA-Omniscience IndexSource	-45.4%	-37.5%	Step 3.7 Flash leads
AA-Omniscience AccuracySource	19.9%	25.4%	Step 3.7 Flash leads
AA-Omniscience Hallucination RateSource	81.6%	84.4%	Gemma 4 31B leads
AA Openness IndexSource	38.9%	—	Not comparable

Multimodal

5 benchmarks

Benchmark	Gemma 4 31B	Step 3.7 Flash	Result
MMMU-ProSource	76.9%	—	Not comparable
AA-MMMU-ProSource	73.4%	75.3%	Step 3.7 Flash leads
SimpleVQASource	—	79.2%	Not comparable
V*Source	—	95.3%	Not comparable
Design Arena WebsiteSource	—	1218	Not comparable

Inst. Following

1 benchmarks

Benchmark	Gemma 4 31B	Step 3.7 Flash	Result
AA-IFBenchSource	75.6%	67.3%	Gemma 4 31B leads

Frequently Asked Questions (2)

Which is better, Gemma 4 31B or Step 3.7 Flash?

Gemma 4 31B is ahead on BenchLM's provisional leaderboard, 62 to 57.

Which is better for coding, Gemma 4 31B or Step 3.7 Flash?

Step 3.7 Flash has the edge for coding in this comparison, averaging 56.3 versus 41.6. Inside this category, AA Coding Index is the benchmark that creates the most daylight between them.

Self-host vs API cost

Estimates at 50,000 req/day · 1000 tokens/req average.

Gemma 4 31B

API / mo$0

Self-host / mo$429

Break-even—

Step 3.7 Flash

API / mo$1,012

Self-host / moNot listed

Break-even—

Proprietary model — self-hosting not applicable.

Model the full break-even

Related Comparisons

Explore More

Google Compare Pricing Methodology Find Your Best LLM Overall Rankings

Last updated: July 16, 2026

The AI models change fast. We track them for you.

A weekly brief for engineers and researchers covering new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.