Model comparison

Qwen3.7 Max vs Qwen3.7 Plus

Data verified July 23, 2026

Head-to-head evidence from 48 shared benchmark results across 8 categories. Overall scores shown here use the public BenchAlign v5 ranking lane.

Qwen3.7 Max

Alibaba

72.84/100

Margin

5.6pts

← winning

Qwen3.7 Plus

Alibaba

67.22/100

4 category wins3 category wins

Public leaderboard positions: Qwen3.7 Max #10 (Supported); Qwen3.7 Plus #21 (Supported). Intervals and evidence labels describe ranking uncertainty, not a guarantee for a specific workload.

Evidence parity. Qwen3.7 Max and Qwen3.7 Plus share 48 comparable benchmark results. 7 of 8 categories are comparable. 10 results are unique to Qwen3.7 Max; 21 to Qwen3.7 Plus.

Updated July 23, 2026

Shared results: 48
Qwen3.7 Max only: 10
Qwen3.7 Plus only: 21
Comparable categories: 7 / 8

Pick Qwen3.7 Max if you want the stronger benchmark profile. Qwen3.7 Plus only becomes the better choice if agentic is the priority.

Confidence note. This is a partial-evidence comparison with 48 shared benchmark results across 8 evidence categories; 7 of 8 categories currently have scoreable aggregates for both models. Treat the verdict as directional until coverage is more balanced.

Why this result

Qwen3.7 Max is clearly ahead on the BenchAlign aggregate, 72.84 to 67.22. The gap is large enough that you do not need to squint at the spreadsheet to see the difference.

Qwen3.7 Max's sharpest advantage is in mathematics, where it averages 97.1 against 92.9. The single biggest benchmark swing on the page is HLE, 41.4% to 34.7%. Qwen3.7 Plus does hit back in agentic, so the answer changes if that is the part of the workload you care about most.

Category breakdown

Exact category averages are shown below. Not measured means BenchLM does not have enough sourced public coverage for that model and category.

Category scores and score margins for Qwen3.7 Max and Qwen3.7 Plus
Category	Qwen3.7 Max	Δ	Qwen3.7 Plus
Math	Qwen3.7 Max97.1	Margin← 4.2	Qwen3.7 Plus92.9
Knowledge	Qwen3.7 Max64.2	Margin← 4.1	Qwen3.7 Plus60.1
Coding	Qwen3.7 Max77.9	Margin← 2.3	Qwen3.7 Plus75.6
Agentic	Qwen3.7 Max69.7	Margin→ 2.0	Qwen3.7 Plus71.7
Multilingual	Qwen3.7 Max87.0	Margin← 1.6	Qwen3.7 Plus85.4
Reasoning	Qwen3.7 Max90.4	Margin→ 1.3	Qwen3.7 Plus91.7
Inst. Following	Qwen3.7 Max84.4	Margin→ 0.1	Qwen3.7 Plus84.5
Multimodal	Qwen3.7 MaxNot measured	MarginNo overlap	Qwen3.7 Plus81.5

Decisive benchmark drivers

The largest measured benchmark gaps in this matchup, with exact reported values.

A · Qwen3.7 MaxB · Qwen3.7 Plus

HLE
Knowledge
Source ↗
A 41.4%B 34.7%
Winner: Qwen3.7 MaxΔ 6.7
HLE: Qwen3.7 Max scored 41.4%; Qwen3.7 Plus scored 34.7%. Qwen3.7 Max wins this benchmark.
HMMT Feb 2026
Math
Source ↗
A 97.1%B 92.9%
Winner: Qwen3.7 MaxΔ 4.2
HMMT Feb 2026: Qwen3.7 Max scored 97.1%; Qwen3.7 Plus scored 92.9%. Qwen3.7 Max wins this benchmark.
SWE-bench Pro
Coding
Source ↗
A 60.6%B 57.6%
Winner: Qwen3.7 MaxΔ 3
SWE-bench Pro: Qwen3.7 Max scored 60.6%; Qwen3.7 Plus scored 57.6%. Qwen3.7 Max wins this benchmark.
SWE-bench Verified
Coding
Source ↗
A 80.4%B 77.7%
Winner: Qwen3.7 MaxΔ 2.7
SWE-bench Verified: Qwen3.7 Max scored 80.4%; Qwen3.7 Plus scored 77.7%. Qwen3.7 Max wins this benchmark.
SciCode
Coding
Source ↗
A 53.5%B 51.3%
Winner: Qwen3.7 MaxΔ 2.2
SciCode: Qwen3.7 Max scored 53.5%; Qwen3.7 Plus scored 51.3%. Qwen3.7 Max wins this benchmark.

Operational comparison

Runtime and commercial metrics are compared only when both models have a complete sourced value.

Metric	Qwen3.7 Max	Qwen3.7 Plus	Comparison
Input / output priceUSD per 1M tokens	Qwen3.7 MaxNot available	Qwen3.7 PlusNot available	A complete price comparison is not available.
Generation speedtokens per second	Qwen3.7 MaxNot available	Qwen3.7 PlusNot available	A complete speed comparison is not available.
First-answer latencyseconds to first token	Qwen3.7 MaxNot available	Qwen3.7 PlusNot available	A complete latency comparison is not available.
Context windowmaximum listed tokens	Qwen3.7 Max1M	Qwen3.7 Plus1M	Listed context windows are equal.

Benchmark Deep Dive

AgenticQwen3.7 Plus wins

26 benchmarks

Benchmark	Qwen3.7 Max	Qwen3.7 Plus	Result
Terminal-Bench 2.0Source	69.7%	70.3%	Qwen3.7 Plus leads
QwenClawBenchSource	64.3%	61.8%	Qwen3.7 Max leads
QwenWebBenchSource	1568	1536	Qwen3.7 Max leads
Claw-EvalSource	65.2%	62.7%	Qwen3.7 Max leads
BFCL v4Source	75.0%	72.9%	Qwen3.7 Max leads
MCP AtlasSource	76.4%	73.2%	Qwen3.7 Max leads
VITA-BenchSource	47.9%	45.6%	Qwen3.7 Max leads
HLE w/ toolsSource	53.5%	—	Not comparable
AA Agentic IndexSource	30.6%	20.8%	Qwen3.7 Max leads
τ²-bench resultsSource	94.7%	93%	Qwen3.7 Max leads
GDPval-AASource	38.7%	21.8%	Qwen3.7 Max leads
GDPval-AASource	1273	936	Qwen3.7 Max leads
Gert LabsSource	64.27%	—	Not comparable
ResearchClawBenchSource	18.7%	—	Not comparable
AA BriefcaseSource	908	—	Not comparable
AA AutomationBenchSource	25.6%	—	Not comparable
AA EnterpriseOps-GymSource	45.0%	—	Not comparable
AA ITBenchSource	42.5%	—	Not comparable
terminalBenchHardSource	50.8%	—	Not comparable
aaTerminalBench21Source	74.5%	—	Not comparable
AA Harvey LABSource	83.4%	—	Not comparable
DeepPlanningSource	—	62.3%	Not comparable
OSWorld-VerifiedSource	—	73.3%	Not comparable
AndroidWorldSource	—	81.0%	Not comparable
APEX-Agents-AASource	—	22.4%	Not comparable
OSWorld 2.0Source	—	2.8%	Not comparable

CodingQwen3.7 Max wins

9 benchmarks

Benchmark	Qwen3.7 Max	Qwen3.7 Plus	Result
SWE-bench VerifiedSource	80.4%	77.7%	Qwen3.7 Max leads
SWE-bench ProSource	60.6%	57.6%	Qwen3.7 Max leads
SWE MultilingualSource	78.3%	75.8%	Qwen3.7 Max leads
NL2RepoSource	47.2%	41.1%	Qwen3.7 Max leads
SciCodeSource	53.5%	51.3%	Qwen3.7 Max leads
LiveCodeBenchSource	91.6%	89.6%	Qwen3.7 Max leads
Terminal-Bench 2.0Source	69.7%	70.3%	Qwen3.7 Plus leads
AA Coding IndexSource	66.0%	55.9%	Qwen3.7 Max leads
AA-SciCodeSource	48.8%	45.5%	Qwen3.7 Max leads

ReasoningQwen3.7 Plus wins

3 benchmarks

Benchmark	Qwen3.7 Max	Qwen3.7 Plus	Result
MRCRv2Source	90.4%	91.7%	Qwen3.7 Plus leads
CritPtSource	13.4%	9.1%	Qwen3.7 Max leads
AA-LCRSource	69.0%	65.0%	Qwen3.7 Max leads

KnowledgeQwen3.7 Max wins

13 benchmarks

Benchmark	Qwen3.7 Max	Qwen3.7 Plus	Result
GPQASource	92.4%	90.3%	Qwen3.7 Max leads
GPQA-DSource	92.4%	90.3%	Qwen3.7 Max leads
HLESource	41.4%	34.7%	Qwen3.7 Max leads
MMLU-ProSource	89.6%	88.5%	Qwen3.7 Max leads
MMLU-ReduxSource	95%	94.5%	Qwen3.7 Max leads
SuperGPQASource	73.6%	71.4%	Qwen3.7 Max leads
MMMLUSource	90.3%	89.0%	Qwen3.7 Max leads
Artificial Analysis Intelligence IndexSource	46.0%	39.0%	Qwen3.7 Max leads
AA-GPQA DiamondSource	92.3%	90.0%	Qwen3.7 Max leads
AA-HLESource	38.1%	33.4%	Qwen3.7 Max leads
AA-Omniscience IndexSource	14.1%	2.4%	Qwen3.7 Max leads
AA-Omniscience AccuracySource	30.1%	22.2%	Qwen3.7 Max leads
AA-Omniscience Hallucination RateSource	22.9%	25.5%	Qwen3.7 Max leads

MathQwen3.7 Max wins

3 benchmarks

Benchmark	Qwen3.7 Max	Qwen3.7 Plus	Result
HMMT Feb 2026Source	97.1%	92.9%	Qwen3.7 Max leads
IMOAnswerBenchSource	90.0%	86.0%	Qwen3.7 Max leads
ApexSource	44.5%	22.7%	Qwen3.7 Max leads

MultilingualQwen3.7 Max wins

5 benchmarks

Benchmark	Qwen3.7 Max	Qwen3.7 Plus	Result
MMLU-ProXSource	87%	85.4%	Qwen3.7 Max leads
NOVA-63Source	59.0%	58.8%	Qwen3.7 Max leads
INCLUDESource	86.2%	83.0%	Qwen3.7 Max leads
MAXIFESource	89.2%	88.8%	Qwen3.7 Max leads
PolyMathSource	86.5%	84.0%	Qwen3.7 Max leads

Multimodal

17 benchmarks

Benchmark	Qwen3.7 Max	Qwen3.7 Plus	Result
Design Arena WebsiteSource	1293	1288	Qwen3.7 Max leads
MMMU-ProSource	—	79%	Not comparable
MathVisionSource	—	90.3%	Not comparable
CharXivSource	—	85.9%	Not comparable
ERQASource	—	69.8%	Not comparable
MedXpertQA (MM)Source	—	71.0%	Not comparable
ScreenSpot ProSource	—	79.0%	Not comparable
SimpleVQASource	—	81.7%	Not comparable
MMSearch-PlusSource	—	41.4%	Not comparable
RealWorldQASource	—	86.9%	Not comparable
OmniDocBench 1.5Source	—	91.4%	Not comparable
OCRBench V2Source	—	70.7%	Not comparable
ODINW13Source	—	51.1%	Not comparable
Video-MME (with subtitle)Source	—	88.0%	Not comparable
VideoMMMUSource	—	85.4%	Not comparable
MLVU (M-Avg)Source	—	87.4%	Not comparable
AA-MMMU-ProSource	—	80.5%	Not comparable

Inst. FollowingQwen3.7 Plus wins

3 benchmarks

Benchmark	Qwen3.7 Max	Qwen3.7 Plus	Result
IFEvalSource	94.3%	94.6%	Qwen3.7 Plus leads
IFBenchSource	79.1%	79.1%	Tie
AA-IFBenchSource	80.5%	78.0%	Qwen3.7 Max leads

Frequently Asked Questions (8)

Which is better, Qwen3.7 Max or Qwen3.7 Plus?

Qwen3.7 Max is ahead on BenchLM's BenchAlign leaderboard, 72.84 to 67.22. The biggest single separator in this matchup is HLE, where the scores are 41.4% and 34.7%.

Which is better for knowledge tasks, Qwen3.7 Max or Qwen3.7 Plus?

Qwen3.7 Max has the edge for knowledge tasks in this comparison, averaging 64.2 versus 60.1. Inside this category, AA-Omniscience Index is the benchmark that creates the most daylight between them.

Which is better for coding, Qwen3.7 Max or Qwen3.7 Plus?

Qwen3.7 Max has the edge for coding in this comparison, averaging 77.9 versus 75.6. Inside this category, AA Coding Index is the benchmark that creates the most daylight between them.

Which is better for math, Qwen3.7 Max or Qwen3.7 Plus?

Qwen3.7 Max has the edge for math in this comparison, averaging 97.1 versus 92.9. Inside this category, Apex is the benchmark that creates the most daylight between them.

Which is better for reasoning, Qwen3.7 Max or Qwen3.7 Plus?

Qwen3.7 Plus has the edge for reasoning in this comparison, averaging 91.7 versus 90.4. Inside this category, CritPt is the benchmark that creates the most daylight between them.

Which is better for agentic tasks, Qwen3.7 Max or Qwen3.7 Plus?

Qwen3.7 Plus has the edge for agentic tasks in this comparison, averaging 71.7 versus 69.7. Inside this category, GDPval-AA is the benchmark that creates the most daylight between them.

Which is better for instruction following, Qwen3.7 Max or Qwen3.7 Plus?

Qwen3.7 Plus has the edge for instruction following in this comparison, averaging 84.5 versus 84.4. Inside this category, AA-IFBench is the benchmark that creates the most daylight between them.

Which is better for multilingual tasks, Qwen3.7 Max or Qwen3.7 Plus?

Qwen3.7 Max has the edge for multilingual tasks in this comparison, averaging 87 versus 85.4. Inside this category, INCLUDE is the benchmark that creates the most daylight between them.

Related Comparisons

Explore More

Alibaba Compare Pricing Methodology Find Your Best LLM Overall Rankings

Last updated: July 23, 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.

Qwen3.7 Max vs Qwen3.7 Plus

Category breakdown

Decisive benchmark drivers

HLE

HMMT Feb 2026

SWE-bench Pro

SWE-bench Verified

SciCode

Operational comparison

Benchmark Deep Dive

Which is better, Qwen3.7 Max or Qwen3.7 Plus?

Which is better for knowledge tasks, Qwen3.7 Max or Qwen3.7 Plus?

Which is better for coding, Qwen3.7 Max or Qwen3.7 Plus?

Which is better for math, Qwen3.7 Max or Qwen3.7 Plus?

Which is better for reasoning, Qwen3.7 Max or Qwen3.7 Plus?

Which is better for agentic tasks, Qwen3.7 Max or Qwen3.7 Plus?

Which is better for instruction following, Qwen3.7 Max or Qwen3.7 Plus?

Which is better for multilingual tasks, Qwen3.7 Max or Qwen3.7 Plus?

Related Comparisons

Explore More

Choose a model with this week’s evidence