Model comparison

Qwen3.6 Plus vs Step 3.7 Flash

Data verified July 16, 2026

Head-to-head evidence from 24 shared benchmark results across 6 categories. Overall scores shown here use the public BenchAlign v5 ranking lane.

Qwen3.6 Plus

Alibaba

65.05/100

Margin

14.3pts

← winning

Step 3.7 Flash

StepFun

50.76/100

1 category wins1 category wins

Verified leaderboard positions: Qwen3.6 Plus #14; Step 3.7 Flash unranked

BenchAlign evidence: Qwen3.6 Plus supported; Step 3.7 Flash estimated. Intervals and evidence labels describe ranking uncertainty, not a guarantee for a specific workload.

Evidence parity. Qwen3.6 Plus and Step 3.7 Flash share 24 comparable benchmark results. 2 of 8 categories are comparable. 37 results are unique to Qwen3.6 Plus; 6 to Step 3.7 Flash.

Updated July 16, 2026

Shared results: 24
Qwen3.6 Plus only: 37
Step 3.7 Flash only: 6
Comparable categories: 2 / 8

Pick Qwen3.6 Plus if you want the stronger benchmark profile. Step 3.7 Flash only becomes the better choice if agentic is the priority.

Confidence note. This is a partial-evidence comparison with 24 shared benchmark results across 6 evidence categories; 2 of 8 categories currently have scoreable aggregates for both models. Treat the verdict as directional until coverage is more balanced.

Why this result

Qwen3.6 Plus is clearly ahead on the provisional aggregate, 63 to 57. The gap is large enough that you do not need to squint at the spreadsheet to see the difference.

Qwen3.6 Plus's sharpest advantage is in coding, where it averages 70.3 against 56.3. The single biggest benchmark swing on the page is Terminal-Bench 2.0, 61.6% to 59.5%. Step 3.7 Flash does hit back in agentic, so the answer changes if that is the part of the workload you care about most.

Qwen3.6 Plus gives you the larger context window at 1M, compared with 256K for Step 3.7 Flash.

Category breakdown

Exact category averages are shown below. Not measured means BenchLM does not have enough sourced public coverage for that model and category.

Category scores and score margins for Qwen3.6 Plus and Step 3.7 Flash
Category	Qwen3.6 Plus	Δ	Step 3.7 Flash
Coding	Qwen3.6 Plus70.3	Margin← 14.0	Step 3.7 Flash56.3
Agentic	Qwen3.6 Plus61.6	Margin→ 4.8	Step 3.7 Flash66.4
Reasoning	Qwen3.6 Plus62.0	MarginNo overlap	Step 3.7 FlashNot measured
Knowledge	Qwen3.6 Plus57.5	MarginNo overlap	Step 3.7 FlashNot measured
Math	Qwen3.6 Plus60.5	MarginNo overlap	Step 3.7 FlashNot measured
Multilingual	Qwen3.6 Plus84.7	MarginNo overlap	Step 3.7 FlashNot measured
Multimodal	Qwen3.6 Plus79.8	MarginNo overlap	Step 3.7 FlashNot measured
Inst. Following	Qwen3.6 Plus82.3	MarginNo overlap	Step 3.7 FlashNot measured

Decisive benchmark drivers

The largest measured benchmark gaps in this matchup, with exact reported values.

A · Qwen3.6 PlusB · Step 3.7 Flash

Terminal-Bench 2.0
Agentic
Source ↗
A 61.6%B 59.5%
Winner: Qwen3.6 PlusΔ 2.1
Terminal-Bench 2.0: Qwen3.6 Plus scored 61.6%; Step 3.7 Flash scored 59.5%. Qwen3.6 Plus wins this benchmark.
SWE-bench Pro
Coding
Source ↗
A 56.6%B 56.3%
Winner: Qwen3.6 PlusΔ 0.3
SWE-bench Pro: Qwen3.6 Plus scored 56.6%; Step 3.7 Flash scored 56.3%. Qwen3.6 Plus wins this benchmark.

Operational comparison

Runtime and commercial metrics are compared only when both models have a complete sourced value.

Metric	Qwen3.6 Plus	Step 3.7 Flash	Comparison
Input / output priceUSD per 1M tokens	Qwen3.6 PlusNot available	Step 3.7 Flash$0.2 input / $1.15 output	A complete price comparison is not available.
Generation speedtokens per second	Qwen3.6 PlusNot available	Step 3.7 FlashNot available	A complete speed comparison is not available.
First-answer latencyseconds to first token	Qwen3.6 PlusNot available	Step 3.7 FlashNot available	A complete latency comparison is not available.
Context windowmaximum listed tokens	Qwen3.6 Plus1M	Step 3.7 Flash256K	Qwen3.6 Plus lists the larger context window.

Benchmark Deep Dive

AgenticStep 3.7 Flash wins

20 benchmarks

Benchmark	Qwen3.6 Plus	Step 3.7 Flash	Result
Terminal-Bench 2.0Source	61.6%	59.5%	Qwen3.6 Plus leads
Claw-EvalSource	58.8%	67.1%	Step 3.7 Flash leads
QwenClawBenchSource	57.2%	—	Not comparable
τ³-bench resultsSource	70.7%	—	Not comparable
VITA-BenchSource	44.3%	—	Not comparable
DeepPlanningSource	41.5%	—	Not comparable
ToolathlonSource	39.8%	49.5%	Step 3.7 Flash leads
MCP AtlasSource	48.2%	—	Not comparable
MCP-TasksSource	74.1%	—	Not comparable
WideResearchSource	74.3%	—	Not comparable
AA Agentic IndexSource	27.6%	21.5%	Qwen3.6 Plus leads
τ²-bench resultsSource	97.7%	98.5%	Step 3.7 Flash leads
GDPval-AASource	31.8%	25.9%	Qwen3.6 Plus leads
GDPval-AASource	1135	1017	Qwen3.6 Plus leads
Gert LabsSource	50.60%	51.57%	Step 3.7 Flash leads
ResearchClawBenchSource	18.0%	—	Not comparable
BrowseCompSource	—	75.8%	Not comparable
DeepSearchQASource	—	92.8%	Not comparable
HLE w/ toolsSource	—	47.2%	Not comparable
APEX-Agents-AASource	—	14.8%	Not comparable

CodingQwen3.6 Plus wins

9 benchmarks

Benchmark	Qwen3.6 Plus	Step 3.7 Flash	Result
SWE-bench VerifiedSource	78.8%	—	Not comparable
SWE-bench ProSource	56.6%	56.3%	Qwen3.6 Plus leads
SWE MultilingualSource	73.8%	—	Not comparable
LiveCodeBench v6Source	87.1%	—	Not comparable
Vibe Code BenchSource	25.56%	—	Not comparable
AA Coding IndexSource	54.5%	39.6%	Qwen3.6 Plus leads
Terminal-Bench HardSource	43.9%	35.6%	Qwen3.6 Plus leads
AA-SciCodeSource	40.7%	40.0%	Qwen3.6 Plus leads
Terminal-Bench 2.0Source	—	59.5%	Not comparable

Reasoning

4 benchmarks

Benchmark	Qwen3.6 Plus	Step 3.7 Flash	Result
AI-NeedleSource	68.3%	—	Not comparable
LongBench v2Source	62%	—	Not comparable
AA-LCRSource	69.7%	63.7%	Qwen3.6 Plus leads
CritPtSource	2.9%	2.3%	Qwen3.6 Plus leads

Knowledge

12 benchmarks

Benchmark	Qwen3.6 Plus	Step 3.7 Flash	Result
GPQASource	90.4%	—	Not comparable
SuperGPQASource	71.6%	—	Not comparable
MMLU-ProSource	88.5%	—	Not comparable
MMLU-ReduxSource	94.5%	—	Not comparable
C-EvalSource	93.3%	—	Not comparable
HLESource	28.8%	—	Not comparable
Artificial Analysis Intelligence IndexSource	39.6%	30.3%	Qwen3.6 Plus leads
AA-GPQA DiamondSource	88.2%	80.9%	Qwen3.6 Plus leads
AA-HLESource	25.7%	19.9%	Qwen3.6 Plus leads
AA-Omniscience IndexSource	2.7%	-37.5%	Qwen3.6 Plus leads
AA-Omniscience AccuracySource	26.2%	25.4%	Qwen3.6 Plus leads
AA-Omniscience Hallucination RateSource	32.0%	84.4%	Qwen3.6 Plus leads

Math

7 benchmarks

Benchmark	Qwen3.6 Plus	Step 3.7 Flash	Result
AIME26Source	95.3%	—	Not comparable
HMMT Feb 2025Source	96.7%	—	Not comparable
HMMT Nov 2025Source	94.6%	—	Not comparable
HMMT Feb 2026Source	87.8%	—	Not comparable
MMAnswerBenchSource	83.8%	—	Not comparable
FrontierMath v2 (Tiers 1-3)Source	26.207%	—	Not comparable
FrontierMath v2 (Tier 4)Source	8.333%	—	Not comparable

Multilingual

2 benchmarks

Benchmark	Qwen3.6 Plus	Step 3.7 Flash	Result
MMLU-ProXSource	84.7%	—	Not comparable
NOVA-63Source	57.9%	—	Not comparable

Multimodal

10 benchmarks

Benchmark	Qwen3.6 Plus	Step 3.7 Flash	Result
MMMUSource	86.0%	—	Not comparable
MMMU-ProSource	78.8%	—	Not comparable
MathVisionSource	88.0%	—	Not comparable
VideoMMMUSource	84.0%	—	Not comparable
ScreenSpot ProSource	68.2%	—	Not comparable
CharXivSource	81.5%	—	Not comparable
V*Source	96.9%	95.3%	Qwen3.6 Plus leads
AA-MMMU-ProSource	78.0%	75.3%	Qwen3.6 Plus leads
Design Arena WebsiteSource	1254	1218	Qwen3.6 Plus leads
SimpleVQASource	—	79.2%	Not comparable

Inst. Following

3 benchmarks

Benchmark	Qwen3.6 Plus	Step 3.7 Flash	Result
IFEvalSource	94.3%	—	Not comparable
IFBenchSource	75.8%	—	Not comparable
AA-IFBenchSource	75.2%	67.3%	Qwen3.6 Plus leads

Frequently Asked Questions (3)

Which is better, Qwen3.6 Plus or Step 3.7 Flash?

Qwen3.6 Plus is ahead on BenchLM's provisional leaderboard, 63 to 57. The biggest single separator in this matchup is Terminal-Bench 2.0, where the scores are 61.6% and 59.5%.

Which is better for coding, Qwen3.6 Plus or Step 3.7 Flash?

Qwen3.6 Plus has the edge for coding in this comparison, averaging 70.3 versus 56.3. Inside this category, AA Coding Index is the benchmark that creates the most daylight between them.

Which is better for agentic tasks, Qwen3.6 Plus or Step 3.7 Flash?

Step 3.7 Flash has the edge for agentic tasks in this comparison, averaging 66.4 versus 61.6. Inside this category, GDPval-AA is the benchmark that creates the most daylight between them.

Related Comparisons

Explore More

Alibaba Compare Pricing Methodology Find Your Best LLM Overall Rankings

Last updated: July 16, 2026

The AI models change fast. We track them for you.

A weekly brief for engineers and researchers covering new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.