Model comparison

Hy3 Preview vs Step 3.7 Flash

Data verified July 16, 2026

Head-to-head evidence from 16 shared benchmark results across 4 categories. Overall scores shown here use the public BenchAlign v5 ranking lane.

Hy3 Preview

Tencent

53/100

Margin

2.2pts

← winning

Step 3.7 Flash

StepFun

50.76/100

1 category wins1 category wins

BenchAlign evidence: Hy3 Preview not scored; Step 3.7 Flash estimated. Intervals and evidence labels describe ranking uncertainty, not a guarantee for a specific workload.

Evidence parity. Hy3 Preview and Step 3.7 Flash share 16 comparable benchmark results. 2 of 8 categories are comparable. 6 results are unique to Hy3 Preview; 14 to Step 3.7 Flash.

Updated July 16, 2026

Shared results: 16
Hy3 Preview only: 6
Step 3.7 Flash only: 14
Comparable categories: 2 / 8

Pick Step 3.7 Flash if you want the stronger benchmark profile. Hy3 Preview only becomes the better choice if coding is the priority or you want the cheaper token bill.

Confidence note. This is a partial-evidence comparison with 16 shared benchmark results across 4 evidence categories; 2 of 8 categories currently have scoreable aggregates for both models. Treat the verdict as directional until coverage is more balanced.

Why this result

Step 3.7 Flash is clearly ahead on the provisional aggregate, 57 to 53. The gap is large enough that you do not need to squint at the spreadsheet to see the difference.

Step 3.7 Flash's sharpest advantage is in agentic, where it averages 66.4 against 54.4. The single biggest benchmark swing on the page is Terminal-Bench 2.0, 54.4% to 59.5%. Hy3 Preview does hit back in coding, so the answer changes if that is the part of the workload you care about most.

Step 3.7 Flash is also the more expensive model on tokens at $0.20 input / $1.15 output per 1M tokens, versus $0.00 input / $0.00 output per 1M tokens for Hy3 Preview. That is roughly Infinityx on output cost alone.

Category breakdown

Exact category averages are shown below. Not measured means BenchLM does not have enough sourced public coverage for that model and category.

Category scores and score margins for Hy3 Preview and Step 3.7 Flash
Category	Hy3 Preview	Δ	Step 3.7 Flash
Agentic	Hy3 Preview54.4	Margin→ 12.0	Step 3.7 Flash66.4
Coding	Hy3 Preview57.8	Margin← 1.5	Step 3.7 Flash56.3
Knowledge	Hy3 Preview33.8	MarginNo overlap	Step 3.7 FlashNot measured
Inst. Following	Hy3 Preview63.1	MarginNo overlap	Step 3.7 FlashNot measured

Decisive benchmark drivers

The largest measured benchmark gaps in this matchup, with exact reported values.

A · Hy3 PreviewB · Step 3.7 Flash

Terminal-Bench 2.0
Agentic
Source ↗
A 54.4%B 59.5%
Winner: Step 3.7 FlashΔ 5.1
Terminal-Bench 2.0: Hy3 Preview scored 54.4%; Step 3.7 Flash scored 59.5%. Step 3.7 Flash wins this benchmark.

Operational comparison

Runtime and commercial metrics are compared only when both models have a complete sourced value.

Metric	Hy3 Preview	Step 3.7 Flash	Comparison
Input / output priceUSD per 1M tokens	Hy3 Preview$0 input / $0 output	Step 3.7 Flash$0.2 input / $1.15 output	Hy3 Preview has the lower combined listed price.
Generation speedtokens per second	Hy3 PreviewNot available	Step 3.7 FlashNot available	A complete speed comparison is not available.
First-answer latencyseconds to first token	Hy3 PreviewNot available	Step 3.7 FlashNot available	A complete latency comparison is not available.
Context windowmaximum listed tokens	Hy3 Preview256K	Step 3.7 Flash256K	Listed context windows are equal.

Benchmark Deep Dive

AgenticStep 3.7 Flash wins

12 benchmarks

Benchmark	Hy3 Preview	Step 3.7 Flash	Result
Terminal-Bench 2.0Source	54.4%	59.5%	Step 3.7 Flash leads
Gert LabsSource	36.91%	51.57%	Step 3.7 Flash leads
AA Agentic IndexSource	30.7%	21.5%	Hy3 Preview leads
GDPval-AASource	35.7%	25.9%	Hy3 Preview leads
GDPval-AASource	1214	1017	Hy3 Preview leads
BrowseCompSource	—	75.8%	Not comparable
DeepSearchQASource	—	92.8%	Not comparable
ToolathlonSource	—	49.5%	Not comparable
Claw-EvalSource	—	67.1%	Not comparable
HLE w/ toolsSource	—	47.2%	Not comparable
τ²-bench resultsSource	—	98.5%	Not comparable
APEX-Agents-AASource	—	14.8%	Not comparable

CodingHy3 Preview wins

7 benchmarks

Benchmark	Hy3 Preview	Step 3.7 Flash	Result
SWE-bench VerifiedSource	74.4%	—	Not comparable
Terminal-Bench 2.0Source	54.4%	59.5%	Step 3.7 Flash leads
SciCodeSource	41.2%	—	Not comparable
AA-SciCodeSource	47.6%	40.0%	Hy3 Preview leads
AA Coding IndexSource	58.8%	39.6%	Hy3 Preview leads
SWE-bench ProSource	—	56.3%	Not comparable
Terminal-Bench HardSource	—	35.6%	Not comparable

Reasoning

2 benchmarks

Benchmark	Hy3 Preview	Step 3.7 Flash	Result
AA-LCRSource	66.7%	63.7%	Hy3 Preview leads
CritPtSource	4.9%	2.3%	Hy3 Preview leads

Knowledge

9 benchmarks

Benchmark	Hy3 Preview	Step 3.7 Flash	Result
Artificial Analysis Intelligence IndexSource	41.2%	30.3%	Hy3 Preview leads
GPQASource	87.2%	—	Not comparable
GPQA-DSource	87.2%	—	Not comparable
HLESource	25.5%	—	Not comparable
AA-Omniscience AccuracySource	31.5%	25.4%	Hy3 Preview leads
AA-Omniscience Hallucination RateSource	73.0%	84.4%	Hy3 Preview leads
AA-GPQA DiamondSource	89.7%	80.9%	Hy3 Preview leads
AA-HLESource	31.6%	19.9%	Hy3 Preview leads
AA-Omniscience IndexSource	-18.5%	-37.5%	Hy3 Preview leads

Multimodal

4 benchmarks

Benchmark	Hy3 Preview	Step 3.7 Flash	Result
SimpleVQASource	—	79.2%	Not comparable
V*Source	—	95.3%	Not comparable
AA-MMMU-ProSource	—	75.3%	Not comparable
Design Arena WebsiteSource	—	1218	Not comparable

Inst. Following

2 benchmarks

Benchmark	Hy3 Preview	Step 3.7 Flash	Result
IFBenchSource	63.1%	—	Not comparable
AA-IFBenchSource	—	67.3%	Not comparable

Frequently Asked Questions (3)

Which is better, Hy3 Preview or Step 3.7 Flash?

Step 3.7 Flash is ahead on BenchLM's provisional leaderboard, 57 to 53. The biggest single separator in this matchup is Terminal-Bench 2.0, where the scores are 54.4% and 59.5%.

Which is better for coding, Hy3 Preview or Step 3.7 Flash?

Hy3 Preview has the edge for coding in this comparison, averaging 57.8 versus 56.3. Inside this category, AA Coding Index is the benchmark that creates the most daylight between them.

Which is better for agentic tasks, Hy3 Preview or Step 3.7 Flash?

Step 3.7 Flash has the edge for agentic tasks in this comparison, averaging 66.4 versus 54.4. Inside this category, GDPval-AA is the benchmark that creates the most daylight between them.

Related Comparisons

Explore More

Tencent Compare Pricing Methodology Find Your Best LLM Overall Rankings

Last updated: July 16, 2026

The AI models change fast. We track them for you.

A weekly brief for engineers and researchers covering new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.