Model profile

Step 3.7 Flash

Name: Step 3.7 Flash
Author: StepFun

StepFunCurrentReleased May 29, 2026

Data verified July 16, 2026

Overall Score

Unranked

Arena Elo

Not listed

Categories Ranked

1of 8

Price (1M tokens)

$0.2 in / $1.15 out

API pricing

Speed

Not listed

Context

256K

Evidence coverage

30 of 313 tracked benchmarks are published. 12 are verified and 18 provisional. 6 of 8 categories are measured.

Updated July 16, 2026Methodology

Published / tracked: 30 / 313
Verified: 12
Provisional: 18
Categories measured: 6 / 8

Agentic12 benchmarks
Mixed evidence
Coding5 benchmarks
Mixed evidence
Reasoning2 benchmarks
Reported
Knowledge6 benchmarks
Reported
Math0 benchmarks
Not measured
Multilingual0 benchmarks
Not measured
Multimodal4 benchmarks
Mixed evidence
Inst. Following1 benchmark
Reported

Open WeightSelf-hostReasoning

Confidence:

Low

base

BenchLM is tracking Step 3.7 Flash, but this profile is currently excluded from the public leaderboard because it still lacks enough non-generated benchmark coverage to rank safely. Only non-generated public benchmark rows appear below.

Step 3.7 Flash is a open weight model with a 256K token context window. It uses explicit chain-of-thought reasoning, which typically improves performance on math and complex reasoning tasks at the cost of higher latency and token usage.

BenchLM links it directly to Step 3.5 Flash as the earlier related model in that lineage. This profile currently has 30 of 313 tracked benchmarks. BenchLM only exposes non-generated benchmark rows publicly, so missing categories stay blank until a sourced evaluation is available.

Its strongest category is Agentic (#46). This performance profile makes it particularly useful for coding agents, browser research, and computer-use workflows.

Peer position

Exact provisional scores and ranks for the closest listed peers.

Range 56.0–58.0

Qwen3.6-35B-A3B
Alibaba
#4858.0
Qwen3.6-35B-A3B is #48 with a score of 58.0.
Compare
MAI-Thinking-1
Microsoft
#4958.0
MAI-Thinking-1 is #49 with a score of 58.0.
Compare
MiMo-V2-Flash
Xiaomi
#5057.0
MiMo-V2-Flash is #50 with a score of 57.0.
Compare
GPT-4.1
OpenAI
#5156.0
GPT-4.1 is #51 with a score of 56.0.
Compare
Step 3.7 FlashCurrent model
StepFun
Unranked57.0
Step 3.7 Flash is Unranked with a score of 57.0.
DeepSeek V4 Flash
DeepSeek
Unranked57.0
DeepSeek V4 Flash is Unranked with a score of 57.0.
Compare
Gemma 4 26B A4B
Google
Unranked56.0
Gemma 4 26B A4B is Unranked with a score of 56.0.
Compare

Category percentile

Relative position among models eligible for each sourced category. A higher percentile means a stronger position within that category's ranked cohort; 100 is highest.

Agentic63%
Eligible cohort rank #46 of 121Category score 56.9

Category evidence

Scores and ranks appear only where this model has published benchmark evidence. Categories without displayable source records remain not measured.

Category scores, ranks, weighting, benchmark coverage, and evidence status
Category	Score	Rank	Percentile	Weight	Benchmarks	Evidence
AgenticRank #46 of 121Percentile 63rdWeight 22%12 benchmarksMixed sources	56.9	#46 of 121	63rd	22%	12 benchmarks	Mixed sources
CodingRank Not rankedWeight 20%5 benchmarksMixed sources	59.1	Not ranked	Not available	20%	5 benchmarks	Mixed sources
ReasoningRank Not rankedWeight 17%2 benchmarksReported	0.0	Not ranked	Not available	17%	2 benchmarks	Reported
KnowledgeRank Not rankedWeight 12%6 benchmarksReported	0.0	Not ranked	Not available	12%	6 benchmarks	Reported
MathWeight 5%0 benchmarksNot measured	Not measured	Not ranked	Not available	5%	0 benchmarks	Not measured
MultilingualWeight 7%0 benchmarksNot measured	Not measured	Not ranked	Not available	7%	0 benchmarks	Not measured
MultimodalRank Not rankedWeight 12%4 benchmarksMixed sources	0.0	Not ranked	Not available	12%	4 benchmarks	Mixed sources
Inst. FollowingRank Not rankedWeight 5%1 benchmarkReported	0.0	Not ranked	Not available	5%	1 benchmark	Reported

Benchmark Details

Rows below have a displayable published verification record. Each source link and provenance note remains in the page HTML while its category is closed. Source-unverified manual rows and generated rows stay hidden.

Agentic12 benchmarks

Terminal-Bench 2.0Provider exact

59.5%Weighted 38%

Source: StepFun: Step 3.7 FlashProvenance: StepFun reports Step 3.7 Flash at 59.5 on Terminal-Bench 2.1.

BrowseCompProvider exact

75.8%Weighted 28%

Source: StepFun: Step 3.7 FlashProvenance: StepFun reports Step 3.7 Flash at 75.82 on BrowseComp.

DeepSearchQAProvider exact

92.8%Display only

Source: StepFun: Step 3.7 FlashProvenance: StepFun reports Step 3.7 Flash at 92.82 F1 on DeepSearchQA.

GDPval-AAProvider exact

GDPval-AA normalized

25.9%Display only

Source: StepFun: Step 3.7 FlashProvenance: StepFun reports Step 3.7 Flash at 45.8 on GDPVal-AA / GDPval across 44 occupations. BenchLM stores the percentage-style value on the normalized GDPval-AA row and leaves the Elo row empty.

ToolathlonProvider exact

49.5%Display only

Source: StepFun: Step 3.7 FlashProvenance: StepFun reports Step 3.7 Flash at 49.5 on Toolathlon.

Claw-EvalProvider exact

67.1%Display only

Source: StepFun: Step 3.7 FlashProvenance: StepFun reports Step 3.7 Flash at 67.1 on ClawEval-1.1.

HLE w/ toolsProvider exact

Humanity's Last Exam with tools

47.2%Display only

Source: StepFun: Step 3.7 FlashProvenance: StepFun reports Step 3.7 Flash at 47.20 on HLE with Tools in the launch page search section and comparison chart. BenchLM does not map the separate Hugging Face 48.1 HLE row because its setup is ambiguous.

Gert LabsBenchmark exact

Gert Labs Composite Game Benchmark

51.57%Display only

Source: Gert Labs rankingsProvenance: Gert Labs reports this composite leaderboard score in the public rankings API. BenchLM scales the source gscore from 0-1 to 0-100 and stores it as a display-only agentic benchmark.

AA Agentic IndexReported

Artificial Analysis Agentic Index

21.5%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

τ²-bench resultsReported

τ²-Bench Tool-Agent-User Evaluation

98.5%Display only

Source: Artificial Analysis: tau2-bench leaderboardProvenance: Display-only row synced from the current Artificial Analysis evaluation leaderboard. It is excluded from BenchLM weighted scoring.

GDPval-AAReported

1017Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

APEX-Agents-AAReported

14.8%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Coding5 benchmarks

SWE-bench ProProvider exact

56.3%Weighted 10%

Source: StepFun: Step 3.7 FlashProvenance: StepFun reports Step 3.7 Flash at 56.3 on SWE-Bench Pro.

Terminal-Bench 2.0Provider exact

59.5%Display only

Source: StepFun: Step 3.7 FlashProvenance: StepFun reports Step 3.7 Flash at 59.5 on Terminal-Bench 2.1. BenchLM stores it on the existing Terminal-Bench display key in both coding and agentic views.

AA Coding IndexReported

Artificial Analysis Coding Index

39.6%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Terminal-Bench HardReported

35.6%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-SciCodeReported

Artificial Analysis SciCode

40.0%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Reasoning2 benchmarks

AA-LCRReported

Artificial Analysis Long Context Reasoning

63.7%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

CritPtReported

Critical Physics Tasks

2.3%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Knowledge6 benchmarks

Artificial Analysis Intelligence IndexReported

30.3%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-GPQA DiamondReported

Artificial Analysis GPQA Diamond

80.9%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-HLEReported

Artificial Analysis Humanity's Last Exam

19.9%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience IndexReported

Artificial Analysis Omniscience Index

-37.5%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience AccuracyReported

Artificial Analysis Omniscience Accuracy

25.4%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience Hallucination RateReported

Artificial Analysis Omniscience Hallucination Rate

84.4%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Multimodal4 benchmarks

SimpleVQAProvider exact

79.2%Display only

Source: StepFun: Step 3.7 FlashProvenance: StepFun reports Step 3.7 Flash at 79.2 on SimpleVQA with tool/search.

V*Provider exact

95.3%Display only

Source: StepFun: Step 3.7 FlashProvenance: StepFun reports Step 3.7 Flash at 95.3 on V* with Python tool.

AA-MMMU-ProReported

Artificial Analysis MMMU-Pro

75.3%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Design Arena WebsiteReported

Design Arena Website Elo

1218Display only

Source: OpenRouter model benchmarksProvenance: Display-only Design Arena Website Elo synced from OpenRouter model benchmark metadata. It is excluded from BenchLM weighted scoring.

Inst. Following1 benchmark

AA-IFBenchReported

Artificial Analysis IFBench

67.3%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Step 3.7 Flash Family

Base entry

Related Earlier Model

Step 3.5 Flash

Frequently Asked Questions

How does Step 3.7 Flash perform overall in AI benchmarks?

Step 3.7 Flash has 30 published benchmark scores on BenchLM, but it does not yet have enough non-generated coverage to receive a global overall rank.

Is Step 3.7 Flash good for knowledge and understanding?

Step 3.7 Flash has visible benchmark coverage in knowledge and understanding, but BenchLM does not currently assign it a global category rank there.

Is Step 3.7 Flash good for coding and programming?

Step 3.7 Flash has visible benchmark coverage in coding and programming, but BenchLM does not currently assign it a global category rank there.

Is Step 3.7 Flash good for reasoning and logic?

Step 3.7 Flash has visible benchmark coverage in reasoning and logic, but BenchLM does not currently assign it a global category rank there.

Is Step 3.7 Flash good for agentic tool use and computer tasks?

Step 3.7 Flash ranks #46 out of 78 models in agentic tool use and computer tasks benchmarks with an average score of 56.9. There are stronger options in this category.

Is Step 3.7 Flash good for multimodal and grounded tasks?

Step 3.7 Flash has visible benchmark coverage in multimodal and grounded tasks, but BenchLM does not currently assign it a global category rank there.

Is Step 3.7 Flash good for instruction following?

Step 3.7 Flash has visible benchmark coverage in instruction following, but BenchLM does not currently assign it a global category rank there.

Is Step 3.7 Flash open source?

Yes, Step 3.7 Flash is an open weight model created by StepFun, meaning it can be downloaded and run locally or fine-tuned for specific use cases.

Does Step 3.7 Flash have full benchmark coverage on BenchLM?

Not yet. Step 3.7 Flash currently has 30 published benchmark scores out of the 313 benchmarks BenchLM tracks. BenchLM only exposes non-generated public benchmark rows, so missing categories stay blank until a sourced evaluation is available.

What is the context window size of Step 3.7 Flash?

Step 3.7 Flash has a context window of 256K, which determines how much text it can process in a single interaction.

Related Resources

Last updated: July 16, 2026 · Runtime metrics stay blank until BenchLM has a sourced snapshot.

Don't miss the next GPT moment

Which models moved up, what is new, and what it costs. One email each week.

Free. One email per week.

Step 3.7 Flash

Evidence coverage

Evidence by category

Peer position

Category percentile

Category evidence

Benchmark Details

Step 3.7 Flash Family

Frequently Asked Questions

How does Step 3.7 Flash perform overall in AI benchmarks?

Is Step 3.7 Flash good for knowledge and understanding?

Is Step 3.7 Flash good for coding and programming?

Is Step 3.7 Flash good for reasoning and logic?

Is Step 3.7 Flash good for agentic tool use and computer tasks?

Is Step 3.7 Flash good for multimodal and grounded tasks?

Is Step 3.7 Flash good for instruction following?

Is Step 3.7 Flash open source?

Does Step 3.7 Flash have full benchmark coverage on BenchLM?

What is the context window size of Step 3.7 Flash?

Related Resources

Don't miss the next GPT moment