Benchmark profile

Grade School Math 8K (GSM8K)

A grade-school mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations.

Data verified July 23, 2026

Benchmark score on GSM8K — July 23, 2026

BenchLM mirrors the published score view for GSM8K. DeepSeek V4 Pro Base leads the public snapshot at 92.6% , followed by DeepSeek V4 Flash Base (90.8%) and Soofi S 30B-A3B (86.1%). BenchLM does not use these results to rank models overall.

1Open

DeepSeek V4 Pro Base

DeepSeek

deepseek-v4-pro-base

92.6%

Overall —Context 1M

2Open

DeepSeek V4 Flash Base

DeepSeek

deepseek-v4-flash-base

90.8%

Overall —Context 1M

3Open

Soofi S 30B-A3B

Soofi Project

soofi-s-30b-a3b

86.1%

Overall —Context 1M

3 modelsMathCurrentDisplay onlyUpdated July 23, 2026

Benchmark score table (3 models)

Score

DeepSeek V4 Pro BaseDeepSeek · Open weight

92.6%

DeepSeek V4 Flash BaseDeepSeek · Open weight

90.8%

Soofi S 30B-A3BSoofi Project · Open weight

86.1%

The published GSM8K snapshot places DeepSeek V4 Pro Base first at 92.6%. The third row is 6.5 points behind. The broader top-10 range is 6.5 points, so many of the published results sit in a relatively narrow band.

3 models have been evaluated on GSM8K. The benchmark falls in the Math category. This category carries a 5% weight in BenchLM.ai's overall scoring system. GSM8K is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About GSM8K

Year

2026

Tasks

Grade-school math word problems

Format

Exact match

Difficulty

Grade-school math

BenchLM stores GSM8K as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations.

DeepSeek-V4 Technical Report

BenchLM freshness & provenance

Version

GSM8K 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does GSM8K measure?

A grade-school mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations.

Which model scores highest on GSM8K?

DeepSeek V4 Pro Base by DeepSeek currently leads with a score of 92.6% on GSM8K.

How many models are evaluated on GSM8K?

3 AI models have been evaluated on GSM8K on BenchLM.

Compare Top Models on GSM8K

DeepSeek V4 Pro Base vs DeepSeek V4 Flash Base DeepSeek V4 Flash Base vs Soofi S 30B-A3B

Last updated: July 23, 2026 · BenchLM version GSM8K 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.