Benchmark profile

AGIEval

A human-centric exam benchmark for general knowledge and reasoning reported in DeepSeek-V4 base-model evaluations.

Data verified July 23, 2026

Benchmark score on AGIEval — July 23, 2026

BenchLM mirrors the published score view for AGIEval. DeepSeek V4 Pro Base leads the public snapshot at 83.1% , followed by DeepSeek V4 Flash Base (82.6%) and Soofi S 30B-A3B (66.9%). BenchLM does not use these results to rank models overall.

1Open

DeepSeek V4 Pro Base

DeepSeek

deepseek-v4-pro-base

83.1%

Overall —Context 1M

2Open

DeepSeek V4 Flash Base

DeepSeek

deepseek-v4-flash-base

82.6%

Overall —Context 1M

3Open

Soofi S 30B-A3B

Soofi Project

soofi-s-30b-a3b

66.9%

Overall —Context 1M

3 modelsKnowledgeCurrentDisplay onlyUpdated July 23, 2026

Benchmark score table (3 models)

Score

DeepSeek V4 Pro BaseDeepSeek · Open weight

83.1%

DeepSeek V4 Flash BaseDeepSeek · Open weight

82.6%

Soofi S 30B-A3BSoofi Project · Open weight

66.9%

The published AGIEval snapshot places DeepSeek V4 Pro Base first at 83.1%. The third row is 16.2 points behind. The broader top-10 range is 16.2 points, so the table still separates the published systems.

3 models have been evaluated on AGIEval. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. AGIEval is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AGIEval

Year

2026

Tasks

General academic and professional exam questions

Format

Exact match

Difficulty

General knowledge

BenchLM stores AGIEval as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations.

DeepSeek-V4 Technical Report

BenchLM freshness & provenance

Version

AGIEval 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does AGIEval measure?

A human-centric exam benchmark for general knowledge and reasoning reported in DeepSeek-V4 base-model evaluations.

Which model scores highest on AGIEval?

DeepSeek V4 Pro Base by DeepSeek currently leads with a score of 83.1% on AGIEval.

How many models are evaluated on AGIEval?

3 AI models have been evaluated on AGIEval on BenchLM.

Compare Top Models on AGIEval

DeepSeek V4 Pro Base vs DeepSeek V4 Flash Base DeepSeek V4 Flash Base vs Soofi S 30B-A3B

Last updated: July 23, 2026 · BenchLM version AGIEval 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.