Skip to main content

Measuring Short-Form Factuality in Large Language Models (SimpleQA)

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.

Top models on SimpleQA — May 13, 2026

As of May 13, 2026, DeepSeek V4 Pro (Max) leads the SimpleQA leaderboard with 57.9% , followed by DeepSeek V4 Pro Base (55.2%) and DeepSeek V4 Pro (High) (46.2%).

8 modelsKnowledge13% of category scoreRefreshingUpdated May 13, 2026

According to BenchLM.ai, DeepSeek V4 Pro (Max) leads the SimpleQA benchmark with a score of 57.9%, followed by DeepSeek V4 Pro Base (55.2%) and DeepSeek V4 Pro (High) (46.2%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

8 models have been evaluated on SimpleQA. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, SimpleQA contributes 13% of the category score, so strong performance here directly affects a model's overall ranking.

About SimpleQA

Year

2024

Tasks

Factual questions

Format

Short-form Q&A

Difficulty

Factual accuracy focused

SimpleQA prioritizes two key properties: questions should have short, factual answers that can be easily verified, and questions should be diverse and challenging. It serves as a crucial test of factual knowledge and accuracy.

BenchLM freshness & provenance

Version

SimpleQA 2024

Refresh cadence

Annual

Staleness state

Refreshing

Question availability

Public benchmark set

Refreshing

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (8 models)

1
57.9%
2
55.2%
3
46.2%
4
45%
5
34.1%
6
30.1%
7
28.9%
8
23.1%

FAQ

What does SimpleQA measure?

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.

Which model scores highest on SimpleQA?

DeepSeek V4 Pro (Max) by DeepSeek currently leads with a score of 57.9% on SimpleQA.

How many models are evaluated on SimpleQA?

8 AI models have been evaluated on SimpleQA on BenchLM.

Last updated: May 13, 2026 · BenchLM version SimpleQA 2024

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.