Measuring Short-Form Factuality in Large Language Models (SimpleQA)

Name: Measuring Short-Form Factuality in Large Language Models
Creator: BenchLM

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.

Top models on SimpleQA — May 13, 2026

As of May 13, 2026, DeepSeek V4 Pro (Max) leads the SimpleQA leaderboard with 57.9% , followed by DeepSeek V4 Pro Base (55.2%) and DeepSeek V4 Pro (High) (46.2%).

1Open

DeepSeek V4 Pro (Max)

DeepSeek

57.9%

Overall 88Context 1M

2Open

DeepSeek V4 Pro Base

DeepSeek

55.2%

Overall —Context 1M

3Open

DeepSeek V4 Pro (High)

DeepSeek

46.2%

Overall 84Context 1M

8 modelsKnowledge13% of category scoreRefreshingUpdated May 13, 2026

According to BenchLM.ai, DeepSeek V4 Pro (Max) leads the SimpleQA benchmark with a score of 57.9%, followed by DeepSeek V4 Pro Base (55.2%) and DeepSeek V4 Pro (High) (46.2%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

8 models have been evaluated on SimpleQA. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, SimpleQA contributes 13% of the category score, so strong performance here directly affects a model's overall ranking.

About SimpleQA

Year

2024

Tasks

Factual questions

Format

Short-form Q&A

Difficulty

Factual accuracy focused

SimpleQA prioritizes two key properties: questions should have short, factual answers that can be easily verified, and questions should be diverse and challenging. It serves as a crucial test of factual knowledge and accuracy.

Measuring short-form factuality in large language models

BenchLM freshness & provenance

Version

SimpleQA 2024

Refresh cadence

Annual

Staleness state

Refreshing

Question availability

Public benchmark set

Refreshing

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (8 models)

DeepSeek V4 Pro (Max)

DeepSeekOpen

57.9%

DeepSeek V4 Pro Base

DeepSeekOpen

55.2%

DeepSeek V4 Pro (High)

DeepSeekOpen

46.2%

DeepSeek V4 Pro

DeepSeekOpen

45%

DeepSeek V4 Flash (Max)

DeepSeekOpen

34.1%

DeepSeek V4 Flash Base

DeepSeekOpen

30.1%

DeepSeek V4 Flash (High)

DeepSeekOpen

28.9%

DeepSeek V4 Flash

DeepSeekOpen

23.1%

FAQ

What does SimpleQA measure?

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.

Which model scores highest on SimpleQA?

DeepSeek V4 Pro (Max) by DeepSeek currently leads with a score of 57.9% on SimpleQA.

How many models are evaluated on SimpleQA?

8 AI models have been evaluated on SimpleQA on BenchLM.

Compare Top Models on SimpleQA

DeepSeek V4 Pro (Max) vs DeepSeek V4 Pro Base DeepSeek V4 Pro Base vs DeepSeek V4 Pro (High)DeepSeek V4 Pro (High) vs DeepSeek V4 Pro DeepSeek V4 Pro vs DeepSeek V4 Flash (Max)

Last updated: May 13, 2026 · BenchLM version SimpleQA 2024

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.