Skip to main content

Measuring Short-Form Factuality in Large Language Models (SimpleQA)

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.

How BenchLM shows SimpleQA right now

BenchLM is tracking SimpleQA in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.

These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.

111 tracked modelsLocal tracked rowsAwaiting exact-source attachmentsDisplay only

Tracked score on SimpleQA — April 20, 2026

BenchLM mirrors the published tracked score view for SimpleQA. GPT-5.4 leads the public snapshot at 97% , followed by GPT-5.2 Pro (97%) and GPT-5.3 Instant (96%). BenchLM does not use these results to rank models overall.

111 modelsKnowledge13% of category scoreRefreshingUpdated April 20, 2026

The published SimpleQA snapshot is tightly clustered at the top: GPT-5.4 sits at 97%, while the third row is only 1.0 points behind. The broader top-10 spread is 2.0 points, so many of the published scores sit in a relatively narrow band.

111 models have been evaluated on SimpleQA. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, SimpleQA contributes 13% of the category score, so strong performance here directly affects a model's overall ranking.

About SimpleQA

Year

2024

Tasks

Factual questions

Format

Short-form Q&A

Difficulty

Factual accuracy focused

SimpleQA prioritizes two key properties: questions should have short, factual answers that can be easily verified, and questions should be diverse and challenging. It serves as a crucial test of factual knowledge and accuracy.

BenchLM freshness & provenance

Version

SimpleQA 2024

Refresh cadence

Annual

Staleness state

Refreshing

Question availability

Public benchmark set

Refreshing

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Tracked score table (111 models)

1
GPT-5.4gpt-5-4
97%
2
GPT-5.2 Progpt-5-2-pro
97%
3
GPT-5.3 Instantgpt-5-3-instant
96%
4
GPT-5.2-Codexgpt-5-2-codex
95%
5
GPT-5.3 Codexgpt-5-3-codex
95%
6
Grok 4.1grok-4-1
95%
7
Gemini 3 Pro Deep Thinkgemini-3-pro-deep-think
95%
8
Gemini 3.1 Progemini-3-1-pro
95%
9
GPT-5.2gpt-5-2
95%
10
Gemini 3 Progemini-3-pro
95%
11
Claude Opus 4.5claude-opus-4-5
95%
12
GPT-5.2 Instantgpt-5-2-instant
95%
13
GPT-5.1-Codex-Maxgpt-5-1-codex-max
94%
14
GPT-5.3-Codex-Sparkgpt-5-3-codex-spark
94%
15
GPT-5.1gpt-5-1
93%
16
GLM-5 (Reasoning)glm-5-reasoning
92%
17
Claude Sonnet 4.5claude-sonnet-4-5
91%
18
Grok 4.1 Fastgrok-4-1-fast
90%
19
GPT-5 (high)gpt-5-high
89%
20
88%
21
GPT-5 (medium)gpt-5-medium
87%
22
Qwen3.5 397B (Reasoning)qwen3-5-397b-reasoning
87%
23
86%
24
GLM-5.1glm-5-1
84%
25
84%
26
GLM-5glm-5
84%
27
GPT-5 minigpt-5-mini
84%
28
Grok 4grok-4
83%
29
DeepSeek V3.2 (Thinking)deepseek-v3-2-thinking
83%
30
Step 3.5 Flashstep-3-5-flash
82%
31
Mercury 2mercury-2
82%
32
Qwen2.5-1Mqwen2-5-1m
81%
33
Gemini 2.5 Progemini-2-5-pro
81%
34
DeepSeek V3.2deepseek-v3-2
81%
35
Qwen3.5 397Bqwen3-5-397b
80%
36
Qwen2.5-72Bqwen2-5-72b
80%
37
o4-mini (high)o4-mini-high
80%
38
DeepSeek Coder 2.0deepseek-coder-2-0
78%
39
DeepSeekMath V2deepseekmath-v2
77%
40
DeepSeek LLM 2.0deepseek-llm-2-0
77%
41
MiMo-V2-Flashmimo-v2-flash
76%
42
Aion-2.0aion-2-0
76%
43
Kimi K2.5kimi-k2-5
74%
44
Claude 4.1 Opusclaude-4-1-opus
74%
45
Claude Opus 4.6claude-opus-4-6
72%
46
Nemotron 3 Ultra 500Bnemotron-3-ultra-500b
71%
47
Claude 4 Sonnetclaude-4-sonnet
71%
48
Ministral 3 14B (Reasoning)ministral-3-14b-reasoning
70%
49
MiniMax M2.5minimax-m2-5
70%
50
Seed 1.6seed-1-6
69%
51
Llama 3.1 405Bllama-3-1-405b
68%
52
Seed-2.0-Liteseed-2-0-lite
68%
53
Gemini 3 Flashgemini-3-flash
67%
54
Mistral Large 2mistral-large-2
66%
55
Ministral 3 14Bministral-3-14b
66%
56
Claude Haiku 4.5claude-haiku-4-5
65%
57
GPT-4ogpt-4o
64%
58
Nemotron 3 Super 120B A12Bnemotron-3-super-120b-a12b
64%
59
Claude 3.5 Sonnetclaude-3-5-sonnet
63%
60
Mistral 8x7Bmistral-8x7b
63%
61
GLM-4.7-Flashglm-4-7-flash
63%
62
Nemotron 3 Super 100Bnemotron-3-super-100b
62%
63
Gemini 1.5 Progemini-1-5-pro
62%
64
Grok Code Fast 1grok-code-fast-1
61%
65
Seed 1.6 Flashseed-1-6-flash
61%
66
Gemini 3.1 Flash-Litegemini-3-1-flash-lite
60%
67
Gemini 1.0 Progemini-1-0-pro
60%
68
Claude 3 Opusclaude-3-opus
59%
69
Seed-2.0-Miniseed-2-0-mini
59%
70
GPT-4 Turbogpt-4-turbo
58%
71
Llama 3 70Bllama-3-70b
56%
72
Qwen3 235B 2507qwen3-235b-2507
54.3%
73
Kimi K2.5 (Reasoning)kimi-k2-5-reasoning
54%
74
Claude 3 Haikuclaude-3-haiku
54%
75
Nemotron 3 Nano 30Bnemotron-3-nano-30b
54%
76
Nemotron-4 15Bnemotron-4-15b
52%
77
Moonshot v1moonshot-v1
51%
78
Z-1z-1
50%
79
GPT-OSS 120Bgpt-oss-120b
49%
80
Claude Sonnet 4.6claude-sonnet-4-6
48.5%
81
Gemini 2.5 Flashgemini-2-5-flash
48%
82
Nemotron Ultra 253Bnemotron-ultra-253b
47%
83
GLM-4.7glm-4-7
46%
84
Llama 4 Behemothllama-4-behemoth
46%
85
Llama 4 Scoutllama-4-scout
45%
86
Llama 4 Maverickllama-4-maverick
44%
87
LFM2-24B-A2Blfm2-24b-a2b
44%
88
Grok 3 [Beta]grok-3-beta
43.6%
89
Gemma 3 27Bgemma-3-27b
43%
90
Nova Pronova-pro
39%
91
Qwen3 235B 2507 (Reasoning)qwen3-235b-2507-reasoning
38%
92
Claude 4.1 Opus Thinkingclaude-4-1-opus-thinking
36%
93
GLM-4.5glm-4-5
35%
94
MiniMax M1 80kminimax-m1-80k
34%
95
GLM-4.5-Airglm-4-5-air
33%
96
DeepSeek V3.1 (Reasoning)deepseek-v3-1-reasoning
32%
97
Ministral 3 8B (Reasoning)ministral-3-8b-reasoning
32%
98
Kimi K2kimi-k2
31%
99
DeepSeek V3.1deepseek-v3-1
31%
100
DeepSeek-R1deepseek-r1
30.1%
101
GPT-OSS 20Bgpt-oss-20b
29%
102
LFM2.5-1.2B-Thinkinglfm2-5-1-2b-thinking
29%
103
Mistral 7B v0.3mistral-7b-v0-3
28%
104
Ministral 3 8Bministral-3-8b
28%
105
Mistral 8x7B v0.2mistral-8x7b-v0-2
27%
106
Ministral 3 3B (Reasoning)ministral-3-3b-reasoning
26%
107
DeepSeek V3deepseek-v3
24.9%
108
Mistral Large 3mistral-large-3
24%
109
LFM2.5-1.2B-Instructlfm2-5-1-2b-instruct
24%
110
Ministral 3 3Bministral-3-3b
22%
111
Grok 3 Minigrok-3-mini
21.7%

FAQ

What does SimpleQA measure?

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.

Which model leads the published SimpleQA snapshot?

GPT-5.4 currently leads the published SimpleQA snapshot with a tracked score of 97%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on SimpleQA?

111 AI models are included in BenchLM's mirrored SimpleQA snapshot, based on the public leaderboard captured on April 20, 2026.

Last updated: April 20, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.