Measuring Short-Form Factuality in Large Language Models (SimpleQA)

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.

About SimpleQA

Year

2024

Tasks

Factual questions

Format

Short-form Q&A

Difficulty

Factual accuracy focused

SimpleQA prioritizes two key properties: questions should have short, factual answers that can be easily verified, and questions should be diverse and challenging. It serves as a crucial test of factual knowledge and accuracy.

Measuring short-form factuality in large language models

Leaderboard (88 models)

#1GPT-5.4
95
#2Gemini 3.1 Pro
95
#3Claude Opus 4.6
95
#4GPT-5.3 Codex
95
#5Grok 4.1
95
#6GPT-5.2
95
#7GPT-5.2-Codex
95
#9Claude Sonnet 4.6
95
#10Claude Opus 4.5
95
#11Gemini 3 Pro
95
#13GPT-5.1
93
#14GLM-5 (Reasoning)
92
#15Claude Sonnet 4.5
91
#17GPT-5 (high)
89
#18o1-preview
88
#19Kimi K2.5 (Reasoning)
88
#20GPT-5 (medium)
87
#22o3-pro
86
#23GPT-5 mini
84
#24o3
84
#25GLM-5
84
#26Grok 4
83
#28GLM-4.7
82
#29Qwen2.5-1M
81
#30Gemini 2.5 Pro
81
#31DeepSeek V3.2
81
#32Qwen2.5-72B
80
#33o4-mini (high)
80
#34Qwen3.5 397B
80
#35DeepSeek Coder 2.0
78
#36DeepSeek LLM 2.0
77
#37DeepSeekMath V2
77
#38MiMo-V2-Flash
76
#39Claude 4.1 Opus
74
#40Kimi K2.5
74
#41Mistral Large 3
73
#42Claude 4 Sonnet
71
#44MiniMax M2.5
70
#46Gemini 3 Flash
67
#47Mistral Large 2
66
#48Claude Haiku 4.5
65
#49GPT-4o
64
#50Mistral 8x7B
63
#51Claude 3.5 Sonnet
63
#52GLM-4.7-Flash
63
#53Gemini 1.5 Pro
62
#56Gemini 1.0 Pro
60
#58Claude 3 Opus
59
#59GPT-4 Turbo
58
#60Llama 3 70B
56
#61Claude 3 Haiku
54
#63Nemotron-4 15B
52
#64Moonshot v1
51
#65Z-1
50
#66GPT-OSS 120B
49
#67Gemini 2.5 Flash
48
#70Llama 4 Scout
45
#72Gemma 3 27B
43
#73DeepSeek-R1
42
#74Qwen2.5-VL-32B
41
#76Nova Pro
39
#78Qwen3 235B 2507
37
#80GLM-4.5
35
#81MiniMax M1 80k
34
#82GLM-4.5-Air
33
#84DeepSeek V3.1
31
#85Kimi K2
30
#86GPT-OSS 20B
29
#87Mistral 7B v0.3
28
#88Mistral 8x7B v0.2
27

FAQ

What does SimpleQA measure?

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.

Which model scores highest on SimpleQA?

GPT-5.4 by OpenAI currently leads with a score of 95 on SimpleQA.

How many models are evaluated on SimpleQA?

88 AI models have been evaluated on SimpleQA on BenchLM.