Graduate-Level Google-Proof Q&A (GPQA)

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.

About GPQA

Year

2023

Tasks

448 questions

Format

Multiple choice questions

Difficulty

Graduate level

GPQA questions are crafted by PhD-level domain experts and validated to be answerable by experts but challenging for non-experts even with internet access. This makes it an excellent test of deep scientific knowledge and reasoning.

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Leaderboard (88 models)

#1GPT-5.4
97
#2Gemini 3.1 Pro
97
#3Claude Opus 4.6
97
#4GPT-5.3 Codex
97
#5Grok 4.1
97
#6GPT-5.2
97
#7GPT-5.2-Codex
97
#9Claude Sonnet 4.6
97
#10Claude Opus 4.5
97
#11Gemini 3 Pro
97
#13GPT-5.1
95
#14GLM-5 (Reasoning)
94
#15Claude Sonnet 4.5
93
#17GPT-5 (high)
91
#18o1-preview
90
#19Kimi K2.5 (Reasoning)
90
#20GPT-5 (medium)
89
#21o3-pro
89
#23o3
87
#24GPT-5 mini
86
#25Grok 4
86
#26GLM-5
86
#28GLM-4.7
84
#29Qwen2.5-1M
83
#30Gemini 2.5 Pro
83
#31DeepSeek V3.2
83
#32Qwen2.5-72B
82
#33o4-mini (high)
82
#34Qwen3.5 397B
82
#35DeepSeek Coder 2.0
79
#36DeepSeekMath V2
79
#37DeepSeek LLM 2.0
78
#38MiMo-V2-Flash
78
#39Claude 4.1 Opus
76
#40Kimi K2.5
76
#41Mistral Large 3
75
#42Claude 4 Sonnet
73
#44MiniMax M2.5
72
#46Gemini 3 Flash
69
#47Mistral Large 2
68
#48Claude Haiku 4.5
67
#49GPT-4o
66
#50Claude 3.5 Sonnet
65
#51GLM-4.7-Flash
65
#52Mistral 8x7B
64
#53Gemini 1.5 Pro
64
#56Gemini 1.0 Pro
62
#58Claude 3 Opus
61
#59GPT-4 Turbo
60
#60Llama 3 70B
58
#61Claude 3 Haiku
56
#63Nemotron-4 15B
53
#64Moonshot v1
52
#65Z-1
51
#66GPT-OSS 120B
50
#67Gemini 2.5 Flash
49
#70Llama 4 Scout
46
#72Gemma 3 27B
44
#73DeepSeek-R1
43
#74Qwen2.5-VL-32B
42
#76Nova Pro
40
#78Qwen3 235B 2507
38
#80GLM-4.5
36
#81MiniMax M1 80k
35
#82GLM-4.5-Air
34
#84DeepSeek V3.1
32
#85Kimi K2
31
#86GPT-OSS 20B
30
#87Mistral 7B v0.3
29
#88Mistral 8x7B v0.2
28

FAQ

What does GPQA measure?

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.

Which model scores highest on GPQA?

GPT-5.4 by OpenAI currently leads with a score of 97 on GPQA.

How many models are evaluated on GPQA?

88 AI models have been evaluated on GPQA on BenchLM.