Skip to main content

Graduate-Level Google-Proof Q&A (GPQA)

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.

Top models on GPQA — April 20, 2026

As of April 20, 2026, Claude Mythos Preview leads the GPQA leaderboard with 94.5% , followed by Claude Opus 4.7 (94.2%) and GPT-5.4 (92.8%).

36 modelsKnowledge12% of category scoreRefreshingUpdated April 20, 2026

According to BenchLM.ai, Claude Mythos Preview leads the GPQA benchmark with a score of 94.5%, followed by Claude Opus 4.7 (94.2%) and GPT-5.4 (92.8%). The top models are clustered within 1.7 points, suggesting this benchmark is nearing saturation for frontier models.

36 models have been evaluated on GPQA. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, GPQA contributes 12% of the category score, so strong performance here directly affects a model's overall ranking.

About GPQA

Year

2023

Tasks

448 questions

Format

Multiple choice questions

Difficulty

Graduate level

GPQA questions are crafted by PhD-level domain experts and validated to be answerable by experts but challenging for non-experts even with internet access. This makes it an excellent test of deep scientific knowledge and reasoning.

BenchLM freshness & provenance

Version

GPQA Diamond

Refresh cadence

Static

Staleness state

Refreshing

Question availability

Public benchmark set

Refreshing

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (36 models)

1
94.5%
2
94.2%
3
92.8%
4
92.4%
5
91.3%
6
90.5%
7
90.4%
8
89.9%
9
88.4%
10
88%
11
87.6%
12
87.6%
13
87%
14
86.6%
15
86%
16
86%
17
85.7%
18
85.5%
19
84.3%
20
84.2%
21
83.7%
22
83.4%
23
83%
24
82.8%
25
79%
26
77.5%
27
77.2%
28
75.7%
29
66.3%
30
64.2%
31
59.4%
32
59.1%
33
58.6%
34
50.3%
35
43.4%
36
25.7%

FAQ

What does GPQA measure?

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.

Which model scores highest on GPQA?

Claude Mythos Preview by Anthropic currently leads with a score of 94.5% on GPQA.

How many models are evaluated on GPQA?

36 AI models have been evaluated on GPQA on BenchLM.

Last updated: April 20, 2026 · BenchLM version GPQA Diamond

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.