A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.
As of April 20, 2026, Claude Mythos Preview leads the GPQA leaderboard with 94.5% , followed by Claude Opus 4.7 (94.2%) and GPT-5.4 (92.8%).
Claude Mythos Preview
Anthropic
Claude Opus 4.7
Anthropic
GPT-5.4
OpenAI
According to BenchLM.ai, Claude Mythos Preview leads the GPQA benchmark with a score of 94.5%, followed by Claude Opus 4.7 (94.2%) and GPT-5.4 (92.8%). The top models are clustered within 1.7 points, suggesting this benchmark is nearing saturation for frontier models.
36 models have been evaluated on GPQA. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, GPQA contributes 12% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2023
Tasks
448 questions
Format
Multiple choice questions
Difficulty
Graduate level
GPQA questions are crafted by PhD-level domain experts and validated to be answerable by experts but challenging for non-experts even with internet access. This makes it an excellent test of deep scientific knowledge and reasoning.
Version
GPQA Diamond
Refresh cadence
Static
Staleness state
Refreshing
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.
Claude Mythos Preview by Anthropic currently leads with a score of 94.5% on GPQA.
36 AI models have been evaluated on GPQA on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.