GPQA tests AI models with graduate-level questions in biology, physics, and chemistry that are 'Google-proof' — even skilled non-experts with internet access can't answer them. Here's how it works.
Share This Report
Copy the link, post it, or save a PDF version.
GPQA Diamond is a benchmark of 198 PhD-level science questions in biology, physics, and chemistry. Human domain experts average 81% — top AI models now score 95-97%. It is the standard test for "can this model reason at a graduate science level?" in 2026.
GPQA (Graduate-Level Google-Proof Q&A) is a benchmark of 448 multiple-choice questions written by PhD-level domain experts in biology, physics, and chemistry. The questions are specifically designed so that even skilled non-experts with full internet access struggle to answer them.
If you can Google the answer, it's not a GPQA question.
Most knowledge benchmarks test recall — can the model regurgitate facts it learned during training? GPQA tests whether a model can apply deep domain expertise to novel questions. The difference is critical for evaluating AI models intended for scientific research, medical applications, or advanced engineering.
Each question was created through a rigorous process:
This "Google-proof" design means GPQA scores reflect genuine understanding, not just memorization or search ability.
GPQA Diamond is the hardest subset of GPQA, consisting of 198 questions that were specifically selected for maximum difficulty and expert agreement. When researchers reference "GPQA" in model evaluations, they usually mean the Diamond subset. The questions are so hard that domain experts — people with PhDs in the relevant field — only achieve about 81% accuracy. Non-experts with internet access score around 22%, barely above the 25% random baseline for four-choice multiple-choice questions.
This means a model scoring 95% on GPQA Diamond is outperforming the average human PhD in that domain. That's a remarkable capability threshold.
GPQA biology questions cover molecular biology, genetics, biochemistry, and ecology at a level expected of late-stage PhD students. Questions might require understanding protein folding mechanisms, gene regulatory networks, or evolutionary dynamics that aren't covered in standard textbooks.
Physics questions span quantum mechanics, general relativity, condensed matter, and particle physics. These aren't textbook problems — they often require combining concepts from multiple subfields or reasoning about novel experimental setups.
Chemistry questions cover organic synthesis, computational chemistry, thermodynamics, and spectroscopy. Many require predicting reaction outcomes or interpreting experimental data in ways that demand deep mechanistic understanding.
According to BenchLM.ai, the top models on GPQA are:
| Rank | Model | Score |
|---|---|---|
| 1 | GPT-5.4 | 97 |
| 2 | GPT-5.3 Codex | 97 |
| 3 | Claude Opus 4.6 | 95 |
| 4 | GPT-5.2 | 95 |
Full rankings: GPQA leaderboard
GPQA scores are getting compressed at the top, similar to MMLU. For even harder questions, see SuperGPQA (285 disciplines) and HLE (frontier expert questions where top models score 10-46%).
As GPQA approaches saturation, two successor benchmarks have emerged:
| Benchmark | Subjects | Questions | Top Score | Score Range | Expert baseline |
|---|---|---|---|---|---|
| GPQA Diamond | 3 | 198 | 97 | 80-97 | ~81% |
| SuperGPQA | 285 | 1,000+ | 95 | 55-95 | Varies |
| HLE | 100+ | 3,000+ | 46 | 10-46 | ~74% |
SuperGPQA expands the scope from 3 subjects to 285 graduate disciplines, including law, economics, computer science, and humanities. This broader coverage reduces the chance that a model happens to be strong in exactly the tested domains. SuperGPQA scores show more variance between models, making it a better discriminator for current frontier models.
HLE (Humanity's Last Exam) is the hardest of the three — top models score below 50%. HLE uses questions crowdsourced from thousands of domain experts worldwide, targeting the absolute frontier of human knowledge. Read our deep dive on HLE.
Even though GPQA is approaching saturation at the top, it remains valuable for several reasons:
Historical comparison: GPQA has the longest track record among expert-level knowledge benchmarks. Comparing a new model's GPQA score to historical data provides meaningful context.
Mid-tier model evaluation: While frontier models all score 90+, mid-tier and open-weight models still show significant variance on GPQA. It's useful for evaluating models like Llama 4, Qwen 3, and Mistral variants.
Domain-specific insights: Because GPQA covers three specific science domains, it can reveal whether a model is stronger in physics than biology. This matters if you're using AI for domain-specific research.
Methodology benchmark: The "Google-proof" validation methodology GPQA pioneered has become the gold standard for creating contamination-resistant benchmarks. Understanding GPQA helps you evaluate the quality of newer benchmarks.
If your use case involves scientific reasoning — drug discovery, materials science, academic research — here's how to interpret GPQA scores:
For the most demanding scientific applications, don't rely on GPQA alone. Check HLE scores for frontier reasoning and SuperGPQA for breadth across many disciplines.
→ See all models ranked on the full leaderboard · Knowledge rankings
GPQA remains the standard reference for "can this model reason at a PhD level?" For finer distinctions between frontier models, pair it with SuperGPQA and HLE.
What is GPQA Diamond? GPQA Diamond is a benchmark of 198 multiple-choice questions in biology, physics, and chemistry, written and validated by PhD-level experts. Non-experts with internet access score around 22% — barely above the 25% random baseline. It measures genuine graduate-level scientific reasoning.
What score does GPT-5.4 get on GPQA Diamond? As of March 2026, GPT-5.4 scores 97 on GPQA Diamond, tied with GPT-5.3 Codex. Claude Opus 4.6 and GPT-5.2 both score 95. See the GPQA leaderboard for current rankings.
What makes GPQA harder than MMLU? MMLU covers general academic subjects at varying difficulty — frontier models score 97-99%. GPQA focuses on three science domains where even non-experts with internet access average only 22%. The questions require deep mechanistic understanding, not searchable recall.
Which model scores highest on GPQA Diamond? GPT-5.4 and GPT-5.3 Codex both score 97 as of March 2026. For better discrimination between frontier models, use SuperGPQA or HLE.
What is the difference between GPQA, SuperGPQA, and HLE? GPQA Diamond: 3 science domains, 198 questions, top scores 95-97. SuperGPQA: 285 disciplines, 1,000+ questions, better spread. HLE: crowdsourced expert questions, top models score only 10-46%.
What does a GPQA score above 90 mean? A score above 90 on GPQA Diamond indicates PhD-level reasoning in biology, physics, and chemistry. Suitable for research assistance and scientific hypothesis generation. Below 75 is not recommended for expert-level applications.
Data from BenchLM.ai. Last updated March 2026.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Humanity's Last Exam is crowdsourced from thousands of domain experts and designed to probe the absolute frontier of AI. Top models score only 10-46%. Here's why HLE matters.
MMLU and MMLU-Pro are the most cited knowledge benchmarks in AI. Here's what each measures, why MMLU is saturated, and why MMLU-Pro is the better discriminator in 2026.
React Native Evals measures whether AI coding models can complete real React Native implementation tasks across navigation, animation, and async state. Here's what it tests, why it matters, and how it differs from SWE-bench and LiveCodeBench.