Skip to main content
benchmarksknowledgegpqaexplainer

GPQA Diamond: The PhD-Level Science Benchmark

GPQA tests AI models with graduate-level questions in biology, physics, and chemistry that are 'Google-proof' — even skilled non-experts with internet access can't answer them. Here's how it works.

Glevd·Published March 7, 2026·10 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

GPQA Diamond is a benchmark of 198 PhD-level science questions in biology, physics, and chemistry. Human domain experts average 81% — top AI models now score 95-97%. It is the standard test for "can this model reason at a graduate science level?" in 2026.

GPQA (Graduate-Level Google-Proof Q&A) is a benchmark of 448 multiple-choice questions written by PhD-level domain experts in biology, physics, and chemistry. The questions are specifically designed so that even skilled non-experts with full internet access struggle to answer them.

If you can Google the answer, it's not a GPQA question.

What makes GPQA different

Most knowledge benchmarks test recall — can the model regurgitate facts it learned during training? GPQA tests whether a model can apply deep domain expertise to novel questions. The difference is critical for evaluating AI models intended for scientific research, medical applications, or advanced engineering.

Each question was created through a rigorous process:

  1. Domain experts write questions that require specialized graduate-level knowledge
  2. Other experts validate that the answer is correct and unambiguous
  3. Non-experts attempt the questions with full internet access — if non-experts can answer them, the questions are filtered out

This "Google-proof" design means GPQA scores reflect genuine understanding, not just memorization or search ability.

The Diamond subset

GPQA Diamond is the hardest subset of GPQA, consisting of 198 questions that were specifically selected for maximum difficulty and expert agreement. When researchers reference "GPQA" in model evaluations, they usually mean the Diamond subset. The questions are so hard that domain experts — people with PhDs in the relevant field — only achieve about 81% accuracy. Non-experts with internet access score around 22%, barely above the 25% random baseline for four-choice multiple-choice questions.

This means a model scoring 95% on GPQA Diamond is outperforming the average human PhD in that domain. That's a remarkable capability threshold.

The three science domains

Biology

GPQA biology questions cover molecular biology, genetics, biochemistry, and ecology at a level expected of late-stage PhD students. Questions might require understanding protein folding mechanisms, gene regulatory networks, or evolutionary dynamics that aren't covered in standard textbooks.

Physics

Physics questions span quantum mechanics, general relativity, condensed matter, and particle physics. These aren't textbook problems — they often require combining concepts from multiple subfields or reasoning about novel experimental setups.

Chemistry

Chemistry questions cover organic synthesis, computational chemistry, thermodynamics, and spectroscopy. Many require predicting reaction outcomes or interpreting experimental data in ways that demand deep mechanistic understanding.

Current leaderboard

According to BenchLM.ai, the top models on GPQA are:

Rank Model Score
1 GPT-5.4 97
2 GPT-5.3 Codex 97
3 Claude Opus 4.6 95
4 GPT-5.2 95

Full rankings: GPQA leaderboard

GPQA scores are getting compressed at the top, similar to MMLU. For even harder questions, see SuperGPQA (285 disciplines) and HLE (frontier expert questions where top models score 10-46%).

GPQA vs SuperGPQA vs HLE

As GPQA approaches saturation, two successor benchmarks have emerged:

Benchmark Subjects Questions Top Score Score Range Expert baseline
GPQA Diamond 3 198 97 80-97 ~81%
SuperGPQA 285 1,000+ 95 55-95 Varies
HLE 100+ 3,000+ 46 10-46 ~74%

SuperGPQA expands the scope from 3 subjects to 285 graduate disciplines, including law, economics, computer science, and humanities. This broader coverage reduces the chance that a model happens to be strong in exactly the tested domains. SuperGPQA scores show more variance between models, making it a better discriminator for current frontier models.

HLE (Humanity's Last Exam) is the hardest of the three — top models score below 50%. HLE uses questions crowdsourced from thousands of domain experts worldwide, targeting the absolute frontier of human knowledge. Read our deep dive on HLE.

Why GPQA still matters

Even though GPQA is approaching saturation at the top, it remains valuable for several reasons:

  1. Historical comparison: GPQA has the longest track record among expert-level knowledge benchmarks. Comparing a new model's GPQA score to historical data provides meaningful context.

  2. Mid-tier model evaluation: While frontier models all score 90+, mid-tier and open-weight models still show significant variance on GPQA. It's useful for evaluating models like Llama 4, Qwen 3, and Mistral variants.

  3. Domain-specific insights: Because GPQA covers three specific science domains, it can reveal whether a model is stronger in physics than biology. This matters if you're using AI for domain-specific research.

  4. Methodology benchmark: The "Google-proof" validation methodology GPQA pioneered has become the gold standard for creating contamination-resistant benchmarks. Understanding GPQA helps you evaluate the quality of newer benchmarks.

How to use GPQA when choosing a model

If your use case involves scientific reasoning — drug discovery, materials science, academic research — here's how to interpret GPQA scores:

  • Score 90+: PhD-level science capabilities. Suitable for research assistance, literature synthesis, and hypothesis generation.
  • Score 75-90: Strong science knowledge but may miss nuances. Good for educational applications and general scientific writing.
  • Score below 75: Not recommended for tasks requiring expert-level scientific accuracy.

For the most demanding scientific applications, don't rely on GPQA alone. Check HLE scores for frontier reasoning and SuperGPQA for breadth across many disciplines.

See all models ranked on the full leaderboard · Knowledge rankings

The bottom line

GPQA remains the standard reference for "can this model reason at a PhD level?" For finer distinctions between frontier models, pair it with SuperGPQA and HLE.


Frequently asked questions

What is GPQA Diamond? GPQA Diamond is a benchmark of 198 multiple-choice questions in biology, physics, and chemistry, written and validated by PhD-level experts. Non-experts with internet access score around 22% — barely above the 25% random baseline. It measures genuine graduate-level scientific reasoning.

What score does GPT-5.4 get on GPQA Diamond? As of March 2026, GPT-5.4 scores 97 on GPQA Diamond, tied with GPT-5.3 Codex. Claude Opus 4.6 and GPT-5.2 both score 95. See the GPQA leaderboard for current rankings.

What makes GPQA harder than MMLU? MMLU covers general academic subjects at varying difficulty — frontier models score 97-99%. GPQA focuses on three science domains where even non-experts with internet access average only 22%. The questions require deep mechanistic understanding, not searchable recall.

Which model scores highest on GPQA Diamond? GPT-5.4 and GPT-5.3 Codex both score 97 as of March 2026. For better discrimination between frontier models, use SuperGPQA or HLE.

What is the difference between GPQA, SuperGPQA, and HLE? GPQA Diamond: 3 science domains, 198 questions, top scores 95-97. SuperGPQA: 285 disciplines, 1,000+ questions, better spread. HLE: crowdsourced expert questions, top models score only 10-46%.

What does a GPQA score above 90 mean? A score above 90 on GPQA Diamond indicates PhD-level reasoning in biology, physics, and chemistry. Suitable for research assistance and scientific hypothesis generation. Below 75 is not recommended for expert-level applications.


Data from BenchLM.ai. Last updated March 2026.

New models drop every week. We send one email a week with what moved and why.