benchmarksknowledgegpqaexplainer

GPQA Diamond: The PhD-Level Science Benchmark

GPQA tests AI models with graduate-level questions in biology, physics, and chemistry that are 'Google-proof' — even skilled non-experts with internet access can't answer them. Here's how it works.

Glevd·March 7, 2026·10 min read

GPQA (Graduate-Level Google-Proof Q&A) is a benchmark of 448 multiple-choice questions written by PhD-level domain experts in biology, physics, and chemistry. The questions are specifically designed so that even skilled non-experts with full internet access struggle to answer them.

If you can Google the answer, it's not a GPQA question.

What makes GPQA different

Most knowledge benchmarks test recall — can the model regurgitate facts it learned during training? GPQA tests whether a model can apply deep domain expertise to novel questions. The difference is critical for evaluating AI models intended for scientific research, medical applications, or advanced engineering.

Each question was created through a rigorous process:

  1. Domain experts write questions that require specialized graduate-level knowledge
  2. Other experts validate that the answer is correct and unambiguous
  3. Non-experts attempt the questions with full internet access — if non-experts can answer them, the questions are filtered out

This "Google-proof" design means GPQA scores reflect genuine understanding, not just memorization or search ability.

The Diamond subset

GPQA Diamond is the hardest subset of GPQA, consisting of 198 questions that were specifically selected for maximum difficulty and expert agreement. When researchers reference "GPQA" in model evaluations, they usually mean the Diamond subset. The questions are so hard that domain experts — people with PhDs in the relevant field — only achieve about 81% accuracy. Non-experts with internet access score around 22%, barely above the 25% random baseline for four-choice multiple-choice questions.

This means a model scoring 95% on GPQA Diamond is outperforming the average human PhD in that domain. That's a remarkable capability threshold.

The three science domains

Biology

GPQA biology questions cover molecular biology, genetics, biochemistry, and ecology at a level expected of late-stage PhD students. Questions might require understanding protein folding mechanisms, gene regulatory networks, or evolutionary dynamics that aren't covered in standard textbooks.

Physics

Physics questions span quantum mechanics, general relativity, condensed matter, and particle physics. These aren't textbook problems — they often require combining concepts from multiple subfields or reasoning about novel experimental setups.

Chemistry

Chemistry questions cover organic synthesis, computational chemistry, thermodynamics, and spectroscopy. Many require predicting reaction outcomes or interpreting experimental data in ways that demand deep mechanistic understanding.

Current leaderboard

According to BenchLM.ai, the top models on GPQA are:

Rank Model Score
1 GPT-5.4 97
2 GPT-5.3 Codex 97
3 Claude Opus 4.6 95
4 GPT-5.2 95

Full rankings: GPQA leaderboard

GPQA scores are getting compressed at the top, similar to MMLU. For even harder questions, see SuperGPQA (285 disciplines) and HLE (frontier expert questions where top models score 10-46%).

GPQA vs SuperGPQA vs HLE

As GPQA approaches saturation, two successor benchmarks have emerged:

Benchmark Subjects Questions Top Score Score Range Expert baseline
GPQA Diamond 3 198 97 80-97 ~81%
SuperGPQA 285 1,000+ 95 55-95 Varies
HLE 100+ 3,000+ 46 10-46 ~74%

SuperGPQA expands the scope from 3 subjects to 285 graduate disciplines, including law, economics, computer science, and humanities. This broader coverage reduces the chance that a model happens to be strong in exactly the tested domains. SuperGPQA scores show more variance between models, making it a better discriminator for current frontier models.

HLE (Humanity's Last Exam) is the hardest of the three — top models score below 50%. HLE uses questions crowdsourced from thousands of domain experts worldwide, targeting the absolute frontier of human knowledge. Read our deep dive on HLE.

Why GPQA still matters

Even though GPQA is approaching saturation at the top, it remains valuable for several reasons:

  1. Historical comparison: GPQA has the longest track record among expert-level knowledge benchmarks. Comparing a new model's GPQA score to historical data provides meaningful context.

  2. Mid-tier model evaluation: While frontier models all score 90+, mid-tier and open-weight models still show significant variance on GPQA. It's useful for evaluating models like Llama 4, Qwen 3, and Mistral variants.

  3. Domain-specific insights: Because GPQA covers three specific science domains, it can reveal whether a model is stronger in physics than biology, for example. This matters if you're using AI for domain-specific research.

  4. Methodology benchmark: The "Google-proof" validation methodology GPQA pioneered has become the gold standard for creating contamination-resistant benchmarks. Understanding GPQA helps you evaluate the quality of newer benchmarks.

How to use GPQA when choosing a model

If your use case involves scientific reasoning — drug discovery, materials science, academic research — here's how to interpret GPQA scores:

  • Score 90+: Model has PhD-level science capabilities. Suitable for research assistance, literature synthesis, and hypothesis generation.
  • Score 75-90: Strong science knowledge but may miss nuances. Good for educational applications and general scientific writing.
  • Score below 75: Not recommended for tasks requiring expert-level scientific accuracy.

For the most demanding scientific applications, don't rely on GPQA alone. Check HLE scores for frontier reasoning and SuperGPQA for breadth across many disciplines.

The bottom line

GPQA remains an important benchmark for evaluating deep scientific knowledge. It's the standard reference for "can this model reason at a PhD level?" For finer distinctions between frontier models, pair it with SuperGPQA and HLE.

See all knowledge benchmarks on our knowledge rankings page, or compare specific models on their benchmark detail pages.


Data from BenchLM.ai. Last updated March 2026.

Weekly LLM Updates

New model releases, benchmark scores, and leaderboard changes. Every Friday.

Free. Unsubscribe anytime.