GPQA tests AI models with graduate-level questions in biology, physics, and chemistry that are 'Google-proof' — even skilled non-experts with internet access can't answer them. Here's how it works.
GPQA (Graduate-Level Google-Proof Q&A) is a benchmark of 448 multiple-choice questions written by PhD-level domain experts in biology, physics, and chemistry. The questions are specifically designed so that even skilled non-experts with full internet access struggle to answer them.
If you can Google the answer, it's not a GPQA question.
Most knowledge benchmarks test recall — can the model regurgitate facts it learned during training? GPQA tests whether a model can apply deep domain expertise to novel questions. The difference is critical for evaluating AI models intended for scientific research, medical applications, or advanced engineering.
Each question was created through a rigorous process:
This "Google-proof" design means GPQA scores reflect genuine understanding, not just memorization or search ability.
GPQA Diamond is the hardest subset of GPQA, consisting of 198 questions that were specifically selected for maximum difficulty and expert agreement. When researchers reference "GPQA" in model evaluations, they usually mean the Diamond subset. The questions are so hard that domain experts — people with PhDs in the relevant field — only achieve about 81% accuracy. Non-experts with internet access score around 22%, barely above the 25% random baseline for four-choice multiple-choice questions.
This means a model scoring 95% on GPQA Diamond is outperforming the average human PhD in that domain. That's a remarkable capability threshold.
GPQA biology questions cover molecular biology, genetics, biochemistry, and ecology at a level expected of late-stage PhD students. Questions might require understanding protein folding mechanisms, gene regulatory networks, or evolutionary dynamics that aren't covered in standard textbooks.
Physics questions span quantum mechanics, general relativity, condensed matter, and particle physics. These aren't textbook problems — they often require combining concepts from multiple subfields or reasoning about novel experimental setups.
Chemistry questions cover organic synthesis, computational chemistry, thermodynamics, and spectroscopy. Many require predicting reaction outcomes or interpreting experimental data in ways that demand deep mechanistic understanding.
According to BenchLM.ai, the top models on GPQA are:
| Rank | Model | Score |
|---|---|---|
| 1 | GPT-5.4 | 97 |
| 2 | GPT-5.3 Codex | 97 |
| 3 | Claude Opus 4.6 | 95 |
| 4 | GPT-5.2 | 95 |
Full rankings: GPQA leaderboard
GPQA scores are getting compressed at the top, similar to MMLU. For even harder questions, see SuperGPQA (285 disciplines) and HLE (frontier expert questions where top models score 10-46%).
As GPQA approaches saturation, two successor benchmarks have emerged:
| Benchmark | Subjects | Questions | Top Score | Score Range | Expert baseline |
|---|---|---|---|---|---|
| GPQA Diamond | 3 | 198 | 97 | 80-97 | ~81% |
| SuperGPQA | 285 | 1,000+ | 95 | 55-95 | Varies |
| HLE | 100+ | 3,000+ | 46 | 10-46 | ~74% |
SuperGPQA expands the scope from 3 subjects to 285 graduate disciplines, including law, economics, computer science, and humanities. This broader coverage reduces the chance that a model happens to be strong in exactly the tested domains. SuperGPQA scores show more variance between models, making it a better discriminator for current frontier models.
HLE (Humanity's Last Exam) is the hardest of the three — top models score below 50%. HLE uses questions crowdsourced from thousands of domain experts worldwide, targeting the absolute frontier of human knowledge. Read our deep dive on HLE.
Even though GPQA is approaching saturation at the top, it remains valuable for several reasons:
Historical comparison: GPQA has the longest track record among expert-level knowledge benchmarks. Comparing a new model's GPQA score to historical data provides meaningful context.
Mid-tier model evaluation: While frontier models all score 90+, mid-tier and open-weight models still show significant variance on GPQA. It's useful for evaluating models like Llama 4, Qwen 3, and Mistral variants.
Domain-specific insights: Because GPQA covers three specific science domains, it can reveal whether a model is stronger in physics than biology, for example. This matters if you're using AI for domain-specific research.
Methodology benchmark: The "Google-proof" validation methodology GPQA pioneered has become the gold standard for creating contamination-resistant benchmarks. Understanding GPQA helps you evaluate the quality of newer benchmarks.
If your use case involves scientific reasoning — drug discovery, materials science, academic research — here's how to interpret GPQA scores:
For the most demanding scientific applications, don't rely on GPQA alone. Check HLE scores for frontier reasoning and SuperGPQA for breadth across many disciplines.
GPQA remains an important benchmark for evaluating deep scientific knowledge. It's the standard reference for "can this model reason at a PhD level?" For finer distinctions between frontier models, pair it with SuperGPQA and HLE.
See all knowledge benchmarks on our knowledge rankings page, or compare specific models on their benchmark detail pages.
Data from BenchLM.ai. Last updated March 2026.
On this page
New model releases, benchmark scores, and leaderboard changes. Every Friday.
Free. Unsubscribe anytime.
AIME and HMMT are high school math olympiad competitions now used to benchmark AI. Frontier models score 95-99% — competition math is effectively solved. Here's what that means.
We ranked every major LLM by coding benchmarks — HumanEval, SWE-bench Verified, and LiveCodeBench. Here's which model actually comes out on top, and why the answer depends on what you're building.
Chatbot Arena ranks AI models using Elo ratings from blind human preference votes. Here's how the system works, what the scores mean, and how Elo compares to standardized benchmarks.