Graduate-Level Google-Proof Q&A (GPQA)

Name: Graduate-Level Google-Proof Q&A
Creator: BenchLM

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.

Top models on GPQA — June 2, 2026

As of June 2, 2026, Claude Mythos Preview leads the GPQA leaderboard with 94.5% , followed by Claude Opus 4.7 (Adaptive) (94.2%) and Claude Opus 4.8 (93.6%).

1Closed

Claude Mythos Preview

Anthropic

94.5%

Overall 99Context 1M

2Closed

Claude Opus 4.7 (Adaptive)

Anthropic

94.2%

Overall 85Context 1M

3Closed

Claude Opus 4.8

Anthropic

93.6%

Overall 95Context 1M

54 modelsKnowledge12% of category scoreRefreshingUpdated June 2, 2026

According to BenchLM.ai, Claude Mythos Preview leads the GPQA benchmark with a score of 94.5%, followed by Claude Opus 4.7 (Adaptive) (94.2%) and Claude Opus 4.8 (93.6%). The top models are clustered within 0.9 points, suggesting this benchmark is nearing saturation for frontier models.

54 models have been evaluated on GPQA. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, GPQA contributes 12% of the category score, so strong performance here directly affects a model's overall ranking.

About GPQA

Year

2023

Tasks

448 questions

Format

Multiple choice questions

Difficulty

Graduate level

GPQA questions are crafted by PhD-level domain experts and validated to be answerable by experts but challenging for non-experts even with internet access. This makes it an excellent test of deep scientific knowledge and reasoning.

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

BenchLM freshness & provenance

Version

GPQA Diamond

Refresh cadence

Static

Staleness state

Refreshing

Question availability

Public benchmark set

Refreshing

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (54 models)

Claude Mythos Preview

AnthropicClosed

94.5%

Claude Opus 4.7 (Adaptive)

AnthropicClosed

94.2%

Claude Opus 4.8

AnthropicClosed

93.6%

GPT-5.5

OpenAIClosed

93.6%

GPT-5.4

OpenAIClosed

92.8%

Qwen3.7 Max

AlibabaClosed

92.4%

GPT-5.2

OpenAIClosed

92.4%

Gemini 3.5 Flash

GoogleClosed

92.2%

Claude Opus 4.6

AnthropicClosed

91.3%

Kimi K2.6

Moonshot AIOpen

90.5%

Qwen3.6 Plus

AlibabaClosed

90.4%

DeepSeek V4 Pro (Max)

DeepSeekOpen

90.1%

Grok 4.3

xAIClosed

90.1%

Claude Sonnet 4.6

AnthropicClosed

89.9%

Interfaze Beta

InterfazeClosed

89.9%

DeepSeek V4 Pro (High)

DeepSeekOpen

89.1%

Qwen3.5 397B

AlibabaOpen

88.4%

DeepSeek V4 Flash (Max)

DeepSeekOpen

88.1%

GPT-5.4 mini

OpenAIClosed

88%

Qwen3.6-27B

AlibabaOpen

87.8%

Kimi K2.5 (Reasoning)

Moonshot AIClosed

87.6%

Kimi K2.5

Moonshot AIOpen

87.6%

DeepSeek V4 Flash (High)

DeepSeekOpen

87.4%

Hy3 Preview

TencentOpen

87.2%

Claude Opus 4.5

AnthropicClosed

87%

Qwen3.5-122B-A10B

AlibabaOpen

86.6%

GLM-5

Z.AIOpen

86%

Qwen3.6-35B-A3B

AlibabaOpen

86%

GLM-4.7

Z.AIOpen

85.7%

Qwen3.5-27B

AlibabaOpen

85.5%

Gemma 4 31B

GoogleOpen

84.3%

Qwen3.5-35B-A3B

AlibabaOpen

84.2%

MiMo-V2-Flash

XiaomiOpen

83.7%

Claude Sonnet 4.5

AnthropicClosed

83.4%

Gemini 2.5 Pro

GoogleClosed

83%

GPT-5.4 nano

OpenAIClosed

82.8%

o1-pro

OpenAIClosed

79%

Qwen3 235B 2507

AlibabaOpen

77.5%

o3-mini

OpenAIClosed

77.2%

OpenAIClosed

75.7%

DeepSeek V4 Pro

DeepSeekOpen

72.9%

Nemotron 3 Nano Omni 30B A3B

NVIDIAOpen

72.2%

DeepSeek V4 Flash

DeepSeekOpen

71.2%

ZAYA1-8B

ZyphraOpen

71%

GPT-4.1

OpenAIClosed

66.3%

GPT-4.1 mini

OpenAIClosed

64.2%

Claude 3.5 Sonnet

AnthropicClosed

59.4%

DeepSeek V3

DeepSeekOpen

59.1%

Ling 2.6 Flash

InclusionAIOpen

59%

Gemma 4 E4B

GoogleOpen

58.6%

ZAYA1-74B-Preview

ZyphraOpen

57.3%

GPT-4.1 nano

OpenAIClosed

50.3%

Gemma 4 E2B

GoogleOpen

43.4%

LFM2.5-VL-450M

LiquidAIOpen

25.7%

FAQ

What does GPQA measure?

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.

Which model scores highest on GPQA?

Claude Mythos Preview by Anthropic currently leads with a score of 94.5% on GPQA.

How many models are evaluated on GPQA?

54 AI models have been evaluated on GPQA on BenchLM.

Compare Top Models on GPQA

Claude Mythos Preview vs Claude Opus 4.7 (Adaptive)Claude Opus 4.7 (Adaptive) vs Claude Opus 4.8 Claude Opus 4.8 vs GPT-5.5 GPT-5.5 vs GPT-5.4

Learn More

Read our explainer: GPQA benchmark deep dive

Last updated: June 2, 2026 · BenchLM version GPQA Diamond

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.