Artificial Analysis GPQA Diamond (AA-GPQA Diamond)

Name: Artificial Analysis GPQA Diamond
Creator: BenchLM

A display-only Artificial Analysis GPQA Diamond score.

Benchmark score on AA-GPQA Diamond — July 4, 2026

BenchLM mirrors the published score view for AA-GPQA Diamond. Gemini 3.1 Pro leads the public snapshot at 94.1% , followed by GPT-5.5 (93.5%) and MiniMax M3 (92.9%). BenchLM does not use these results to rank models overall.

1Closed

Gemini 3.1 Pro

Google

94.1%

Overall 88Context 1M

2Closed

GPT-5.5

OpenAI

93.5%

Overall 78Context 1M

3Open

MiniMax M3

MiniMax

92.9%

Overall 74Context 1M

132 modelsKnowledgeCurrentDisplay onlyUpdated July 4, 2026

The published AA-GPQA Diamond snapshot is tightly clustered at the top: Gemini 3.1 Pro sits at 94.1%, while the third row is only 1.2 points behind. The broader top-10 spread is 3.0 points, so many of the published scores sit in a relatively narrow band.

132 models have been evaluated on AA-GPQA Diamond. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. AA-GPQA Diamond is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA-GPQA Diamond

Year

2026

Tasks

Graduate-level science questions

Format

Accuracy

Difficulty

Graduate-level science reasoning

BenchLM stores the Artificial Analysis GPQA Diamond result separately from the weighted GPQA lane so AA refreshes remain display-only.

Artificial Analysis GPQA Diamond Benchmark Leaderboard

BenchLM freshness & provenance

Version

AA-GPQA Diamond 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (132 models)

Gemini 3.1 Pro

GoogleClosed

94.1%

GPT-5.5

OpenAIClosed

93.5%

MiniMax M3

MiniMaxOpen

92.9%

Qwen3.7 Max

AlibabaClosed

92.3%

Gemini 3.5 Flash

GoogleClosed

92.2%

GPT-5.4

OpenAIClosed

92.0%

Claude Opus 4.8

AnthropicClosed

92.0%

GPT-5.3 Codex

OpenAIClosed

91.5%

Claude Opus 4.7 (Adaptive)

AnthropicClosed

91.4%

Kimi K2.6

Moonshot AIOpen

91.1%

Gemini 3 Pro

GoogleClosed

90.8%

DeepSeek V4 Pro (High)

DeepSeekOpen

90.5%

GPT-5.2

OpenAIClosed

90.3%

Grok 4.3

xAIClosed

90.1%

Qwen3.7 Plus

AlibabaClosed

90.0%

GPT-5.2-Codex

OpenAIClosed

89.9%

Claude Opus 4.6 (Adaptive)

AnthropicClosed

89.6%

Kimi K2.7 Code

Moonshot AIOpen

89.6%

GLM-5.2

Z.AIOpen

89.5%

DeepSeek V4 Flash (Max)

DeepSeekOpen

89.4%

Qwen3.5 397B (Reasoning)

AlibabaOpen

89.3%

DeepSeek V4 Pro (Max)

DeepSeekOpen

88.8%

Qwen 3.6 Max (preview)

AlibabaClosed

88.8%

Claude Opus 4.7

AnthropicClosed

88.5%

Muse Spark

MetaClosed

88.4%

Qwen3.6 Plus

AlibabaClosed

88.2%

Kimi K2.5 (Reasoning)

Moonshot AIClosed

87.9%

Kimi K2.5

Moonshot AIOpen

87.9%

Grok 4

xAIClosed

87.7%

GPT-5.4 mini

OpenAIClosed

87.5%

MiniMax M2.7

MiniMaxOpen

87.4%

GPT-5.1

OpenAIClosed

87.3%

MiMo-V2-Pro

XiaomiClosed

87.0%

GLM-5.1

Z.AIOpen

86.8%

DeepSeek V4 Flash (High)

DeepSeekOpen

86.7%

Nemotron 3 Ultra

NVIDIAOpen

86.7%

Hy3 Preview

TencentOpen

86.7%

MiMo-V2.5-Pro

XiaomiClosed

86.6%

Claude Opus 4.5 Thinking

AnthropicClosed

86.6%

Qwen3.5 397B

AlibabaOpen

86.1%

GPT-5.1-Codex-Max

OpenAIClosed

86.0%

GPT-5.1-Codex

OpenAIClosed

86.0%

GLM-4.7

Z.AIOpen

85.9%

Qwen3.5-27B

AlibabaOpen

85.8%

Qwen3.5-122B-A10B

AlibabaOpen

85.7%

Gemma 4 31B

GoogleOpen

85.7%

GPT-5 (high)

OpenAIClosed

85.4%

Grok 4.1 Fast (Reasoning)

xAIClosed

85.3%

GLM-5-Turbo

Z.AIClosed

84.7%

Grok 4 Fast (Reasoning)

xAIClosed

84.7%

o3-pro

OpenAIClosed

84.5%

Qwen3.5-35B-A3B

AlibabaOpen

84.5%

Gemini 2.5 Pro

GoogleClosed

84.4%

GPT-5 (medium)

OpenAIClosed

84.2%

Qwen3.6-27B

AlibabaOpen

84.2%

Qwen3.6-35B-A3B

AlibabaOpen

84.1%

Claude Opus 4.6

AnthropicClosed

84.0%

MiMo-V2-Omni

XiaomiClosed

82.8%

OpenAIClosed

82.7%

Gemini 3.1 Flash-Lite

GoogleClosed

82.2%

GLM-5

Z.AIOpen

82.0%

GPT-5.4 nano

OpenAIClosed

81.7%

DeepSeek-R1

DeepSeekOpen

81.3%

Gemini 3 Flash

GoogleClosed

81.2%

Claude Opus 4.5

AnthropicClosed

81.0%

Claude 4.1 Opus Thinking

AnthropicClosed

80.9%

Step 3.7 Flash

StepFunOpen

80.9%

GLM-5V-Turbo

Z.AIClosed

80.9%

Claude Sonnet 4.6

AnthropicClosed

79.9%

Gemma 4 26B A4B

GoogleOpen

79.2%

K-Exaone

LG AI ResearchClosed

78.3%

GPT-OSS 120B

OpenAIOpen

78.2%

DeepSeek V3.1 (Reasoning)

DeepSeekOpen

77.9%

Mistral Small 4 (Reasoning)

MistralOpen

76.9%

Mistral Small 4

MistralOpen

76.9%

Kimi K2

Moonshot AIClosed

76.6%

Qwen3 Max

AlibabaClosed

76.4%

Command A+

CohereOpen

76.1%

Gemma 4 12B

GoogleOpen

75.3%

Trinity-Large-Thinking

Arcee AIOpen

75.2%

Trinity-Large-Preview

Arcee AIOpen

75.2%

DeepSeek V3.2

DeepSeekOpen

75.1%

o3-mini

OpenAIClosed

74.8%

Mistral Medium 3.5 128B

MistralOpen

74.8%

OpenAIClosed

74.7%

Sarvam 105B

SarvamOpen

73.8%

DeepSeek V3.1

DeepSeekOpen

73.5%

GLM-4.5-Air

Z.AIClosed

73.3%

Nemotron Ultra 253B

NVIDIAOpen

72.8%

Grok Code Fast 1

xAIClosed

72.7%

GPT-OSS 20B

OpenAIOpen

68.8%

Claude 4 Sonnet

AnthropicClosed

68.3%

Gemini 2.5 Flash

GoogleClosed

68.3%

Mistral Large 3

MistralClosed

68.0%

Llama 4 Maverick

MetaOpen

67.1%

GPT-4.1

OpenAIClosed

66.6%

GPT-4.1 mini

OpenAIClosed

66.4%

MiMo-V2-Flash

XiaomiOpen

65.6%

Grok 4.1 Fast

xAIClosed

63.7%

100

Sarvam 30B

SarvamOpen

63.3%

101

GLM-4.6

Z.AIOpen

63.2%

102

Exaone 4.0 32B

LG AI ResearchOpen

62.8%

103

DeepSeek R1 Distill Qwen 32B

DeepSeekOpen

61.5%

104

Ling 2.6 Flash

InclusionAIOpen

59.3%

105

Gemini 1.5 Pro

GoogleClosed

58.9%

106

Llama 4 Scout

MetaOpen

58.7%

107

Mistral Medium 3

MistralClosed

57.8%

108

Gemma 4 E4B

GoogleOpen

57.6%

109

Phi-4

MicrosoftOpen

57.5%

110

Solar Pro 2

UpstageClosed

56.1%

111

DeepSeek V3

DeepSeekOpen

55.7%

112

GPT-4o

OpenAIClosed

54.3%

113

Llama 3.1 405B

MetaOpen

51.5%

114

LFM2.5-8B-A1B

LiquidAIOpen

51.3%

115

GPT-4.1 nano

OpenAIClosed

51.2%

116

Nova Pro

AmazonClosed

49.9%

117

Claude 3 Opus

AnthropicClosed

48.9%

118

Mistral Large 2

MistralClosed

48.6%

119

Nemotron 3 Nano Omni 30B A3B

NVIDIAOpen

46.9%

120

Gemma 4 E2B

GoogleOpen

43.3%

121

Gemma 3 27B

GoogleOpen

42.8%

122

GPT-4o mini

OpenAIClosed

42.6%

123

Exaone 4.0 1.2B

LG AI ResearchOpen

42.4%

124

Qwen2.5 Coder 32B Instruct

AlibabaOpen

41.7%

125

Nemotron 3 Nano 30B

NVIDIAOpen

39.9%

126

Claude 3 Haiku

AnthropicClosed

37.4%

127

LFM2.5-VL-1.6B-Extract

LiquidAIOpen

28.9%

128

Granite-4.0-1B

IBMOpen

28.1%

129

Gemini 1.0 Pro

GoogleClosed

27.7%

130

Granite-4.0-H-1B

IBMOpen

26.3%

131

Granite-4.0-350M

IBMOpen

26.1%

132

Granite-4.0-H-350M

IBMOpen

25.7%

FAQ

What does AA-GPQA Diamond measure?

A display-only Artificial Analysis GPQA Diamond score.

Which model scores highest on AA-GPQA Diamond?

Gemini 3.1 Pro by Google currently leads with a score of 94.1% on AA-GPQA Diamond.

How many models are evaluated on AA-GPQA Diamond?

132 AI models have been evaluated on AA-GPQA Diamond on BenchLM.

Compare Top Models on AA-GPQA Diamond

Gemini 3.1 Pro vs GPT-5.5 GPT-5.5 vs MiniMax M3 MiniMax M3 vs Qwen3.7 Max Qwen3.7 Max vs Gemini 3.5 Flash

Last updated: July 4, 2026 · BenchLM version AA-GPQA Diamond 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.