Skip to main content

Artificial Analysis GPQA Diamond (AA-GPQA Diamond)

A display-only Artificial Analysis GPQA Diamond score.

Benchmark score on AA-GPQA Diamond — May 21, 2026

BenchLM mirrors the published score view for AA-GPQA Diamond. Gemini 3.1 Pro leads the public snapshot at 94.1% , followed by GPT-5.5 (93.5%) and Qwen3.7 Max (92.3%). BenchLM does not use these results to rank models overall.

125 modelsKnowledgeCurrentDisplay onlyUpdated May 21, 2026

The published AA-GPQA Diamond snapshot is tightly clustered at the top: Gemini 3.1 Pro sits at 94.1%, while the third row is only 1.8 points behind. The broader top-10 spread is 3.6 points, so many of the published scores sit in a relatively narrow band.

125 models have been evaluated on AA-GPQA Diamond. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. AA-GPQA Diamond is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA-GPQA Diamond

Year

2026

Tasks

Graduate-level science questions

Format

Accuracy

Difficulty

Graduate-level science reasoning

BenchLM stores the Artificial Analysis GPQA Diamond result separately from the weighted GPQA lane so AA refreshes remain display-only.

BenchLM freshness & provenance

Version

AA-GPQA Diamond 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (125 models)

1
94.1%
2
93.5%
3
92.3%
4
92.2%
5
92.0%
6
91.5%
7
91.4%
8
91.1%
9
90.8%
10
90.5%
11
90.3%
12
90.1%
13
89.9%
14
89.6%
15
89.4%
16
89.3%
17
88.8%
18
88.8%
19
88.5%
20
88.4%
21
88.2%
22
87.9%
23
87.9%
24
87.7%
25
87.5%
26
87.4%
27
87.3%
28
87.0%
29
86.8%
30
86.7%
31
86.7%
32
86.6%
33
86.6%
34
86.1%
35
86.0%
36
86.0%
37
85.9%
38
85.8%
39
85.7%
40
85.7%
41
85.4%
43
84.7%
44
84.7%
45
84.5%
46
84.5%
47
84.4%
48
84.2%
49
84.2%
50
84.1%
51
84.0%
52
82.8%
53
82.7%
54
82.2%
55
82.0%
56
82.0%
57
81.7%
58
81.3%
59
81.2%
60
81.0%
61
80.9%
62
80.9%
63
79.9%
64
79.2%
65
79.1%
66
78.3%
67
78.2%
68
77.9%
69
77.6%
70
76.9%
71
76.9%
72
76.6%
73
76.4%
74
76.1%
75
75.2%
76
75.2%
77
75.1%
78
74.8%
79
74.8%
80
74.7%
81
73.8%
82
73.5%
83
73.3%
84
72.8%
85
72.7%
86
68.8%
87
68.3%
88
68.3%
89
68.0%
90
67.1%
91
66.6%
92
66.4%
93
65.6%
94
63.7%
95
63.3%
96
63.2%
97
62.8%
98
61.5%
99
59.3%
100
58.9%
101
58.7%
102
57.8%
103
57.6%
104
57.5%
105
56.1%
106
55.7%
107
54.3%
108
51.5%
109
51.2%
110
49.9%
111
48.9%
112
48.6%
114
43.3%
115
42.8%
116
42.6%
117
42.4%
118
41.7%
119
39.9%
120
37.4%
121
28.1%
122
27.7%
123
26.3%
124
26.1%
125
25.7%

FAQ

What does AA-GPQA Diamond measure?

A display-only Artificial Analysis GPQA Diamond score.

Which model scores highest on AA-GPQA Diamond?

Gemini 3.1 Pro by Google currently leads with a score of 94.1% on AA-GPQA Diamond.

How many models are evaluated on AA-GPQA Diamond?

125 AI models have been evaluated on AA-GPQA Diamond on BenchLM.

Last updated: May 21, 2026 · BenchLM version AA-GPQA Diamond 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.