Skip to main content

Artificial Analysis GPQA Diamond (AA-GPQA Diamond)

A display-only Artificial Analysis GPQA Diamond score.

Benchmark score on AA-GPQA Diamond — July 4, 2026

BenchLM mirrors the published score view for AA-GPQA Diamond. Gemini 3.1 Pro leads the public snapshot at 94.1% , followed by GPT-5.5 (93.5%) and MiniMax M3 (92.9%). BenchLM does not use these results to rank models overall.

132 modelsKnowledgeCurrentDisplay onlyUpdated July 4, 2026

The published AA-GPQA Diamond snapshot is tightly clustered at the top: Gemini 3.1 Pro sits at 94.1%, while the third row is only 1.2 points behind. The broader top-10 spread is 3.0 points, so many of the published scores sit in a relatively narrow band.

132 models have been evaluated on AA-GPQA Diamond. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. AA-GPQA Diamond is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA-GPQA Diamond

Year

2026

Tasks

Graduate-level science questions

Format

Accuracy

Difficulty

Graduate-level science reasoning

BenchLM stores the Artificial Analysis GPQA Diamond result separately from the weighted GPQA lane so AA refreshes remain display-only.

BenchLM freshness & provenance

Version

AA-GPQA Diamond 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (132 models)

1
94.1%
2
93.5%
3
92.9%
4
92.3%
5
92.2%
6
92.0%
7
92.0%
8
91.5%
9
91.4%
10
91.1%
11
90.8%
12
90.5%
13
90.3%
14
90.1%
15
90.0%
16
89.9%
17
89.6%
18
89.6%
19
89.5%
20
89.4%
21
89.3%
22
88.8%
23
88.8%
24
88.5%
25
88.4%
26
88.2%
27
87.9%
28
87.9%
29
87.7%
30
87.5%
31
87.4%
32
87.3%
33
87.0%
34
86.8%
35
86.7%
36
86.7%
37
86.7%
38
86.6%
39
86.6%
40
86.1%
41
86.0%
42
86.0%
43
85.9%
44
85.8%
45
85.7%
46
85.7%
47
85.4%
49
84.7%
50
84.7%
51
84.5%
52
84.5%
53
84.4%
54
84.2%
55
84.2%
56
84.1%
57
84.0%
58
82.8%
59
82.7%
60
82.2%
61
82.0%
62
81.7%
63
81.3%
64
81.2%
65
81.0%
66
80.9%
67
80.9%
68
80.9%
69
79.9%
70
79.2%
71
78.3%
72
78.2%
73
77.9%
74
76.9%
75
76.9%
76
76.6%
77
76.4%
78
76.1%
79
75.3%
80
75.2%
81
75.2%
82
75.1%
83
74.8%
84
74.8%
85
74.7%
86
73.8%
87
73.5%
88
73.3%
89
72.8%
90
72.7%
91
68.8%
92
68.3%
93
68.3%
94
68.0%
95
67.1%
96
66.6%
97
66.4%
98
65.6%
99
63.7%
100
63.3%
101
63.2%
102
62.8%
103
61.5%
104
59.3%
105
58.9%
106
58.7%
107
57.8%
108
57.6%
109
57.5%
110
56.1%
111
55.7%
112
54.3%
113
51.5%
114
51.3%
115
51.2%
116
49.9%
117
48.9%
118
48.6%
120
43.3%
121
42.8%
122
42.6%
123
42.4%
124
41.7%
125
39.9%
126
37.4%
127
28.9%
128
28.1%
129
27.7%
130
26.3%
131
26.1%
132
25.7%

FAQ

What does AA-GPQA Diamond measure?

A display-only Artificial Analysis GPQA Diamond score.

Which model scores highest on AA-GPQA Diamond?

Gemini 3.1 Pro by Google currently leads with a score of 94.1% on AA-GPQA Diamond.

How many models are evaluated on AA-GPQA Diamond?

132 AI models have been evaluated on AA-GPQA Diamond on BenchLM.

Last updated: July 4, 2026 · BenchLM version AA-GPQA Diamond 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.