Skip to main content

GDPval-AA normalized (GDPval-AA)

A display-only Artificial Analysis normalized score for economically valuable tasks.

Benchmark score on GDPval-AA — June 13, 2026

BenchLM mirrors the published score view for GDPval-AA. Claude Opus 4.8 leads the public snapshot at 69.5% , followed by GPT-5.5 (63.5%) and Claude Opus 4.7 (Adaptive) (62.6%). BenchLM does not use these results to rank models overall.

121 modelsAgenticCurrentDisplay onlyUpdated June 13, 2026

The published GDPval-AA snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 69.5%, while the third row is only 6.9 points behind. The broader top-10 spread is 15.0 points, so the benchmark still separates strong models even when the leaders cluster.

121 models have been evaluated on GDPval-AA. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. GDPval-AA is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About GDPval-AA

Year

2026

Tasks

Economically valuable tasks

Format

Normalized score

Difficulty

Professional agentic workflows

OpenRouter's Grok 4.3 benchmark card displays GDPval-AA as a normalized percentage. BenchLM stores it separately from the Elo-style GDPval-AA rows used in provider comparison tables.

BenchLM freshness & provenance

Version

GDPval-AA 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (121 models)

1
69.5%
2
63.5%
3
62.6%
4
58.7%
5
58.6%
6
58.5%
7
57.8%
8
55.9%
9
54.8%
10
54.5%
11
53.6%
12
52.9%
13
52.7%
14
52.2%
15
51.8%
16
50.9%
17
50.2%
18
50.2%
19
49.8%
20
49.7%
21
49.1%
22
49.0%
23
48.3%
24
47.3%
25
46.9%
26
45.9%
27
45.9%
28
45.7%
29
45.3%
30
45.2%
31
44.6%
32
44.4%
33
44.0%
34
42.5%
35
41.4%
36
41.2%
37
40.9%
38
40.7%
39
40.0%
40
39.9%
41
39.6%
42
39.4%
43
39.2%
44
39.2%
45
36.8%
46
36.4%
47
35.8%
48
34.8%
49
34.5%
50
34.5%
51
34.5%
52
34.2%
53
34.1%
54
33.4%
55
33.0%
56
31.2%
57
30.7%
58
30.7%
59
30.7%
60
28.7%
61
28.0%
63
26.8%
64
25.7%
65
25.7%
66
25.1%
67
24.6%
68
24.3%
69
22.4%
70
21.3%
71
20.9%
72
20.9%
73
20.3%
74
18.8%
75
18.8%
76
18.2%
77
18.2%
78
18.2%
79
18.0%
80
18.0%
81
16.2%
82
14.2%
83
14.1%
84
13.8%
85
13.1%
87
12.8%
88
11.9%
89
11.9%
90
11.5%
91
9.0%
92
7.4%
93
6.0%
94
5.6%
95
4.2%
96
3.0%
97
1.2%
98
0.0%
99
0.0%
100
0.0%
101
0.0%
102
0.0%
103
0.0%
104
0.0%
105
0.0%
106
0.0%
107
0.0%
108
0.0%
109
0.0%
110
0.0%
111
0.0%
112
0.0%
113
0.0%
114
0.0%
115
0.0%
116
0.0%
117
0.0%
118
0.0%
119
0.0%
120
0.0%
121
0.0%

FAQ

What does GDPval-AA measure?

A display-only Artificial Analysis normalized score for economically valuable tasks.

Which model scores highest on GDPval-AA?

Claude Opus 4.8 by Anthropic currently leads with a score of 69.5% on GDPval-AA.

How many models are evaluated on GDPval-AA?

121 AI models have been evaluated on GDPval-AA on BenchLM.

Last updated: June 13, 2026 · BenchLM version GDPval-AA 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.