Skip to main content

GDPval-AA

An agentic real-world work-task evaluation reported as an Elo score in DeepSeek-V4 thinking-mode evaluations.

Benchmark score on GDPval-AA — June 16, 2026

BenchLM mirrors the published score view for GDPval-AA. Claude Mythos 5 leads the public snapshot at 1932 , followed by Claude Fable 5 (1932) and Claude Opus 4.8 (1890). BenchLM does not use these results to rank models overall.

122 modelsAgenticCurrentDisplay onlyUpdated June 16, 2026

The published GDPval-AA snapshot is tightly clustered at the top: Claude Mythos 5 sits at 1932, while the third row is only 42 points behind. The broader top-10 spread is 313 points, so the benchmark still separates strong models even when the leaders cluster.

122 models have been evaluated on GDPval-AA. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. GDPval-AA is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About GDPval-AA

Year

2026

Tasks

Agentic real-world work tasks

Format

Elo

Difficulty

Professional agentic workflows

BenchLM stores GDPval-AA as a display-only provider-table row for DeepSeek-V4 because the source reports an Elo score rather than a 0-100 percentage.

BenchLM freshness & provenance

Version

GDPval-AA 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (122 models)

1
1932
2
1932
3
1890
4
1769
5
1753
6
1674
7
1672
8
1670
9
1656
10
1619
11
1596
12
1589
13
1571
14
1558
15
1554
16
1543
17
1518
18
1505
19
1504
20
1495
21
1493
22
1481
23
1480
24
1467
25
1446
26
1438
27
1418
28
1417
29
1414
30
1405
31
1403
32
1391
33
1388
34
1379
35
1350
36
1328
37
1324
38
1317
39
1314
40
1300
41
1298
42
1292
43
1288
44
1284
45
1284
46
1236
47
1227
48
1217
49
1195
50
1191
51
1191
52
1190
53
1184
54
1183
55
1168
56
1160
57
1123
58
1115
59
1114
60
1113
61
1075
62
1059
64
1037
66
1014
67
1001
68
991
69
985
70
947
71
926
72
919
73
918
74
905
75
876
76
875
77
864
78
864
79
864
81
859
82
824
83
785
84
781
85
777
86
763
88
757
89
739
90
738
91
730
92
680
93
647
94
619
95
612
96
585
97
560
98
525
99
443
100
436
101
409
102
386
103
378
104
359
105
348
106
347
107
328
108
323
109
318
110
303
111
293
112
289
113
283
114
270
115
269
116
268
117
268
118
255
119
255
120
255
121
238
122
232

FAQ

What does GDPval-AA measure?

An agentic real-world work-task evaluation reported as an Elo score in DeepSeek-V4 thinking-mode evaluations.

Which model scores highest on GDPval-AA?

Claude Mythos 5 by Anthropic currently leads with a score of 1932 on GDPval-AA.

How many models are evaluated on GDPval-AA?

122 AI models have been evaluated on GDPval-AA on BenchLM.

Last updated: June 16, 2026 · BenchLM version GDPval-AA 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.