GDPval-AA normalized (GDPval-AA)

Name: GDPval-AA normalized
Creator: BenchLM

A display-only Artificial Analysis normalized score for economically valuable tasks.

Benchmark score on GDPval-AA — June 13, 2026

BenchLM mirrors the published score view for GDPval-AA. Claude Opus 4.8 leads the public snapshot at 69.5% , followed by GPT-5.5 (63.5%) and Claude Opus 4.7 (Adaptive) (62.6%). BenchLM does not use these results to rank models overall.

1Closed

Claude Opus 4.8

Anthropic

69.5%

Overall 93Context 1M

2Closed

GPT-5.5

OpenAI

63.5%

Overall 89Context 1M

3Closed

Claude Opus 4.7 (Adaptive)

Anthropic

62.6%

Overall 84Context 1M

121 modelsAgenticCurrentDisplay onlyUpdated June 13, 2026

The published GDPval-AA snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 69.5%, while the third row is only 6.9 points behind. The broader top-10 spread is 15.0 points, so the benchmark still separates strong models even when the leaders cluster.

121 models have been evaluated on GDPval-AA. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. GDPval-AA is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About GDPval-AA

Year

2026

Tasks

Economically valuable tasks

Format

Normalized score

Difficulty

Professional agentic workflows

OpenRouter's Grok 4.3 benchmark card displays GDPval-AA as a normalized percentage. BenchLM stores it separately from the Elo-style GDPval-AA rows used in provider comparison tables.

Artificial Analysis model benchmarks

BenchLM freshness & provenance

Version

GDPval-AA 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (121 models)

Claude Opus 4.8

AnthropicClosed

69.5%

GPT-5.5

OpenAIClosed

63.5%

Claude Opus 4.7 (Adaptive)

AnthropicClosed

62.6%

GPT-5.4

OpenAIClosed

58.7%

Claude Opus 4.7

AnthropicClosed

58.6%

MiniMax M3

MiniMaxOpen

58.5%

Gemini 3.5 Flash

GoogleClosed

57.8%

Claude Opus 4.6 (Adaptive)

AnthropicClosed

55.9%

Claude Sonnet 4.6

AnthropicClosed

54.8%

Claude Opus 4.6

AnthropicClosed

54.5%

MiMo-V2.5-Pro

XiaomiClosed

53.6%

DeepSeek V4 Pro (High)

DeepSeekOpen

52.9%

DeepSeek V4 Pro (Max)

DeepSeekOpen

52.7%

Qwen3.7 Max

AlibabaClosed

52.2%

GLM-5.1

Z.AIOpen

51.8%

Qwen3.7 Plus

AlibabaClosed

50.9%

MiniMax M2.7

MiniMaxOpen

50.2%

Qwen 3.6 Max (preview)

AlibabaClosed

50.2%

Grok 4.3

xAIClosed

49.8%

GLM-5-Turbo

Z.AIClosed

49.7%

Kimi K2.6

Moonshot AIOpen

49.1%

GPT-5.3 Codex

OpenAIClosed

49.0%

GPT-5.2

OpenAIClosed

48.3%

Claude Opus 4.5 Thinking

AnthropicClosed

47.3%

GPT-5.4 mini

OpenAIClosed

46.9%

Claude Opus 4.5

AnthropicClosed

45.9%

Muse Spark

MetaClosed

45.9%

DeepSeek V4 Flash (High)

DeepSeekOpen

45.7%

MiMo-V2-Pro

XiaomiClosed

45.3%

Qwen3.6-27B

AlibabaOpen

45.2%

GLM-5

Z.AIOpen

44.6%

DeepSeek V4 Flash (Max)

DeepSeekOpen

44.4%

Nemotron 3 Ultra

NVIDIAOpen

44.0%

Qwen3.6 Plus

AlibabaClosed

42.5%

GLM-5V-Turbo

Z.AIClosed

41.4%

Gemini 3 Pro Deep Think

GoogleClosed

41.2%

MiMo-V2-Omni

XiaomiClosed

40.9%

Gemini 3.1 Pro

GoogleClosed

40.7%

Step 3.7 Flash

StepFunOpen

40.0%

Qwen3.6-35B-A3B

AlibabaOpen

39.9%

GPT-5 (high)

OpenAIClosed

39.6%

GPT-5.2-Codex

OpenAIClosed

39.4%

Kimi K2.5 (Reasoning)

Moonshot AIClosed

39.2%

Kimi K2.5

Moonshot AIOpen

39.2%

Hy3 Preview

TencentOpen

36.8%

GPT-5.1

OpenAIClosed

36.4%

Qwen3.5 397B

AlibabaOpen

35.8%

GPT-5.4 nano

OpenAIClosed

34.8%

Qwen3.5 397B (Reasoning)

AlibabaOpen

34.5%

GPT-5.1-Codex-Max

OpenAIClosed

34.5%

GPT-5.1-Codex

OpenAIClosed

34.5%

Gemini 3 Pro

GoogleClosed

34.2%

GLM-4.7

Z.AIOpen

34.1%

Mistral Medium 3.5 128B

MistralOpen

33.4%

Qwen3.5-27B

AlibabaOpen

33.0%

Claude 4 Sonnet

AnthropicClosed

31.2%

Qwen3.5-122B-A10B

AlibabaOpen

30.7%

Gemini 3 Flash

GoogleClosed

30.7%

Gemma 4 31B

GoogleOpen

30.7%

DeepSeek V3.1

DeepSeekOpen

28.7%

MiMo-V2-Flash

XiaomiOpen

28.0%

Grok 4.1 Fast (Reasoning)

xAIClosed

27.3%

Qwen3 Max

AlibabaClosed

26.8%

Gemma 4 26B A4B

GoogleOpen

25.7%

Grok 4 Fast (Reasoning)

xAIClosed

25.7%

GPT-5 (medium)

OpenAIClosed

25.1%

Grok 4

xAIClosed

24.6%

GLM-4.6

Z.AIOpen

24.3%

GPT-OSS 120B

OpenAIOpen

22.4%

Gemini 3.1 Flash-Lite

GoogleClosed

21.3%

Gemini 2.5 Pro

GoogleClosed

20.9%

Command A+

CohereOpen

20.9%

Qwen3.5-35B-A3B

AlibabaOpen

20.3%

DeepSeek V3.2

DeepSeekOpen

18.8%

Gemma 4 12B

GoogleOpen

18.8%

Mistral Large 3

MistralClosed

18.2%

Trinity-Large-Preview

Arcee AIOpen

18.2%

Trinity-Large-Thinking

Arcee AIOpen

18.2%

Mistral Small 4 (Reasoning)

MistralOpen

18.0%

Mistral Small 4

MistralOpen

18.0%

K-Exaone

LG AI ResearchClosed

16.2%

Ling 2.6 Flash

InclusionAIOpen

14.2%

Grok 4.1 Fast

xAIClosed

14.1%

GPT-4.1

OpenAIClosed

13.8%

Grok Code Fast 1

xAIClosed

13.1%

Nemotron 3 Nano Omni 30B A3B

NVIDIAOpen

13.1%

OpenAIClosed

12.8%

Sarvam 105B

SarvamOpen

11.9%

Gemini 2.5 Flash

GoogleClosed

11.9%

OpenAIClosed

11.5%

DeepSeek-R1

DeepSeekOpen

9.0%

GPT-OSS 20B

OpenAIOpen

7.4%

GPT-4.1 mini

OpenAIClosed

6.0%

DeepSeek V3.1 (Reasoning)

DeepSeekOpen

5.6%

Mistral Medium 3

MistralClosed

4.2%

GLM-4.5-Air

Z.AIClosed

3.0%

Kimi K2

Moonshot AIClosed

1.2%

GPT-4o

OpenAIClosed

0.0%

Llama 3.1 405B

MetaOpen

0.0%

100

Mistral Large 2

MistralClosed

0.0%

101

DeepSeek V3

DeepSeekOpen

0.0%

102

GPT-4.1 nano

OpenAIClosed

0.0%

103

Llama 4 Scout

MetaOpen

0.0%

104

Nemotron 3 Nano 30B

NVIDIAOpen

0.0%

105

Claude 3 Haiku

AnthropicClosed

0.0%

106

Nemotron Ultra 253B

NVIDIAOpen

0.0%

107

Llama 4 Maverick

MetaOpen

0.0%

108

Gemma 3 27B

GoogleOpen

0.0%

109

Nova Pro

AmazonClosed

0.0%

110

Exaone 4.0 32B

LG AI ResearchOpen

0.0%

111

LFM2.5-8B-A1B

LiquidAIOpen

0.0%

112

Sarvam 30B

SarvamOpen

0.0%

113

Gemma 4 E4B

GoogleOpen

0.0%

114

Granite-4.0-1B

IBMOpen

0.0%

115

Gemma 4 E2B

GoogleOpen

0.0%

116

Granite-4.0-H-1B

IBMOpen

0.0%

117

Solar Pro 2

UpstageClosed

0.0%

118

Exaone 4.0 1.2B

LG AI ResearchOpen

0.0%

119

LFM2.5-VL-1.6B-Extract

LiquidAIOpen

0.0%

120

Granite-4.0-350M

IBMOpen

0.0%

121

Granite-4.0-H-350M

IBMOpen

0.0%

FAQ

What does GDPval-AA measure?

A display-only Artificial Analysis normalized score for economically valuable tasks.

Which model scores highest on GDPval-AA?

Claude Opus 4.8 by Anthropic currently leads with a score of 69.5% on GDPval-AA.

How many models are evaluated on GDPval-AA?

121 AI models have been evaluated on GDPval-AA on BenchLM.

Compare Top Models on GDPval-AA

Claude Opus 4.8 vs GPT-5.5 GPT-5.5 vs Claude Opus 4.7 (Adaptive)Claude Opus 4.7 (Adaptive) vs GPT-5.4 GPT-5.4 vs Claude Opus 4.7

Last updated: June 13, 2026 · BenchLM version GDPval-AA 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.