Benchmark profile

Gert Labs Composite Game Benchmark (Gert Labs)

A game-environment benchmark that evaluates AI models in novel games covering strategic planning, resource management, spatial reasoning, cooperation, and theory of mind.

Data verified July 16, 2026

Benchmark score on Gert Labs — July 16, 2026

BenchLM mirrors the published score view for Gert Labs. Claude Opus 4.8 leads the public snapshot at 72.97% , followed by GPT-5.5 (72.93%) and Claude Opus 4.7 (65.59%). BenchLM does not use these results to rank models overall.

1Closed

Claude Opus 4.8

Anthropic

claude-opus-4-8

72.97%

Overall 84Context 1M

2Closed

GPT-5.5

OpenAI

gpt-5-5

72.93%

Overall 78Context 1M

3Closed

Claude Opus 4.7

Anthropic

claude-opus-4-7

65.59%

Overall —Context 1M

56 modelsAgenticCurrentDisplay onlyUpdated July 16, 2026

Benchmark score table (56 models)

Score

Claude Opus 4.8Anthropic · Closed

72.97%

GPT-5.5OpenAI · Closed

72.93%

Claude Opus 4.7Anthropic · Closed

65.59%

GPT-5.4OpenAI · Closed

64.89%

Qwen3.7 MaxAlibaba · Closed

64.27%

Claude Opus 4.5Anthropic · Closed

64.23%

Gemini 3 ProGoogle · Closed

63.23%

Claude Sonnet 4.6Anthropic · Closed

62.92%

Gemini 3.5 FlashGoogle · Closed

62.80%

MiMo-V2.5-ProXiaomi · Closed

62.70%

Claude Opus 4.6Anthropic · Closed

61.85%

GLM-5.1Z.AI · Open weight

60.11%

GPT-5.3 CodexOpenAI · Closed

57.47%

Kimi K2.6Moonshot AI · Open weight

56.82%

Gemini 3 FlashGoogle · Closed

56.63%

Qwen3.6-27BAlibaba · Open weight

54.84%

DeepSeek V4 FlashDeepSeek · Open weight

54.35%

GPT-5.2-CodexOpenAI · Closed

51.79%

Step 3.7 FlashStepFun · Open weight

51.57%

GLM-5Z.AI · Open weight

50.99%

Qwen3.6 PlusAlibaba · Closed

50.60%

DeepSeek V4 ProDeepSeek · Open weight

50.28%

Gemini 3.1 ProGoogle · Closed

49.91%

GPT-5.1-CodexOpenAI · Closed

49.68%

Grok Build 0.1xAI · Closed

49.15%

Claude Sonnet 4.5Anthropic · Closed

48.51%

Grok 4.1 FastxAI · Closed

47.32%

MiMo-V2.5Xiaomi · Closed

46.89%

Qwen3.5 397BAlibaba · Open weight

46.76%

GPT-5.2OpenAI · Closed

46.54%

Kimi K2.5Moonshot AI · Open weight

45.88%

Grok 4.3xAI · Closed

43.86%

Qwen3 MaxAlibaba · Closed

43.74%

Qwen3.6-35B-A3BAlibaba · Open weight

42.65%

Grok 4xAI · Closed

42.34%

Gemini 2.5 ProGoogle · Closed

42.01%

GPT-5.1OpenAI · Closed

41.24%

MiniMax M2.7MiniMax · Open weight

40.40%

GLM-4.7Z.AI · Open weight

39.95%

Claude 4 SonnetAnthropic · Closed

39.66%

Qwen3.5-27BAlibaba · Open weight

39.41%

MiniMax M2.5MiniMax · Closed

39.11%

Mistral Medium 3.5 128BMistral · Open weight

39.10%

Gemini 3.1 Flash-LiteGoogle · Closed

38.46%

Grok 4.20xAI · Closed

38.36%

Hy3 PreviewTencent · Open weight

36.91%

MiMo-V2-ProXiaomi · Closed

36.68%

Gemma 4 31BGoogle · Open weight

35.26%

Kimi K2.5 (Reasoning)Moonshot AI · Closed

32.58%

Trinity-Large-ThinkingArcee AI · Open weight

32.55%

GLM-5V-TurboZ.AI · Closed

30.76%

GPT-OSS 120BOpenAI · Open weight

29.61%

DeepSeek V3.2DeepSeek · Open weight

29.57%

Qwen3.5-35B-A3BAlibaba · Open weight

28.96%

GPT-4.1OpenAI · Closed

25.65%

Nemotron 3 Super 120B A12BNVIDIA · Open weight

25.34%

The published Gert Labs snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 72.97%, while the third row is only 7.38 points behind. The broader top-10 spread is 10.27 points, so the benchmark still separates strong models even when the leaders cluster.

56 models have been evaluated on Gert Labs. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. Gert Labs is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About Gert Labs

Year

2026

Tasks

Novel game environments

Format

Composite game leaderboard

Difficulty

Agentic coding and decision-making

The public Gert Labs leaderboard reports a composite 0-100 metric derived from average and median percentile across games, success rate, and response-time penalty. The combined leaderboard blends agentic coding, one-shot coding, and social decision-making modes.

Gert Labs rankings

BenchLM freshness & provenance

Version

Gert Labs 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does Gert Labs measure?

A game-environment benchmark that evaluates AI models in novel games covering strategic planning, resource management, spatial reasoning, cooperation, and theory of mind.

Which model scores highest on Gert Labs?

Claude Opus 4.8 by Anthropic currently leads with a score of 72.97% on Gert Labs.

How many models are evaluated on Gert Labs?

56 AI models have been evaluated on Gert Labs on BenchLM.

Compare Top Models on Gert Labs

Claude Opus 4.8 vs GPT-5.5 GPT-5.5 vs Claude Opus 4.7 Claude Opus 4.7 vs GPT-5.4 GPT-5.4 vs Qwen3.7 Max

Last updated: July 16, 2026 · BenchLM version Gert Labs 2026

The AI models change fast. We track them for you.

A weekly brief for engineers and researchers covering new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.