Benchmark profile

GBA-Eval

An agentic coding benchmark that asks models to build a Game Boy Advance emulator from scratch and grades emulator behavior against procedural, audio, and gameplay tests.

Data verified May 30, 2026

How BenchLM shows GBA-Eval

BenchLM mirrors the official GBA-Eval leaderboard snapshot graded on May 30, 2026. The benchmark asks coding agents to build a Game Boy Advance emulator and scores the result against 27 procedural, audio, and gameplay test cases.

GBA-Eval is display only on BenchLM. The source rows are agentic software-engineering runs with large token budgets and verifier-specific emulator tests, so BenchLM does not fold them into model-only weighted rankings.

14 agent rows27 emulator testsGBA emulator buildOfficial JSON feedDisplay only

GBA-Eval leaderboard Official leaderboard JSON

Overall score on GBA-Eval — May 30, 2026

BenchLM mirrors the published overall score view for GBA-Eval. Claude Opus 4.8 leads the public snapshot at 70.9% , followed by GPT-5.5 (53.2%) and Claude Sonnet 4.6 (48.8%). BenchLM does not use these results to rank models overall.

Claude Opus 4.8

Anthropic

claude-opus-4-8

70.9%

Overall —

GPT-5.5

OpenAI

gpt-5.5

53.2%

Overall —

Claude Sonnet 4.6

Anthropic

claude-sonnet-4-6

48.8%

Overall —

14 modelsExternal benchmark mirrorsCurrentDisplay onlyUpdated May 30, 2026

Overall score table (14 models)

Score

Claude Opus 4.8Anthropic

70.9%

GPT-5.5OpenAI

53.2%

Claude Sonnet 4.6Anthropic

48.8%

Claude Opus 4.6Anthropic

44.1%

Claude Opus 4.7Anthropic

43.8%

GPT-5.4OpenAI

31.6%

Gemini 3.5 FlashGoogle

6.7%

Grok Build 0.1xAI

2.4%

MiniMax M3OpenRouter via goose

0.9%

Kimi K2.6OpenRouter via goose

0.9%

Gemini 3.1 ProGoogle

0.8%

Qwen 3.7 MaxOpenRouter via goose

0.4%

GLM 5.1OpenRouter via goose

0.0%

MiniMax M2.7OpenRouter via goose

0.0%

The published GBA-Eval snapshot places Claude Opus 4.8 first at 70.9%. The third row is 22.1 points behind. The broader top-10 range is 70.0 points, so the table still separates the published systems.

14 models have been evaluated on GBA-Eval. The benchmark falls in the External benchmark mirrors category. We keep external benchmark mirrors separate from the weighted global scoring system, so these results remain source-specific evidence. GBA-Eval is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About GBA-Eval

Year

2026

Tasks

27 emulator test cases

Format

Overall emulator score

Difficulty

Long-horizon systems programming

GBA-Eval evaluates long-horizon coding agents by having them implement a working GBA emulator. The public leaderboard reports overall scores across 27 test cases with token usage and checkpoints preserved in the source JSON feed.

GBA-Eval Public benchmark source

BenchLM freshness & provenance

Version

GBA-Eval 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does GBA-Eval measure?

An agentic coding benchmark that asks models to build a Game Boy Advance emulator from scratch and grades emulator behavior against procedural, audio, and gameplay tests.

Which model leads the published GBA-Eval snapshot?

Claude Opus 4.8 currently leads the published GBA-Eval snapshot with 70.9% overall score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on GBA-Eval?

14 AI models are included in BenchLM's mirrored GBA-Eval snapshot, based on the public leaderboard captured on May 30, 2026.

Last updated: May 30, 2026 · mirrored from the public benchmark leaderboard

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.