Benchmark profile

RuneBench / runescape-bench (RuneScape-Bench)

An agentic coding benchmark where models use a TypeScript SDK to play a RuneScape-like environment and optimize skill-training performance.

How BenchLM shows RuneScape-Bench

BenchLM mirrors RuneBench, the public results site for runescape-bench. The snapshot averages ln(1 + XP/min) across 16 skill-training tasks where coding agents use a TypeScript SDK to control a RuneScape-like game environment.

RuneScape-Bench is display only on BenchLM because it measures agent scaffolds and gameplay automation strategies as much as base model quality. BenchLM keeps it as external agentic-coding context.

25 mirrored rows16 skill tasksTypeScript SDKXP/min aggregateDisplay only

RuneBench results runescape-bench GitHub rs-sdk GitHub

Avg ln(1 + XP/min) on RuneScape-Bench — May 2026 snapshot

BenchLM mirrors the published avg ln(1 + xp/min) view for RuneScape-Bench. GPT-5.5 xhigh leads the public snapshot at 5.7% , followed by Gemini 3.5 Flash (5.4%) and GPT-5.5 (5.3%). BenchLM does not use these results to rank models overall.

GPT-5.5 xhigh

OpenAI

gpt55-apikey

5.7%

Overall —

Gemini 3.5 Flash

Google

gemini35flash

5.4%

Overall —

GPT-5.5

OpenAI

gpt55

5.3%

Overall —

25 modelsExternal benchmark mirrorsCurrentDisplay onlyUpdated May 2026 snapshot

Avg ln(1 + XP/min) table (25 models)

Score

GPT-5.5 xhighOpenAI

5.7%

Gemini 3.5 FlashGoogle

5.4%

GPT-5.5OpenAI

5.3%

Claude Opus 4.8 maxAnthropic

5.1%

Claude Opus 4.8Anthropic

5.0%

Gemini 3.5 Flash highGoogle

4.9%

Claude Opus 4.7 xhighAnthropic

4.7%

GPT-5.4OpenAI

4.7%

Gemini 3 FlashGoogle

4.7%

Claude Opus 4.7Anthropic

4.6%

Gemini 3.1 ProGoogle

4.5%

Claude Opus 4.6Anthropic

4.4%

Codex CLI 5.3OpenAI

4.3%

Claude Opus 4.5Anthropic

4.1%

GPT-5.4 MiniOpenAI

4.1%

Gemini 3 ProGoogle

3.8%

Claude Sonnet 4.6Anthropic

3.2%

Claude Sonnet 4.5Anthropic

3.2%

GPT-5.4 NanoOpenAI

2.3%

Kimi K2.5Moonshot AI

2.1%

GLM 5Z.AI

1.9%

Claude Haiku 4.5Anthropic

1.6%

Qwen3 MaxAlibaba

1.4%

Qwen3 Coder NextAlibaba

1.2%

Qwen3.5 35BAlibaba

0.7%

The published RuneScape-Bench snapshot places GPT-5.5 xhigh first at 5.7%. The third row is 0.4 points behind. The broader top-10 range is 1.1 points, so many of the published results sit in a relatively narrow band.

25 models have been evaluated on RuneScape-Bench. The benchmark falls in the External benchmark mirrors category. We keep external benchmark mirrors separate from the weighted global scoring system, so these results remain source-specific evidence. RuneScape-Bench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About RuneScape-Bench

Year

2026

Tasks

16 RuneScape skill-training tasks

Format

Average log XP-rate score

Difficulty

Agentic gameplay automation

RuneBench evaluates gameplay automation and coding-agent strategy. BenchLM mirrors the public aggregate computed as average ln(1 + XP/min) across 16 skill-training tasks, while keeping the benchmark display-only because rows reflect agent harness and gameplay strategy.

RuneBench Public benchmark source

BenchLM freshness & provenance

Version

RuneScape-Bench 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does RuneScape-Bench measure?

An agentic coding benchmark where models use a TypeScript SDK to play a RuneScape-like environment and optimize skill-training performance.

Which model leads the published RuneScape-Bench snapshot?

GPT-5.5 xhigh currently leads the published RuneScape-Bench snapshot with 5.7% avg ln(1 + xp/min). BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on RuneScape-Bench?

25 AI models are included in BenchLM's mirrored RuneScape-Bench snapshot, based on the public leaderboard captured on May 2026 snapshot.

Last updated: May 2026 snapshot · mirrored from the public benchmark leaderboard

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.