Skip to main content

OpenHands Index

A holistic coding-agent benchmark that evaluates AI agents across issue resolution, frontend work, greenfield development, testing, and information gathering.

How BenchLM shows OpenHands Index

BenchLM mirrors the official OpenHands Index REST API snapshot from May 11, 2026 snapshot. The source evaluates 28 coding-agent model variants across 5 categories: Issue Resolution, Frontend, Greenfield, Testing, Information Gathering.

OpenHands Index is display only on BenchLM. It is a valuable agentic software-engineering reference, but its rows combine model, SDK version, agent harness, cost, runtime, and per-benchmark result links, so BenchLM keeps it separate from weighted model-only rankings.

28 model variants15 open models13 closed models5 benchmarksDisplay only

Average agent score on OpenHands Index — May 11, 2026 snapshot

BenchLM mirrors the published average agent score view for OpenHands Index. Claude Opus 4.7 (Adaptive) leads the public snapshot at 68.2% , followed by Claude Opus 4.6 (66.7%) and GPT-5.5 (65.9%). BenchLM does not use these results to rank models overall.

28 modelsAgenticCurrentDisplay onlyUpdated May 11, 2026 snapshot

The published OpenHands Index snapshot is tightly clustered at the top: Claude Opus 4.7 (Adaptive) sits at 68.2%, while the third row is only 2.2 points behind. The broader top-10 spread is 15.2 points, so the benchmark still separates strong models even when the leaders cluster.

28 models have been evaluated on OpenHands Index. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. OpenHands Index is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About OpenHands Index

Year

2025

Tasks

SWE-bench Verified, SWE-bench Multimodal, Commit0, SWT-bench Verified, and GAIA

Format

Macro-average across five coding-agent categories

Difficulty

Real-world software engineering agent tasks

BenchLM mirrors the official OpenHands Index REST API as a display-only agentic software-engineering benchmark. The source reports average agent score, cost, runtime, per-category scores, logs, and visualizations for each model and SDK version.

BenchLM freshness & provenance

Version

OpenHands Index 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Average agent score table (28 models)

1
68.2%
2
Claude Opus 4.6claude-opus-4-6
66.7%
3
65.9%
4
64.3%
5
Claude Opus 4.5claude-opus-4-5
60.6%
6
58.8%
7
58.3%
8
58.2%
9
Gemini 3.1 ProGemini-3.1-Pro
57.0%
10
Claude Sonnet 4.5claude-sonnet-4-5
53.0%
11
Qwen3.6 PlusQwen3.6-Plus
52.9%
12
49.4%
13
Kimi K2.5Kimi-K2.5
49.2%
14
Gemini 3 ProGemini-3-Pro
49.0%
15
Gemini 3 FlashGemini-3-Flash
49.0%
16
DeepSeek V3.2 (Thinking)DeepSeek-V3.2-Reasoner
45.7%
17
MiniMax M2.5MiniMax-M2.5
45.2%
18
Claude Sonnet 4.6claude-sonnet-4-6
44.5%
19
MiniMax M2.7Minimax-2.7
43.4%
20
42.3%
21
MiniMax M2.1MiniMax-M2.1
41.2%
22
Kimi K2.5 (Reasoning)Kimi-K2-Thinking
41.0%
23
Qwen3.5 FlashQwen3.5-Flash
38.1%
24
36.2%
25
Qwen3 Coder NextQwen3-Coder-Next
34.7%
26
Qwen3 Coder 480B A35BQwen3-Coder-480B
30.9%
27
Kimi K2.6Kimi-K2.6
29.0%
28
25.4%

FAQ

What does OpenHands Index measure?

A holistic coding-agent benchmark that evaluates AI agents across issue resolution, frontend work, greenfield development, testing, and information gathering.

Which model leads the published OpenHands Index snapshot?

Claude Opus 4.7 (Adaptive) currently leads the published OpenHands Index snapshot with a average agent score of 68.2%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on OpenHands Index?

28 AI models are included in BenchLM's mirrored OpenHands Index snapshot, based on the public leaderboard captured on May 11, 2026 snapshot.

Last updated: May 11, 2026 snapshot · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.