OpenHands Index

Name: OpenHands Index
Creator: BenchLM

A holistic coding-agent benchmark that evaluates AI agents across issue resolution, frontend work, greenfield development, testing, and information gathering.

How BenchLM shows OpenHands Index

BenchLM mirrors the official OpenHands Index REST API snapshot from May 11, 2026 snapshot. The source evaluates 28 coding-agent model variants across 5 categories: Issue Resolution, Frontend, Greenfield, Testing, Information Gathering.

OpenHands Index is display only on BenchLM. It is a valuable agentic software-engineering reference, but its rows combine model, SDK version, agent harness, cost, runtime, and per-benchmark result links, so BenchLM keeps it separate from weighted model-only rankings.

28 model variants15 open models13 closed models5 benchmarksDisplay only

OpenHands Index leaderboard Leaderboard API Methodology Software Agent SDK Raw results

Average agent score on OpenHands Index — May 11, 2026 snapshot

BenchLM mirrors the published average agent score view for OpenHands Index. Claude Opus 4.7 (Adaptive) leads the public snapshot at 68.2% , followed by Claude Opus 4.6 (66.7%) and GPT-5.5 (65.9%). BenchLM does not use these results to rank models overall.

1Closed

Claude Opus 4.7 (Adaptive)

Anthropic

claude-opus-4-7

68.2%

Overall 90Context 1M

2Closed

Claude Opus 4.6

Anthropic

claude-opus-4-6

66.7%

Overall 87Context 1M

3Closed

GPT-5.5

OpenAI

65.9%

Overall 91Context 1M

28 modelsAgenticCurrentDisplay onlyUpdated May 11, 2026 snapshot

The published OpenHands Index snapshot is tightly clustered at the top: Claude Opus 4.7 (Adaptive) sits at 68.2%, while the third row is only 2.2 points behind. The broader top-10 spread is 15.2 points, so the benchmark still separates strong models even when the leaders cluster.

28 models have been evaluated on OpenHands Index. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. OpenHands Index is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About OpenHands Index

Year

2025

Tasks

SWE-bench Verified, SWE-bench Multimodal, Commit0, SWT-bench Verified, and GAIA

Format

Macro-average across five coding-agent categories

Difficulty

Real-world software engineering agent tasks

BenchLM mirrors the official OpenHands Index REST API as a display-only agentic software-engineering benchmark. The source reports average agent score, cost, runtime, per-category scores, logs, and visualizations for each model and SDK version.

OpenHands Index methodology Public benchmark source

BenchLM freshness & provenance

Version

OpenHands Index 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Average agent score table (28 models)

Claude Opus 4.7 (Adaptive)claude-opus-4-7

AnthropicClosed

68.2%

Claude Opus 4.6claude-opus-4-6

AnthropicClosed

66.7%

GPT-5.5

OpenAIClosed

65.9%

GPT-5.4

OpenAIClosed

64.3%

Claude Opus 4.5claude-opus-4-5

AnthropicClosed

60.6%

GPT-5.2

OpenAIClosed

58.8%

GPT-5.2-Codex

OpenAIClosed

58.3%

GLM-5.1

Z.AIOpen

58.2%

Gemini 3.1 ProGemini-3.1-Pro

GoogleClosed

57.0%

Claude Sonnet 4.5claude-sonnet-4-5

AnthropicClosed

53.0%

Qwen3.6 PlusQwen3.6-Plus

AlibabaClosed

52.9%

GLM-5

Z.AIOpen

49.4%

Kimi K2.5Kimi-K2.5

Moonshot AIOpen

49.2%

Gemini 3 ProGemini-3-Pro

GoogleClosed

49.0%

Gemini 3 FlashGemini-3-Flash

GoogleClosed

49.0%

DeepSeek V3.2 (Thinking)DeepSeek-V3.2-Reasoner

DeepSeekOpen

45.7%

MiniMax M2.5MiniMax-M2.5

MiniMaxClosed

45.2%

Claude Sonnet 4.6claude-sonnet-4-6

AnthropicClosed

44.5%

MiniMax M2.7Minimax-2.7

MiniMaxOpen

43.4%

GLM-4.7

Z.AIOpen

42.3%

MiniMax M2.1MiniMax-M2.1

MiniMax

41.2%

Kimi K2.5 (Reasoning)Kimi-K2-Thinking

Moonshot AIClosed

41.0%

Qwen3.5 FlashQwen3.5-Flash

AlibabaClosed

38.1%

Nemotron 3 Super 120B A12BNemotron-3-Super

NVIDIAOpen

36.2%

Qwen3 Coder NextQwen3-Coder-Next

Alibaba

34.7%

Qwen3 Coder 480B A35BQwen3-Coder-480B

Alibaba

30.9%

Kimi K2.6Kimi-K2.6

Moonshot AIOpen

29.0%

Trinity-Large-Thinking

Arcee AIOpen

25.4%

FAQ

What does OpenHands Index measure?

A holistic coding-agent benchmark that evaluates AI agents across issue resolution, frontend work, greenfield development, testing, and information gathering.

Which model leads the published OpenHands Index snapshot?

Claude Opus 4.7 (Adaptive) currently leads the published OpenHands Index snapshot with a average agent score of 68.2%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on OpenHands Index?

28 AI models are included in BenchLM's mirrored OpenHands Index snapshot, based on the public leaderboard captured on May 11, 2026 snapshot.

Last updated: May 11, 2026 snapshot · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.