A holistic coding-agent benchmark that evaluates AI agents across issue resolution, frontend work, greenfield development, testing, and information gathering.
BenchLM mirrors the official OpenHands Index REST API snapshot from May 11, 2026 snapshot. The source evaluates 28 coding-agent model variants across 5 categories: Issue Resolution, Frontend, Greenfield, Testing, Information Gathering.
OpenHands Index is display only on BenchLM. It is a valuable agentic software-engineering reference, but its rows combine model, SDK version, agent harness, cost, runtime, and per-benchmark result links, so BenchLM keeps it separate from weighted model-only rankings.
BenchLM mirrors the published average agent score view for OpenHands Index. Claude Opus 4.7 (Adaptive) leads the public snapshot at 68.2% , followed by Claude Opus 4.6 (66.7%) and GPT-5.5 (65.9%). BenchLM does not use these results to rank models overall.
Claude Opus 4.7 (Adaptive)
Anthropic
claude-opus-4-7
Claude Opus 4.6
Anthropic
claude-opus-4-6
GPT-5.5
OpenAI
The published OpenHands Index snapshot is tightly clustered at the top: Claude Opus 4.7 (Adaptive) sits at 68.2%, while the third row is only 2.2 points behind. The broader top-10 spread is 15.2 points, so the benchmark still separates strong models even when the leaders cluster.
28 models have been evaluated on OpenHands Index. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. OpenHands Index is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2025
Tasks
SWE-bench Verified, SWE-bench Multimodal, Commit0, SWT-bench Verified, and GAIA
Format
Macro-average across five coding-agent categories
Difficulty
Real-world software engineering agent tasks
BenchLM mirrors the official OpenHands Index REST API as a display-only agentic software-engineering benchmark. The source reports average agent score, cost, runtime, per-category scores, logs, and visualizations for each model and SDK version.
Version
OpenHands Index 2025
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A holistic coding-agent benchmark that evaluates AI agents across issue resolution, frontend work, greenfield development, testing, and information gathering.
Claude Opus 4.7 (Adaptive) currently leads the published OpenHands Index snapshot with a average agent score of 68.2%. BenchLM shows this benchmark for display only and does not use it in overall rankings.
28 AI models are included in BenchLM's mirrored OpenHands Index snapshot, based on the public leaderboard captured on May 11, 2026 snapshot.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.