A computer-use benchmark for GUI task completion across the broader OSWorld task suite.
BenchLM mirrors the published score view for OSWorld. Claude Opus 4.5 leads the public snapshot at 66.3%. BenchLM does not use these results to rank models overall.
Year
2026
Tasks
Computer-use tasks
Format
Interactive GUI evaluation
Difficulty
Broad computer-use suite
BenchLM tracks plain OSWorld as a display-only provider-table reference and preserves OSWorld-Verified as the weighted core benchmark key.
Version
OSWorld 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A computer-use benchmark for GUI task completion across the broader OSWorld task suite.
Claude Opus 4.5 by Anthropic currently leads with a score of 66.3% on OSWorld.
1 AI models have been evaluated on OSWorld on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.