A verified subset of OSWorld focused on computer-use tasks in desktop-like environments, including navigation, editing, and workflow completion.
According to BenchLM.ai, GPT-5.3 Codex leads the OSWorld-Verified benchmark with a score of 86, followed by GPT-5.4 (85) and GPT-5.2-Codex (85). The top models are clustered within 1 points, suggesting this benchmark is nearing saturation for frontier models.
121 models have been evaluated on OSWorld-Verified. The benchmark falls in the agentic category, which carries a 22% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.
Year
2025
Tasks
Desktop and GUI tasks
Format
Interactive computer-use evaluation
Difficulty
Complex multi-step workflows
OSWorld-Verified measures whether models can operate software interfaces, keep state across steps, and complete practical GUI workflows. It is one of the clearest public signals for computer-use capability.
OSWorldA verified subset of OSWorld focused on computer-use tasks in desktop-like environments, including navigation, editing, and workflow completion.
GPT-5.3 Codex by OpenAI currently leads with a score of 86 on OSWorld-Verified.
121 AI models have been evaluated on OSWorld-Verified on BenchLM.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.