Skip to main content

OSWorld-Verified

A verified subset of OSWorld focused on computer-use tasks in desktop-like environments, including navigation, editing, and workflow completion.

Top models on OSWorld-Verified — June 13, 2026

As of June 13, 2026, Claude Mythos 5 leads the OSWorld-Verified leaderboard with 85% , followed by Claude Fable 5 (85%) and Claude Opus 4.8 (83.4%).

23 modelsAgentic24% of category scoreCurrentUpdated June 13, 2026

According to BenchLM.ai, Claude Mythos 5 leads the OSWorld-Verified benchmark with a score of 85%, followed by Claude Fable 5 (85%) and Claude Opus 4.8 (83.4%). The top models are clustered within 1.6 points, suggesting this benchmark is nearing saturation for frontier models.

23 models have been evaluated on OSWorld-Verified. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. Within that category, OSWorld-Verified contributes 24% of the category score, so strong performance here directly affects a model's overall ranking.

About OSWorld-Verified

Year

2025

Tasks

Desktop and GUI tasks

Format

Interactive computer-use evaluation

Difficulty

Complex multi-step workflows

OSWorld-Verified measures whether models can operate software interfaces, keep state across steps, and complete practical GUI workflows. It is one of the clearest public signals for computer-use capability.

BenchLM freshness & provenance

Version

OSWorld Verified

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (23 models)

1
85%
2
85%
3
83.4%
4
82.6%
5
78.8%
6
78.7%
7
78.4%
8
78%
9
75%
10
73.3%
11
73.1%
12
72.7%
13
72.1%
14
72.1%
15
70.1%
16
66.3%
17
64.7%
18
61.4%
19
58%
20
56.2%
21
54.5%
22
47.3%
23
39%

FAQ

What does OSWorld-Verified measure?

A verified subset of OSWorld focused on computer-use tasks in desktop-like environments, including navigation, editing, and workflow completion.

Which model scores highest on OSWorld-Verified?

Claude Mythos 5 by Anthropic currently leads with a score of 85% on OSWorld-Verified.

How many models are evaluated on OSWorld-Verified?

23 AI models have been evaluated on OSWorld-Verified on BenchLM.

Last updated: June 13, 2026 · BenchLM version OSWorld Verified

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.