OSWorld-Verified (OSWorld-Verified)

A verified subset of OSWorld focused on computer-use tasks in desktop-like environments, including navigation, editing, and workflow completion.

According to BenchLM.ai, GPT-5.3 Codex leads the OSWorld-Verified benchmark with a score of 86, followed by GPT-5.4 (85) and GPT-5.2-Codex (85). The top models are clustered within 1 points, suggesting this benchmark is nearing saturation for frontier models.

121 models have been evaluated on OSWorld-Verified. The benchmark falls in the agentic category, which carries a 22% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.

About OSWorld-Verified

Year

2025

Tasks

Desktop and GUI tasks

Format

Interactive computer-use evaluation

Difficulty

Complex multi-step workflows

OSWorld-Verified measures whether models can operate software interfaces, keep state across steps, and complete practical GUI workflows. It is one of the clearest public signals for computer-use capability.

OSWorld

Leaderboard (121 models)

#1GPT-5.3 Codex
86
#2GPT-5.4
85
#3GPT-5.2-Codex
85
#4GPT-5.4 Pro
84
#6GPT-5.2 Pro
82
#8GPT-5.2
81
#9GPT-5.3 Instant
80
#10Claude Opus 4.6
74
#11GPT-5.2 Instant
74
#12GLM-5 (Reasoning)
74
#13Grok 4.1
73
#15GPT-5 (high)
72
#16GPT-5 (medium)
72
#17GPT-5.1
71
#18o1-preview
71
#20Claude Sonnet 4.5
69
#21Gemini 3.1 Pro
68
#22Claude Sonnet 4.6
68
#23Claude Opus 4.5
68
#24Kimi K2.5 (Reasoning)
68
#25o3-pro
68
#27Gemini 3 Pro
66
#29o3
65
#30DeepSeek Coder 2.0
65
#31GPT-4.1
63
#32Mercury 2
62
#33o3-mini
61
#34GLM-4.7
61
#35DeepSeekMath V2
61
#36GPT-5 mini
60
#37o1
60
#38Qwen2.5-1M
59
#39MiMo-V2-Flash
58
#40GLM-5
58
#41Seed 1.6
58
#43Claude 4 Sonnet
57
#44Claude 4.1 Opus
57
#45Claude Haiku 4.5
57
#46GLM-4.7-Flash
57
#47Grok 4
56
#48DeepSeek LLM 2.0
56
#49Gemini 2.5 Pro
55
#50DeepSeek V3.2
55
#51Qwen2.5-72B
55
#52o4-mini (high)
55
#54Step 3.5 Flash
54
#56Seed-2.0-Lite
53
#57Gemini 3 Flash
53
#59Qwen3.5 397B
52
#61Seed 1.6 Flash
52
#62Claude 3.5 Sonnet
51
#64MiniMax M2.5
50
#65Mistral Large 2
50
#66Aion-2.0
50
#67Mistral Large 3
49
#68Kimi K2.5
49
#69GPT-4.1 mini
49
#70GPT-4o
48
#71Claude 3 Opus
47
#73Seed-2.0-Mini
45
#74Gemini 1.5 Pro
45
#76Ministral 3 14B
44
#78GPT-4o mini
44
#79DeepSeek-R1
44
#81GPT-OSS 120B
43
#83Claude 3 Haiku
42
#84GPT-4.1 nano
42
#85Nemotron-4 15B
42
#86Gemini 2.5 Flash
41
#87GPT-4 Turbo
41
#88Moonshot v1
41
#89Z-1
41
#90Llama 3 70B
41
#92Mistral 8x7B
38
#94Llama 4 Scout
37
#95Gemini 1.0 Pro
36
#98Gemma 3 27B
35
#99Phi-4
34
#101LFM2-24B-A2B
34
#102DeepSeek V3.1
33
#104o1-pro
32
#105Nova Pro
32
#106Qwen2.5-VL-32B
32
#107LFM2.5-1.2B-Thinking
32
#108GLM-4.5
31
#109GPT-OSS 20B
31
#110MiniMax M1 80k
31
#111GPT-5 nano
30
#112Qwen3 235B 2507
30
#113DBRX Instruct
29
#114GLM-4.5-Air
28
#116Mistral 8x7B v0.2
28
#117Kimi K2
27
#118Ministral 3 8B
27
#119LFM2.5-1.2B-Instruct
26
#120Mistral 7B v0.3
25
#121Ministral 3 3B
20

FAQ

What does OSWorld-Verified measure?

A verified subset of OSWorld focused on computer-use tasks in desktop-like environments, including navigation, editing, and workflow completion.

Which model scores highest on OSWorld-Verified?

GPT-5.3 Codex by OpenAI currently leads with a score of 86 on OSWorld-Verified.

How many models are evaluated on OSWorld-Verified?

121 AI models have been evaluated on OSWorld-Verified on BenchLM.

Last updated: March 12, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.