BrowseComp (BrowseComp)

A benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions.

According to BenchLM.ai, GPT-5.4 Pro leads the BrowseComp benchmark with a score of 88, followed by GPT-5.2 Pro (88) and GPT-5.4 (88). The top models are clustered within 0 points, suggesting this benchmark is nearing saturation for frontier models.

121 models have been evaluated on BrowseComp. The benchmark falls in the agentic category, which carries a 22% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.

About BrowseComp

Year

2025

Tasks

Research questions requiring browsing

Format

Web search and evidence synthesis

Difficulty

Hard web research

BrowseComp is designed to measure real web research behavior, not just latent world knowledge. It rewards models that can plan searches, inspect multiple pages, and avoid shallow answer synthesis.

BrowseComp

Leaderboard (121 models)

#1GPT-5.4 Pro
88
#2GPT-5.2 Pro
88
#3GPT-5.4
88
#4GPT-5.3 Codex
88
#6Gemini 3.1 Pro
86
#7Claude Opus 4.6
85
#8GPT-5.2-Codex
85
#10GPT-5.2
84
#11Gemini 3 Pro
83
#12GPT-5.3 Instant
82
#14GPT-5.2 Instant
82
#15GLM-5 (Reasoning)
80
#16Grok 4.1
79
#17GPT-5.1
79
#18o1-preview
79
#19GPT-5 (medium)
78
#21Claude Sonnet 4.6
77
#22Kimi K2.5 (Reasoning)
77
#23o3-pro
76
#24GPT-5 (high)
75
#25o3
75
#26Claude Sonnet 4.5
74
#27o3-mini
74
#28Claude Opus 4.5
73
#30GPT-4.1
73
#31o1
72
#32Gemini 2.5 Pro
72
#33GLM-4.7
72
#34Qwen2.5-1M
72
#35GPT-4.1 mini
71
#37GPT-5 mini
70
#39GLM-5
67
#40Mercury 2
67
#41Seed 1.6
67
#42DeepSeekMath V2
66
#43Step 3.5 Flash
66
#44Gemini 3 Flash
66
#45MiMo-V2-Flash
65
#46Qwen2.5-72B
64
#47o4-mini (high)
64
#48Gemini 1.5 Pro
64
#49Grok 4
63
#50Seed-2.0-Lite
63
#51GLM-4.7-Flash
63
#52DeepSeek Coder 2.0
62
#53Claude 4 Sonnet
62
#54Claude 4.1 Opus
62
#55DeepSeek V3.2
62
#56Claude Haiku 4.5
62
#57DeepSeek LLM 2.0
62
#58Qwen3.5 397B
62
#59Claude 3.5 Sonnet
62
#60MiniMax M2.5
62
#61Seed 1.6 Flash
62
#62GPT-4.1 nano
62
#65Aion-2.0
60
#68Kimi K2.5
59
#69GPT-4o
59
#70Mistral Large 3
58
#72Gemini 2.5 Flash
58
#73Mistral Large 2
57
#75Claude 3 Opus
56
#76Ministral 3 14B
55
#77GPT-4 Turbo
54
#78Seed-2.0-Mini
53
#79Claude 3 Haiku
53
#80Gemini 1.0 Pro
51
#82GPT-OSS 120B
50
#84o1-pro
50
#85GPT-4o mini
49
#86Moonshot v1
49
#87Z-1
49
#89DeepSeek-R1
49
#90GPT-5 nano
48
#91Llama 3 70B
48
#92Llama 4 Scout
48
#95Mistral 8x7B
47
#96Nemotron-4 15B
47
#98Gemma 3 27B
42
#99GPT-OSS 20B
42
#100Grok 3 [Beta]
41
#101Qwen2.5-VL-32B
41
#103Qwen3 235B 2507
40
#104Nova Pro
39
#105DeepSeek V3.1
39
#107LFM2-24B-A2B
38
#108GLM-4.5
37
#109GLM-4.5-Air
37
#110MiniMax M1 80k
37
#111LFM2.5-1.2B-Thinking
37
#113Kimi K2
36
#114Ministral 3 8B
36
#115Phi-4
35
#116Mistral 8x7B v0.2
34
#117Ministral 3 3B
33
#119Mistral 7B v0.3
32
#120DBRX Instruct
31
#121LFM2.5-1.2B-Instruct
31

FAQ

What does BrowseComp measure?

A benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions.

Which model scores highest on BrowseComp?

GPT-5.4 Pro by OpenAI currently leads with a score of 88 on BrowseComp.

How many models are evaluated on BrowseComp?

121 AI models have been evaluated on BrowseComp on BenchLM.

Last updated: March 12, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.