Skip to main content

BrowseComp

A benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions.

Top models on BrowseComp — April 10, 2026

As of April 10, 2026, GPT-5.4 Pro leads the BrowseComp leaderboard with 89.3% , followed by Claude Mythos Preview (86.9%) and Claude Opus 4.6 (83.7%).

13 modelsAgentic18% of category scoreCurrentUpdated April 10, 2026

According to BenchLM.ai, GPT-5.4 Pro leads the BrowseComp benchmark with a score of 89.3%, followed by Claude Mythos Preview (86.9%) and Claude Opus 4.6 (83.7%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

13 models have been evaluated on BrowseComp. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. Within that category, BrowseComp contributes 18% of the category score, so strong performance here directly affects a model's overall ranking.

About BrowseComp

Year

2025

Tasks

Research questions requiring browsing

Format

Web search and evidence synthesis

Difficulty

Hard web research

BrowseComp is designed to measure real web research behavior, not just latent world knowledge. It rewards models that can plan searches, inspect multiple pages, and avoid shallow answer synthesis.

BenchLM freshness & provenance

Version

BrowseComp 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (13 models)

1
89.3%
2
86.9%
3
83.7%
4
82.7%
5
68%
6
65.8%
7
63.8%
8
62%
9
61%
10
61%
11
60.6%
12
60.6%
13
52%

FAQ

What does BrowseComp measure?

A benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions.

Which model scores highest on BrowseComp?

GPT-5.4 Pro by OpenAI currently leads with a score of 89.3% on BrowseComp.

How many models are evaluated on BrowseComp?

13 AI models have been evaluated on BrowseComp on BenchLM.

Last updated: April 10, 2026 · BenchLM version BrowseComp 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.