BrowseComp

A benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions.

Top models on BrowseComp — April 29, 2026

As of April 29, 2026, GPT-5.5 Pro leads the BrowseComp leaderboard with 90.1% , followed by GPT-5.4 Pro (89.3%) and Claude Mythos Preview (86.9%).

GPT-5.5 Pro

OpenAI

Overall —Context 1M

GPT-5.4 Pro

OpenAI

Overall 91Context 1.05M

Claude Mythos Preview

Anthropic

Overall 99Context 1M

21 modelsAgentic18% of category scoreCurrentUpdated April 29, 2026

According to BenchLM.ai, GPT-5.5 Pro leads the BrowseComp benchmark with a score of 90.1%, followed by GPT-5.4 Pro (89.3%) and Claude Mythos Preview (86.9%). The scores show moderate spread, with meaningful differences between the top tier and mid-tier models.

21 models have been evaluated on BrowseComp. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. Within that category, BrowseComp contributes 18% of the category score, so strong performance here directly affects a model's overall ranking.

About BrowseComp

Year

2025

Tasks

Research questions requiring browsing

Format

Web search and evidence synthesis

Difficulty

Hard web research

BrowseComp is designed to measure real web research behavior, not just latent world knowledge. It rewards models that can plan searches, inspect multiple pages, and avoid shallow answer synthesis.

BenchLM freshness & provenance

Version

BrowseComp 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (21 models)

1

OpenAIClosed

90.1%

2

OpenAIClosed

89.3%

3

Claude Mythos Preview

AnthropicClosed

86.9%

4

OpenAIClosed

84.4%

5

Claude Opus 4.6

AnthropicClosed

83.7%

6

DeepSeek V4 Pro (Max)

DeepSeekOpen

83.4%

7

Moonshot AIOpen

83.2%

8

OpenAIClosed

82.7%

9

DeepSeek V4 Pro (High)

DeepSeekOpen

80.4%

10

Claude Opus 4.7 (Adaptive)

AnthropicClosed

79.3%

11

DeepSeek V4 Flash (Max)

DeepSeekOpen

73.2%

12

Z.AIOpen

68%

13

OpenAIClosed

65.8%

14

Qwen3.5-122B-A10B

AlibabaOpen

63.8%

15

AlibabaOpen

62%

16

AlibabaOpen

61%

17

Qwen3.5-35B-A3B

AlibabaOpen

61%

18

Kimi K2.5 (Reasoning)

Moonshot AIClosed

60.6%

19

Moonshot AIOpen

60.6%

20

DeepSeek V4 Flash (High)

DeepSeekOpen

53.5%

21

Z.AIOpen

52%

FAQ

What does BrowseComp measure?

A benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions.

Which model scores highest on BrowseComp?

GPT-5.5 Pro by OpenAI currently leads with a score of 90.1% on BrowseComp.

How many models are evaluated on BrowseComp?

21 AI models have been evaluated on BrowseComp on BenchLM.

Compare Top Models on BrowseComp

GPT-5.5 Pro vs GPT-5.4 Pro GPT-5.4 Pro vs Claude Mythos Preview Claude Mythos Preview vs GPT-5.5 GPT-5.5 vs Claude Opus 4.6

Learn More

Read our explainer: BrowseComp benchmark deep dive

Last updated: April 29, 2026 · BenchLM version BrowseComp 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.