A benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions.
According to BenchLM.ai, GPT-5.4 Pro leads the BrowseComp benchmark with a score of 88, followed by GPT-5.2 Pro (88) and GPT-5.4 (88). The top models are clustered within 0 points, suggesting this benchmark is nearing saturation for frontier models.
121 models have been evaluated on BrowseComp. The benchmark falls in the agentic category, which carries a 22% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.
Year
2025
Tasks
Research questions requiring browsing
Format
Web search and evidence synthesis
Difficulty
Hard web research
BrowseComp is designed to measure real web research behavior, not just latent world knowledge. It rewards models that can plan searches, inspect multiple pages, and avoid shallow answer synthesis.
BrowseCompA benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions.
GPT-5.4 Pro by OpenAI currently leads with a score of 88 on BrowseComp.
121 AI models have been evaluated on BrowseComp on BenchLM.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.