A benchmark for research-level scientific reasoning, designed to separate frontier models on difficult science tasks that mix domain knowledge with deep reasoning.
As of April 29, 2026, GPT-5.4 Pro leads the FrontierScience leaderboard with 36.7%.
Year
2026
Tasks
Research-level science tasks
Format
Scientific reasoning benchmark
Difficulty
Research frontier
FrontierScience matters because GPQA-style knowledge alone is not enough for scientific copilots. It better reflects the kind of reasoning needed for research assistance and frontier technical work.
Version
FrontierScience 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A benchmark for research-level scientific reasoning, designed to separate frontier models on difficult science tasks that mix domain knowledge with deep reasoning.
GPT-5.4 Pro by OpenAI currently leads with a score of 36.7% on FrontierScience.
1 AI models have been evaluated on FrontierScience on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.