FrontierScience (FrontierScience)

A benchmark for research-level scientific reasoning, designed to separate frontier models on difficult science tasks that mix domain knowledge with deep reasoning.

According to BenchLM.ai, GPT-5.2 Pro leads the FrontierScience benchmark with a score of 93, followed by GPT-5.4 Pro (92) and GPT-5.3 Instant (92). The top models are clustered within 1 points, suggesting this benchmark is nearing saturation for frontier models.

121 models have been evaluated on FrontierScience. The benchmark falls in the knowledge category, which carries a 12% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.

About FrontierScience

Year

2026

Tasks

Research-level science tasks

Format

Scientific reasoning benchmark

Difficulty

Research frontier

FrontierScience matters because GPQA-style knowledge alone is not enough for scientific copilots. It better reflects the kind of reasoning needed for research assistance and frontier technical work.

FrontierScience

Leaderboard (121 models)

#1GPT-5.2 Pro
93
#2GPT-5.4 Pro
92
#3GPT-5.3 Instant
92
#4GPT-5.4
91
#5GPT-5.2
91
#6GPT-5.2 Instant
91
#7Grok 4.1
91
#8GPT-5.3 Codex
90
#10Claude Opus 4.6
88
#11Gemini 3.1 Pro
88
#13GPT-5.2-Codex
86
#14Gemini 3 Pro
86
#15Claude Sonnet 4.6
85
#17GPT-5.1
84
#18Claude Opus 4.5
84
#19Claude Sonnet 4.5
84
#20GPT-5 (high)
83
#21GLM-5 (Reasoning)
83
#22o1-preview
83
#24GPT-5 (medium)
82
#26Kimi K2.5 (Reasoning)
80
#27o3-pro
77
#29o3
77
#30GPT-5 mini
75
#31Grok 4
75
#32Qwen2.5-1M
74
#33GLM-5
74
#34DeepSeekMath V2
73
#35o4-mini (high)
73
#36GLM-4.7
72
#37DeepSeek Coder 2.0
72
#38DeepSeek V3.2
72
#39MiMo-V2-Flash
71
#40Step 3.5 Flash
71
#41Qwen3.5 397B
71
#42Gemini 2.5 Pro
70
#43Qwen2.5-72B
70
#44Mercury 2
69
#45Seed 1.6
68
#46Claude 4.1 Opus
68
#47Claude 4 Sonnet
67
#49DeepSeek LLM 2.0
67
#50Mistral Large 3
67
#51Kimi K2.5
67
#52o3-mini
66
#53Seed-2.0-Lite
66
#55MiniMax M2.5
66
#56Aion-2.0
66
#57o1
65
#58Gemini 3 Flash
65
#60Mistral Large 2
65
#61Claude Haiku 4.5
64
#62GLM-4.7-Flash
63
#64o1-pro
63
#66GPT-4o mini
62
#67GPT-4.1
61
#68GPT-4.1 mini
61
#69Ministral 3 14B
60
#70Claude 3.5 Sonnet
59
#71GPT-4o
58
#72GPT-5 nano
58
#73Seed 1.6 Flash
57
#75Claude 3 Opus
56
#76Mistral 8x7B
56
#78Seed-2.0-Mini
54
#79Gemini 1.5 Pro
54
#80Gemini 1.0 Pro
54
#81Llama 3 70B
54
#84GPT-4 Turbo
52
#85Phi-4
52
#86DBRX Instruct
52
#87GPT-4.1 nano
51
#88Z-1
51
#89Claude 3 Haiku
50
#90Nemotron-4 15B
50
#91Gemini 2.5 Flash
49
#92GPT-OSS 120B
49
#94Moonshot v1
49
#96DeepSeek-R1
44
#97Llama 4 Scout
44
#99LFM2-24B-A2B
43
#100Qwen2.5-VL-32B
43
#102Gemma 3 27B
42
#104Nova Pro
41
#105Grok 3 [Beta]
40
#106GLM-4.5
40
#107Qwen3 235B 2507
39
#108MiniMax M1 80k
39
#110DeepSeek V3.1
37
#111GLM-4.5-Air
37
#112Mistral 8x7B v0.2
35
#114GPT-OSS 20B
34
#115Kimi K2
34
#116Mistral 7B v0.3
34
#117Ministral 3 8B
32
#118LFM2.5-1.2B-Thinking
31
#119LFM2.5-1.2B-Instruct
30
#121Ministral 3 3B
28

FAQ

What does FrontierScience measure?

A benchmark for research-level scientific reasoning, designed to separate frontier models on difficult science tasks that mix domain knowledge with deep reasoning.

Which model scores highest on FrontierScience?

GPT-5.2 Pro by OpenAI currently leads with a score of 93 on FrontierScience.

How many models are evaluated on FrontierScience?

121 AI models have been evaluated on FrontierScience on BenchLM.

Last updated: March 12, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.