Skip to main content
Skip to main content
Knowledge

Knowledge Benchmarks — GPQA, HLE & FrontierScience Leaderboard

General knowledge and factual understanding

Bottom line: Broad knowledge (MMLU) is saturated. The real differentiators are HLE and FrontierScience — frontier-difficulty tests where top models still score below 30%.

MMLU · GPQA · GPQA-D · SuperGPQA · MMLU-Pro · HLE · FrontierScience · HLE w/o tools · SimpleQA · HealthBench Hard · MedXpertQA (Text) · FrontierScience Research · MMLU-Pro (Arcee)

Frontier scienceBroad academic knowledgeFactuality

Best Knowledge picks

BenchLM summaries for knowledge plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

Top AI Models for KnowledgeApril 2026

As of April 2026, Claude Opus 4.7 leads the provisional knowledge leaderboard with a weighted score of 98.6%, followed by GPT-5.4 (96.8%) and Gemini 3.1 Pro (94.9%). BenchLM is currently showing 94 provisional-ranked models and 13 verified-ranked models in this category.

What changed

Claude Mythos Preview leads knowledge with the strongest HLE and FrontierScience scores.

GPT-5.4 close second, with excellent GPQA Diamond scores.

Claude Opus 4.6 holds #3, strong on SuperGPQA and SimpleQA factual accuracy.

How to choose

Top models by benchmark

Expert-level questions in biology, physics, and chemistry(12% of category score)

Knowledge Leaderboard

Updated April 16, 2026

Sorted by knowledge weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

94 ranked models
CSVJSON
Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row
1
98.6%
94
94.2%94.2%54.7%46.9%
2
GPT-5.4
OpenAI
96.8%
93
92.8%92.8%39.8%40.1%59.6%
94.9%
94
94.3%45.4%20.6%71.5%
94%
Est.80
93.7%
Est.89
6
GPT-5.2
OpenAI
92.6%
Est.83
92.4%
7
92.4%
92
91.3%89.2%95%82%53%40%14.8%52.1%89.1%
88.5%
Est.87
9
85.2%
84
86.2%52.3%
83.9%
86
89.9%95%79.2%49%
11
83.7%
Est.83
12
83.7%
80
87%70.6%89.5%30.8%
13
GLM-5
Z.AI
83.7%
77
86%86.0%66.8%85.7%50.4%85.8%
83.3%
Est.84
15
GPT-5.1
OpenAI
83.1%
Est.80
81.2%
68
86.6%67.1%86.7%
17
80.6%
Est.80
80.2%
Est.79
79.9%
Est.81
79.9%
Est.80
21
78.9%
Est.68
22
78.9%
65
85.5%65.6%86.1%
77.2%
Est.72
76.5%
59
84.2%63.4%85.3%
25
76.1%
77
90.4%71.6%88.5%28.8%
Showing 25 of 94

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Knowledge carries a 12% weight in overall scoring. The weighted score blends expert-level tests (HLE, FrontierScience, GPQA) with broad knowledge (MMLU-Pro). A model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.

Known limitations

MMLU is saturated — top models score 90%+, making it poor at differentiating. HLE ("Humanity's Last Exam") is deliberately very hard, so even top models score below 30%. Small score differences on HLE are noisy. SimpleQA measures factual accuracy but can penalize models that hedge appropriately.

How we weight

Knowledge carries a 12% weight in BenchLM.ai's overall scoring. A model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.

For tasks like research assistance, factual Q&A, content creation, and educational applications, knowledge benchmark scores remain one of the strongest predictive signals. See the knowledge leaderboard for the top models in this category.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

BenchmarkWeightStatusDescription
MMLUDisplay onlyTests knowledge across 57 academic subjects
GPQA12%WeightedExpert-level questions in biology, physics, and chemistry
GPQA-DDisplay onlyProvider-table reference for GPQA Diamond scores reported in first-party comparison charts.
SuperGPQA12%WeightedEnhanced version covering 285 disciplines
MMLU-Pro22%WeightedHarder version of MMLU with 10 answer choices and more reasoning-focused questions
HLE23%WeightedExtremely difficult questions contributed by domain experts worldwide to test frontier AI
FrontierScience18%WeightedResearch-level science and scientific reasoning benchmark
HLE w/o toolsDisplay onlyTool-free variant of Humanity's Last Exam used to isolate raw frontier reasoning without external aids
SimpleQA13%WeightedFactual question answering benchmark
HealthBench HardDisplay onlyA harder health reasoning benchmark subset used in first-party frontier model comparisons.
MedXpertQA (Text)Display onlyMedical multiple-choice benchmark covering many specialties with text-only questions.
FrontierScience ResearchDisplay onlyA research-oriented FrontierScience variant focused on scientific investigation and solution quality.
MMLU-Pro (Arcee)Display onlyDisplay-only MMLU-Pro reference from Arcee AI's Trinity-Large-Thinking launch chart.

Knowledge benchmark updates

Know which model knows the most — updated every week.

Free. No spam. Unsubscribe anytime.

About Knowledge Benchmarks

Tests knowledge across 57 academic subjects

Related