Skip to main content
Skip to main content
Knowledge

Knowledge Benchmarks — GPQA, HLE & FrontierScience Leaderboard

General knowledge and factual understanding

Bottom line: Broad knowledge (MMLU) is saturated. The real differentiators are HLE and FrontierScience — frontier-difficulty tests where top models still score below 30%.

MMLU · GPQA · GPQA-D · SuperGPQA · MMLU-Pro · HLE · FrontierScience · HLE w/o tools · SimpleQA · HealthBench Hard · MedXpertQA (Text) · FrontierScience Research · MMLU-Pro (Arcee)

Frontier scienceBroad academic knowledgeFactuality

Best Knowledge picks

BenchLM summaries for knowledge plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

Top AI Models for KnowledgeMay 2026

As of May 2026, GPT-5.4 leads the provisional knowledge leaderboard with a weighted score of 99.3%, followed by Claude Opus 4.7 (Adaptive) (99.2%) and Gemini 3.1 Pro (94.8%). BenchLM is currently showing 103 provisional-ranked models and 22 verified-ranked models in this category.

What changed

Claude Mythos Preview leads knowledge with the strongest HLE and FrontierScience scores.

GPT-5.4 close second, with excellent GPQA Diamond scores.

Claude Opus 4.6 holds #3, strong on SuperGPQA and SimpleQA factual accuracy.

How to choose

Top models by benchmark

Expert-level questions in biology, physics, and chemistry(12% of category score)

Knowledge Leaderboard

Updated May 1, 2026

Sorted by knowledge weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

103 ranked models
CSVJSON
Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row
1
GPT-5.4
OpenAI
99.3%
89
92.8%92.8%52.1%39.8%40.1%59.6%
99.2%
90
94.2%94.2%54.7%46.9%
94.8%
92
94.3%45.4%20.6%71.5%
94.5%
Est.90
93.1%
Est.87
6
GPT-5.2
OpenAI
92.2%
81
92.4%
7
91.8%
87
91.3%89.2%95%82%53%40%14.8%52.1%89.1%
88.4%
Est.90
85.1%
83
86.2%52.3%
83.8%
83
89.9%95%79.2%49%
11
83.6%
77
87%70.6%89.5%30.8%
12
83.5%
81
13
GLM-5
Z.AI
Self-host
83.5%
67
86%86.0%66.8%85.7%50.4%85.8%
14
GLM-5 (Reasoning)
Z.AI
Self-host
83%
Est.82
15
GPT-5.1
OpenAI
83%
Est.79
16
81.6%
Est.83
80.9%
65
86.6%67.1%86.7%
18
80.4%
Est.78
79.9%
Est.76
20
Qwen3.5 397B (Reasoning)
Alibaba
Self-host
79.7%
Est.79
79.4%
Est.78
78.9%
63
85.5%65.6%86.1%
77.9%
88
90.1%90.1%87.5%37.7%57.9%
76.8%
Est.70
25
76.6%
73
90.4%71.6%88.5%28.8%
Showing 25 of 103

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Knowledge carries a 12% weight in overall scoring. The weighted score blends expert-level tests (HLE, FrontierScience, GPQA) with broad knowledge (MMLU-Pro). A model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.

Known limitations

MMLU is saturated — top models score 90%+, making it poor at differentiating. HLE ("Humanity's Last Exam") is deliberately very hard, so even top models score below 30%. Small score differences on HLE are noisy. SimpleQA measures factual accuracy but can penalize models that hedge appropriately.

How we weight

Knowledge carries a 12% weight in BenchLM.ai's overall scoring. A model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.

For tasks like research assistance, factual Q&A, content creation, and educational applications, knowledge benchmark scores remain one of the strongest predictive signals. See the knowledge leaderboard for the top models in this category.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

BenchmarkWeightStatusDescription
MMLUDisplay onlyTests knowledge across 57 academic subjects
GPQA12%WeightedExpert-level questions in biology, physics, and chemistry
GPQA-DDisplay onlyProvider-table reference for GPQA Diamond scores reported in first-party comparison charts.
SuperGPQA12%WeightedEnhanced version covering 285 disciplines
MMLU-Pro22%WeightedHarder version of MMLU with 10 answer choices and more reasoning-focused questions
HLE23%WeightedExtremely difficult questions contributed by domain experts worldwide to test frontier AI
FrontierScience18%WeightedResearch-level science and scientific reasoning benchmark
HLE w/o toolsDisplay onlyTool-free variant of Humanity's Last Exam used to isolate raw frontier reasoning without external aids
SimpleQA13%WeightedFactual question answering benchmark
HealthBench HardDisplay onlyA harder health reasoning benchmark subset used in first-party frontier model comparisons.
MedXpertQA (Text)Display onlyMedical multiple-choice benchmark covering many specialties with text-only questions.
FrontierScience ResearchDisplay onlyA research-oriented FrontierScience variant focused on scientific investigation and solution quality.
MMLU-Pro (Arcee)Display onlyDisplay-only MMLU-Pro reference from Arcee AI's Trinity-Large-Thinking launch chart.

Knowledge benchmark updates

Know which model knows the most — updated every week.

Free. No spam. Unsubscribe anytime.

About Knowledge Benchmarks

Tests knowledge across 57 academic subjects

Related