Knowledge Benchmarks — GPQA, HLE & FrontierScience Leaderboard
General knowledge and factual understanding
Bottom line: Broad knowledge (MMLU) is saturated. The real differentiators are HLE and FrontierScience — frontier-difficulty tests where top models still score below 30%.
MMLU · GPQA · GPQA-D · SuperGPQA · MMLU-Pro · HLE · FrontierScience · HLE w/o tools · SimpleQA · HealthBench Hard · MedXpertQA (Text) · FrontierScience Research · MMLU-Pro (Arcee)
Best Knowledge picks
BenchLM summaries for knowledge plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.
Top AI Models for Knowledge — May 2026
As of May 2026, GPT-5.4 leads the provisional knowledge leaderboard with a weighted score of 99.3%, followed by Claude Opus 4.7 (Adaptive) (99.2%) and Gemini 3.1 Pro (94.8%). BenchLM is currently showing 103 provisional-ranked models and 22 verified-ranked models in this category.
GPT-5.4
OpenAI
Claude Opus 4.7 (Adaptive)
Anthropic
Gemini 3.1 Pro
What changed
Claude Mythos Preview leads knowledge with the strongest HLE and FrontierScience scores.
GPT-5.4 close second, with excellent GPQA Diamond scores.
Claude Opus 4.6 holds #3, strong on SuperGPQA and SimpleQA factual accuracy.
How to choose
Top models by benchmark
Expert-level questions in biology, physics, and chemistry(12% of category score)
Knowledge Leaderboard
Updated May 1, 2026Sorted by knowledge weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.
1 GPT-5.4 OpenAI | 99.3% | 89 | — | 92.8% | 92.8% | — | — | 52.1% | — | 39.8% | — | 40.1% | 59.6% | — | — |
2 Claude Opus 4.7 (Adaptive) Anthropic | 99.2% | 90 | — | 94.2% | 94.2% | — | — | 54.7% | — | 46.9% | — | — | — | — | — |
3 Gemini 3.1 Pro Google | 94.8% | 92 | — | — | 94.3% | — | — | — | — | 45.4% | — | 20.6% | 71.5% | — | — |
4 Grok 4.1 xAI | 94.5% | Est.90 | — | — | — | — | — | — | — | — | — | — | — | — | — |
5 GPT-5.3 Codex OpenAI | 93.1% | Est.87 | — | — | — | — | — | — | — | — | — | — | — | — | — |
6 GPT-5.2 OpenAI | 92.2% | 81 | — | 92.4% | — | — | — | — | — | — | — | — | — | — | — |
7 Claude Opus 4.6 Anthropic | 91.8% | 87 | — | 91.3% | 89.2% | 95% | 82% | 53% | — | 40% | — | 14.8% | 52.1% | — | 89.1% |
8 Gemini 3 Pro Deep Think Google | 88.4% | Est.90 | — | — | — | — | — | — | — | — | — | — | — | — | — |
| 85.1% | 83 | — | — | 86.2% | — | — | 52.3% | — | — | — | — | — | — | — | |
10 Claude Sonnet 4.6 Anthropic | 83.8% | 83 | — | 89.9% | — | 95% | 79.2% | 49% | — | — | — | — | — | — | — |
11 Claude Opus 4.5 Anthropic | 83.6% | 77 | — | 87% | — | 70.6% | 89.5% | 30.8% | — | — | — | — | — | — | — |
12 Gemini 3 Pro Google | 83.5% | 81 | — | — | — | — | — | — | — | — | — | — | — | — | — |
13 | 83.5% | 67 | — | 86% | 86.0% | 66.8% | 85.7% | 50.4% | — | — | — | — | — | — | 85.8% |
14 | 83% | Est.82 | — | — | — | — | — | — | — | — | — | — | — | — | — |
15 GPT-5.1 OpenAI | 83% | Est.79 | — | — | — | — | — | — | — | — | — | — | — | — | — |
16 o1-preview OpenAI | 81.6% | Est.83 | — | — | — | — | — | — | — | — | — | — | — | — | — |
17 | 80.9% | 65 | — | 86.6% | — | 67.1% | 86.7% | — | — | — | — | — | — | — | — |
18 GPT-5 (high) OpenAI | 80.4% | Est.78 | — | — | — | — | — | — | — | — | — | — | — | — | — |
19 GPT-5.1-Codex-Max OpenAI | 79.9% | Est.76 | — | — | — | — | — | — | — | — | — | — | — | — | — |
20 | 79.7% | Est.79 | — | — | — | — | — | — | — | — | — | — | — | — | — |
21 GPT-5.2-Codex OpenAI | 79.4% | Est.78 | — | — | — | — | — | — | — | — | — | — | — | — | — |
22 | 78.9% | 63 | — | 85.5% | — | 65.6% | 86.1% | — | — | — | — | — | — | — | — |
23 | 77.9% | 88 | — | 90.1% | 90.1% | — | 87.5% | 37.7% | — | — | 57.9% | — | — | — | — |
24 | 76.8% | Est.70 | — | — | — | — | — | — | — | — | — | — | — | — | — |
25 Qwen3.6 Plus Alibaba | 76.6% | 73 | — | 90.4% | — | 71.6% | 88.5% | 28.8% | — | — | — | — | — | — | — |
These rankings update weekly
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
Score in Context
What these scores mean
Knowledge carries a 12% weight in overall scoring. The weighted score blends expert-level tests (HLE, FrontierScience, GPQA) with broad knowledge (MMLU-Pro). A model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.
Known limitations
MMLU is saturated — top models score 90%+, making it poor at differentiating. HLE ("Humanity's Last Exam") is deliberately very hard, so even top models score below 30%. Small score differences on HLE are noisy. SimpleQA measures factual accuracy but can penalize models that hedge appropriately.
How we weight
Knowledge carries a 12% weight in BenchLM.ai's overall scoring. A model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.
For tasks like research assistance, factual Q&A, content creation, and educational applications, knowledge benchmark scores remain one of the strongest predictive signals. See the knowledge leaderboard for the top models in this category.
Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.
The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.
| Benchmark | Weight | Status | Description |
|---|---|---|---|
| MMLU | — | Display only | Tests knowledge across 57 academic subjects |
| GPQA | 12% | Weighted | Expert-level questions in biology, physics, and chemistry |
| GPQA-D | — | Display only | Provider-table reference for GPQA Diamond scores reported in first-party comparison charts. |
| SuperGPQA | 12% | Weighted | Enhanced version covering 285 disciplines |
| MMLU-Pro | 22% | Weighted | Harder version of MMLU with 10 answer choices and more reasoning-focused questions |
| HLE | 23% | Weighted | Extremely difficult questions contributed by domain experts worldwide to test frontier AI |
| FrontierScience | 18% | Weighted | Research-level science and scientific reasoning benchmark |
| HLE w/o tools | — | Display only | Tool-free variant of Humanity's Last Exam used to isolate raw frontier reasoning without external aids |
| SimpleQA | 13% | Weighted | Factual question answering benchmark |
| HealthBench Hard | — | Display only | A harder health reasoning benchmark subset used in first-party frontier model comparisons. |
| MedXpertQA (Text) | — | Display only | Medical multiple-choice benchmark covering many specialties with text-only questions. |
| FrontierScience Research | — | Display only | A research-oriented FrontierScience variant focused on scientific investigation and solution quality. |
| MMLU-Pro (Arcee) | — | Display only | Display-only MMLU-Pro reference from Arcee AI's Trinity-Large-Thinking launch chart. |
Knowledge benchmark updates
Know which model knows the most — updated every week.
Free. No spam. Unsubscribe anytime.
About Knowledge Benchmarks
Tests knowledge across 57 academic subjects