Knowledge Benchmarks — GPQA, HLE & FrontierScience Leaderboard
General knowledge and factual understanding
Bottom line: Broad knowledge (MMLU) is saturated. The real differentiators are HLE and FrontierScience — frontier-difficulty tests where top models still score below 30%.
MMLU · GPQA · GPQA-D · SuperGPQA · MMLU-Pro · HLE · FrontierScience · HLE w/o tools · SimpleQA · HealthBench Hard · MedXpertQA (Text) · FrontierScience Research · MMLU-Pro (Arcee)
Best Knowledge picks
BenchLM summaries for knowledge plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.
Top AI Models for Knowledge — April 2026
As of April 2026, Claude Opus 4.7 leads the provisional knowledge leaderboard with a weighted score of 98.6%, followed by GPT-5.4 (96.8%) and Gemini 3.1 Pro (94.9%). BenchLM is currently showing 94 provisional-ranked models and 13 verified-ranked models in this category.
Claude Opus 4.7
Anthropic
GPT-5.4
OpenAI
Gemini 3.1 Pro
What changed
Claude Mythos Preview leads knowledge with the strongest HLE and FrontierScience scores.
GPT-5.4 close second, with excellent GPQA Diamond scores.
Claude Opus 4.6 holds #3, strong on SuperGPQA and SimpleQA factual accuracy.
How to choose
Top models by benchmark
Expert-level questions in biology, physics, and chemistry(12% of category score)
Knowledge Leaderboard
Updated April 16, 2026Sorted by knowledge weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.
1 Claude Opus 4.7 Anthropic | 98.6% | 94 | — | 94.2% | 94.2% | — | — | 54.7% | — | 46.9% | — | — | — | — | — |
2 GPT-5.4 OpenAI | 96.8% | 93 | — | 92.8% | 92.8% | — | — | — | — | 39.8% | — | 40.1% | 59.6% | — | — |
3 Gemini 3.1 Pro Google | 94.9% | 94 | — | — | 94.3% | — | — | — | — | 45.4% | — | 20.6% | 71.5% | — | — |
4 Grok 4.1 xAI | 94% | Est.80 | — | — | — | — | — | — | — | — | — | — | — | — | — |
5 GPT-5.3 Codex OpenAI | 93.7% | Est.89 | — | — | — | — | — | — | — | — | — | — | — | — | — |
6 GPT-5.2 OpenAI | 92.6% | Est.83 | — | 92.4% | — | — | — | — | — | — | — | — | — | — | — |
7 Claude Opus 4.6 Anthropic | 92.4% | 92 | — | 91.3% | 89.2% | 95% | 82% | 53% | — | 40% | — | 14.8% | 52.1% | — | 89.1% |
8 Gemini 3 Pro Deep Think Google | 88.5% | Est.87 | — | — | — | — | — | — | — | — | — | — | — | — | — |
9 GLM-5.1 Z.AI | 85.2% | 84 | — | — | 86.2% | — | — | 52.3% | — | — | — | — | — | — | — |
10 Claude Sonnet 4.6 Anthropic | 83.9% | 86 | — | 89.9% | — | 95% | 79.2% | 49% | — | — | — | — | — | — | — |
11 Gemini 3 Pro Google | 83.7% | Est.83 | — | — | — | — | — | — | — | — | — | — | — | — | — |
12 Claude Opus 4.5 Anthropic | 83.7% | 80 | — | 87% | — | 70.6% | 89.5% | 30.8% | — | — | — | — | — | — | — |
13 GLM-5 Z.AI | 83.7% | 77 | — | 86% | 86.0% | 66.8% | 85.7% | 50.4% | — | — | — | — | — | — | 85.8% |
14 | 83.3% | Est.84 | — | — | — | — | — | — | — | — | — | — | — | — | — |
15 GPT-5.1 OpenAI | 83.1% | Est.80 | — | — | — | — | — | — | — | — | — | — | — | — | — |
16 Qwen3.5-122B-A10B Alibaba | 81.2% | 68 | — | 86.6% | — | 67.1% | 86.7% | — | — | — | — | — | — | — | — |
17 GPT-5 (high) OpenAI | 80.6% | Est.80 | — | — | — | — | — | — | — | — | — | — | — | — | — |
18 GPT-5.1-Codex-Max OpenAI | 80.2% | Est.79 | — | — | — | — | — | — | — | — | — | — | — | — | — |
19 Qwen3.5 397B (Reasoning) Alibaba | 79.9% | Est.81 | — | — | — | — | — | — | — | — | — | — | — | — | — |
20 GPT-5.2-Codex OpenAI | 79.9% | Est.80 | — | — | — | — | — | — | — | — | — | — | — | — | — |
21 o1-preview OpenAI | 78.9% | Est.68 | — | — | — | — | — | — | — | — | — | — | — | — | — |
22 Qwen3.5-27B Alibaba | 78.9% | 65 | — | 85.5% | — | 65.6% | 86.1% | — | — | — | — | — | — | — | — |
23 | 77.2% | Est.72 | — | — | — | — | — | — | — | — | — | — | — | — | — |
24 Qwen3.5-35B-A3B Alibaba | 76.5% | 59 | — | 84.2% | — | 63.4% | 85.3% | — | — | — | — | — | — | — | — |
25 Qwen3.6 Plus Alibaba | 76.1% | 77 | — | 90.4% | — | 71.6% | 88.5% | 28.8% | — | — | — | — | — | — | — |
These rankings update weekly
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
Score in Context
What these scores mean
Knowledge carries a 12% weight in overall scoring. The weighted score blends expert-level tests (HLE, FrontierScience, GPQA) with broad knowledge (MMLU-Pro). A model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.
Known limitations
MMLU is saturated — top models score 90%+, making it poor at differentiating. HLE ("Humanity's Last Exam") is deliberately very hard, so even top models score below 30%. Small score differences on HLE are noisy. SimpleQA measures factual accuracy but can penalize models that hedge appropriately.
How we weight
Knowledge carries a 12% weight in BenchLM.ai's overall scoring. A model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.
For tasks like research assistance, factual Q&A, content creation, and educational applications, knowledge benchmark scores remain one of the strongest predictive signals. See the knowledge leaderboard for the top models in this category.
Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.
The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.
| Benchmark | Weight | Status | Description |
|---|---|---|---|
| MMLU | — | Display only | Tests knowledge across 57 academic subjects |
| GPQA | 12% | Weighted | Expert-level questions in biology, physics, and chemistry |
| GPQA-D | — | Display only | Provider-table reference for GPQA Diamond scores reported in first-party comparison charts. |
| SuperGPQA | 12% | Weighted | Enhanced version covering 285 disciplines |
| MMLU-Pro | 22% | Weighted | Harder version of MMLU with 10 answer choices and more reasoning-focused questions |
| HLE | 23% | Weighted | Extremely difficult questions contributed by domain experts worldwide to test frontier AI |
| FrontierScience | 18% | Weighted | Research-level science and scientific reasoning benchmark |
| HLE w/o tools | — | Display only | Tool-free variant of Humanity's Last Exam used to isolate raw frontier reasoning without external aids |
| SimpleQA | 13% | Weighted | Factual question answering benchmark |
| HealthBench Hard | — | Display only | A harder health reasoning benchmark subset used in first-party frontier model comparisons. |
| MedXpertQA (Text) | — | Display only | Medical multiple-choice benchmark covering many specialties with text-only questions. |
| FrontierScience Research | — | Display only | A research-oriented FrontierScience variant focused on scientific investigation and solution quality. |
| MMLU-Pro (Arcee) | — | Display only | Display-only MMLU-Pro reference from Arcee AI's Trinity-Large-Thinking launch chart. |
Knowledge benchmark updates
Know which model knows the most — updated every week.
Free. No spam. Unsubscribe anytime.
About Knowledge Benchmarks
Tests knowledge across 57 academic subjects