Knowledge

Knowledge Benchmarks

General knowledge and factual understanding

MMLU · GPQA · SuperGPQA · OpenBookQA · MMLU-Pro · HLE · FrontierScience

Knowledge benchmarks test whether an AI model can accurately recall facts and apply domain expertise. Unlike reasoning benchmarks that measure logical deduction, knowledge benchmarks evaluate the breadth and depth of information a model has internalized during training.

BenchLM.ai tracks seven knowledge benchmarks ranging from broad undergraduate-level tests (MMLU) to PhD-level science questions (GPQA, SuperGPQA) to frontier-difficulty expert questions (HLE, FrontierScience). This range matters because a model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.

Knowledge carries a 12% weight in BenchLM.ai's overall scoring. For tasks like research assistance, factual Q&A, content creation, and educational applications, knowledge benchmark scores remain one of the strongest predictive signals. See our knowledge rankings for the top models in this category.

124 models
1
GPT-5.4 Pro
OpenAI
ClosedReasoning1.05M9199%99%97%94%94%50%92%
2
GPT-5.2 Pro
OpenAI
ClosedReasoning400K9099%99%97%95%90%44%93%
3
GPT-5.4
OpenAI
ClosedReasoning1.05M9099%98%96%94%93%48%91%
4
GPT-5.3 Codex
OpenAI
ClosedReasoning400K8999%97%95%93%90%44%90%
5
GPT-5.2
OpenAI
ClosedReasoning400K8899%97%95%93%88%42%91%
6
GPT-5.3 Instant
OpenAI
ClosedReasoning128K8799%98%96%94%89%44%92%
7
GPT-5.3-Codex-Spark
OpenAI
ClosedReasoning256K8797%95%93%91%88%42%88%
8
Claude Opus 4.6
Anthropic
ClosedStandard1M8599%97%95%93%92%38%88%
9
GPT-5.2 Instant
OpenAI
ClosedReasoning128K8598%97%95%93%88%43%91%
10
GPT-5.2-Codex
OpenAI
ClosedReasoning400K8599%97%95%93%80%26%86%
11
Gemini 3.1 Pro
Google
ClosedStandard1M8499%97%95%93%92%40%88%
12
GPT-5.1-Codex-Max
OpenAI
ClosedReasoning400K8498%96%94%92%82%27%84%
13
Grok 4.1
xAI
ClosedStandard1M8499%97%95%93%90%40%91%
14
Gemini 3 Pro Deep Think
Google
ClosedReasoning2M8199%97%95%93%81%32%88%
15
GPT-5.1
OpenAI
ClosedReasoning200K8097%95%93%91%83%27%84%
16
GPT-5 (high)
OpenAI
ClosedReasoning128K7993%91%89%87%83%27%83%
17
Claude Sonnet 4.6
Anthropic
ClosedStandard200K7899%97%95%93%83%21%85%
18
GLM-5 (Reasoning)
Zhipu AI
OpenReasoning200K7896%94%92%90%81%29%83%
19
GPT-5 (medium)
OpenAI
ClosedReasoning128K7891%89%87%85%81%27%82%
20
Claude Opus 4.5
Anthropic
ClosedStandard200K7799%97%95%93%81%20%84%
21
Gemini 3 Pro
Google
ClosedStandard2M7799%97%95%93%83%20%86%
22
o1-preview
OpenAI
ClosedReasoning200K7792%90%88%86%80%32%83%
23
Claude Sonnet 4.5
Anthropic
ClosedStandard200K7695%93%91%89%84%21%84%
24
Grok 4.1 Fast
xAI
ClosedStandard1M7694%92%90%88%81%20%83%
25
Kimi K2.5 (Reasoning)
Moonshot AI
ClosedReasoning128K7692%90%88%86%81%27%80%
Showing 25 of 124

About Knowledge Benchmarks

Tests knowledge across 57 academic subjects