Knowledge Benchmarks
General knowledge and factual understanding
MMLU · GPQA · SuperGPQA · OpenBookQA · MMLU-Pro · HLE · FrontierScience
Knowledge benchmarks test whether an AI model can accurately recall facts and apply domain expertise. Unlike reasoning benchmarks that measure logical deduction, knowledge benchmarks evaluate the breadth and depth of information a model has internalized during training.
BenchLM.ai tracks seven knowledge benchmarks ranging from broad undergraduate-level tests (MMLU) to PhD-level science questions (GPQA, SuperGPQA) to frontier-difficulty expert questions (HLE, FrontierScience). This range matters because a model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.
Knowledge carries a 12% weight in BenchLM.ai's overall scoring. For tasks like research assistance, factual Q&A, content creation, and educational applications, knowledge benchmark scores remain one of the strongest predictive signals. See our knowledge rankings for the top models in this category.
1 GPT-5.4 Pro OpenAI | Closed | Reasoning | 1.05M | 91 | 99% | 99% | 97% | 94% | 94% | 50% | 92% |
2 GPT-5.2 Pro OpenAI | Closed | Reasoning | 400K | 90 | 99% | 99% | 97% | 95% | 90% | 44% | 93% |
3 GPT-5.4 OpenAI | Closed | Reasoning | 1.05M | 90 | 99% | 98% | 96% | 94% | 93% | 48% | 91% |
4 GPT-5.3 Codex OpenAI | Closed | Reasoning | 400K | 89 | 99% | 97% | 95% | 93% | 90% | 44% | 90% |
5 GPT-5.2 OpenAI | Closed | Reasoning | 400K | 88 | 99% | 97% | 95% | 93% | 88% | 42% | 91% |
6 GPT-5.3 Instant OpenAI | Closed | Reasoning | 128K | 87 | 99% | 98% | 96% | 94% | 89% | 44% | 92% |
7 GPT-5.3-Codex-Spark OpenAI | Closed | Reasoning | 256K | 87 | 97% | 95% | 93% | 91% | 88% | 42% | 88% |
8 Claude Opus 4.6 Anthropic | Closed | Standard | 1M | 85 | 99% | 97% | 95% | 93% | 92% | 38% | 88% |
9 GPT-5.2 Instant OpenAI | Closed | Reasoning | 128K | 85 | 98% | 97% | 95% | 93% | 88% | 43% | 91% |
10 GPT-5.2-Codex OpenAI | Closed | Reasoning | 400K | 85 | 99% | 97% | 95% | 93% | 80% | 26% | 86% |
11 Gemini 3.1 Pro Google | Closed | Standard | 1M | 84 | 99% | 97% | 95% | 93% | 92% | 40% | 88% |
12 GPT-5.1-Codex-Max OpenAI | Closed | Reasoning | 400K | 84 | 98% | 96% | 94% | 92% | 82% | 27% | 84% |
13 Grok 4.1 xAI | Closed | Standard | 1M | 84 | 99% | 97% | 95% | 93% | 90% | 40% | 91% |
14 Gemini 3 Pro Deep Think Google | Closed | Reasoning | 2M | 81 | 99% | 97% | 95% | 93% | 81% | 32% | 88% |
15 GPT-5.1 OpenAI | Closed | Reasoning | 200K | 80 | 97% | 95% | 93% | 91% | 83% | 27% | 84% |
16 GPT-5 (high) OpenAI | Closed | Reasoning | 128K | 79 | 93% | 91% | 89% | 87% | 83% | 27% | 83% |
17 Claude Sonnet 4.6 Anthropic | Closed | Standard | 200K | 78 | 99% | 97% | 95% | 93% | 83% | 21% | 85% |
18 GLM-5 (Reasoning) Zhipu AI | Open | Reasoning | 200K | 78 | 96% | 94% | 92% | 90% | 81% | 29% | 83% |
19 GPT-5 (medium) OpenAI | Closed | Reasoning | 128K | 78 | 91% | 89% | 87% | 85% | 81% | 27% | 82% |
20 Claude Opus 4.5 Anthropic | Closed | Standard | 200K | 77 | 99% | 97% | 95% | 93% | 81% | 20% | 84% |
21 Gemini 3 Pro Google | Closed | Standard | 2M | 77 | 99% | 97% | 95% | 93% | 83% | 20% | 86% |
22 o1-preview OpenAI | Closed | Reasoning | 200K | 77 | 92% | 90% | 88% | 86% | 80% | 32% | 83% |
23 Claude Sonnet 4.5 Anthropic | Closed | Standard | 200K | 76 | 95% | 93% | 91% | 89% | 84% | 21% | 84% |
24 Grok 4.1 Fast xAI | Closed | Standard | 1M | 76 | 94% | 92% | 90% | 88% | 81% | 20% | 83% |
25 Kimi K2.5 (Reasoning) Moonshot AI | Closed | Reasoning | 128K | 76 | 92% | 90% | 88% | 86% | 81% | 27% | 80% |
About Knowledge Benchmarks
Tests knowledge across 57 academic subjects