What is the best LLM for knowledge tasks?

The top LLMs for knowledge tasks are ranked by benchmarks like MMLU and GPQA, which test factual accuracy and expert-level understanding across dozens of subjects.

What is MMLU and how does it measure LLM knowledge?

MMLU (Massive Multitask Language Understanding) tests LLMs across 57 subjects from STEM to humanities, measuring broad factual knowledge and reasoning at varying difficulty levels.

What benchmarks test knowledge in AI models?

Key knowledge benchmarks include MMLU, MMLU-Pro, GPQA, SuperGPQA, HLE, and FrontierScience, each evaluating different depths of factual and scientific understanding.

How do knowledge benchmarks differ from reasoning benchmarks?

Knowledge benchmarks focus on factual recall and domain expertise, while reasoning benchmarks test logical deduction and multi-step problem solving independent of specific facts.

Knowledge

Knowledge Benchmarks

Name: Knowledge Benchmarks — LLM Leaderboard
Creator: BenchLM.ai

General knowledge and factual understanding

MMLU · GPQA · SuperGPQA · OpenBookQA · MMLU-Pro · HLE · FrontierScience

Knowledge benchmarks test whether an AI model can accurately recall facts and apply domain expertise. Unlike reasoning benchmarks that measure logical deduction, knowledge benchmarks evaluate the breadth and depth of information a model has internalized during training.

BenchLM.ai tracks seven knowledge benchmarks ranging from broad undergraduate-level tests (MMLU) to PhD-level science questions (GPQA, SuperGPQA) to frontier-difficulty expert questions (HLE, FrontierScience). This range matters because a model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.

Knowledge carries a 12% weight in BenchLM.ai's overall scoring. For tasks like research assistance, factual Q&A, content creation, and educational applications, knowledge benchmark scores remain one of the strongest predictive signals. See our knowledge rankings for the top models in this category.

124 models


1 GPT-5.4 Pro OpenAI	Closed	Reasoning	1.05M	91	99%	99%	97%	94%	94%	50%	92%
2 GPT-5.2 Pro OpenAI	Closed	Reasoning	400K	90	99%	99%	97%	95%	90%	44%	93%
3 GPT-5.4 OpenAI	Closed	Reasoning	1.05M	90	99%	98%	96%	94%	93%	48%	91%
4 GPT-5.3 Codex OpenAI	Closed	Reasoning	400K	89	99%	97%	95%	93%	90%	44%	90%
5 GPT-5.2 OpenAI	Closed	Reasoning	400K	88	99%	97%	95%	93%	88%	42%	91%
6 GPT-5.3 Instant OpenAI	Closed	Reasoning	128K	87	99%	98%	96%	94%	89%	44%	92%
7 GPT-5.3-Codex-Spark OpenAI	Closed	Reasoning	256K	87	97%	95%	93%	91%	88%	42%	88%
8 Claude Opus 4.6 Anthropic	Closed	Standard	1M	85	99%	97%	95%	93%	92%	38%	88%
9 GPT-5.2 Instant OpenAI	Closed	Reasoning	128K	85	98%	97%	95%	93%	88%	43%	91%
10 GPT-5.2-Codex OpenAI	Closed	Reasoning	400K	85	99%	97%	95%	93%	80%	26%	86%
11 Gemini 3.1 Pro Google	Closed	Standard	1M	84	99%	97%	95%	93%	92%	40%	88%
12 GPT-5.1-Codex-Max OpenAI	Closed	Reasoning	400K	84	98%	96%	94%	92%	82%	27%	84%
13 Grok 4.1 xAI	Closed	Standard	1M	84	99%	97%	95%	93%	90%	40%	91%
14 Gemini 3 Pro Deep Think Google	Closed	Reasoning	2M	81	99%	97%	95%	93%	81%	32%	88%
15 GPT-5.1 OpenAI	Closed	Reasoning	200K	80	97%	95%	93%	91%	83%	27%	84%
16 GPT-5 (high) OpenAI	Closed	Reasoning	128K	79	93%	91%	89%	87%	83%	27%	83%
17 Claude Sonnet 4.6 Anthropic	Closed	Standard	200K	78	99%	97%	95%	93%	83%	21%	85%
18 GLM-5 (Reasoning) Zhipu AI	Open	Reasoning	200K	78	96%	94%	92%	90%	81%	29%	83%
19 GPT-5 (medium) OpenAI	Closed	Reasoning	128K	78	91%	89%	87%	85%	81%	27%	82%
20 Claude Opus 4.5 Anthropic	Closed	Standard	200K	77	99%	97%	95%	93%	81%	20%	84%
21 Gemini 3 Pro Google	Closed	Standard	2M	77	99%	97%	95%	93%	83%	20%	86%
22 o1-preview OpenAI	Closed	Reasoning	200K	77	92%	90%	88%	86%	80%	32%	83%
23 Claude Sonnet 4.5 Anthropic	Closed	Standard	200K	76	95%	93%	91%	89%	84%	21%	84%
24 Grok 4.1 Fast xAI	Closed	Standard	1M	76	94%	92%	90%	88%	81%	20%	83%
25 Kimi K2.5 (Reasoning) Moonshot AI	Closed	Reasoning	128K	76	92%	90%	88%	86%	81%	27%	80%

Showing 25 of 124

Knowledge benchmark updates

Get notified when MMLU, MMLU-Pro, or knowledge scores change.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

About Knowledge Benchmarks

Tests knowledge across 57 academic subjects