123 models tracked · 32 benchmarks · Updated regularly

Compare the World's Best
AI Model Benchmarks.

Performance data across agentic, coding, reasoning, knowledge, and multimodal workflows — curated, transparent, and reproducible.

123 models
Agentic Coding Reasoning MM/Grounded Knowledge Multilingual IF MathArena
1
GPT-5.4 Pro
OpenAI
ClosedReasoning1.05M9190%88%84%95%86%86%89%97%95%98%95%97%94%96%99%99%97%94%94%50%92%97%95%97%99%99%99%96%98%97%97%99%1472
2
GPT-5.2 Pro
OpenAI
ClosedReasoning400K9088%88%82%93%83%81%89%97%95%98%93%95%96%96%99%99%97%95%90%44%93%96%92%95%99%99%99%96%98%97%97%99%1442
3
GPT-5.4
OpenAI
ClosedReasoning1.05M9090%88%85%95%84%84%85%97%94%97%95%97%95%96%99%98%96%94%93%48%91%96%94%96%99%99%99%96%98%97%97%99%1454
4
GPT-5.3 Codex
OpenAI
ClosedReasoning400K8990%88%86%95%85%85%90%95%93%98%92%93%89%94%99%97%95%93%90%44%90%96%91%93%99%99%98%95%97%96%96%99%1416
5
GPT-5.2
OpenAI
ClosedReasoning400K8890%84%81%91%80%79%85%95%93%96%91%93%95%95%99%97%95%93%88%42%91%95%91%94%99%99%98%95%97%96%96%98%1426
6
GPT-5.3 Instant
OpenAI
ClosedReasoning128K8786%82%80%88%76%75%83%96%94%97%92%94%95%95%99%98%96%94%89%44%92%96%92%96%99%99%98%95%97%96%96%98%1438
7
GPT-5.3-Codex-Spark
OpenAI
ClosedReasoning256K8790%82%83%91%80%80%85%94%92%97%91%92%86%91%97%95%93%91%88%42%88%94%89%92%98%98%97%94%96%95%95%98%1398
8
Claude Opus 4.6
Anthropic
ClosedStandard1M8580%85%74%91%80%75%74%95%93%94%92%92%95%94%99%97%95%93%92%38%88%96%94%95%99%99%98%95%97%96%96%98%1422
9
GPT-5.2 Instant
OpenAI
ClosedReasoning128K8583%82%74%87%75%74%77%95%93%96%89%84%94%92%98%97%95%93%88%43%91%95%94%95%99%99%98%95%97%96%96%98%1428
10
GPT-5.2-Codex
OpenAI
ClosedReasoning400K8590%85%85%95%76%66%86%95%93%90%90%91%84%92%99%97%95%93%80%26%86%91%87%92%99%99%98%95%97%96%96%94%1331
11
Gemini 3.1 Pro
Google
ClosedStandard1M8477%86%68%91%75%71%72%95%93%92%93%90%95%95%99%97%95%93%92%40%88%96%93%95%99%99%98%95%97%96%96%97%1423
12
GPT-5.1-Codex-Max
OpenAI
ClosedReasoning400K8490%85%82%94%75%67%84%94%92%92%90%93%85%92%98%96%94%92%82%27%84%89%87%91%99%99%98%95%97%96%96%93%1349
13
Grok 4.1
xAI
ClosedStandard1M8479%79%73%91%77%73%73%95%93%93%90%89%95%91%99%97%95%93%90%40%91%96%91%93%99%99%98%95%97%96%96%97%1435
14
Gemini 3 Pro Deep Think
Google
ClosedReasoning2M8177%87%73%91%58%58%63%95%93%95%94%96%95%95%99%97%95%93%81%32%88%92%85%89%99%99%98%95%97%96%96%92%1349
15
GPT-5.1
OpenAI
ClosedReasoning200K8078%79%71%89%68%61%71%93%91%92%84%84%94%89%97%95%93%91%83%27%84%89%87%89%99%99%98%95%97%96%96%94%1334
16
GPT-5 (high)
OpenAI
ClosedReasoning128K7978%75%72%85%67%62%70%89%87%94%83%80%93%85%93%91%89%87%83%27%83%89%85%91%95%97%96%91%93%92%94%94%1337
17
Claude Sonnet 4.6
Anthropic
ClosedStandard200K7870%77%68%93%69%54%64%95%93%88%83%79%95%88%99%97%95%93%83%21%85%91%89%91%99%99%98%95%97%96%96%91%1339
18
GLM-5 (Reasoning)
Zhipu AI
OpenReasoning200K7881%80%74%88%62%58%67%92%90%91%86%87%74%84%96%94%92%90%81%29%83%89%85%92%98%99%98%94%96%95%96%92%1340
19
GPT-5 (medium)
OpenAI
ClosedReasoning128K7877%78%72%83%67%60%72%87%85%92%81%81%89%87%91%89%87%85%81%27%82%90%87%88%93%95%94%89%91%90%92%92%1328
20
Claude Opus 4.5
Anthropic
ClosedStandard200K7771%73%68%91%68%57%62%95%93%87%82%81%94%87%99%97%95%93%81%20%84%90%84%90%99%99%98%95%97%96%96%89%1349
21
Gemini 3 Pro
Google
ClosedStandard2M7768%83%66%91%59%49%58%95%93%90%90%87%94%92%99%97%95%93%83%20%86%89%85%88%99%99%98%95%97%96%96%91%1328
22
o1-preview
OpenAI
ClosedReasoning200K7777%79%71%86%65%60%69%88%86%93%87%83%72%80%92%90%88%86%80%32%83%90%86%88%94%96%95%90%92%91%93%94%1328
23
Claude Sonnet 4.5
Anthropic
ClosedStandard200K7669%74%69%87%66%53%60%91%89%88%82%81%95%87%95%93%91%89%84%21%84%91%87%90%97%99%98%93%95%94%96%88%1346
24
Grok 4.1 Fast
xAI
ClosedStandard1M7674%73%66%86%68%54%63%90%88%87%87%89%91%83%94%92%90%88%81%20%83%88%83%90%96%98%97%92%94%93%95%89%1342
25
Kimi K2.5 (Reasoning)
Moonshot AI
ClosedReasoning128K7675%77%68%84%65%58%70%88%86%91%82%81%72%77%92%90%88%86%81%27%80%88%86%91%94%96%95%90%92%91%93%92%1325
Showing 25 of 123

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

Scoring Methodology

Each model's overall score is a weighted average of category averages. Within each category, benchmark scores are combined using per-benchmark weights that favor harder, less-saturated evaluations. The category averages are then combined using the weights below. Legacy benchmarks (MMLU, HumanEval, older competition math exams) are still displayed but excluded from scoring.

Agentic22%
Coding20%
Reasoning17%
Multimodal & Grounded12%
Knowledge12%
Multilingual7%
Instruction Following5%
Math5%

Agentic

Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified

Coding

SWE-bench Verified · SWE-bench Pro · LiveCodeBench

Reasoning

SimpleQA · MuSR · BBH · LongBench v2 · MRCRv2

Multimodal & Grounded

MMMU-Pro · OfficeQA Pro

Knowledge

GPQA · SuperGPQA · MMLU-Pro · HLE · FrontierScience

Multilingual

MGSM · MMLU-ProX

Instruction Following

IFEval

Math

AIME 2025 · HMMT 2025 · BRUMO 2025 · MATH-500

Data sourced from OpenBench, official model papers, and public leaderboards. Chatbot Arena Elo is tracked separately and not included in the overall score.