88 models tracked · Updated regularly

Compare the World's Best
AI Model Benchmarks.

Performance data across knowledge, coding, math & reasoning — curated, transparent, and reproducible.

88 models
Knowledge Coding Math Reasoning IF MultiArena
1
GPT-5.3 Codex
OpenAI
ClosedReasoning400K9299%97%95%93%90%44%95%85%85%99%99%98%95%97%96%96%99%95%93%98%93%96%1416
2
GPT-5.4
OpenAI
ClosedReasoning1M9199%97%95%93%91%46%91%81%75%99%99%98%95%97%96%96%99%95%93%95%95%95%1442
3
GPT-5.2
OpenAI
ClosedReasoning400K9199%97%95%93%88%42%91%80%79%99%99%98%95%97%96%96%98%95%93%96%94%95%1426
4
Claude Opus 4.6
Anthropic
ClosedStandard1M9099%97%95%93%92%38%91%80%75%99%99%98%95%97%96%96%98%95%93%94%95%96%1422
5
Gemini 3.1 Pro
Google
ClosedStandard1M8999%97%95%93%92%40%91%75%71%99%99%98%95%97%96%96%97%95%93%92%95%96%1423
6
Grok 4.1
xAI
ClosedStandard128K8999%97%95%93%90%40%91%77%73%99%99%98%95%97%96%96%97%95%93%93%93%96%1435
7
GPT-5.2-Codex
OpenAI
ClosedReasoning400K8899%97%95%93%80%26%95%76%66%99%99%98%95%97%96%96%94%95%93%90%92%91%1331
8
GPT-5.1-Codex-Max
OpenAI
ClosedReasoning400K8798%96%94%92%82%27%94%75%67%99%99%98%95%97%96%96%93%94%92%92%91%89%1349
9
Claude Sonnet 4.6
Anthropic
ClosedStandard1M8699%97%95%93%83%21%93%69%54%99%99%98%95%97%96%96%91%95%93%88%91%91%1339
10
Gemini 3 Pro Deep Think
Google
ClosedReasoning2M8599%97%95%93%81%32%91%58%58%99%99%98%95%97%96%96%92%95%93%95%89%92%1349
11
Claude Opus 4.5
Anthropic
ClosedStandard200K8599%97%95%93%81%20%91%68%57%99%99%98%95%97%96%96%89%95%93%87%90%90%1349
12
GPT-5.1
OpenAI
ClosedReasoning400K8597%95%93%91%83%27%89%68%61%99%99%98%95%97%96%96%94%93%91%92%89%89%1334
13
GPT-5 (high)
OpenAI
ClosedReasoning128K8493%91%89%87%83%27%85%67%62%95%97%96%91%93%92%94%94%89%87%94%91%89%1337
14
Gemini 3 Pro
Google
ClosedStandard2M8499%97%95%93%83%20%91%59%49%99%99%98%95%97%96%96%91%95%93%90%88%89%1328
15
GLM-5 (Reasoning)
Zhipu AI
OpenReasoning200K8496%94%92%90%81%29%88%62%58%98%99%98%94%96%95%96%92%92%90%91%92%89%1340
16
o1-preview
OpenAI
ClosedReasoning200K8392%90%88%86%80%32%86%65%60%94%96%95%90%92%91%93%94%88%86%93%88%90%1328
17
Claude Sonnet 4.5
Anthropic
ClosedStandard1M8395%93%91%89%84%21%87%66%53%97%99%98%93%95%94%96%88%91%89%88%90%91%1346
18
Grok 4.1 Fast
xAI
ClosedStandard2M8394%92%90%88%81%20%86%68%54%96%98%97%92%94%93%95%89%90%88%87%90%88%1342
19
GPT-5 (medium)
OpenAI
ClosedReasoning128K8291%89%87%85%81%27%83%67%60%93%95%94%89%91%90%92%92%87%85%92%88%90%1328
20
Kimi K2.5 (Reasoning)
Moonshot AI
OpenReasoning128K8292%90%88%86%81%27%84%65%58%94%96%95%90%92%91%93%92%88%86%91%91%88%1325
21
Qwen3.5 397B (Reasoning)
Alibaba
OpenReasoning128K8291%89%87%85%81%29%83%60%60%93%95%94%89%91%90%92%93%87%85%91%89%91%1326
22
o3-pro
OpenAI
ClosedReasoning200K7788%89%87%85%75%26%80%46%44%90%92%91%86%88%87%89%89%86%84%89%82%83%1242
23
o3
OpenAI
ClosedReasoning200K7686%87%85%83%75%24%78%50%40%88%90%89%84%86%85%87%88%84%82%86%85%83%1258
24
DeepSeek V3.2 (Thinking)
DeepSeek
OpenReasoning128K7587%85%83%81%73%22%79%48%45%87%89%88%83%85%84%86%84%83%81%86%85%84%1260
25
GPT-5 mini
OpenAI
ClosedReasoning128K7488%86%84%82%73%16%80%41%37%90%92%91%86%88%87%89%85%84%82%87%82%82%1243
Showing 25 of 88

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

Scoring Methodology

Each model's overall score is a weighted average of category averages. Within each category, all benchmark scores are averaged equally. The category averages are then combined using the weights below, giving more influence to well-covered categories and the capabilities most relevant to real-world use.

Coding25%
Knowledge20%
Math20%
Reasoning20%
Instruction Following10%
Multilingual5%

Coding

HumanEval · SWE-bench Verified · LiveCodeBench

Knowledge

MMLU · GPQA · SuperGPQA · OpenBookQA · MMLU-Pro · HLE

Math

AIME 2023–2025 · HMMT 2023–2025 · BRUMO 2025 · MATH-500

Reasoning

SimpleQA · MuSR · BBH

Instruction Following

IFEval

Multilingual

MGSM

Data sourced from OpenBench, official model papers, and public leaderboards. Chatbot Arena Elo is tracked separately and not included in the overall score.