133 models tracked · 51 benchmarks · Updated regularly

Compare the World's Best
AI Model Benchmarks.

Performance data across agentic, coding, reasoning, knowledge, and multimodal workflows — curated, transparent, and reproducible.

133 models
Score confidence:Full (7+ verified categories)Good (5+ categories)Limited (3+ categories)Est. (unverified)
1
Gemini 3.1 Pro
Google
Closed1M
83
76728895819495971423
2
GPT-5.4
OpenAI
Closed1.05M
79
77739088839596981454
3
Claude Sonnet 4.6
Anthropic
Closed200K
76
71617892749091961339
4
GPT-5.4 Pro
OpenAI
Closed1.05M
76
88879695859697981472
5
Gemini 3 Pro Deep Think
Google
Closed2M
70
78608295768789961349
6
GPT-5.3 Codex
OpenAI
Closed400K
70
76739391829393981416
7
o3-mini
OpenAI
Closed200K
70
67558174717394
8
Qwen2.5-1M
Alibaba
Open1M
67
65458168628084851256
9
GPT-4.1
OpenAI
Closed1M
67
65528174636987
10
o1
OpenAI
Closed200K
67
65487871697792
11
Grok 4.1
xAI
Closed1M
67
78749193819393971435
12
Claude Opus 4.6
Anthropic
Closed1M
67
79768685789595971422
13
DeepSeek V3.2 (Thinking)
DeepSeek
Open128K
66
69516071668185861260
14
DeepSeek Coder 2.0
DeepSeek
Open128K
66
68537359618086811238
15
GPT-5.2
OpenAI
Closed400K
66
66698295809294971426
16
Nemotron 3 Ultra 500B
NVIDIA
Open10M
63
63447967588084771252
17
Grok 4
xAI
Closed128K
63
58436078658182861238
18
Qwen3.5 397B (Reasoning)
Alibaba
Open128K
63
75628271728889931326
19
Claude Haiku 4.5
Anthropic
Closed200K
62
57426978548086711263
20
o3
OpenAI
Closed200K
62
70496272678185881258
21
Gemini 3 Flash
Google
Closed1M
62
58417380548185731241
22
Claude 4 Sonnet
Anthropic
Closed200K
61
58437180588283751239
23
Nemotron 3 Super 100B
NVIDIA
Open1M
60
57427160538084701260
24
Qwen3.5 397B
Alibaba
Open128K
60
57417361617982831237
25
Claude 3.5 Sonnet
Anthropic
Closed200K
60
55386875528183691214
Showing 25 of 133

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

Scoring Methodology

Each model's overall score is a normalized weighted average of category averages. Within each category, benchmark scores are combined using per-benchmark weights that favor harder, less-saturated evaluations. Scores are normalized so models with partial benchmark coverage are not unfairly penalized. Each score includes a confidence indicator (1-4 dots) showing how much verified data supports it — models with no verified benchmarks are marked as estimated. Display-only benchmarks like MMLU, HumanEval, SWE-bench Verified, BBH, FLTEval, and the AIME/HMMT exams remain visible but are excluded from scoring.

Agentic22%
Coding20%
Reasoning17%
Multimodal & Grounded12%
Knowledge12%
Multilingual7%
Instruction Following5%
Math5%

Agentic

Weighted: Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified

Coding

Weighted: SWE-bench Pro · LiveCodeBench · SWE-bench Verified. Display-only: HumanEval · FLTEval

Reasoning

Weighted: LongBench v2 · ARC-AGI-2 · MRCRv2 · MuSR. Display-only: BBH

Multimodal & Grounded

Weighted: MMMU-Pro · OfficeQA Pro

Knowledge

Weighted: GPQA · SuperGPQA · MMLU-Pro · HLE · FrontierScience · SimpleQA. Display-only: MMLU

Multilingual

Weighted: MGSM · MMLU-ProX

Instruction Following

Weighted: IFEval

Math

Weighted: AIME 2025 · BRUMO 2025 · MATH-500. Display-only: AIME 2023-2024 · HMMT 2023-2025

Data sourced from OpenBench, official model papers, and public leaderboards. Chatbot Arena Elo is tracked separately and not included in the overall score.