135 models tracked · 51 benchmarks · Updated regularly

Compare the World's Best
AI Model Benchmarks.

Performance data across agentic, coding, reasoning, knowledge, and multimodal workflows — curated, transparent, and reproducible.

135 models
Score confidence:Full (7+ verified categories)Good (5+ categories)Limited (3+ categories)Est. (unverified)
1
Gemini 3.1 Pro
Google
Closed1M
83
76728895819495971500
2
GPT-5.4
OpenAI
Closed1.05M
80
77739088839596981480
3
Claude Opus 4.6
Anthropic
Closed1M
76
73768285789595971504
4
Claude Sonnet 4.6
Anthropic
Closed200K
76
68627892739090971438
5
GPT-5.4 Pro
OpenAI
Closed1.05M
76
88879695859697981472
6
Gemini 3 Pro Deep Think
Google
Closed2M
70
78608295768789961349
7
GPT-5.3 Codex
OpenAI
Closed400K
70
76739391829393981416
8
o3-mini
OpenAI
Closed200K
70
67558174717394
9
Qwen2.5-1M
Alibaba
Open1M
67
65458168628084851256
10
Grok 4
xAI
Closed128K
67
58656078658182861238
11
GPT-5.2
OpenAI
Closed400K
67
66698295809294971481
12
Kimi K2.5 (Reasoning)
Moonshot AI
Closed128K
67
58777478689094941447
13
GPT-4.1
OpenAI
Closed1M
67
65528174636987
14
o1
OpenAI
Closed200K
67
65487871697792
15
Grok 4.1
xAI
Closed1M
67
78749193819393971473
16
DeepSeek V3.2 (Thinking)
DeepSeek
Open128K
66
69516071668185861421
17
DeepSeek Coder 2.0
DeepSeek
Open128K
66
68537359618086811238
18
o3
OpenAI
Closed200K
63
70526272678185881258
19
Nemotron 3 Ultra 500B
NVIDIA
Open10M
63
63447967588084771252
20
Qwen3.5 397B (Reasoning)
Alibaba
Open128K
63
75628271728889931450
21
Gemini 3 Flash
Google
Closed1M
62
58417380548185731473
22
Claude Opus 4.5
Anthropic
Closed200K
62
66637391728690951349
23
Claude Haiku 4.5
Anthropic
Closed200K
61
52466978548086711263
24
Claude 4 Sonnet
Anthropic
Closed200K
61
58477180588283751239
25
Qwen3.5 397B
Alibaba
Open128K
60
57417361617982831400
Showing 25 of 135

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

Scoring Methodology

Each model's overall score is a normalized weighted average of category averages. Within each category, benchmark scores are combined using per-benchmark weights that favor harder, less-saturated evaluations. Scores are normalized so models with partial benchmark coverage are not unfairly penalized. Each score includes a confidence indicator (1-4 dots) showing how much verified data supports it — models with no verified benchmarks are marked as estimated. Display-only benchmarks like MMLU, HumanEval, SWE-bench Verified, BBH, FLTEval, and the AIME/HMMT exams remain visible but are excluded from scoring.

Agentic22%
Coding20%
Reasoning17%
Multimodal & Grounded12%
Knowledge12%
Multilingual7%
Instruction Following5%
Math5%

Agentic

Weighted: Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified

Coding

Weighted: SWE-bench Pro · LiveCodeBench · SWE-bench Verified. Display-only: HumanEval · FLTEval

Reasoning

Weighted: LongBench v2 · ARC-AGI-2 · MRCRv2 · MuSR. Display-only: BBH

Multimodal & Grounded

Weighted: MMMU-Pro · OfficeQA Pro

Knowledge

Weighted: GPQA · SuperGPQA · MMLU-Pro · HLE · FrontierScience · SimpleQA. Display-only: MMLU

Multilingual

Weighted: MGSM · MMLU-ProX

Instruction Following

Weighted: IFEval

Math

Weighted: AIME 2025 · BRUMO 2025 · MATH-500. Display-only: AIME 2023-2024 · HMMT 2023-2025

Data sourced from OpenBench, official model papers, and public leaderboards. Chatbot Arena Elo is tracked separately and not included in the overall score.