161 models · 54 benchmarks

Compare the World's Best AI Models

161 LLMs ranked across 54 benchmarks — agentic, coding, reasoning, knowledge, and multimodal workflows. Curated, transparent, reproducible.

The BenchLM LLM leaderboard 2026 ranks 161+ large language models side by side across 54benchmarks — from SWE-bench and LiveCodeBench for coding to GPQA Diamond and MMLU-Pro for knowledge and reasoning. Whether you need the best AI models 2026 has to offer for agentic workflows, math, multilingual tasks, or instruction following, our AI benchmark comparison tables make it easy to see how GPT-5, Claude, Gemini, DeepSeek, Llama, and dozens of other frontier and open-source models stack up. Every score is sourced from published results, updated regularly, and linked to its methodology so you can verify the data yourself.

161 models
Score confidence:Full (7+ verified categories)Good (5+ categories)Limited (3+ categories)Est. (unverified)
1
GPT-5.4 Pro
OpenAI
Closed1.05M
87
8887969585969798
1484.03C · M
2
GPT-5.4
OpenAI
Closed1.05M
84
7774908883959698
1465.6C · HP
3
Gemini 3.1 Pro
Google
Closed1M
83
7669889581949597
1493.17C · HP
4
Claude Opus 4.6
Anthropic
Closed1M
80
7373828578959597
1499.5C · HP
5
GPT-5.3 Codex
OpenAI
Closed400K
80
7669939182939398
1416
6
Gemini 3 Pro Deep Think
Google
Closed2M
79
7860829576878996
1486.39C · HP
7
GPT-5.2
OpenAI
Closed400K
77
6670829580929497
1440.15C · HP
8
Claude Sonnet 4.6
Anthropic
Closed200K
76
6863789273909097
1462.64C · HP
9
Qwen3.5 397B (Reasoning)
Alibaba
Open128K
72
7561827172888993
1450
10
Kimi K2.5 (Reasoning)
Moonshot AI
Closed128K
71
5870747868909494
1447
11
GLM-5 (Reasoning)
Zhipu AI
Open200K
71
7862877974869296
1455.62C · HP
12
o3-mini
OpenAI
Closed200K
70
67548174717394
1347.52C · HP
13
GLM-4.7
Zhipu AI
Open200K
69
5169797163848889
1442.9C · HP
14
o3
OpenAI
Closed200K
68
7054627267818588
1258
15
Grok 4
xAI
Closed128K
67
5866607865818286
1410
16
Qwen2.5-1M
Alibaba
Open1M
67
6545816862808485
1256
17
GPT-4.1
OpenAI
Closed1M
67
65528174636987
1413
18
o1
OpenAI
Closed200K
67
65477871697792
19
DeepSeek V3.2 (Thinking)
DeepSeek
Open128K
66
6951607166818586
1421.84C · HP
20
DeepSeek Coder 2.0
DeepSeek
Open128K
66
6853735961808681
1264
21
Nemotron 3 Ultra 500B
NVIDIA
Open10M
65
6344796758808477
1252
22
Claude 4 Sonnet
Anthropic
Closed200K
65
5849718058828375
1239
23
Kimi K2
Moonshot AI
Closed128K
65
58649068
1051
24
Gemini 3 Flash
Google
Closed1M
64
5845738054818573
1474.42C · HP
25
GLM-5
Zhipu AI
Open200K
64
5858776970828592
1455.62C · HP
Showing 25 of 161

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

Scoring Methodology

Each model's overall score is a normalized weighted average of category averages. Within each category, benchmark scores are combined using per-benchmark weights that favor harder, less-saturated evaluations. Scores are normalized so models with partial benchmark coverage are not unfairly penalized. Each score includes a confidence indicator (1-4 dots) showing how much verified data supports it — models with no verified benchmarks are marked as estimated. Display-only benchmarks like MMLU, HumanEval, SWE-bench Verified, BBH, FLTEval, and the AIME/HMMT exams remain visible but are excluded from scoring.

Agentic22%
Coding20%
Reasoning17%
Multimodal & Grounded12%
Knowledge12%
Multilingual7%
Instruction Following5%
Math5%

Agentic

Weighted: Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified

Coding

Weighted: SWE-bench Pro · LiveCodeBench · SWE-bench Verified. Display-only: HumanEval · FLTEval

Reasoning

Weighted: LongBench v2 · ARC-AGI-2 · MRCRv2 · MuSR. Display-only: BBH

Multimodal & Grounded

Weighted: MMMU-Pro · OfficeQA Pro

Knowledge

Weighted: GPQA · SuperGPQA · MMLU-Pro · HLE · FrontierScience · SimpleQA. Display-only: MMLU

Multilingual

Weighted: MGSM · MMLU-ProX

Instruction Following

Weighted: IFEval

Math

Weighted: AIME 2025 · BRUMO 2025 · MATH-500. Display-only: AIME 2023-2024 · HMMT 2023-2025

Data sourced from OpenBench, official model papers, and public leaderboards. Chatbot Arena Elo is tracked separately and not included in the overall score.