148 models tracked · 52 benchmarks · Updated regularly

Compare the World's Best AI Models148 LLMs, 52 Benchmarks

Performance data across agentic, coding, reasoning, knowledge, and multimodal workflows — curated, transparent, and reproducible.

The BenchLM LLM leaderboard 2026 ranks 148+ large language models side by side across 52 benchmarks — from SWE-bench and LiveCodeBench for coding to GPQA Diamond and MMLU-Pro for knowledge and reasoning. Whether you need the best AI models 2026 has to offer for agentic workflows, math, multilingual tasks, or instruction following, our AI benchmark comparison tables make it easy to see how GPT-5, Claude, Gemini, DeepSeek, Llama, and dozens of other frontier and open-source models stack up. Every score is sourced from published results, updated regularly, and linked to its methodology so you can verify the data yourself. Use the filters below to sort by category, compare any two models head to head, or explore pricing and context-window details — all in one place. New models and benchmark results are added as they launch, so the rankings always reflect the latest state of the AI model landscape.

148 models
Score confidence:Full (7+ verified categories)Good (5+ categories)Limited (3+ categories)Est. (unverified)
1
GPT-5.4 Pro
OpenAI
Closed1.05M
87
88879695859697981472
2
GPT-5.4
OpenAI
Closed1.05M
84
77749088839596981480
3
Gemini 3.1 Pro
Google
Closed1M
83
76698895819495971500
4
Claude Opus 4.6
Anthropic
Closed1M
80
73738285789595971504
5
GPT-5.3 Codex
OpenAI
Closed400K
80
76699391829393981416
6
Gemini 3 Pro Deep Think
Google
Closed2M
79
78608295768789961349
7
GPT-5.2
OpenAI
Closed400K
77
66708295809294971481
8
Claude Sonnet 4.6
Anthropic
Closed200K
76
68637892739090971438
9
Qwen3.5 397B (Reasoning)
Alibaba
Open128K
72
75618271728889931450
10
Kimi K2.5 (Reasoning)
Moonshot AI
Closed128K
71
58707478689094941447
11
GLM-5 (Reasoning)
Zhipu AI
Open200K
71
78628779748692961451
12
o3-mini
OpenAI
Closed200K
70
67548174717394
13
GLM-4.7
Zhipu AI
Open200K
69
51697971638488891445
14
o3
OpenAI
Closed200K
68
70546272678185881258
15
Qwen2.5-1M
Alibaba
Open1M
67
65458168628084851256
16
Grok 4
xAI
Closed128K
67
58666078658182861238
17
GPT-4.1
OpenAI
Closed1M
67
65528174636987
18
o1
OpenAI
Closed200K
67
65477871697792
19
DeepSeek V3.2 (Thinking)
DeepSeek
Open128K
66
69516071668185861421
20
DeepSeek Coder 2.0
DeepSeek
Open128K
66
68537359618086811238
21
Nemotron 3 Ultra 500B
NVIDIA
Open10M
65
63447967588084771252
22
Claude 4 Sonnet
Anthropic
Closed200K
65
58497180588283751239
23
Kimi K2
Moonshot AI
Closed128K
65
586490681051
24
Gemini 3 Flash
Google
Closed1M
64
58457380548185731473
25
GLM-5
Zhipu AI
Open200K
63
58587769708285921420
Showing 25 of 148

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

Scoring Methodology

Each model's overall score is a normalized weighted average of category averages. Within each category, benchmark scores are combined using per-benchmark weights that favor harder, less-saturated evaluations. Scores are normalized so models with partial benchmark coverage are not unfairly penalized. Each score includes a confidence indicator (1-4 dots) showing how much verified data supports it — models with no verified benchmarks are marked as estimated. Display-only benchmarks like MMLU, HumanEval, SWE-bench Verified, BBH, FLTEval, and the AIME/HMMT exams remain visible but are excluded from scoring.

Agentic22%
Coding20%
Reasoning17%
Multimodal & Grounded12%
Knowledge12%
Multilingual7%
Instruction Following5%
Math5%

Agentic

Weighted: Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified

Coding

Weighted: SWE-bench Pro · LiveCodeBench · SWE-bench Verified. Display-only: HumanEval · FLTEval

Reasoning

Weighted: LongBench v2 · ARC-AGI-2 · MRCRv2 · MuSR. Display-only: BBH

Multimodal & Grounded

Weighted: MMMU-Pro · OfficeQA Pro

Knowledge

Weighted: GPQA · SuperGPQA · MMLU-Pro · HLE · FrontierScience · SimpleQA. Display-only: MMLU

Multilingual

Weighted: MGSM · MMLU-ProX

Instruction Following

Weighted: IFEval

Math

Weighted: AIME 2025 · BRUMO 2025 · MATH-500. Display-only: AIME 2023-2024 · HMMT 2023-2025

Data sourced from OpenBench, official model papers, and public leaderboards. Chatbot Arena Elo is tracked separately and not included in the overall score.