Compare frontier AI models by quality, cost, and context
106 provisional-ranked models, 11 verified-ranked models, and 188 tracked LLMs. The most comprehensive LLM comparison tool — 150 benchmarks, real pricing, and runtime data in one place.
Explore by Use Case
Additional reporting pages for long context, tool use, web research, computer use, document AI, image understanding, frontend work, and factuality.
LongBench v2, MRCR, AI-Needle, Graphwalks
BFCL, MCP Atlas, Toolathlon, TAU Bench
BrowseComp, WebArena, WebVoyager
OSWorld, ScreenSpot Pro, Vision2Web
OfficeQA Pro, OmniDocBench, CC-OCR
MMMU-Pro, AI2D, CountBench, RefCOCO
React Native Evals, Design2Code, Vision2Web
SimpleQA, HLE w/o tools, Facts-VLM
The BenchLM LLM leaderboard 2026 provisionally ranks 106+ models and tracks 188+ large language models side by side across 150 benchmarks — from SWE-bench and LiveCodeBench for coding to GPQA Diamond and MMLU-Pro for knowledge and reasoning. Whether you need the best AI models 2026 has to offer for agentic workflows, math, multilingual tasks, or instruction following, our AI benchmark comparison tables make it easy to see how GPT-5, Claude, Gemini, DeepSeek, Llama, and dozens of other frontier and open-source models stack up on both benchmarks and operator tradeoffs like price and context. The main leaderboard now distinguishes provisional ranking from verified ranking so you can see which scores rest on exact-source coverage and which still rely on source-unverified public rows.
Compare models instantly
Decision-ready picks
The fastest way to scan the current BenchLM dataset by outcome instead of just by benchmark.
Claude Mythos Preview
99
overall score
Anthropic
GLM-5 (Reasoning)
85
overall score
Z.AI
GLM-5 (Reasoning)
$0.00
avg / 1M tokens
Z.AI
Mercury 2
789
tokens / sec
Inception
LFM2-24B-A2B
0.42s
TTFT
LiquidAI
Nemotron 3 Ultra 500B
10M
context window
NVIDIA
Claude Mythos Preview
Anthropic
Provider Podium
Unified Model Leaderboard
Benchmarks, pricing, runtime signals, and context window in one table. Filter state syncs to the URL so every view is shareable. Provisional-ranked mode includes source-unverified non-generated benchmark evidence.
1 Claude Mythos Preview Anthropic | Anthropic | Closed | Current | Reasoning | 1M | $25.00 / $125.00 | N/A | N/A | 99 | 100 | 100 | — | 98 | 99 | 100 | 90 | — | — |
2 GPT-5.4 OpenAI | OpenAI | Closed | Current | Reasoning | 1.05M | $2.50 / $15.00 | 74 | 151.79s | 94 | 94 | 91 | 93 | 88 | 98 | 100 | 94 | 95 | 1465.79 |
3 Gemini 3.1 Pro Google | Closed | Current | Standard | 1M | $1.25 / $5.00 | 109 | 29.71s | 94 | 88 | 94 | 97 | 90 | 96 | 100 | 93 | 71 | 1492.63 | |
4 Claude Opus 4.6 Anthropic | Anthropic | Closed | Current | Standard | 1M | $15.00 / $75.00 | 40 | 1.78s | 92 | 93 | 91 | 90 | 84 | 92 | 100 | 95 | 89 | 1496.61 |
5 GPT-5.4 Pro OpenAI | OpenAI | Closed | Current | Reasoning | 1.05M | $30.00 / $180.00 | 74 | 151.79s | 92 | 92 | 93 | 99 | 100 | 60 | — | 94 | 100 | 1483.56 |
6 GPT-5.3 Codex OpenAI | OpenAI | Closed | Current | Reasoning | 400K | $2.50 / $10.00 | 79 | 88.26s | ~89 | 86 | 88 | 95 | 95 | 94 | 100 | 91 | 100 | 1416 |
7 Gemini 3 Pro Deep Think Google | Closed | Current | Reasoning | 2M | N/A | N/A | N/A | ~87 | 88 | 77 | 89 | 100 | 89 | 85 | 83 | 96 | 1486.39 | |
8 Claude Sonnet 4.6 Anthropic | Anthropic | Closed | Current | Standard | 200K | $3.00 / $15.00 | 44 | 1.48s | 86 | 85 | 83 | 83 | 95 | 85 | 91 | 82 | 78 | 1462.21 |
9 | Z.AI | Open | Current | Reasoning | 200K | $0.00 / $0.00 | N/A | N/A | ~85 | 86 | 76 | 88 | 73 | 84 | 82 | 81 | 93 | 1455.62 |
10 GLM-5.1 Z.AI | Z.AI | Open | Current | Reasoning | 203K | $1.40 / $4.40 | N/A | N/A | 84 | 83 | 83 | 65 | — | 85 | — | 93 | 89 | 1467.44 |
11 GPT-5.2 OpenAI | OpenAI | Closed | Current | Reasoning | 400K | $2.00 / $8.00 | 73 | 130.34s | ~84 | 66 | 84 | 86 | 86 | 93 | 99 | 86 | 84 | 1439.54 |
12 Gemini 3 Pro Google | Closed | Current | Standard | 2M | N/A | 109 | 32.65s | ~83 | 76 | 75 | 82 | 86 | 84 | 82 | 79 | 84 | 1486.16 | |
13 Grok 4.1 xAI | xAI | Closed | Superseded | Standard | 1M | $3.00 / $15.00 | N/A | N/A | ~81 | 73 | 69 | 92 | 98 | 95 | 100 | 86 | 92 | 1460.98 |
14 Qwen3.5 397B (Reasoning) Alibaba | Alibaba | Open | Current | Reasoning | 128K | $0.00 / $0.00 | N/A | N/A | ~81 | 77 | 85 | 82 | 59 | 80 | 86 | 82 | 92 | 1450 |
15 GPT-5.1 OpenAI | OpenAI | Closed | Current | Reasoning | 200K | $1.50 / $6.00 | 111 | 57.47s | ~81 | 81 | 81 | 68 | 96 | 84 | 86 | 78 | 70 | 1438.53 |
16 Claude Opus 4.5 Anthropic | Anthropic | Closed | Current | Standard | 200K | N/A | 46 | 1.01s | 80 | 81 | 79 | 70 | 72 | 84 | 84 | 58 | 95 | 1468 |
17 GPT-5 (high) OpenAI | OpenAI | Closed | Established | Reasoning | 128K | N/A | 83 | 36.28s | ~80 | 82 | 73 | 78 | 92 | 81 | 82 | 83 | 72 | 1433.37 |
18 GPT-5.2-Codex OpenAI | OpenAI | Closed | Current | Reasoning | 400K | $2.00 / $8.00 | 123 | 87.34s | ~80 | 84 | 81 | 89 | 89 | 80 | 88 | 93 | 98 | 1331 |
19 Kimi K2.5 (Reasoning) Moonshot AI | Moonshot AI | Closed | Current | Reasoning | 128K | N/A | N/A | N/A | ~79 | 69 | 87 | 70 | 71 | 75 | 90 | 100 | 68 | 1447 |
20 GPT-5.1-Codex-Max OpenAI | OpenAI | Closed | Current | Reasoning | 400K | $2.00 / $8.00 | N/A | N/A | ~79 | 81 | 79 | 90 | 90 | 81 | 86 | 89 | 97 | 1349 |
21 Grok 4.20 xAI | xAI | Closed | Current | Reasoning | 2M | $2.00 / $6.00 | 233 | 10.33s | 78 | 64 | 80 | 69 | 68 | — | — | 98 | — | 1490.38 |
22 GLM-5 Z.AI | Z.AI | Open | Superseded | Standard | 200K | $0.00 / $0.00 | 74 | 1.64s | 77 | 73 | 77 | 63 | 56 | 85 | 73 | 81 | 89 | 1455.57 |
23 Qwen3.6 Plus Alibaba | Alibaba | Closed | Current | Reasoning | 1M | $0.00 / $0.00 | N/A | N/A | 77 | 72 | 80 | 44 | 74 | 77 | 82 | 90 | — | — |
24 Gemma 4 31B Google | Open | Current | Reasoning | 256K | $0.00 / $0.00 | N/A | N/A | ~74 | — | 87 | 55 | 71 | 75 | — | — | — | 1451.16 | |
25 GPT-5 (medium) OpenAI | OpenAI | Closed | Established | Reasoning | 128K | N/A | 83 | 36.28s | ~74 | 75 | 81 | 75 | 89 | 76 | 87 | 78 | 92 | 1328 |
The AI models change fast. We track them for you.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.
Rankings
Popular Comparisons
Scoring Methodology
Each model's overall score is a normalized weighted average of category averages. Within each category, benchmark scores are normalized to a common scale and combined using per-benchmark weights that favor harder, less-saturated evaluations.
Each score includes a confidence indicator (1-4 dots) showing how much sourced benchmark data supports it — models with no non-generated benchmark coverage are marked as estimated.
Display-only benchmarks like MMLU, HumanEval, BBH, LisanBench, FLTEval, and the AIME/HMMT exams remain visible but are excluded from scoring.
Data sourced from OpenBench, official model papers, and public leaderboards. External consensus signals are used as bounded calibration inputs but are not exposed in exported data.
Terminal-Bench 2.0 · OSWorld-Verified · BrowseComp · GAIA · Tau-Bench · WebArena
SWE-Rebench · SWE-bench Pro · LiveCodeBench · SWE-bench Verified · SciCode
LongBench v2 · ARC-AGI-2 · MRCRv2 · MuSR
MMMU-Pro · OfficeQA Pro
HLE · MMLU-Pro · FrontierScience · SimpleQA · GPQA · SuperGPQA
MMLU-ProX · MGSM
IFEval · IFBench
FrontierMath · AIME 2025 · BRUMO 2025 · MATH-500