Skip to main content
Skip to main content
272 models · 251 benchmarks

Compare frontier AI models by quality, cost, and context

124 provisional-ranked models, 33 verified-ranked models, and 272 tracked LLMs. The most comprehensive LLM comparison tool — 251 benchmarks, real pricing, and runtime data in one place.

The top-ranked model on the BenchLM leaderboard is Claude Mythos 5 with an overall score of 89.

Last verified: July 4, 2026

The BenchLM LLM leaderboard 2026 provisionally ranks 124+ models and tracks 272+ large language models side by side across 251 benchmarks — from SWE-bench and LiveCodeBench for coding to GPQA Diamond and MMLU-Pro for knowledge and reasoning. Whether you need the best AI models 2026 has to offer for agentic workflows, math, multilingual tasks, or instruction following, our AI benchmark comparison tables make it easy to see how GPT-5, Claude, Gemini, DeepSeek, Llama, and dozens of other frontier and open-source models stack up on both benchmarks and operator tradeoffs like price and context. The main leaderboard now distinguishes provisional ranking from verified ranking so you can see which scores rest on exact-source coverage and which still rely on source-unverified public rows.

Compare models instantly

vs

Decision-ready picks

The fastest way to scan the current BenchLM dataset by outcome instead of just by benchmark.

The AI Race
Explore timeline
Current Crown(model released this month)

Claude Mythos 5

Anthropic

89

Provider Podium

1st
Anthropic88
2nd
Google85.7
3rd
OpenAI84.3
6 months tracked146 total releases5 crown changes

Unified Model Leaderboard

Benchmarks, pricing, runtime signals, and context window in one table. Filter state syncs to the URL so every view is shareable. Provisional-ranked mode includes source-unverified non-generated benchmark evidence.

272 models
Provisional-ranked mode includes source-unverified non-generated benchmark evidence.Score confidence:Full sourced coverageGood sourced coverageLimited sourced coverageEstimated
1
AnthropicClosedCurrentReasoning1M+$10.00 / $50.00N/AN/A891009192879482
2
AnthropicClosedCurrentReasoning1M+$10.00 / $50.00N/AN/A898995798894851507.59
GoogleClosedCurrentStandard1M$2.00 / $12.0010929.71s88679396849310093681486.42
GoogleClosedCurrentReasoning2MN/AN/AN/A~8887748996878280871486.39
xAIClosedSupersededStandard1MN/AN/AN/A~8776829294939791901459.55
6
GPT-5.4
OpenAI
OpenAIClosedSupersededReasoning1.05M$2.50 / $15.0074151.79s86838689609710096941467.72
7
AnthropicClosedSupersededStandard1M$5.00 / $25.00401.78s86808588779010095861498.76
8
AnthropicClosedCurrentReasoning1M$5.00 / $25.00N/AN/A85929165881477.85
9
AlibabaClosedCurrentReasoning1MN/AN/AN/A847990808181931474.65
10
OpenAIClosedCurrentReasoning1.05M$30.00 / $180.0074151.79s84838283896385851478.02
OpenAIClosedCurrentReasoning400K$1.75 / $14.007988.26s~8375878892919791901416
GoogleClosedCurrentReasoning1M$1.50 / $9.00284.218.55s818876747675741476.25
DeepSeekOpenCurrentReasoning1M$1.74 / $3.48N/AN/A80808773
14
AlibabaClosedCurrentReasoning1MN/AN/AN/A8079868273757796
AnthropicClosedSupersededStandard200K$3.00 / $15.00441.48s8072788286828881771471.74
16
GoogleClosedEstablishedStandard2M$2.00 / $12.0010932.65s8071738381837980801485.74
Z.AIOpenCurrentReasoning1M$1.40 / $4.40N/AN/A80718484
18
GPT-5.2
OpenAI
OpenAIClosedEstablishedReasoning400K$1.75 / $14.0073130.34s7960788482919685811434.64
19
GLM-5 (Reasoning)
Z.AI
Self-host
Z.AIOpenCurrentReasoning200K$1.00 / $3.20N/AN/A~7977708371827981841455.62
20
OpenAIClosedSupersededReasoning200K$15.00 / $60.00N/AN/A~7979788467818276851388.29
21
GPT-5.5
OpenAI
OpenAIClosedSupersededReasoning1M$5.00 / $30.00N/AN/A788874835883871474.85
DeepSeekOpenCurrentReasoning1M$1.74 / $3.48N/AN/A76768368
23
GPT-5.1
OpenAI
OpenAIClosedEstablishedReasoning200K$1.25 / $10.0011157.47s~7672756792828374681438.78
AnthropicClosedSupersededReasoning1M$5.00 / $25.00N/AN/A757884774985701502.38
25
AnthropicClosedEstablishedStandard200K$5.00 / $25.00461.01s7572757061828163861468
Showing 25 of 272

Writing about LLMs? Embed this live leaderboard in your post or dashboard — it stays current automatically and includes attribution.

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.

Scoring Methodology

Each model's overall score is a normalized weighted average of category averages. Within each category, benchmark scores are normalized to a common scale and combined using per-benchmark weights that favor harder, less-saturated evaluations.

Each score includes a confidence indicator (1-4 dots) showing how much sourced benchmark data supports it — models with no non-generated benchmark coverage are marked as estimated.

Display-only benchmarks like MMLU, HumanEval, BBH, LisanBench, FLTEval, and the AIME/HMMT exams remain visible but are excluded from scoring.

Data sourced from OpenBench, official model papers, and public leaderboards. External consensus signals are used as bounded calibration inputs but are not exposed in exported data.

Agentic22%

Terminal-Bench 2.0 · OSWorld-Verified · BrowseComp · GAIA · Tau-Bench · WebArena

Coding20%

SWE-Rebench · SWE-bench Pro · LiveCodeBench · SWE-bench Verified · SciCode

Reasoning17%

LongBench v2 · ARC-AGI-2 · MRCRv2 · MuSR

Multimodal12%

MMMU-Pro · OfficeQA Pro

Knowledge12%

HLE · MMLU-Pro · FrontierScience · SimpleQA · GPQA · SuperGPQA

Multilingual7%

MMLU-ProX · MGSM

Instruction Following5%

IFEval · IFBench

Math5%

FrontierMath · AIME 2025 · BRUMO 2025 · MATH-500