188 models · 151 benchmarks

Compare frontier AI models by quality, cost, and context

106 provisional-ranked models, 11 verified-ranked models, and 188 tracked LLMs. The most comprehensive LLM comparison tool — 151 benchmarks, real pricing, and runtime data in one place.

The BenchLM LLM leaderboard 2026 provisionally ranks 106+ models and tracks 188+ large language models side by side across 151 benchmarks — from SWE-bench and LiveCodeBench for coding to GPQA Diamond and MMLU-Pro for knowledge and reasoning. Whether you need the best AI models 2026 has to offer for agentic workflows, math, multilingual tasks, or instruction following, our AI benchmark comparison tables make it easy to see how GPT-5, Claude, Gemini, DeepSeek, Llama, and dozens of other frontier and open-source models stack up on both benchmarks and operator tradeoffs like price and context. The main leaderboard now distinguishes provisional ranking from verified ranking so you can see which scores rest on exact-source coverage and which still rely on source-unverified public rows.

Compare models instantly

vs

Decision-ready picks

The fastest way to scan the current BenchLM dataset by outcome instead of just by benchmark.

The AI Race
Explore timeline
Current Crown(model released this month)

Claude Mythos Preview

Anthropic

81

Provider Podium

1st
Anthropic81.7
2nd
OpenAI81.7
3rd
Google81.3
6 months tracked89 total releases5 crown changes

Unified Model Leaderboard

Benchmarks, pricing, runtime signals, and context window in one table. Filter state syncs to the URL so every view is shareable. Provisional-ranked mode includes source-unverified non-generated benchmark evidence.

188 models
Provisional-ranked mode includes source-unverified non-generated benchmark evidence.Score confidence:Full sourced coverageGood sourced coverageLimited sourced coverageEstimated
GoogleClosedCurrentStandard1M$1.25 / $5.0010929.71s8578778889819490591492.63
2
AnthropicClosedCurrentStandard1M$15.00 / $75.00401.78s8482758285789590791496.61
OpenAIClosedCurrentReasoning400K$2.50 / $10.007988.26s~8481739391829388981416
xAIClosedSupersededStandard1M$3.00 / $15.00N/AN/A~8374699193819385931460.98
5
GPT-5.4
OpenAI
OpenAIClosedCurrentReasoning1.05M$2.50 / $15.0074151.79s8170748888839590811465.79
AnthropicClosedCurrentReasoning1M$25.00 / $125.00N/AN/A81787893759380
GoogleClosedCurrentReasoning2MN/AN/AN/A~8176668295768789961486.39
Z.AIOpenCurrentReasoning200K$0.00 / $0.00N/AN/A~8181668779748684941455.62
AnthropicClosedCurrentStandard200K$3.00 / $15.00441.48s8076707892739086751462.21
10
OpenAIClosedEstablishedReasoning128KN/A8336.28s~8080718389738687721433.37
11
Z.AIOpenCurrentReasoning203K$1.40 / $4.40N/AN/A797870717492901467.44
12
GPT-5.2
OpenAI
OpenAIClosedCurrentReasoning400K$2.00 / $8.0073130.34s~7968738287809285761439.54
13
GoogleClosedCurrentStandard2MN/A10932.65s~7871657586748685771486.16
OpenAIClosedCurrentReasoning400K$2.00 / $8.0012387.34s~7880699188758892961331
15
AnthropicClosedCurrentStandard200KN/A461.01s7775686878738779951468
16
OpenAIClosedCurrentReasoning1.05M$30.00 / $180.0074151.79s77776183944983551483.56
AlibabaOpenCurrentReasoning128K$0.00 / $0.00N/AN/A~7773678271728889931450
18
GPT-5.1
OpenAI
OpenAIClosedCurrentReasoning200K$1.50 / $6.0011157.47s~7775706992748886711438.53
OpenAIClosedCurrentReasoning400K$2.00 / $8.00N/AN/A~7776709288748891961349
20
GLM-5
Z.AI
Z.AIOpenSupersededStandard200K$0.00 / $0.00741.64s7672657169738385901455.57
Moonshot AIClosedCurrentReasoning128KN/AN/AN/A~7667737478689094711447
22
GoogleOpenCurrentReasoning256K$0.00 / $0.00N/AN/A~73786677611451.16
23
Z.AIOpenEstablishedReasoning200K$0.00 / $0.00821.10s~7362687971638485861442.71
OpenAIClosedEstablishedReasoning128KN/A8336.28s~7370698288718888931328
xAIClosedCurrentStandard1MN/A1380.54s~7265558887718585941420
Showing 25 of 188

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

Scoring Methodology

Each model's overall score is a normalized weighted average of category averages. Within each category, benchmark scores are combined using per-benchmark weights that favor harder, less-saturated evaluations. Scores are normalized so models with partial benchmark coverage are not unfairly penalized. Each score includes a confidence indicator (1-4 dots) showing how much sourced benchmark data supports it — models with no non-generated benchmark coverage are marked as estimated. Display-only benchmarks like MMLU, HumanEval, SWE-bench Verified, BBH, LisanBench, FLTEval, and the AIME/HMMT exams remain visible but are excluded from scoring.

Agentic22%
Coding20%
Reasoning17%
Multimodal & Grounded12%
Knowledge12%
Multilingual7%
Instruction Following5%
Math5%

Agentic

Weighted: Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified

Coding

Weighted: SWE-bench Pro · LiveCodeBench · SWE-bench Verified. Display-only: HumanEval · FLTEval

Reasoning

Weighted: LongBench v2 · ARC-AGI-2 · MRCRv2 · MuSR. Display-only: BBH · LisanBench

Multimodal & Grounded

Weighted: MMMU-Pro · OfficeQA Pro

Knowledge

Weighted: GPQA · SuperGPQA · MMLU-Pro · HLE · FrontierScience · SimpleQA. Display-only: MMLU

Multilingual

Weighted: MGSM · MMLU-ProX

Instruction Following

Weighted: IFEval

Math

Weighted: AIME 2025 · BRUMO 2025 · MATH-500. Display-only: AIME 2023-2024 · HMMT 2023-2025

Data sourced from OpenBench, official model papers, and public leaderboards. Chatbot Arena Elo is tracked separately and not included in the overall score.