183 models · 125 benchmarks

Compare frontier AI models by quality, cost, and context

103 ranked models and 183 tracked LLMs across 125 benchmarks with BenchLM scoring, pricing, release status, and runtime tradeoffs in one place.

The BenchLM LLM leaderboard 2026 ranks 103+ models and tracks 183+ large language models side by side across 125benchmarks — from SWE-bench and LiveCodeBench for coding to GPQA Diamond and MMLU-Pro for knowledge and reasoning. Whether you need the best AI models 2026 has to offer for agentic workflows, math, multilingual tasks, or instruction following, our AI benchmark comparison tables make it easy to see how GPT-5, Claude, Gemini, DeepSeek, Llama, and dozens of other frontier and open-source models stack up on both benchmarks and operator tradeoffs like price and context. Every score is sourced from published results, updated regularly, and linked to its methodology so you can verify the data yourself.

Decision-ready picks

The fastest way to scan the current BenchLM dataset by outcome instead of just by benchmark.

The AI Race
Explore timeline
Current Crown(model released this month)

Gemma 4 31B

Google

73

Provider Podium

1st
OpenAI86.3
2nd
Google82
3rd
Anthropic81.7
6 months tracked58 total releases5 crown changes

Unified Model Leaderboard

Benchmarks, pricing, runtime signals, and context window in one table. Filter state syncs to the URL so every view is shareable.

183 models
Score confidence:Full verified coverageGood coverageLimited coverageEstimated
OpenAIClosedCurrentReasoning1.05M$30.00 / $180.0074151.79s9289889695859697991484.42
GoogleClosedCurrentStandard1M$1.25 / $5.0010929.71s8781788895819495971494.17
3
AnthropicClosedCurrentStandard1M$15.00 / $75.00401.78s8583798285789596971498.99
xAIClosedSupersededStandard1M$3.00 / $15.00N/AN/A8580749193819390921461.49
OpenAIClosedCurrentReasoning400K$2.50 / $10.007988.26s8580759391829393981416
AnthropicClosedCurrentStandard200K$3.00 / $15.00441.48s848074879290961462.41
7
GPT-5.4
OpenAI
OpenAIClosedCurrentReasoning1.05M$2.50 / $15.0074151.79s8272769088839595971464.14
Zhipu AIOpenTrackedReasoning200K$0.00 / $0.00N/AN/A82856890797489941455.62
9
GPT-5.2
OpenAI
OpenAIClosedCurrentReasoning400K$2.00 / $8.0073130.34s8274768287809290941440.27
10
OpenAIClosedEstablishedReasoning128KN/A8336.28s8283728389738687911433.56
OpenAIClosedCurrentReasoning400K$2.00 / $8.0012387.34s8281739188758892961331
OpenAIClosedCurrentReasoning400K$2.00 / $8.00N/AN/A8177749288748891961349
GoogleClosedCurrentReasoning2MN/AN/AN/A8076658295768789941486.39
14
GoogleClosedCurrentStandard2MN/A10932.65s797466758693971486.36
15
GPT-5.1
OpenAI
OpenAIClosedCurrentReasoning200K$1.50 / $6.0011157.47s7875716992748886911438.55
AlibabaOpenTrackedReasoning128K$0.00 / $0.00N/AN/A7775668271728889931450
17
AnthropicClosedCurrentStandard200KN/A461.01s7675686878738791951468
Moonshot AIClosedTrackedReasoning128KN/AN/AN/A7669727478689094941447
OpenAIClosedEstablishedReasoning128KN/A8336.28s7669698288718888931328
20
GLM-5
Zhipu AI
Zhipu AIOpenSupersededStandard200K$0.00 / $0.00741.64s7573677169688391911455.9
21
GLM-4.7
Zhipu AI
Zhipu AIOpenTrackedReasoning200K$0.00 / $0.00821.10s74686879717584871443.02
22
GoogleOpenCurrentReasoning256K$0.00 / $0.00N/AN/A73796677611452.08
23
Kimi K2.5
Moonshot AI
Moonshot AIOpenTrackedStandard128K$0.50 / $2.80452.38s7266706774678394781400
24
OpenAIClosedSupersededReasoning200KN/AN/AN/A7273598576738782871387.74
AlibabaOpenCurrentReasoning262K$0.00 / $0.00N/AN/A71637060778282881416.03
Showing 25 of 183

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

Scoring Methodology

Each model's overall score is a normalized weighted average of category averages. Within each category, benchmark scores are combined using per-benchmark weights that favor harder, less-saturated evaluations. Scores are normalized so models with partial benchmark coverage are not unfairly penalized. Each score includes a confidence indicator (1-4 dots) showing how much verified data supports it — models with no verified benchmarks are marked as estimated. Display-only benchmarks like MMLU, HumanEval, SWE-bench Verified, BBH, FLTEval, and the AIME/HMMT exams remain visible but are excluded from scoring.

Agentic22%
Coding20%
Reasoning17%
Multimodal & Grounded12%
Knowledge12%
Multilingual7%
Instruction Following5%
Math5%

Agentic

Weighted: Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified

Coding

Weighted: SWE-bench Pro · LiveCodeBench · SWE-bench Verified. Display-only: HumanEval · FLTEval

Reasoning

Weighted: LongBench v2 · ARC-AGI-2 · MRCRv2 · MuSR. Display-only: BBH

Multimodal & Grounded

Weighted: MMMU-Pro · OfficeQA Pro

Knowledge

Weighted: GPQA · SuperGPQA · MMLU-Pro · HLE · FrontierScience · SimpleQA. Display-only: MMLU

Multilingual

Weighted: MGSM · MMLU-ProX

Instruction Following

Weighted: IFEval

Math

Weighted: AIME 2025 · BRUMO 2025 · MATH-500. Display-only: AIME 2023-2024 · HMMT 2023-2025

Data sourced from OpenBench, official model papers, and public leaderboards. Chatbot Arena Elo is tracked separately and not included in the overall score.