167 models · 54 benchmarks

Compare frontier AI models by quality, cost, and context

167 LLMs ranked across 54 benchmarks with BenchLM scoring, pricing, release status, and runtime tradeoffs in one place.

The BenchLM LLM leaderboard 2026 ranks 167+ large language models side by side across 54benchmarks — from SWE-bench and LiveCodeBench for coding to GPQA Diamond and MMLU-Pro for knowledge and reasoning. Whether you need the best AI models 2026 has to offer for agentic workflows, math, multilingual tasks, or instruction following, our AI benchmark comparison tables make it easy to see how GPT-5, Claude, Gemini, DeepSeek, Llama, and dozens of other frontier and open-source models stack up on both benchmarks and operator tradeoffs like price and context. Every score is sourced from published results, updated regularly, and linked to its methodology so you can verify the data yourself.

Decision-ready picks

The fastest way to scan the current BenchLM dataset by outcome instead of just by benchmark.

Unified Model Leaderboard

Benchmarks, pricing, runtime signals, and context window in one table. Filter state syncs to the URL so every view is shareable.

167 models
Score confidence:Full verified coverageGood coverageLimited coverageEstimated
OpenAIClosedCurrentReasoning1.05M$30.00 / $180.0074151.79s9289889694859697981484.03
GoogleClosedCurrentStandard1M$1.25 / $5.0010929.71s8781788895819495971493.17
OpenAIClosedCurrentReasoning400K$2.50 / $10.007988.26s8580759391779393981416
4
AnthropicClosedCurrentStandard1M$15.00 / $75.00401.78s84837982857896971499.5
5
GPT-5.2
OpenAI
OpenAIClosedCurrentReasoning400K$2.00 / $8.0073130.34s8374768295809294971440.15
6
GPT-5.4
OpenAI
OpenAIClosedCurrentReasoning1.05M$2.50 / $15.0074151.79s8272768888839596981465.6
Zhipu AIOpenTrackedReasoning200K$0.00 / $0.00N/AN/A8285688779748692961455.62
OpenAIClosedEstablishedReasoning128KN/A8336.28s8283718389738691951433.72
OpenAIClosedCurrentReasoning400K$2.00 / $8.0012387.34s8282739188738892971331
OpenAIClosedCurrentReasoning400K$2.00 / $8.00N/AN/A8177749288748891961349
GoogleClosedCurrentReasoning2MN/AN/AN/A8075668295768789961486.39
AnthropicClosedCurrentStandard200K$3.00 / $15.00441.48s8077717888739090981462.64
13
GoogleClosedCurrentStandard2MN/A10932.65s7874667586748688961486.39
14
AnthropicClosedCurrentStandard200KN/A461.01s7875697391728690951468
15
GPT-5.1
OpenAI
OpenAIClosedCurrentReasoning200K$1.50 / $6.0011157.47s7874706992748889961438.57
AlibabaOpenTrackedReasoning128K$0.00 / $0.00N/AN/A7778668271728889931450
Moonshot AIClosedTrackedReasoning128KN/AN/AN/A7669727478689094941447
OpenAIClosedEstablishedReasoning128KN/A8336.28s7669688288718888931328
19
GLM-5
Zhipu AI
Zhipu AIOpenSupersededStandard200K$0.00 / $0.00741.64s7573657769708285921455.62
20
GLM-4.7
Zhipu AI
Zhipu AIOpenTrackedReasoning200K$0.00 / $0.00821.10s7262677971638488891442.9
xAIClosedCurrentStandard1MN/A1380.54s7264548887718590941420
22
OpenAIClosedSupersededReasoning200KN/AN/AN/A7273598576738788941387.78
23
Kimi K2.5
Moonshot AI
Moonshot AIOpenTrackedStandard128K$0.50 / $2.80452.38s7166676974638094781400
AlibabaOpenCurrentReasoning262K$0.00 / $0.00N/AN/A7162736077828293
25
AlibabaOpenCurrentReasoning262K$0.00 / $0.00N/AN/A7160746175818295
Showing 25 of 167

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

Scoring Methodology

Each model's overall score is a normalized weighted average of category averages. Within each category, benchmark scores are combined using per-benchmark weights that favor harder, less-saturated evaluations. Scores are normalized so models with partial benchmark coverage are not unfairly penalized. Each score includes a confidence indicator (1-4 dots) showing how much verified data supports it — models with no verified benchmarks are marked as estimated. Display-only benchmarks like MMLU, HumanEval, SWE-bench Verified, BBH, FLTEval, and the AIME/HMMT exams remain visible but are excluded from scoring.

Agentic22%
Coding20%
Reasoning17%
Multimodal & Grounded12%
Knowledge12%
Multilingual7%
Instruction Following5%
Math5%

Agentic

Weighted: Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified

Coding

Weighted: SWE-bench Pro · LiveCodeBench · SWE-bench Verified. Display-only: HumanEval · FLTEval

Reasoning

Weighted: LongBench v2 · ARC-AGI-2 · MRCRv2 · MuSR. Display-only: BBH

Multimodal & Grounded

Weighted: MMMU-Pro · OfficeQA Pro

Knowledge

Weighted: GPQA · SuperGPQA · MMLU-Pro · HLE · FrontierScience · SimpleQA. Display-only: MMLU

Multilingual

Weighted: MGSM · MMLU-ProX

Instruction Following

Weighted: IFEval

Math

Weighted: AIME 2025 · BRUMO 2025 · MATH-500. Display-only: AIME 2023-2024 · HMMT 2023-2025

Data sourced from OpenBench, official model papers, and public leaderboards. Chatbot Arena Elo is tracked separately and not included in the overall score.