248 models · 225 benchmarks

Compare frontier AI models by quality, cost, and context

119 provisional-ranked models, 28 verified-ranked models, and 248 tracked LLMs. The most comprehensive LLM comparison tool — 225 benchmarks, real pricing, and runtime data in one place.

Methodology Providers AI Race

119 prov. / 28 ver.

Leaderboard

$0.30 avg output/M

LLM Pricing

Real-time TTFT & TPS

Speed Dashboard

24 agentic evals

Agent Benchmarks

−94% since 2023

Pricing Trends

CSV · JSON · Embed

Data Export

Agentic Coding Multimodal Reasoning Knowledge Inst. Following Multilingual Math

Explore by Use Case

Additional reporting pages for long context, tool use, web research, computer use, document AI, image understanding, frontend work, and factuality.

Browse all rankings

Long Context

LongBench v2, MRCR, AI-Needle, Graphwalks

Tool Use

BFCL, MCP Atlas, Toolathlon, TAU Bench

Web Research

BrowseComp, WebArena, WebVoyager

Computer Use

OSWorld, ScreenSpot Pro, Vision2Web

Document AI

OfficeQA Pro, OmniDocBench, CC-OCR

Image Understanding

MMMU-Pro, AI2D, CountBench, RefCOCO

Frontend / App Dev

React Native Evals, Design2Code, Vision2Web

Factuality

SimpleQA, HLE w/o tools, Facts-VLM

The BenchLM LLM leaderboard 2026 provisionally ranks 119+ models and tracks 248+ large language models side by side across 225 benchmarks — from SWE-bench and LiveCodeBench for coding to GPQA Diamond and MMLU-Pro for knowledge and reasoning. Whether you need the best AI models 2026 has to offer for agentic workflows, math, multilingual tasks, or instruction following, our AI benchmark comparison tables make it easy to see how GPT-5, Claude, Gemini, DeepSeek, Llama, and dozens of other frontier and open-source models stack up on both benchmarks and operator tradeoffs like price and context. The main leaderboard now distinguishes provisional ranking from verified ranking so you can see which scores rest on exact-source coverage and which still rely on source-unverified public rows.

Compare models instantly

Decision-ready picks

The fastest way to scan the current BenchLM dataset by outcome instead of just by benchmark.

How BenchLM scores these

Best Overall

Claude Mythos Preview

overall score

Anthropic

Best Open Weight

DeepSeek V4 Pro (Max)

overall score

DeepSeek

Cheapest

Qwen3.6-27B

$0.00

avg / 1M tokens

Alibaba

Fastest

Mercury 2

789

tokens / sec

Inception

Lowest Latency

Command A+

0.25s

TTFT

Cohere

Largest Context

Llama 4 Scout

10M

context window

Unified Model Leaderboard

Benchmarks, pricing, runtime signals, and context window in one table. Filter state syncs to the URL so every view is shareable. Provisional-ranked mode includes source-unverified non-generated benchmark evidence.

Methodology Providers AI Race

CSV JSON

248 models

Provisional-ranked mode includes source-unverified non-generated benchmark evidence.Score confidence:Full sourced coverageGood sourced coverageLimited sourced coverageEstimated


1 Claude Mythos Preview Anthropic	Anthropic	Closed	Current	Reasoning	1M	$25.00 / $125.00	N/A	N/A	99	100	100	—	98	99	100	91	—	—
2 Claude Opus 4.8 Anthropic	Anthropic	Closed	Current	Reasoning	1M	$5.00 / $25.00	N/A	N/A	95	98	99	—	69	99	—	—	—	—
3 Gemini 3.1 Pro Google	Google	Closed	Current	Standard	1M	$2.00 / $12.00	109	29.71s	92	83	94	96	84	95	100	94	68	1487.45
4 GPT-5.5 OpenAI	OpenAI	Closed	Current	Reasoning	1M	$5.00 / $30.00	N/A	N/A	91	98	84	97	57	97	—	—	97	1476.41
5 Qwen3.7 Max Alibaba	Alibaba	Closed	Current	Reasoning	1M	N/A	N/A	N/A	91	87	92	94	—	87	88	94	—	1474.72
6 GPT-5.4 Pro OpenAI	OpenAI	Closed	Current	Reasoning	1.05M	$30.00 / $180.00	74	151.79s	91	92	91	98	100	60	—	94	93	1480.43
7 Gemini 3 Pro Deep Think Google	Google	Closed	Current	Reasoning	2M	N/A	N/A	N/A	~90	95	75	89	100	87	85	82	95	1486.39
8 Grok 4.1 xAI	xAI	Closed	Superseded	Standard	1M	N/A	N/A	N/A	~90	79	82	97	98	94	100	91	99	1460.11
9 GPT-5.4 OpenAI	OpenAI	Closed	Superseded	Reasoning	1.05M	$2.50 / $15.00	74	151.79s	89	87	88	96	60	99	100	96	94	1469.34
10 Claude Opus 4.6 Anthropic	Anthropic	Closed	Superseded	Standard	1M	$5.00 / $25.00	40	1.78s	87	84	86	88	77	91	100	95	86	1498.38
11 Gemini 3.5 Flash Google	Google	Closed	Current	Reasoning	1M	$1.50 / $9.00	284.2	18.55s	87	97	78	79	81	83	—	79	—	1479.08
12 DeepSeek V4 Pro (Max) DeepSeek Self-host	DeepSeek	Open	Current	Reasoning	1M	$1.74 / $3.48	N/A	N/A	87	89	90	—	—	77	—	—	—	—
13 GPT-5.3 Codex OpenAI	OpenAI	Closed	Current	Reasoning	400K	$1.75 / $14.00	79	88.26s	~86	79	88	93	95	93	100	91	100	1416
14 Claude Opus 4.7 (Adaptive) Anthropic	Anthropic	Closed	Superseded	Reasoning	1M	$5.00 / $25.00	N/A	N/A	85	89	94	91	48	99	—	—	71	1499.7
15 Kimi K2.6 Moonshot AI Self-host	Moonshot AI	Open	Current	Reasoning	256K	$0.95 / $4.00	N/A	N/A	84	87	89	—	72	75	—	—	—	1462.39
16 Claude Sonnet 4.6 Anthropic	Anthropic	Closed	Current	Standard	200K	$3.00 / $15.00	44	1.48s	83	81	82	83	87	82	91	83	76	1469.54
17 DeepSeek V4 Pro (High) DeepSeek Self-host	DeepSeek	Open	Current	Reasoning	1M	$1.74 / $3.48	N/A	N/A	83	83	87	—	—	71	—	—	—	—
18 o1-preview OpenAI	OpenAI	Closed	Superseded	Reasoning	200K	$15.00 / $60.00	N/A	N/A	~83	90	79	88	67	80	85	77	94	1388.11
19 GLM-5.1 Z.AI Self-host	Z.AI	Open	Current	Reasoning	203K	$1.40 / $4.40	N/A	N/A	82	80	84	65	—	84	—	92	90	1473.85
20 Gemini 3 Pro Google	Google	Closed	Established	Standard	2M	$2.00 / $12.00	109	32.65s	81	72	73	82	82	83	82	80	80	1485.97
21 GLM-5 (Reasoning) Z.AI Self-host	Z.AI	Open	Current	Reasoning	200K	$1.00 / $3.20	N/A	N/A	~80	79	70	88	72	82	82	81	92	1455.62
22 GPT-5.2 OpenAI	OpenAI	Closed	Current	Reasoning	400K	$1.75 / $14.00	73	130.34s	79	61	80	84	82	92	99	85	81	1434.98
23 Qwen3.5 397B (Reasoning) Alibaba Self-host	Alibaba	Open	Current	Reasoning	128K	$0.60 / $3.60	N/A	N/A	~78	71	86	81	58	78	86	81	92	1450
24 GPT-5.1 OpenAI	OpenAI	Closed	Established	Reasoning	200K	$1.25 / $10.00	111	57.47s	~78	76	77	67	96	82	86	76	68	1438.81
25 GPT-5 (high) OpenAI	OpenAI	Closed	Established	Reasoning	128K	$1.25 / $10.00	83	36.28s	~77	77	71	77	92	79	82	81	70	1433.72

Showing 25 of 248

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.

Scoring Methodology

Each model's overall score is a normalized weighted average of category averages. Within each category, benchmark scores are normalized to a common scale and combined using per-benchmark weights that favor harder, less-saturated evaluations.

Each score includes a confidence indicator (1-4 dots) showing how much sourced benchmark data supports it — models with no non-generated benchmark coverage are marked as estimated.

Display-only benchmarks like MMLU, HumanEval, BBH, LisanBench, FLTEval, and the AIME/HMMT exams remain visible but are excluded from scoring.

Data sourced from OpenBench, official model papers, and public leaderboards. External consensus signals are used as bounded calibration inputs but are not exposed in exported data.

Agentic22%

Terminal-Bench 2.0 · OSWorld-Verified · BrowseComp · GAIA · Tau-Bench · WebArena

Coding20%

SWE-Rebench · SWE-bench Pro · LiveCodeBench · SWE-bench Verified · SciCode

Reasoning17%

LongBench v2 · ARC-AGI-2 · MRCRv2 · MuSR

Multimodal12%

MMMU-Pro · OfficeQA Pro

Knowledge12%

HLE · MMLU-Pro · FrontierScience · SimpleQA · GPQA · SuperGPQA

Multilingual7%

MMLU-ProX · MGSM

Instruction Following5%

IFEval · IFBench

Math5%

FrontierMath · AIME 2025 · BRUMO 2025 · MATH-500

Compare frontier AI models by quality, cost, and context

Decision-ready picks

Unified Model Leaderboard

The AI models change fast. We track them for you.

Rankings

Popular Comparisons

Tools & Resources

Scoring Methodology

Stay ahead of the LLM curve