Best AI Models in 2026 — Overall Rankings

BenchLM.ai now distinguishes provisional overall ranking from verified overall ranking. The provisional score is a normalized weighted average across 8 benchmark categories: agentic (22%), coding (20%), reasoning (17%), knowledge (12%), multimodal & grounded (12%), multilingual (7%), instruction following (5%), and math (5%), using non-generated benchmark coverage plus bounded external consensus calibration. The verified leaderboard is stricter and only counts sourced benchmark rows. Each score includes a confidence indicator (1-4 dots) based on how much sourced coverage supports it. Display-only benchmarks — including MMLU, OpenBookQA, HumanEval, FLTEval, BBH, LisanBench, and older AIME/HMMT variants — remain visible for context but do not affect ranking.

Unless noted otherwise, ranking surfaces on this page use BenchLM's provisional leaderboard lane rather than the stricter sourced-only verified leaderboard.

Bottom line: Claude Mythos Preview leads overall, but GPT-5.4 and Claude Opus 4.6 are within striking distance — and significantly cheaper.

According to BenchLM.ai, Claude Mythos Preview leads this ranking with a score of 99, followed by Claude Opus 4.8 (95) and Gemini 3.1 Pro (92). There is meaningful separation between the top models, suggesting genuine performance differences.

The best open-weight option is DeepSeek V4 Pro (Max) (ranked #12 with a score of 87). Proprietary models hold a clear advantage in this category, though open-weight options may suffice for less demanding use cases.

This ranking is based on provisional overall weighted scores across BenchLM.ai's scoring formula tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.

1Closed

Claude Mythos Preview

Anthropic · 1M

99prov. overall

Highest overall score. Leads agentic and coding. Premium-priced.

2Closed

Claude Opus 4.8

Anthropic · 1M

95prov. overall

3Closed

Gemini 3.1 Pro

Google · 1M

92prov. overall

What changed

Claude Mythos Preview entered at #1 with the highest overall score on BenchLM.

GPT-5.4 holds a strong #2 across all categories.

Claude Opus 4.6 remains #3, the most consistent model across all 8 benchmark categories.

How to choose

Best model regardless of cost?

Claude Mythos Preview — highest overall score

Best balance of cost and performance?

GPT-5.4 — strong #2 at lower cost

Need open weights?

GLM-5 (Reasoning) — best open-weight overall

Budget-friendly all-rounder?

Gemini 3.1 Pro — strong scores at $1.25/$5

Full Rankings (119 models)

Claude Mythos Preview

Anthropic·Proprietary·1M

prov. overall

vs #2

Claude Opus 4.8

Anthropic·Proprietary·1M

prov. overall

vs #3

Gemini 3.1 Pro

Google·Proprietary·1M

prov. overall

vs #4

GPT-5.5

OpenAI·Proprietary·1M

prov. overall

vs #5

Qwen3.7 Max

Alibaba·Proprietary·1M

prov. overall

vs #6

GPT-5.4 Pro

OpenAI·Proprietary·1.05M

prov. overall

vs #7

Gemini 3 Pro Deep Think

Google·Proprietary·2M

prov. overall

vs #8

Grok 4.1

xAI·Proprietary·1M

prov. overall

vs #9

GPT-5.4

OpenAI·Proprietary·1.05M

prov. overall

vs #10

Claude Opus 4.6

Anthropic·Proprietary·1M

prov. overall

vs #11

Gemini 3.5 Flash

Google·Proprietary·1M

prov. overall

vs #12

DeepSeek V4 Pro (Max)

DeepSeek·Open Weight·1M

prov. overall

vs #13

GPT-5.3 Codex

OpenAI·Proprietary·400K

prov. overall

vs #14

Claude Opus 4.7 (Adaptive)

Anthropic·Proprietary·1M

prov. overall

vs #15

Kimi K2.6

Moonshot AI·Open Weight·256K

prov. overall

vs #16

Claude Sonnet 4.6

Anthropic·Proprietary·200K

prov. overall

vs #17

DeepSeek V4 Pro (High)

DeepSeek·Open Weight·1M

prov. overall

vs #18

o1-preview

OpenAI·Proprietary·200K

prov. overall

vs #19

GLM-5.1

Z.AI·Open Weight·203K

prov. overall

vs #20

Gemini 3 Pro

Google·Proprietary·2M

prov. overall

vs #21

GLM-5 (Reasoning)

Z.AI·Open Weight·200K

prov. overall

vs #22

GPT-5.2

OpenAI·Proprietary·400K

prov. overall

vs #23

GPT-5.1

OpenAI·Proprietary·200K

prov. overall

vs #24

Qwen3.5 397B (Reasoning)

Alibaba·Open Weight·128K

prov. overall

vs #25

GPT-5 (high)

OpenAI·Proprietary·128K

prov. overall

vs #26

Claude Opus 4.5

Anthropic·Proprietary·200K

prov. overall

vs #27

Kimi K2.5 (Reasoning)

Moonshot AI·Proprietary·128K

prov. overall

vs #28

MiniMax M3

MiniMax·Open Weight·1M

prov. overall

vs #29

GPT-5.2-Codex

OpenAI·Proprietary·400K

prov. overall

vs #30

DeepSeek V4 Flash (Max)

DeepSeek·Open Weight·1M

prov. overall

vs #31

GPT-5.1-Codex-Max

OpenAI·Proprietary·400K

prov. overall

vs #32

Qwen3.6 Plus

Alibaba·Proprietary·1M

prov. overall

vs #33

Qwen3.6-27B

Alibaba·Open Weight·262K

prov. overall

vs #34

Grok 4.20

xAI·Proprietary·2M

prov. overall

vs #35

DeepSeek V4 Flash (High)

DeepSeek·Open Weight·1M

prov. overall

vs #36

GPT-5 (medium)

OpenAI·Proprietary·128K

prov. overall

vs #37

DeepSeek V4 Pro

DeepSeek·Open Weight·1M

prov. overall

vs #38

Grok 4.1 Fast

xAI·Proprietary·1M

prov. overall

vs #39

GLM-4.7

Z.AI·Open Weight·200K

prov. overall

vs #40

GLM-5

Z.AI·Open Weight·200K

prov. overall

vs #41

Qwen3.6-35B-A3B

Alibaba·Open Weight·262K

prov. overall

vs #42

Claude Sonnet 4.5

Anthropic·Proprietary·200K

prov. overall

vs #43

Kimi K2.5

Moonshot AI·Open Weight·256K

prov. overall

vs #44

Qwen3.5-122B-A10B

Alibaba·Open Weight·262K

prov. overall

vs #45

Gemini 2.5 Pro

Google·Proprietary·1M

prov. overall

vs #46

Qwen3.5 397B

Alibaba·Open Weight·128K

prov. overall

vs #47

Grok 4

xAI·Proprietary·128K

prov. overall

vs #48

Qwen3.5-27B

Alibaba·Open Weight·262K

prov. overall

vs #49

DeepSeek V3.2 (Thinking)

DeepSeek·Open Weight·128K

prov. overall

vs #50

MiMo-V2-Flash

Xiaomi·Open Weight·256K

prov. overall

vs #51

DeepSeek V4 Flash

DeepSeek·Open Weight·1M

prov. overall

vs #52

DeepSeek V3.2

DeepSeek·Open Weight·128K

prov. overall

vs #53

GPT-4.1

OpenAI·Proprietary·1M

prov. overall

vs #54

OpenAI·Proprietary·200K

prov. overall

vs #55

OpenAI·Proprietary·200K

prov. overall

vs #56

o3-pro

OpenAI·Proprietary·200K

prov. overall

vs #57

Qwen3.5-35B-A3B

Alibaba·Open Weight·262K

prov. overall

vs #58

Claude Haiku 4.5

Anthropic·Proprietary·200K

prov. overall

vs #59

Gemini 3 Flash

Google·Proprietary·1M

prov. overall

vs #60

o3-mini

OpenAI·Proprietary·200K

prov. overall

vs #61

MiniMax M2.7

MiniMax·Open Weight·200K

prov. overall

vs #62

Claude 4.1 Opus

Anthropic·Proprietary·200K

prov. overall

vs #63

DeepSeek Coder 2.0

DeepSeek·Open Weight·128K

prov. overall

vs #64

DeepSeek LLM 2.0

DeepSeek·Open Weight·128K

prov. overall

vs #65

Qwen2.5-1M

Alibaba·Open Weight·1M

prov. overall

vs #66

Claude 4 Sonnet

Anthropic·Proprietary·200K

prov. overall

vs #67

DeepSeekMath V2

DeepSeek·Open Weight·128K

prov. overall

vs #68

GPT-4o mini

OpenAI·Proprietary·128K

prov. overall

vs #69

Mistral Large 3

Mistral·Proprietary·128K

prov. overall

vs #70

Qwen2.5-72B

Alibaba·Open Weight·128K

prov. overall

vs #71

Gemini 3.1 Flash-Lite

Google·Proprietary·1M

prov. overall

vs #72

Qwen3 235B 2507 (Reasoning)

Alibaba·Open Weight·128K

prov. overall

vs #73

GPT-4.1 mini

OpenAI·Proprietary·1M

prov. overall

vs #74

o4-mini (high)

OpenAI·Proprietary·200K

prov. overall

vs #75

Claude 4.1 Opus Thinking

Anthropic·Proprietary·200K

prov. overall

vs #76

Nemotron 3 Super 100B

NVIDIA·Open Weight·1M

prov. overall

vs #77

GPT-4o

OpenAI·Proprietary·128K

prov. overall

vs #78

Kimi K2

Moonshot AI·Proprietary·128K

prov. overall

vs #79

Llama 3.1 405B

Meta·Open Weight·128K

prov. overall

vs #80

Claude 3.5 Sonnet

Anthropic·Proprietary·200K

prov. overall

vs #81

Grok Code Fast 1

xAI·Proprietary·256K

prov. overall

vs #82

Sarvam 105B

Sarvam·Open Weight·128K

prov. overall

vs #83

Mistral Large 2

Mistral·Proprietary·128K

prov. overall

vs #84

Gemini 2.5 Flash

Google·Proprietary·1M

prov. overall

vs #85

DeepSeek V3

DeepSeek·Open Weight·128K

prov. overall

vs #86

Gemini 1.5 Pro

Google·Proprietary·2M

prov. overall

vs #87

Claude 3 Opus

Anthropic·Proprietary·200K

prov. overall

vs #88

GPT-OSS 120B

OpenAI·Open Weight·128K

prov. overall

vs #89

MiniCPM5-1B

OpenBMB·Open Weight·131K

prov. overall

vs #90

DeepSeek-R1

DeepSeek·Open Weight·128K

prov. overall

vs #91

DBRX Instruct

Databricks·Open Weight·32K

prov. overall

vs #92

Qwen3 235B 2507

Alibaba·Open Weight·128K

prov. overall

vs #93

Grok 3 [Beta]

xAI·Proprietary·128K

prov. overall

vs #94

DeepSeek V3.1 (Reasoning)

DeepSeek·Open Weight·128K

prov. overall

vs #95

o1-pro

OpenAI·Proprietary·200K

prov. overall

vs #96

Phi-4

Microsoft·Open Weight·16K

prov. overall

vs #97

GPT-4.1 nano

OpenAI·Proprietary·1M

prov. overall

vs #98

GLM-4.5

Z.AI·Proprietary·128K

prov. overall

vs #99

Llama 3 70B

Meta·Open Weight·128K

prov. overall

vs #100

100

DeepSeek V3.1

DeepSeek·Open Weight·128K

prov. overall

vs #101

101

GPT-4 Turbo

OpenAI·Proprietary·128K

prov. overall

vs #102

102

Nemotron 3 Nano 30B

NVIDIA·Open Weight·32K

prov. overall

vs #103

103

Gemini 1.0 Pro

Google·Proprietary·32K

prov. overall

vs #104

104

Mistral 8x7B

Mistral·Open Weight·32K

prov. overall

vs #105

105

Z-1

Z·Proprietary·128K

prov. overall

vs #106

106

Claude 3 Haiku

Anthropic·Proprietary·200K

prov. overall

vs #107

107

Moonshot v1

Moonshot AI·Proprietary·128K

prov. overall

vs #108

108

Llama 4 Scout

Meta·Open Weight·10M

prov. overall

vs #109

109

Mixtral 8x22B Instruct v0.1

Mistral·Open Weight·64K

prov. overall

vs #110

110

Nemotron Ultra 253B

NVIDIA·Open Weight·32K

prov. overall

vs #111

111

Nemotron-4 15B

NVIDIA·Open Weight·32K

prov. overall

vs #112

112

GLM-4.5-Air

Z.AI·Proprietary·128K

prov. overall

vs #113

113

Gemma 3 27B

Google·Open Weight·32K

prov. overall

vs #114

114

GPT-OSS 20B

OpenAI·Open Weight·128K

prov. overall

vs #115

115

Llama 4 Maverick

Meta·Open Weight·1M

prov. overall

vs #116

116

Llama 4 Behemoth

Meta·Open Weight·32K

prov. overall

vs #117

117

Nova Pro

Amazon·Proprietary·128K

prov. overall

vs #118

118

Mistral 7B v0.3

Mistral·Open Weight·32K

prov. overall

vs #119

119

Mistral 8x7B v0.2

Mistral·Open Weight·32K

prov. overall

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Key Takeaways

The top model is Claude Mythos Preview by Anthropic with a provisional score of 99.

The best open-weight model is DeepSeek V4 Pro (Max) at position #12.

119 models are included in this ranking.

Score in Context

What these scores mean

The overall score is a weighted average across 8 benchmark categories. Agentic (22%), coding (20%), and reasoning (17%) carry the most weight. A 5-point gap in overall score is meaningful — it reflects consistent performance differences across multiple domains.

Known limitations

The overall score compresses 8 categories into one number. Two models with the same overall score can have very different strengths — one might lead coding while the other leads reasoning. Always check category scores for your specific use case.

Explore More

Price vs Performance Chart Compare Pricing Which LLM Should I Use? Benchmark Explainers

Last updated: June 2, 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.