Skip to main content

Best AI Models in 2026 — Overall Rankings

BenchLM.ai now distinguishes provisional overall ranking from verified overall ranking. The provisional score is a normalized weighted average across 8 benchmark categories: agentic (22%), coding (20%), reasoning (17%), knowledge (12%), multimodal & grounded (12%), multilingual (7%), instruction following (5%), and math (5%), using non-generated benchmark coverage plus bounded external consensus calibration. The verified leaderboard is stricter and only counts sourced benchmark rows. Each score includes a confidence indicator (1-4 dots) based on how much sourced coverage supports it. Display-only benchmarks — including MMLU, OpenBookQA, HumanEval, FLTEval, BBH, LisanBench, and older AIME/HMMT variants — remain visible for context but do not affect ranking.

Unless noted otherwise, ranking surfaces on this page use BenchLM's provisional leaderboard lane rather than the stricter sourced-only verified leaderboard.

Bottom line: Claude Mythos Preview leads overall, but GPT-5.4 and Claude Opus 4.6 are within striking distance — and significantly cheaper.

According to BenchLM.ai, Claude Mythos Preview leads this ranking with a score of 99, followed by Claude Opus 4.8 (95) and Gemini 3.1 Pro (92). There is meaningful separation between the top models, suggesting genuine performance differences.

The best open-weight option is DeepSeek V4 Pro (Max) (ranked #12 with a score of 87). Proprietary models hold a clear advantage in this category, though open-weight options may suffice for less demanding use cases.

This ranking is based on provisional overall weighted scores across BenchLM.ai's scoring formula tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.

What changed

Claude Mythos Preview entered at #1 with the highest overall score on BenchLM.

GPT-5.4 holds a strong #2 across all categories.

Claude Opus 4.6 remains #3, the most consistent model across all 8 benchmark categories.

How to choose

Full Rankings (119 models)

Claude Mythos Preview
Anthropic·Proprietary·1M

99

prov. overall

Claude Opus 4.8
Anthropic·Proprietary·1M

95

prov. overall

Gemini 3.1 Pro
Google·Proprietary·1M

92

prov. overall

4
GPT-5.5
OpenAI·Proprietary·1M

91

prov. overall

5
Qwen3.7 Max
Alibaba·Proprietary·1M

91

prov. overall

6
GPT-5.4 Pro
OpenAI·Proprietary·1.05M

91

prov. overall

7
Gemini 3 Pro Deep Think
Google·Proprietary·2M

90

prov. overall

8
Grok 4.1
xAI·Proprietary·1M

90

prov. overall

9
GPT-5.4
OpenAI·Proprietary·1.05M

89

prov. overall

10
Claude Opus 4.6
Anthropic·Proprietary·1M

87

prov. overall

11
Gemini 3.5 Flash
Google·Proprietary·1M

87

prov. overall

12
DeepSeek V4 Pro (Max)
DeepSeek·Open Weight·1M

87

prov. overall

13
GPT-5.3 Codex
OpenAI·Proprietary·400K

86

prov. overall

14
Claude Opus 4.7 (Adaptive)
Anthropic·Proprietary·1M

85

prov. overall

15
Kimi K2.6
Moonshot AI·Open Weight·256K

84

prov. overall

16
Claude Sonnet 4.6
Anthropic·Proprietary·200K

83

prov. overall

17
DeepSeek V4 Pro (High)
DeepSeek·Open Weight·1M

83

prov. overall

18
o1-preview
OpenAI·Proprietary·200K

83

prov. overall

19
GLM-5.1
Z.AI·Open Weight·203K

82

prov. overall

20
Gemini 3 Pro
Google·Proprietary·2M

81

prov. overall

21
GLM-5 (Reasoning)
Z.AI·Open Weight·200K

80

prov. overall

22
GPT-5.2
OpenAI·Proprietary·400K

79

prov. overall

23
GPT-5.1
OpenAI·Proprietary·200K

78

prov. overall

24
Qwen3.5 397B (Reasoning)
Alibaba·Open Weight·128K

78

prov. overall

25
GPT-5 (high)
OpenAI·Proprietary·128K

77

prov. overall

26
Claude Opus 4.5
Anthropic·Proprietary·200K

76

prov. overall

27
Kimi K2.5 (Reasoning)
Moonshot AI·Proprietary·128K

76

prov. overall

28
MiniMax M3
MiniMax·Open Weight·1M

76

prov. overall

29
GPT-5.2-Codex
OpenAI·Proprietary·400K

76

prov. overall

30
DeepSeek V4 Flash (Max)
DeepSeek·Open Weight·1M

75

prov. overall

31
GPT-5.1-Codex-Max
OpenAI·Proprietary·400K

75

prov. overall

32
Qwen3.6 Plus
Alibaba·Proprietary·1M

73

prov. overall

33
Qwen3.6-27B
Alibaba·Open Weight·262K

73

prov. overall

34
Grok 4.20
xAI·Proprietary·2M

72

prov. overall

35
DeepSeek V4 Flash (High)
DeepSeek·Open Weight·1M

71

prov. overall

36
GPT-5 (medium)
OpenAI·Proprietary·128K

70

prov. overall

37
DeepSeek V4 Pro
DeepSeek·Open Weight·1M

69

prov. overall

38
Grok 4.1 Fast
xAI·Proprietary·1M

69

prov. overall

39
GLM-4.7
Z.AI·Open Weight·200K

68

prov. overall

40
GLM-5
Z.AI·Open Weight·200K

67

prov. overall

41
Qwen3.6-35B-A3B
Alibaba·Open Weight·262K

66

prov. overall

42
Claude Sonnet 4.5
Anthropic·Proprietary·200K

65

prov. overall

43
Kimi K2.5
Moonshot AI·Open Weight·256K

64

prov. overall

44
Qwen3.5-122B-A10B
Alibaba·Open Weight·262K

64

prov. overall

45
Gemini 2.5 Pro
Google·Proprietary·1M

64

prov. overall

46
Qwen3.5 397B
Alibaba·Open Weight·128K

63

prov. overall

47
Grok 4
xAI·Proprietary·128K

63

prov. overall

48
Qwen3.5-27B
Alibaba·Open Weight·262K

62

prov. overall

49
DeepSeek V3.2 (Thinking)
DeepSeek·Open Weight·128K

61

prov. overall

50
MiMo-V2-Flash
Xiaomi·Open Weight·256K

59

prov. overall

51
DeepSeek V4 Flash
DeepSeek·Open Weight·1M

57

prov. overall

52
DeepSeek V3.2
DeepSeek·Open Weight·128K

57

prov. overall

53
GPT-4.1
OpenAI·Proprietary·1M

57

prov. overall

54
o1
OpenAI·Proprietary·200K

57

prov. overall

55
o3
OpenAI·Proprietary·200K

57

prov. overall

56
o3-pro
OpenAI·Proprietary·200K

57

prov. overall

57
Qwen3.5-35B-A3B
Alibaba·Open Weight·262K

56

prov. overall

58
Claude Haiku 4.5
Anthropic·Proprietary·200K

56

prov. overall

59
Gemini 3 Flash
Google·Proprietary·1M

56

prov. overall

60
o3-mini
OpenAI·Proprietary·200K

55

prov. overall

61
MiniMax M2.7
MiniMax·Open Weight·200K

54

prov. overall

62
Claude 4.1 Opus
Anthropic·Proprietary·200K

51

prov. overall

63
DeepSeek Coder 2.0
DeepSeek·Open Weight·128K

51

prov. overall

64
DeepSeek LLM 2.0
DeepSeek·Open Weight·128K

51

prov. overall

65
Qwen2.5-1M
Alibaba·Open Weight·1M

51

prov. overall

66
Claude 4 Sonnet
Anthropic·Proprietary·200K

50

prov. overall

67
DeepSeekMath V2
DeepSeek·Open Weight·128K

50

prov. overall

68
GPT-4o mini
OpenAI·Proprietary·128K

49

prov. overall

69
Mistral Large 3
Mistral·Proprietary·128K

49

prov. overall

70
Qwen2.5-72B
Alibaba·Open Weight·128K

49

prov. overall

71
Gemini 3.1 Flash-Lite
Google·Proprietary·1M

48

prov. overall

72
Qwen3 235B 2507 (Reasoning)
Alibaba·Open Weight·128K

46

prov. overall

73
GPT-4.1 mini
OpenAI·Proprietary·1M

45

prov. overall

74
o4-mini (high)
OpenAI·Proprietary·200K

44

prov. overall

75
Claude 4.1 Opus Thinking
Anthropic·Proprietary·200K

43

prov. overall

76
Nemotron 3 Super 100B
NVIDIA·Open Weight·1M

43

prov. overall

77
GPT-4o
OpenAI·Proprietary·128K

42

prov. overall

78
Kimi K2
Moonshot AI·Proprietary·128K

41

prov. overall

79
Llama 3.1 405B
Meta·Open Weight·128K

41

prov. overall

80
Claude 3.5 Sonnet
Anthropic·Proprietary·200K

40

prov. overall

81
Grok Code Fast 1
xAI·Proprietary·256K

39

prov. overall

82
Sarvam 105B
Sarvam·Open Weight·128K

39

prov. overall

83
Mistral Large 2
Mistral·Proprietary·128K

38

prov. overall

84
Gemini 2.5 Flash
Google·Proprietary·1M

37

prov. overall

85
DeepSeek V3
DeepSeek·Open Weight·128K

35

prov. overall

86
Gemini 1.5 Pro
Google·Proprietary·2M

35

prov. overall

87
Claude 3 Opus
Anthropic·Proprietary·200K

34

prov. overall

88
GPT-OSS 120B
OpenAI·Open Weight·128K

34

prov. overall

89
MiniCPM5-1B
OpenBMB·Open Weight·131K

34

prov. overall

90
DeepSeek-R1
DeepSeek·Open Weight·128K

33

prov. overall

91
DBRX Instruct
Databricks·Open Weight·32K

32

prov. overall

92
Qwen3 235B 2507
Alibaba·Open Weight·128K

32

prov. overall

93
Grok 3 [Beta]
xAI·Proprietary·128K

31

prov. overall

94
DeepSeek V3.1 (Reasoning)
DeepSeek·Open Weight·128K

29

prov. overall

95
o1-pro
OpenAI·Proprietary·200K

29

prov. overall

96
Phi-4
Microsoft·Open Weight·16K

28

prov. overall

97
GPT-4.1 nano
OpenAI·Proprietary·1M

27

prov. overall

98
GLM-4.5
Z.AI·Proprietary·128K

26

prov. overall

99
Llama 3 70B
Meta·Open Weight·128K

26

prov. overall

100
DeepSeek V3.1
DeepSeek·Open Weight·128K

25

prov. overall

101
GPT-4 Turbo
OpenAI·Proprietary·128K

25

prov. overall

102
Nemotron 3 Nano 30B
NVIDIA·Open Weight·32K

25

prov. overall

103
Gemini 1.0 Pro
Google·Proprietary·32K

24

prov. overall

104
Mistral 8x7B
Mistral·Open Weight·32K

24

prov. overall

105
Z-1
Z·Proprietary·128K

24

prov. overall

106
Claude 3 Haiku
Anthropic·Proprietary·200K

23

prov. overall

107
Moonshot v1
Moonshot AI·Proprietary·128K

23

prov. overall

108
Llama 4 Scout
Meta·Open Weight·10M

22

prov. overall

109
Mixtral 8x22B Instruct v0.1
Mistral·Open Weight·64K

22

prov. overall

110
Nemotron Ultra 253B
NVIDIA·Open Weight·32K

22

prov. overall

111
Nemotron-4 15B
NVIDIA·Open Weight·32K

22

prov. overall

112
GLM-4.5-Air
Z.AI·Proprietary·128K

19

prov. overall

113
Gemma 3 27B
Google·Open Weight·32K

17

prov. overall

114
GPT-OSS 20B
OpenAI·Open Weight·128K

17

prov. overall

115
Llama 4 Maverick
Meta·Open Weight·1M

17

prov. overall

116
Llama 4 Behemoth
Meta·Open Weight·32K

12

prov. overall

117
Nova Pro
Amazon·Proprietary·128K

10

prov. overall

118
Mistral 7B v0.3
Mistral·Open Weight·32K

4

prov. overall

119
Mistral 8x7B v0.2
Mistral·Open Weight·32K

1

prov. overall

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Key Takeaways

The top model is Claude Mythos Preview by Anthropic with a provisional score of 99.

The best open-weight model is DeepSeek V4 Pro (Max) at position #12.

119 models are included in this ranking.

Score in Context

What these scores mean

The overall score is a weighted average across 8 benchmark categories. Agentic (22%), coding (20%), and reasoning (17%) carry the most weight. A 5-point gap in overall score is meaningful — it reflects consistent performance differences across multiple domains.

Known limitations

The overall score compresses 8 categories into one number. Two models with the same overall score can have very different strengths — one might lead coding while the other leads reasoning. Always check category scores for your specific use case.

Last updated: June 2, 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.