Best Agentic AI Models in 2026

Agentic capability is the single biggest factor in BenchLM.ai's overall ranking at 22% weight. It measures what matters most for production AI systems: whether a model can complete multi-step workflows — not just answer questions. Terminal-Bench 2.0 tests coding and shell tasks, BrowseComp measures web research and evidence gathering, and OSWorld-Verified tests computer-use reliability across real software interfaces. Models that lead here can browse, plan, use tools, and recover from mistakes without hand-holding. This is the most predictive category for real-world AI agent performance.

According to BenchLM.ai, GPT-5.3 Codex leads this ranking with a score of 88.1, followed by GPT-5.4 (87.8) and GPT-5.4 Pro (87.4). The top three are separated by just a few points — any of them would perform well for this use case.

The best open-weight option is GLM-5 (Reasoning) (ranked #12 with a score of 78.3). Proprietary models hold a clear advantage in this category, though open-weight options may suffice for less demanding use cases.

This ranking is based on average scores across all agentic benchmarks tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.

GPT-5.3 Codex
OpenAIProprietary400K

88.1

avg

GPT-5.4
OpenAIProprietary1.05M

87.8

avg

GPT-5.4 Pro
OpenAIProprietary1.05M

87.4

avg

4
GPT-5.2-Codex
OpenAIProprietary400K

87

avg

5
GPT-5.1-Codex-Max
OpenAIProprietary400K

86

avg

6
GPT-5.2 Pro
OpenAIProprietary400K

85.9

avg

7
GPT-5.3-Codex-Spark
OpenAIProprietary256K

85.6

avg

8
GPT-5.2
OpenAIProprietary400K

85.4

avg

9
GPT-5.3 Instant
OpenAIProprietary128K

82.9

avg

10
GPT-5.2 Instant
OpenAIProprietary128K

79.6

avg

11
Claude Opus 4.6
AnthropicProprietary1M

79.2

avg

12
GLM-5 (Reasoning)
Zhipu AIOpen Weight200K

78.3

avg

13
Gemini 3 Pro Deep Think
GoogleProprietary2M

78.1

avg

14
Grok 4.1
xAIProprietary1M

76.9

avg

15
Gemini 3.1 Pro
GoogleProprietary1M

76.1

avg

16
GPT-5.1
OpenAIProprietary200K

75.8

avg

17
GPT-5 (medium)
OpenAIProprietary128K

75.5

avg

18
o1-preview
OpenAIProprietary200K

75.4

avg

19
GPT-5 (high)
OpenAIProprietary128K

75.2

avg

20
Qwen3.5 397B (Reasoning)
AlibabaOpen Weight128K

74.8

avg

21
Kimi K2.5 (Reasoning)
Moonshot AIProprietary128K

73.1

avg

22
Claude Sonnet 4.6
AnthropicProprietary200K

71.1

avg

23
Gemini 3 Pro
GoogleProprietary2M

71.1

avg

24
Grok 4.1 Fast
xAIProprietary1M

71

avg

25
Claude Opus 4.5
AnthropicProprietary200K

70.5

avg

26
o3-pro
OpenAIProprietary200K

70.4

avg

27
Claude Sonnet 4.5
AnthropicProprietary200K

70.3

avg

28
o3
OpenAIProprietary200K

69.9

avg

29
DeepSeek V3.2 (Thinking)
DeepSeekOpen Weight128K

69.4

avg

30
DeepSeek Coder 2.0
DeepSeekOpen Weight128K

67.5

avg

31
o3-mini
OpenAIProprietary200K

66.6

avg

32
GLM-4.7
Zhipu AIOpen Weight200K

66.1

avg

33
GPT-5 mini
OpenAIProprietary128K

65.7

avg

34
o1
OpenAIProprietary200K

65.4

avg

35
Qwen2.5-1M
AlibabaOpen Weight1M

64.7

avg

36
GPT-4.1
OpenAIProprietary1M

64.7

avg

37
DeepSeekMath V2
DeepSeekOpen Weight128K

63.9

avg

38
Mercury 2
InceptionProprietary128K

63.7

avg

39
Nemotron 3 Ultra 500B
NVIDIAOpen Weight10M

62.8

avg

40
GLM-5
Zhipu AIOpen Weight200K

62.3

avg

41
Seed 1.6
ByteDanceProprietary256K

62.3

avg

42
MiMo-V2-Flash
XiaomiOpen Weight128K

61.8

avg

43
Gemini 2.5 Pro
GoogleProprietary1M

61.7

avg

44
GLM-4.7-Flash
Zhipu AIOpen Weight200K

61.3

avg

45
Step 3.5 Flash
StepFunOpen Weight256K

60.2

avg

46
DeepSeek V3.2
DeepSeekOpen Weight128K

58.8

avg

47
Claude 4.1 Opus
AnthropicProprietary200K

58.7

avg

48
o4-mini (high)
OpenAIProprietary200K

58.5

avg

49
Ministral 3 14B (Reasoning)
MistralOpen Weight128K

58.5

avg

50
Grok 4
xAIProprietary128K

58.1

avg

51
Claude 4 Sonnet
AnthropicProprietary200K

57.9

avg

52
DeepSeek LLM 2.0
DeepSeekOpen Weight128K

57.9

avg

53
Qwen2.5-72B
AlibabaOpen Weight128K

57.7

avg

54
Gemini 3 Flash
GoogleProprietary1M

57.5

avg

55
Qwen3.5 397B
AlibabaOpen Weight128K

56.9

avg

56
Claude Haiku 4.5
AnthropicProprietary200K

56.7

avg

57
Nemotron 3 Super 100B
NVIDIAOpen Weight1M

56.6

avg

58
GPT-4.1 mini
OpenAIProprietary1M

56.5

avg

59
Grok Code Fast 1
xAIProprietary256K

55.7

avg

60
Nemotron 3 Super 120B A12B
NVIDIAOpen Weight256K

55.3

avg

61
Seed-2.0-Lite
ByteDanceProprietary256K

55.1

avg

62
Claude 3.5 Sonnet
AnthropicProprietary200K

55

avg

63
Seed 1.6 Flash
ByteDanceProprietary256K

54.5

avg

64
Llama 3.1 405B
MetaOpen Weight128K

53.9

avg

65
MiniMax M2.5
MiniMaxProprietary128K

53.4

avg

66
Mistral Large 3
MistralProprietary128K

52.5

avg

67
Kimi K2.5
Moonshot AIOpen Weight128K

52.3

avg

68
Mistral Large 2
MistralProprietary128K

52.2

avg

69
Aion-2.0
Aion LabsProprietary128K

51.7

avg

70
GPT-4o
OpenAIProprietary128K

51.2

avg

71
GPT-4o mini
OpenAIProprietary128K

50.9

avg

72
Gemini 1.5 Pro
GoogleProprietary2M

49.8

avg

73
Gemini 3.1 Flash-Lite
GoogleProprietary1M

49.2

avg

74
Ministral 3 14B
MistralOpen Weight128K

48.4

avg

75
Claude 3 Opus
AnthropicProprietary200K

48.1

avg

76
GPT-4.1 nano
OpenAIProprietary1M

47.4

avg

77
Nemotron Ultra 253B
NVIDIAOpen Weight32K

46.7

avg

78
Claude 4.1 Opus Thinking
AnthropicProprietary200K

46.7

avg

79
Gemini 2.5 Flash
GoogleProprietary1M

46.5

avg

80
Seed-2.0-Mini
ByteDanceProprietary256K

46.2

avg

81
Qwen3 235B 2507 (Reasoning)
AlibabaOpen Weight128K

45.9

avg

82
GPT-OSS 120B
OpenAIOpen Weight128K

44.8

avg

83
GPT-4 Turbo
OpenAIProprietary128K

44.7

avg

84
DeepSeek-R1
DeepSeekOpen Weight128K

44.5

avg

85
DeepSeek V3.1 (Reasoning)
DeepSeekOpen Weight128K

44.2

avg

86
Claude 3 Haiku
AnthropicProprietary200K

44

avg

87
Moonshot v1
Moonshot AIProprietary128K

42.2

avg

88
Z-1
ZProprietary128K

42.2

avg

89
Nemotron-4 15B
NVIDIAOpen Weight32K

41.3

avg

90
Llama 3 70B
MetaOpen Weight128K

41.2

avg

91
Mistral 8x7B
MistralOpen Weight32K

41.1

avg

92
Llama 4 Maverick
MetaOpen Weight1M

40.9

avg

93
Llama 4 Scout
MetaOpen Weight10M

40.6

avg

94
Gemini 1.0 Pro
GoogleProprietary32K

39.8

avg

95
o1-pro
OpenAIProprietary200K

39.7

avg

96
Nemotron 3 Nano 30B
NVIDIAOpen Weight32K

39.6

avg

97
Ministral 3 8B (Reasoning)
MistralOpen Weight128K

38.5

avg

98
Phi-4
MicrosoftOpen Weight16K

38.3

avg

99
GPT-5 nano
OpenAIProprietary400K

37.7

avg

100
Grok 3 [Beta]
xAIProprietary128K

35.7

avg

101
GPT-OSS 20B
OpenAIOpen Weight128K

35.4

avg

102
Llama 4 Behemoth
MetaOpen Weight32K

34.6

avg

103
Gemma 3 27B
GoogleOpen Weight32K

34.4

avg

104
DBRX Instruct
DatabricksOpen Weight32K

34.3

avg

105
LFM2.5-1.2B-Thinking
LiquidAIProprietary32K

34.1

avg

106
Ministral 3 3B (Reasoning)
MistralOpen Weight128K

34

avg

107
Qwen3 235B 2507
AlibabaOpen Weight128K

33.7

avg

108
Qwen2.5-VL-32B
AlibabaOpen Weight32K

33.5

avg

109
LFM2-24B-A2B
LiquidAIProprietary32K

33.4

avg

110
Nova Pro
Nova AIProprietary128K

33.3

avg

111
DeepSeek V3.1
DeepSeekOpen Weight128K

32.9

avg

112
MiniMax M1 80k
MiniMaxProprietary80K

32.1

avg

113
Mixtral 8x22B Instruct v0.1
MistralOpen Weight64K

31.8

avg

114
GLM-4.5
TsinghuaProprietary128K

31.3

avg

115
GLM-4.5-Air
TsinghuaProprietary128K

30.3

avg

116
Kimi K2
Moonshot AIProprietary128K

29.3

avg

117
Ministral 3 8B
MistralOpen Weight128K

28.9

avg

118
Mistral 8x7B v0.2
MistralOpen Weight32K

27.9

avg

119
Mistral 7B v0.3
MistralOpen Weight32K

26.4

avg

120
LFM2.5-1.2B-Instruct
LiquidAIProprietary32K

25.7

avg

121
Ministral 3 3B
MistralOpen Weight128K

22.9

avg

Key Takeaways

  • According to BenchLM.ai, the top model is GPT-5.3 Codex by OpenAI with a score of 88.1.
  • The best open-weight model in this ranking is GLM-5 (Reasoning) at position #12.
  • 121 models are included in this ranking.
Last updated: March 12, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.