Artificial Analysis IFBench (AA-IFBench)

Name: Artificial Analysis IFBench
Creator: BenchLM

A display-only Artificial Analysis IFBench score.

Benchmark score on AA-IFBench — July 4, 2026

BenchLM mirrors the published score view for AA-IFBench. MiniMax M3 leads the public snapshot at 82.9% , followed by Nemotron 3 Ultra (81.4%) and Grok 4.3 (81.3%). BenchLM does not use these results to rank models overall.

1Open

MiniMax M3

MiniMax

82.9%

Overall 74Context 1M

2Open

Nemotron 3 Ultra

NVIDIA

81.4%

Overall 66Context 1M

3Closed

Grok 4.3

xAI

81.3%

Overall —Context 1M

126 modelsInstruction FollowingCurrentDisplay onlyUpdated July 4, 2026

The published AA-IFBench snapshot is tightly clustered at the top: MiniMax M3 sits at 82.9%, while the third row is only 1.6 points behind. The broader top-10 spread is 5.7 points, so many of the published scores sit in a relatively narrow band.

126 models have been evaluated on AA-IFBench. The benchmark falls in the Instruction Following category. This category carries a 5% weight in BenchLM.ai's overall scoring system. AA-IFBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA-IFBench

Year

2026

Tasks

Verifiable instruction constraints

Format

Constraint satisfaction accuracy

Difficulty

Instruction precision

BenchLM stores the Artificial Analysis IFBench result separately from the weighted IFBench lane so AA refreshes remain display-only.

Artificial Analysis IFBench Benchmark Leaderboard

BenchLM freshness & provenance

Version

AA-IFBench 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (126 models)

MiniMax M3

MiniMaxOpen

82.9%

Nemotron 3 Ultra

NVIDIAOpen

81.4%

Grok 4.3

xAIClosed

81.3%

Qwen3.7 Max

AlibabaClosed

80.5%

MiMo-V2.5-Pro

XiaomiClosed

79.9%

DeepSeek V4 Flash (Max)

DeepSeekOpen

79.2%

Qwen3.5 397B (Reasoning)

AlibabaOpen

78.8%

Qwen3.7 Plus

AlibabaClosed

78.0%

GPT-5.2-Codex

OpenAIClosed

77.6%

Gemini 3.1 Flash-Lite

GoogleClosed

77.2%

Gemini 3.1 Pro

GoogleClosed

77.1%

Qwen 3.6 Max (preview)

AlibabaClosed

76.6%

DeepSeek V4 Pro (Max)

DeepSeekOpen

76.5%

Gemini 3.5 Flash

GoogleClosed

76.3%

GLM-5.1

Z.AIOpen

76.3%

Kimi K2.6

Moonshot AIOpen

76.0%

GPT-5.5

OpenAIClosed

75.9%

Muse Spark

MetaClosed

75.9%

GPT-5.4 nano

OpenAIClosed

75.9%

Qwen3.5-122B-A10B

AlibabaOpen

75.7%

MiniMax M2.7

MiniMaxOpen

75.7%

Qwen3.5-27B

AlibabaOpen

75.6%

Gemma 4 31B

GoogleOpen

75.6%

GPT-5.3 Codex

OpenAIClosed

75.4%

GPT-5.2

OpenAIClosed

75.4%

Qwen3.6 Plus

AlibabaClosed

75.2%

GPT-5.4

OpenAIClosed

73.9%

Command A+

CohereOpen

73.9%

DeepSeek V4 Flash (High)

DeepSeekOpen

73.5%

Gemma 4 12B

GoogleOpen

73.5%

GLM-5.2

Z.AIOpen

73.3%

GPT-5.4 mini

OpenAIClosed

73.3%

GLM-5-Turbo

Z.AIClosed

73.2%

GPT-5 (high)

OpenAIClosed

73.1%

GPT-5.1

OpenAIClosed

72.9%

Qwen3.5-35B-A3B

AlibabaOpen

72.5%

Gemma 4 26B A4B

GoogleOpen

72.4%

GLM-5

Z.AIOpen

72.3%

OpenAIClosed

71.4%

DeepSeek V4 Pro (High)

DeepSeekOpen

71.3%

GPT-5 (medium)

OpenAIClosed

70.6%

Gemini 3 Pro

GoogleClosed

70.4%

OpenAIClosed

70.3%

Kimi K2.5 (Reasoning)

Moonshot AIClosed

70.2%

Kimi K2.5

Moonshot AIOpen

70.2%

GPT-5.1-Codex-Max

OpenAIClosed

70.0%

GPT-5.1-Codex

OpenAIClosed

70.0%

GPT-OSS 120B

OpenAIOpen

69.0%

MiMo-V2-Pro

XiaomiClosed

68.8%

Mistral Medium 3.5 128B

MistralOpen

68.8%

GLM-4.7

Z.AIOpen

67.9%

Qwen3.6-27B

AlibabaOpen

67.6%

Step 3.7 Flash

StepFunOpen

67.3%

GPT-OSS 20B

OpenAIOpen

65.1%

K-Exaone

LG AI ResearchClosed

64.7%

Qwen3.6-35B-A3B

AlibabaOpen

64.4%

Nemotron 3 Nano Omni 30B A3B

NVIDIAOpen

63.2%

Hy3 Preview

TencentOpen

63.1%

Kimi K2.7 Code

Moonshot AIOpen

63.1%

Claude Opus 4.8

AnthropicClosed

62.2%

GLM-5V-Turbo

Z.AIClosed

61.1%

Claude Opus 4.7 (Adaptive)

AnthropicClosed

58.6%

Claude Opus 4.5 Thinking

AnthropicClosed

58.0%

Ling 2.6 Flash

InclusionAIOpen

57.4%

Trinity-Large-Thinking

Arcee AIOpen

56.3%

Trinity-Large-Preview

Arcee AIOpen

56.3%

LFM2.5-8B-A1B

LiquidAIOpen

55.6%

Claude 4.1 Opus Thinking

AnthropicClosed

55.4%

Gemini 3 Flash

GoogleClosed

55.1%

Grok 4

xAIClosed

53.7%

MiMo-V2-Omni

XiaomiClosed

53.5%

Claude Opus 4.6 (Adaptive)

AnthropicClosed

53.1%

Grok 4.1 Fast (Reasoning)

xAIClosed

52.7%

Qwen3.5 397B

AlibabaOpen

51.6%

Grok 4 Fast (Reasoning)

xAIClosed

50.5%

DeepSeek V3.2

DeepSeekOpen

49.0%

Gemini 2.5 Pro

GoogleClosed

48.7%

Mistral Small 4 (Reasoning)

MistralOpen

48.2%

Mistral Small 4

MistralOpen

48.2%

Claude 4 Sonnet

AnthropicClosed

45.4%

Claude Opus 4.6

AnthropicClosed

44.6%

Gemma 4 E4B

GoogleOpen

44.2%

Qwen3 Max

AlibabaClosed

44.1%

Claude Opus 4.7

AnthropicClosed

43.6%

Claude Opus 4.5

AnthropicClosed

43.0%

GPT-4.1

OpenAIClosed

43.0%

Llama 4 Maverick

MetaOpen

43.0%

Kimi K2

Moonshot AIClosed

41.5%

DeepSeek V3.1 (Reasoning)

DeepSeekOpen

41.5%

Grok Code Fast 1

xAIClosed

41.4%

Claude Sonnet 4.6

AnthropicClosed

41.2%

MiMo-V2-Flash

XiaomiOpen

39.9%

DeepSeek-R1

DeepSeekOpen

39.6%

Llama 4 Scout

MetaOpen

39.5%

Mistral Medium 3

MistralClosed

39.3%

Llama 3.1 405B

MetaOpen

39.0%

Gemini 2.5 Flash

GoogleClosed

39.0%

GPT-4.1 mini

OpenAIClosed

38.3%

Nemotron Ultra 253B

NVIDIAOpen

38.2%

100

Nova Pro

AmazonClosed

38.1%

101

Gemma 4 E2B

GoogleOpen

38.0%

102

DeepSeek V3.1

DeepSeekOpen

37.8%

103

GLM-4.5-Air

Z.AIClosed

37.6%

104

Nemotron 3 Nano 30B

NVIDIAOpen

37.5%

105

GLM-4.6

Z.AIOpen

36.7%

106

Grok 4.1 Fast

xAIClosed

36.5%

107

Mistral Large 3

MistralClosed

36.2%

108

Claude 3 Haiku

AnthropicClosed

36.1%

109

DeepSeek V3

DeepSeekOpen

34.8%

110

Sarvam 105B

SarvamOpen

34.4%

111

GPT-4o

OpenAIClosed

34.3%

112

Solar Pro 2

UpstageClosed

33.7%

113

Exaone 4.0 32B

LG AI ResearchOpen

33.5%

114

LFM2.5-VL-1.6B-Extract

LiquidAIOpen

33.1%

115

GPT-4.1 nano

OpenAIClosed

32.0%

116

Gemma 3 27B

GoogleOpen

31.8%

117

Mistral Large 2

MistralClosed

31.2%

118

GPT-4o mini

OpenAIClosed

31.0%

119

Sarvam 30B

SarvamOpen

26.5%

120

Granite-4.0-H-1B

IBMOpen

26.2%

121

Exaone 4.0 1.2B

LG AI ResearchOpen

25.3%

122

Phi-4

MicrosoftOpen

23.5%

123

DeepSeek R1 Distill Qwen 32B

DeepSeekOpen

22.9%

124

Granite-4.0-1B

IBMOpen

20.5%

125

Granite-4.0-H-350M

IBMOpen

17.6%

126

Granite-4.0-350M

IBMOpen

15.9%

FAQ

What does AA-IFBench measure?

A display-only Artificial Analysis IFBench score.

Which model scores highest on AA-IFBench?

MiniMax M3 by MiniMax currently leads with a score of 82.9% on AA-IFBench.

How many models are evaluated on AA-IFBench?

126 AI models have been evaluated on AA-IFBench on BenchLM.

Compare Top Models on AA-IFBench

MiniMax M3 vs Nemotron 3 Ultra Nemotron 3 Ultra vs Grok 4.3 Grok 4.3 vs Qwen3.7 Max Qwen3.7 Max vs MiMo-V2.5-Pro

Last updated: July 4, 2026 · BenchLM version AA-IFBench 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.