Skip to main content

Artificial Analysis IFBench (AA-IFBench)

A display-only Artificial Analysis IFBench score.

Benchmark score on AA-IFBench — May 21, 2026

BenchLM mirrors the published score view for AA-IFBench. Grok 4.3 leads the public snapshot at 81.3% , followed by Qwen3.7 Max (80.5%) and MiMo-V2.5-Pro (79.9%). BenchLM does not use these results to rank models overall.

119 modelsInstruction FollowingCurrentDisplay onlyUpdated May 21, 2026

The published AA-IFBench snapshot is tightly clustered at the top: Grok 4.3 sits at 81.3%, while the third row is only 1.4 points behind. The broader top-10 spread is 4.8 points, so many of the published scores sit in a relatively narrow band.

119 models have been evaluated on AA-IFBench. The benchmark falls in the Instruction Following category. This category carries a 5% weight in BenchLM.ai's overall scoring system. AA-IFBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA-IFBench

Year

2026

Tasks

Verifiable instruction constraints

Format

Constraint satisfaction accuracy

Difficulty

Instruction precision

BenchLM stores the Artificial Analysis IFBench result separately from the weighted IFBench lane so AA refreshes remain display-only.

BenchLM freshness & provenance

Version

AA-IFBench 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (119 models)

1
81.3%
2
80.5%
3
79.9%
4
79.2%
5
78.8%
6
77.6%
7
77.2%
8
77.1%
9
76.6%
10
76.5%
11
76.3%
12
76.3%
13
76.0%
14
75.9%
15
75.9%
16
75.9%
17
75.7%
18
75.7%
19
75.6%
20
75.6%
21
75.4%
22
75.4%
23
75.2%
24
73.9%
25
73.9%
26
73.5%
27
73.3%
28
73.2%
29
73.1%
30
72.9%
31
72.5%
32
72.4%
33
72.3%
34
72.3%
35
71.4%
36
71.3%
37
70.6%
38
70.4%
39
70.3%
40
70.2%
41
70.2%
42
70.0%
43
70.0%
44
69.0%
45
68.8%
46
68.8%
47
67.9%
48
67.6%
49
65.1%
50
64.7%
51
64.4%
53
63.1%
54
61.1%
55
58.6%
56
58.0%
57
57.4%
58
56.3%
59
56.3%
60
55.4%
61
55.1%
62
53.7%
63
53.5%
64
53.1%
66
51.6%
67
50.5%
68
49.3%
69
49.0%
70
48.7%
71
48.2%
72
48.2%
73
45.9%
74
45.4%
75
44.6%
76
44.2%
77
44.1%
78
43.6%
79
43.0%
80
43.0%
81
43.0%
82
41.5%
83
41.5%
84
41.4%
85
41.2%
86
39.9%
87
39.6%
88
39.5%
89
39.3%
90
39.0%
91
39.0%
92
38.3%
93
38.2%
94
38.1%
95
38.0%
96
37.8%
97
37.6%
98
37.5%
99
36.7%
100
36.5%
101
36.2%
102
36.1%
103
34.8%
104
34.4%
105
34.3%
106
33.7%
107
33.5%
108
32.0%
109
31.8%
110
31.2%
111
31.0%
112
26.5%
113
26.2%
114
25.3%
115
23.5%
116
22.9%
117
20.5%
118
17.6%
119
15.9%

FAQ

What does AA-IFBench measure?

A display-only Artificial Analysis IFBench score.

Which model scores highest on AA-IFBench?

Grok 4.3 by xAI currently leads with a score of 81.3% on AA-IFBench.

How many models are evaluated on AA-IFBench?

119 AI models have been evaluated on AA-IFBench on BenchLM.

Last updated: May 21, 2026 · BenchLM version AA-IFBench 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.