Skip to main content
Skip to main content
Instruction Following

Instruction Following Benchmarks — IFEval Leaderboard

Ability to follow precise instructions and constraints

Bottom line: Instruction following is a reliability signal — a model that ignores constraints is unusable in production, regardless of how smart it is.

IFEval · IFBench

Best Instruction Following picks

BenchLM summaries for instruction following plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

Top AI Models for Instruction FollowingApril 2026

As of April 2026, Kimi K2.5 (Reasoning) leads the provisional instruction following leaderboard with a weighted score of 100.0%, followed by Grok 4.20 Multi-agent (100.0%) and Grok 4.20 (97.1%). BenchLM is currently showing 118 provisional-ranked models and 14 verified-ranked models in this category.

What changed

Claude Mythos Preview leads IFEval with the highest instruction-following accuracy.

GPT-5.4 close second on IFEval and IFBench.

Claude Opus 4.6 holds #3, with strong IFBench scores.

How to choose

Top models by benchmark

Tests ability to follow verifiable instructions like format constraints and content requirements(65% of category score)

IFEval Leaderboard

Updated April 21, 2026

Sorted by instruction following weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

118 ranked models
CSVJSON
Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row
100%
Est.78
100%
Est.74
97.1%
77
4
94.8%
91
5
GPT-5.4
OpenAI
93.4%
93
93.4%
92
7
o1
OpenAI
93.3%
Est.59
92.2%
8
92.9%
84
92.6%
Est.79
92.5%
94
90.8%
Est.89
89.8%
68
93.4%
89.6%
99
14
89.3%
65
95%
88.9%
Est.78
16
o3-mini
OpenAI
86.8%
Est.58
93.9%
17
85.9%
76
94.3%75.8%
85.5%
Est.80
85.2%
Est.67
20
GPT-5.2
OpenAI
84.8%
83
21
Kimi K2
Moonshot AI
84.4%
Est.43
22
83%
65
92.6%
83%
Est.45
82.7%
Est.86
25
82.5%
Est.79
Showing 25 of 118

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Instruction following carries a 5% weight in overall scoring — small, but it directly measures reliability. A model that ignores formatting rules, word count limits, or inclusion constraints is unusable in automated pipelines. The weighted score blends IFEval (verifiable constraints) and IFBench.

Known limitations

IFEval tests a specific set of verifiable constraints (word count, formatting, inclusion/exclusion) — it doesn't capture the full range of instruction-following quality. A model can score well on IFEval but still misinterpret nuanced or ambiguous instructions. IFBench is newer and coverage is still building.

How we weight

Instruction following carries a 5% weight in BenchLM.ai's overall scoring. While a smaller category, it directly measures reliability — a model that ignores constraints is unusable in production pipelines. See the instruction following leaderboard.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

BenchmarkWeightStatusDescription
IFEval65%WeightedTests ability to follow verifiable instructions like format constraints and content requirements
IFBench35%WeightedInstruction-following benchmark used in first-party comparison charts for agent-oriented reasoning models.

Instruction following updates

Instruction following scores just got interesting. Get the weekly update.

Free. No spam. Unsubscribe anytime.

About Instruction Following Benchmarks

Tests ability to follow verifiable instructions like format constraints and content requirements

Related