Instruction Following Benchmarks — IFEval Leaderboard
Ability to follow precise instructions and constraints
Bottom line: Instruction following is a reliability signal — a model that ignores constraints is unusable in production, regardless of how smart it is.
IFEval · IFBench
Best Instruction Following picks
BenchLM summaries for instruction following plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.
Kimi K2.5 (Reasoning)
100
category score
Moonshot AI
GLM-5.1
84
overall score
Z.AI
GLM-5 (Reasoning)
$0.00
avg / 1M tokens
Z.AI
Mercury 2
789
tokens / sec
Inception
LFM2-24B-A2B
0.42s
TTFT
LiquidAI
Nemotron 3 Ultra 500B
10M
context window
NVIDIA
Top AI Models for Instruction Following — April 2026
As of April 2026, Kimi K2.5 (Reasoning) leads the provisional instruction following leaderboard with a weighted score of 100.0%, followed by Grok 4.20 Multi-agent (100.0%) and Grok 4.20 (97.1%). BenchLM is currently showing 118 provisional-ranked models and 14 verified-ranked models in this category.
Kimi K2.5 (Reasoning)
Moonshot AI
Grok 4.20 Multi-agent
xAI
Grok 4.20
xAI
What changed
Claude Mythos Preview leads IFEval with the highest instruction-following accuracy.
GPT-5.4 close second on IFEval and IFBench.
Claude Opus 4.6 holds #3, with strong IFBench scores.
How to choose
Structured output (JSON, XML)?
Claude Mythos Preview — highest IFEval accuracy
Production pipelines with constraints?
GPT-5.4 — strong on both IFEval and IFBench
Reliable formatting on a budget?
Gemini 3.1 Pro — good IFEval at low cost
Open-weight with good instruction following?
GLM-5 — best open-weight IFEval score
Top models by benchmark
Tests ability to follow verifiable instructions like format constraints and content requirements(65% of category score)
IFEval Leaderboard
Updated April 21, 2026Sorted by instruction following weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.
1 Kimi K2.5 (Reasoning) Moonshot AI | 100% | Est.78 | — | — |
| 100% | Est.74 | — | — | |
3 Grok 4.20 xAI | 97.1% | 77 | — | — |
4 Claude Opus 4.6 Anthropic | 94.8% | 91 | — | — |
5 GPT-5.4 OpenAI | 93.4% | 93 | — | — |
6 GPT-5.4 Pro OpenAI | 93.4% | 92 | — | — |
7 o1 OpenAI | 93.3% | Est.59 | 92.2% | — |
8 GLM-5.1 Z.AI | 92.9% | 84 | — | — |
9 GPT-5.2-Codex OpenAI | 92.6% | Est.79 | — | — |
10 Gemini 3.1 Pro Google | 92.5% | 94 | — | — |
11 GPT-5.3 Codex OpenAI | 90.8% | Est.89 | — | — |
12 Qwen3.5-122B-A10B Alibaba | 89.8% | 68 | 93.4% | — |
13 Claude Mythos Preview Anthropic | 89.6% | 99 | — | — |
14 Qwen3.5-27B Alibaba | 89.3% | 65 | 95% | — |
15 GPT-5.1-Codex-Max OpenAI | 88.9% | Est.78 | — | — |
16 o3-mini OpenAI | 86.8% | Est.58 | 93.9% | — |
17 Qwen3.6 Plus Alibaba | 85.9% | 76 | 94.3% | 75.8% |
18 Grok 4.1 xAI | 85.5% | Est.80 | — | — |
19 Claude Sonnet 4.5 Anthropic | 85.2% | Est.67 | — | — |
20 GPT-5.2 OpenAI | 84.8% | 83 | — | — |
21 Kimi K2 Moonshot AI | 84.4% | Est.43 | — | — |
22 Qwen3.5 397B Alibaba | 83% | 65 | 92.6% | — |
23 Mistral Medium 3 Mistral | 83% | Est.45 | — | — |
24 Gemini 3 Pro Deep Think Google | 82.7% | Est.86 | — | — |
25 GPT-5 (high) OpenAI | 82.5% | Est.79 | — | — |
These rankings update weekly
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
Score in Context
What these scores mean
Instruction following carries a 5% weight in overall scoring — small, but it directly measures reliability. A model that ignores formatting rules, word count limits, or inclusion constraints is unusable in automated pipelines. The weighted score blends IFEval (verifiable constraints) and IFBench.
Known limitations
IFEval tests a specific set of verifiable constraints (word count, formatting, inclusion/exclusion) — it doesn't capture the full range of instruction-following quality. A model can score well on IFEval but still misinterpret nuanced or ambiguous instructions. IFBench is newer and coverage is still building.
How we weight
Instruction following carries a 5% weight in BenchLM.ai's overall scoring. While a smaller category, it directly measures reliability — a model that ignores constraints is unusable in production pipelines. See the instruction following leaderboard.
Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.
The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.
| Benchmark | Weight | Status | Description |
|---|---|---|---|
| IFEval | 65% | Weighted | Tests ability to follow verifiable instructions like format constraints and content requirements |
| IFBench | 35% | Weighted | Instruction-following benchmark used in first-party comparison charts for agent-oriented reasoning models. |
Instruction following updates
Instruction following scores just got interesting. Get the weekly update.
Free. No spam. Unsubscribe anytime.
About Instruction Following Benchmarks
Tests ability to follow verifiable instructions like format constraints and content requirements