Skip to main content

Instruction-Following Eval (IFEval)

A benchmark that evaluates language models' ability to follow verifiable instructions such as formatting constraints, keyword inclusion/exclusion, length limits, and structural requirements.

Top models on IFEval — April 21, 2026

As of April 21, 2026, Qwen3.5-27B leads the IFEval leaderboard with 95% , followed by Qwen3.6 Plus (94.3%) and Kimi K2.5 (93.9%).

15 modelsInstruction Following65% of category scoreStaleUpdated April 21, 2026

According to BenchLM.ai, Qwen3.5-27B leads the IFEval benchmark with a score of 95%, followed by Qwen3.6 Plus (94.3%) and Kimi K2.5 (93.9%). The top models are clustered within 1.1 points, suggesting this benchmark is nearing saturation for frontier models.

15 models have been evaluated on IFEval. The benchmark falls in the Instruction Following category. This category carries a 5% weight in BenchLM.ai's overall scoring system. Within that category, IFEval contributes 65% of the category score, so strong performance here directly affects a model's overall ranking.

About IFEval

Year

2023

Tasks

500+ instructions

Format

Constrained generation

Difficulty

Instruction precision

IFEval uses verifiable instructions to objectively measure instruction-following ability. Instructions include requirements like 'write in all caps', 'include exactly 3 bullet points', or 'respond in JSON format', making evaluation automated and reproducible.

BenchLM freshness & provenance

Version

IFEval 2023

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

Stale

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (15 models)

1
95%
2
94.3%
3
93.9%
4
93.9%
5
93.4%
6
92.6%
7
92.6%
8
92.2%
9
91.9%
10
90.9%
11
88.5%
12
87.4%
13
86.1%
14
83.2%
15
61.2%

FAQ

What does IFEval measure?

A benchmark that evaluates language models' ability to follow verifiable instructions such as formatting constraints, keyword inclusion/exclusion, length limits, and structural requirements.

Which model scores highest on IFEval?

Qwen3.5-27B by Alibaba currently leads with a score of 95% on IFEval.

How many models are evaluated on IFEval?

15 AI models have been evaluated on IFEval on BenchLM.

Last updated: April 21, 2026 · BenchLM version IFEval 2023

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.