Instruction-Following Eval (IFEval)

Name: Instruction-Following Eval
Creator: BenchLM

A benchmark that evaluates language models' ability to follow verifiable instructions such as formatting constraints, keyword inclusion/exclusion, length limits, and structural requirements.

Top models on IFEval — June 2, 2026

As of June 2, 2026, Qwen3.5-27B leads the IFEval leaderboard with 95% , followed by Qwen3.7 Max (94.3%) and Qwen3.6 Plus (94.3%).

1Open

Qwen3.5-27B

Alibaba

95%

Overall 62Context 262K

2Closed

Qwen3.7 Max

Alibaba

94.3%

Overall 91Context 1M

3Closed

Qwen3.6 Plus

Alibaba

94.3%

Overall 73Context 1M

19 modelsInstruction Following65% of category scoreStaleUpdated June 2, 2026

According to BenchLM.ai, Qwen3.5-27B leads the IFEval benchmark with a score of 95%, followed by Qwen3.7 Max (94.3%) and Qwen3.6 Plus (94.3%). The top models are clustered within 0.7 points, suggesting this benchmark is nearing saturation for frontier models.

19 models have been evaluated on IFEval. The benchmark falls in the Instruction Following category. This category carries a 5% weight in BenchLM.ai's overall scoring system. Within that category, IFEval contributes 65% of the category score, so strong performance here directly affects a model's overall ranking.

About IFEval

Year

2023

Tasks

500+ instructions

Format

Constrained generation

Difficulty

Instruction precision

IFEval uses verifiable instructions to objectively measure instruction-following ability. Instructions include requirements like 'write in all caps', 'include exactly 3 bullet points', or 'respond in JSON format', making evaluation automated and reproducible.

Instruction-Following Evaluation for Large Language Models

BenchLM freshness & provenance

Version

IFEval 2023

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

Stale

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (19 models)

Qwen3.5-27B

AlibabaOpen

95%

Qwen3.7 Max

AlibabaClosed

94.3%

Qwen3.6 Plus

AlibabaClosed

94.3%

Kimi K2.5

Moonshot AIOpen

93.9%

o3-mini

OpenAIClosed

93.9%

Qwen3.5-122B-A10B

AlibabaOpen

93.4%

GLM-5

Z.AIOpen

92.6%

Qwen3.5 397B

AlibabaOpen

92.6%

OpenAIClosed

92.2%

Qwen3.5-35B-A3B

AlibabaOpen

91.9%

LFM2.5-8B-A1B

LiquidAIOpen

91.8%

Claude Opus 4.5

AnthropicClosed

90.9%

GPT-4.1 mini

OpenAIClosed

88.5%

GPT-4.1

OpenAIClosed

87.4%

DeepSeek V3

DeepSeekOpen

86.1%

ZAYA1-8B

ZyphraOpen

85.6%

GPT-4.1 nano

OpenAIClosed

83.2%

MiniCPM5-1B

OpenBMBOpen

80.4%

LFM2.5-VL-450M

LiquidAIOpen

61.2%

FAQ

What does IFEval measure?

A benchmark that evaluates language models' ability to follow verifiable instructions such as formatting constraints, keyword inclusion/exclusion, length limits, and structural requirements.

Which model scores highest on IFEval?

Qwen3.5-27B by Alibaba currently leads with a score of 95% on IFEval.

How many models are evaluated on IFEval?

19 AI models have been evaluated on IFEval on BenchLM.

Compare Top Models on IFEval

Qwen3.5-27B vs Qwen3.7 Max Qwen3.7 Max vs Qwen3.6 Plus Qwen3.6 Plus vs Kimi K2.5 Kimi K2.5 vs o3-mini

Last updated: June 2, 2026 · BenchLM version IFEval 2023

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.