Instruction-Following Eval (IFEval)

A benchmark that evaluates language models' ability to follow verifiable instructions such as formatting constraints, keyword inclusion/exclusion, length limits, and structural requirements.

About IFEval

Year

2023

Tasks

500+ instructions

Format

Constrained generation

Difficulty

Instruction precision

IFEval uses verifiable instructions to objectively measure instruction-following ability. Instructions include requirements like 'write in all caps', 'include exactly 3 bullet points', or 'respond in JSON format', making evaluation automated and reproducible.

Instruction-Following Evaluation for Large Language Models

Leaderboard (88 models)

#1GPT-5.4
95
#2Claude Opus 4.6
95
#3Gemini 3.1 Pro
95
#4GPT-5.2
94
#5GPT-5.3 Codex
93
#6Grok 4.1
93
#7GPT-5.2-Codex
92
#8GLM-5 (Reasoning)
92
#10Claude Sonnet 4.6
91
#11GPT-5 (high)
91
#12Kimi K2.5 (Reasoning)
91
#13Claude Opus 4.5
90
#14Claude Sonnet 4.5
90
#17GPT-5.1
89
#19Gemini 3 Pro
88
#20o1-preview
88
#21GPT-5 (medium)
88
#22DeepSeek Coder 2.0
86
#24Claude Haiku 4.5
86
#25o3
85
#27GLM-5
85
#28GLM-4.7
85
#29Qwen2.5-72B
85
#30DeepSeek V3.2
85
#31DeepSeek LLM 2.0
85
#32Kimi K2.5
85
#33MiniMax M2.5
85
#34Gemini 3 Flash
85
#35Qwen2.5-1M
84
#36MiMo-V2-Flash
84
#38GLM-4.7-Flash
84
#40Gemini 2.5 Pro
83
#41o4-mini (high)
83
#42DeepSeekMath V2
83
#43Claude 4.1 Opus
83
#44Mistral Large 3
83
#45Claude 4 Sonnet
83
#46Mistral Large 2
83
#47Claude 3.5 Sonnet
83
#48o3-pro
82
#49GPT-5 mini
82
#50Grok 4
82
#51Qwen3.5 397B
82
#52GPT-4o
82
#53GPT-4 Turbo
80
#54Z-1
80
#57Nemotron-4 15B
79
#58GPT-OSS 120B
79
#59Gemini 2.5 Flash
79
#60Mistral 8x7B
78
#63Gemini 1.5 Pro
77
#64Claude 3 Opus
77
#65Gemini 1.0 Pro
77
#66Llama 3 70B
77
#67Moonshot v1
77
#68Claude 3 Haiku
76
#70DeepSeek-R1
69
#71Qwen3 235B 2507
69
#73Llama 4 Scout
68
#76GLM-4.5
68
#77MiniMax M1 80k
68
#78GLM-4.5-Air
68
#79Mistral 7B v0.3
68
#80Gemma 3 27B
67
#81Qwen2.5-VL-32B
67
#83DeepSeek V3.1
67
#84Kimi K2
67
#85GPT-OSS 20B
67
#86Mistral 8x7B v0.2
67
#87Nova Pro
66

FAQ

What does IFEval measure?

A benchmark that evaluates language models' ability to follow verifiable instructions such as formatting constraints, keyword inclusion/exclusion, length limits, and structural requirements.

Which model scores highest on IFEval?

GPT-5.4 by OpenAI currently leads with a score of 95 on IFEval.

How many models are evaluated on IFEval?

88 AI models have been evaluated on IFEval on BenchLM.