What is the best LLM for instruction following?

The best instruction-following LLMs are ranked by benchmarks like IFEval, which measure how accurately models adhere to specific formatting, length, and content constraints.

What is IFEval and how does it work?

IFEval (Instruction Following Evaluation) tests whether LLMs can precisely follow verifiable instructions such as word count limits, formatting rules, and inclusion/exclusion constraints.

Why is instruction following important for LLMs?

Instruction following is critical because real-world applications require models to reliably adhere to user specifications, output formats, and constraints rather than just generating plausible text.

Instruction Following

Instruction Following Benchmarks — IFEval Leaderboard

Name: Instruction Following Benchmarks — LLM Leaderboard
Creator: BenchLM.ai

Ability to follow precise instructions and constraints

Bottom line: Instruction following is a reliability signal — a model that ignores constraints is unusable in production, regardless of how smart it is.

IFEval · IFBench · SOB Value Acc

Best Instruction Following picks

BenchLM summaries for instruction following plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

How BenchLM scores these

Best Instruction Following

Kimi K2.5 (Reasoning)

category score

Moonshot AI

Best Open Weight

DeepSeek V4 Pro (Max)

overall score

DeepSeek

Cheapest

Qwen3.6-27B

$0.00

avg / 1M tokens

Alibaba

Fastest

Mercury 2

789

tokens / sec

Inception

Lowest Latency

Command A+

0.25s

TTFT

Cohere

Largest Context

Llama 4 Scout

10M

context window

Top AI Models for Instruction Following — June 2026

As of June 2026, Grok 4.20 Multi-agent leads the provisional instruction following leaderboard with a weighted score of 100.0%, followed by Kimi K2.5 (Reasoning) (98.9%) and GPT-5.4 (96.0%). BenchLM is currently showing 126 provisional-ranked models and 23 verified-ranked models in this category.

1Proprietary

Grok 4.20 Multi-agent

xAI

100.0%weighted

2Proprietary

Kimi K2.5 (Reasoning)

Moonshot AI

98.9%weighted

3Proprietary

GPT-5.4

OpenAI

96.0%weighted

126 provisional-ranked23 verified-ranked3 benchmarksUpdated June 2, 2026

What changed

Claude Mythos Preview leads IFEval with the highest instruction-following accuracy.

GPT-5.4 close second on IFEval and IFBench.

Claude Opus 4.6 holds #3, with strong IFBench scores.

How to choose

Structured output (JSON, XML)?

Claude Mythos Preview — highest IFEval accuracy

Production pipelines with constraints?

GPT-5.4 — strong on both IFEval and IFBench

Reliable formatting on a budget?

Gemini 3.1 Pro — good IFEval at low cost

Open-weight with good instruction following?

GLM-5 — best open-weight IFEval score

Top models by benchmark

Tests ability to follow verifiable instructions like format constraints and content requirements(65% of category score)

94.3

94.3

93.9

~93.9

IFEval Leaderboard

Updated June 2, 2026

Sorted by instruction following weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

126 ranked models

CSV JSON

Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row


1 Grok 4.20 Multi-agent xAI	Closed	Reasoning	2M	N/A	N/A	N/A	100%	Est.74	—	—	—
2 Kimi K2.5 (Reasoning) Moonshot AI Self-host	Closed	Reasoning	128K	$0.60 / $3.00	N/A	N/A	98.9%	76	—	—	—
3 GPT-5.4 OpenAI	Closed	Reasoning	1.05M	$2.50 / $15.00	74	151.79s	96%	89	—	—	—
4 Grok 4.20 xAI	Closed	Reasoning	2M	$2.00 / $6.00	233	10.33s	96%	72	—	—	—
5 Claude Opus 4.6 Anthropic	Closed	Standard	1M	$5.00 / $25.00	40	1.78s	95.3%	87	—	—	—
6 GPT-5.4 Pro OpenAI	Closed	Reasoning	1.05M	$30.00 / $180.00	74	151.79s	94.2%	91	—	—	—
7 Qwen3.7 Max Alibaba	Closed	Reasoning	1M	N/A	N/A	N/A	93.7%	91	94.3%	79.1%	—
8 Gemini 3.1 Pro Google	Closed	Standard	1M	$2.00 / $12.00	109	29.71s	93.5%	92	—	—	—
9 GLM-5.1 Z.AI Self-host	Open	Reasoning	203K	$1.40 / $4.40	N/A	N/A	92.3%	82	—	—	—
10 o1 OpenAI	Closed	Reasoning	200K	$15.00 / $60.00	98	32.29s	92.3%	Est.57	92.2%	—	—
11 GPT-5.2-Codex OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	123	87.34s	91.6%	Est.76	—	—	—
12 Claude Mythos Preview Anthropic	Closed	Reasoning	1M	$25.00 / $125.00	N/A	N/A	91.4%	99	—	—	—
13 GPT-5.3 Codex OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	79	88.26s	91.4%	Est.86	—	—	—
14 Grok 4.1 xAI	Closed	Standard	1M	N/A	N/A	N/A	90.6%	Est.90	—	—	—
15 Qwen3.5-27B Alibaba Self-host	Open	Reasoning	262K	$0.00 / $0.00	N/A	N/A	89.2%	62	95%	—	—
16 GPT-5.1-Codex-Max OpenAI	Closed	Reasoning	400K	$1.25 / $10.00	N/A	N/A	87.9%	Est.75	—	—	—
17 Grok 4.3 xAI	Closed	Reasoning	1M	$1.25 / $2.50	209	12.36s	87.7%	Est.75	—	81.3%	—
18 Qwen3.5-122B-A10B Alibaba Self-host	Open	Reasoning	262K	$0.00 / $0.00	N/A	N/A	87.4%	64	93.4%	—	—
19 Qwen3.6 Plus Alibaba	Closed	Reasoning	1M	N/A	N/A	N/A	87.2%	73	94.3%	75.8%	—
20 o3-mini OpenAI	Closed	Reasoning	200K	$1.10 / $4.40	160	7.12s	85.4%	Est.55	93.9%	—	—
21 GPT-5.2 OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	73	130.34s	85.2%	79	—	—	—
22 Qwen3.5 397B Alibaba Self-host	Open	Standard	128K	$0.60 / $3.60	96	2.44s	84.8%	63	92.6%	—	—
23 Kimi K2.5 Moonshot AI Self-host	Open	Standard	256K	$0.60 / $3.00	45	2.38s	84.5%	64	93.9%	—	—
24 GLM-5 Z.AI Self-host	Open	Standard	200K	$1.00 / $3.20	74	1.64s	84.3%	67	92.6%	—	—
25 Claude Sonnet 4.5 Anthropic	Closed	Standard	200K	$3.00 / $15.00	N/A	N/A	84.2%	Est.65	—	—	—

Showing 25 of 126

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Instruction following carries a 5% weight in overall scoring — small, but it directly measures reliability. A model that ignores formatting rules, word count limits, or inclusion constraints is unusable in automated pipelines. The weighted score blends IFEval (verifiable constraints) and IFBench.

Known limitations

IFEval tests a specific set of verifiable constraints (word count, formatting, inclusion/exclusion) — it doesn't capture the full range of instruction-following quality. A model can score well on IFEval but still misinterpret nuanced or ambiguous instructions. IFBench is newer and coverage is still building.

How we weight

Instruction following carries a 5% weight in BenchLM.ai's overall scoring. While a smaller category, it directly measures reliability — a model that ignores constraints is unusable in production pipelines. See the instruction following leaderboard.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

Benchmark	Weight	Status	Description
IFEval	65%	Weighted	Tests ability to follow verifiable instructions like format constraints and content requirements
IFBench	35%	Weighted	Instruction-following benchmark used in first-party comparison charts for agent-oriented reasoning models.
SOB Value Acc	—	Display only	A structured-output benchmark from Interfaze measuring whether extracted JSON leaf values exactly match verified ground truth.

Instruction following updates

Instruction following scores just got interesting. Get the weekly update.

Free. No spam. Unsubscribe anytime.

About Instruction Following Benchmarks

Tests ability to follow verifiable instructions like format constraints and content requirements

Best LLMs Overall

Top models ranked across all benchmark categories.

View

Reasoning Benchmarks

Multi-step inference and logical deduction leaderboard.

View

Agentic Benchmarks

How models perform on autonomous tasks.

View

LLM Selector Quiz

Find the best model for your specific needs.

View

Instruction Following Benchmarks — IFEval Leaderboard

Best Instruction Following picks

Top AI Models for Instruction Following — June 2026

What changed

How to choose

Top models by benchmark

IFEval Leaderboard

These rankings update weekly

Score in Context

What these scores mean

Known limitations

How we weight

Instruction following updates

About Instruction Following Benchmarks

Related

Stay ahead of the LLM curve