Skip to main content

Instruction Following Benchmark (IFBench)

IFBench evaluates precise instruction-following generalization on 58 challenging, verifiable out-of-domain constraints. Unlike IFEval which tests familiar constraint types, IFBench specifically measures how well models follow novel instructions they haven't been optimized for, exposing overfitting to common instruction patterns.

Top models on IFBench — April 16, 2026

As of April 16, 2026, Qwen3.6 Plus leads the IFBench leaderboard with 75.8% , followed by Claude Opus 4.5 (58%).

2 modelsInstruction Following35% of category scoreCurrentUpdated April 16, 2026

About IFBench

Year

2025

Tasks

58

BenchLM freshness & provenance

Version

IFBench 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (2 models)

1
75.8%
2
58%

FAQ

What does IFBench measure?

IFBench evaluates precise instruction-following generalization on 58 challenging, verifiable out-of-domain constraints. Unlike IFEval which tests familiar constraint types, IFBench specifically measures how well models follow novel instructions they haven't been optimized for, exposing overfitting to common instruction patterns.

Which model scores highest on IFBench?

Qwen3.6 Plus by Alibaba currently leads with a score of 75.8% on IFBench.

How many models are evaluated on IFBench?

2 AI models have been evaluated on IFBench on BenchLM.

Compare Top Models on IFBench

Last updated: April 16, 2026 · BenchLM version IFBench 2025

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.