IFBench evaluates precise instruction-following generalization on 58 challenging, verifiable out-of-domain constraints. Unlike IFEval which tests familiar constraint types, IFBench specifically measures how well models follow novel instructions they haven't been optimized for, exposing overfitting to common instruction patterns.
As of April 16, 2026, Qwen3.6 Plus leads the IFBench leaderboard with 75.8% , followed by Claude Opus 4.5 (58%).
Qwen3.6 Plus
Alibaba
Claude Opus 4.5
Anthropic
Year
2025
Tasks
58
Version
IFBench 2025
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
IFBench evaluates precise instruction-following generalization on 58 challenging, verifiable out-of-domain constraints. Unlike IFEval which tests familiar constraint types, IFBench specifically measures how well models follow novel instructions they haven't been optimized for, exposing overfitting to common instruction patterns.
Qwen3.6 Plus by Alibaba currently leads with a score of 75.8% on IFBench.
2 AI models have been evaluated on IFBench on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.