Skip to main content

Instruction Following Benchmark (IFBench)

IFBench evaluates precise instruction-following generalization on 58 challenging, verifiable out-of-domain constraints. Unlike IFEval which tests familiar constraint types, IFBench specifically measures how well models follow novel instructions they haven't been optimized for, exposing overfitting to common instruction patterns.

Top models on IFBench — May 13, 2026

As of May 13, 2026, Grok 4.3 leads the IFBench leaderboard with 81.3% , followed by Qwen3.6 Plus (75.8%) and Nemotron 3 Nano Omni 30B A3B (74.2%).

7 modelsInstruction Following35% of category scoreCurrentUpdated May 13, 2026

According to BenchLM.ai, Grok 4.3 leads the IFBench benchmark with a score of 81.3%, followed by Qwen3.6 Plus (75.8%) and Nemotron 3 Nano Omni 30B A3B (74.2%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

7 models have been evaluated on IFBench. The benchmark falls in the Instruction Following category. This category carries a 5% weight in BenchLM.ai's overall scoring system. Within that category, IFBench contributes 35% of the category score, so strong performance here directly affects a model's overall ranking.

About IFBench

Year

2025

Tasks

58

BenchLM freshness & provenance

Version

IFBench 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Leaderboard (7 models)

1
81.3%
2
75.8%
4
63.1%
5
58%
6
57%
7
52.6%

FAQ

What does IFBench measure?

IFBench evaluates precise instruction-following generalization on 58 challenging, verifiable out-of-domain constraints. Unlike IFEval which tests familiar constraint types, IFBench specifically measures how well models follow novel instructions they haven't been optimized for, exposing overfitting to common instruction patterns.

Which model scores highest on IFBench?

Grok 4.3 by xAI currently leads with a score of 81.3% on IFBench.

How many models are evaluated on IFBench?

7 AI models have been evaluated on IFBench on BenchLM.

Last updated: May 13, 2026 · BenchLM version IFBench 2025

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.