Instruction Following Benchmark (IFBench)

Name: Instruction Following Benchmark
Creator: BenchLM

IFBench evaluates precise instruction-following generalization on 58 challenging, verifiable out-of-domain constraints. Unlike IFEval which tests familiar constraint types, IFBench specifically measures how well models follow novel instructions they haven't been optimized for, exposing overfitting to common instruction patterns.

Top models on IFBench — May 13, 2026

As of May 13, 2026, Grok 4.3 leads the IFBench leaderboard with 81.3% , followed by Qwen3.6 Plus (75.8%) and Nemotron 3 Nano Omni 30B A3B (74.2%).

1Closed

Grok 4.3

xAI

81.3%

Overall —Context 1M

2Closed

Qwen3.6 Plus

Alibaba

75.8%

Overall 73Context 1M

3Open

Nemotron 3 Nano Omni 30B A3B

NVIDIA

74.2%

Overall —Context 256K

7 modelsInstruction Following35% of category scoreCurrentUpdated May 13, 2026

According to BenchLM.ai, Grok 4.3 leads the IFBench benchmark with a score of 81.3%, followed by Qwen3.6 Plus (75.8%) and Nemotron 3 Nano Omni 30B A3B (74.2%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

7 models have been evaluated on IFBench. The benchmark falls in the Instruction Following category. This category carries a 5% weight in BenchLM.ai's overall scoring system. Within that category, IFBench contributes 35% of the category score, so strong performance here directly affects a model's overall ranking.

About IFBench

Year

2025

Tasks

BenchLM freshness & provenance

Version

IFBench 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.