MMLU-Redux

Name: MMLU-Redux
Creator: BenchLM

A harder refresh of MMLU intended to keep broad knowledge evaluation useful after the original benchmark became too easy for frontier models.

Benchmark score on MMLU-Redux — May 20, 2026

BenchLM mirrors the published score view for MMLU-Redux. Claude Opus 4.5 leads the public snapshot at 96.6% , followed by Qwen3.7 Max (95%) and Qwen3.5 397B (94.9%). BenchLM does not use these results to rank models overall.

1Closed

Claude Opus 4.5

Anthropic

96.6%

Overall 77Context 200K

2Closed

Qwen3.7 Max

Alibaba

95%

Overall 93Context 1M

3Open

Qwen3.5 397B

Alibaba

94.9%

Overall 64Context 128K

7 modelsKnowledgeCurrentDisplay onlyUpdated May 20, 2026

The published MMLU-Redux snapshot is tightly clustered at the top: Claude Opus 4.5 sits at 96.6%, while the third row is only 1.7 points behind. The broader top-10 spread is 7.2 points, so many of the published scores sit in a relatively narrow band.

7 models have been evaluated on MMLU-Redux. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. MMLU-Redux is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About MMLU-Redux

Year

2026

Tasks

Broad academic QA

Format

Multiple choice questions

Difficulty

Advanced general knowledge

MMLU-Redux is useful when MMLU itself has largely saturated. It acts as a broader knowledge sanity check with fresher or harder questions intended to preserve separation among strong general-purpose models.

Qwen3.6 launch benchmarks

BenchLM freshness & provenance

Version

MMLU-Redux 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.