KMMLU-Hard (KMMLU-Hard)

A filtered hard subset of KMMLU containing ~5,000 questions that most models get wrong.

Top Models on KMMLU-Hard — March 2026

As of March 2026, GPT-5.4 leads the KMMLU-Hard leaderboard with 72.8% , followed by GPT-5 mini (60.6%) and GPT-5 nano (51.7%).

11 modelsKorean BenchmarksKorean-language benchmarkUpdated March 18, 2026

According to BenchLM.ai, GPT-5.4 leads the KMMLU-Hard benchmark with a score of 72.8%, followed by GPT-5 mini (60.6%) and GPT-5 nano (51.7%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

11 models have been evaluated on KMMLU-Hard. The benchmark falls in the Korean Benchmarks category. BenchLM tracks this category separately from its weighted global scoring system, so these results are best compared on the dedicated Korean benchmark views. KMMLU-Hard is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About KMMLU-Hard

Year

2025

Tasks

~5,000 questions

Format

Multiple choice questions

Difficulty

Advanced Korean reasoning

Provides strong signals for advanced frontier models attempting reasoning in Korean.

Evaluating LLMs on Hard Korean Queries

Leaderboard (11 models)

#1GPT-5.4
72.8%
#2GPT-5 mini
60.6%
#3GPT-5 nano
51.7%
#4GPT-5.2
51.1%
#5GPT-5.1
43.9%
#6GPT-4.1
42.8%
#7GPT-4o
39.6%
#8GPT-4.1 mini
35.6%
#9GPT-4 Turbo
30.6%
#10GPT-4o mini
24.6%
#11GPT-4.1 nano
24.3%

FAQ

What does KMMLU-Hard measure?

A filtered hard subset of KMMLU containing ~5,000 questions that most models get wrong.

Which model scores highest on KMMLU-Hard?

GPT-5.4 by OpenAI currently leads with a score of 72.8% on KMMLU-Hard.

How many models are evaluated on KMMLU-Hard?

11 AI models have been evaluated on KMMLU-Hard on BenchLM.

Last updated: March 18, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.