A filtered hard subset of KMMLU containing ~5,000 questions that most models get wrong.
As of March 2026, GPT-5.4 leads the KMMLU-Hard leaderboard with 72.8% , followed by GPT-5 mini (60.6%) and GPT-5 nano (51.7%).
GPT-5.4
OpenAI
GPT-5 mini
OpenAI
GPT-5 nano
OpenAI
According to BenchLM.ai, GPT-5.4 leads the KMMLU-Hard benchmark with a score of 72.8%, followed by GPT-5 mini (60.6%) and GPT-5 nano (51.7%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.
11 models have been evaluated on KMMLU-Hard. The benchmark falls in the Korean Benchmarks category. BenchLM tracks this category separately from its weighted global scoring system, so these results are best compared on the dedicated Korean benchmark views. KMMLU-Hard is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2025
Tasks
~5,000 questions
Format
Multiple choice questions
Difficulty
Advanced Korean reasoning
Provides strong signals for advanced frontier models attempting reasoning in Korean.
Evaluating LLMs on Hard Korean QueriesA filtered hard subset of KMMLU containing ~5,000 questions that most models get wrong.
GPT-5.4 by OpenAI currently leads with a score of 72.8% on KMMLU-Hard.
11 AI models have been evaluated on KMMLU-Hard on BenchLM.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.