BENCHMARK EXPLAINER

What is KMMLU? The Korean Large Language Model Benchmark Explained

KMMLU is the clearest public benchmark for measuring whether frontier models can actually reason in Korean, not just translate English competence into Korean output.

Unlike translated benchmark sets, KMMLU is built from Korean-native exams and professional subject matter. That makes it a much stronger signal for localized law, history, culture, and high-context knowledge work.

BenchLM tracks KMMLU alongside adjacent Korean evaluations like KMMLU-Hard, KMMLU-Pro, CLIcK, and KoBALT so you can compare global frontier models against Korean-market specialists on the same surface.

Questions

35,030

Multiple-choice items

Coverage

45

Localized subject areas

Why it matters

Native

Built for Korean context, not translated into it

Why translated English benchmarks are not enough

Historically, non-English evaluation often meant taking English benchmarks like MMLU or GSM8K and translating them. That works poorly for real local-market model selection.

A translated algebra question may still test math. A translated question about US constitutional law does not tell you much about a model deployed in Seoul for legal, public-sector, or education use.

KMMLU solves that by using Korean-native expert and exam material. More than 20% of the benchmark requires specifically Korean historical, legal, geographic, or cultural understanding.

How KMMLU is structured

KMMLU spans 45 localized subjects across four broad supercategories. It is large enough to test both surface recall and deeper professional competence.

STEM

Math, physics, chemistry, software engineering, and related technical domains.

Applied Science

Medicine, telecommunications, civil engineering, and other professional fields.

HUMSS

Korean history, constitutional law, sociology, and other humanities and social sciences.

Other

Specialized domains such as patent law and real estate.

Why the hard variants matter

KMMLU-Hard and KMMLU-Pro are useful because they strip away easier rows and expose where models still fail. They are often better signals for frontier separation than the broad base set alone.

What the leaderboard usually shows

On BenchLM's Korean benchmarks leaderboard, large global models often stay competitive on STEM and reasoning-heavy slices simply because their raw capability is high.

The real separation tends to appear in Korean-native legal, historical, linguistic, and culturally grounded tasks. That is where Korean-market models like Exaone, HyperClova X, and Solar can outperform global defaults despite trailing on broader worldwide rankings.

Ready to compare live scores?

Move from the explainer into the actual benchmark tables and see how Korean-market models compare with frontier global systems on the same rows.

Korean benchmark updates

Get notified when KMMLU, KMMLU-Hard, and Korean leaderboard coverage changes.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.