Skip to main content
MIXED GLOBAL + REGIONAL

Korean Benchmarks Leaderboard

How do global frontier models stack up against regional Korean models on domestic tasks? This leaderboard ranks all models based exclusively on Korean benchmark performance like KMMLU, KMMLU-Hard, CLIcK, and KoBALT.

Claude Sonnet 4.6 currently leads the cross-market Korean view with an average score of 85.0.

This is the right page for deciding whether Korean-market specialists are actually outperforming global frontier models on Korean-native evaluation, rather than just inside a regional-only pool.

RankModelTypeKMMLUKMMLU-HardAvg Score
#1Claude Sonnet 4.6
Anthropic
GLOBAL85.0
#2Solar 🇰🇷
Upstage
REGIONAL80.1
#3o1
OpenAI
GLOBAL79.5
#4HyperClova X 🇰🇷
Naver Cloud
REGIONAL78.4
#5GPT-5.4
OpenAI
GLOBAL78.2
#6A.X 🇰🇷
SK Telecom
REGIONAL78.0
#7K-Exaone 🇰🇷
LG AI Research
REGIONAL76.0
#8Exaone 4.0 🇰🇷
LG AI Research
REGIONAL75.2
#9GPT-5
OpenAI
GLOBAL68.5
#10GPT-5.2
OpenAI
GLOBAL61.3
#11GPT-5
OpenAI
GLOBAL60.5
#12GPT-5.1
OpenAI
GLOBAL54.9
#13GPT-4.1
OpenAI
GLOBAL54.1
#14GPT-4o
OpenAI
GLOBAL51.9
#15GPT-4.1
OpenAI
GLOBAL47.4
#16GPT-4 Turbo
OpenAI
GLOBAL44.7
#17GPT-4o
OpenAI
GLOBAL38.6
#18GPT-4.1
OpenAI
GLOBAL36.5

What these rows mean

KMMLU: Measures massive multitask language understanding on 45 Korean expert-level subjects.

KMMLU-Hard: A computationally heavier slice focusing on complex Korean reasoning where models struggle most.

How to interpret the crossover

While global frontier models like GPT-5 and Claude lead in general reasoning, models like HyperClova X and Exaone are explicitly trained on high-quality Korean corpora. This leaderboard tracks the crossover points between sheer model scale and regional specialization.

View regional-only Korean LLMs

Korean benchmark updates

Get leaderboard shifts when Korean benchmark scores change for either regional or global models.

Free. No spam. Unsubscribe anytime.

Recommended next step

If the mixed leaderboard shows a Korean-market model winning on your target rows, open its model page next and inspect the full score breakdown before choosing it over a global default.