LLM Leaderboard History: How AI Models Improved from 2023 to 2026

38 months of Arena Elo ratings tell the complete story of AI progress: from vicuna-13b at 1094 Elo to claude-opus-4-6-thinking at 1501 Elo. 21 crown changes, 8 competing providers, and an open-source community that nearly caught up.

Data: Arena Leaderboard Dataset |Updated: 2026-05-12|4 arena subsets, 14 categories

Jump to

The Numbers

Key stats from 38 months of AI model competition.

+407

Total Elo Gain

1094 to 1501

Crown Changes

38 months tracked

5mo

Longest Reign

gemini-2.5-pro

+375

Open-Source Gain

Gap low: 4 pts

10.7/mo

Avg Elo Velocity

Points gained per month

Providers Competed

openai leads

Elo Rating Progression: May 2023 to Today

The full history of AI model improvement. Proprietary models (solid) vs. open-source (dashed). The journey from vicuna-13b to claude-opus-4-6-thinking.

Proprietary #1 Open-Source #1

Elo Milestones

When the AI frontier crossed each Elo threshold for the first time.

1100

gpt-4-0314

2023-12 · openai

1150

gpt-4-0314

2024-01 · openai

1200

gpt-4-0125-preview

2024-02 · openai

1250

gpt-4-0125-preview

2024-02 · openai

1300

chatgpt-4o-latest

2024-09 · openai

1350

o1-2024-12-17

2025-01 · openai

1400

grok-3-preview-02-24

2025-03 · xai

1450

gemini-2.5-pro

2025-07 · google

1500

claude-opus-4-6-thinking

2026-02 · anthropic

Key Breakthroughs

Slide through 26 significant moments in the AI race. Each dot marks a breakthrough.

2023-06New Challenger

UW enters the race

guanaco-33b is the first UW model to take #1

2023202420252026

1 / 26

The Open-Source Gap

Elo difference between the #1 proprietary and #1 open-source model. Lower = open-source is closer. The gap hit 4 points in 2025-02 — then proprietary labs fought back.

Crown Change Timeline

21 changes

Every time the #1 model changed hands since May 2023.

2023-06

vicuna-13b→guanaco-33b(UW)

2023-07

guanaco-33b→vicuna-33b(LMSYS)

2023-10

vicuna-33b→wizardlm-70b(microsoft)

2023-12

wizardlm-70b→gpt-4-0314(openai)

2024-02

gpt-4-0314→gpt-4-0125-preview(openai)

2024-03

gpt-4-0125-preview→gpt-4-1106-preview(openai)

2024-04

gpt-4-1106-preview→claude-3-opus-20240229(anthropic)

2024-05

claude-3-opus-20240229→gpt-4-turbo-2024-04-09(openai)

Provider Dominance

Months at #1 on the Arena leaderboard since May 2023. Who has dominated the AI race?

openai16 months (42%)

google7 months (18%)

anthropic6 months (16%)

LMSYS4 months (11%)

microsoft2 months (5%)

UW1 month (3%)

deepseek1 month (3%)

xai1 month (3%)

Category Breakdown: Who Wins at What?

Arena tracks Elo by category. Here's who leads in coding, math, creative writing, and more — with Elo gain over time.

Coding

+289 Elo

#1: claude-opus-4-7-thinking (anthropic) · 1569

22 months tracked15 crown changes

English

+277 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1513

28 months tracked13 crown changes

Hard Prompts

+262 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1535

23 months tracked13 crown changes

Chinese

+224 Elo

#1: gpt-5.5 (openai) · 1554

28 months tracked13 crown changes

Multi-Turn

+207 Elo

#1: claude-opus-4-7-thinking (anthropic) · 1514

25 months tracked12 crown changes

Creative Writing

+171 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1494

21 months tracked10 crown changes

Math

+169 Elo

#1: gpt-5.4-high (openai) · 1515

18 months tracked11 crown changes

Instruction Following

+160 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1518

18 months tracked8 crown changes

Japanese

+108 Elo

#1: gemini-3.1-pro-preview (google) · 1526

25 months tracked10 crown changes

Korean

+101 Elo

#1: gemini-3.1-pro-preview (google) · 1481

25 months tracked14 crown changes

Vision Arena: Multimodal Model Rankings

How multimodal (vision) models have improved over time. Currently led by claude-opus-4-7-thinking at 1318 Elo.

Open-Source Champions Over Time

Every model that held the open-source crown, from early LLaMA to modern reasoning models.

vicuna-13b

LMSYS · 2023-05

1094

guanaco-33b

UW · 2023-06

1065

vicuna-33b

LMSYS · 2023-07

1096

wizardlm-70b

microsoft · 2023-10

1099

tulu-2-dpo-70b

AllenAI/UW · 2023-12

1060

mixtral-8x7b-instruct-v0.1

mistral · 2024-01

1124

qwen1.5-72b-chat

alibaba · 2024-03

1147

llama-3-70b-instruct

meta · 2024-05

1210

gemma-2-27b-it

google · 2024-07

1217

llama-3.1-405b-instruct

meta · 2024-08

1262

llama-3.1-405b-instruct-bf16

meta · 2024-10

1267

llama-3.1-nemotron-70b-instruct

nvidia · 2024-11

1271

athene-v2-chat

NexusFlow · 2024-12

1276

deepseek-v3

deepseek · 2025-01

1319

deepseek-r1

deepseek · 2025-02

1361

deepseek-v3-0324

deepseek · 2025-04

1369

deepseek-r1-0528

deepseek · 2025-07

1424

qwen3-235b-a22b-instruct-2507

alibaba · 2025-08

1431

glm-4.5

zai · 2025-09

1430

glm-4.6

zai · 2025-11

1444

kimi-k2.5-thinking

moonshot · 2026-02

1445

qwen3.5-397b-a17b

alibaba · 2026-03

1450

glm-5.1

zai · 2026-05

1468

Monthly Rankings

Full top-20 Arena rankings for any month. Scroll through 38 snapshots from 2023-05 to 2026-05.

#	Model	Org	License	Elo	Votes
1	claude-opus-4-6-thinking	anthropic	Proprietary	1501	23,616
2	claude-opus-4-6	anthropic	Proprietary	1498	25,089
3	claude-opus-4-7-thinking	anthropic	Proprietary	1487	8,945
4	gemini-3.1-pro-preview	google	Proprietary	1487	29,468
5	gemini-3-pro	google	Proprietary	1480	41,381
6	claude-opus-4-7	anthropic	Proprietary	1478	9,614
7	muse-spark	meta	Proprietary	1476	10,491
8	ernie-5.1	baidu	Proprietary	1473	5,733
9	qwen3.5-max-preview	alibaba	Proprietary	1470	14,558
10	gpt-5.4-high	openai	Proprietary	1469	17,146
11	gpt-5.5-high	openai	Proprietary	1469	6,488
12	glm-5.1	zai	Open	1469	11,349
13	gemini-3-flash	google	Proprietary	1467	30,784
14	gpt-5.5	openai	Proprietary	1461	6,653
15	gemini-2.5-pro	google	Proprietary	1460	114,865
16	mimo-v2.5-pro	xiaomi	Open	1459	6,238
17	kimi-k2.6	moonshot	Open	1456	7,108
18	grok-4.20-beta-0309-reasoning	xai	Proprietary	1455	17,538
19	gpt-5.4	openai	Proprietary	1455	17,925
20	dola-seed-2.0-pro		Proprietary	1454	26,587

The AI Race Timeline

Interactive monthly scrubber with crown holders, provider rankings, and benchmark health.

LLM Benchmark Rankings

BenchLM's own benchmark-based rankings across coding, math, reasoning, and more.

Frequently Asked Questions

What is the Arena Elo leaderboard?

The Arena (formerly Chatbot Arena / LMSYS) is a crowdsourced platform where users vote on anonymous side-by-side model comparisons. Each model receives an Elo rating based on its win rate, similar to chess rankings. It is widely considered the most reliable measure of real-world LLM quality because it reflects human preference rather than automated benchmarks.

How much have AI models improved since 2023?

The top Arena Elo score rose from 1094 (vicuna-13b) in May 2023 to 1501 (claude-opus-4-6-thinking) in 2026-05 — a gain of +407 Elo points over 38 months, averaging 10.7 points per month. To put this in perspective, in chess a 400-point Elo difference means the higher-rated player wins ~91% of the time.

Which company has dominated the AI leaderboard?

openai has held the #1 position for 16 out of 38 months (42%), followed by google (7 months) and anthropic (6 months). The crown has changed hands 21 times since May 2023.

Are open-source LLMs catching up to proprietary models?

Open-source models have gained +375 Elo points. The gap hit a low of just 4 points in 2025-02. However, proprietary labs have since widened the gap. The open-source frontier is led by models like DeepSeek, Qwen, and Kimi, with Chinese labs driving much of the open-source progress.

How does Arena Elo differ from benchmark scores?

Arena Elo ratings are based on real user votes in blind A/B tests, making them less gameable than automated benchmarks. While benchmarks like MMLU or HumanEval test specific capabilities, Arena Elo measures overall user preference across diverse tasks. BenchLM tracks both — see our main leaderboard for benchmark-based rankings.

Which AI model is best for coding?

According to Arena coding Elo, claude-opus-4-7-thinking (anthropic) currently leads with an Elo of 1569. The coding category has seen 15 crown changes over 22 months.

What models are best for math and reasoning?

The current Arena math leader is gpt-5.4-high (openai) at 1515 Elo. Math Elo has gained +169 points since tracking began, making it one of the fastest-improving categories.

Where does this data come from?

All Elo ratings come from the Arena Leaderboard Dataset on HuggingFace, maintained by Arena Intelligence (lmarena-ai). We process the text, text_style_control, vision, and webdev subsets. BenchLM.ai visualizes this data but does not generate the underlying ratings.

Data attribution: All Elo ratings on this page come from the Arena Leaderboard Dataset by Arena Intelligence (lmarena-ai), available on HuggingFace. Data covers text, text_style_control, vision, webdev arena subsets.

BenchLM.ai processes and visualizes this data to provide historical insights. We do not generate the underlying Elo ratings. For BenchLM's own benchmark-based rankings, see the main leaderboard or the AI Race timeline.