Skip to main content
Skip to main content

LLM Leaderboard History: How AI Models Improved from 2023 to 2026

38 months of Arena Elo ratings tell the complete story of AI progress: from vicuna-13b at 1094 Elo to claude-opus-4-6-thinking at 1501 Elo. 21 crown changes, 8 competing providers, and an open-source community that nearly caught up.

Data: Arena Leaderboard Dataset |Updated: 2026-05-12|4 arena subsets, 14 categories

The Numbers

Key stats from 38 months of AI model competition.

+407

Total Elo Gain

1094 to 1501

21

Crown Changes

38 months tracked

5mo

Longest Reign

gemini-2.5-pro

+375

Open-Source Gain

Gap low: 4 pts

10.7/mo

Avg Elo Velocity

Points gained per month

8

Providers Competed

openai leads

Elo Rating Progression: May 2023 to Today

The full history of AI model improvement. Proprietary models (solid) vs. open-source (dashed). The journey from vicuna-13b to claude-opus-4-6-thinking.

Proprietary #1 Open-Source #1

Elo Milestones

When the AI frontier crossed each Elo threshold for the first time.

1100

gpt-4-0314

2023-12 · openai

1150

gpt-4-0314

2024-01 · openai

1200

gpt-4-0125-preview

2024-02 · openai

1250

gpt-4-0125-preview

2024-02 · openai

1300

chatgpt-4o-latest

2024-09 · openai

1350

o1-2024-12-17

2025-01 · openai

1400

grok-3-preview-02-24

2025-03 · xai

1450

gemini-2.5-pro

2025-07 · google

1500

claude-opus-4-6-thinking

2026-02 · anthropic

Key Breakthroughs

Slide through 26 significant moments in the AI race. Each dot marks a breakthrough.

2023-06New Challenger

UW enters the race

guanaco-33b is the first UW model to take #1

2023202420252026
1 / 26

The Open-Source Gap

Elo difference between the #1 proprietary and #1 open-source model. Lower = open-source is closer. The gap hit 4 points in 2025-02 — then proprietary labs fought back.

Crown Change Timeline

21 changes

Every time the #1 model changed hands since May 2023.

2023-06

vicuna-13bguanaco-33b(UW)

2023-07

guanaco-33bvicuna-33b(LMSYS)

2023-10

vicuna-33bwizardlm-70b(microsoft)

2023-12

wizardlm-70bgpt-4-0314(openai)

2024-02

gpt-4-0314gpt-4-0125-preview(openai)

2024-03

gpt-4-0125-previewgpt-4-1106-preview(openai)

2024-04

gpt-4-1106-previewclaude-3-opus-20240229(anthropic)

2024-05

claude-3-opus-20240229gpt-4-turbo-2024-04-09(openai)

Provider Dominance

Months at #1 on the Arena leaderboard since May 2023. Who has dominated the AI race?

openai16 months (42%)
google7 months (18%)
anthropic6 months (16%)
LMSYS4 months (11%)
microsoft2 months (5%)
UW1 month (3%)
deepseek1 month (3%)
xai1 month (3%)

Category Breakdown: Who Wins at What?

Arena tracks Elo by category. Here's who leads in coding, math, creative writing, and more — with Elo gain over time.

Coding

+289 Elo

#1: claude-opus-4-7-thinking (anthropic) · 1569

22 months tracked15 crown changes

English

+277 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1513

28 months tracked13 crown changes

Hard Prompts

+262 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1535

23 months tracked13 crown changes

Chinese

+224 Elo

#1: gpt-5.5 (openai) · 1554

28 months tracked13 crown changes

Multi-Turn

+207 Elo

#1: claude-opus-4-7-thinking (anthropic) · 1514

25 months tracked12 crown changes

Creative Writing

+171 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1494

21 months tracked10 crown changes

Math

+169 Elo

#1: gpt-5.4-high (openai) · 1515

18 months tracked11 crown changes

Instruction Following

+160 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1518

18 months tracked8 crown changes

Japanese

+108 Elo

#1: gemini-3.1-pro-preview (google) · 1526

25 months tracked10 crown changes

Korean

+101 Elo

#1: gemini-3.1-pro-preview (google) · 1481

25 months tracked14 crown changes

Vision Arena: Multimodal Model Rankings

How multimodal (vision) models have improved over time. Currently led by claude-opus-4-7-thinking at 1318 Elo.

Open-Source Champions Over Time

Every model that held the open-source crown, from early LLaMA to modern reasoning models.

vicuna-13b

LMSYS · 2023-05

1094

guanaco-33b

UW · 2023-06

1065

vicuna-33b

LMSYS · 2023-07

1096

wizardlm-70b

microsoft · 2023-10

1099

tulu-2-dpo-70b

AllenAI/UW · 2023-12

1060

mixtral-8x7b-instruct-v0.1

mistral · 2024-01

1124

qwen1.5-72b-chat

alibaba · 2024-03

1147

llama-3-70b-instruct

meta · 2024-05

1210

gemma-2-27b-it

google · 2024-07

1217

llama-3.1-405b-instruct

meta · 2024-08

1262

llama-3.1-405b-instruct-bf16

meta · 2024-10

1267

llama-3.1-nemotron-70b-instruct

nvidia · 2024-11

1271

athene-v2-chat

NexusFlow · 2024-12

1276

deepseek-v3

deepseek · 2025-01

1319

deepseek-r1

deepseek · 2025-02

1361

deepseek-v3-0324

deepseek · 2025-04

1369

deepseek-r1-0528

deepseek · 2025-07

1424

qwen3-235b-a22b-instruct-2507

alibaba · 2025-08

1431

glm-4.5

zai · 2025-09

1430

glm-4.6

zai · 2025-11

1444

kimi-k2.5-thinking

moonshot · 2026-02

1445

qwen3.5-397b-a17b

alibaba · 2026-03

1450

glm-5.1

zai · 2026-05

1468

Monthly Rankings

Full top-20 Arena rankings for any month. Scroll through 38 snapshots from 2023-05 to 2026-05.

#ModelOrgLicenseEloVotes
1claude-opus-4-6-thinkinganthropicProprietary150123,616
2claude-opus-4-6anthropicProprietary149825,089
3claude-opus-4-7-thinkinganthropicProprietary14878,945
4gemini-3.1-pro-previewgoogleProprietary148729,468
5gemini-3-progoogleProprietary148041,381
6claude-opus-4-7anthropicProprietary14789,614
7muse-sparkmetaProprietary147610,491
8ernie-5.1baiduProprietary14735,733
9qwen3.5-max-previewalibabaProprietary147014,558
10gpt-5.4-highopenaiProprietary146917,146
11gpt-5.5-highopenaiProprietary14696,488
12glm-5.1zaiOpen146911,349
13gemini-3-flashgoogleProprietary146730,784
14gpt-5.5openaiProprietary14616,653
15gemini-2.5-progoogleProprietary1460114,865
16mimo-v2.5-proxiaomiOpen14596,238
17kimi-k2.6moonshotOpen14567,108
18grok-4.20-beta-0309-reasoningxaiProprietary145517,538
19gpt-5.4openaiProprietary145517,925
20dola-seed-2.0-proProprietary145426,587

Share These Insights

Found this useful? Share it with your team or cite it in your research.

Frequently Asked Questions

What is the Arena Elo leaderboard?

The Arena (formerly Chatbot Arena / LMSYS) is a crowdsourced platform where users vote on anonymous side-by-side model comparisons. Each model receives an Elo rating based on its win rate, similar to chess rankings. It is widely considered the most reliable measure of real-world LLM quality because it reflects human preference rather than automated benchmarks.

How much have AI models improved since 2023?

The top Arena Elo score rose from 1094 (vicuna-13b) in May 2023 to 1501 (claude-opus-4-6-thinking) in 2026-05 — a gain of +407 Elo points over 38 months, averaging 10.7 points per month. To put this in perspective, in chess a 400-point Elo difference means the higher-rated player wins ~91% of the time.

Which company has dominated the AI leaderboard?

openai has held the #1 position for 16 out of 38 months (42%), followed by google (7 months) and anthropic (6 months). The crown has changed hands 21 times since May 2023.

Are open-source LLMs catching up to proprietary models?

Open-source models have gained +375 Elo points. The gap hit a low of just 4 points in 2025-02. However, proprietary labs have since widened the gap. The open-source frontier is led by models like DeepSeek, Qwen, and Kimi, with Chinese labs driving much of the open-source progress.

How does Arena Elo differ from benchmark scores?

Arena Elo ratings are based on real user votes in blind A/B tests, making them less gameable than automated benchmarks. While benchmarks like MMLU or HumanEval test specific capabilities, Arena Elo measures overall user preference across diverse tasks. BenchLM tracks both — see our main leaderboard for benchmark-based rankings.

Which AI model is best for coding?

According to Arena coding Elo, claude-opus-4-7-thinking (anthropic) currently leads with an Elo of 1569. The coding category has seen 15 crown changes over 22 months.

What models are best for math and reasoning?

The current Arena math leader is gpt-5.4-high (openai) at 1515 Elo. Math Elo has gained +169 points since tracking began, making it one of the fastest-improving categories.

Where does this data come from?

All Elo ratings come from the Arena Leaderboard Dataset on HuggingFace, maintained by Arena Intelligence (lmarena-ai). We process the text, text_style_control, vision, and webdev subsets. BenchLM.ai visualizes this data but does not generate the underlying ratings.

Data attribution: All Elo ratings on this page come from the Arena Leaderboard Dataset by Arena Intelligence (lmarena-ai), available on HuggingFace. Data covers text, text_style_control, vision, webdev arena subsets.

BenchLM.ai processes and visualizes this data to provide historical insights. We do not generate the underlying Elo ratings. For BenchLM's own benchmark-based rankings, see the main leaderboard or the AI Race timeline.