LLM Leaderboard History: How AI Models Improved from 2023 to 2026

36 months of Arena Elo ratings tell the complete story of AI progress: from vicuna-13b at 1094 Elo to claude-opus-4-6-thinking at 1500 Elo. 21 crown changes, 8 competing providers, and an open-source community that nearly caught up.

Data: Arena Leaderboard Dataset |Updated: 2026-04-02|4 arena subsets, 14 categories

The Numbers

Key stats from 36 months of AI model competition.

+406

Total Elo Gain

1094 to 1500

21

Crown Changes

36 months tracked

5mo

Longest Reign

gemini-2.5-pro

+355

Open-Source Gain

Gap low: 4 pts

11.3/mo

Avg Elo Velocity

Points gained per month

8

Providers Competed

openai leads

Elo Rating Progression: May 2023 to Today

The full history of AI model improvement. Proprietary models (solid) vs. open-source (dashed). The journey from vicuna-13b to claude-opus-4-6-thinking.

Proprietary #1 Open-Source #1

Elo Milestones

When the AI frontier crossed each Elo threshold for the first time.

1100

gpt-4-0314

2023-12 · openai

1150

gpt-4-0314

2024-01 · openai

1200

gpt-4-0125-preview

2024-02 · openai

1250

gpt-4-0125-preview

2024-02 · openai

1300

chatgpt-4o-latest

2024-09 · openai

1350

o1-2024-12-17

2025-01 · openai

1400

grok-3-preview-02-24

2025-03 · xai

1450

gemini-2.5-pro

2025-07 · google

1500

claude-opus-4-6-thinking

2026-02 · anthropic

Key Breakthroughs

Slide through 25 significant moments in the AI race. Each dot marks a breakthrough.

2023-06New Challenger

UW enters the race

guanaco-33b is the first UW model to take #1

2023202420252026
1 / 25

The Open-Source Gap

Elo difference between the #1 proprietary and #1 open-source model. Lower = open-source is closer. The gap hit 4 points in 2025-02 — then proprietary labs fought back.

Crown Change Timeline

21 changes

Every time the #1 model changed hands since May 2023.

2023-06

vicuna-13bguanaco-33b(UW)

2023-07

guanaco-33bvicuna-33b(LMSYS)

2023-10

vicuna-33bwizardlm-70b(microsoft)

2023-12

wizardlm-70bgpt-4-0314(openai)

2024-02

gpt-4-0314gpt-4-0125-preview(openai)

2024-03

gpt-4-0125-previewgpt-4-1106-preview(openai)

2024-04

gpt-4-1106-previewclaude-3-opus-20240229(anthropic)

2024-05

claude-3-opus-20240229gpt-4-turbo-2024-04-09(openai)

Provider Dominance

Months at #1 on the Arena leaderboard since May 2023. Who has dominated the AI race?

openai16 months (44%)
google7 months (19%)
LMSYS4 months (11%)
anthropic4 months (11%)
microsoft2 months (6%)
UW1 month (3%)
deepseek1 month (3%)
xai1 month (3%)

Category Breakdown: Who Wins at What?

Arena tracks Elo by category. Here's who leads in coding, math, creative writing, and more — with Elo gain over time.

Coding

+276 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1556

20 months tracked14 crown changes

English

+275 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1510

26 months tracked13 crown changes

Hard Prompts

+263 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1537

21 months tracked13 crown changes

Chinese

+225 Elo

#1: claude-opus-4-6 (anthropic) · 1555

26 months tracked12 crown changes

Multi-Turn

+205 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1512

23 months tracked11 crown changes

Creative Writing

+171 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1493

19 months tracked8 crown changes

Math

+170 Elo

#1: gpt-5.4-high (openai) · 1517

16 months tracked11 crown changes

Instruction Following

+155 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1512

16 months tracked8 crown changes

Japanese

+118 Elo

#1: gemini-3.1-pro-preview (google) · 1536

23 months tracked10 crown changes

Korean

+118 Elo

#1: gemini-3.1-pro-preview (google) · 1498

23 months tracked14 crown changes

Vision Arena: Multimodal Model Rankings

How multimodal (vision) models have improved over time. Currently led by claude-opus-4-6 at 1310 Elo.

Open-Source Champions Over Time

Every model that held the open-source crown, from early LLaMA to modern reasoning models.

vicuna-13b

LMSYS · 2023-05

1094

guanaco-33b

UW · 2023-06

1065

vicuna-33b

LMSYS · 2023-07

1096

wizardlm-70b

microsoft · 2023-10

1099

tulu-2-dpo-70b

AllenAI/UW · 2023-12

1060

mixtral-8x7b-instruct-v0.1

mistral · 2024-01

1124

qwen1.5-72b-chat

alibaba · 2024-03

1147

llama-3-70b-instruct

meta · 2024-05

1210

gemma-2-27b-it

google · 2024-07

1217

llama-3.1-405b-instruct

meta · 2024-08

1262

llama-3.1-405b-instruct-bf16

meta · 2024-10

1267

llama-3.1-nemotron-70b-instruct

nvidia · 2024-11

1271

athene-v2-chat

NexusFlow · 2024-12

1276

deepseek-v3

deepseek · 2025-01

1319

deepseek-r1

deepseek · 2025-02

1361

deepseek-v3-0324

deepseek · 2025-04

1369

deepseek-r1-0528

deepseek · 2025-07

1424

qwen3-235b-a22b-instruct-2507

alibaba · 2025-08

1431

glm-4.5

zai · 2025-09

1430

glm-4.6

zai · 2025-11

1444

kimi-k2.5-thinking

moonshot · 2026-02

1445

qwen3.5-397b-a17b

alibaba · 2026-03

1450

Monthly Rankings

Full top-20 Arena rankings for any month. Scroll through 36 snapshots from 2023-05 to 2026-04.

#ModelOrgLicenseEloVotes
1claude-opus-4-6-thinkinganthropicProprietary150013,979
2claude-opus-4-6anthropicProprietary149714,934
3gemini-3.1-pro-previewgoogleProprietary149017,559
4gemini-3-progoogleProprietary148041,632
5gpt-5.4-highopenaiProprietary14747,160
6qwen3.5-max-previewalibabaProprietary14725,899
7gemini-3-flashgoogleProprietary146730,966
8grok-4.20-beta1xaiProprietary14627,380
9gemini-2.5-progoogleProprietary1460105,423
10dola-seed-2.0-previewBytedanceProprietary145713,461
11grok-4.20-beta-0309-reasoningxaiProprietary14567,344
12ernie-5.0-0110baiduProprietary144920,836
13gemini-3-flash (thinking-minimal)googleProprietary144930,448
14kimi-k2.5-thinkingmoonshotOpen144917,818
15gpt-5.4openaiProprietary14497,261
16amazon-nova-experimental-chat-26-02-10amazonProprietary14493,461
17claude-opus-4-5-20251101-thinking-32kanthropicProprietary144737,467
18claude-opus-4-5-20251101anthropicProprietary144744,715
19qwen3.5-397b-a17balibabaOpen144712,994
20grok-4.20-multi-agent-beta-0309xaiProprietary14467,815

Share These Insights

Found this useful? Share it with your team or cite it in your research.

Frequently Asked Questions

What is the Arena Elo leaderboard?

The Arena (formerly Chatbot Arena / LMSYS) is a crowdsourced platform where users vote on anonymous side-by-side model comparisons. Each model receives an Elo rating based on its win rate, similar to chess rankings. It is widely considered the most reliable measure of real-world LLM quality because it reflects human preference rather than automated benchmarks.

How much have AI models improved since 2023?

The top Arena Elo score rose from 1094 (vicuna-13b) in May 2023 to 1500 (claude-opus-4-6-thinking) in 2026-04 — a gain of +406 Elo points over 36 months, averaging 11.3 points per month. To put this in perspective, in chess a 400-point Elo difference means the higher-rated player wins ~91% of the time.

Which company has dominated the AI leaderboard?

openai has held the #1 position for 16 out of 36 months (44%), followed by google (7 months) and LMSYS (4 months). The crown has changed hands 21 times since May 2023.

Are open-source LLMs catching up to proprietary models?

Open-source models have gained +355 Elo points. The gap hit a low of just 4 points in 2025-02. However, proprietary labs have since widened the gap. The open-source frontier is led by models like DeepSeek, Qwen, and Kimi, with Chinese labs driving much of the open-source progress.

How does Arena Elo differ from benchmark scores?

Arena Elo ratings are based on real user votes in blind A/B tests, making them less gameable than automated benchmarks. While benchmarks like MMLU or HumanEval test specific capabilities, Arena Elo measures overall user preference across diverse tasks. BenchLM tracks both — see our main leaderboard for benchmark-based rankings.

Which AI model is best for coding?

According to Arena coding Elo, claude-opus-4-6-thinking (anthropic) currently leads with an Elo of 1556. The coding category has seen 14 crown changes over 20 months.

What models are best for math and reasoning?

The current Arena math leader is gpt-5.4-high (openai) at 1517 Elo. Math Elo has gained +170 points since tracking began, making it one of the fastest-improving categories.

Where does this data come from?

All Elo ratings come from the Arena Leaderboard Dataset on HuggingFace, maintained by Arena Intelligence (lmarena-ai). We process the text, text_style_control, vision, and webdev subsets. BenchLM.ai visualizes this data but does not generate the underlying ratings.

Data attribution: All Elo ratings on this page come from the Arena Leaderboard Dataset by Arena Intelligence (lmarena-ai), available on HuggingFace. Data covers text, text_style_control, vision, webdev arena subsets.

BenchLM.ai processes and visualizes this data to provide historical insights. We do not generate the underlying Elo ratings. For BenchLM's own benchmark-based rankings, see the main leaderboard or the AI Race timeline.