Skip to main content
Skip to main content

LLM Leaderboard History: How AI Models Improved from 2023 to 2026

37 months of Arena Elo ratings tell the complete story of AI progress: from vicuna-13b at 1094 Elo to claude-opus-4-6-thinking at 1499 Elo. 21 crown changes, 8 competing providers, and an open-source community that nearly caught up.

Data: Arena Leaderboard Dataset |Updated: 2026-04-17|4 arena subsets, 14 categories

The Numbers

Key stats from 37 months of AI model competition.

+405

Total Elo Gain

1094 to 1499

21

Crown Changes

37 months tracked

5mo

Longest Reign

gemini-2.5-pro

+375

Open-Source Gain

Gap low: 4 pts

11/mo

Avg Elo Velocity

Points gained per month

8

Providers Competed

openai leads

Elo Rating Progression: May 2023 to Today

The full history of AI model improvement. Proprietary models (solid) vs. open-source (dashed). The journey from vicuna-13b to claude-opus-4-6-thinking.

Proprietary #1 Open-Source #1

Elo Milestones

When the AI frontier crossed each Elo threshold for the first time.

1100

gpt-4-0314

2023-12 · openai

1150

gpt-4-0314

2024-01 · openai

1200

gpt-4-0125-preview

2024-02 · openai

1250

gpt-4-0125-preview

2024-02 · openai

1300

chatgpt-4o-latest

2024-09 · openai

1350

o1-2024-12-17

2025-01 · openai

1400

grok-3-preview-02-24

2025-03 · xai

1450

gemini-2.5-pro

2025-07 · google

1500

claude-opus-4-6-thinking

2026-02 · anthropic

Key Breakthroughs

Slide through 26 significant moments in the AI race. Each dot marks a breakthrough.

2023-06New Challenger

UW enters the race

guanaco-33b is the first UW model to take #1

2023202420252026
1 / 26

The Open-Source Gap

Elo difference between the #1 proprietary and #1 open-source model. Lower = open-source is closer. The gap hit 4 points in 2025-02 — then proprietary labs fought back.

Crown Change Timeline

21 changes

Every time the #1 model changed hands since May 2023.

2023-06

vicuna-13bguanaco-33b(UW)

2023-07

guanaco-33bvicuna-33b(LMSYS)

2023-10

vicuna-33bwizardlm-70b(microsoft)

2023-12

wizardlm-70bgpt-4-0314(openai)

2024-02

gpt-4-0314gpt-4-0125-preview(openai)

2024-03

gpt-4-0125-previewgpt-4-1106-preview(openai)

2024-04

gpt-4-1106-previewclaude-3-opus-20240229(anthropic)

2024-05

claude-3-opus-20240229gpt-4-turbo-2024-04-09(openai)

Provider Dominance

Months at #1 on the Arena leaderboard since May 2023. Who has dominated the AI race?

openai16 months (43%)
google7 months (19%)
anthropic5 months (14%)
LMSYS4 months (11%)
microsoft2 months (5%)
UW1 month (3%)
deepseek1 month (3%)
xai1 month (3%)

Category Breakdown: Who Wins at What?

Arena tracks Elo by category. Here's who leads in coding, math, creative writing, and more — with Elo gain over time.

Coding

+276 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1556

21 months tracked14 crown changes

English

+275 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1511

27 months tracked13 crown changes

Hard Prompts

+262 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1536

22 months tracked13 crown changes

Chinese

+216 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1546

27 months tracked13 crown changes

Multi-Turn

+202 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1509

24 months tracked11 crown changes

Math

+171 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1518

17 months tracked12 crown changes

Creative Writing

+170 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1492

20 months tracked8 crown changes

Instruction Following

+157 Elo

#1: claude-opus-4-6-thinking (anthropic) · 1514

17 months tracked8 crown changes

Japanese

+124 Elo

#1: gemini-3.1-pro-preview (google) · 1542

24 months tracked10 crown changes

Korean

+114 Elo

#1: gemini-3.1-pro-preview (google) · 1495

24 months tracked14 crown changes

Vision Arena: Multimodal Model Rankings

How multimodal (vision) models have improved over time. Currently led by claude-opus-4-6-thinking at 1314 Elo.

Open-Source Champions Over Time

Every model that held the open-source crown, from early LLaMA to modern reasoning models.

vicuna-13b

LMSYS · 2023-05

1094

guanaco-33b

UW · 2023-06

1065

vicuna-33b

LMSYS · 2023-07

1096

wizardlm-70b

microsoft · 2023-10

1099

tulu-2-dpo-70b

AllenAI/UW · 2023-12

1060

mixtral-8x7b-instruct-v0.1

mistral · 2024-01

1124

qwen1.5-72b-chat

alibaba · 2024-03

1147

llama-3-70b-instruct

meta · 2024-05

1210

gemma-2-27b-it

google · 2024-07

1217

llama-3.1-405b-instruct

meta · 2024-08

1262

llama-3.1-405b-instruct-bf16

meta · 2024-10

1267

llama-3.1-nemotron-70b-instruct

nvidia · 2024-11

1271

athene-v2-chat

NexusFlow · 2024-12

1276

deepseek-v3

deepseek · 2025-01

1319

deepseek-r1

deepseek · 2025-02

1361

deepseek-v3-0324

deepseek · 2025-04

1369

deepseek-r1-0528

deepseek · 2025-07

1424

qwen3-235b-a22b-instruct-2507

alibaba · 2025-08

1431

glm-4.5

zai · 2025-09

1430

glm-4.6

zai · 2025-11

1444

kimi-k2.5-thinking

moonshot · 2026-02

1445

qwen3.5-397b-a17b

alibaba · 2026-03

1450

glm-5.1

zai · 2026-04

1469

Monthly Rankings

Full top-20 Arena rankings for any month. Scroll through 37 snapshots from 2023-05 to 2026-04.

#ModelOrgLicenseEloVotes
1claude-opus-4-6-thinkinganthropicProprietary149917,219
2claude-opus-4-6anthropicProprietary149418,377
3gemini-3.1-pro-previewgoogleProprietary148821,708
4muse-sparkmetaProprietary14814,182
5gemini-3-progoogleProprietary147941,578
6gpt-5.4-highopenaiProprietary147210,633
7qwen3.5-max-previewalibabaProprietary14718,774
8glm-5.1zaiOpen14696,274
9gemini-3-flashgoogleProprietary146630,922
10gemini-2.5-progoogleProprietary1461108,717
11grok-4.20-beta-0309-reasoningxaiProprietary145510,713
12dola-seed-2.0-proProprietary145519,770
13grok-4.20-beta1xaiProprietary145510,884
14gpt-5.4openaiProprietary145210,990
15grok-4.20-multi-agent-beta-0309xaiProprietary145011,079
16ernie-5.0-0110baiduProprietary144923,507
17gemini-3-flash (thinking-minimal)googleProprietary144834,519
18amazon-nova-experimental-chat-26-02-10amazonProprietary14483,448
19claude-opus-4-5-20251101anthropicProprietary144848,318
20kimi-k2.5-thinkingmoonshotOpen144721,678

Share These Insights

Found this useful? Share it with your team or cite it in your research.

Frequently Asked Questions

What is the Arena Elo leaderboard?

The Arena (formerly Chatbot Arena / LMSYS) is a crowdsourced platform where users vote on anonymous side-by-side model comparisons. Each model receives an Elo rating based on its win rate, similar to chess rankings. It is widely considered the most reliable measure of real-world LLM quality because it reflects human preference rather than automated benchmarks.

How much have AI models improved since 2023?

The top Arena Elo score rose from 1094 (vicuna-13b) in May 2023 to 1499 (claude-opus-4-6-thinking) in 2026-04 — a gain of +405 Elo points over 37 months, averaging 11 points per month. To put this in perspective, in chess a 400-point Elo difference means the higher-rated player wins ~91% of the time.

Which company has dominated the AI leaderboard?

openai has held the #1 position for 16 out of 37 months (43%), followed by google (7 months) and anthropic (5 months). The crown has changed hands 21 times since May 2023.

Are open-source LLMs catching up to proprietary models?

Open-source models have gained +375 Elo points. The gap hit a low of just 4 points in 2025-02. However, proprietary labs have since widened the gap. The open-source frontier is led by models like DeepSeek, Qwen, and Kimi, with Chinese labs driving much of the open-source progress.

How does Arena Elo differ from benchmark scores?

Arena Elo ratings are based on real user votes in blind A/B tests, making them less gameable than automated benchmarks. While benchmarks like MMLU or HumanEval test specific capabilities, Arena Elo measures overall user preference across diverse tasks. BenchLM tracks both — see our main leaderboard for benchmark-based rankings.

Which AI model is best for coding?

According to Arena coding Elo, claude-opus-4-6-thinking (anthropic) currently leads with an Elo of 1556. The coding category has seen 14 crown changes over 21 months.

What models are best for math and reasoning?

The current Arena math leader is claude-opus-4-6-thinking (anthropic) at 1518 Elo. Math Elo has gained +171 points since tracking began, making it one of the fastest-improving categories.

Where does this data come from?

All Elo ratings come from the Arena Leaderboard Dataset on HuggingFace, maintained by Arena Intelligence (lmarena-ai). We process the text, text_style_control, vision, and webdev subsets. BenchLM.ai visualizes this data but does not generate the underlying ratings.

Data attribution: All Elo ratings on this page come from the Arena Leaderboard Dataset by Arena Intelligence (lmarena-ai), available on HuggingFace. Data covers text, text_style_control, vision, webdev arena subsets.

BenchLM.ai processes and visualizes this data to provide historical insights. We do not generate the underlying Elo ratings. For BenchLM's own benchmark-based rankings, see the main leaderboard or the AI Race timeline.