LLM Leaderboard History: How AI Models Improved from 2023 to 2026
37 months of Arena Elo ratings tell the complete story of AI progress: from vicuna-13b at 1094 Elo to claude-opus-4-6-thinking at 1499 Elo. 21 crown changes, 8 competing providers, and an open-source community that nearly caught up.
The Numbers
Key stats from 37 months of AI model competition.
+405
Total Elo Gain
1094 to 1499
21
Crown Changes
37 months tracked
5mo
Longest Reign
gemini-2.5-pro
+375
Open-Source Gain
Gap low: 4 pts
11/mo
Avg Elo Velocity
Points gained per month
8
Providers Competed
openai leads
Elo Rating Progression: May 2023 to Today
The full history of AI model improvement. Proprietary models (solid) vs. open-source (dashed). The journey from vicuna-13b to claude-opus-4-6-thinking.
Elo Milestones
When the AI frontier crossed each Elo threshold for the first time.
gpt-4-0314
2023-12 · openai
gpt-4-0314
2024-01 · openai
gpt-4-0125-preview
2024-02 · openai
gpt-4-0125-preview
2024-02 · openai
chatgpt-4o-latest
2024-09 · openai
o1-2024-12-17
2025-01 · openai
grok-3-preview-02-24
2025-03 · xai
gemini-2.5-pro
2025-07 · google
claude-opus-4-6-thinking
2026-02 · anthropic
Key Breakthroughs
Slide through 26 significant moments in the AI race. Each dot marks a breakthrough.
UW enters the race
guanaco-33b is the first UW model to take #1
The Open-Source Gap
Elo difference between the #1 proprietary and #1 open-source model. Lower = open-source is closer. The gap hit 4 points in 2025-02 — then proprietary labs fought back.
Crown Change Timeline
21 changesEvery time the #1 model changed hands since May 2023.
2023-06
vicuna-13b→guanaco-33b(UW)
2023-07
guanaco-33b→vicuna-33b(LMSYS)
2023-10
vicuna-33b→wizardlm-70b(microsoft)
2023-12
wizardlm-70b→gpt-4-0314(openai)
2024-02
gpt-4-0314→gpt-4-0125-preview(openai)
2024-03
gpt-4-0125-preview→gpt-4-1106-preview(openai)
2024-04
gpt-4-1106-preview→claude-3-opus-20240229(anthropic)
2024-05
claude-3-opus-20240229→gpt-4-turbo-2024-04-09(openai)
Provider Dominance
Months at #1 on the Arena leaderboard since May 2023. Who has dominated the AI race?
Category Breakdown: Who Wins at What?
Arena tracks Elo by category. Here's who leads in coding, math, creative writing, and more — with Elo gain over time.
Coding
+276 Elo#1: claude-opus-4-6-thinking (anthropic) · 1556
English
+275 Elo#1: claude-opus-4-6-thinking (anthropic) · 1511
Hard Prompts
+262 Elo#1: claude-opus-4-6-thinking (anthropic) · 1536
Chinese
+216 Elo#1: claude-opus-4-6-thinking (anthropic) · 1546
Multi-Turn
+202 Elo#1: claude-opus-4-6-thinking (anthropic) · 1509
Math
+171 Elo#1: claude-opus-4-6-thinking (anthropic) · 1518
Creative Writing
+170 Elo#1: claude-opus-4-6-thinking (anthropic) · 1492
Instruction Following
+157 Elo#1: claude-opus-4-6-thinking (anthropic) · 1514
Japanese
+124 Elo#1: gemini-3.1-pro-preview (google) · 1542
Korean
+114 Elo#1: gemini-3.1-pro-preview (google) · 1495
Vision Arena: Multimodal Model Rankings
How multimodal (vision) models have improved over time. Currently led by claude-opus-4-6-thinking at 1314 Elo.
Open-Source Champions Over Time
Every model that held the open-source crown, from early LLaMA to modern reasoning models.
vicuna-13b
LMSYS · 2023-05
guanaco-33b
UW · 2023-06
vicuna-33b
LMSYS · 2023-07
wizardlm-70b
microsoft · 2023-10
tulu-2-dpo-70b
AllenAI/UW · 2023-12
mixtral-8x7b-instruct-v0.1
mistral · 2024-01
qwen1.5-72b-chat
alibaba · 2024-03
llama-3-70b-instruct
meta · 2024-05
gemma-2-27b-it
google · 2024-07
llama-3.1-405b-instruct
meta · 2024-08
llama-3.1-405b-instruct-bf16
meta · 2024-10
llama-3.1-nemotron-70b-instruct
nvidia · 2024-11
athene-v2-chat
NexusFlow · 2024-12
deepseek-v3
deepseek · 2025-01
deepseek-r1
deepseek · 2025-02
deepseek-v3-0324
deepseek · 2025-04
deepseek-r1-0528
deepseek · 2025-07
qwen3-235b-a22b-instruct-2507
alibaba · 2025-08
glm-4.5
zai · 2025-09
glm-4.6
zai · 2025-11
kimi-k2.5-thinking
moonshot · 2026-02
qwen3.5-397b-a17b
alibaba · 2026-03
glm-5.1
zai · 2026-04
Monthly Rankings
Full top-20 Arena rankings for any month. Scroll through 37 snapshots from 2023-05 to 2026-04.
| # | Model | Org | License | Elo | Votes |
|---|---|---|---|---|---|
| 1 | claude-opus-4-6-thinking | anthropic | Proprietary | 1499 | 17,219 |
| 2 | claude-opus-4-6 | anthropic | Proprietary | 1494 | 18,377 |
| 3 | gemini-3.1-pro-preview | Proprietary | 1488 | 21,708 | |
| 4 | muse-spark | meta | Proprietary | 1481 | 4,182 |
| 5 | gemini-3-pro | Proprietary | 1479 | 41,578 | |
| 6 | gpt-5.4-high | openai | Proprietary | 1472 | 10,633 |
| 7 | qwen3.5-max-preview | alibaba | Proprietary | 1471 | 8,774 |
| 8 | glm-5.1 | zai | Open | 1469 | 6,274 |
| 9 | gemini-3-flash | Proprietary | 1466 | 30,922 | |
| 10 | gemini-2.5-pro | Proprietary | 1461 | 108,717 | |
| 11 | grok-4.20-beta-0309-reasoning | xai | Proprietary | 1455 | 10,713 |
| 12 | dola-seed-2.0-pro | Proprietary | 1455 | 19,770 | |
| 13 | grok-4.20-beta1 | xai | Proprietary | 1455 | 10,884 |
| 14 | gpt-5.4 | openai | Proprietary | 1452 | 10,990 |
| 15 | grok-4.20-multi-agent-beta-0309 | xai | Proprietary | 1450 | 11,079 |
| 16 | ernie-5.0-0110 | baidu | Proprietary | 1449 | 23,507 |
| 17 | gemini-3-flash (thinking-minimal) | Proprietary | 1448 | 34,519 | |
| 18 | amazon-nova-experimental-chat-26-02-10 | amazon | Proprietary | 1448 | 3,448 |
| 19 | claude-opus-4-5-20251101 | anthropic | Proprietary | 1448 | 48,318 |
| 20 | kimi-k2.5-thinking | moonshot | Open | 1447 | 21,678 |
Related
The AI Race Timeline
Interactive monthly scrubber with crown holders, provider rankings, and benchmark health.
Related
LLM Benchmark Rankings
BenchLM's own benchmark-based rankings across coding, math, reasoning, and more.
Frequently Asked Questions
What is the Arena Elo leaderboard?
The Arena (formerly Chatbot Arena / LMSYS) is a crowdsourced platform where users vote on anonymous side-by-side model comparisons. Each model receives an Elo rating based on its win rate, similar to chess rankings. It is widely considered the most reliable measure of real-world LLM quality because it reflects human preference rather than automated benchmarks.
How much have AI models improved since 2023?
The top Arena Elo score rose from 1094 (vicuna-13b) in May 2023 to 1499 (claude-opus-4-6-thinking) in 2026-04 — a gain of +405 Elo points over 37 months, averaging 11 points per month. To put this in perspective, in chess a 400-point Elo difference means the higher-rated player wins ~91% of the time.
Which company has dominated the AI leaderboard?
openai has held the #1 position for 16 out of 37 months (43%), followed by google (7 months) and anthropic (5 months). The crown has changed hands 21 times since May 2023.
Are open-source LLMs catching up to proprietary models?
Open-source models have gained +375 Elo points. The gap hit a low of just 4 points in 2025-02. However, proprietary labs have since widened the gap. The open-source frontier is led by models like DeepSeek, Qwen, and Kimi, with Chinese labs driving much of the open-source progress.
How does Arena Elo differ from benchmark scores?
Arena Elo ratings are based on real user votes in blind A/B tests, making them less gameable than automated benchmarks. While benchmarks like MMLU or HumanEval test specific capabilities, Arena Elo measures overall user preference across diverse tasks. BenchLM tracks both — see our main leaderboard for benchmark-based rankings.
Which AI model is best for coding?
According to Arena coding Elo, claude-opus-4-6-thinking (anthropic) currently leads with an Elo of 1556. The coding category has seen 14 crown changes over 21 months.
What models are best for math and reasoning?
The current Arena math leader is claude-opus-4-6-thinking (anthropic) at 1518 Elo. Math Elo has gained +171 points since tracking began, making it one of the fastest-improving categories.
Where does this data come from?
All Elo ratings come from the Arena Leaderboard Dataset on HuggingFace, maintained by Arena Intelligence (lmarena-ai). We process the text, text_style_control, vision, and webdev subsets. BenchLM.ai visualizes this data but does not generate the underlying ratings.
Data attribution: All Elo ratings on this page come from the Arena Leaderboard Dataset by Arena Intelligence (lmarena-ai), available on HuggingFace. Data covers text, text_style_control, vision, webdev arena subsets.
BenchLM.ai processes and visualizes this data to provide historical insights. We do not generate the underlying Elo ratings. For BenchLM's own benchmark-based rankings, see the main leaderboard or the AI Race timeline.