36 months of Arena Elo ratings tell the complete story of AI progress: from vicuna-13b at 1094 Elo to claude-opus-4-6-thinking at 1500 Elo. 21 crown changes, 8 competing providers, and an open-source community that nearly caught up.
Key stats from 36 months of AI model competition.
+406
Total Elo Gain
1094 to 1500
21
Crown Changes
36 months tracked
5mo
Longest Reign
gemini-2.5-pro
+355
Open-Source Gain
Gap low: 4 pts
11.3/mo
Avg Elo Velocity
Points gained per month
8
Providers Competed
openai leads
The full history of AI model improvement. Proprietary models (solid) vs. open-source (dashed). The journey from vicuna-13b to claude-opus-4-6-thinking.
When the AI frontier crossed each Elo threshold for the first time.
gpt-4-0314
2023-12 · openai
gpt-4-0314
2024-01 · openai
gpt-4-0125-preview
2024-02 · openai
gpt-4-0125-preview
2024-02 · openai
chatgpt-4o-latest
2024-09 · openai
o1-2024-12-17
2025-01 · openai
grok-3-preview-02-24
2025-03 · xai
gemini-2.5-pro
2025-07 · google
claude-opus-4-6-thinking
2026-02 · anthropic
Slide through 25 significant moments in the AI race. Each dot marks a breakthrough.
UW enters the race
guanaco-33b is the first UW model to take #1
Elo difference between the #1 proprietary and #1 open-source model. Lower = open-source is closer. The gap hit 4 points in 2025-02 — then proprietary labs fought back.
Every time the #1 model changed hands since May 2023.
2023-06
vicuna-13b→guanaco-33b(UW)
2023-07
guanaco-33b→vicuna-33b(LMSYS)
2023-10
vicuna-33b→wizardlm-70b(microsoft)
2023-12
wizardlm-70b→gpt-4-0314(openai)
2024-02
gpt-4-0314→gpt-4-0125-preview(openai)
2024-03
gpt-4-0125-preview→gpt-4-1106-preview(openai)
2024-04
gpt-4-1106-preview→claude-3-opus-20240229(anthropic)
2024-05
claude-3-opus-20240229→gpt-4-turbo-2024-04-09(openai)
Months at #1 on the Arena leaderboard since May 2023. Who has dominated the AI race?
Arena tracks Elo by category. Here's who leads in coding, math, creative writing, and more — with Elo gain over time.
Coding
+276 Elo#1: claude-opus-4-6-thinking (anthropic) · 1556
English
+275 Elo#1: claude-opus-4-6-thinking (anthropic) · 1510
Hard Prompts
+263 Elo#1: claude-opus-4-6-thinking (anthropic) · 1537
Chinese
+225 Elo#1: claude-opus-4-6 (anthropic) · 1555
Multi-Turn
+205 Elo#1: claude-opus-4-6-thinking (anthropic) · 1512
Creative Writing
+171 Elo#1: claude-opus-4-6-thinking (anthropic) · 1493
Math
+170 Elo#1: gpt-5.4-high (openai) · 1517
Instruction Following
+155 Elo#1: claude-opus-4-6-thinking (anthropic) · 1512
Japanese
+118 Elo#1: gemini-3.1-pro-preview (google) · 1536
Korean
+118 Elo#1: gemini-3.1-pro-preview (google) · 1498
How multimodal (vision) models have improved over time. Currently led by claude-opus-4-6 at 1310 Elo.
Every model that held the open-source crown, from early LLaMA to modern reasoning models.
vicuna-13b
LMSYS · 2023-05
guanaco-33b
UW · 2023-06
vicuna-33b
LMSYS · 2023-07
wizardlm-70b
microsoft · 2023-10
tulu-2-dpo-70b
AllenAI/UW · 2023-12
mixtral-8x7b-instruct-v0.1
mistral · 2024-01
qwen1.5-72b-chat
alibaba · 2024-03
llama-3-70b-instruct
meta · 2024-05
gemma-2-27b-it
google · 2024-07
llama-3.1-405b-instruct
meta · 2024-08
llama-3.1-405b-instruct-bf16
meta · 2024-10
llama-3.1-nemotron-70b-instruct
nvidia · 2024-11
athene-v2-chat
NexusFlow · 2024-12
deepseek-v3
deepseek · 2025-01
deepseek-r1
deepseek · 2025-02
deepseek-v3-0324
deepseek · 2025-04
deepseek-r1-0528
deepseek · 2025-07
qwen3-235b-a22b-instruct-2507
alibaba · 2025-08
glm-4.5
zai · 2025-09
glm-4.6
zai · 2025-11
kimi-k2.5-thinking
moonshot · 2026-02
qwen3.5-397b-a17b
alibaba · 2026-03
Full top-20 Arena rankings for any month. Scroll through 36 snapshots from 2023-05 to 2026-04.
| # | Model | Org | License | Elo | Votes |
|---|---|---|---|---|---|
| 1 | claude-opus-4-6-thinking | anthropic | Proprietary | 1500 | 13,979 |
| 2 | claude-opus-4-6 | anthropic | Proprietary | 1497 | 14,934 |
| 3 | gemini-3.1-pro-preview | Proprietary | 1490 | 17,559 | |
| 4 | gemini-3-pro | Proprietary | 1480 | 41,632 | |
| 5 | gpt-5.4-high | openai | Proprietary | 1474 | 7,160 |
| 6 | qwen3.5-max-preview | alibaba | Proprietary | 1472 | 5,899 |
| 7 | gemini-3-flash | Proprietary | 1467 | 30,966 | |
| 8 | grok-4.20-beta1 | xai | Proprietary | 1462 | 7,380 |
| 9 | gemini-2.5-pro | Proprietary | 1460 | 105,423 | |
| 10 | dola-seed-2.0-preview | Bytedance | Proprietary | 1457 | 13,461 |
| 11 | grok-4.20-beta-0309-reasoning | xai | Proprietary | 1456 | 7,344 |
| 12 | ernie-5.0-0110 | baidu | Proprietary | 1449 | 20,836 |
| 13 | gemini-3-flash (thinking-minimal) | Proprietary | 1449 | 30,448 | |
| 14 | kimi-k2.5-thinking | moonshot | Open | 1449 | 17,818 |
| 15 | gpt-5.4 | openai | Proprietary | 1449 | 7,261 |
| 16 | amazon-nova-experimental-chat-26-02-10 | amazon | Proprietary | 1449 | 3,461 |
| 17 | claude-opus-4-5-20251101-thinking-32k | anthropic | Proprietary | 1447 | 37,467 |
| 18 | claude-opus-4-5-20251101 | anthropic | Proprietary | 1447 | 44,715 |
| 19 | qwen3.5-397b-a17b | alibaba | Open | 1447 | 12,994 |
| 20 | grok-4.20-multi-agent-beta-0309 | xai | Proprietary | 1446 | 7,815 |
Related
The AI Race Timeline
Interactive monthly scrubber with crown holders, provider rankings, and benchmark health.
Related
LLM Benchmark Rankings
BenchLM's own benchmark-based rankings across coding, math, reasoning, and more.
The Arena (formerly Chatbot Arena / LMSYS) is a crowdsourced platform where users vote on anonymous side-by-side model comparisons. Each model receives an Elo rating based on its win rate, similar to chess rankings. It is widely considered the most reliable measure of real-world LLM quality because it reflects human preference rather than automated benchmarks.
The top Arena Elo score rose from 1094 (vicuna-13b) in May 2023 to 1500 (claude-opus-4-6-thinking) in 2026-04 — a gain of +406 Elo points over 36 months, averaging 11.3 points per month. To put this in perspective, in chess a 400-point Elo difference means the higher-rated player wins ~91% of the time.
openai has held the #1 position for 16 out of 36 months (44%), followed by google (7 months) and LMSYS (4 months). The crown has changed hands 21 times since May 2023.
Open-source models have gained +355 Elo points. The gap hit a low of just 4 points in 2025-02. However, proprietary labs have since widened the gap. The open-source frontier is led by models like DeepSeek, Qwen, and Kimi, with Chinese labs driving much of the open-source progress.
Arena Elo ratings are based on real user votes in blind A/B tests, making them less gameable than automated benchmarks. While benchmarks like MMLU or HumanEval test specific capabilities, Arena Elo measures overall user preference across diverse tasks. BenchLM tracks both — see our main leaderboard for benchmark-based rankings.
According to Arena coding Elo, claude-opus-4-6-thinking (anthropic) currently leads with an Elo of 1556. The coding category has seen 14 crown changes over 20 months.
The current Arena math leader is gpt-5.4-high (openai) at 1517 Elo. Math Elo has gained +170 points since tracking began, making it one of the fastest-improving categories.
All Elo ratings come from the Arena Leaderboard Dataset on HuggingFace, maintained by Arena Intelligence (lmarena-ai). We process the text, text_style_control, vision, and webdev subsets. BenchLM.ai visualizes this data but does not generate the underlying ratings.
Data attribution: All Elo ratings on this page come from the Arena Leaderboard Dataset by Arena Intelligence (lmarena-ai), available on HuggingFace. Data covers text, text_style_control, vision, webdev arena subsets.
BenchLM.ai processes and visualizes this data to provide historical insights. We do not generate the underlying Elo ratings. For BenchLM's own benchmark-based rankings, see the main leaderboard or the AI Race timeline.