An automated evaluation framework using 1,000+ real-world user tasks covering reasoning, planning, coding, and creative writing. Highly correlated with Chatbot Arena human preference rankings.
Year
2024
Tasks
1,024 real-world tasks
Format
Real-world task evaluation
Difficulty
Diverse real-world scenarios
WildBench bridges the gap between static benchmarks and human preference evaluations. Tasks are derived from real ChatGPT conversations, making it more representative of actual user needs than synthetic benchmarks.
Version
WildBench 2024
Refresh cadence
Annual
Staleness state
Refreshing
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
An automated evaluation framework using 1,000+ real-world user tasks covering reasoning, planning, coding, and creative writing. Highly correlated with Chatbot Arena human preference rankings.
No models have been evaluated on WildBench yet.
0 AI models have been evaluated on WildBench on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.