An automated evaluation framework using 1,000+ real-world user tasks covering reasoning, planning, coding, and creative writing. Highly correlated with Chatbot Arena human preference rankings.
Year
2024
Tasks
1,024 real-world tasks
Format
Real-world task evaluation
Difficulty
Diverse real-world scenarios
WildBench bridges the gap between static benchmarks and human preference evaluations. Tasks are derived from real ChatGPT conversations, making it more representative of actual user needs than synthetic benchmarks.
Version
WildBench 2024
Refresh cadence
Annual
Staleness state
Refreshing
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
An automated evaluation framework using 1,000+ real-world user tasks covering reasoning, planning, coding, and creative writing. Highly correlated with Chatbot Arena human preference rankings.
No models have been evaluated on WildBench yet.
0 AI models have been evaluated on WildBench on BenchLM.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.