An independent agentic-intelligence evaluation from Toloka using private simulated workflows and a pass^5 metric.
BenchLM tracks Toloka Arena as a source-backed external benchmark using the public pass^5 metric and update label from the official page.
Toloka Arena is display only on BenchLM. The public site renders an arena leaderboard, but BenchLM did not find a stable public static CSV or API for mirroring every row, so this page is source metadata only for now.
Year
2026
Tasks
Private simulated enterprise workflows
Format
pass^5 arena score
Difficulty
Agentic workflow reliability
Toloka Arena evaluates agents on private simulated workflows with tools, databases, policies, and multi-turn tasks. BenchLM tracks it as source metadata only until a stable public leaderboard data feed is available.
Version
Toloka Arena 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
An independent agentic-intelligence evaluation from Toloka using private simulated workflows and a pass^5 metric.
No models have been evaluated on Toloka Arena yet.
0 AI models are included in BenchLM's mirrored Toloka Arena snapshot, based on the public leaderboard captured on June 3, 2026.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.