Skip to main content

Toloka Arena

An independent agentic-intelligence evaluation from Toloka using private simulated workflows and a pass^5 metric.

How BenchLM shows Toloka Arena

BenchLM tracks Toloka Arena as a source-backed external benchmark using the public pass^5 metric and update label from the official page.

Toloka Arena is display only on BenchLM. The public site renders an arena leaderboard, but BenchLM did not find a stable public static CSV or API for mirroring every row, so this page is source metadata only for now.

pass^5 metricPrivate simulated workflowsAgentic intelligenceOfficial pageSource metadata only

About Toloka Arena

Year

2026

Tasks

Private simulated enterprise workflows

Format

pass^5 arena score

Difficulty

Agentic workflow reliability

Toloka Arena evaluates agents on private simulated workflows with tools, databases, policies, and multi-turn tasks. BenchLM tracks it as source metadata only until a stable public leaderboard data feed is available.

BenchLM freshness & provenance

Version

Toloka Arena 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

pass^5 score table (0 models)

FAQ

What does Toloka Arena measure?

An independent agentic-intelligence evaluation from Toloka using private simulated workflows and a pass^5 metric.

Which model leads the published Toloka Arena snapshot?

No models have been evaluated on Toloka Arena yet.

How many models are evaluated on Toloka Arena?

0 AI models are included in BenchLM's mirrored Toloka Arena snapshot, based on the public leaderboard captured on June 3, 2026.

Last updated: June 3, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.