An interactive real-world agent benchmark grounded in practical consumer-service tasks such as delivery, in-store consumption, and online travel workflows.
BenchLM mirrors the published score view for VITA-Bench. Qwen3.6 Plus leads the public snapshot at 44.3% , followed by Qwen3.5 397B (43.7%) and Claude Opus 4.5 (23.3%). BenchLM does not use these results to rank models overall.
Qwen3.6 Plus
Alibaba
Qwen3.5 397B
Alibaba
Claude Opus 4.5
Anthropic
The published VITA-Bench snapshot is tightly clustered at the top: Qwen3.6 Plus sits at 44.3%, while the third row is only 21.0 points behind. The broader top-10 spread is 28.8 points, so the benchmark still separates strong models even when the leaders cluster.
6 models have been evaluated on VITA-Bench. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. VITA-Bench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2025
Tasks
Interactive consumer-service agent tasks
Format
End-to-end interactive agent evaluation
Difficulty
Long-horizon real-world workflows
VITA-Bench is built to test realistic interactive agent behavior rather than toy tool calls. It stresses long-horizon coordination, tool selection, changing user intent, and domain switching across daily-life applications.
Version
VITA-Bench 2025
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
An interactive real-world agent benchmark grounded in practical consumer-service tasks such as delivery, in-store consumption, and online travel workflows.
Qwen3.6 Plus by Alibaba currently leads with a score of 44.3% on VITA-Bench.
6 AI models have been evaluated on VITA-Bench on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.