Agentic workloads are token-intensive — agents loop, retry, and chain multiple calls. That makes cost-per-token a critical factor alongside raw capability. This ranking divides each model's weighted agentic score (Terminal-Bench 2.0, BrowseComp, OSWorld-Verified) by its output token price. The result shows which models give you the most agent capability per dollar. If you're building production AI agents with budget constraints, this is where you start.
According to BenchLM.ai, Gemini 3.1 Flash-Lite leads this ranking with a score of 123.03, followed by GPT-4.1 nano (87.33) and GPT-4o mini (82.35). There is a significant gap between the leading models and the rest of the field.
The best open-weight option is DeepSeek Coder 2.0 (ranked #5 with a score of 57.15). While proprietary models lead, open-weight options are within striking distance for teams willing to trade a few points of performance for full model control.
This ranking is based on weighted averages across the scoring benchmarks in agentic tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.