This ranking answers the simplest question: which model gives you the most benchmark performance per dollar? It divides each model's overall weighted score (across all 8 categories) by its output token price. The leaders here are generalist value picks — strong across coding, reasoning, agentic, knowledge, and more, at prices that won't break your API budget. Start here if you need a single cost-effective model for mixed workloads.
According to BenchLM.ai, Gemini 3.1 Flash-Lite leads this ranking with a score of 140, followed by GPT-4.1 nano (110) and GPT-4o mini (90). There is a significant gap between the leading models and the rest of the field.
The best open-weight option is DeepSeek Coder 2.0 (ranked #5 with a score of 56.36). While proprietary models lead, open-weight options are within striking distance for teams willing to trade a few points of performance for full model control.
This ranking is based on overall weighted scores across BenchLM.ai's scoring formula tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.