Multimodal workloads — processing images, charts, documents, and screenshots — often involve large inputs that drive up token costs quickly. This ranking divides each model's weighted multimodal score (MMMU-Pro, OfficeQA-Pro) by output token price. For document processing pipelines and visual AI applications running at scale, the value leaders here offer the best multimodal reasoning per dollar spent.
According to BenchLM.ai, Gemini 3.1 Flash-Lite leads this ranking with a score of 182.75, followed by GPT-4.1 nano (148.25) and Gemini 2.5 Flash (112.75). There is a significant gap between the leading models and the rest of the field.
The best open-weight option is DeepSeek Coder 2.0 (ranked #5 with a score of 53.23). While proprietary models lead, open-weight options are within striking distance for teams willing to trade a few points of performance for full model control.
This ranking is based on weighted averages across the scoring benchmarks in multimodalGrounded tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.