Open-weight models have closed much of the gap with proprietary ones. The best open models now score within 5-10 points of the top closed APIs on most benchmarks. DeepSeek, Meta Llama, Alibaba Qwen, Zhipu GLM, and Mistral all ship strong open options — some of them reasoning models that match proprietary performance on math and coding. The main trade-offs are context window size (most cap at 128K vs 1M+ for top proprietary models) and agentic performance, where proprietary models still hold a wider lead. Self-hosting also shifts infrastructure burden to you, so factor in serving costs.
Unless noted otherwise, ranking surfaces on this page use BenchLM's provisional leaderboard lane rather than the stricter sourced-only verified leaderboard.
Bottom line: Open-weight models are within 5-10 points of the best proprietary APIs. GLM-5 (Reasoning) leads, but DeepSeek and Llama are strong alternatives.
According to BenchLM.ai, GLM-5.1 leads this ranking with a score of 84, followed by GLM-5 (Reasoning) (84) and Kimi 2.6 (83). The top three are separated by just a few points — any of them would perform well for this use case.
All models in this ranking are open-weight, meaning they can be self-hosted for maximum control and cost efficiency.
This ranking is based on provisional overall weighted scores across BenchLM.ai's scoring formula tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.
GLM-5.1
Z.AI · 203K
GLM-5 (Reasoning)
Z.AI · 200K
Best open-weight overall. Reasoning model with strong math and coding.
Kimi 2.6
Moonshot AI · 256K
GLM-5 (Reasoning) leads all open-weight models with the highest overall score.
DeepSeek R1 competitive on reasoning and math benchmarks.
Llama 4 Maverick Meta's strongest entry, good on coding and reasoning.
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
The top model is GLM-5.1 by Z.AI with a provisional score of 84.
The best open-weight model is GLM-5.1 at position #1.
49 models are included in this ranking.
Open-weight models are ranked by the same overall BenchLM score as proprietary ones. The gap has closed significantly — the best open models score within 5-10 points of the top closed APIs.
Open-weight models typically have smaller context windows (128K vs 1M+), which matters for long-document and agentic tasks. Self-hosting costs (GPU, inference optimization) are not reflected in benchmark scores.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.