Which open source LLM is best in 2026? We rank the top open weight models by real benchmark data — GLM-5, Qwen3.5, Gemma 4, Kimi K2.5, Llama — and compare them to proprietary leaders.
Share This Report
Copy the link, post it, or save a PDF version.
The best open source LLM right now is GLM-5 (Reasoning) from Zhipu AI, scoring 85 on BenchLM.ai's overall leaderboard. GLM-5.1 follows at 84, Qwen3.5 397B (Reasoning) sits at 81, and GLM-5 rounds out the next tier at 77.
That's a significant shift. Two years ago, Llama dominated the open source conversation. Today, Chinese labs — Zhipu AI, Alibaba, Moonshot AI, and DeepSeek — hold most of the top positions among open weight models, with Google's Gemma 4 31B breaking into the top 5. The best open source LLMs in 2026 are not where most people expect them to be.
| Rank | Model | Creator | Overall | Context |
|---|---|---|---|---|
| 1 | GLM-5 (Reasoning) | Zhipu AI | 85 | 200K |
| 2 | GLM-5.1 | Zhipu AI | 84 | 203K |
| 3 | Qwen3.5 397B (Reasoning) | Alibaba | 81 | 128K |
| 4 | GLM-5 | Zhipu AI | 77 | 200K |
| 5 | Gemma 4 31B | 67 | 256K | |
| 6 | GLM-4.7 | Zhipu AI | 72 | 200K |
| 7 | Kimi K2.5 | Moonshot AI | 68 | 128K |
| 8 | Qwen3.5-122B-A10B | Alibaba | 68 | 262K |
Scores from BenchLM.ai open source leaderboard. Overall score is BenchLM.ai's benchmark-weighted composite.
This table reveals something non-obvious: the models with the highest overall scores are not always the ones with the best individual benchmark rows. Some open models still post stronger isolated coding results than GLM-5 (Reasoning), but GLM-5 (Reasoning) wins overall because its knowledge, reasoning, and math profile is much broader.
The honest answer: closer than ever, but still behind.
| Model | Type | Overall | MMLU | AIME 2025 | SWE-Verified | LiveCodeBench |
|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | Proprietary | 94 | — | — | 75 | 71 |
| GPT-5.3 Codex | Proprietary | 89 | 99 | 98 | 85 | 85 |
| Claude Opus 4.6 | Proprietary | 92 | 99 | 98 | 80.8 | 76 |
| GPT-5.4 | Proprietary | 94 | 99 | 99 | 84 | 84 |
| GLM-5 (Reasoning) | Open Weight | 85 | 96 | 98 | 62 | — |
| Qwen3.5 397B (Reasoning) | Open Weight | 81 | 91 | 94 | 60 | 60 |
The overall gap between the best open weight model (GLM-5 (Reasoning) at 85) and the current proprietary leaders at 94 is 9 points. That's tighter than most people expect. In mid-2024, the gap was much wider. The trajectory still matters as much as the current snapshot.
Where open source models already match or beat proprietary ones:
Where the gap remains wide:
GLM-5 (Reasoning) is the clear winner. AIME 2025: 98. HMMT 2025: 95. BRUMO 2025: 96. Math500: 92. These are near-perfect scores on graduate-level competition math. No other open weight model comes close.
Runner-up: Step 3.5 Flash (AIME 2025: 97.3) and GLM-4.7 (HMMT 2025: 97.1) are both strong math alternatives with lower overall resource requirements.
This depends on which coding benchmark you care about:
On BenchLM's current blended coding score, Gemma 4 31B leads the open-weight field at 86.6, followed by Qwen3.5 397B (Reasoning) at 84.9 and GLM-5.1 at 82.9. GLM-4.7 still offers one of the cleaner all-around coding profiles with 84.9 on LiveCodeBench and 73.8 on SWE-bench Verified. If SWE-bench Pro matters most, GLM-5 (Reasoning) is still the better pick despite weaker LiveCodeBench numbers.
→ See the full coding leaderboard · SWE-bench Pro explained
GLM-5 (Reasoning) dominates knowledge benchmarks: MMLU 96, GPQA 94, SuperGPQA 92, SimpleQA 92. The non-reasoning GLM-5 variant is nearly as strong at MMLU 91.7 and SimpleQA 84.
Qwen3.5 397B (Reasoning) is a solid second choice with MMLU 91, GPQA 89, and more balanced performance across categories.
For factual accuracy specifically, check SimpleQA scores — they measure whether models hallucinate less. GLM-5 (Reasoning) leads open models at 92.
Self-hosting economics favor smaller models. Running a 397B-parameter model requires multiple high-end GPUs. Here's where the practical sweet spot sits:
For fine-tuning specifically, Mistral and Qwen models have the most mature fine-tuning ecosystems with well-documented tooling.
If data cannot leave your infrastructure, your options are:
All open weight models can be self-hosted with no API dependency. The real constraint is GPU cost: running a 400B+ model costs $2-5K/month in cloud GPU compute, which only makes sense above roughly 50M tokens per month versus API pricing.
If cost per token drives your decision, these are the open weight models available through low-cost APIs:
| Model | API Price (input/output per 1M tokens) | Overall Score |
|---|---|---|
| DeepSeek Coder 2.0 | $0.27 / $1.10 | 54 |
| DeepSeek V3.2 (Thinking) | ~$0.55 / $2.19 | 65 |
| Kimi K2.5 | $0.50 / $2.80 | 68 |
For comparison, GPT-5.4 Pro costs $30/$180 and Claude Opus 4.6 costs $15/$75. DeepSeek Coder 2.0 at $0.27/$1.10 delivers a BenchLM overall score of 54 at a fraction of the output cost of GPT-5.4 Pro. That's not a rounding error — it's a fundamentally different cost structure.
DeepSeek has the strongest brand recognition among open source LLMs, but its leaderboard position has slipped. DeepSeek V3.2 (Thinking) scores 65 and DeepSeek Coder 2.0 scores 54 — both well behind GLM-5 (Reasoning) at 85 and Qwen3.5 397B (Reasoning) at 81. DeepSeek's advantage is pricing: the API at $0.27-0.55/M input tokens makes it one of the cheapest ways to access a capable model. DeepSeek also has strong agentic rows relative to its overall ranking.
Alibaba's Qwen3.5 397B (Reasoning) at 81 is still one of the strongest open-weight models overall. The Qwen ecosystem is broad: Qwen2.5-1M offers a 1M token context window, Qwen2.5-72B provides a solid mid-size option, and Alibaba continues to ship variants frequently. The Qwen3.5 non-reasoning variant sits at 66, so reasoning mode currently adds about 15 points of overall performance.
Zhipu AI's GLM family now occupies four of the top six open-weight spots. GLM-5 (Reasoning) at 85 leads the leaderboard, GLM-5.1 is right behind at 84, and GLM-5 at 77 remains a strong non-reasoning engineering option. GLM-4.7 at 72 still offers one of the cleaner all-around coding profiles in the open-weight field.
Llama's position has changed dramatically. Llama 4 Maverick scores 18 and Llama 4 Scout scores 24, both below Llama 3.1 405B at 43. Meta's open-weight models, which defined the category in 2023-2024, now trail the leading Chinese open models by a wide margin. Llama remains relevant for its ecosystem, community tooling, and broad cloud provider support. But on pure benchmark performance, it is no longer competitive at the frontier of open-weight AI.
Mistral Small 4 (Reasoning) is no longer ranked on BenchLM.ai due to insufficient trusted benchmark data. Mistral's strength remains efficiency: Small 4 runs on modest hardware with a 256K context window. Mistral is also the only major European open weight model provider, which matters for organizations with data sovereignty requirements.
NVIDIA's Nemotron 3 Ultra 500B scores 65 and offers a 10M token context window — the largest among open weight models. Nemotron 3 Super 120B A12B (61) and Super 100B (60) provide more practical deployment options. NVIDIA's integration with its own GPU tooling gives Nemotron models a deployment advantage on NVIDIA hardware.
Not all "open weight" models are truly open source. The distinction matters:
Open weight means the model weights are downloadable and you can run inference locally. This is what most people mean when they say "open source LLM." GLM-5, Qwen3.5, DeepSeek, and Mistral models are all open weight.
Open source in the strict OSI definition requires the training data, training code, and model weights to all be available. Almost no frontier LLM meets this bar. OLMo from AI2 is one of the few that does.
Permissive license vs restricted license is often more important than the open-weight distinction. Mistral uses Apache 2.0 (fully permissive for commercial use). Meta's Llama license restricts use above 700M monthly active users. Many Chinese open weight models have custom licenses that may restrict commercial deployment in certain jurisdictions.
For most teams, the practical question is: "Can I download the weights, run the model on my infrastructure, and use it in my product without paying per-token fees?" For all models on this list, the answer is yes — but read the license before deploying in production.
BenchLM.ai ranks open weight models using the same benchmark-weighted methodology as proprietary models. The overall score combines performance across knowledge (MMLU, GPQA, SuperGPQA), coding (SWE-bench Pro, SWE-bench Verified, LiveCodeBench), math (AIME, HMMT, Math500), reasoning (MUSR, BBH), instruction following (IFEval), multilingual (MGSM), agentic tasks (TerminalBench, BrowseComp, OSWorld), and multimodal benchmarks.
Models with scores reported only by their creators receive lower confidence weighting until independent verification is available. The full methodology is documented on the BenchLM.ai leaderboard.
→ See the full open source leaderboard · Compare all models · Best LLM for coding
What is the best open source LLM in 2026? GLM-5 (Reasoning) from Zhipu AI leads BenchLM.ai's open weight leaderboard at 85 overall, followed by GLM-5.1 at 84 and Qwen3.5 397B (Reasoning) at 81. All three are well ahead of DeepSeek and Llama on the current leaderboard.
Can I run these models locally? Yes, all models listed are open weight and can be self-hosted. Smaller models like Mistral Small 4 (24B parameters) run on a single high-end consumer GPU. Larger models like Qwen3.5 397B and GLM-5 require multi-GPU setups or cloud instances with 4-8 A100/H100 GPUs.
Is DeepSeek still the best open source model? No. DeepSeek V3.2 (Thinking) scores 65 and DeepSeek Coder 2.0 scores 54 on BenchLM.ai — both well behind the leader, GLM-5 (Reasoning) at 85. DeepSeek remains competitive on pricing and useful for cost-sensitive workloads, but it is no longer the overall open-weight performance leader.
What happened to Llama? Meta's Llama 4 Maverick and Scout score 18 and 24 respectively — significantly below the leaders. Llama 3.1 405B at 43 still outperforms both on BenchLM.ai. Llama's ecosystem advantages remain strong, but its benchmark performance has fallen behind.
Which open source model is cheapest to run via API? DeepSeek Coder 2.0 at $0.27/$1.10 per million tokens is still one of the cheapest capable options, though its current overall score is 54. Kimi K2.5 at $0.50/$2.80 offers a higher overall score of 68 at still-affordable pricing.
Benchmark data from BenchLM.ai. Prices per million tokens, current as of April 2026.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Which AI model is best for writing in 2026? We rank Claude, GPT, Gemini, and open source LLMs by creative writing Arena scores, instruction-following benchmarks, and real-world content quality — with pricing for every budget.
Which Chinese LLM is best in 2026? We rank GLM-5, GLM-5.1, Qwen3.5, Kimi K2.5, DeepSeek V3.2, MiMo, and more using current BenchLM data across coding, math, reasoning, and agentic work.
State of LLM benchmarks in 2026: current BenchLM rankings, category leaders, benchmark trends, open vs closed performance, and what still matters after the latest scoring changes.