Which open source LLM is best in 2026? We rank the top open weight models by real benchmark data — DeepSeek V4, Kimi K2.6, GLM-5, Qwen3.5, Gemma 4, Llama — and compare them to proprietary leaders.
Share This Report
Copy the link, post it, or save a PDF version.
The best open source LLM right now is DeepSeek V4 Pro (Max), scoring 87 on BenchLM.ai's overall leaderboard. Kimi K2.6 follows at 84, GLM-5 (Reasoning) and GLM-5.1 sit at 83, and Qwen3.5 397B (Reasoning) rounds out the next tier at 79.
That's a significant shift. Two years ago, Llama dominated the open source conversation. Today, Chinese labs — DeepSeek, Moonshot AI, Zhipu AI, and Alibaba — hold most of the top positions among open weight models. The best open source LLMs in 2026 are not where most people expect them to be.
| Rank | Model | Creator | Overall | Context |
|---|---|---|---|---|
| 1 | DeepSeek V4 Pro (Max) | DeepSeek | 87 | 1M |
| 2 | Kimi K2.6 | Moonshot AI | 84 | 256K |
| 3 | GLM-5 (Reasoning) | Zhipu AI | 83 | 200K |
| 4 | GLM-5.1 | Zhipu AI | 83 | 203K |
| 5 | DeepSeek V4 Pro (High) | DeepSeek | 83 | 1M |
| 6 | Qwen3.5 397B (Reasoning) | Alibaba | 79 | 128K |
| 7 | DeepSeek V4 Flash (Max) | DeepSeek | 77 | 1M |
| 8 | Qwen3.6-27B | Alibaba | 75 | 262K |
Scores from BenchLM.ai open source leaderboard. Overall score is BenchLM.ai's benchmark-weighted composite.
This table reveals something non-obvious: the models with the highest overall scores are not always the ones with the best individual benchmark rows. DeepSeek V4 Pro (Max) wins the current open-weight slice because its coding and agentic profile is unusually strong, while GLM-5 (Reasoning) remains the cleaner math and reasoning-heavy pick.
The honest answer: closer than ever, but still behind.
| Model | Type | Overall | MMLU | AIME 2025 | SWE-Verified | LiveCodeBench |
|---|---|---|---|---|---|---|
| Gemini 3.1 Pro | Proprietary | 93 | — | — | 75 | 71 |
| GPT-5.3 Codex | Proprietary | 87 | 99 | 98 | 85 | 85 |
| Claude Opus 4.6 | Proprietary | 88 | 99 | 98 | 80.8 | 76 |
| GPT-5.4 | Proprietary | 88 | 99 | 99 | 84 | 84 |
| DeepSeek V4 Pro (Max) | Open Weight | 87 | — | — | 80.6 | 93.5 |
| Kimi K2.6 | Open Weight | 84 | — | — | 80.2 | 89.6 |
| GLM-5 (Reasoning) | Open Weight | 83 | 96 | 98 | 62 | 58 |
| Qwen3.5 397B (Reasoning) | Open Weight | 79 | 91 | 94 | 60 | 60 |
The overall gap between the best open weight model (DeepSeek V4 Pro (Max) at 87) and the current mainstream proprietary leader at 93 is 6 points. That's tighter than most people expect. In mid-2024, the gap was much wider. The trajectory still matters as much as the current snapshot.
Where open source models already match or beat proprietary ones:
Where the gap remains wide:
GLM-5 (Reasoning) is the clear winner. AIME 2025: 98. HMMT 2025: 95. BRUMO 2025: 96. Math500: 92. These are near-perfect scores on graduate-level competition math. No other open weight model comes close.
Runner-up: Step 3.5 Flash (AIME 2025: 97.3) and GLM-4.7 (HMMT 2025: 97.1) are both strong math alternatives with lower overall resource requirements.
This depends on which coding benchmark you care about:
On BenchLM's current blended coding score, DeepSeek V4 Pro (Max) leads the open-weight field at 89.8, followed by Kimi K2.6 and DeepSeek V4 Pro (High) at 88.7, and Qwen3.5 397B (Reasoning) at 86.7. GLM-5.1 still offers one of the cleaner all-around profiles with 84.1 on the blended coding score and 77.8 on SWE-bench Verified. If SWE-bench Pro matters most, GLM-5 (Reasoning) is still the better pick despite weaker LiveCodeBench numbers.
→ See the full coding leaderboard · SWE-bench Pro explained
GLM-5 (Reasoning) dominates knowledge benchmarks: MMLU 96, GPQA 94, SuperGPQA 92, SimpleQA 92. The non-reasoning GLM-5 variant is nearly as strong at MMLU 91.7 and SimpleQA 84.
Qwen3.5 397B (Reasoning) is a solid second choice with MMLU 91, GPQA 89, and more balanced performance across categories.
For factual accuracy specifically, check SimpleQA scores — they measure whether models hallucinate less. GLM-5 (Reasoning) leads open models at 92.
Self-hosting economics favor smaller models. Running a 397B-parameter model requires multiple high-end GPUs. Here's where the practical sweet spot sits:
For fine-tuning specifically, Mistral and Qwen models have the most mature fine-tuning ecosystems with well-documented tooling.
If data cannot leave your infrastructure, your options are:
All open weight models can be self-hosted with no API dependency. The real constraint is GPU cost: running a 400B+ model costs $2-5K/month in cloud GPU compute, which only makes sense above roughly 50M tokens per month versus API pricing.
If cost per token drives your decision, these are the open weight models available through low-cost APIs:
| Model | API Price (input/output per 1M tokens) | Overall Score |
|---|---|---|
| DeepSeek Coder 2.0 | — | 54 |
| DeepSeek V3.2 (Thinking) | ~$0.55 / $2.19 | 63 |
| Kimi K2.5 | $0.60 / $3.00 | 64 |
For comparison, GPT-5.4 Pro costs $30/$180 and Claude Opus 4.6 costs $5/$25. DeepSeek Coder 2.0 remains a cheap open-weight option to self-host, but we no longer attach a current hosted API token price to the exact row without a first-party source.
DeepSeek has the strongest brand recognition among open source LLMs, and the V4 rows have pushed it back to the top of the open-weight leaderboard. DeepSeek V4 Pro (Max) scores 87, DeepSeek V4 Pro (High) scores 83, and DeepSeek V4 Flash (Max) scores 77. DeepSeek V3.2 (Thinking) now sits lower at 63, but DeepSeek's advantage remains pricing and deployment flexibility: the API at $0.27-0.55/M input tokens makes it one of the cheapest ways to access a capable model.
Alibaba's Qwen3.5 397B (Reasoning) at 79 is still one of the strongest open-weight models overall. The Qwen ecosystem is broad: Qwen3.6-27B offers a newer 75-point open-weight row, Qwen2.5-1M offers a 1M token context window, and Alibaba continues to ship variants frequently. The Qwen3.5 non-reasoning variant sits at 65, so reasoning mode currently adds about 14 points of overall performance.
Zhipu AI's GLM family remains one of the strongest open-weight clusters. GLM-5 (Reasoning) and GLM-5.1 both score 83, while GLM-5 at 67 remains a strong non-reasoning engineering option. GLM-4.7 at 70 still offers one of the cleaner all-around coding profiles in the open-weight field.
Llama's position has changed dramatically. Llama 4 Maverick scores 18 and Llama 4 Scout scores 24, both below Llama 3.1 405B at 43. Meta's open-weight models, which defined the category in 2023-2024, now trail the leading Chinese open models by a wide margin. Llama remains relevant for its ecosystem, community tooling, and broad cloud provider support. But on pure benchmark performance, it is no longer competitive at the frontier of open-weight AI.
Mistral Small 4 (Reasoning) is no longer ranked on BenchLM.ai due to insufficient trusted benchmark data. Mistral's strength remains efficiency: Small 4 runs on modest hardware with a 256K context window. Mistral is also the only major European open weight model provider, which matters for organizations with data sovereignty requirements.
NVIDIA's Nemotron 3 Ultra 500B scores 65 and offers a 10M token context window — the largest among open weight models. Nemotron 3 Super 120B A12B (61) and Super 100B (60) provide more practical deployment options. NVIDIA's integration with its own GPU tooling gives Nemotron models a deployment advantage on NVIDIA hardware.
Not all "open weight" models are truly open source. The distinction matters:
Open weight means the model weights are downloadable and you can run inference locally. This is what most people mean when they say "open source LLM." GLM-5, Qwen3.5, DeepSeek, and Mistral models are all open weight.
Open source in the strict OSI definition requires the training data, training code, and model weights to all be available. Almost no frontier LLM meets this bar. OLMo from AI2 is one of the few that does.
Permissive license vs restricted license is often more important than the open-weight distinction. Mistral uses Apache 2.0 (fully permissive for commercial use). Meta's Llama license restricts use above 700M monthly active users. Many Chinese open weight models have custom licenses that may restrict commercial deployment in certain jurisdictions.
For most teams, the practical question is: "Can I download the weights, run the model on my infrastructure, and use it in my product without paying per-token fees?" For all models on this list, the answer is yes — but read the license before deploying in production.
BenchLM.ai ranks open weight models using the same benchmark-weighted methodology as proprietary models. The overall score combines performance across knowledge (MMLU, GPQA, SuperGPQA), coding (SWE-bench Pro, SWE-bench Verified, LiveCodeBench), math (AIME, HMMT, Math500), reasoning (MUSR, BBH), instruction following (IFEval), multilingual (MGSM), agentic tasks (TerminalBench, BrowseComp, OSWorld), and multimodal benchmarks.
Models with scores reported only by their creators receive lower confidence weighting until independent verification is available. The full methodology is documented on the BenchLM.ai leaderboard.
→ See the full open source leaderboard · Compare all models · Best LLM for coding
What is the best open source LLM in 2026? DeepSeek V4 Pro (Max) leads BenchLM.ai's open weight leaderboard at 87 overall, followed by Kimi K2.6 at 84, GLM-5 (Reasoning) and GLM-5.1 at 83, and Qwen3.5 397B (Reasoning) at 79. These rows are well ahead of Llama on the current leaderboard.
Can I run these models locally? Yes, all models listed are open weight and can be self-hosted. Smaller models like Mistral Small 4 (24B parameters) run on a single high-end consumer GPU. Larger models like Qwen3.5 397B and GLM-5 require multi-GPU setups or cloud instances with 4-8 A100/H100 GPUs.
Is DeepSeek still the best open source model? DeepSeek is back at the top through the newer V4 rows. DeepSeek V4 Pro (Max) scores 87 and leads the open-weight leaderboard, while older DeepSeek V3.2 (Thinking) scores 63 and DeepSeek Coder 2.0 scores 53. Use the exact row carefully: the V4 variants are the current leaders, not the older V3.2 or Coder rows.
What happened to Llama? Meta's Llama 4 Maverick and Scout score 18 and 24 respectively — significantly below the leaders. Llama 3.1 405B at 43 still outperforms both on BenchLM.ai. Llama's ecosystem advantages remain strong, but its benchmark performance has fallen behind.
Which open source model is cheapest to run via API? DeepSeek Coder 2.0 remains one of the cheapest capable options to self-host, though its current overall score is 54. Kimi K2.5 at $0.60/$3.00 offers a higher overall score of 68 at still-affordable pricing, though it is not an open-weight model.
Benchmark data from BenchLM.ai. Prices per million tokens, current as of April 2026.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Best LLM for math 2026: GPT-5.4 leads AIME 2025, MATH-500, and BRUMO. Compare Claude, Gemini, DeepSeek-R1, GPT-5.5, and value picks by use case.
Which AI model is best for writing in 2026? We rank Claude, GPT, Gemini, and open source LLMs by creative writing Arena scores, instruction-following benchmarks, and real-world content quality — with pricing for every budget.
Which Chinese LLM is best in 2026? We rank DeepSeek V4, Kimi K2.6, GLM-5, GLM-5.1, Qwen3.5, MiMo, and more using current BenchLM data across coding, math, reasoning, and agentic work.