Skip to main content
open-sourcecomparisonrankingself-hostingguide

Best Open Source LLM in 2026: Rankings, Benchmarks, and the Models Worth Running

Which open source LLM is best in 2026? We rank the top open weight models by real benchmark data — DeepSeek V4, Kimi K2.6, GLM-5, Qwen3.5, Gemma 4, Llama — and compare them to proprietary leaders.

Glevd·Published April 1, 2026·Updated April 8, 2026·12 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

The best open source LLM right now is DeepSeek V4 Pro (Max), scoring 87 on BenchLM.ai's overall leaderboard. Kimi K2.6 follows at 84, GLM-5 (Reasoning) and GLM-5.1 sit at 83, and Qwen3.5 397B (Reasoning) rounds out the next tier at 79.

That's a significant shift. Two years ago, Llama dominated the open source conversation. Today, Chinese labs — DeepSeek, Moonshot AI, Zhipu AI, and Alibaba — hold most of the top positions among open weight models. The best open source LLMs in 2026 are not where most people expect them to be.

Top open source LLMs ranked by benchmarks

Rank Model Creator Overall Context
1 DeepSeek V4 Pro (Max) DeepSeek 87 1M
2 Kimi K2.6 Moonshot AI 84 256K
3 GLM-5 (Reasoning) Zhipu AI 83 200K
4 GLM-5.1 Zhipu AI 83 203K
5 DeepSeek V4 Pro (High) DeepSeek 83 1M
6 Qwen3.5 397B (Reasoning) Alibaba 79 128K
7 DeepSeek V4 Flash (Max) DeepSeek 77 1M
8 Qwen3.6-27B Alibaba 75 262K

Scores from BenchLM.ai open source leaderboard. Overall score is BenchLM.ai's benchmark-weighted composite.

This table reveals something non-obvious: the models with the highest overall scores are not always the ones with the best individual benchmark rows. DeepSeek V4 Pro (Max) wins the current open-weight slice because its coding and agentic profile is unusually strong, while GLM-5 (Reasoning) remains the cleaner math and reasoning-heavy pick.

How close are open source models to proprietary ones?

The honest answer: closer than ever, but still behind.

Model Type Overall MMLU AIME 2025 SWE-Verified LiveCodeBench
Gemini 3.1 Pro Proprietary 93 75 71
GPT-5.3 Codex Proprietary 87 99 98 85 85
Claude Opus 4.6 Proprietary 88 99 98 80.8 76
GPT-5.4 Proprietary 88 99 99 84 84
DeepSeek V4 Pro (Max) Open Weight 87 80.6 93.5
Kimi K2.6 Open Weight 84 80.2 89.6
GLM-5 (Reasoning) Open Weight 83 96 98 62 58
Qwen3.5 397B (Reasoning) Open Weight 79 91 94 60 60

The overall gap between the best open weight model (DeepSeek V4 Pro (Max) at 87) and the current mainstream proprietary leader at 93 is 6 points. That's tighter than most people expect. In mid-2024, the gap was much wider. The trajectory still matters as much as the current snapshot.

Where open source models already match or beat proprietary ones:

  • Math: GLM-5 (Reasoning) scores 98 on AIME 2025 and 95 on HMMT 2025 — competitive with the best proprietary math scores
  • Knowledge: GLM-5 (Reasoning) hits 96 on MMLU, 94 on GPQA, and 92 on SuperGPQA
  • Competitive coding: DeepSeek V4 Pro (Max) reaches 93.5 on LiveCodeBench and Kimi K2.6 reaches 89.6, both ahead of Claude Opus 4.6 (76) on that specific benchmark
  • Multilingual: GLM-4.7 scores 94 on MGSM, ahead of most proprietary models

Where the gap remains wide:

  • Software engineering: The best open weight SWE-bench Verified score is 80.6 (DeepSeek V4 Pro (Max)). GPT-5.3 Codex scores 85. For SWE-bench Pro, the gap is larger — open models top out around 67 vs 90 for GPT-5.3 Codex
  • Agentic tasks: Open models trail significantly on BrowseComp, TerminalBench, and OSWorld
  • Overall consistency: Proprietary models perform well across all categories simultaneously. Open models tend to spike on specific benchmarks but dip on others

Best open source LLM by use case

Best for math and reasoning

GLM-5 (Reasoning) is the clear winner. AIME 2025: 98. HMMT 2025: 95. BRUMO 2025: 96. Math500: 92. These are near-perfect scores on graduate-level competition math. No other open weight model comes close.

Runner-up: Step 3.5 Flash (AIME 2025: 97.3) and GLM-4.7 (HMMT 2025: 97.1) are both strong math alternatives with lower overall resource requirements.

Best for coding

This depends on which coding benchmark you care about:

  • HumanEval (function generation): Kimi K2.5 at 99 — essentially perfect, but HumanEval is saturated and no longer differentiates frontier models
  • LiveCodeBench (competitive programming): DeepSeek V4 Pro (Max) at 93.5, followed by DeepSeek V4 Flash (Max) at 91.6 and Kimi K2.6 at 89.6
  • SWE-bench Verified (real bug fixing): GLM-5 at 77.8, Kimi K2.5 at 76.8, GLM-4.7 at 73.8
  • SWE-bench Pro (harder software engineering): GLM-5 (Reasoning) at 67 leads open models, but this is still well below GPT-5.3 Codex (90)

On BenchLM's current blended coding score, DeepSeek V4 Pro (Max) leads the open-weight field at 89.8, followed by Kimi K2.6 and DeepSeek V4 Pro (High) at 88.7, and Qwen3.5 397B (Reasoning) at 86.7. GLM-5.1 still offers one of the cleaner all-around profiles with 84.1 on the blended coding score and 77.8 on SWE-bench Verified. If SWE-bench Pro matters most, GLM-5 (Reasoning) is still the better pick despite weaker LiveCodeBench numbers.

See the full coding leaderboard · SWE-bench Pro explained

Best for knowledge and question answering

GLM-5 (Reasoning) dominates knowledge benchmarks: MMLU 96, GPQA 94, SuperGPQA 92, SimpleQA 92. The non-reasoning GLM-5 variant is nearly as strong at MMLU 91.7 and SimpleQA 84.

Qwen3.5 397B (Reasoning) is a solid second choice with MMLU 91, GPQA 89, and more balanced performance across categories.

For factual accuracy specifically, check SimpleQA scores — they measure whether models hallucinate less. GLM-5 (Reasoning) leads open models at 92.

Best for self-hosting and fine-tuning

Self-hosting economics favor smaller models. Running a 397B-parameter model requires multiple high-end GPUs. Here's where the practical sweet spot sits:

  • Mistral Small 4 (24B parameters, 256K context): Scores 47 overall. Fits on a single consumer GPU with quantization. Mistral's Apache 2.0 license is genuinely permissive for commercial use
  • Gemma 4 31B (31B parameters, 256K context): Scores 65 overall. Google's latest open model fits on a single high-end consumer GPU and offers strong LiveCodeBench (80) performance
  • DeepSeek Coder 2.0: 54 overall, but $0.27/$1.10 via API makes it cheaper than self-hosting for most teams. Self-host only if data privacy requires it

For fine-tuning specifically, Mistral and Qwen models have the most mature fine-tuning ecosystems with well-documented tooling.

Best for privacy-sensitive deployments

If data cannot leave your infrastructure, your options are:

  1. DeepSeek V4 Pro (Max), Kimi K2.6, GLM-5, or Qwen3.5 397B for maximum capability (requires serious GPU infrastructure)
  2. Mistral Small 4 for the best quality-to-resource ratio (runs on a single A100 or equivalent)
  3. Llama 3.1 405B for the broadest ecosystem support and community tooling, despite lower benchmark scores (43 overall)

All open weight models can be self-hosted with no API dependency. The real constraint is GPU cost: running a 400B+ model costs $2-5K/month in cloud GPU compute, which only makes sense above roughly 50M tokens per month versus API pricing.

Best for cost-sensitive applications

If cost per token drives your decision, these are the open weight models available through low-cost APIs:

Model API Price (input/output per 1M tokens) Overall Score
DeepSeek Coder 2.0 54
DeepSeek V3.2 (Thinking) ~$0.55 / $2.19 63
Kimi K2.5 $0.60 / $3.00 64

For comparison, GPT-5.4 Pro costs $30/$180 and Claude Opus 4.6 costs $5/$25. DeepSeek Coder 2.0 remains a cheap open-weight option to self-host, but we no longer attach a current hosted API token price to the exact row without a first-party source.

The DeepSeek, Qwen, GLM, and Llama landscape

DeepSeek

DeepSeek has the strongest brand recognition among open source LLMs, and the V4 rows have pushed it back to the top of the open-weight leaderboard. DeepSeek V4 Pro (Max) scores 87, DeepSeek V4 Pro (High) scores 83, and DeepSeek V4 Flash (Max) scores 77. DeepSeek V3.2 (Thinking) now sits lower at 63, but DeepSeek's advantage remains pricing and deployment flexibility: the API at $0.27-0.55/M input tokens makes it one of the cheapest ways to access a capable model.

Qwen (Alibaba)

Alibaba's Qwen3.5 397B (Reasoning) at 79 is still one of the strongest open-weight models overall. The Qwen ecosystem is broad: Qwen3.6-27B offers a newer 75-point open-weight row, Qwen2.5-1M offers a 1M token context window, and Alibaba continues to ship variants frequently. The Qwen3.5 non-reasoning variant sits at 65, so reasoning mode currently adds about 14 points of overall performance.

GLM (Zhipu AI)

Zhipu AI's GLM family remains one of the strongest open-weight clusters. GLM-5 (Reasoning) and GLM-5.1 both score 83, while GLM-5 at 67 remains a strong non-reasoning engineering option. GLM-4.7 at 70 still offers one of the cleaner all-around coding profiles in the open-weight field.

Llama (Meta)

Llama's position has changed dramatically. Llama 4 Maverick scores 18 and Llama 4 Scout scores 24, both below Llama 3.1 405B at 43. Meta's open-weight models, which defined the category in 2023-2024, now trail the leading Chinese open models by a wide margin. Llama remains relevant for its ecosystem, community tooling, and broad cloud provider support. But on pure benchmark performance, it is no longer competitive at the frontier of open-weight AI.

Mistral

Mistral Small 4 (Reasoning) is no longer ranked on BenchLM.ai due to insufficient trusted benchmark data. Mistral's strength remains efficiency: Small 4 runs on modest hardware with a 256K context window. Mistral is also the only major European open weight model provider, which matters for organizations with data sovereignty requirements.

NVIDIA Nemotron

NVIDIA's Nemotron 3 Ultra 500B scores 65 and offers a 10M token context window — the largest among open weight models. Nemotron 3 Super 120B A12B (61) and Super 100B (60) provide more practical deployment options. NVIDIA's integration with its own GPU tooling gives Nemotron models a deployment advantage on NVIDIA hardware.

What "open source" actually means for LLMs

Not all "open weight" models are truly open source. The distinction matters:

Open weight means the model weights are downloadable and you can run inference locally. This is what most people mean when they say "open source LLM." GLM-5, Qwen3.5, DeepSeek, and Mistral models are all open weight.

Open source in the strict OSI definition requires the training data, training code, and model weights to all be available. Almost no frontier LLM meets this bar. OLMo from AI2 is one of the few that does.

Permissive license vs restricted license is often more important than the open-weight distinction. Mistral uses Apache 2.0 (fully permissive for commercial use). Meta's Llama license restricts use above 700M monthly active users. Many Chinese open weight models have custom licenses that may restrict commercial deployment in certain jurisdictions.

For most teams, the practical question is: "Can I download the weights, run the model on my infrastructure, and use it in my product without paying per-token fees?" For all models on this list, the answer is yes — but read the license before deploying in production.

How we rank open source models

BenchLM.ai ranks open weight models using the same benchmark-weighted methodology as proprietary models. The overall score combines performance across knowledge (MMLU, GPQA, SuperGPQA), coding (SWE-bench Pro, SWE-bench Verified, LiveCodeBench), math (AIME, HMMT, Math500), reasoning (MUSR, BBH), instruction following (IFEval), multilingual (MGSM), agentic tasks (TerminalBench, BrowseComp, OSWorld), and multimodal benchmarks.

Models with scores reported only by their creators receive lower confidence weighting until independent verification is available. The full methodology is documented on the BenchLM.ai leaderboard.

See the full open source leaderboard · Compare all models · Best LLM for coding


Frequently asked questions

What is the best open source LLM in 2026? DeepSeek V4 Pro (Max) leads BenchLM.ai's open weight leaderboard at 87 overall, followed by Kimi K2.6 at 84, GLM-5 (Reasoning) and GLM-5.1 at 83, and Qwen3.5 397B (Reasoning) at 79. These rows are well ahead of Llama on the current leaderboard.

Can I run these models locally? Yes, all models listed are open weight and can be self-hosted. Smaller models like Mistral Small 4 (24B parameters) run on a single high-end consumer GPU. Larger models like Qwen3.5 397B and GLM-5 require multi-GPU setups or cloud instances with 4-8 A100/H100 GPUs.

Is DeepSeek still the best open source model? DeepSeek is back at the top through the newer V4 rows. DeepSeek V4 Pro (Max) scores 87 and leads the open-weight leaderboard, while older DeepSeek V3.2 (Thinking) scores 63 and DeepSeek Coder 2.0 scores 53. Use the exact row carefully: the V4 variants are the current leaders, not the older V3.2 or Coder rows.

What happened to Llama? Meta's Llama 4 Maverick and Scout score 18 and 24 respectively — significantly below the leaders. Llama 3.1 405B at 43 still outperforms both on BenchLM.ai. Llama's ecosystem advantages remain strong, but its benchmark performance has fallen behind.

Which open source model is cheapest to run via API? DeepSeek Coder 2.0 remains one of the cheapest capable options to self-host, though its current overall score is 54. Kimi K2.5 at $0.60/$3.00 offers a higher overall score of 68 at still-affordable pricing, though it is not an open-weight model.


Benchmark data from BenchLM.ai. Prices per million tokens, current as of April 2026.

New models drop every week. We send one email a week with what moved and why.