Skip to main content
open-sourcecomparisonrankingself-hostingguide

Best Open Source LLM in 2026: Rankings, Benchmarks, and the Models Worth Running

Which open source LLM is best in 2026? We rank the top open weight models by real benchmark data — GLM-5, Qwen3.5, Gemma 4, Kimi K2.5, Llama — and compare them to proprietary leaders.

Glevd·Published April 1, 2026·Updated April 8, 2026·12 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

The best open source LLM right now is GLM-5 (Reasoning) from Zhipu AI, scoring 85 on BenchLM.ai's overall leaderboard. GLM-5.1 follows at 84, Qwen3.5 397B (Reasoning) sits at 81, and GLM-5 rounds out the next tier at 77.

That's a significant shift. Two years ago, Llama dominated the open source conversation. Today, Chinese labs — Zhipu AI, Alibaba, Moonshot AI, and DeepSeek — hold most of the top positions among open weight models, with Google's Gemma 4 31B breaking into the top 5. The best open source LLMs in 2026 are not where most people expect them to be.

Top open source LLMs ranked by benchmarks

Rank Model Creator Overall Context
1 GLM-5 (Reasoning) Zhipu AI 85 200K
2 GLM-5.1 Zhipu AI 84 203K
3 Qwen3.5 397B (Reasoning) Alibaba 81 128K
4 GLM-5 Zhipu AI 77 200K
5 Gemma 4 31B Google 67 256K
6 GLM-4.7 Zhipu AI 72 200K
7 Kimi K2.5 Moonshot AI 68 128K
8 Qwen3.5-122B-A10B Alibaba 68 262K

Scores from BenchLM.ai open source leaderboard. Overall score is BenchLM.ai's benchmark-weighted composite.

This table reveals something non-obvious: the models with the highest overall scores are not always the ones with the best individual benchmark rows. Some open models still post stronger isolated coding results than GLM-5 (Reasoning), but GLM-5 (Reasoning) wins overall because its knowledge, reasoning, and math profile is much broader.

How close are open source models to proprietary ones?

The honest answer: closer than ever, but still behind.

Model Type Overall MMLU AIME 2025 SWE-Verified LiveCodeBench
Gemini 3.1 Pro Proprietary 94 75 71
GPT-5.3 Codex Proprietary 89 99 98 85 85
Claude Opus 4.6 Proprietary 92 99 98 80.8 76
GPT-5.4 Proprietary 94 99 99 84 84
GLM-5 (Reasoning) Open Weight 85 96 98 62
Qwen3.5 397B (Reasoning) Open Weight 81 91 94 60 60

The overall gap between the best open weight model (GLM-5 (Reasoning) at 85) and the current proprietary leaders at 94 is 9 points. That's tighter than most people expect. In mid-2024, the gap was much wider. The trajectory still matters as much as the current snapshot.

Where open source models already match or beat proprietary ones:

  • Math: GLM-5 (Reasoning) scores 98 on AIME 2025 and 95 on HMMT 2025 — competitive with the best proprietary math scores
  • Knowledge: GLM-5 (Reasoning) hits 96 on MMLU, 94 on GPQA, and 92 on SuperGPQA
  • Competitive coding: Kimi K2.5 reaches 85 on LiveCodeBench and GLM-4.7 hits 84.9, both ahead of Claude Opus 4.6 (76) on that specific benchmark
  • Multilingual: GLM-4.7 scores 94 on MGSM, ahead of most proprietary models

Where the gap remains wide:

  • Software engineering: The best open weight SWE-bench Verified score is 77.8 (GLM-5). GPT-5.3 Codex scores 85. For SWE-bench Pro, the gap is larger — open models top out around 67 vs 90 for GPT-5.3 Codex
  • Agentic tasks: Open models trail significantly on BrowseComp, TerminalBench, and OSWorld
  • Overall consistency: Proprietary models perform well across all categories simultaneously. Open models tend to spike on specific benchmarks but dip on others

Best open source LLM by use case

Best for math and reasoning

GLM-5 (Reasoning) is the clear winner. AIME 2025: 98. HMMT 2025: 95. BRUMO 2025: 96. Math500: 92. These are near-perfect scores on graduate-level competition math. No other open weight model comes close.

Runner-up: Step 3.5 Flash (AIME 2025: 97.3) and GLM-4.7 (HMMT 2025: 97.1) are both strong math alternatives with lower overall resource requirements.

Best for coding

This depends on which coding benchmark you care about:

  • HumanEval (function generation): Kimi K2.5 at 99 — essentially perfect, but HumanEval is saturated and no longer differentiates frontier models
  • LiveCodeBench (competitive programming): Kimi K2.5 at 85, closely followed by GLM-4.7 (84.9) and Qwen3.5-27B (80.7)
  • SWE-bench Verified (real bug fixing): GLM-5 at 77.8, Kimi K2.5 at 76.8, GLM-4.7 at 73.8
  • SWE-bench Pro (harder software engineering): GLM-5 (Reasoning) at 67 leads open models, but this is still well below GPT-5.3 Codex (90)

On BenchLM's current blended coding score, Gemma 4 31B leads the open-weight field at 86.6, followed by Qwen3.5 397B (Reasoning) at 84.9 and GLM-5.1 at 82.9. GLM-4.7 still offers one of the cleaner all-around coding profiles with 84.9 on LiveCodeBench and 73.8 on SWE-bench Verified. If SWE-bench Pro matters most, GLM-5 (Reasoning) is still the better pick despite weaker LiveCodeBench numbers.

See the full coding leaderboard · SWE-bench Pro explained

Best for knowledge and question answering

GLM-5 (Reasoning) dominates knowledge benchmarks: MMLU 96, GPQA 94, SuperGPQA 92, SimpleQA 92. The non-reasoning GLM-5 variant is nearly as strong at MMLU 91.7 and SimpleQA 84.

Qwen3.5 397B (Reasoning) is a solid second choice with MMLU 91, GPQA 89, and more balanced performance across categories.

For factual accuracy specifically, check SimpleQA scores — they measure whether models hallucinate less. GLM-5 (Reasoning) leads open models at 92.

Best for self-hosting and fine-tuning

Self-hosting economics favor smaller models. Running a 397B-parameter model requires multiple high-end GPUs. Here's where the practical sweet spot sits:

  • Mistral Small 4 (24B parameters, 256K context): Scores 47 overall. Fits on a single consumer GPU with quantization. Mistral's Apache 2.0 license is genuinely permissive for commercial use
  • Gemma 4 31B (31B parameters, 256K context): Scores 67 overall. Google's latest open model fits on a single high-end consumer GPU and offers strong LiveCodeBench (80) performance
  • DeepSeek Coder 2.0: 54 overall, but $0.27/$1.10 via API makes it cheaper than self-hosting for most teams. Self-host only if data privacy requires it

For fine-tuning specifically, Mistral and Qwen models have the most mature fine-tuning ecosystems with well-documented tooling.

Best for privacy-sensitive deployments

If data cannot leave your infrastructure, your options are:

  1. GLM-5 or Qwen3.5 397B for maximum capability (requires serious GPU infrastructure)
  2. Mistral Small 4 for the best quality-to-resource ratio (runs on a single A100 or equivalent)
  3. Llama 3.1 405B for the broadest ecosystem support and community tooling, despite lower benchmark scores (43 overall)

All open weight models can be self-hosted with no API dependency. The real constraint is GPU cost: running a 400B+ model costs $2-5K/month in cloud GPU compute, which only makes sense above roughly 50M tokens per month versus API pricing.

Best for cost-sensitive applications

If cost per token drives your decision, these are the open weight models available through low-cost APIs:

Model API Price (input/output per 1M tokens) Overall Score
DeepSeek Coder 2.0 $0.27 / $1.10 54
DeepSeek V3.2 (Thinking) ~$0.55 / $2.19 65
Kimi K2.5 $0.50 / $2.80 68

For comparison, GPT-5.4 Pro costs $30/$180 and Claude Opus 4.6 costs $15/$75. DeepSeek Coder 2.0 at $0.27/$1.10 delivers a BenchLM overall score of 54 at a fraction of the output cost of GPT-5.4 Pro. That's not a rounding error — it's a fundamentally different cost structure.

The DeepSeek, Qwen, GLM, and Llama landscape

DeepSeek

DeepSeek has the strongest brand recognition among open source LLMs, but its leaderboard position has slipped. DeepSeek V3.2 (Thinking) scores 65 and DeepSeek Coder 2.0 scores 54 — both well behind GLM-5 (Reasoning) at 85 and Qwen3.5 397B (Reasoning) at 81. DeepSeek's advantage is pricing: the API at $0.27-0.55/M input tokens makes it one of the cheapest ways to access a capable model. DeepSeek also has strong agentic rows relative to its overall ranking.

Qwen (Alibaba)

Alibaba's Qwen3.5 397B (Reasoning) at 81 is still one of the strongest open-weight models overall. The Qwen ecosystem is broad: Qwen2.5-1M offers a 1M token context window, Qwen2.5-72B provides a solid mid-size option, and Alibaba continues to ship variants frequently. The Qwen3.5 non-reasoning variant sits at 66, so reasoning mode currently adds about 15 points of overall performance.

GLM (Zhipu AI)

Zhipu AI's GLM family now occupies four of the top six open-weight spots. GLM-5 (Reasoning) at 85 leads the leaderboard, GLM-5.1 is right behind at 84, and GLM-5 at 77 remains a strong non-reasoning engineering option. GLM-4.7 at 72 still offers one of the cleaner all-around coding profiles in the open-weight field.

Llama (Meta)

Llama's position has changed dramatically. Llama 4 Maverick scores 18 and Llama 4 Scout scores 24, both below Llama 3.1 405B at 43. Meta's open-weight models, which defined the category in 2023-2024, now trail the leading Chinese open models by a wide margin. Llama remains relevant for its ecosystem, community tooling, and broad cloud provider support. But on pure benchmark performance, it is no longer competitive at the frontier of open-weight AI.

Mistral

Mistral Small 4 (Reasoning) is no longer ranked on BenchLM.ai due to insufficient trusted benchmark data. Mistral's strength remains efficiency: Small 4 runs on modest hardware with a 256K context window. Mistral is also the only major European open weight model provider, which matters for organizations with data sovereignty requirements.

NVIDIA Nemotron

NVIDIA's Nemotron 3 Ultra 500B scores 65 and offers a 10M token context window — the largest among open weight models. Nemotron 3 Super 120B A12B (61) and Super 100B (60) provide more practical deployment options. NVIDIA's integration with its own GPU tooling gives Nemotron models a deployment advantage on NVIDIA hardware.

What "open source" actually means for LLMs

Not all "open weight" models are truly open source. The distinction matters:

Open weight means the model weights are downloadable and you can run inference locally. This is what most people mean when they say "open source LLM." GLM-5, Qwen3.5, DeepSeek, and Mistral models are all open weight.

Open source in the strict OSI definition requires the training data, training code, and model weights to all be available. Almost no frontier LLM meets this bar. OLMo from AI2 is one of the few that does.

Permissive license vs restricted license is often more important than the open-weight distinction. Mistral uses Apache 2.0 (fully permissive for commercial use). Meta's Llama license restricts use above 700M monthly active users. Many Chinese open weight models have custom licenses that may restrict commercial deployment in certain jurisdictions.

For most teams, the practical question is: "Can I download the weights, run the model on my infrastructure, and use it in my product without paying per-token fees?" For all models on this list, the answer is yes — but read the license before deploying in production.

How we rank open source models

BenchLM.ai ranks open weight models using the same benchmark-weighted methodology as proprietary models. The overall score combines performance across knowledge (MMLU, GPQA, SuperGPQA), coding (SWE-bench Pro, SWE-bench Verified, LiveCodeBench), math (AIME, HMMT, Math500), reasoning (MUSR, BBH), instruction following (IFEval), multilingual (MGSM), agentic tasks (TerminalBench, BrowseComp, OSWorld), and multimodal benchmarks.

Models with scores reported only by their creators receive lower confidence weighting until independent verification is available. The full methodology is documented on the BenchLM.ai leaderboard.

See the full open source leaderboard · Compare all models · Best LLM for coding


Frequently asked questions

What is the best open source LLM in 2026? GLM-5 (Reasoning) from Zhipu AI leads BenchLM.ai's open weight leaderboard at 85 overall, followed by GLM-5.1 at 84 and Qwen3.5 397B (Reasoning) at 81. All three are well ahead of DeepSeek and Llama on the current leaderboard.

Can I run these models locally? Yes, all models listed are open weight and can be self-hosted. Smaller models like Mistral Small 4 (24B parameters) run on a single high-end consumer GPU. Larger models like Qwen3.5 397B and GLM-5 require multi-GPU setups or cloud instances with 4-8 A100/H100 GPUs.

Is DeepSeek still the best open source model? No. DeepSeek V3.2 (Thinking) scores 65 and DeepSeek Coder 2.0 scores 54 on BenchLM.ai — both well behind the leader, GLM-5 (Reasoning) at 85. DeepSeek remains competitive on pricing and useful for cost-sensitive workloads, but it is no longer the overall open-weight performance leader.

What happened to Llama? Meta's Llama 4 Maverick and Scout score 18 and 24 respectively — significantly below the leaders. Llama 3.1 405B at 43 still outperforms both on BenchLM.ai. Llama's ecosystem advantages remain strong, but its benchmark performance has fallen behind.

Which open source model is cheapest to run via API? DeepSeek Coder 2.0 at $0.27/$1.10 per million tokens is still one of the cheapest capable options, though its current overall score is 54. Kimi K2.5 at $0.50/$2.80 offers a higher overall score of 68 at still-affordable pricing.


Benchmark data from BenchLM.ai. Prices per million tokens, current as of April 2026.

New models drop every week. We send one email a week with what moved and why.