open-sourcecomparisonrankingself-hostingguide

Best Open Source LLM in 2026: Rankings, Benchmarks, and the Models Worth Running

Which open source LLM is best in 2026? We rank the top open weight models by real benchmark data — GLM-5, Qwen3.5, DeepSeek, Mistral, Llama — and compare them to proprietary leaders.

Glevd·April 1, 2026·12 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

The best open source LLM right now is GLM-5 (Reasoning) from Zhipu AI, scoring 78 on BenchLM.ai's overall leaderboard. Qwen3.5 397B (Reasoning) follows at 75, with MiMo-V2-Flash and Step 3.5 Flash tied at 70.

That's a significant shift. Two years ago, Llama dominated the open source conversation. Today, Chinese labs — Zhipu AI, Alibaba, Xiaomi, DeepSeek, Moonshot AI, and StepFun — hold every top position among open weight models. The best open source LLMs in 2026 are not where most people expect them to be.

Top open source LLMs ranked by benchmarks

Rank Model Creator Overall MMLU AIME 2025 SWE-Verified LiveCodeBench Context
1 GLM-5 (Reasoning) Zhipu AI 78 96 98 62 58 200K
2 Qwen3.5 397B (Reasoning) Alibaba 75 91 94 60 60 128K
3 MiMo-V2-Flash Xiaomi 70 86.7 94.1 73.4 80.6 256K
4 Step 3.5 Flash StepFun 70 84 97.3 74.4 86.4 256K
5 GLM-4.7 Zhipu AI 69 86 95.7 73.8 84.9 200K
6 Kimi K2.5 Moonshot AI 68 77 78 76.8 85 128K
7 GLM-5 Zhipu AI 68 91.7 93.3 77.8 52 200K
8 Mistral Small 4 (Reasoning) Mistral 68 83.8 63.6 256K

Scores from BenchLM.ai open source leaderboard. Overall score is BenchLM.ai's benchmark-weighted composite.

This table reveals something non-obvious: the models with the highest overall scores are not always the ones with the best individual benchmark numbers. MiMo-V2-Flash and Step 3.5 Flash outscore GLM-5 (Reasoning) on SWE-bench Verified and LiveCodeBench, but GLM-5 (Reasoning) dominates knowledge and math so comprehensively that its overall score is 8 points higher.

How close are open source models to proprietary ones?

The honest answer: closer than ever, but still behind.

Model Type Overall MMLU AIME 2025 SWE-Verified LiveCodeBench
GPT-5.4 Pro Proprietary 91 84 84
Claude Opus 4.6 Proprietary 80.8 76
GLM-5 (Reasoning) Open Weight 78 96 98 62 58
Qwen3.5 397B (Reasoning) Open Weight 75 91 94 60 60
MiMo-V2-Flash Open Weight 70 86.7 94.1 73.4 80.6

The overall gap between the best open weight model (GLM-5 (Reasoning) at 78) and the best proprietary model (GPT-5.4 Pro at 91) is 13 points. That's real. In mid-2024, the gap was closer to 25-30 points. The trajectory matters as much as the current snapshot.

Where open source models already match or beat proprietary ones:

  • Math: GLM-5 (Reasoning) scores 98 on AIME 2025 and 95 on HMMT 2025 — competitive with the best proprietary math scores
  • Knowledge: GLM-5 (Reasoning) hits 96 on MMLU, 94 on GPQA, and 92 on SuperGPQA
  • Competitive coding: Step 3.5 Flash reaches 86.4 on LiveCodeBench, ahead of Claude Opus 4.6 (76) on that specific benchmark
  • Multilingual: GLM-4.7 scores 94 on MGSM, ahead of most proprietary models

Where the gap remains wide:

  • Software engineering: The best open weight SWE-bench Verified score is 77.8 (GLM-5). GPT-5.3 Codex scores 85. For SWE-bench Pro, the gap is larger — open models top out around 67 vs 90 for GPT-5.3 Codex
  • Agentic tasks: Open models trail significantly on BrowseComp, TerminalBench, and OSWorld
  • Overall consistency: Proprietary models perform well across all categories simultaneously. Open models tend to spike on specific benchmarks but dip on others

Best open source LLM by use case

Best for math and reasoning

GLM-5 (Reasoning) is the clear winner. AIME 2025: 98. HMMT 2025: 95. BRUMO 2025: 96. Math500: 92. These are near-perfect scores on graduate-level competition math. No other open weight model comes close.

Runner-up: Step 3.5 Flash (AIME 2025: 97.3) and GLM-4.7 (HMMT 2025: 97.1) are both strong math alternatives with lower overall resource requirements.

Best for coding

This depends on which coding benchmark you care about:

  • HumanEval (function generation): Kimi K2.5 at 99 — essentially perfect, but HumanEval is saturated and no longer differentiates frontier models
  • LiveCodeBench (competitive programming): Step 3.5 Flash at 86.4, closely followed by Kimi K2.5 (85) and GLM-4.7 (84.9)
  • SWE-bench Verified (real bug fixing): GLM-5 at 77.8, Kimi K2.5 at 76.8, Step 3.5 Flash at 74.4
  • SWE-bench Pro (harder software engineering): GLM-5 (Reasoning) at 67 leads open models, but this is still well below GPT-5.3 Codex (90)

For a general-purpose open source coding model, GLM-4.7 offers the best balance: 94.2 HumanEval, 84.9 LiveCodeBench, 73.8 SWE-Verified, and a 200K context window. If SWE-bench Pro matters most, GLM-5 (Reasoning) is the better pick despite weaker LiveCodeBench numbers.

See the full coding leaderboard · SWE-bench Pro explained

Best for knowledge and question answering

GLM-5 (Reasoning) dominates knowledge benchmarks: MMLU 96, GPQA 94, SuperGPQA 92, SimpleQA 92. The non-reasoning GLM-5 variant is nearly as strong at MMLU 91.7 and SimpleQA 84.

Qwen3.5 397B (Reasoning) is a solid second choice with MMLU 91, GPQA 89, and more balanced performance across categories.

For factual accuracy specifically, check SimpleQA scores — they measure whether models hallucinate less. GLM-5 (Reasoning) leads open models at 92.

Best for self-hosting and fine-tuning

Self-hosting economics favor smaller models. Running a 397B-parameter model requires multiple high-end GPUs. Here's where the practical sweet spot sits:

  • Mistral Small 4 (24B parameters, 256K context): Scores 66 overall. Fits on a single consumer GPU with quantization. Mistral's Apache 2.0 license is genuinely permissive for commercial use
  • MiMo-V2-Flash (Mixture of Experts): 70 overall score with strong multimodal capabilities (MMmu-Pro: 78). Efficient inference via MoE architecture
  • DeepSeek Coder 2.0: 66 overall, but $0.27/$1.10 via API makes it cheaper than self-hosting for most teams. Self-host only if data privacy requires it

For fine-tuning specifically, Mistral and Qwen models have the most mature fine-tuning ecosystems with well-documented tooling.

Best for privacy-sensitive deployments

If data cannot leave your infrastructure, your options are:

  1. GLM-5 or Qwen3.5 397B for maximum capability (requires serious GPU infrastructure)
  2. Mistral Small 4 for the best quality-to-resource ratio (runs on a single A100 or equivalent)
  3. Llama 3.1 405B for the broadest ecosystem support and community tooling, despite lower benchmark scores (59 overall)

All open weight models can be self-hosted with no API dependency. The real constraint is GPU cost: running a 400B+ model costs $2-5K/month in cloud GPU compute, which only makes sense above roughly 50M tokens per month versus API pricing.

Best for cost-sensitive applications

If cost per token drives your decision, these are the open weight models available through low-cost APIs:

Model API Price (input/output per 1M tokens) Overall Score
Step 3.5 Flash $0.10 / $0.30 70
DeepSeek Coder 2.0 $0.27 / $1.10 66
DeepSeek V3.2 (Thinking) ~$0.55 / $2.19 66
Kimi K2.5 $0.50 / $2.80 68

For comparison, GPT-5.4 Pro costs $30/$180 and Claude Opus 4.6 costs $15/$75. Step 3.5 Flash at $0.10/$0.30 delivers a BenchLM overall score of 70 at 1/600th the output cost of GPT-5.4 Pro. That's not a rounding error — it's a fundamentally different cost structure.

The DeepSeek, Qwen, GLM, and Llama landscape

DeepSeek

DeepSeek has the strongest brand recognition among open source LLMs, but its leaderboard position has slipped. DeepSeek V3.2 (Thinking) scores 66 and DeepSeek Coder 2.0 also scores 66 — both well behind GLM-5 (78) and Qwen3.5 (75). DeepSeek's advantage is pricing: the API at $0.27-0.55/M input tokens makes it one of the cheapest ways to access a capable model. DeepSeek also has strong agentic scores (TerminalBench 71, OSWorld Verified 67) relative to its overall ranking.

Qwen (Alibaba)

Alibaba's Qwen3.5 397B (Reasoning) at 75 is the second-strongest open weight model overall. The Qwen ecosystem is broad: Qwen2.5-1M offers a 1M token context window (unique among high-performing open models), Qwen2.5-72B provides a solid mid-size option at 63, and Alibaba continues to ship variants frequently. The Qwen3.5 non-reasoning variant (62) shows that reasoning mode adds 13 points of overall performance — the largest reasoning-mode uplift among open models.

GLM (Zhipu AI)

Zhipu AI's GLM family holds three spots in the top 8 open weight models. GLM-5 (Reasoning) at 78 leads the entire open source leaderboard. GLM-4.7 at 69 offers possibly the best all-around coding profile among open models (HumanEval 94.2, LiveCodeBench 84.9, HMMT 2025 97.1). GLM-5 (non-reasoning) at 68 has the highest SWE-bench Verified score (77.8) and SWE-Rebench score (62.8) of any open model, making it the best option for real-world software engineering without reasoning-mode overhead.

Llama (Meta)

Llama's position has changed dramatically. Llama 4 Maverick and Scout both score 43 — lower than Llama 3.1 405B (59). Meta's open source models, which defined the category in 2023-2024, now trail Chinese competitors by 20-35 points on overall benchmarks. Llama remains relevant for its permissive license, massive community ecosystem (fine-tuning tooling, deployment guides, quantized versions), and broad cloud provider support. But on pure benchmark performance, Llama is no longer competitive at the frontier of open source AI.

Mistral

Mistral Small 4 (Reasoning) scores 68 overall — tied with Kimi K2.5 and GLM-5 (non-reasoning). Mistral's strength is efficiency: Small 4 runs on modest hardware with a 256K context window. Mistral is also the only major European open weight model provider, which matters for organizations with data sovereignty requirements.

NVIDIA Nemotron

NVIDIA's Nemotron 3 Ultra 500B scores 65 and offers a 10M token context window — the largest among open weight models. Nemotron 3 Super 120B A12B (61) and Super 100B (60) provide more practical deployment options. NVIDIA's integration with its own GPU tooling gives Nemotron models a deployment advantage on NVIDIA hardware.

What "open source" actually means for LLMs

Not all "open weight" models are truly open source. The distinction matters:

Open weight means the model weights are downloadable and you can run inference locally. This is what most people mean when they say "open source LLM." GLM-5, Qwen3.5, DeepSeek, and Mistral models are all open weight.

Open source in the strict OSI definition requires the training data, training code, and model weights to all be available. Almost no frontier LLM meets this bar. OLMo from AI2 is one of the few that does.

Permissive license vs restricted license is often more important than the open-weight distinction. Mistral uses Apache 2.0 (fully permissive for commercial use). Meta's Llama license restricts use above 700M monthly active users. Many Chinese open weight models have custom licenses that may restrict commercial deployment in certain jurisdictions.

For most teams, the practical question is: "Can I download the weights, run the model on my infrastructure, and use it in my product without paying per-token fees?" For all models on this list, the answer is yes — but read the license before deploying in production.

How we rank open source models

BenchLM.ai ranks open weight models using the same benchmark-weighted methodology as proprietary models. The overall score combines performance across knowledge (MMLU, GPQA, SuperGPQA), coding (SWE-bench Pro, SWE-bench Verified, LiveCodeBench), math (AIME, HMMT, Math500), reasoning (MUSR, BBH), instruction following (IFEval), multilingual (MGSM), agentic tasks (TerminalBench, BrowseComp, OSWorld), and multimodal benchmarks.

Models with scores reported only by their creators receive lower confidence weighting until independent verification is available. The full methodology is documented on the BenchLM.ai leaderboard.

See the full open source leaderboard · Compare all models · Best LLM for coding


Frequently asked questions

What is the best open source LLM in 2026? GLM-5 (Reasoning) from Zhipu AI leads BenchLM.ai's open weight leaderboard at 78 overall, followed by Qwen3.5 397B (Reasoning) at 75. Both are well ahead of DeepSeek (66) and Llama 4 (43).

Can I run these models locally? Yes, all models listed are open weight and can be self-hosted. Smaller models like Mistral Small 4 (24B parameters) run on a single high-end consumer GPU. Larger models like Qwen3.5 397B and GLM-5 require multi-GPU setups or cloud instances with 4-8 A100/H100 GPUs.

Is DeepSeek still the best open source model? No. DeepSeek V3.2 and DeepSeek Coder 2.0 both score 66 on BenchLM.ai — 12 points behind the leader (GLM-5 (Reasoning) at 78). DeepSeek remains competitive on pricing and strong for agentic tasks, but it is no longer the overall open weight performance leader.

What happened to Llama? Meta's Llama 4 Maverick and Scout score 43 — significantly below Chinese open weight models. Llama 3.1 405B (59) still outperforms Llama 4 on BenchLM.ai. Llama's ecosystem advantages (community tooling, cloud provider support, permissive licensing) remain strong, but its benchmark performance has fallen behind.

Which open source model is cheapest to run via API? Step 3.5 Flash at $0.10/$0.30 per million tokens is the cheapest high-performing option (70 overall score). DeepSeek Coder 2.0 at $0.27/$1.10 is the next most affordable option at 66 overall.


Benchmark data from BenchLM.ai. Prices per million tokens, current as of April 2026.

Enjoyed this post?

Get weekly benchmark updates in your inbox.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.

Share This Report

Copy the link, post it, or save a PDF version.

More posts
Share on XShare on LinkedIn

Weekly LLM Updates

New model releases, benchmark scores, and leaderboard changes. Every Friday.

Free. Your signup is stored with a derived country code for compliance routing.