Not all benchmark scores are equally trustworthy. BenchLM tracks the provenance of every score — whether it comes from a verified public source or was inferred from related data. The confidence indicator (1-4 dots) shows how much verified data supports each model's overall score.
7+ categories, 20+ verified benchmarks
5+ categories, 12+ verified benchmarks
3+ categories, 8+ verified benchmarks
Limited verified data, score is estimated
78
High (75%)
23
Good (22%)
3
Moderate (3%)
0
Low / Estimated (0%)
Each benchmark value is tagged as manual (verified from public sources) or generated (inferred from related models). Generated values receive a 25% discount in the overall score calculation to prevent models with mostly inferred data from outranking those with solid verified coverage.
A model must have at least 8 verified benchmarks across 2+ categories to receive a global ranking. Models below this threshold are shown as "Tracked" with their available scores visible but not ranked.
For category leaderboards, a model needs verified scores on at least half of the weighted benchmarks in that category. This prevents a model with a single strong benchmark from appearing at the top of a category.
Some benchmarks (MMLU, BBH, HumanEval, older AIME/HMMT variants) are shown for context but don't affect scoring. These are either saturated (top models all score 97%+) or have been superseded by harder versions.
| Model | Confidence | Score |
|---|---|---|
| Kimi K2.5 Moonshot AI | ●●●●High | 72 |
| Claude Opus 4.5 Anthropic | ●●●●High | 76 |
| Qwen3.5 397B Alibaba | ●●●●High | 68 |
| GLM-5 Zhipu AI | ●●●●High | 75 |
| Qwen3.6 Plus Alibaba | ●●●●High | 69 |
| Claude Opus 4.6 Anthropic | ●●●●High | 85 |
| GPT-5.2 OpenAI | ●●●●High | 82 |
| GPT-5.4 OpenAI | ●●●●High | 82 |
| Gemini 3.1 Pro | ●●●●High | 87 |
| Kimi K2.5 (Reasoning) Moonshot AI | ●●●●High | 76 |
| Grok 4 xAI | ●●●●High | 68 |
| Gemini 3 Flash | ●●●●High | 67 |
| GPT-OSS 120B OpenAI | ●●●●High | 50 |
| GPT-5.3 Codex OpenAI | ●●●●High | 85 |
| GPT-5 (high) OpenAI | ●●●●High | 82 |
| GPT-5.2-Codex OpenAI | ●●●●High | 82 |
| GPT-5.1 OpenAI | ●●●●High | 78 |
| DeepSeek V3.2 (Thinking) DeepSeek | ●●●●High | 67 |
| MiMo-V2-Flash Xiaomi | ●●●●High | 67 |
| Gemini 2.5 Pro | ●●●●High | 65 |
| Claude Haiku 4.5 Anthropic | ●●●●High | 63 |
| o4-mini (high) OpenAI | ●●●●High | 58 |
| Claude 3.5 Sonnet Anthropic | ●●●●High | 55 |
| DeepSeek-R1 DeepSeek | ●●●●High | 45 |
| GPT-5.4 Pro OpenAI | ●●●●High | 92 |
| GPT-5.1-Codex-Max OpenAI | ●●●●High | 81 |
| GPT-5 (medium) OpenAI | ●●●●High | 76 |
| o1-preview OpenAI | ●●●●High | 72 |
| Grok 4.1 Fast xAI | ●●●●High | 70 |
| o3-pro OpenAI | ●●●●High | 67 |
| DeepSeek Coder 2.0 DeepSeek | ●●●●High | 62 |
| Qwen2.5-1M Alibaba | ●●●●High | 62 |
| Claude 4.1 Opus Anthropic | ●●●●High | 62 |
| Claude 4 Sonnet Anthropic | ●●●●High | 62 |
| Qwen2.5-72B Alibaba | ●●●●High | 60 |
| Gemini 3.1 Flash-Lite | ●●●●High | 56 |
| Grok Code Fast 1 xAI | ●●●●High | 56 |
| Mistral Large 2 Mistral | ●●●●High | 52 |
| Claude 3 Opus Anthropic | ●●●●High | 49 |
| Qwen3 235B 2507 Alibaba | ●●●●High | 47 |
| Llama 3 70B Meta | ●●●●High | 44 |
| Mistral 8x7B Mistral | ●●●●High | 44 |
| Moonshot v1 Moonshot AI | ●●●●High | 43 |
| Claude 3 Haiku Anthropic | ●●●●High | 43 |
| DeepSeek V3.1 DeepSeek | ●●●●High | 41 |
| Nemotron Ultra 253B NVIDIA | ●●●●High | 41 |
| Llama 4 Maverick Meta | ●●●●High | 39 |
| GLM-4.5-Air Zhipu AI | ●●●●High | 38 |
| Gemma 3 27B | ●●●●High | 35 |
| Qwen3.5 397B (Reasoning) Alibaba | ●●●●High | 77 |
| o3 OpenAI | ●●●●High | 64 |
| Qwen3 235B 2507 (Reasoning) Alibaba | ●●●●High | 55 |
| Z-1 Z | ●●●●High | 44 |
| Mistral 7B v0.3 Mistral | ●●●●High | 29 |
| Gemini 3 Pro Deep Think | ●●●●High | 80 |
| Gemini 2.5 Flash | ●●●●High | 50 |
| Nemotron 3 Nano 30B NVIDIA | ●●●●High | 42 |
| Grok 4.1 xAI | ●●●●High | 85 |
| Claude Sonnet 4.5 Anthropic | ●●●●High | 68 |
| Mistral Large 3 Mistral | ●●●●High | 58 |
| Nemotron 3 Super 100B NVIDIA | ●●●●High | 56 |
| DeepSeek V3.1 (Reasoning) DeepSeek | ●●●●High | 43 |
| DeepSeekMath V2 DeepSeek | ●●●●High | 63 |
| Grok 3 [Beta] xAI | ●●●●High | 48 |
| GLM-4.5 Zhipu AI | ●●●●High | 40 |
| Gemini 1.0 Pro | ●●●●High | 40 |
| GPT-OSS 20B OpenAI | ●●●●High | 36 |
| GLM-4.7 Zhipu AI | ●●●●High | 74 |
| DeepSeek V3.2 DeepSeek | ●●●●High | 61 |
| Nova Pro Amazon | ●●●●High | 33 |
| GLM-5 (Reasoning) Zhipu AI | ●●●●High | 82 |
| Nemotron 3 Ultra 500B NVIDIA | ●●●●High | 60 |
| GPT-4o OpenAI | ●●●●High | 50 |
| GPT-4 Turbo OpenAI | ●●●●High | 43 |
| Llama 4 Behemoth Meta | ●●●●High | 34 |
| Nemotron-4 15B NVIDIA | ●●●●High | 42 |
| Gemini 1.5 Pro | ●●●●High | 50 |
| Llama 3.1 405B Meta | ●●●●High | 53 |
| Gemini 3 Pro | ●●●○Good | 79 |
| DeepSeek LLM 2.0 DeepSeek | ●●●○Good | 57 |
| Claude 4.1 Opus Thinking Anthropic | ●●●○Good | 57 |
| Claude Sonnet 4.6 Anthropic | ●●●○Good | 84 |
| Kimi K2 Moonshot AI | ●●●○Good | 53 |
| GPT-5.4 nano OpenAI | ●●●○Good | 58 |
| Llama 4 Scout Meta | ●●●○Good | 44 |
| Mistral 8x7B v0.2 Mistral | ●●●○Good | 27 |
| GPT-5.4 mini OpenAI | ●●●○Good | 66 |
| o3-mini OpenAI | ●●●○Good | 65 |
| GPT-4.1 OpenAI | ●●●○Good | 64 |
| o1 OpenAI | ●●●○Good | 64 |
| Phi-4 Microsoft | ●●●○Good | 40 |
| Qwen3.5-122B-A10B Alibaba | ●●●○Good | 71 |
| Qwen3.5-27B Alibaba | ●●●○Good | 70 |
| Qwen3.5-35B-A3B Alibaba | ●●●○Good | 66 |
| GPT-4.1 mini OpenAI | ●●●○Good | 57 |
| GPT-4.1 nano OpenAI | ●●●○Good | 44 |
| Sarvam 105B Sarvam | ●●●○Good | 60 |
| GPT-4o mini OpenAI | ●●●○Good | 54 |
| o1-pro OpenAI | ●●●○Good | 45 |
| DBRX Instruct Databricks | ●●●○Good | 41 |
| Mixtral 8x22B Instruct v0.1 Mistral | ●●●○Good | 36 |
| DeepSeek V3 DeepSeek | ●●○○Moderate | 49 |
| Gemma 4 31B | ●●○○Moderate | 73 |
| Gemma 4 26B A4B | ●●○○Moderate | 64 |
Verified = sourced from public evaluations. Generated = inferred from related models (25% scoring discount). Coverage = percentage of benchmarks that are verified.
Score confidence (1-4 dots) indicates how much verified benchmark data supports a model's overall score. A 4-dot score is backed by 20+ verified benchmarks across 7+ categories. A 1-dot score relies on limited verified data and may include estimated values. The confidence system helps you distinguish between well-tested models and those with sparse coverage.
Scores marked with "Est." or "~" are derived from limited verified data, often supplemented by generated (inferred) benchmark values. Generated values receive a 25% discount in the scoring formula. While these estimates are directionally useful, they should not be treated as authoritative rankings until more verified data becomes available.
BenchLM tracks two key signals: (1) benchmark provenance — whether each score comes from a verified public source ("manual") or was generated/inferred from related data, and (2) benchmark freshness — older benchmarks that haven't been updated are more likely to have been contaminated through training data inclusion. Models with mostly generated data or stale benchmarks receive lower confidence ratings.
Provenance tracks the origin of each benchmark score. "Manual" scores were sourced from published evaluations — papers, official model cards, or trusted third-party benchmarks. "Generated" scores were inferred from related models or interpolated. Only manual (verified) scores count toward ranking eligibility. A model needs at least 8 verified benchmarks across 2+ categories to be ranking-eligible.
Fresh, held-out benchmarks like SWE-Rebench (rolling window), Terminal-Bench 2.0, and HLE are the hardest to game. Older, saturated benchmarks like MMLU (where top models all score 97-99%) provide little signal. BenchLM weights newer, harder benchmarks more heavily and flags saturated ones as display-only.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.