Benchmark Confidence & Contamination Flags

Not all benchmark scores are equally trustworthy. BenchLM tracks the provenance of every score — whether it comes from a verified public source or was inferred from related data. The confidence indicator (1-4 dots) shows how much verified data supports each model's overall score.

●●●●High

7+ categories, 20+ verified benchmarks

●●●○Good

5+ categories, 12+ verified benchmarks

●●○○Moderate

3+ categories, 8+ verified benchmarks

●○○○Low / Estimated

Limited verified data, score is estimated

Confidence Distribution (Ranked Models)

78

High (75%)

23

Good (22%)

3

Moderate (3%)

0

Low / Estimated (0%)

How BenchLM Scores Work

Verified vs Generated Scores

Each benchmark value is tagged as manual (verified from public sources) or generated (inferred from related models). Generated values receive a 25% discount in the overall score calculation to prevent models with mostly inferred data from outranking those with solid verified coverage.

Ranking Eligibility

A model must have at least 8 verified benchmarks across 2+ categories to receive a global ranking. Models below this threshold are shown as "Tracked" with their available scores visible but not ranked.

Category Eligibility

For category leaderboards, a model needs verified scores on at least half of the weighted benchmarks in that category. This prevents a model with a single strong benchmark from appearing at the top of a category.

Display-Only Benchmarks

Some benchmarks (MMLU, BBH, HumanEval, older AIME/HMMT variants) are shown for context but don't affect scoring. These are either saturated (top models all score 97%+) or have been superseded by harder versions.

ModelConfidenceScore
Kimi K2.5

Moonshot AI

●●●●High72
Claude Opus 4.5

Anthropic

●●●●High76
Qwen3.5 397B

Alibaba

●●●●High68
GLM-5

Zhipu AI

●●●●High75
Qwen3.6 Plus

Alibaba

●●●●High69
Claude Opus 4.6

Anthropic

●●●●High85
GPT-5.2

OpenAI

●●●●High82
GPT-5.4

OpenAI

●●●●High82
Gemini 3.1 Pro

Google

●●●●High87
Kimi K2.5 (Reasoning)

Moonshot AI

●●●●High76
Grok 4

xAI

●●●●High68
Gemini 3 Flash

Google

●●●●High67
GPT-OSS 120B

OpenAI

●●●●High50
GPT-5.3 Codex

OpenAI

●●●●High85
GPT-5 (high)

OpenAI

●●●●High82
GPT-5.2-Codex

OpenAI

●●●●High82
GPT-5.1

OpenAI

●●●●High78
DeepSeek V3.2 (Thinking)

DeepSeek

●●●●High67
MiMo-V2-Flash

Xiaomi

●●●●High67
Gemini 2.5 Pro

Google

●●●●High65
Claude Haiku 4.5

Anthropic

●●●●High63
o4-mini (high)

OpenAI

●●●●High58
Claude 3.5 Sonnet

Anthropic

●●●●High55
DeepSeek-R1

DeepSeek

●●●●High45
GPT-5.4 Pro

OpenAI

●●●●High92
GPT-5.1-Codex-Max

OpenAI

●●●●High81
GPT-5 (medium)

OpenAI

●●●●High76
o1-preview

OpenAI

●●●●High72
Grok 4.1 Fast

xAI

●●●●High70
o3-pro

OpenAI

●●●●High67
DeepSeek Coder 2.0

DeepSeek

●●●●High62
Qwen2.5-1M

Alibaba

●●●●High62
Claude 4.1 Opus

Anthropic

●●●●High62
Claude 4 Sonnet

Anthropic

●●●●High62
Qwen2.5-72B

Alibaba

●●●●High60
Gemini 3.1 Flash-Lite

Google

●●●●High56
Grok Code Fast 1

xAI

●●●●High56
Mistral Large 2

Mistral

●●●●High52
Claude 3 Opus

Anthropic

●●●●High49
Qwen3 235B 2507

Alibaba

●●●●High47
Llama 3 70B

Meta

●●●●High44
Mistral 8x7B

Mistral

●●●●High44
Moonshot v1

Moonshot AI

●●●●High43
Claude 3 Haiku

Anthropic

●●●●High43
DeepSeek V3.1

DeepSeek

●●●●High41
Nemotron Ultra 253B

NVIDIA

●●●●High41
Llama 4 Maverick

Meta

●●●●High39
GLM-4.5-Air

Zhipu AI

●●●●High38
Gemma 3 27B

Google

●●●●High35
Qwen3.5 397B (Reasoning)

Alibaba

●●●●High77
o3

OpenAI

●●●●High64
Qwen3 235B 2507 (Reasoning)

Alibaba

●●●●High55
Z-1

Z

●●●●High44
Mistral 7B v0.3

Mistral

●●●●High29
Gemini 3 Pro Deep Think

Google

●●●●High80
Gemini 2.5 Flash

Google

●●●●High50
Nemotron 3 Nano 30B

NVIDIA

●●●●High42
Grok 4.1

xAI

●●●●High85
Claude Sonnet 4.5

Anthropic

●●●●High68
Mistral Large 3

Mistral

●●●●High58
Nemotron 3 Super 100B

NVIDIA

●●●●High56
DeepSeek V3.1 (Reasoning)

DeepSeek

●●●●High43
DeepSeekMath V2

DeepSeek

●●●●High63
Grok 3 [Beta]

xAI

●●●●High48
GLM-4.5

Zhipu AI

●●●●High40
Gemini 1.0 Pro

Google

●●●●High40
GPT-OSS 20B

OpenAI

●●●●High36
GLM-4.7

Zhipu AI

●●●●High74
DeepSeek V3.2

DeepSeek

●●●●High61
Nova Pro

Amazon

●●●●High33
GLM-5 (Reasoning)

Zhipu AI

●●●●High82
Nemotron 3 Ultra 500B

NVIDIA

●●●●High60
GPT-4o

OpenAI

●●●●High50
GPT-4 Turbo

OpenAI

●●●●High43
Llama 4 Behemoth

Meta

●●●●High34
Nemotron-4 15B

NVIDIA

●●●●High42
Gemini 1.5 Pro

Google

●●●●High50
Llama 3.1 405B

Meta

●●●●High53
Gemini 3 Pro

Google

●●●○Good79
DeepSeek LLM 2.0

DeepSeek

●●●○Good57
Claude 4.1 Opus Thinking

Anthropic

●●●○Good57
Claude Sonnet 4.6

Anthropic

●●●○Good84
Kimi K2

Moonshot AI

●●●○Good53
GPT-5.4 nano

OpenAI

●●●○Good58
Llama 4 Scout

Meta

●●●○Good44
Mistral 8x7B v0.2

Mistral

●●●○Good27
GPT-5.4 mini

OpenAI

●●●○Good66
o3-mini

OpenAI

●●●○Good65
GPT-4.1

OpenAI

●●●○Good64
o1

OpenAI

●●●○Good64
Phi-4

Microsoft

●●●○Good40
Qwen3.5-122B-A10B

Alibaba

●●●○Good71
Qwen3.5-27B

Alibaba

●●●○Good70
Qwen3.5-35B-A3B

Alibaba

●●●○Good66
GPT-4.1 mini

OpenAI

●●●○Good57
GPT-4.1 nano

OpenAI

●●●○Good44
Sarvam 105B

Sarvam

●●●○Good60
GPT-4o mini

OpenAI

●●●○Good54
o1-pro

OpenAI

●●●○Good45
DBRX Instruct

Databricks

●●●○Good41
Mixtral 8x22B Instruct v0.1

Mistral

●●●○Good36
DeepSeek V3

DeepSeek

●●○○Moderate49
Gemma 4 31B

Google

●●○○Moderate73
Gemma 4 26B A4B

Google

●●○○Moderate64

Verified = sourced from public evaluations. Generated = inferred from related models (25% scoring discount). Coverage = percentage of benchmarks that are verified.

Frequently Asked Questions

What is benchmark confidence on BenchLM?

Score confidence (1-4 dots) indicates how much verified benchmark data supports a model's overall score. A 4-dot score is backed by 20+ verified benchmarks across 7+ categories. A 1-dot score relies on limited verified data and may include estimated values. The confidence system helps you distinguish between well-tested models and those with sparse coverage.

What does "estimated" mean on BenchLM scores?

Scores marked with "Est." or "~" are derived from limited verified data, often supplemented by generated (inferred) benchmark values. Generated values receive a 25% discount in the scoring formula. While these estimates are directionally useful, they should not be treated as authoritative rankings until more verified data becomes available.

How does BenchLM detect contamination risk?

BenchLM tracks two key signals: (1) benchmark provenance — whether each score comes from a verified public source ("manual") or was generated/inferred from related data, and (2) benchmark freshness — older benchmarks that haven't been updated are more likely to have been contaminated through training data inclusion. Models with mostly generated data or stale benchmarks receive lower confidence ratings.

What is benchmark provenance?

Provenance tracks the origin of each benchmark score. "Manual" scores were sourced from published evaluations — papers, official model cards, or trusted third-party benchmarks. "Generated" scores were inferred from related models or interpolated. Only manual (verified) scores count toward ranking eligibility. A model needs at least 8 verified benchmarks across 2+ categories to be ranking-eligible.

Which LLM benchmarks are most reliable?

Fresh, held-out benchmarks like SWE-Rebench (rolling window), Terminal-Bench 2.0, and HLE are the hardest to game. Older, saturated benchmarks like MMLU (where top models all score 97-99%) provide little signal. BenchLM weights newer, harder benchmarks more heavily and flags saturated ones as display-only.

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.