Skip to main content
Skip to main content

Benchmark Confidence & Contamination Flags

Not all benchmark scores are equally trustworthy. BenchLM now separates verified ranking from provisionalranking while still tracking the provenance of every stored score. The confidence indicator (1-4 dots) shows how much sourced benchmark coverage supports each model's score.

●●●●High

7+ categories, 20+ non-generated benchmarks

●●●○Good

5+ categories, 12+ non-generated benchmarks

●●○○Moderate

3+ categories, 8+ non-generated benchmarks

●○○○Low / Estimated

Limited sourced data, score is estimated

Confidence Distribution (Ranked Models)

8

High (6%)

15

Good (12%)

18

Moderate (15%)

83

Low / Estimated (67%)

How BenchLM Scores Work

Verified, provisional, and generated

Each benchmark value is tagged as manual (a hand-entered public row) or generated (inferred from related models). Generated rows are excluded from all public ranking logic. Manual rows are now split again into sourced rows for the verified leaderboard and source-unverified rows that can still appear in provisional mode.

Ranking Eligibility

A model must have at least 8 qualifying benchmarks across 2+ categories to rank in a lane. The provisional leaderboard uses rankable non-generated rows; the verified leaderboard uses sourced rows only. Models below the threshold are shown as tracked but unranked.

Category Eligibility

For category leaderboards, a model needs qualifying scores on at least half of the weighted benchmarks in that category. BenchLM computes this separately for provisional and verified ranking so sparse exact-source coverage cannot silently borrow strength from provisional rows.

Display-Only Benchmarks

Some benchmarks (MMLU, BBH, HumanEval, older AIME/HMMT variants) are shown for context but don't affect scoring. These are either saturated (top models all score 97%+) or have been superseded by harder versions.

ModelConfidenceProv. score
Qwen3.7 Plus

Alibaba

●●●●High85
Claude Opus 4.5

Anthropic

●●●●High75
Kimi K2.5

Moonshot AI

●●●●High63
Qwen3.6 Plus

Alibaba

●●●●High66
Qwen3.5 397B

Alibaba

●●●●High62
GLM-5

Z.AI

●●●●High66
Claude Opus 4.6

Anthropic

●●●●High86
GPT-5.4

OpenAI

●●●●High87
Qwen3.7 Max

Alibaba

●●●○Good90
GPT-5.5

OpenAI

●●●○Good87
Gemini 3.5 Flash

Google

●●●○Good85
Claude Opus 4.7 (Adaptive)

Anthropic

●●●○Good83
Nemotron 3 Ultra

NVIDIA

●●●○Good67
Claude Mythos 5

Anthropic

●●●○Good99
Claude Fable 5

Anthropic

●●●○Good95
Gemini 3.1 Pro

Google

●●●○Good88
GLM-5.1

Z.AI

●●●○Good74
Grok 4.20

xAI

●●●○Good70
Claude Sonnet 4.6

Anthropic

●●●○Good80
MAI-Thinking-1

Microsoft

●●●○Good65
Qwen3.5-122B-A10B

Alibaba

●●●○Good63
Qwen3.5-27B

Alibaba

●●●○Good61
Qwen3.5-35B-A3B

Alibaba

●●●○Good55
Qwen3.6-35B-A3B

Alibaba

●●○○Moderate62
Qwen3.6-27B

Alibaba

●●○○Moderate71
Kimi K2.6

Moonshot AI

●●○○Moderate80
DeepSeek V4 Pro (Max)

DeepSeek

●●○○Moderate87
Claude Opus 4.8

Anthropic

●●○○Moderate92
DeepSeek V4 Pro (High)

DeepSeek

●●○○Moderate82
DeepSeek V4 Flash (Max)

DeepSeek

●●○○Moderate76
DeepSeek V4 Flash (High)

DeepSeek

●●○○Moderate70
DeepSeek V4 Pro

DeepSeek

●●○○Moderate68
DeepSeek V4 Flash

DeepSeek

●●○○Moderate57
MiniMax M2.7

MiniMax

●●○○Moderate52
MiniMax M3

MiniMax

●●○○Moderate78
GLM-5.2

Z.AI

●●○○Moderate90
MiniCPM5-1B

OpenBMB

●●○○Moderate25
GPT-5.2

OpenAI

●●○○Moderate78
GPT-5.4 Pro

OpenAI

●●○○Moderate90
Gemini 3 Pro

Google

●●○○Moderate80
Kimi K2.5 (Reasoning)

Moonshot AI

●●○○Moderate75
GLM-4.7

Z.AI

●○○○Low / Estimated~68
GPT-5.3 Codex

OpenAI

●○○○Low / Estimated~85
Claude Sonnet 4.5

Anthropic

●○○○Low / Estimated~64
o3-mini

OpenAI

●○○○Low / Estimated~55
DeepSeek V3.2

DeepSeek

●○○○Low / Estimated~56
GPT-4.1

OpenAI

●○○○Low / Estimated~56
GPT-4.1 mini

OpenAI

●○○○Low / Estimated~45
Qwen3 235B 2507

Alibaba

●○○○Low / Estimated~32
Gemini 2.5 Pro

Google

●○○○Low / Estimated~63
o1

OpenAI

●○○○Low / Estimated~56
GPT-4.1 nano

OpenAI

●○○○Low / Estimated~26
Gemini 3 Flash

Google

●○○○Low / Estimated~55
Gemini 3.1 Flash-Lite

Google

●○○○Low / Estimated~47
Gemini 3 Pro Deep Think

Google

●○○○Low / Estimated~89
GLM-5 (Reasoning)

Z.AI

●○○○Low / Estimated~79
GPT-5.1

OpenAI

●○○○Low / Estimated~77
GPT-5.2-Codex

OpenAI

●○○○Low / Estimated~76
GPT-5 (high)

OpenAI

●○○○Low / Estimated~75
GPT-5.1-Codex-Max

OpenAI

●○○○Low / Estimated~75
Grok 4

xAI

●○○○Low / Estimated~63
DeepSeek V3.2 (Thinking)

DeepSeek

●○○○Low / Estimated~60
MiMo-V2-Flash

Xiaomi

●○○○Low / Estimated~59
Claude Haiku 4.5

Anthropic

●○○○Low / Estimated~56
Claude 4.1 Opus

Anthropic

●○○○Low / Estimated~51
Claude 4 Sonnet

Anthropic

●○○○Low / Estimated~50
Nemotron 3 Super 100B

NVIDIA

●○○○Low / Estimated~43
GPT-OSS 120B

OpenAI

●○○○Low / Estimated~34
GPT-OSS 20B

OpenAI

●○○○Low / Estimated~16
Grok 4.1

xAI

●○○○Low / Estimated~89
o1-preview

OpenAI

●○○○Low / Estimated~82
Qwen3.5 397B (Reasoning)

Alibaba

●○○○Low / Estimated~76
GPT-5 (medium)

OpenAI

●○○○Low / Estimated~70
Grok 4.1 Fast

xAI

●○○○Low / Estimated~68
o3-pro

OpenAI

●○○○Low / Estimated~57
o3

OpenAI

●○○○Low / Estimated~56
DeepSeek LLM 2.0

DeepSeek

●○○○Low / Estimated~50
DeepSeek Coder 2.0

DeepSeek

●○○○Low / Estimated~50
Qwen2.5-1M

Alibaba

●○○○Low / Estimated~50
GPT-4o mini

OpenAI

●○○○Low / Estimated~49
Qwen2.5-72B

Alibaba

●○○○Low / Estimated~49
DeepSeekMath V2

DeepSeek

●○○○Low / Estimated~49
Mistral Large 3

Mistral

●○○○Low / Estimated~48
Qwen3 235B 2507 (Reasoning)

Alibaba

●○○○Low / Estimated~45
o4-mini (high)

OpenAI

●○○○Low / Estimated~43
Claude 4.1 Opus Thinking

Anthropic

●○○○Low / Estimated~43
GPT-4o

OpenAI

●○○○Low / Estimated~42
Kimi K2

Moonshot AI

●○○○Low / Estimated~41
Llama 3.1 405B

Meta

●○○○Low / Estimated~40
Claude 3.5 Sonnet

Anthropic

●○○○Low / Estimated~40
Grok Code Fast 1

xAI

●○○○Low / Estimated~39
Sarvam 105B

Sarvam

●○○○Low / Estimated~39
Mistral Large 2

Mistral

●○○○Low / Estimated~38
Gemini 2.5 Flash

Google

●○○○Low / Estimated~37
DeepSeek V3

DeepSeek

●○○○Low / Estimated~35
Gemini 1.5 Pro

Google

●○○○Low / Estimated~35
Claude 3 Opus

Anthropic

●○○○Low / Estimated~34
DeepSeek-R1

DeepSeek

●○○○Low / Estimated~32
DBRX Instruct

Databricks

●○○○Low / Estimated~32
Grok 3 [Beta]

xAI

●○○○Low / Estimated~30
DeepSeek V3.1 (Reasoning)

DeepSeek

●○○○Low / Estimated~29
o1-pro

OpenAI

●○○○Low / Estimated~28
Phi-4

Microsoft

●○○○Low / Estimated~27
Llama 4 Scout

Meta

●○○○Low / Estimated~26
Llama 3 70B

Meta

●○○○Low / Estimated~26
DeepSeek V3.1

DeepSeek

●○○○Low / Estimated~25
GLM-4.5

Z.AI

●○○○Low / Estimated~25
Nemotron 3 Nano 30B

NVIDIA

●○○○Low / Estimated~25
GPT-4 Turbo

OpenAI

●○○○Low / Estimated~25
Z-1

Z

●○○○Low / Estimated~24
Mistral 8x7B

Mistral

●○○○Low / Estimated~24
Gemini 1.0 Pro

Google

●○○○Low / Estimated~24
Claude 3 Haiku

Anthropic

●○○○Low / Estimated~23
Mixtral 8x22B Instruct v0.1

Mistral

●○○○Low / Estimated~22
Nemotron-4 15B

NVIDIA

●○○○Low / Estimated~22
Moonshot v1

Moonshot AI

●○○○Low / Estimated~22
Nemotron Ultra 253B

NVIDIA

●○○○Low / Estimated~22
GLM-4.5-Air

Z.AI

●○○○Low / Estimated~19
Llama 4 Maverick

Meta

●○○○Low / Estimated~17
Gemma 3 27B

Google

●○○○Low / Estimated~16
Llama 4 Behemoth

Meta

●○○○Low / Estimated~12
Nova Pro

Amazon

●○○○Low / Estimated~10
Mistral 7B v0.3

Mistral

●○○○Low / Estimated~4
Mistral 8x7B v0.2

Mistral

●○○○Low / Estimated~2

Sourced = exact-source benchmark coverage. Rankable = non-generated benchmark coverage used by the provisional leaderboard. Generated = inferred from related models and excluded from ranking. Coverage = sourced share of the visible benchmark footprint.

Frequently Asked Questions

What is benchmark confidence on BenchLM?

Score confidence (1-4 dots) indicates how much sourced benchmark data supports a model's score. A 4-dot score is backed by 20+ sourced benchmark rows across 7+ categories. A 1-dot score relies on limited sourced coverage, and the provisional leaderboard may still include source-unverified non-generated rows. The confidence system helps you distinguish between well-tested models and those with sparse coverage.

What does "estimated" mean on BenchLM scores?

Scores marked with "Est." or "~" are derived from limited sourced data. Generated rows are excluded from ranking inputs, but the provisional leaderboard may still rely on source-unverified non-generated public rows until exact citations are attached. The verified leaderboard avoids that by using sourced rows only.

How does BenchLM detect contamination risk?

BenchLM tracks two key signals: (1) benchmark provenance — whether each score is a hand-entered public row ("manual") or was generated/inferred from related data, and (2) benchmark freshness — older benchmarks that haven't been updated are more likely to have been contaminated through training data inclusion. Models with mostly generated data or stale benchmarks receive lower confidence ratings. Exact-source verification is tracked separately from this manual-vs-generated split.

What is benchmark provenance?

Provenance tracks the origin of each benchmark score. "Manual" scores are hand-entered public rows from BenchLM's dataset work. "Generated" scores were inferred from related models or interpolated. BenchLM now distinguishes provisional ranking, which can use non-generated manual rows, from verified ranking, which only uses exact-source-attached rows.

Which LLM benchmarks are most reliable?

Fresh, held-out benchmarks like SWE-Rebench (rolling window), Terminal-Bench 2.0, and HLE are the hardest to game. Older, saturated benchmarks like MMLU (where top models all score 97-99%) provide little signal. BenchLM weights newer, harder benchmarks more heavily and flags saturated ones as display-only.

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.