Benchmark Confidence & Contamination Flags

Not all benchmark scores are equally trustworthy. BenchLM now separates verified ranking from provisionalranking while still tracking the provenance of every stored score. The confidence indicator (1-4 dots) shows how much sourced benchmark coverage supports each model's score.

●●●●High

7+ categories, 20+ non-generated benchmarks

●●●○Good

5+ categories, 12+ non-generated benchmarks

●●○○Moderate

3+ categories, 8+ non-generated benchmarks

●○○○Low / Estimated

Limited sourced data, score is estimated

Confidence Distribution (Ranked Models)

High (6%)

Good (9%)

Moderate (12%)

Low / Estimated (73%)

How BenchLM Scores Work

Verified, provisional, and generated

Each benchmark value is tagged as manual (a hand-entered public row) or generated (inferred from related models). Generated rows are excluded from all public ranking logic. Manual rows are now split again into sourced rows for the verified leaderboard and source-unverified rows that can still appear in provisional mode.

Ranking Eligibility

A model must have at least 8 qualifying benchmarks across 2+ categories to rank in a lane. The provisional leaderboard uses rankable non-generated rows; the verified leaderboard uses sourced rows only. Models below the threshold are shown as tracked but unranked.

Category Eligibility

For category leaderboards, a model needs qualifying scores on at least half of the weighted benchmarks in that category. BenchLM computes this separately for provisional and verified ranking so sparse exact-source coverage cannot silently borrow strength from provisional rows.

Display-Only Benchmarks

Some benchmarks (MMLU, BBH, HumanEval, older AIME/HMMT variants) are shown for context but don't affect scoring. These are either saturated (top models all score 97%+) or have been superseded by harder versions.

Model	Confidence	Prov. score	Sourced	Rankable	Coverage
Claude Opus 4.5 Anthropic	●●●●High	77	41	81	51%
Kimi K2.5 Moonshot AI	●●●●High	64	39	108	36%
Qwen3.6 Plus Alibaba	●●●●High	73	38	68	56%
Qwen3.5 397B Alibaba	●●●●High	64	36	86	42%
GLM-5 Z.AI	●●●●High	67	33	73	45%
GPT-5.4 OpenAI	●●●●High	89	28	64	44%
Claude Opus 4.6 Anthropic	●●●●High	87	28	74	38%
GPT-5.5 OpenAI	●●●○Good	91	22	22	100%
Gemini 3.1 Pro Google	●●●○Good	92	19	50	38%
Claude Opus 4.7 (Adaptive) Anthropic	●●●○Good	90	18	18	100%
Grok 4.20 xAI	●●●○Good	65	18	20	90%
GLM-5.1 Z.AI	●●●○Good	83	16	33	48%
Claude Mythos Preview Anthropic	●●●○Good	99	15	23	65%
Claude Sonnet 4.6 Anthropic	●●●○Good	83	13	46	28%
Qwen3.5-122B-A10B Alibaba	●●●○Good	65	13	20	65%
Qwen3.5-27B Alibaba	●●●○Good	63	13	21	62%
Qwen3.5-35B-A3B Alibaba	●●●○Good	56	13	20	65%
Qwen3.6-35B-A3B Alibaba	●●○○Moderate	67	40	40	100%
Qwen3.6-27B Alibaba	●●○○Moderate	74	37	37	100%
Kimi K2.6 Moonshot AI	●●○○Moderate	85	27	27	100%
DeepSeek V4 Pro (Max) DeepSeek	●●○○Moderate	88	25	25	100%
DeepSeek V4 Flash (Max) DeepSeek	●●○○Moderate	76	24	24	100%
DeepSeek V4 Pro (High) DeepSeek	●●○○Moderate	84	23	23	100%
DeepSeek V4 Flash (High) DeepSeek	●●○○Moderate	71	23	23	100%
DeepSeek V4 Pro DeepSeek	●●○○Moderate	70	21	21	100%
DeepSeek V4 Flash DeepSeek	●●○○Moderate	59	21	21	100%
MiniMax M2.7 MiniMax	●●○○Moderate	62	18	32	56%
GPT-5.2 OpenAI	●●○○Moderate	81	11	56	20%
GPT-5.4 Pro OpenAI	●●○○Moderate	91	9	14	64%
Gemini 3 Pro Google	●●○○Moderate	81	8	63	13%
Kimi K2.5 (Reasoning) Moonshot AI	●●○○Moderate	76	8	38	21%
GLM-4.7 Z.AI	●○○○Low / Estimated	~69	7	34	21%
GPT-5.3 Codex OpenAI	●○○○Low / Estimated	~87	6	38	16%
Claude Sonnet 4.5 Anthropic	●○○○Low / Estimated	~66	6	37	16%
o3-mini OpenAI	●○○○Low / Estimated	~56	5	16	31%
DeepSeek V3.2 DeepSeek	●○○○Low / Estimated	~58	4	41	10%
GPT-4.1 OpenAI	●○○○Low / Estimated	~58	4	16	25%
GPT-4.1 mini OpenAI	●○○○Low / Estimated	~45	4	16	25%
Qwen3 235B 2507 Alibaba	●○○○Low / Estimated	~33	4	32	13%
Gemini 2.5 Pro Google	●○○○Low / Estimated	~65	3	37	8%
o1 OpenAI	●○○○Low / Estimated	~57	3	16	19%
GPT-4.1 nano OpenAI	●○○○Low / Estimated	~27	3	15	20%
Gemini 3 Flash Google	●○○○Low / Estimated	~65	2	40	5%
Gemini 3.1 Flash-Lite Google	●○○○Low / Estimated	~48	2	34	6%
Gemini 3 Pro Deep Think Google	●○○○Low / Estimated	~91	1	33	3%
GLM-5 (Reasoning) Z.AI	●○○○Low / Estimated	~82	1	40	3%
GPT-5.1 OpenAI	●○○○Low / Estimated	~79	1	37	3%
GPT-5 (high) OpenAI	●○○○Low / Estimated	~78	1	36	3%
GPT-5.2-Codex OpenAI	●○○○Low / Estimated	~77	1	31	3%
GPT-5.1-Codex-Max OpenAI	●○○○Low / Estimated	~76	1	30	3%
Grok 4 xAI	●○○○Low / Estimated	~65	1	34	3%
DeepSeek V3.2 (Thinking) DeepSeek	●○○○Low / Estimated	~62	1	41	2%
MiMo-V2-Flash Xiaomi	●○○○Low / Estimated	~60	1	34	3%
Claude Haiku 4.5 Anthropic	●○○○Low / Estimated	~58	1	33	3%
Claude 4.1 Opus Anthropic	●○○○Low / Estimated	~52	1	33	3%
Claude 4 Sonnet Anthropic	●○○○Low / Estimated	~51	1	32	3%
Nemotron 3 Super 100B NVIDIA	●○○○Low / Estimated	~44	1	33	3%
GPT-OSS 120B OpenAI	●○○○Low / Estimated	~35	1	34	3%
GPT-OSS 20B OpenAI	●○○○Low / Estimated	~17	1	33	3%
Grok 4.1 xAI	●○○○Low / Estimated	~90	0	37	0%
o1-preview OpenAI	●○○○Low / Estimated	~83	0	32	0%
Qwen3.5 397B (Reasoning) Alibaba	●○○○Low / Estimated	~79	0	33	0%
GPT-5 (medium) OpenAI	●○○○Low / Estimated	~71	0	32	0%
Grok 4.1 Fast xAI	●○○○Low / Estimated	~70	0	36	0%
o3 OpenAI	●○○○Low / Estimated	~58	0	35	0%
o3-pro OpenAI	●○○○Low / Estimated	~58	0	32	0%
DeepSeek Coder 2.0 DeepSeek	●○○○Low / Estimated	~52	0	32	0%
DeepSeek LLM 2.0 DeepSeek	●○○○Low / Estimated	~51	0	32	0%
Qwen2.5-1M Alibaba	●○○○Low / Estimated	~51	0	32	0%
GPT-4o mini OpenAI	●○○○Low / Estimated	~50	0	14	0%
Qwen2.5-72B Alibaba	●○○○Low / Estimated	~50	0	32	0%
DeepSeekMath V2 DeepSeek	●○○○Low / Estimated	~50	0	32	0%
Mistral Large 3 Mistral	●○○○Low / Estimated	~49	0	36	0%
Qwen3 235B 2507 (Reasoning) Alibaba	●○○○Low / Estimated	~47	0	32	0%
Nemotron 3 Ultra 500B NVIDIA	●○○○Low / Estimated	~47	0	35	0%
o4-mini (high) OpenAI	●○○○Low / Estimated	~44	0	39	0%
Claude 4.1 Opus Thinking Anthropic	●○○○Low / Estimated	~44	0	32	0%
GPT-4o OpenAI	●○○○Low / Estimated	~43	0	33	0%
Kimi K2 Moonshot AI	●○○○Low / Estimated	~42	0	18	0%
Llama 3.1 405B Meta	●○○○Low / Estimated	~41	0	32	0%
Claude 3.5 Sonnet Anthropic	●○○○Low / Estimated	~41	0	33	0%
Grok Code Fast 1 xAI	●○○○Low / Estimated	~40	0	32	0%
Sarvam 105B Sarvam	●○○○Low / Estimated	~39	0	12	0%
Gemini 2.5 Flash Google	●○○○Low / Estimated	~38	0	33	0%
Mistral Large 2 Mistral	●○○○Low / Estimated	~38	0	32	0%
DeepSeek V3 DeepSeek	●○○○Low / Estimated	~36	0	9	0%
Gemini 1.5 Pro Google	●○○○Low / Estimated	~36	0	32	0%
Claude 3 Opus Anthropic	●○○○Low / Estimated	~35	0	32	0%
DeepSeek-R1 DeepSeek	●○○○Low / Estimated	~33	0	33	0%
DBRX Instruct Databricks	●○○○Low / Estimated	~33	0	13	0%
Grok 3 [Beta] xAI	●○○○Low / Estimated	~32	0	32	0%
DeepSeek V3.1 (Reasoning) DeepSeek	●○○○Low / Estimated	~30	0	32	0%
o1-pro OpenAI	●○○○Low / Estimated	~29	0	13	0%
Phi-4 Microsoft	●○○○Low / Estimated	~28	0	17	0%
GLM-4.5 Z.AI	●○○○Low / Estimated	~27	0	33	0%
Llama 3 70B Meta	●○○○Low / Estimated	~27	0	32	0%
DeepSeek V3.1 DeepSeek	●○○○Low / Estimated	~26	0	32	0%
Nemotron 3 Nano 30B NVIDIA	●○○○Low / Estimated	~26	0	32	0%
GPT-4 Turbo OpenAI	●○○○Low / Estimated	~25	0	30	0%
Gemini 1.0 Pro Google	●○○○Low / Estimated	~25	0	31	0%
Z-1 Z	●○○○Low / Estimated	~24	0	32	0%
Mistral 8x7B Mistral	●○○○Low / Estimated	~24	0	32	0%
Claude 3 Haiku Anthropic	●○○○Low / Estimated	~24	0	32	0%
Mixtral 8x22B Instruct v0.1 Mistral	●○○○Low / Estimated	~23	0	13	0%
Nemotron-4 15B NVIDIA	●○○○Low / Estimated	~23	0	32	0%
Moonshot v1 Moonshot AI	●○○○Low / Estimated	~23	0	32	0%
Llama 4 Scout Meta	●○○○Low / Estimated	~22	0	34	0%
Nemotron Ultra 253B NVIDIA	●○○○Low / Estimated	~22	0	32	0%
GLM-4.5-Air Z.AI	●○○○Low / Estimated	~19	0	34	0%
Gemma 3 27B Google	●○○○Low / Estimated	~17	0	30	0%
Llama 4 Maverick Meta	●○○○Low / Estimated	~17	0	37	0%
Llama 4 Behemoth Meta	●○○○Low / Estimated	~12	0	37	0%
Nova Pro Amazon	●○○○Low / Estimated	~10	0	31	0%
Mistral 7B v0.3 Mistral	●○○○Low / Estimated	~5	0	32	0%
Mistral 8x7B v0.2 Mistral	●○○○Low / Estimated	~2	0	32	0%

Sourced = exact-source benchmark coverage. Rankable = non-generated benchmark coverage used by the provisional leaderboard. Generated = inferred from related models and excluded from ranking. Coverage = sourced share of the visible benchmark footprint.

Frequently Asked Questions

What is benchmark confidence on BenchLM?

Score confidence (1-4 dots) indicates how much sourced benchmark data supports a model's score. A 4-dot score is backed by 20+ sourced benchmark rows across 7+ categories. A 1-dot score relies on limited sourced coverage, and the provisional leaderboard may still include source-unverified non-generated rows. The confidence system helps you distinguish between well-tested models and those with sparse coverage.

What does "estimated" mean on BenchLM scores?

Scores marked with "Est." or "~" are derived from limited sourced data. Generated rows are excluded from ranking inputs, but the provisional leaderboard may still rely on source-unverified non-generated public rows until exact citations are attached. The verified leaderboard avoids that by using sourced rows only.

How does BenchLM detect contamination risk?

BenchLM tracks two key signals: (1) benchmark provenance — whether each score is a hand-entered public row ("manual") or was generated/inferred from related data, and (2) benchmark freshness — older benchmarks that haven't been updated are more likely to have been contaminated through training data inclusion. Models with mostly generated data or stale benchmarks receive lower confidence ratings. Exact-source verification is tracked separately from this manual-vs-generated split.

What is benchmark provenance?

Provenance tracks the origin of each benchmark score. "Manual" scores are hand-entered public rows from BenchLM's dataset work. "Generated" scores were inferred from related models or interpolated. BenchLM now distinguishes provisional ranking, which can use non-generated manual rows, from verified ranking, which only uses exact-source-attached rows.

Which LLM benchmarks are most reliable?

Fresh, held-out benchmarks like SWE-Rebench (rolling window), Terminal-Bench 2.0, and HLE are the hardest to game. Older, saturated benchmarks like MMLU (where top models all score 97-99%) provide little signal. BenchLM weights newer, harder benchmarks more heavily and flags saturated ones as display-only.

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.