Skip to main content
methodologybenchmarksleaderboardscoringtransparency

The Missing-Benchmark Problem: How Leaderboards Reward Hiding Weak Scores

A reader caught BenchLM ranking Qwen3.7 Max below its own cheaper sibling. The bug was not a data error. It was the averaging method almost every LLM leaderboard uses, and fixing it moved 170 scores.

BenchLM·Published July 4, 2026·13 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

The Missing-Benchmark Problem: How Leaderboards Reward Hiding Weak Scores

BenchLM rescored all 272 tracked models on July 4, 2026, after a reader proved the site had ranked Qwen3.7 Max below Qwen3.7 Plus even though Max wins 27 of the 32 benchmarks the two models share. The inversion was not a data error, and it was not unique to BenchLM. It is a structural bug in how nearly every LLM leaderboard turns incomplete benchmark tables into a single ranked number, and this post documents the mechanism, the statistics behind it, the case that exposed it, the alternatives we rejected, and the fix now running in production.

The screenshot that started it

A reply on X put two of our rows side by side: Alibaba's flagship Qwen3.7 Max sitting one place below Qwen3.7 Plus, the cheaper tier of the same family. The commenter called the rankings hallucinated.

Our first move was the boring one: an audit of both rows. Every value traced to a provenance record with a source attached, every source checked out, and no generated or estimated data had touched either model, an invariant our build validates on every deploy. By the standards we had published, both rows were correct.

The commenter was wrong about the cause and right about the symptom. The numbers were fine. The problem was what the scoring did where numbers were absent, and absence is the normal condition of benchmark data: across our catalog of 251 tracked benchmarks, even the best-covered model reports a fraction of the full grid. A scoring method is mostly defined by how it behaves on the cells that are empty.

Ours behaved badly, in a specific and instructive way.

Averaging rewards silence

Composite leaderboard scores are weighted averages across categories: coding counts for so much, agentic for so much, multimodal for so much. When a model has no data in a category, the standard move (ours included, until this week) is to renormalize, spreading that category's weight across the categories that do have data.

Renormalization sounds like a neutral accounting choice. A toy example shows it is not.

Coding (50%) Multimodal (50%) Naive overall
Model A, reports everything 90 60 75
Model B, skips multimodal 90 not reported 90

Two models with identical coding ability, and the one that stayed quiet about multimodal outscores the honest one by 15 points. Model B did not earn 90 overall; the method awarded it for the shape of its disclosure. Scale that up to eight categories and dozens of weighted benchmarks, and selective reporting becomes a ranking strategy that no lab even has to adopt consciously. Publishing what looks good is the default behavior of every marketing department on earth.

The same trick recurses one level down, inside categories. A coding score built from five weighted benchmarks and a coding score built from the two easiest of those five carry the same label and the same weight in the overall number. Skipping the hard benchmark in a category pays exactly the way skipping the hard category does, just in smaller coin.

Statisticians named this problem in 1976

None of this is a new discovery; leaderboards just tend to be built without consulting the literature that predicted their failure mode. The missing-data taxonomy that statisticians have used since Donald Rubin's 1976 work sorts absence into three kinds. Data can be missing completely at random, missing at random given what you observe, or missing not at random, where the probability that a value is absent depends on the value itself.

Benchmark tables are a textbook case of the third kind. A lab that runs an evaluation and sees a flattering number publishes it; a lab that sees an ugly number files it away. The absence is informative: unreported scores are, on average, worse than reported ones. Under that condition, averaging observed values is not merely noisy, it is biased upward, and biased most for exactly the models whose gaps are most strategic.

The textbook response to informative missingness is not to ignore the gaps and not to pretend certainty about them either. It is to impute them from a model of what absent values plausibly look like, and to shade that imputation toward pessimism when absence itself is evidence. Hold that thought; it is the fix.

Qwen3.7 Max versus Qwen3.7 Plus, 27 to 4

The Qwen pair is the cleanest real-world specimen we have seen, because the two models come from one lab, one generation, and one training lineage, differing mainly in scale and price.

Plus is the transparency champion of our catalog: 50 sourced benchmarks across seven categories, including 15 multimodal results averaging an 81 category score. Max publishes 33 sourced benchmarks across six categories and exactly zero weighted multimodal rows. The only multimodal signal Alibaba has released for Max is a display-only Design Arena Elo, which BenchLM never weights.

Under renormalized averaging, Plus paid for its honesty. Its real multimodal results were dragged into its average while the same 12% of rubric weight silently vanished from Max's. Depending on the data snapshot, the two flipped back and forth within a point, and on the verified board Plus sat on top the day the screenshot was taken.

The paired comparison removes the coverage asymmetry entirely: restrict scoring to the 32 benchmarks both models publish, and Max wins 27, loses 4, ties 1, with a mean margin of 3.6 points per benchmark.

Max had the better report card and the worse grade-point average.

Chatbot Arena never had this bug, by construction

One popular leaderboard is structurally immune to the missing-benchmark problem, and it is worth understanding why. Chatbot Arena does not average scores from incomplete tables. Its Elo ratings are built from millions of paired battles in which two models answer the same prompt and a human picks a winner. Every comparison is on shared ground, so there is no cell for a model to leave strategically empty.

Arena has different problems (preference is not performance, voters favor confident formatting, and rankings lag releases), and we have written about those elsewhere. But its core design choice, comparing models only on common support, is the statistically clean answer to missing data, and we stole it without embarrassment.

The theft shows up in our tie-breaking rule. When two models round to the same displayed score on BenchLM, their sub-point difference is coverage noise, so the order between them is now decided the Arena way: head-to-head, on the sourced benchmarks both models share, with a minimum overlap of ten and a required margin before the result counts as decisive. Sub-point differences between averages of different benchmark sets settle nothing. Twenty-seven wins to four settles the argument.

What changed in the scoring

The July 4 rescore replaced renormalization with conservative imputation, applied at two levels and on both leaderboards.

Missing categories. Every model is now scored against all eight categories. A category with no sourced data is imputed as an even blend of two quantities: the cohort median for that category, and the model's own weighted mean across the categories it does report. The blend says: a model probably performs somewhere near its own level in the areas it stayed quiet about, but only half of that assumption gets credit.

Missing benchmarks inside categories. The same logic recurses. A model holding only the two easiest coding benchmarks no longer gets its coding weight renormalized onto them; the absent weighted benchmarks are imputed the same half-blend way, using per-benchmark cohort medians.

Ties resolve on shared evidence. The paired head-to-head described above, restricted to equal displayed scores so the visible ranking stays monotonic in the number readers see.

Two implementation details matter enough to state publicly. First, both leaderboards now score on the same percentile-normalized scale, each against its own cohort: the verified board previously used raw averages, which let a model that skipped a hard benchmark inherit an easier scale. Second, cohort medians are computed exclusively from sourced, non-generated values, so the imputation machinery cannot become a side door for estimated data. The build fails if generated rows influence any public ranking output, and that check ran before and after this change.

The methodology page carries the full description, and the shrink constant sits in one documented line of code for anyone who wants to argue with 0.5.

The alternatives we rejected

Conservative imputation was not the only candidate, and the rejected options explain the chosen one.

Impute zero. Treat missing as failing. Maximally troll-proof and maximally wrong: it asserts that Qwen3.7 Max would score 0 on MMMU-Pro, a claim nobody believes, and it would let a lab tank a rival's rank by simply noting which evaluations the rival has not published. Absence of evidence is evidence of weakness here, but it is not evidence of zero.

Impute the cohort median alone. Cleaner statistics, brutal in practice: every frontier model with a gap gets dragged to the middle of a 272-model catalog that includes 7B open-weight models. A flagship missing one category would be treated as mid-tier in it. The signal that a model performs near its own level elsewhere is real information, and throwing it away overcorrects.

Coverage gates without imputation. Require some minimum number of benchmarks and categories, then renormalize as before above the gate. This was our previous system. Gates stop the worst cases and do nothing about the marginal ones; the Qwen inversion happened entirely above our gates.

Require complete coverage. Only rank models with every category sourced. Intellectually pure, and it would shrink the ranked catalog from 124 models to something in the single digits, at which point the leaderboard answers no question anyone is asking.

The half-blend sits deliberately between the second and none of the above: half cohort realism, half own-level charity, applied identically to every model. Anyone who prefers a different constant is welcome to fork the scoring file; the argument is at least now about one explicit number instead of an implicit method nobody had examined.

What the rescore did to the board

Numbers first: the final recalculation moved the stored scores of 170 of 272 tracked models. Most models moved three to five points down, because almost nobody has complete coverage and imputation pulls every gap toward the cohort. The top overall score fell from 99 to 89, which we count as a side benefit. A leaderboard whose best row reads 99/100 is advertising; one whose best row reads 89 is measuring something.

The models that moved most are the ones the old method flattered most.

Model Stored score before After Why
GPT-5.5 Pro 100 75 strong categories only, thin everywhere else
Mistral Medium 3.5 128B 95 71 2-category coverage carried 100% of weight
MiMo-V2-Omni 84 68 same pattern, smaller scale
Qwen3.7 Max 89 78 multimodal gap now priced in
Qwen3.7 Plus 86 78 tied on the snapshot; shared rows break the tie

A 25-point correction, as in the GPT-5.5 Pro row, is not a claim that the model got worse in July. It is a confession about how much of its previous score was renormalization rather than measurement. The model's sourced results did not change; the credit extended against its silence did.

The case that still looks wrong

Honesty requires showing the ranking this method produces that we are least comfortable with. GPT-5.4 Pro, OpenAI's premium tier, now sits below the base GPT-5.4 on our provisional board, by roughly two points.

The evidence file explains it without excusing it. GPT-5.4 Pro has 9 sourced benchmarks in our catalog; GPT-5.4 has 26. On the 4 benchmarks the two models share, Pro wins all 4, which matches everyone's prior about which model is stronger. But 4 shared rows is below the 10 our head-to-head rule requires before it will overturn anything, and we are not going to lower a threshold mid-flight because one result offends intuition. Thresholds bent for sympathetic cases stop being thresholds.

So the conservative rank stands, published next to its explanation, and the remedy sits with the party holding the missing data. Our verification-impact audit ranks GPT-5.4 Pro's absent rows among the highest-value backfills on the entire board. The day OpenAI publishes them, the order corrects itself without a human touching a constant.

That is the difference between a scoring bug and a scoring policy. A bug embarrasses you at random. A policy tells you in advance exactly which embarrassments it will accept, and why they are worth it.

A leaderboard is an incentive system

Rankings do not just describe the model market; the labs read them, and disclosure behavior follows what scoring rewards. Renormalized averaging pays labs to bury weak results. Conservative imputation inverts the payoff: an unreported category scores below the model's likely true level, so full disclosure becomes the strictly dominant strategy for any lab that believes in its own model.

That is the design goal, stated plainly. We would rather be the leaderboard that is slightly harsh on silence than the one that is reliably generous to it.

The number to watch next is Qwen3.7 Max's multimodal row. Alibaba has published those results for Plus, so the evaluation harness exists. The day a sourced Max multimodal score lands in our catalog, the largest remaining imputation in the top five disappears, and the ranking will have been earned twice over.

Frequently asked questions

What is the missing-benchmark problem in LLM leaderboards?

Composite rankings are built from incomplete tables, and the standard fix (renormalizing weights over whatever exists) turns each gap into a small subsidy for the model that left it. Selective reporting then functions as strategy. The July 2026 BenchLM rescore exists because a reader caught the subsidy operating in public.

Why did BenchLM scores drop in July 2026?

Full-rubric scoring landed on July 4, 2026: all eight categories count for every model, with unsourced areas imputed conservatively rather than skipped. The final recalculation moved 170 of 272 stored scores, mostly downward by a few points, and hit hardest where high ranks rested on narrow coverage.

How does BenchLM handle missing benchmark scores?

Each gap is filled with a half-and-half blend: cohort median plus the model's own demonstrated level, at category and individual-benchmark granularity. No weight ever renormalizes onto a convenient subset. Generated or estimated values remain excluded from ranking entirely, an invariant checked by an automated validator on every build.

Why does Qwen3.7 Max rank above Qwen3.7 Plus on BenchLM?

Because shared-coverage evidence is decisive in that direction. Across the 32 sourced benchmarks common to both models, Max takes 27, drops 4, and ties 1. Plus ranked higher previously only because its 15 published multimodal rows counted against it while Max's silence in the category cost nothing.

Does sparse benchmark coverage hurt a model's BenchLM ranking?

Yes, and the penalty is intentional. Thin coverage now scores partway between demonstrated level and cohort-typical, never at the flattering renormalized average. Labs unhappy about an imputed gap have a reliable remedy available: publish the missing results with a citable source, and the imputation is replaced by evidence.

Do other LLM leaderboards have the missing-benchmark problem?

Every composite score averaged from an incomplete benchmark grid carries some version of it, which describes most published rankings. Paired-preference systems such as Chatbot Arena are immune by construction, since Elo updates only happen when two models face the same prompt. BenchLM's tie-breaking now borrows that paired structure for models with equal displayed scores.

Why does GPT-5.4 Pro rank below GPT-5.4 on BenchLM?

Nine sourced benchmarks against twenty-six. Most of the Pro model's rubric is conservatively imputed, the pair shares only four rows (fewer than the ten required for a head-to-head override), and Pro's four straight wins on those rows suggest the gap closes the moment OpenAI publishes fuller results.

New models drop every week. We send one email a week with what moved and why.