Skip to main content
benchmarkstransparencymethodologyverified-rankingdata

What AI Labs Don't Publish: The Benchmark Disclosure Gap

All 30 of the top models on BenchLM's verified leaderboard have sourced coding results. Four have sourced competition math. The gap between those numbers is a map of what labs would rather not discuss.

BenchLM·Published July 4, 2026·13 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

What AI Labs Don't Publish: The Benchmark Disclosure Gap

All 30 of the top models on BenchLM's verified leaderboard have citable coding benchmark results as of July 2026, and exactly 4 of those 30 have citable competition-math results. Between those two numbers sits every incentive that shapes what AI labs disclose. Benchmark tables are usually read as measurements of models. Read as disclosures by companies, the same tables say more, and this post reads them that way: category by category, lab by lab, with the receipts attached.

Thirty publish coding. Four publish math.

BenchLM's verified board only counts rows with primary sources attached: a model card, a technical report, a launch page with the exact number for the exact variant. That sourcing bar makes absence visible, and the absences are not random.

Labs benchmark what sells.

Coding and agentic results move API revenue, so every lab in the top 30 publishes them, without exception, for every model. Competition math moves Twitter for an afternoon. Multilingual performance matters enormously to users in eighty countries and to approximately zero enterprise procurement decks written in English. The disclosure rates below are the cleanest picture of lab priorities we know how to draw, precisely because nobody composed it on purpose. Each lab made its own small, local, defensible choices about which evaluations to run and which results to ship. The aggregate is the tell.

The disclosure ladder

Sourcing rates across the top 30 verified models, by category, as of July 2026:

Category Top-30 models with sourced results
Coding 30 of 30
Agentic 30 of 30
Knowledge 29 of 30
Multimodal 18 of 30
Reasoning 13 of 30
Instruction following 12 of 30
Multilingual 10 of 30
Competition math 4 of 30

The ladder has a shape: the further a category sits from "demo that closes an enterprise deal," the fewer labs publish it.

The top rungs are unsurprising. Coding and agentic scores are the currency of the API market, and knowledge benchmarks (MMLU-Pro, GPQA and their successors) are the legacy lingua franca every launch post still speaks. One model in thirty lacking a sourced knowledge row is noise, not pattern.

The middle of the ladder is where it gets interesting. Reasoning at 13 of 30 partly reflects harness fragmentation: long-context and abstract-reasoning suites have less standardization than coding, so labs publish idiosyncratic subsets that fail exact-variant matching. Multimodal at 18 of 30 is the gap with the clearest lab-level structure, and it gets its own section below.

Instruction following at 12 of 30 is the quietly damning row. IFEval and its successors are cheap to run, utterly standard, and measure the single property every one of these models is marketed on. Every API landing page promises a model that does what it is told; 60% of the makers decline to publish the standard measurement of exactly that. There is no harness-fragmentation excuse on this rung.

And then math, at 4 of 30. Some of this is our exact-variant rule doing its job, which the next section explains, because the math rung is where the difference between "a number exists in a launch deck" and "a number is citable for this model" gets widest.

What "sourced" means here, and why math fails it

Our verification pipeline attaches a status to every benchmark value in the catalog. The bar for a weighted, verified row is specific: a primary document (model card, technical report, launch page, or equivalent) stating the number for the exact model variant we track, at retrievable URL, with provider-published sources preferred when multiple candidates exist. Aggregator-reported values get display-only status: visible on model pages, never weighted into rankings. Generated estimates are excluded from ranking entirely, and a build-time validator proves stripping them changes nothing public.

Math is where this bar bites hardest, for a reason that says more about launch practices than about our pedantry. Math scores are abundant in the wild; citable math scores for exact production variants are scarce. Launch decks report AIME and HMMT results for "preview" builds, internal checkpoints, and max-compute configurations that never ship under the same name. The number is real, the model it describes is not quite the model you can buy, and our exact-variant rule refuses the substitution. Twenty-six of thirty top models ship without a citable competition-math number attached to the production variant, across every lab named on this page.

We are not going to soften the rule to fill the column. A verified board that accepts adjacent-checkpoint numbers is a provisional board with better marketing.

The display-only tier, where unweighted numbers live

One mechanism deserves a paragraph before the lab tables, because it explains several absences that would otherwise look stranger than they are. Alongside its weighted benchmarks, BenchLM tracks a display-only tier: roughly thirty benchmark families shown on model pages but excluded from every ranking computation. Aggregator-reported indexes, arena-style preference Elos, and benchmarks whose cross-model coverage is still too thin for defensible weighting all live there.

Qwen3.7 Max's entire multimodal record is one display-only row, a Design Arena Elo of 1303 synced from an aggregator. The number is real and readers can see it; it simply cannot buy ranking credit, because a preference rating from a third party is not a sourced result from the lab. The display-only tier is our compromise between showing everything we know and weighting only what survives verification, and the boundary between the tiers is itself part of the disclosure story: a lab whose category coverage consists entirely of display-only rows has published signals, not evidence.

Reasoning at 13 of 30 illustrates the other soft failure mode. The category's harnesses are fragmented: long-context suites, abstract-reasoning sets, and multi-step evaluation frameworks each have several competing versions, and labs publish whichever subset their internal tooling supports. Many reasoning rows fail our exact-variant matching not because a lab hid a number but because the number describes a benchmark version we cannot reconcile with the one we track. Fragmentation and selectivity produce identical gaps in the table, and only the lab knows which one it is.

Who skips what

The multimodal gap clusters by lab, and naming the cluster requires precision about what is being claimed. The following statement is about our catalog, not about model capability: these models have no weighted multimodal benchmark results that meet the sourcing bar.

Lab Top-30 models without sourced multimodal rows
DeepSeek V4 Pro (Max), V4 Pro (High), V4 Pro, V4 Flash (Max), V4 Flash (High)
Z.AI GLM-5.2, GLM-5.1, GLM-5
Alibaba Qwen3.7 Max, Qwen3.5-27B
NVIDIA Nemotron 3 Ultra
Microsoft MAI-Thinking-1

Anthropic, OpenAI, Google, Moonshot, and MiniMax all have sourced multimodal coverage in the same cohort, so this is not a case of an impossible evaluation. The strangest entry is Alibaba, which published 15 multimodal rows for Qwen3.7 Plus and zero for the Max tier above it. The harness exists, the results do not, and readers can draw their own conclusion about which direction those results probably point.

For DeepSeek the pattern spans five models of one generation, which reads more like a policy than an oversight. For NVIDIA and Microsoft, single text-focused flagships, the charitable product-scope explanation carries more weight.

Volume is not the problem

Before anyone files this under a lazy geopolitical narrative, the depth numbers point the other way.

Lab Models in top 30 Average sourced benchmarks per model
Moonshot AI 2 34
Alibaba 8 33
Anthropic 6 24
OpenAI 2 24
DeepSeek 5 23
Z.AI 3 21
Google 1 20
NVIDIA 1 18
MiniMax 1 15
Microsoft 1 14

The labs with missing multimodal categories are among the heaviest publishers in the catalog. Alibaba averages 33 sourced rows per top-30 model, ahead of every Western lab on the board; Moonshot leads outright at 34. The disclosure gap is category-shaped, not volume-shaped. High-volume labs flood the categories they like and go silent in the ones they do not, which is arguably a stronger signal than thin coverage everywhere would be. A lab that publishes 33 results has an evaluation team, a harness budget, and an editorial process. What that lab omits, it omits on purpose or by a priority ranking that amounts to the same thing.

The same table torpedoes the reverse narrative too. Anthropic and OpenAI, at 24 rows apiece, publish fewer numbers than the top Chinese labs while covering categories more completely. Neither disclosure culture dominates the other; they fail differently.

Two readings of one table

The charitable reading: evaluation is expensive, multimodal harnesses are genuinely painful to run, and a lab shipping a text-first model may reasonably deprioritize categories far from its product. Some of the gap is logistics, and the NVIDIA and Microsoft rows probably live mostly here.

The uncharitable reading: labs run these evaluations internally, look at the results, and publish the subset that flatters. Selective disclosure is invisible on a leaderboard that averages whatever exists, which until this month included ours. The Qwen3.7 Max case sits awkwardly for the charitable reading, because the same lab, same quarter, same modality stack published the full multimodal suite for the cheaper tier.

We cannot see inside the labs, so we decline to pick between the readings. We can change which reading pays.

What silence costs now

Since the July 4 scoring change, BenchLM scores every model against the full rubric, and an unsourced category is imputed conservatively: half cohort median, half the model's own demonstrated level. For a frontier model, that lands well below where a real result would likely land. Our verification-impact audit quantifies the cost per model, and the current top of the list reads like an invoice for silence:

Model Missing coverage Projected gain if sourced at own level
Qwen3.7 Max multimodal, entire category +1.9 points
GLM-5.2 multimodal, entire category +1.8
DeepSeek V4 Pro (Max) multimodal, entire category +1.5
Claude Opus 4.8 MMMU-Pro, single row +1.3
Claude Mythos 5 OfficeQA Pro, single row +1.1

Those projections assume each model performs at its own covered level, so they are sourcing priorities rather than predictions. Note the bottom two rows: the invoice is ecumenical, and Anthropic's flagships appear on it for single missing benchmarks. Nobody in the top five of the verified board has a clean sheet.

The direction is the point: under full-rubric scoring, publishing is worth rank, and silence has a posted price. Before July, a lab optimizing its BenchLM position had a mild incentive to withhold mediocre results. That incentive is now negative everywhere on the board, which is the only equilibrium a measurement site should be willing to operate.

How to read any leaderboard after this

The disclosure ladder is not a BenchLM quirk; it is the shape of the public evidence every ranking is built from. Three habits transfer to reading anyone's table, including ours.

Check the denominator before the rank. A #4 built on 33 sourced results and a #5 built on 9 are different claims wearing the same font. We now print the sourced-row count on every verified-board entry for exactly this reason.

Ask which categories are load-bearing. A model ranked on coding, agentic, and knowledge alone has been graded on the subjects every lab studies for. The interesting information is usually in the columns that are empty, and an empty column on a verified board means no lab chose to fill it.

A worked example of all three habits, using our own board: Claude Mythos 5 ranks first on the verified leaderboard with 17 sourced benchmarks, fewer than half of Alibaba's per-model average. The rank survives because the 17 cover seven of eight categories and win their head-to-heads, but a reader applying habit one would correctly hold that #1 more loosely than a #1 built on 40 rows, and we would rather the reader hold it loosely than trust it blindly.

Treat single-number rankings as compressed arguments, not measurements. Every composite score encodes editorial decisions about weights, missing data, and protocol. The honest sites publish the decisions; the rest publish the number. We keep our methodology page current and our scoring constants in version control, and we would extend the same skepticism to us that we are recommending toward everyone else.

The rows we are waiting on

A standing offer to every lab in the tables above: publish the missing results anywhere citable, and our verification pipeline will attach them within a refresh cycle. The evaluation is yours to run; the imputation is ours to delete.

The row we are watching hardest is Qwen3.7 Max multimodal, worth 1.9 points and the difference between a ranking that survives arguments and one that merely wins them. Alibaba already proved it can publish this category. The next disclosure decision is theirs, and either way, the board will say what the evidence says.

Frequently asked questions

Which benchmark category do AI labs report least?

Competition math is the least-sourced category on BenchLM's verified leaderboard: 4 of the top 30 models have citable math results as of July 2026, against 30 of 30 for coding and agentic work. Multilingual (10 of 30) and instruction following (12 of 30) are nearly as thin.

Why is there no multimodal score for Qwen3.7 Max on BenchLM?

Alibaba has not published weighted multimodal benchmark results for Qwen3.7 Max that meet BenchLM's sourcing bar; the only multimodal signal available is a display-only Design Arena rating. Qwen3.7 Plus, the same family's lower tier, has 15 sourced multimodal rows, so the harness clearly exists.

Do missing benchmarks lower a model's BenchLM score?

Yes. Since July 4, 2026, every unsourced category or weighted benchmark is imputed as a half-blend of the cohort median and the model's own demonstrated level, which lands below where strong models would likely score. Publishing real results replaces the conservative estimate with evidence.

Which AI lab publishes the most benchmark results?

By sourced-row depth in BenchLM's top 30, Moonshot AI leads with an average of 34 verified benchmarks per model, with Alibaba close behind at 33. Anthropic and OpenAI average 24. Depth and completeness differ: Alibaba's high volume still omits entire categories for some flagship tiers.

What counts as a sourced benchmark result on BenchLM?

A number published somewhere citable for the exact model variant: a model card, technical report, launch page, or equivalent primary document. Aggregator-reported values and estimated rows are tracked but never weighted, and provider-published sources are preferred when multiple candidates exist for the same benchmark.

How can an AI lab get missing scores added to BenchLM?

Publish the results somewhere citable: a model card, technical report, or launch page with exact numbers for the exact model variant. BenchLM's verification pipeline attaches primary sources to every weighted row, and the verification-impact audit shows which missing rows would move each model most.

New models drop every week. We send one email a week with what moved and why.