Methodology

How BenchLM scores models

BenchLM combines benchmark scores, freshness rules, provenance filters, pricing metadata, and runtime snapshots into a leaderboard that is useful for real model selection. The goal is not a single magic number. The goal is a leaderboard where the number is understandable and defensible.

Verified and provisional are separate

BenchLM now separates a sourced-only verified leaderboard from a broader provisional leaderboard. Generated rows are excluded from both. Provisional ranking can still use source-unverified non-generated public rows until exact citations are attached.

Freshness is explicit

Every benchmark now carries BenchLM freshness metadata: version, refresh cadence, staleness state, saturation state, and whether the benchmark is weighted or display-only.

Multi-signal calibration is bounded

BenchLM starts from non-generated benchmark coverage, then applies bounded category-specific calibration using external leaderboard consensus signals. Each category blends its benchmark backbone with relevant external signals at fixed weights. Categories without a matching external signal remain purely benchmark-driven. Runtime metrics are sourced separately and updated as of 2026-05-01.

Category weights

Agentic

22%

6 weighted benchmarks and 29 display-only benchmarks.

Coding

20%

5 weighted benchmarks and 17 display-only benchmarks.

Reasoning

17%

4 weighted benchmarks and 18 display-only benchmarks.

Multimodal

12%

4 weighted benchmarks and 41 display-only benchmarks.

Knowledge

12%

6 weighted benchmarks and 21 display-only benchmarks.

Multilingual

2 weighted benchmarks and 6 display-only benchmarks.

Instruction Following

2 weighted benchmarks and 1 display-only benchmark.

Math

Weighted

MGSM MMLU-ProX

Display only

NOVA-63 INCLUDE PolyMath VWT2k-lite MAXIFE SWE Multilingual

Instruction Following

3 tracked benchmarks

View leaderboard

Weighted

IFEval IFBench

Display only

SOB Value Acc

Math

23 tracked benchmarks

View leaderboard

Future tracked families

BenchLM tracks a small number of important benchmark families that are intentionally not weighted yet because exact-source density and cross-model coverage are still too thin for defensible ranking use.

Safety / alignment · tracking-onlyHallucination / factuality · tracking-onlyAudio / speech · tracking-onlyEmbodied / robotics · tracking-only

Calibration approach

External consensus signals

BenchLM blends its benchmark backbone with external leaderboard consensus signals at fixed, category-specific weights. These signals act as bounded corrections — they cannot override the benchmark backbone, only nudge scores where benchmark-only output clearly misranks frontier models. The benchmark backbone always carries the majority weight. Categories without a matching external signal remain 100% benchmark-driven.

Runtime metrics

Runtime metrics (tokens/sec, time-to-first-token) stay separate from ranking and are shown as operational metadata only. They do not affect overall or category scores.

Source refresh: 2026-05-01

BenchLM defaults and caveats

BenchLM uses benchmark freshness as a product layer, not as a claim about an official benchmark maintainer. A benchmark marked Current means BenchLM still treats it as a strong differentiator. A benchmark marked Stale means it is still useful for context but is no longer relied on heavily to separate frontier models.

Public BenchLM benchmark tables default to exact-source rows only. Users can opt into provisional rows, which are non-generated but still awaiting exact citation attachment. Generated benchmark values remain excluded from public benchmark tables.

BenchLM's public rankings use calibrated display scores, not the raw stored `overallScore` field inside the JSON. The benchmark backbone comes first, then BenchLM applies bounded external calibration using external consensus signals for coding, agentic, and final overall display ordering.

BenchLM does not estimate runtime metrics when no sourced runtime snapshot is available. The leaderboard and model pages show N/A instead. Pricing sort uses the average of input and output token price so models can be ranked on a single cost column while the full input/output pair remains visible.

Last benchmark dataset refresh: May 13, 2026. For raw benchmark exploration, use the benchmark directory. For current provider rollups, use provider pages.