Methodology

How BenchLM scores models

BenchLM combines benchmark scores, freshness rules, provenance filters, pricing metadata, and runtime snapshots into a leaderboard that is useful for real model selection. The goal is not a single magic number. The goal is a leaderboard where the number is understandable and defensible.

Trusted rows rank higher

BenchLM only ranks models when enough trustworthy benchmark coverage exists. Generated or thin coverage can still appear, but it is discounted or excluded from the public ranking logic.

Freshness is explicit

Every benchmark now carries BenchLM freshness metadata: version, refresh cadence, staleness state, saturation state, and whether the benchmark is weighted or display-only.

Operational tradeoffs stay visible

Pricing comes from BenchLM's pricing catalog. Runtime metrics use the current Artificial Analysis snapshot dated 2026-03-30 when BenchLM has a direct model match; otherwise the table shows N/A instead of estimating.

Category weights

Benchmarks by category

BenchLM defaults and caveats

BenchLM uses benchmark freshness as a product layer, not as a claim about an official benchmark maintainer. A benchmark marked Current means BenchLM still treats it as a strong differentiator. A benchmark marked Stale means it is still useful for context but is no longer relied on heavily to separate frontier models.

Public BenchLM benchmark tables only expose verified manual scores. The build now fails if generated benchmark values leak into the public model catalog.

BenchLM does not estimate runtime metrics when no sourced runtime snapshot is available. The leaderboard and model pages show N/A instead. Pricing sort uses the average of input and output token price so models can be ranked on a single cost column while the full input/output pair remains visible.

Last benchmark dataset refresh: March 30, 2026. For raw benchmark exploration, use the benchmark directory. For current provider rollups, use provider pages.