How BenchLM scores models
BenchLM combines benchmark scores, freshness rules, provenance filters, pricing metadata, and runtime snapshots into a leaderboard that is useful for real model selection. The goal is not a single magic number. The goal is a leaderboard where the number is understandable and defensible.
Trusted rows rank higher
BenchLM only ranks models when enough trustworthy benchmark coverage exists. Generated or thin coverage can still appear, but it is discounted or excluded from the public ranking logic.
Freshness is explicit
Every benchmark now carries BenchLM freshness metadata: version, refresh cadence, staleness state, saturation state, and whether the benchmark is weighted or display-only.
Operational tradeoffs stay visible
Pricing comes from BenchLM's pricing catalog. Runtime metrics use the current Artificial Analysis snapshot dated 2026-03-30 when BenchLM has a direct model match; otherwise the table shows N/A instead of estimating.
Category weights
Agentic
22%3 weighted benchmarks and 6 display-only benchmarks.
Coding
20%4 weighted benchmarks and 7 display-only benchmarks.
Reasoning
17%4 weighted benchmarks and 5 display-only benchmarks.
Multimodal
12%2 weighted benchmarks and 3 display-only benchmarks.
Knowledge
12%6 weighted benchmarks and 3 display-only benchmarks.
Multilingual
7%2 weighted benchmarks and 0 display-only benchmarks.
Instruction Following
5%1 weighted benchmark and 0 display-only benchmarks.
Math
5%3 weighted benchmarks and 5 display-only benchmarks.
Benchmarks by category
Agentic
9 tracked benchmarks
Coding
11 tracked benchmarks
Reasoning
9 tracked benchmarks
Multimodal
5 tracked benchmarks
Weighted
Display only
Knowledge
9 tracked benchmarks
Multilingual
2 tracked benchmarks
Instruction Following
1 tracked benchmarks
Weighted
Display only
Math
8 tracked benchmarks
BenchLM defaults and caveats
BenchLM uses benchmark freshness as a product layer, not as a claim about an official benchmark maintainer. A benchmark marked Current means BenchLM still treats it as a strong differentiator. A benchmark marked Stale means it is still useful for context but is no longer relied on heavily to separate frontier models.
Public BenchLM benchmark tables only expose verified manual scores. The build now fails if generated benchmark values leak into the public model catalog.
BenchLM does not estimate runtime metrics when no sourced runtime snapshot is available. The leaderboard and model pages show N/A instead. Pricing sort uses the average of input and output token price so models can be ranked on a single cost column while the full input/output pair remains visible.
Last benchmark dataset refresh: March 30, 2026. For raw benchmark exploration, use the benchmark directory. For current provider rollups, use provider pages.