How BenchLM scores models
BenchLM combines benchmark scores, freshness rules, provenance filters, pricing metadata, and runtime snapshots into a leaderboard that is useful for real model selection. The goal is not a single magic number. The goal is a leaderboard where the number is understandable and defensible.
Verified and provisional are separate
BenchLM now separates a sourced-only verified leaderboard from a broader provisional leaderboard. Generated rows are excluded from both. Provisional ranking can still use source-unverified non-generated public rows until exact citations are attached.
Freshness is explicit
Every benchmark now carries BenchLM freshness metadata: version, refresh cadence, staleness state, saturation state, and whether the benchmark is weighted or display-only.
Multi-signal calibration is bounded
BenchLM starts from non-generated benchmark coverage, then applies bounded category-specific calibration using external leaderboard consensus signals. Each category blends its benchmark backbone with relevant external signals at fixed weights. Categories without a matching external signal remain purely benchmark-driven. Runtime metrics are sourced separately and updated as of 2026-04-07.
Category weights
Agentic
22%6 weighted benchmarks and 23 display-only benchmarks.
Coding
20%5 weighted benchmarks and 10 display-only benchmarks.
Reasoning
17%4 weighted benchmarks and 10 display-only benchmarks.
Multimodal
12%2 weighted benchmarks and 39 display-only benchmarks.
Knowledge
12%6 weighted benchmarks and 10 display-only benchmarks.
Multilingual
7%2 weighted benchmarks and 6 display-only benchmarks.
Instruction Following
5%2 weighted benchmarks and 0 display-only benchmarks.
Math
5%4 weighted benchmarks and 13 display-only benchmarks.
Benchmarks by category
Agentic
29 tracked benchmarks
Subfamilies
Coding
15 tracked benchmarks
Reasoning
14 tracked benchmarks
Weighted
Subfamilies
Multimodal
41 tracked benchmarks
Weighted
Display only
Subfamilies
Knowledge
16 tracked benchmarks
Subfamilies
Multilingual
8 tracked benchmarks
Instruction Following
2 tracked benchmarks
Future tracked families
BenchLM tracks a small number of important benchmark families that are intentionally not weighted yet because exact-source density and cross-model coverage are still too thin for defensible ranking use.
Calibration approach
External consensus signals
BenchLM blends its benchmark backbone with external leaderboard consensus signals at fixed, category-specific weights. These signals act as bounded corrections — they cannot override the benchmark backbone, only nudge scores where benchmark-only output clearly misranks frontier models. The benchmark backbone always carries the majority weight. Categories without a matching external signal remain 100% benchmark-driven.
Runtime metrics
Runtime metrics (tokens/sec, time-to-first-token) stay separate from ranking and are shown as operational metadata only. They do not affect overall or category scores.
Source refresh: 2026-04-07
BenchLM defaults and caveats
BenchLM uses benchmark freshness as a product layer, not as a claim about an official benchmark maintainer. A benchmark marked Current means BenchLM still treats it as a strong differentiator. A benchmark marked Stale means it is still useful for context but is no longer relied on heavily to separate frontier models.
Public BenchLM benchmark tables default to exact-source rows only. Users can opt into provisional rows, which are non-generated but still awaiting exact citation attachment. Generated benchmark values remain excluded from public benchmark tables.
BenchLM's public rankings use calibrated display scores, not the raw stored `overallScore` field inside the JSON. The benchmark backbone comes first, then BenchLM applies bounded external calibration using external consensus signals for coding, agentic, and final overall display ordering.
BenchLM does not estimate runtime metrics when no sourced runtime snapshot is available. The leaderboard and model pages show N/A instead. Pricing sort uses the average of input and output token price so models can be ranked on a single cost column while the full input/output pair remains visible.
Last benchmark dataset refresh: April 10, 2026. For raw benchmark exploration, use the benchmark directory. For current provider rollups, use provider pages.