About BenchLM

BenchLM tracks how AI models perform on public benchmarks that actually help with model selection. We currently cover 248 models across 98 benchmarks in eight categories: agentic, coding, multimodal & grounded, reasoning, knowledge, instruction following, multilingual, and math.

The site exists because AI evaluation is fragmented. Benchmark results are scattered across model cards, launch posts, benchmark leaderboards, papers, and screenshots. BenchLM pulls those rows into one place, applies ranking guardrails, and tries to make the final score understandable instead of mystical.

Where the data comes from

Benchmark scores come from OpenBench, official provider tables, model cards, release announcements, benchmark-native public leaderboards, and other public documentation. We cross-reference multiple sources when possible. If two public rows conflict, BenchLM prefers the more exact or more controlled source rather than whichever number looks better.

Pricing and runtime metadata are tracked separately from benchmark performance. They are useful for decisions, but they are not treated as benchmark rows and do not directly change the benchmark backbone.

How scoring works

BenchLM starts from a benchmark backbone. Within each category, weighted benchmark rows are normalized and blended into a category score. The overall score then combines those category scores using fixed category weights. Display-only benchmarks stay visible for context, but they do not directly affect the weighted ranking. The category weights are:

BenchLM also applies bounded external consensus calibration to some public display scores. That calibration is intentionally limited: it can nudge benchmark-only output when the public benchmark backbone clearly misranks frontier rows, but it does not replace the benchmark backbone. Runtime metrics, API pricing, and provider marketing claims do not override benchmark math.

Agentic — 22%
Coding — 20%
Reasoning — 17%
Multimodal & Grounded — 12%
Knowledge — 12%
Multilingual — 7%
Instruction Following — 5%
Math — 5%

Agentic carries the highest weight because the frontier has shifted from “who answers best” to “who can actually complete a workflow.” Coding and reasoning still matter heavily. Harder, less saturated benchmarks are favored over legacy rows that mostly act as floor checks. The full public methodology lives on the methodology page.

Verified vs provisional

BenchLM separates verified and provisional views. Verified views use only sourced public rows with displayable verification status. Provisional views can include additional non-generated public rows while exact citations are still being attached. Generated benchmark rows are excluded from public rankings.

Every score also carries a confidence signal based on how much non-generated benchmark coverage supports it. Broad, multi-category rows deserve more trust than sparse, single-category spikes.

Update frequency

BenchLM updates when new benchmark rows appear, when benchmark protocols materially change, and when new tracked model releases justify a refresh. The current benchmark dataset was last updated on June 2, 2026.

What we don't track

BenchLM is not a private eval lab. We do not run hidden internal model shootouts, and we do not claim to replace hands-on testing in your own workflow. We also do not treat provider contract pricing, regional pricing, or one-off promotional rates as stable source-of-record pricing data. BenchLM is best used for public benchmark performance, relative tradeoffs, and shortlist construction.

Who runs this

BenchLM is built and maintained by @glevd. If you spot an error in the data or want to suggest a model or benchmark to add, reach out on X.