About BenchLM

BenchLM tracks how AI models perform on standardized benchmarks. We currently cover 88 models across 22 benchmarks in six categories: coding, math, knowledge, reasoning, instruction following, and multilingual.

The site exists because comparing AI models is harder than it should be. Benchmark results are scattered across papers, blog posts, and Twitter threads. We pull them into one place so you can see how models actually stack up.

Where the data comes from

Benchmark scores come from OpenBench, official model papers and release announcements, and public leaderboards. We cross-reference multiple sources when possible. If a score conflicts between sources, we use the one from the most controlled evaluation environment.

Chatbot Arena Elo scores are tracked separately from benchmark performance. Elo measures human preference in blind comparisons, which is useful but measures something different than task accuracy.

How scoring works

Each model gets an overall score calculated as a weighted average of category averages. Within a category, all benchmarks are weighted equally. The category weights are:

  • Coding — 25%
  • Knowledge — 20%
  • Math — 20%
  • Reasoning — 20%
  • Instruction Following — 10%
  • Multilingual — 5%

Coding gets the highest weight because it has the most differentiated benchmarks (HumanEval, SWE-bench Verified, LiveCodeBench) and shows the clearest separation between models. The full methodology is on the homepage.

Update frequency

We update scores when new models are released or when existing benchmarks are re-run with updated evaluation protocols. The data was last updated on March 7, 2026.

What we don't track

We don't track pricing, latency, or throughput. Those change too frequently and depend on provider, tier, and region. We also don't run our own evaluations — we aggregate results from standardized public benchmarks.

Who runs this

BenchLM is built and maintained by @glevd. If you spot an error in the data or want to suggest a model or benchmark to add, reach out on X.