BenchLM.ai now distinguishes provisional overall ranking from verified overall ranking. The provisional score is a normalized weighted average across 8 benchmark categories: agentic (22%), coding (20%), reasoning (17%), knowledge (12%), multimodal & grounded (12%), multilingual (7%), instruction following (5%), and math (5%), using non-generated benchmark coverage plus bounded external consensus calibration. The verified leaderboard is stricter and only counts sourced benchmark rows. Each score includes a confidence indicator (1-4 dots) based on how much sourced coverage supports it. Display-only benchmarks — including MMLU, OpenBookQA, HumanEval, FLTEval, BBH, LisanBench, and older AIME/HMMT variants — remain visible for context but do not affect ranking.
Unless noted otherwise, ranking surfaces on this page use BenchLM's provisional leaderboard lane rather than the stricter sourced-only verified leaderboard.
Bottom line: Claude Mythos Preview leads overall, but GPT-5.4 and Claude Opus 4.6 are within striking distance — and significantly cheaper.
According to BenchLM.ai, Claude Mythos Preview leads this ranking with a score of 99, followed by Claude Opus 4.7 (97) and GPT-5.4 (93). There is meaningful separation between the top models, suggesting genuine performance differences.
The best open-weight option is GLM-5.1 (ranked #10 with a score of 84). While proprietary models lead, open-weight options are within striking distance for teams willing to trade a few points of performance for full model control.
This ranking is based on provisional overall weighted scores across BenchLM.ai's scoring formula tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.
Claude Mythos Preview
Anthropic · 1M
Highest overall score. Leads agentic and coding. Premium-priced.
Claude Opus 4.7
Anthropic · 1M
GPT-5.4
OpenAI · 1.05M
Claude Mythos Preview entered at #1 with the highest overall score on BenchLM.
GPT-5.4 holds a strong #2 across all categories.
Claude Opus 4.6 remains #3, the most consistent model across all 8 benchmark categories.
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
The top model is Claude Mythos Preview by Anthropic with a provisional score of 99.
The best open-weight model is GLM-5.1 at position #10.
110 models are included in this ranking.
The overall score is a weighted average across 8 benchmark categories. Agentic (22%), coding (20%), and reasoning (17%) carry the most weight. A 5-point gap in overall score is meaningful — it reflects consistent performance differences across multiple domains.
The overall score compresses 8 categories into one number. Two models with the same overall score can have very different strengths — one might lead coding while the other leads reasoning. Always check category scores for your specific use case.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.