AI models with the largest context windows (200K+ tokens), ranked by benchmark performance.
Unless noted otherwise, ranking surfaces on this page use BenchLM's provisional leaderboard lane rather than the stricter sourced-only verified leaderboard.
Bottom line: A large context window means nothing if the model can't actually use it. Claude Mythos Preview and Gemini 3.1 Pro both have 1M+ context and the benchmarks to back it up.
According to BenchLM.ai, Claude Mythos Preview leads this ranking with a score of 99, followed by Claude Opus 4.7 (97) and GPT-5.4 (93). There is meaningful separation between the top models, suggesting genuine performance differences.
The best open-weight option is GLM-5.1 (ranked #10 with a score of 84). While proprietary models lead, open-weight options are within striking distance for teams willing to trade a few points of performance for full model control.
This ranking is based on provisional overall weighted scores across BenchLM.ai's scoring formula tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.
Claude Mythos Preview
Anthropic · 1M
Highest-scoring large-context model. 1M tokens with top benchmarks.
Claude Opus 4.7
Anthropic · 1M
GPT-5.4
OpenAI · 1.05M
Claude Mythos Preview leads large-context models with 1M context and the highest overall score.
Gemini 3.1 Pro 1M context with strong reasoning (97) — best non-reasoning large-context model.
GPT-5.4 1.05M context — largest window among the top 3 overall models.
Get notified when models move. One email a week with what changed and why.
Free. No spam. Unsubscribe anytime.
The top model is Claude Mythos Preview by Anthropic with a provisional score of 99.
The best open-weight model is GLM-5.1 at position #10.
63 models are included in this ranking.
Models are filtered by context window (200K+ tokens) and ranked by overall BenchLM score. A large context window alone is not enough — check long-context benchmark scores for actual retrieval and reasoning quality.
Context window size is self-reported by providers. Actual usable context may be smaller due to edge degradation. Long-context benchmarks test specific patterns — real workloads may differ.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.