A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.
BenchLM mirrors the current public LisanBench difficulty-weighted leaderboard using the official dataset published at lisanbench.com for the April 29, 2026 snapshot. The public benchmark tests 130 model variants across 50 starting words, with 3 trials per starting word.
LisanBench is a strong reasoning reference, but BenchLM currently keeps it display only rather than weighted. The public leaderboard is highly variant-specific, strongly English-vocabulary-dependent, and not yet aligned cleanly enough with BenchLM canonical model rows to use as a ranking input.
BenchLM mirrors the published difficulty-weighted score view for LisanBench. Claude Opus 4.7 leads the public snapshot at 3957.70 , followed by Opus 4.6 (16k) (2772.16) and Sonnet 4.6 (16k) (2307.52). BenchLM does not use these results to rank models overall.
Claude Opus 4.7
Anthropic
anthropic/claude-opus-4.7:thinking-xhigh
Opus 4.6 (16k)
Anthropic
anthropic/claude-opus-4.6:thinking-16k
Sonnet 4.6 (16k)
Anthropic
anthropic/claude-sonnet-4.6:thinking-16k
The published LisanBench snapshot is tightly clustered at the top: Claude Opus 4.7 sits at 3957.70, while the third row is only 1650.19 points behind. The broader top-10 spread is 2796.23 points, so the benchmark still separates strong models even when the leaders cluster.
130 models have been evaluated on LisanBench. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. LisanBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
50 starting words × 3 trials
Format
Difficulty-weighted word-chain reasoning
Difficulty
Open-ended lexical planning
BenchLM mirrors the public difficulty-weighted LisanBench leaderboard as a display-only reasoning benchmark. The public benchmark currently evaluates 128 model variants across 50 starting words with 3 trials per word.
Version
LisanBench 2026
Refresh cadence
Static
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.
Claude Opus 4.7 currently leads the published LisanBench snapshot with a difficulty-weighted score of 3957.70. BenchLM shows this benchmark for display only and does not use it in overall rankings.
130 AI models are included in BenchLM's mirrored LisanBench snapshot, based on the public leaderboard captured on April 29, 2026 snapshot.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.