A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.
BenchLM mirrors the current public LisanBench difficulty-weighted leaderboard using the official dataset published at lisanbench.com and fetched on April 2, 2026. The public benchmark tests 128 model variants across 50 starting words, with 3 trials per starting word.
LisanBench is a strong reasoning reference, but BenchLM currently keeps it display only rather than weighted. The public leaderboard is highly variant-specific, strongly English-vocabulary-dependent, and not yet aligned cleanly enough with BenchLM canonical model rows to use as a ranking input.
BenchLM mirrors the published difficulty-weighted score view for LisanBench. Opus 4.6 (16k) leads the public snapshot at 2772.16 , followed by Sonnet 4.6 (16k) (2307.52) and GPT 5.4 (medium) (2215.79). BenchLM does not use these results to rank models overall.
Opus 4.6 (16k)
Anthropic
anthropic/claude-opus-4.6:thinking-16k
Sonnet 4.6 (16k)
Anthropic
anthropic/claude-sonnet-4.6:thinking-16k
GPT 5.4 (medium)
OpenAI
openai/gpt-5.4:thinking-medium
The published LisanBench snapshot is tightly clustered at the top: Opus 4.6 (16k) sits at 2772.16, while the third row is only 556.37 points behind. The broader top-10 spread is 1669.13 points, so the benchmark still separates strong models even when the leaders cluster.
128 models have been evaluated on LisanBench. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. LisanBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
50 starting words × 3 trials
Format
Difficulty-weighted word-chain reasoning
Difficulty
Open-ended lexical planning
BenchLM mirrors the public difficulty-weighted LisanBench leaderboard as a display-only reasoning benchmark. The public benchmark currently evaluates 128 model variants across 50 starting words with 3 trials per word.
Version
LisanBench 2026
Refresh cadence
Static
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.
Opus 4.6 (16k) currently leads the published LisanBench snapshot with a difficulty-weighted score of 2772.16. BenchLM shows this benchmark for display only and does not use it in overall rankings.
128 AI models are included in BenchLM's mirrored LisanBench snapshot, based on the public leaderboard captured on April 2, 2026 snapshot.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.