LisanBench

A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.

How BenchLM shows LisanBench

BenchLM mirrors the current public LisanBench difficulty-weighted leaderboard using the official dataset published at lisanbench.com and fetched on April 2, 2026. The public benchmark tests 128 model variants across 50 starting words, with 3 trials per starting word.

LisanBench is a strong reasoning reference, but BenchLM currently keeps it display only rather than weighted. The public leaderboard is highly variant-specific, strongly English-vocabulary-dependent, and not yet aligned cleanly enough with BenchLM canonical model rows to use as a ranking input.

128 model variants50 starting words3 trials per wordDifficulty-weighted scoresDisplay only

Difficulty-weighted score on LisanBench — April 2, 2026 snapshot

BenchLM mirrors the published difficulty-weighted score view for LisanBench. Opus 4.6 (16k) leads the public snapshot at 2772.16 , followed by Sonnet 4.6 (16k) (2307.52) and GPT 5.4 (medium) (2215.79). BenchLM does not use these results to rank models overall.

128 modelsReasoningCurrentDisplay onlyUpdated April 2, 2026 snapshot

The published LisanBench snapshot is tightly clustered at the top: Opus 4.6 (16k) sits at 2772.16, while the third row is only 556.37 points behind. The broader top-10 spread is 1669.13 points, so the benchmark still separates strong models even when the leaders cluster.

128 models have been evaluated on LisanBench. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. LisanBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About LisanBench

Year

2026

Tasks

50 starting words × 3 trials

Format

Difficulty-weighted word-chain reasoning

Difficulty

Open-ended lexical planning

BenchLM mirrors the public difficulty-weighted LisanBench leaderboard as a display-only reasoning benchmark. The public benchmark currently evaluates 128 model variants across 50 starting words with 3 trials per word.

BenchLM freshness & provenance

Version

LisanBench 2026

Refresh cadence

Static

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Difficulty-weighted score table (128 models)

#1
Opus 4.6 (16k)anthropic/claude-opus-4.6:thinking-16k
2772.16
#2
Sonnet 4.6 (16k)anthropic/claude-sonnet-4.6:thinking-16k
2307.52
#3
GPT 5.4 (medium)openai/gpt-5.4:thinking-medium
2215.79
#4
Opus 4.5 (16k)anthropic/claude-opus-4.5:thinking-16k
1805.52
#5
Gemini 3.1 Pro Preview (high)google/gemini-3.1-pro-preview:thinking-high
1576.39
#6
Grok 4x-ai/grok-4:thinking-medium
1450.46
#7
Grok 4.20x-ai/grok-4.20-beta:thinking
1190.60
#8
GPT-5 (medium)openai/gpt-5
1189.18
#9
Deepseek V3.2 Speciale (thinking)deepseek/deepseek-v3.2-speciale:thinking
1161.47
#10
o3openai/o3:thinking-medium
1103.03
#11
GPT-5.2openai/gpt-5.2:thinking-medium
1038.83
#12
Gemini 3 Progoogle/gemini-3-pro-preview
942.55
#13
Sonnet 4.5 (16k)anthropic/claude-sonnet-4.5:thinking-16k
863.95
#14
DeepSeek V3.2 (Thinking)deepseek/deepseek-v3.2:thinking
758.97
#15
Gemini 3.1 Pro Preview (low)google/gemini-3.1-pro-preview:thinking-low
746.55
#16
Step 3.5 Flashzenmux/step-3.5-flash:thinking
668.59
#17
Grok 4 Fast (thinking)x-ai/grok-4-fast:free
648.18
#18
GPT-5 miniopenai/gpt-5-mini
611.61
#19
Kimi K2.5 (Reasoning)moonshotai/kimi-k2.5:thinking
578.76
#20
Grok 4.1 Fast (thinking)x-ai/grok-4.1-fast:thinking
513.54
#21
Gemini 3 Flashgoogle/gemini-3-flash-preview
511.79
#22
GPT-5 nanoopenai/gpt-5-nano
507.27
#23
Kimi K2 (thinking)moonshotai/kimi-k2-thinking
497.61
#24
GPT 5.4 Mini (medium)openai/gpt-5.4-mini:thinking-medium
491.61
#25
Sonnet 4 (16k)anthropic/claude-sonnet-4:thinking-16k
490.62
#26
GPT 5.4 Nano (medium)openai/gpt-5.4-nano:thinking-medium
450.72
#27
o3-miniopenai/o3-mini
419.03
#28
Doubao Seed 2.0 Pro (thinking)zenmux/doubao-seed-2.0-pro:thinking
393.81
#29
GPT-OSS 120Bopenai/gpt-oss-120b
367.03
#30
Qwen3.5 397B (Reasoning)qwen/qwen3.5-397b-a17b:thinking
310.94
#31
GLM-5 (Reasoning)z-ai/glm-5:thinking
305.91
#32
O4 Mini (medium)openai/o4-mini
283.09
#33
Opus 4anthropic/claude-opus-4
219.47
#34
Doubao Seed 2.0 Lite (thinking)zenmux/doubao-seed-2.0-lite:thinking
216.63
#35
Doubao Seed 1.8 (thinking)zenmux/doubao-seed-1.8:thinking
215.64
#36
Qwen3 235B 2507 (Reasoning)qwen/qwen3-235b-a22b-thinking-2507
190.08
#37
MiniMax M2.5minimax/minimax-m2.5:thinking
183.46
#38
Claude 4.1 Opusanthropic/claude-opus-4.1
181.45
#39
Claude Sonnet 4.6anthropic/claude-sonnet-4.6
180.59
#40
Gemini 2.5 Progoogle/gemini-2.5-pro:thinking-16k
168.93
#41
Grok 3 Minix-ai/grok-3-mini:thinking-medium
163.17
#42
157.86
#43
Sonnet 3.7anthropic/claude-3.7-sonnet
139.33
#44
GPT-OSS 20Bopenai/gpt-oss-20b
132.53
#45
Claude 4 Sonnetanthropic/claude-sonnet-4
131.70
#46
Doubao Seed 2.0 Mini (thinking)zenmux/doubao-seed-2.0-mini:thinking
127.78
#47
Claude 3.5 Sonnetanthropic/claude-3.5-sonnet
125.79
#48
DeepSeek V3.2deepseek/deepseek-v3.2
110.86
#49
Claude Sonnet 4.5anthropic/claude-sonnet-4.5
103.90
#50
Claude 3.5 Sonnetanthropic/claude-3.5-sonnet-20240620
103.10
#51
Olmo 3 32B (thinking)allenai/olmo-3-32b-think
99.80
#52
Gemini 1.5 Progoogle/gemini-pro-1.5
98.93
#53
Gemini 2.5 Flash (16k)google/gemini-2.5-flash:thinking-16k
97.87
#54
Qwen3.5-122B-A10Bqwen/qwen3.5-122b-a10b:thinking
92.64
#55
DeepSeek V3deepseek/deepseek-chat
91.99
#56
GPT-5.4openai/gpt-5.4:thinking-none
89.51
#57
GLM-4.5z-ai/glm-4.5
89.18
#58
Qwen3.5-35B-A3Bqwen/qwen3.5-35b-a3b:thinking
87.36
#59
O1 Mini (medium)openai/o1-mini
85.61
#60
Claude Opus 4.5anthropic/claude-opus-4.5
84.09
#61
GPT-4oopenai/chatgpt-4o-latest
82.43
#62
Claude Opus 4.6anthropic/claude-opus-4.6
79.59
#63
DeepSeek-R1deepseek/deepseek-r1-0528
78.47
#64
GPT-4 Turboopenai/gpt-4-turbo
78.23
#65
Claude 3 Opusanthropic/claude-3-opus
72.79
#66
Kimi K2moonshotai/kimi-k2
68.20
#67
Qwen3 4B (16k)Qwen/Qwen3-4B-FP8:thinking-16k
67.24
#68
Gemini 2.5 Flashgoogle/gemini-2.5-flash
65.81
#69
MiniMax M1 80kminimax/minimax-m1
59.71
#70
Gemini 2.0 Flashgoogle/gemini-2.0-flash-001
50.79
#71
Claude Haiku 4.5anthropic/claude-haiku-4.5
49.89
#72
Gemini Flash 1.5google/gemini-flash-1.5
49.60
#73
Horizon Betaopenrouter/horizon-beta
48.03
#74
Nova Proamazon/nova-pro-v1
47.48
#75
GLM-4.7z-ai/glm-4.7
47.01
#76
Polaris Alphaopenrouter/polaris-alpha
46.78
#77
GLM-4.5-Airz-ai/glm-4.5-air
46.75
#78
GPT 3.5 Turboopenai/gpt-3.5-turbo-0613
46.35
#79
Qwen3 Coderqwen/qwen3-coder
43.40
#80
Grok 4.1 Fastx-ai/grok-4.1-fast
42.96
#81
Llama 3.1 405Bmeta-llama/llama-3.1-405b-instruct
41.69
#82
GLM 4.6 (thinking)z-ai/glm-4.6
40.14
#83
GPT 5.4 Miniopenai/gpt-5.4-mini:thinking-none
39.33
#84
Llama 4 Maverickmeta-llama/llama-4-maverick
39.01
#85
Gemma 3 27Bgoogle/gemma-3-27b-it
37.17
#86
Mistral Medium 3mistralai/mistral-medium-3
37.16
#87
Devstral Mediummistralai/devstral-medium
36.06
#88
Qwen3 1.7B (16k)Qwen/Qwen3-1.7B-FP8:thinking-16k
36.01
#89
GPT 4.1openai/gpt-4.1
35.94
#90
Ernie 4.5 300B A47Bbaidu/ernie-4.5-300b-a47b
35.24
#91
Sherlock Dash Alphaopenrouter/sherlock-dash-alpha
34.99
#92
Gemini 2.0 Flash Lite 001google/gemini-2.0-flash-lite-001
34.90
#93
Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite
34.45
#94
Haiku 3.5anthropic/claude-3-5-haiku-20241022
34.27
#95
Llama 3.1 70Bmeta-llama/llama-3.1-70b-instruct
33.89
#96
Nova Micro V1amazon/nova-micro-v1
30.88
#97
Llama 4 Scoutmeta-llama/llama-4-scout
30.21
#98
Mistral Large 2411mistralai/mistral-large-2411
29.98
#99
GPT 4.1 Miniopenai/gpt-4.1-mini
29.52
#100
Haiku 3anthropic/claude-3-haiku
29.43
#101
Gemini 2.5 Flash Lite (16k)google/gemini-2.5-flash-lite:thinking-16k
29.32
#102
Qwen3 235B A22B 2507qwen/qwen3-235b-a22b-2507
28.91
#103
Nova Lite V1amazon/nova-lite-v1
28.01
#104
Mimo V2 Flash (thinking)zenmux/mimo-v2-flash:thinking
26.40
#105
Gemini Flash 1.5 8Bgoogle/gemini-flash-1.5-8b
20.76
#106
Qwen3 32Bqwen/qwen3-32b
19.09
#107
Qwen3 14Bqwen/qwen3-14b
18.33
#108
Gemma 3 12Bgoogle/gemma-3-12b-it
17.38
#109
GPT-4o miniopenai/gpt-4o-mini
17.29
#110
Qwen3 30B A3B 2507qwen/qwen3-30b-a3b-instruct-2507
16.91
#111
Mistral Small 3.2 24Bmistralai/mistral-small-3.2-24b-instruct
16.08
#112
Qwen3 8BQwen/Qwen3-8B-FP8
13.72
#113
GPT 4.1 Nanoopenai/gpt-4.1-nano
13.51
#114
Codestral 2508mistralai/codestral-2508
12.94
#115
Ministral 14B 2512mistralai/ministral-14b-2512
12.29
#116
Devstral Smallmistralai/devstral-small
11.27
#117
GPT 5.4 Nanoopenai/gpt-5.4-nano:thinking-none
11.03
#118
Ministral 8B 2512mistralai/ministral-8b-2512
10.93
#119
Mistral Nemomistralai/mistral-nemo
9.84
#120
Qwen3 0.6B (16k)Qwen/Qwen3-0.6B-FP8:thinking-16k
8.59
#121
Gemma 3 4Bgoogle/gemma-3-4b-it
8.23
#122
Qwen3 4BQwen/Qwen3-4B-FP8
7.94
#123
Qwen3 1.7BQwen/Qwen3-1.7B-FP8
6.57
#124
Ministral 3B 2512mistralai/ministral-3b-2512
6.40
#125
Llama 3.1 8Bmeta-llama/llama-3.1-8b-instruct
4.10
#126
Llama 3.2 3Bmeta-llama/llama-3.2-3b-instruct
2.46
#127
Llama 3.2 1Bmeta-llama/llama-3.2-1b-instruct
0.84
#128
Qwen3 0.6BQwen/Qwen3-0.6B-FP8
0.05

FAQ

What does LisanBench measure?

A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.

Which model leads the published LisanBench snapshot?

Opus 4.6 (16k) currently leads the published LisanBench snapshot with a difficulty-weighted score of 2772.16. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on LisanBench?

128 AI models are included in BenchLM's mirrored LisanBench snapshot, based on the public leaderboard captured on April 2, 2026 snapshot.

Last updated: April 2, 2026 snapshot · mirrored from the public benchmark leaderboard

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.