Skip to main content

LisanBench

A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.

How BenchLM shows LisanBench

BenchLM mirrors the current public LisanBench difficulty-weighted leaderboard using the official dataset published at lisanbench.com for the April 29, 2026 snapshot. The public benchmark tests 130 model variants across 50 starting words, with 3 trials per starting word.

LisanBench is a strong reasoning reference, but BenchLM currently keeps it display only rather than weighted. The public leaderboard is highly variant-specific, strongly English-vocabulary-dependent, and not yet aligned cleanly enough with BenchLM canonical model rows to use as a ranking input.

130 model variants50 starting words3 trials per wordDifficulty-weighted scoresDisplay only

Difficulty-weighted score on LisanBench — April 29, 2026 snapshot

BenchLM mirrors the published difficulty-weighted score view for LisanBench. Claude Opus 4.7 leads the public snapshot at 3957.70 , followed by Opus 4.6 (16k) (2772.16) and Sonnet 4.6 (16k) (2307.52). BenchLM does not use these results to rank models overall.

130 modelsReasoningCurrentDisplay onlyUpdated April 29, 2026 snapshot

The published LisanBench snapshot is tightly clustered at the top: Claude Opus 4.7 sits at 3957.70, while the third row is only 1650.19 points behind. The broader top-10 spread is 2796.23 points, so the benchmark still separates strong models even when the leaders cluster.

130 models have been evaluated on LisanBench. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. LisanBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About LisanBench

Year

2026

Tasks

50 starting words × 3 trials

Format

Difficulty-weighted word-chain reasoning

Difficulty

Open-ended lexical planning

BenchLM mirrors the public difficulty-weighted LisanBench leaderboard as a display-only reasoning benchmark. The public benchmark currently evaluates 128 model variants across 50 starting words with 3 trials per word.

BenchLM freshness & provenance

Version

LisanBench 2026

Refresh cadence

Static

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Difficulty-weighted score table (130 models)

1
Claude Opus 4.7anthropic/claude-opus-4.7:thinking-xhigh
3957.70
2
Opus 4.6 (16k)anthropic/claude-opus-4.6:thinking-16k
2772.16
3
Sonnet 4.6 (16k)anthropic/claude-sonnet-4.6:thinking-16k
2307.52
4
GPT 5.4 (medium)openai/gpt-5.4:thinking-medium
2215.79
5
Opus 4.5 (16k)anthropic/claude-opus-4.5:thinking-16k
1805.52
6
Gemini 3.1 Pro Preview (high)google/gemini-3.1-pro-preview:thinking-high
1576.39
7
Grok 4 (medium)x-ai/grok-4:thinking-medium
1450.46
8
Grok 4.20 Beta (thinking)x-ai/grok-4.20-beta:thinking
1190.60
9
GPT 5 (medium)openai/gpt-5
1189.18
10
Deepseek V3.2 Speciale (thinking)deepseek/deepseek-v3.2-speciale:thinking
1161.47
11
O3 (medium)openai/o3:thinking-medium
1103.03
12
GPT 5.2 (medium)openai/gpt-5.2:thinking-medium
1038.83
13
Gemini 3 Pro Preview (high)google/gemini-3-pro-preview
942.55
14
Sonnet 4.5 (16k)anthropic/claude-sonnet-4.5:thinking-16k
863.95
15
Deepseek V3.2 (thinking)deepseek/deepseek-v3.2:thinking
758.97
16
Gemini 3.1 Pro Preview (low)google/gemini-3.1-pro-preview:thinking-low
746.55
17
Step 3.5 Flash (thinking)zenmux/step-3.5-flash:thinking
668.59
18
Grok 4 Fast (thinking)x-ai/grok-4-fast:free
648.18
19
GPT 5 Mini (medium)openai/gpt-5-mini
611.61
20
Kimi K2.5 (thinking)moonshotai/kimi-k2.5:thinking
578.76
21
Grok 4.1 Fast (thinking)x-ai/grok-4.1-fast:thinking
513.54
22
Gemini 3 Flash Preview (high)google/gemini-3-flash-preview
511.79
23
GPT 5 Nano (medium)openai/gpt-5-nano
507.27
24
Kimi K2 (thinking)moonshotai/kimi-k2-thinking
497.61
25
GPT 5.4 Mini (medium)openai/gpt-5.4-mini:thinking-medium
491.61
26
Sonnet 4 (16k)anthropic/claude-sonnet-4:thinking-16k
490.62
27
GPT 5.4 Nano (medium)openai/gpt-5.4-nano:thinking-medium
450.72
28
O3 Mini (medium)openai/o3-mini
419.03
29
Doubao Seed 2.0 Pro (thinking)zenmux/doubao-seed-2.0-pro:thinking
393.81
30
GPT-OSS-120B (medium)openai/gpt-oss-120b
367.03
31
Qwen3.5 397B A17B (thinking)qwen/qwen3.5-397b-a17b:thinking
310.94
32
GLM 5 (thinking)z-ai/glm-5:thinking
305.91
33
O4 Mini (medium)openai/o4-mini
283.09
34
Opus 4anthropic/claude-opus-4
219.47
35
Doubao Seed 2.0 Lite (thinking)zenmux/doubao-seed-2.0-lite:thinking
216.63
36
Doubao Seed 1.8 (thinking)zenmux/doubao-seed-1.8:thinking
215.64
37
Qwen3 235B A22B 2507 (thinking)qwen/qwen3-235b-a22b-thinking-2507
190.08
38
Claude Opus 4.7anthropic/claude-opus-4.7
184.35
39
Minimax M2.5 (thinking)minimax/minimax-m2.5:thinking
183.46
40
Opus 4.1anthropic/claude-opus-4.1
181.45
41
Sonnet 4.6anthropic/claude-sonnet-4.6
180.59
42
Gemini 2.5 Pro (16k)google/gemini-2.5-pro:thinking-16k
168.93
43
Grok 3 Mini (medium)x-ai/grok-3-mini:thinking-medium
163.17
44
157.86
45
Sonnet 3.7anthropic/claude-3.7-sonnet
139.33
46
GPT-OSS-20B (medium)openai/gpt-oss-20b
132.53
47
Sonnet 4anthropic/claude-sonnet-4
131.70
48
Doubao Seed 2.0 Mini (thinking)zenmux/doubao-seed-2.0-mini:thinking
127.78
49
Sonnet 3.6anthropic/claude-3.5-sonnet
125.79
50
Deepseek V3.2deepseek/deepseek-v3.2
110.86
51
Sonnet 4.5anthropic/claude-sonnet-4.5
103.90
52
Sonnet 3.5anthropic/claude-3.5-sonnet-20240620
103.10
53
Olmo 3 32B (thinking)allenai/olmo-3-32b-think
99.80
54
Gemini Pro 1.5google/gemini-pro-1.5
98.93
55
Gemini 2.5 Flash (16k)google/gemini-2.5-flash:thinking-16k
97.87
56
Qwen3.5 122B A10B (thinking)qwen/qwen3.5-122b-a10b:thinking
92.64
57
Deepseek V3deepseek/deepseek-chat
91.99
58
GPT 5.4openai/gpt-5.4:thinking-none
89.51
59
GLM 4.5 (thinking)z-ai/glm-4.5
89.18
60
Qwen3.5 35B A3B (thinking)qwen/qwen3.5-35b-a3b:thinking
87.36
61
O1 Mini (medium)openai/o1-mini
85.61
62
Opus 4.5anthropic/claude-opus-4.5
84.09
63
GPT 4oopenai/chatgpt-4o-latest
82.43
64
Opus 4.6anthropic/claude-opus-4.6
79.59
65
Deepseek R1 0528 (thinking)deepseek/deepseek-r1-0528
78.47
66
GPT 4 Turboopenai/gpt-4-turbo
78.23
67
Opus 3anthropic/claude-3-opus
72.79
68
Kimi K2moonshotai/kimi-k2
68.20
69
Qwen3 4B (16k)Qwen/Qwen3-4B-FP8:thinking-16k
67.24
70
Gemini 2.5 Flashgoogle/gemini-2.5-flash
65.81
71
Minimax M1 (thinking)minimax/minimax-m1
59.71
72
Gemini 2.0 Flashgoogle/gemini-2.0-flash-001
50.79
73
Haiku 4.5anthropic/claude-haiku-4.5
49.89
74
Gemini Flash 1.5google/gemini-flash-1.5
49.60
75
Horizon Betaopenrouter/horizon-beta
48.03
76
Nova Pro V1amazon/nova-pro-v1
47.48
77
GLM 4.7 (thinking)z-ai/glm-4.7
47.01
78
Polaris Alphaopenrouter/polaris-alpha
46.78
79
GLM 4.5 Air (thinking)z-ai/glm-4.5-air
46.75
80
GPT 3.5 Turboopenai/gpt-3.5-turbo-0613
46.35
81
Qwen3 Coderqwen/qwen3-coder
43.40
82
Grok 4.1 Fastx-ai/grok-4.1-fast
42.96
83
Llama 3.1 405Bmeta-llama/llama-3.1-405b-instruct
41.69
84
GLM 4.6 (thinking)z-ai/glm-4.6
40.14
85
GPT 5.4 Miniopenai/gpt-5.4-mini:thinking-none
39.33
86
Llama 4 Maverickmeta-llama/llama-4-maverick
39.01
87
Gemma 3 27Bgoogle/gemma-3-27b-it
37.17
88
Mistral Medium 3mistralai/mistral-medium-3
37.16
89
Devstral Mediummistralai/devstral-medium
36.06
90
Qwen3 1.7B (16k)Qwen/Qwen3-1.7B-FP8:thinking-16k
36.01
91
GPT 4.1openai/gpt-4.1
35.94
92
Ernie 4.5 300B A47Bbaidu/ernie-4.5-300b-a47b
35.24
93
Sherlock Dash Alphaopenrouter/sherlock-dash-alpha
34.99
94
Gemini 2.0 Flash Lite 001google/gemini-2.0-flash-lite-001
34.90
95
Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite
34.45
96
Haiku 3.5anthropic/claude-3-5-haiku-20241022
34.27
97
Llama 3.1 70Bmeta-llama/llama-3.1-70b-instruct
33.89
98
Nova Micro V1amazon/nova-micro-v1
30.88
99
Llama 4 Scoutmeta-llama/llama-4-scout
30.21
100
Mistral Large 2411mistralai/mistral-large-2411
29.98
101
GPT 4.1 Miniopenai/gpt-4.1-mini
29.52
102
Haiku 3anthropic/claude-3-haiku
29.43
103
Gemini 2.5 Flash Lite (16k)google/gemini-2.5-flash-lite:thinking-16k
29.32
104
Qwen3 235B A22B 2507qwen/qwen3-235b-a22b-2507
28.91
105
Nova Lite V1amazon/nova-lite-v1
28.01
106
Mimo V2 Flash (thinking)zenmux/mimo-v2-flash:thinking
26.40
107
Gemini Flash 1.5 8Bgoogle/gemini-flash-1.5-8b
20.76
108
Qwen3 32Bqwen/qwen3-32b
19.09
109
Qwen3 14Bqwen/qwen3-14b
18.33
110
Gemma 3 12Bgoogle/gemma-3-12b-it
17.38
111
GPT 4o Miniopenai/gpt-4o-mini
17.29
112
Qwen3 30B A3B 2507qwen/qwen3-30b-a3b-instruct-2507
16.91
113
Mistral Small 3.2 24Bmistralai/mistral-small-3.2-24b-instruct
16.08
114
Qwen3 8BQwen/Qwen3-8B-FP8
13.72
115
GPT 4.1 Nanoopenai/gpt-4.1-nano
13.51
116
Codestral 2508mistralai/codestral-2508
12.94
117
Ministral 14B 2512mistralai/ministral-14b-2512
12.29
118
Devstral Smallmistralai/devstral-small
11.27
119
GPT 5.4 Nanoopenai/gpt-5.4-nano:thinking-none
11.03
120
Ministral 8B 2512mistralai/ministral-8b-2512
10.93
121
Mistral Nemomistralai/mistral-nemo
9.84
122
Qwen3 0.6B (16k)Qwen/Qwen3-0.6B-FP8:thinking-16k
8.59
123
Gemma 3 4Bgoogle/gemma-3-4b-it
8.23
124
Qwen3 4BQwen/Qwen3-4B-FP8
7.94
125
Qwen3 1.7BQwen/Qwen3-1.7B-FP8
6.57
126
Ministral 3B 2512mistralai/ministral-3b-2512
6.40
127
Llama 3.1 8Bmeta-llama/llama-3.1-8b-instruct
4.10
128
Llama 3.2 3Bmeta-llama/llama-3.2-3b-instruct
2.46
129
Llama 3.2 1Bmeta-llama/llama-3.2-1b-instruct
0.84
130
Qwen3 0.6BQwen/Qwen3-0.6B-FP8
0.05

FAQ

What does LisanBench measure?

A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.

Which model leads the published LisanBench snapshot?

Claude Opus 4.7 currently leads the published LisanBench snapshot with a difficulty-weighted score of 3957.70. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on LisanBench?

130 AI models are included in BenchLM's mirrored LisanBench snapshot, based on the public leaderboard captured on April 29, 2026 snapshot.

Last updated: April 29, 2026 snapshot · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.