LisanBench

Name: LisanBench
Creator: BenchLM

A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.

How BenchLM shows LisanBench

BenchLM mirrors the current public LisanBench difficulty-weighted leaderboard using the official dataset published at lisanbench.com for the April 29, 2026 snapshot. The public benchmark tests 130 model variants across 50 starting words, with 3 trials per starting word.

LisanBench is a strong reasoning reference, but BenchLM currently keeps it display only rather than weighted. The public leaderboard is highly variant-specific, strongly English-vocabulary-dependent, and not yet aligned cleanly enough with BenchLM canonical model rows to use as a ranking input.

130 model variants50 starting words3 trials per wordDifficulty-weighted scoresDisplay only

LisanBench leaderboard LisanBench methodology Code and data

Difficulty-weighted score on LisanBench — April 29, 2026 snapshot

BenchLM mirrors the published difficulty-weighted score view for LisanBench. Claude Opus 4.7 leads the public snapshot at 3957.70 , followed by Opus 4.6 (16k) (2772.16) and Sonnet 4.6 (16k) (2307.52). BenchLM does not use these results to rank models overall.

1Closed

Claude Opus 4.7

Anthropic

anthropic/claude-opus-4.7:thinking-xhigh

3957.70

Overall —Context 1M

Opus 4.6 (16k)

Anthropic

anthropic/claude-opus-4.6:thinking-16k

2772.16

Overall —

Sonnet 4.6 (16k)

Anthropic

anthropic/claude-sonnet-4.6:thinking-16k

2307.52

Overall —

130 modelsReasoningCurrentDisplay onlyUpdated April 29, 2026 snapshot

The published LisanBench snapshot is tightly clustered at the top: Claude Opus 4.7 sits at 3957.70, while the third row is only 1650.19 points behind. The broader top-10 spread is 2796.23 points, so the benchmark still separates strong models even when the leaders cluster.

130 models have been evaluated on LisanBench. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. LisanBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About LisanBench

Year

2026

Tasks

50 starting words × 3 trials

Format

Difficulty-weighted word-chain reasoning

Difficulty

Open-ended lexical planning

BenchLM mirrors the public difficulty-weighted LisanBench leaderboard as a display-only reasoning benchmark. The public benchmark currently evaluates 128 model variants across 50 starting words with 3 trials per word.

LisanBench methodology Public benchmark source

BenchLM freshness & provenance

Version

LisanBench 2026

Refresh cadence

Static

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Difficulty-weighted score table (130 models)

Claude Opus 4.7anthropic/claude-opus-4.7:thinking-xhigh

AnthropicClosed

3957.70

Opus 4.6 (16k)anthropic/claude-opus-4.6:thinking-16k

Anthropic

2772.16

Sonnet 4.6 (16k)anthropic/claude-sonnet-4.6:thinking-16k

Anthropic

2307.52

GPT 5.4 (medium)openai/gpt-5.4:thinking-medium

OpenAIClosed

2215.79

Opus 4.5 (16k)anthropic/claude-opus-4.5:thinking-16k

Anthropic

1805.52

Gemini 3.1 Pro Preview (high)google/gemini-3.1-pro-preview:thinking-high

GoogleClosed

1576.39

Grok 4 (medium)x-ai/grok-4:thinking-medium

xAIClosed

1450.46

Grok 4.20 Beta (thinking)x-ai/grok-4.20-beta:thinking

xAIClosed

1190.60

GPT 5 (medium)openai/gpt-5

OpenAIClosed

1189.18

Deepseek V3.2 Speciale (thinking)deepseek/deepseek-v3.2-speciale:thinking

DeepSeek

1161.47

O3 (medium)openai/o3:thinking-medium

OpenAIClosed

1103.03

GPT 5.2 (medium)openai/gpt-5.2:thinking-medium

OpenAIClosed

1038.83

Gemini 3 Pro Preview (high)google/gemini-3-pro-preview

GoogleClosed

942.55

Sonnet 4.5 (16k)anthropic/claude-sonnet-4.5:thinking-16k

Anthropic

863.95

Deepseek V3.2 (thinking)deepseek/deepseek-v3.2:thinking

DeepSeekOpen

758.97

Gemini 3.1 Pro Preview (low)google/gemini-3.1-pro-preview:thinking-low

GoogleClosed

746.55

Step 3.5 Flash (thinking)zenmux/step-3.5-flash:thinking

StepFunOpen

668.59

Grok 4 Fast (thinking)x-ai/grok-4-fast:free

xAI

648.18

GPT 5 Mini (medium)openai/gpt-5-mini

OpenAIClosed

611.61

Kimi K2.5 (thinking)moonshotai/kimi-k2.5:thinking

Moonshot AIClosed

578.76

Grok 4.1 Fast (thinking)x-ai/grok-4.1-fast:thinking

xAIClosed

513.54

Gemini 3 Flash Preview (high)google/gemini-3-flash-preview

GoogleClosed

511.79

GPT 5 Nano (medium)openai/gpt-5-nano

OpenAIClosed

507.27

Kimi K2 (thinking)moonshotai/kimi-k2-thinking

Moonshot AIClosed

497.61

GPT 5.4 Mini (medium)openai/gpt-5.4-mini:thinking-medium

OpenAIClosed

491.61

Sonnet 4 (16k)anthropic/claude-sonnet-4:thinking-16k

Anthropic

490.62

GPT 5.4 Nano (medium)openai/gpt-5.4-nano:thinking-medium

OpenAIClosed

450.72

O3 Mini (medium)openai/o3-mini

OpenAIClosed

419.03

Doubao Seed 2.0 Pro (thinking)zenmux/doubao-seed-2.0-pro:thinking

StepFun

393.81

GPT-OSS-120B (medium)openai/gpt-oss-120b

OpenAIOpen

367.03

Qwen3.5 397B A17B (thinking)qwen/qwen3.5-397b-a17b:thinking

AlibabaOpen

310.94

GLM 5 (thinking)z-ai/glm-5:thinking

Z.AIOpen

305.91

O4 Mini (medium)openai/o4-mini

OpenAIClosed

283.09

Opus 4anthropic/claude-opus-4

Anthropic

219.47

Doubao Seed 2.0 Lite (thinking)zenmux/doubao-seed-2.0-lite:thinking

StepFun

216.63

Doubao Seed 1.8 (thinking)zenmux/doubao-seed-1.8:thinking

StepFun

215.64

Qwen3 235B A22B 2507 (thinking)qwen/qwen3-235b-a22b-thinking-2507

AlibabaOpen

190.08

Claude Opus 4.7anthropic/claude-opus-4.7

AnthropicClosed

184.35

Minimax M2.5 (thinking)minimax/minimax-m2.5:thinking

MiniMaxClosed

183.46

Opus 4.1anthropic/claude-opus-4.1

AnthropicClosed

181.45

Sonnet 4.6anthropic/claude-sonnet-4.6

AnthropicClosed

180.59

Gemini 2.5 Pro (16k)google/gemini-2.5-pro:thinking-16k

GoogleClosed

168.93

Grok 3 Mini (medium)x-ai/grok-3-mini:thinking-medium

xAIClosed

163.17

Grok 3 (thinking)x-ai/grok-3

xAIClosed

157.86

Sonnet 3.7anthropic/claude-3.7-sonnet

Anthropic

139.33

GPT-OSS-20B (medium)openai/gpt-oss-20b

OpenAIOpen

132.53

Sonnet 4anthropic/claude-sonnet-4

AnthropicClosed

131.70

Doubao Seed 2.0 Mini (thinking)zenmux/doubao-seed-2.0-mini:thinking

StepFun

127.78

Sonnet 3.6anthropic/claude-3.5-sonnet

AnthropicClosed

125.79

Deepseek V3.2deepseek/deepseek-v3.2

DeepSeekOpen

110.86

Sonnet 4.5anthropic/claude-sonnet-4.5

AnthropicClosed

103.90

Sonnet 3.5anthropic/claude-3.5-sonnet-20240620

AnthropicClosed

103.10

Olmo 3 32B (thinking)allenai/olmo-3-32b-think

Allen AI

99.80

Gemini Pro 1.5google/gemini-pro-1.5

GoogleClosed

98.93

Gemini 2.5 Flash (16k)google/gemini-2.5-flash:thinking-16k

Google

97.87

Qwen3.5 122B A10B (thinking)qwen/qwen3.5-122b-a10b:thinking

AlibabaOpen

92.64

Deepseek V3deepseek/deepseek-chat

DeepSeekOpen

91.99

GPT 5.4openai/gpt-5.4:thinking-none

OpenAIClosed

89.51

GLM 4.5 (thinking)z-ai/glm-4.5

Z.AIClosed

89.18

Qwen3.5 35B A3B (thinking)qwen/qwen3.5-35b-a3b:thinking

AlibabaOpen

87.36

O1 Mini (medium)openai/o1-mini

OpenAI

85.61

Opus 4.5anthropic/claude-opus-4.5

AnthropicClosed

84.09

GPT 4oopenai/chatgpt-4o-latest

OpenAIClosed

82.43

Opus 4.6anthropic/claude-opus-4.6

AnthropicClosed

79.59

Deepseek R1 0528 (thinking)deepseek/deepseek-r1-0528

DeepSeekOpen

78.47

GPT 4 Turboopenai/gpt-4-turbo

OpenAIClosed

78.23

Opus 3anthropic/claude-3-opus

AnthropicClosed

72.79

Kimi K2moonshotai/kimi-k2

Moonshot AIClosed

68.20

Qwen3 4B (16k)Qwen/Qwen3-4B-FP8:thinking-16k

Alibaba

67.24

Gemini 2.5 Flashgoogle/gemini-2.5-flash

GoogleClosed

65.81

Minimax M1 (thinking)minimax/minimax-m1

MiniMaxClosed

59.71

Gemini 2.0 Flashgoogle/gemini-2.0-flash-001

Google

50.79

Haiku 4.5anthropic/claude-haiku-4.5

AnthropicClosed

49.89

Gemini Flash 1.5google/gemini-flash-1.5

Google

49.60

Horizon Betaopenrouter/horizon-beta

OpenRouter

48.03

Nova Pro V1amazon/nova-pro-v1

AmazonClosed

47.48

GLM 4.7 (thinking)z-ai/glm-4.7

Z.AIOpen

47.01

Polaris Alphaopenrouter/polaris-alpha

OpenRouter

46.78

GLM 4.5 Air (thinking)z-ai/glm-4.5-air

Z.AIClosed

46.75

GPT 3.5 Turboopenai/gpt-3.5-turbo-0613

OpenAI

46.35

Qwen3 Coderqwen/qwen3-coder

Alibaba

43.40

Grok 4.1 Fastx-ai/grok-4.1-fast

xAIClosed

42.96

Llama 3.1 405Bmeta-llama/llama-3.1-405b-instruct

Meta

41.69

GLM 4.6 (thinking)z-ai/glm-4.6

Z.AIOpen

40.14

GPT 5.4 Miniopenai/gpt-5.4-mini:thinking-none

OpenAI

39.33

Llama 4 Maverickmeta-llama/llama-4-maverick

MetaOpen

39.01

Gemma 3 27Bgoogle/gemma-3-27b-it

Google

37.17

Mistral Medium 3mistralai/mistral-medium-3

MistralClosed

37.16

Devstral Mediummistralai/devstral-medium

Mistral

36.06

Qwen3 1.7B (16k)Qwen/Qwen3-1.7B-FP8:thinking-16k

Alibaba

36.01

GPT 4.1openai/gpt-4.1

OpenAIClosed

35.94

Ernie 4.5 300B A47Bbaidu/ernie-4.5-300b-a47b

Baidu

35.24

Sherlock Dash Alphaopenrouter/sherlock-dash-alpha

OpenRouter

34.99

Gemini 2.0 Flash Lite 001google/gemini-2.0-flash-lite-001

Google

34.90

Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite

Google

34.45

Haiku 3.5anthropic/claude-3-5-haiku-20241022

Anthropic

34.27

Llama 3.1 70Bmeta-llama/llama-3.1-70b-instruct

Meta

33.89

Nova Micro V1amazon/nova-micro-v1

Amazon

30.88

Llama 4 Scoutmeta-llama/llama-4-scout

MetaOpen

30.21

100

Mistral Large 2411mistralai/mistral-large-2411

Mistral

29.98

101

GPT 4.1 Miniopenai/gpt-4.1-mini

OpenAIClosed

29.52

102

Haiku 3anthropic/claude-3-haiku

AnthropicClosed

29.43

103

Gemini 2.5 Flash Lite (16k)google/gemini-2.5-flash-lite:thinking-16k

Google

29.32

104

Qwen3 235B A22B 2507qwen/qwen3-235b-a22b-2507

Alibaba

28.91

105

Nova Lite V1amazon/nova-lite-v1

Amazon

28.01

106

Mimo V2 Flash (thinking)zenmux/mimo-v2-flash:thinking

XiaomiOpen

26.40

107

Gemini Flash 1.5 8Bgoogle/gemini-flash-1.5-8b

Google

20.76

108

Qwen3 32Bqwen/qwen3-32b

Alibaba

19.09

109

Qwen3 14Bqwen/qwen3-14b

Alibaba

18.33

110

Gemma 3 12Bgoogle/gemma-3-12b-it

Google

17.38

111

GPT 4o Miniopenai/gpt-4o-mini

OpenAIClosed

17.29

112

Qwen3 30B A3B 2507qwen/qwen3-30b-a3b-instruct-2507

Alibaba

16.91

113

Mistral Small 3.2 24Bmistralai/mistral-small-3.2-24b-instruct

Mistral

16.08

114

Qwen3 8BQwen/Qwen3-8B-FP8

Alibaba

13.72

115

GPT 4.1 Nanoopenai/gpt-4.1-nano

OpenAIClosed

13.51

116

Codestral 2508mistralai/codestral-2508

Mistral

12.94

117

Ministral 14B 2512mistralai/ministral-14b-2512

Mistral

12.29

118

Devstral Smallmistralai/devstral-small

Mistral

11.27

119

GPT 5.4 Nanoopenai/gpt-5.4-nano:thinking-none

OpenAI

11.03

120

Ministral 8B 2512mistralai/ministral-8b-2512

Mistral

10.93

121

Mistral Nemomistralai/mistral-nemo

Mistral

9.84

122

Qwen3 0.6B (16k)Qwen/Qwen3-0.6B-FP8:thinking-16k

Alibaba

8.59

123

Gemma 3 4Bgoogle/gemma-3-4b-it

Google

8.23

124

Qwen3 4BQwen/Qwen3-4B-FP8

Alibaba

7.94

125

Qwen3 1.7BQwen/Qwen3-1.7B-FP8

Alibaba

6.57

126

Ministral 3B 2512mistralai/ministral-3b-2512

Mistral

6.40

127

Llama 3.1 8Bmeta-llama/llama-3.1-8b-instruct

Meta

4.10

128

Llama 3.2 3Bmeta-llama/llama-3.2-3b-instruct

Meta

2.46

129

Llama 3.2 1Bmeta-llama/llama-3.2-1b-instruct

Meta

0.84

130

Qwen3 0.6BQwen/Qwen3-0.6B-FP8

Alibaba

0.05

FAQ

What does LisanBench measure?

A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.

Which model leads the published LisanBench snapshot?

Claude Opus 4.7 currently leads the published LisanBench snapshot with a difficulty-weighted score of 3957.70. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on LisanBench?

130 AI models are included in BenchLM's mirrored LisanBench snapshot, based on the public leaderboard captured on April 29, 2026 snapshot.

Last updated: April 29, 2026 snapshot · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.