Pencil Puzzle Bench

Name: Pencil Puzzle Bench
Creator: BenchLM

A multi-step verifiable reasoning benchmark that evaluates whether models can solve pencil puzzles with unique solutions.

How BenchLM shows Pencil Puzzle Bench

BenchLM mirrors the public Pencil Puzzle Bench leaderboard from June 19, 2026 snapshot. The source benchmark evaluates 51 frontier models on 300 curated puzzles spanning 20 puzzle types, with direct-ask and agentic solve rates reported separately.

Pencil Puzzle Bench is display only on BenchLM. It is a useful multi-step reasoning reference, but the public table mixes direct prompting and agentic runs and exposes variant-specific reasoning settings, so BenchLM keeps it out of weighted model rankings for now.

73 model variants300 evaluation puzzles20 puzzle types17,000 eval runsDisplay only

Pencil Puzzle Bench leaderboard Hugging Face dataset Paper GitHub repository

Best solve rate on Pencil Puzzle Bench — June 19, 2026 snapshot

BenchLM mirrors the published best solve rate view for Pencil Puzzle Bench. Claude Fable 5 leads the public snapshot at 97.6% , followed by GPT-5.5 (83.3%) and GPT-5.4 (70.2%). BenchLM does not use these results to rank models overall.

1Closed

Claude Fable 5

Anthropic

claude-fable-5@high

97.6%

Overall 89Context 1M+

2Closed

GPT-5.5

OpenAI

gpt-5.5@xhigh

83.3%

Overall 78Context 1M

3Closed

GPT-5.4

OpenAI

gpt-5.4@xhigh

70.2%

Overall 86Context 1.05M

73 modelsReasoningCurrentDisplay onlyUpdated June 19, 2026 snapshot

The published Pencil Puzzle Bench snapshot is tightly clustered at the top: Claude Fable 5 sits at 97.6%, while the third row is only 27.4 points behind. The broader top-10 spread is 64.3 points, so the benchmark still separates strong models even when the leaders cluster.

73 models have been evaluated on Pencil Puzzle Bench. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Pencil Puzzle Bench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About Pencil Puzzle Bench

Year

2026

Tasks

300 evaluation puzzles

Format

Direct and agentic puzzle solve rate

Difficulty

Multi-step verifiable reasoning

BenchLM mirrors the public Pencil Puzzle Bench leaderboard as a display-only reasoning benchmark. The public site reports direct-ask and agentic solve rates across a 300-puzzle evaluation selection from the 62,231-puzzle dataset.

Pencil Puzzle Bench Public benchmark source

BenchLM freshness & provenance

Version

Pencil Puzzle Bench 2026

Refresh cadence

Static

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Best solve rate table (73 models)

Claude Fable 5claude-fable-5@high

AnthropicClosed

97.6%

GPT-5.5gpt-5.5@xhigh

OpenAIClosed

83.3%

GPT-5.4gpt-5.4@xhigh

OpenAIClosed

70.2%

GPT-5.2gpt-5.2@xhigh

OpenAIClosed

56.0%

Claude Opus 4.7claude-opus-4-7@thinking

AnthropicClosed

50.0%

Gemini 3.5 Flashgemini-3.5-flash@high

GoogleClosed

41.9%

Qwen3.7 Maxqwen3.7-max

AlibabaClosed

40.0%

GPT-5.2gpt-5.2@high

OpenAIClosed

36.7%

claude-opus-4-6-1m

Anthropic

36.7%

Claude Opus 4.6 (Adaptive)claude-opus-4-6@thinking

AnthropicClosed

33.3%

Gemini 3.1 Progemini-3.1-pro

GoogleClosed

33.3%

Claude Opus 4.6claude-opus-4-6

AnthropicClosed

30.0%

GLM-5.2glm-5.2

Z.AIOpen

26.7%

Claude Sonnet 4.6claude-sonnet-4-6@thinking

AnthropicClosed

26.7%

GPT-5.2 Progpt-5.2-pro

OpenAIClosed

26.7%

GPT-5.2gpt-5.2@medium

OpenAIClosed

23.3%

claude-opus-4-6@max

Anthropic

23.3%

claude-sonnet-4-6-1m

Anthropic

23.3%

Kimi K2.6kimi-k2.6

Moonshot AIOpen

20.0%

Kimi K2.7 Codekimi-k2.7-code

Moonshot AIOpen

16.7%

Qwen3.7 Plusqwen3.7-plus

AlibabaClosed

16.7%

Gemini 3 Progemini-3-pro@high

GoogleClosed

16.7%

Claude Sonnet 4.6claude-sonnet-4-6

AnthropicClosed

16.7%

Gemini 3 Progemini-3-pro

GoogleClosed

13.3%

Gemini 3 Progemini-3-pro@minimal

GoogleClosed

10.0%

GPT-5.2gpt-5.2@low

OpenAIClosed

10.0%

Qwen3.6 Plusqwen3.6-plus

AlibabaClosed

10.0%

GPT-5.1gpt-5.1@medium

OpenAIClosed

7.7%

MiniMax M3minimax-m3

MiniMaxOpen

7.1%

Claude Opus 4.5 Thinkingclaude-opus-4-5@thinking

AnthropicClosed

6.7%

Gemini 3 Flashgemini-3-flash@minimal

GoogleClosed

6.7%

Gemini 3 Flashgemini-3-flash@high

GoogleClosed

6.7%

Grok 4.20grok-4.20-reasoning

xAIClosed

6.7%

GPT-5 (high)gpt-5@medium

OpenAIClosed

6.0%

Kimi K2.5kimi-k2.5

Moonshot AIOpen

6.0%

Grok 4.1 Fastgrok-4-1-fast

xAIClosed

5.7%

Grok 4.1 Fast (Reasoning)grok-4-1-fast-reasoning

xAIClosed

5.3%

DeepSeek V4 Prodeepseek-v4-pro

DeepSeekOpen

4.0%

Grok 4.3grok-4.3@xhigh

xAIClosed

3.3%

OpenAIClosed

3.3%

nemotron-3-ultra-550b-a55b

Other

3.3%

MiniMax M2.5minimax-m2.5

MiniMaxClosed

3.3%

Claude Opus 4.5claude-opus-4-5-high

AnthropicClosed

3.3%

Claude Sonnet 4.5claude-sonnet-4-5

AnthropicClosed

3.3%

deepseek-v3.2-speciale

DeepSeek

2.3%

Claude Sonnet 4.5 Thinkingclaude-sonnet-4-5@thinking

AnthropicClosed

2.3%

DeepSeek V3.2deepseek-v3.2

DeepSeekOpen

2.0%

Grok 4.3grok-4.3

xAIClosed

2.0%

Kimi K2kimi-k2-thinking

Moonshot AIClosed

1.3%

MiMo-V2-Promimo-v2-pro

XiaomiClosed

1.0%

OpenAIClosed

0.7%

MiniMax M2.7minimax-m2.7

MiniMaxOpen

0.7%

qwen3.5-397b-a17b

Alibaba

0.7%

GLM-5glm-5

Z.AIOpen

0.7%

Gemini 2.5 Progemini-2.5-pro

GoogleClosed

0.3%

GPT-5.2gpt-5.2

OpenAIClosed

0.3%

gemma-4-31b-it

Other

0.3%

minimax-m2.1

MiniMax

0.3%

GPT-OSS 120Bgpt-oss-120b

OpenAIOpen

0.3%

qwen3-235b-a22b-thinking-2507

Alibaba

0.3%

qwen3-next-80b-a3b-thinking

Alibaba

0.3%

qwen3-vl-235b-a22b-thinking

Alibaba

0.3%

MiMo-V2-Flashmimo-v2-flash

XiaomiOpen

0.3%

GLM-4.7glm-4.7

Z.AIOpen

0.3%

Grok Code Fast 1grok-code-fast-1

xAIClosed

0.3%

Gemini 3.5 Flashgemini-3.5-flash@low

GoogleClosed

0.0%

gpt-3.5-turbo

OpenAI

0.0%

GPT-4.1gpt-4.1

OpenAIClosed

0.0%

GPT-4ogpt-4o

OpenAIClosed

0.0%

devstral-2512

Mistral

0.0%

mistral-large-2512

Mistral

0.0%

mistral-small-2603

Mistral

0.0%

qwen3-coder

Alibaba

0.0%

FAQ

What does Pencil Puzzle Bench measure?

A multi-step verifiable reasoning benchmark that evaluates whether models can solve pencil puzzles with unique solutions.

Which model leads the published Pencil Puzzle Bench snapshot?

Claude Fable 5 currently leads the published Pencil Puzzle Bench snapshot with 97.6% best solve rate. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on Pencil Puzzle Bench?

73 AI models are included in BenchLM's mirrored Pencil Puzzle Bench snapshot, based on the public leaderboard captured on June 19, 2026 snapshot.

Last updated: June 19, 2026 snapshot · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.