Skip to main content

Pencil Puzzle Bench

A multi-step verifiable reasoning benchmark that evaluates whether models can solve pencil puzzles with unique solutions.

How BenchLM shows Pencil Puzzle Bench

BenchLM mirrors the public Pencil Puzzle Bench leaderboard from May 20, 2026 snapshot. The source benchmark evaluates 51 frontier models on 300 curated puzzles spanning 20 puzzle types, with direct-ask and agentic solve rates reported separately.

Pencil Puzzle Bench is display only on BenchLM. It is a useful multi-step reasoning reference, but the public table mixes direct prompting and agentic runs and exposes variant-specific reasoning settings, so BenchLM keeps it out of weighted model rankings for now.

65 model variants300 evaluation puzzles20 puzzle types17,000 eval runsDisplay only

Best solve rate on Pencil Puzzle Bench — May 20, 2026 snapshot

BenchLM mirrors the published best solve rate view for Pencil Puzzle Bench. GPT-5.5 leads the public snapshot at 83.3% , followed by GPT-5.4 (70.2%) and GPT-5.2 (56.0%). BenchLM does not use these results to rank models overall.

65 modelsReasoningCurrentDisplay onlyUpdated May 20, 2026 snapshot

The published Pencil Puzzle Bench snapshot is tightly clustered at the top: GPT-5.5 sits at 83.3%, while the third row is only 27.3 points behind. The broader top-10 spread is 53.3 points, so the benchmark still separates strong models even when the leaders cluster.

65 models have been evaluated on Pencil Puzzle Bench. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Pencil Puzzle Bench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About Pencil Puzzle Bench

Year

2026

Tasks

300 evaluation puzzles

Format

Direct and agentic puzzle solve rate

Difficulty

Multi-step verifiable reasoning

BenchLM mirrors the public Pencil Puzzle Bench leaderboard as a display-only reasoning benchmark. The public site reports direct-ask and agentic solve rates across a 300-puzzle evaluation selection from the 62,231-puzzle dataset.

BenchLM freshness & provenance

Version

Pencil Puzzle Bench 2026

Refresh cadence

Static

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Best solve rate table (65 models)

1
GPT-5.5gpt-5.5@xhigh
83.3%
2
GPT-5.4gpt-5.4@xhigh
70.2%
3
GPT-5.2gpt-5.2@xhigh
56.0%
4
Claude Opus 4.7claude-opus-4-7@thinking
50.0%
5
Gemini 3.5 Flashgemini-3.5-flash@high
43.3%
6
GPT-5.2gpt-5.2@high
36.7%
7
36.7%
8
Claude Opus 4.6 (Adaptive)claude-opus-4-6@thinking
33.3%
9
Gemini 3.1 Progemini-3.1-pro
33.3%
10
Claude Opus 4.6claude-opus-4-6
30.0%
11
Claude Sonnet 4.6claude-sonnet-4-6@thinking
26.7%
12
GPT-5.2 Progpt-5.2-pro
26.7%
13
GPT-5.2gpt-5.2@medium
23.3%
14
23.3%
15
23.3%
16
Kimi K2.6kimi-k2.6
20.0%
17
Gemini 3 Progemini-3-pro@high
16.7%
18
Claude Sonnet 4.6claude-sonnet-4-6
16.7%
19
Gemini 3 Progemini-3-pro
13.3%
20
Gemini 3 Progemini-3-pro@minimal
10.0%
21
GPT-5.2gpt-5.2@low
10.0%
22
Qwen3.6 Plusqwen3.6-plus
10.0%
23
GPT-5.1gpt-5.1@medium
7.7%
24
Claude Opus 4.5 Thinkingclaude-opus-4-5@thinking
6.7%
25
Gemini 3 Flashgemini-3-flash@minimal
6.7%
26
Gemini 3 Flashgemini-3-flash@high
6.7%
27
Grok 4.20grok-4.20-reasoning
6.7%
28
GPT-5 (high)gpt-5@medium
6.0%
29
Kimi K2.5kimi-k2.5
6.0%
30
Grok 4.1 Fastgrok-4-1-fast
5.7%
31
Grok 4.1 Fast (Reasoning)grok-4-1-fast-reasoning
5.3%
32
DeepSeek V4 Prodeepseek-v4-pro
4.0%
33
Grok 4.3grok-4.3@xhigh
3.3%
34
3.3%
35
MiniMax M2.5minimax-m2.5
3.3%
36
Claude Opus 4.5claude-opus-4-5-high
3.3%
37
Claude Sonnet 4.5claude-sonnet-4-5
3.3%
39
Claude Sonnet 4.5 Thinkingclaude-sonnet-4-5@thinking
2.3%
40
DeepSeek V3.2deepseek-v3.2
2.0%
41
Grok 4.3grok-4.3
2.0%
42
Kimi K2kimi-k2-thinking
1.3%
43
MiMo-V2-Promimo-v2-pro
1.0%
44
0.7%
45
MiniMax M2.7minimax-m2.7
0.7%
46
0.7%
47
GLM-5glm-5
0.7%
48
Gemini 2.5 Progemini-2.5-pro
0.3%
49
GPT-5.2gpt-5.2
0.3%
50
0.3%
51
0.3%
52
GPT-OSS 120Bgpt-oss-120b
0.3%
56
MiMo-V2-Flashmimo-v2-flash
0.3%
57
GLM-4.7glm-4.7
0.3%
58
Grok Code Fast 1grok-code-fast-1
0.3%
59
0.0%
60
GPT-4.1gpt-4.1
0.0%
61
GPT-4ogpt-4o
0.0%
62
0.0%
63
0.0%
64
0.0%
65
0.0%

FAQ

What does Pencil Puzzle Bench measure?

A multi-step verifiable reasoning benchmark that evaluates whether models can solve pencil puzzles with unique solutions.

Which model leads the published Pencil Puzzle Bench snapshot?

GPT-5.5 currently leads the published Pencil Puzzle Bench snapshot with 83.3% best solve rate. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on Pencil Puzzle Bench?

65 AI models are included in BenchLM's mirrored Pencil Puzzle Bench snapshot, based on the public leaderboard captured on May 20, 2026 snapshot.

Last updated: May 20, 2026 snapshot · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.