A multi-step verifiable reasoning benchmark that evaluates whether models can solve pencil puzzles with unique solutions.
BenchLM mirrors the public Pencil Puzzle Bench leaderboard from May 20, 2026 snapshot. The source benchmark evaluates 51 frontier models on 300 curated puzzles spanning 20 puzzle types, with direct-ask and agentic solve rates reported separately.
Pencil Puzzle Bench is display only on BenchLM. It is a useful multi-step reasoning reference, but the public table mixes direct prompting and agentic runs and exposes variant-specific reasoning settings, so BenchLM keeps it out of weighted model rankings for now.
BenchLM mirrors the published best solve rate view for Pencil Puzzle Bench. GPT-5.5 leads the public snapshot at 83.3% , followed by GPT-5.4 (70.2%) and GPT-5.2 (56.0%). BenchLM does not use these results to rank models overall.
GPT-5.5
OpenAI
gpt-5.5@xhigh
GPT-5.4
OpenAI
gpt-5.4@xhigh
GPT-5.2
OpenAI
gpt-5.2@xhigh
The published Pencil Puzzle Bench snapshot is tightly clustered at the top: GPT-5.5 sits at 83.3%, while the third row is only 27.3 points behind. The broader top-10 spread is 53.3 points, so the benchmark still separates strong models even when the leaders cluster.
65 models have been evaluated on Pencil Puzzle Bench. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Pencil Puzzle Bench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
300 evaluation puzzles
Format
Direct and agentic puzzle solve rate
Difficulty
Multi-step verifiable reasoning
BenchLM mirrors the public Pencil Puzzle Bench leaderboard as a display-only reasoning benchmark. The public site reports direct-ask and agentic solve rates across a 300-puzzle evaluation selection from the 62,231-puzzle dataset.
Version
Pencil Puzzle Bench 2026
Refresh cadence
Static
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A multi-step verifiable reasoning benchmark that evaluates whether models can solve pencil puzzles with unique solutions.
GPT-5.5 currently leads the published Pencil Puzzle Bench snapshot with 83.3% best solve rate. BenchLM shows this benchmark for display only and does not use it in overall rankings.
65 AI models are included in BenchLM's mirrored Pencil Puzzle Bench snapshot, based on the public leaderboard captured on May 20, 2026 snapshot.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.