BIG-Bench Hard (BBH)

Name: BIG-Bench Hard
Creator: BenchLM

A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.

Benchmark score on BBH — June 2, 2026

BenchLM mirrors the published score view for BBH. DeepSeek V4 Pro Base leads the public snapshot at 87.5% , followed by DeepSeek V4 Flash Base (86.9%) and MiniCPM5-1B (71.9%). BenchLM does not use these results to rank models overall.

1Open

DeepSeek V4 Pro Base

DeepSeek

87.5%

Overall —Context 1M

2Open

DeepSeek V4 Flash Base

DeepSeek

86.9%

Overall —Context 1M

3Open

MiniCPM5-1B

OpenBMB

71.9%

Overall ~34Context 131K

3 modelsReasoningStaleSaturatedDisplay onlyUpdated June 2, 2026

The published BBH snapshot is tightly clustered at the top: DeepSeek V4 Pro Base sits at 87.5%, while the third row is only 15.6 points behind. The broader top-10 spread is 15.6 points, so the benchmark still separates strong models even when the leaders cluster.

3 models have been evaluated on BBH. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. BBH is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About BBH

Year

2022

Tasks

23 tasks

Format

Mixed reasoning tasks

Difficulty

Advanced reasoning

BBH focuses on 23 tasks from BIG-Bench that remain challenging for language models. Tasks include logical deduction, tracking shuffled objects, causal judgement, and other complex reasoning scenarios.

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

BenchLM freshness & provenance

Version

BBH 2022

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

StaleSaturatedDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (3 models)

DeepSeek V4 Pro Base

DeepSeekOpen

87.5%

DeepSeek V4 Flash Base

DeepSeekOpen

86.9%

MiniCPM5-1B

OpenBMBOpen

71.9%

FAQ

What does BBH measure?

A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.

Which model scores highest on BBH?

DeepSeek V4 Pro Base by DeepSeek currently leads with a score of 87.5% on BBH.

How many models are evaluated on BBH?

3 AI models have been evaluated on BBH on BenchLM.

Compare Top Models on BBH

DeepSeek V4 Pro Base vs DeepSeek V4 Flash Base DeepSeek V4 Flash Base vs MiniCPM5-1B

Last updated: June 2, 2026 · BenchLM version BBH 2022

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.