A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.
BenchLM mirrors the published score view for BBH. DeepSeek V4 Pro Base leads the public snapshot at 87.5% , followed by DeepSeek V4 Flash Base (86.9%) and MiniCPM5-1B (71.9%). BenchLM does not use these results to rank models overall.
DeepSeek V4 Pro Base
DeepSeek
DeepSeek V4 Flash Base
DeepSeek
MiniCPM5-1B
OpenBMB
The published BBH snapshot is tightly clustered at the top: DeepSeek V4 Pro Base sits at 87.5%, while the third row is only 15.6 points behind. The broader top-10 spread is 15.6 points, so the benchmark still separates strong models even when the leaders cluster.
3 models have been evaluated on BBH. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. BBH is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2022
Tasks
23 tasks
Format
Mixed reasoning tasks
Difficulty
Advanced reasoning
BBH focuses on 23 tasks from BIG-Bench that remain challenging for language models. Tasks include logical deduction, tracking shuffled objects, causal judgement, and other complex reasoning scenarios.
Version
BBH 2022
Refresh cadence
Static
Staleness state
Stale
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.
DeepSeek V4 Pro Base by DeepSeek currently leads with a score of 87.5% on BBH.
3 AI models have been evaluated on BBH on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.