A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.
Year
2022
Tasks
23 tasks
Format
Mixed reasoning tasks
Difficulty
Advanced reasoning
BBH focuses on 23 tasks from BIG-Bench that remain challenging for language models. Tasks include logical deduction, tracking shuffled objects, causal judgement, and other complex reasoning scenarios.
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve ThemA suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.
GPT-5.3 Codex by OpenAI currently leads with a score of 98 on BBH.
88 AI models have been evaluated on BBH on BenchLM.