BIG-Bench Hard (BBH)

A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.

About BBH

Year

2022

Tasks

23 tasks

Format

Mixed reasoning tasks

Difficulty

Advanced reasoning

BBH focuses on 23 tasks from BIG-Bench that remain challenging for language models. Tasks include logical deduction, tracking shuffled objects, causal judgement, and other complex reasoning scenarios.

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Leaderboard (88 models)

#1GPT-5.3 Codex
98
#2GPT-5.2
96
#3GPT-5.4
95
#5Claude Opus 4.6
94
#6GPT-5 (high)
94
#7Grok 4.1
93
#8o1-preview
93
#9Gemini 3.1 Pro
92
#11GPT-5.1
92
#12GPT-5 (medium)
92
#13GLM-5 (Reasoning)
91
#14Kimi K2.5 (Reasoning)
91
#16GPT-5.2-Codex
90
#17Gemini 3 Pro
90
#18o3-pro
89
#19Claude Sonnet 4.6
88
#20Claude Sonnet 4.5
88
#21Claude Opus 4.5
87
#23GPT-5 mini
87
#24o3
86
#26DeepSeekMath V2
86
#27GLM-4.7-Flash
86
#28MiMo-V2-Flash
85
#30GLM-4.7
84
#31DeepSeek Coder 2.0
84
#32Gemini 3 Flash
84
#33Grok 4
83
#34GLM-5
83
#35o4-mini (high)
83
#36MiniMax M2.5
83
#38Claude 3.5 Sonnet
83
#39Qwen2.5-1M
82
#40Qwen3.5 397B
82
#41Claude 4 Sonnet
82
#43Mistral Large 2
82
#44GPT-4o
82
#45Qwen2.5-72B
81
#46DeepSeek V3.2
81
#47Gemini 2.5 Pro
81
#48DeepSeek LLM 2.0
81
#49Claude 4.1 Opus
81
#50Kimi K2.5
81
#51Mistral Large 3
81
#52Claude Haiku 4.5
81
#54Mistral 8x7B
76
#56GPT-4 Turbo
75
#57Gemini 2.5 Flash
75
#58Gemini 1.5 Pro
74
#60Claude 3 Opus
74
#61Llama 3 70B
74
#62Claude 3 Haiku
74
#63Z-1
74
#64Gemini 1.0 Pro
73
#65Nemotron-4 15B
73
#66Moonshot v1
73
#67GPT-OSS 120B
73
#70DeepSeek-R1
66
#71MiniMax M1 80k
64
#74Nova Pro
63
#76GLM-4.5-Air
63
#77Mistral 7B v0.3
63
#79Gemma 3 27B
62
#81GPT-OSS 20B
62
#82Mistral 8x7B v0.2
62
#83Qwen2.5-VL-32B
61
#84GLM-4.5
61
#85DeepSeek V3.1
61
#86Kimi K2
61
#87Llama 4 Scout
60
#88Qwen3 235B 2507
60

FAQ

What does BBH measure?

A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.

Which model scores highest on BBH?

GPT-5.3 Codex by OpenAI currently leads with a score of 98 on BBH.

How many models are evaluated on BBH?

88 AI models have been evaluated on BBH on BenchLM.