Skip to main content

BIG-Bench Hard (BBH)

A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.

How BenchLM shows BBH right now

BenchLM is tracking BBH in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.

These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.

116 tracked modelsLocal tracked rowsAwaiting exact-source attachmentsDisplay only

Tracked score on BBH — April 21, 2026

BenchLM mirrors the published tracked score view for BBH. GPT-5.3 Codex leads the public snapshot at 98% , followed by GPT-5.2 Pro (98%) and GPT-5.4 (97%). BenchLM does not use these results to rank models overall.

116 modelsReasoningStaleSaturatedDisplay onlyUpdated April 21, 2026

The published BBH snapshot is tightly clustered at the top: GPT-5.3 Codex sits at 98%, while the third row is only 1.0 points behind. The broader top-10 spread is 4.0 points, so many of the published scores sit in a relatively narrow band.

116 models have been evaluated on BBH. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. BBH is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About BBH

Year

2022

Tasks

23 tasks

Format

Mixed reasoning tasks

Difficulty

Advanced reasoning

BBH focuses on 23 tasks from BIG-Bench that remain challenging for language models. Tasks include logical deduction, tracking shuffled objects, causal judgement, and other complex reasoning scenarios.

BenchLM freshness & provenance

Version

BBH 2022

Refresh cadence

Static

Staleness state

Stale

Question availability

Public benchmark set

StaleSaturatedDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Tracked score table (116 models)

1
GPT-5.3 Codexgpt-5-3-codex
98%
2
GPT-5.2 Progpt-5-2-pro
98%
3
GPT-5.4gpt-5-4
97%
4
GPT-5.3 Instantgpt-5-3-instant
97%
5
GPT-5.3-Codex-Sparkgpt-5-3-codex-spark
97%
6
GPT-5.2gpt-5-2
96%
7
GPT-5.2 Instantgpt-5-2-instant
96%
8
Gemini 3 Pro Deep Thinkgemini-3-pro-deep-think
95%
9
Claude Opus 4.6claude-opus-4-6
94%
10
GPT-5 (high)gpt-5-high
94%
11
Grok 4.1grok-4-1
93%
12
93%
13
GPT-5.1-Codex-Maxgpt-5-1-codex-max
92%
14
Gemini 3.1 Progemini-3-1-pro
92%
15
GPT-5 (medium)gpt-5-medium
92%
16
GPT-5.1gpt-5-1
92%
17
GLM-5 (Reasoning)glm-5-reasoning
91%
18
Qwen3.5 397B (Reasoning)qwen3-5-397b-reasoning
91%
19
Kimi K2.5 (Reasoning)kimi-k2-5-reasoning
91%
20
GPT-5.2-Codexgpt-5-2-codex
90%
21
Gemini 3 Progemini-3-pro
90%
22
89%
23
Claude Sonnet 4.6claude-sonnet-4-6
88%
24
Claude Sonnet 4.5claude-sonnet-4-5
88%
25
Nemotron 3 Super 120B A12Bnemotron-3-super-120b-a12b
88%
26
Grok 4.1 Fastgrok-4-1-fast
87%
27
Claude Opus 4.5claude-opus-4-5
87%
28
GPT-5 minigpt-5-mini
87%
29
Mercury 2mercury-2
87%
30
86%
31
DeepSeekMath V2deepseekmath-v2
86%
32
DeepSeek V3.2 (Thinking)deepseek-v3-2-thinking
86%
33
Claude 4.1 Opus Thinkingclaude-4-1-opus-thinking
86%
34
Seed 1.6seed-1-6
86%
35
GLM-4.7-Flashglm-4-7-flash
86%
36
MiMo-V2-Flashmimo-v2-flash
85%
37
Nemotron 3 Ultra 500Bnemotron-3-ultra-500b
85%
38
Seed-2.0-Liteseed-2-0-lite
85%
39
GLM-4.7glm-4-7
84%
40
DeepSeek Coder 2.0deepseek-coder-2-0
84%
41
Gemini 3 Flashgemini-3-flash
84%
42
GLM-5.1glm-5-1
83%
43
GLM-5glm-5
83%
44
Grok 4grok-4
83%
45
o4-mini (high)o4-mini-high
83%
46
Nemotron 3 Super 100Bnemotron-3-super-100b
83%
47
Claude 3.5 Sonnetclaude-3-5-sonnet
83%
48
Step 3.5 Flashstep-3-5-flash
83%
49
MiniMax M2.5minimax-m2-5
83%
50
Qwen2.5-1Mqwen2-5-1m
82%
51
Qwen3.5 397Bqwen3-5-397b
82%
52
Claude 4 Sonnetclaude-4-sonnet
82%
53
Llama 3.1 405Bllama-3-1-405b
82%
54
Mistral Large 2mistral-large-2
82%
55
GPT-4ogpt-4o
82%
56
Kimi K2.5kimi-k2-5
81%
57
Gemini 2.5 Progemini-2-5-pro
81%
58
Claude 4.1 Opusclaude-4-1-opus
81%
59
Qwen2.5-72Bqwen2-5-72b
81%
60
Claude Haiku 4.5claude-haiku-4-5
81%
61
DeepSeek LLM 2.0deepseek-llm-2-0
81%
62
DeepSeek V3.2deepseek-v3-2
81%
63
Mistral Large 3mistral-large-3
81%
64
Ministral 3 14B (Reasoning)ministral-3-14b-reasoning
80%
65
Nemotron Ultra 253Bnemotron-ultra-253b
77%
66
Aion-2.0aion-2-0
76%
67
Grok Code Fast 1grok-code-fast-1
75%
68
Gemini 2.5 Flashgemini-2-5-flash
75%
69
GPT-4 Turbogpt-4-turbo
75%
70
Seed 1.6 Flashseed-1-6-flash
75%
71
Gemma 4 31Bgemma-4-31b
74.4%
72
Gemini 3.1 Flash-Litegemini-3-1-flash-lite
74%
73
Gemini 1.5 Progemini-1-5-pro
74%
74
Claude 3 Opusclaude-3-opus
74%
75
Claude 3 Haikuclaude-3-haiku
74%
76
Z-1z-1
74%
77
Llama 3 70Bllama-3-70b
74%
78
Ministral 3 14Bministral-3-14b
74%
79
GPT-OSS 120Bgpt-oss-120b
73%
80
Moonshot v1moonshot-v1
73%
81
Nemotron-4 15Bnemotron-4-15b
73%
82
Gemini 1.0 Progemini-1-0-pro
73%
83
Seed-2.0-Miniseed-2-0-mini
73%
84
Nemotron 3 Nano 30Bnemotron-3-nano-30b
72%
85
Ministral 3 8B (Reasoning)ministral-3-8b-reasoning
70%
86
Mistral 8x7Bmistral-8x7b
67.1%
87
LFM2.5-1.2B-Thinkinglfm2-5-1-2b-thinking
67%
88
DeepSeek-R1deepseek-r1
66%
89
Gemma 4 26B A4Bgemma-4-26b-a4b
64.8%
90
DeepSeek V3.1 (Reasoning)deepseek-v3-1-reasoning
64%
91
MiniMax M1 80kminimax-m1-80k
64%
92
Qwen3 235B 2507 (Reasoning)qwen3-235b-2507-reasoning
63%
93
Llama 4 Maverickllama-4-maverick
63%
94
Nova Pronova-pro
63%
95
GLM-4.5-Airglm-4-5-air
63%
96
Mistral 7B v0.3mistral-7b-v0-3
63%
97
LFM2-24B-A2Blfm2-24b-a2b
63%
98
Ministral 3 8Bministral-3-8b
63%
99
Ministral 3 3B (Reasoning)ministral-3-3b-reasoning
63%
100
Grok 3 [Beta]grok-3-beta
62%
101
Llama 4 Behemothllama-4-behemoth
62%
102
Gemma 3 27Bgemma-3-27b
62%
103
GPT-OSS 20Bgpt-oss-20b
62%
104
Mistral 8x7B v0.2mistral-8x7b-v0-2
62%
105
GLM-4.5glm-4-5
61%
106
DeepSeek V3.1deepseek-v3-1
61%
107
Granite-4.0-H-1Bgranite-4-0-h-1b
60.4%
108
Qwen3 235B 2507qwen3-235b-2507
60%
109
Llama 4 Scoutllama-4-scout
60%
110
Granite-4.0-1Bgranite-4-0-1b
59.7%
111
LFM2.5-1.2B-Instructlfm2-5-1-2b-instruct
59%
112
Ministral 3 3Bministral-3-3b
57%
113
Granite-4.0-350Mgranite-4-0-350m
33.3%
114
Gemma 4 E4Bgemma-4-e4b
33.1%
115
Granite-4.0-H-350Mgranite-4-0-h-350m
33.1%
116
Gemma 4 E2Bgemma-4-e2b
21.9%

FAQ

What does BBH measure?

A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.

Which model leads the published BBH snapshot?

GPT-5.3 Codex currently leads the published BBH snapshot with a tracked score of 98%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on BBH?

116 AI models are included in BenchLM's mirrored BBH snapshot, based on the public leaderboard captured on April 21, 2026.

Last updated: April 21, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.