Skip to main content

BullshitBench v2

A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input.

How BenchLM shows BullshitBench v2

BenchLM mirrors the published BullshitBench v2 leaderboard using the official snapshot generated on May 19, 2026 at 11:13 PM UTC. The public view reports per-model clear-pushback rates across 100 nonsense prompts, scored by a 3-judge panel.

BullshitBench is a useful reasoning sanity check, but BenchLM currently keeps it display only rather than weighted. The public leaderboard is highly variant-specific and exposes reasoning-effort settings directly, so BenchLM treats it as a mirrored external benchmark instead of a canonical ranking input.

158 model variants93 base models100 nonsense prompts3 judgesDisplay only

Clear pushback rate on BullshitBench v2 — May 19, 2026 at 11:13 PM UTC

BenchLM mirrors the published clear pushback rate view for BullshitBench v2. Claude Sonnet 4.6 (high) leads the public snapshot at 91% , followed by Claude Opus 4.5 (high) (90%) and Claude Sonnet 4.6 (none) (89%). BenchLM does not use these results to rank models overall.

158 modelsReasoningCurrentDisplay onlyUpdated May 19, 2026 at 11:13 PM UTC

The published BullshitBench v2 snapshot is tightly clustered at the top: Claude Sonnet 4.6 (high) sits at 91%, while the third row is only 2.0 points behind. The broader top-10 spread is 14.0 points, so the benchmark still separates strong models even when the leaders cluster.

158 models have been evaluated on BullshitBench v2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. BullshitBench v2 is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About BullshitBench v2

Year

2025

Tasks

Nonsensical and flawed prompts across multiple domains

Format

Prompt challenge and refusal evaluation

Difficulty

Robustness and critical reasoning

BullshitBench evaluates a crucial real-world capability: knowing when NOT to answer. Models that score highly recognize flawed premises, impossible physics scenarios, and logical contradictions rather than hallucinating plausible-sounding responses. V2 includes harder and more diverse challenge categories.

BenchLM freshness & provenance

Version

BullshitBench v2 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Clear pushback rate table (158 models)

1
Claude Sonnet 4.6 (high)anthropic/claude-sonnet-4.6@reasoning=high
91%
2
Claude Opus 4.5 (high)anthropic/claude-opus-4.5@reasoning=high
90%
3
Claude Sonnet 4.6 (none)anthropic/claude-sonnet-4.6@reasoning=none
89%
4
Claude Opus 4.6 (high)anthropic/claude-opus-4.6@reasoning=high
87%
5
Claude Opus 4.6 (none)anthropic/claude-opus-4.6@reasoning=none
83%
6
Claude Opus 4.7 (none)anthropic/claude-opus-4.7@reasoning=none
83%
7
Claude Sonnet 4.5 (high)anthropic/claude-sonnet-4.5@reasoning=high
79%
8
Claude Opus 4.5 (none)anthropic/claude-opus-4.5@reasoning=none
79%
9
Qwen3.5 397B (Reasoning) (high)qwen/qwen3.5-397b-a17b@reasoning=high
78%
10
Claude Haiku 4.5 (high)anthropic/claude-haiku-4.5@reasoning=high
77%
11
Claude Opus 4.7 (max)anthropic/claude-opus-4.7@reasoning=max
74%
12
Claude Sonnet 4.5 (none)anthropic/claude-sonnet-4.5@reasoning=none
74%
13
Qwen3.6 Plus (none)qwen/qwen3.6-plus@reasoning=none
72%
14
Claude Haiku 4.5 (none)anthropic/claude-haiku-4.5@reasoning=none
71%
15
Qwen3.5 397B (none)qwen/qwen3.5-397b-a17b@reasoning=none
69%
16
Grok 4.20 Multi-Agent Beta (low)x-ai/grok-4.20-multi-agent-beta@reasoning=low
67%
17
Kimi K2.6 (none)moonshotai/kimi-k2.6@reasoning=none
65%
18
Grok 4.20 Multi-Agent Beta (xhigh)x-ai/grok-4.20-multi-agent-beta@reasoning=xhigh
64%
19
Qwen3.6 Plus (high)qwen/qwen3-max-thinking@reasoning=high
63%
20
MiMo-V2.5-Pro (xhigh)xiaomi/mimo-v2.5-pro@reasoning=xhigh
62%
21
Qwen3.6 Plus (xhigh)qwen/qwen3.6-plus@reasoning=xhigh
59%
22
Grok 4.20 Beta (low)x-ai/grok-4.20-beta@reasoning=low
56%
23
Grok 4.20 Beta (xhigh)x-ai/grok-4.20-beta@reasoning=xhigh
54%
24
Nemotron 3 Super 120B A12B (xhigh)nvidia/nemotron-3-super-120b-a12b:free@reasoning=xhigh
54%
25
Kimi K2.5 (none)moonshotai/kimi-k2.5@reasoning=none
52%
26
Grok 4.3 (minimal)x-ai/grok-4.3@reasoning=minimal
50%
27
Kimi K2.6 (xhigh)moonshotai/kimi-k2.6@reasoning=xhigh
50%
28
anthropic/claude-3.5-haikuanthropic/claude-3.5-haiku@reasoning=default
50%
29
anthropic/claude-3.7-sonnet:thinkinganthropic/claude-3.7-sonnet:thinking@reasoning=default
49%
30
GPT-5.4 (none)openai/gpt-5.4@reasoning=none
48%
31
Gemini 3 Pro (low)google/gemini-3-pro-preview@reasoning=low
48%
32
GPT-5.5 (xhigh)openai/gpt-5.5@reasoning=xhigh
47%
33
Nemotron 3 Super 120B A12B (high)nvidia/nemotron-3-super-120b-a12b@reasoning=high
47%
34
Qwen3.6 Plus (none)qwen/qwen3-max-thinking@reasoning=none
46%
35
Grok 4.3 (xhigh)x-ai/grok-4.3@reasoning=xhigh
46%
36
GPT-5.5 (none)openai/gpt-5.5@reasoning=none
45%
37
GPT-5.5 (low)openai/gpt-5.5@reasoning=low
45%
38
GPT-5.2-Codex (low)openai/gpt-5.2-codex@reasoning=low
45%
39
Claude 3.5 Sonnetanthropic/claude-3.5-sonnet@reasoning=default
45%
40
GPT-5.1openai/gpt-5.1-chat@reasoning=default
45%
41
Claude 4.1 Opus (none)anthropic/claude-opus-4.1@reasoning=none
43%
42
anthropic/claude-3.7-sonnetanthropic/claude-3.7-sonnet@reasoning=default
43%
43
Nemotron 3 Super 120B A12B (none)nvidia/nemotron-3-super-120b-a12b:free@reasoning=none
43%
44
openrouter/hunter-alpha (none)openrouter/hunter-alpha@reasoning=none
43%
45
GPT-5.4 (xhigh)openai/gpt-5.4@reasoning=xhigh
42%
46
Claude 4.1 Opus (high)anthropic/claude-opus-4.1@reasoning=high
42%
47
GPT-5.3 Instantopenai/gpt-5.3-chat@reasoning=default
40%
48
GPT-5 Codexopenai/gpt-5-codex@reasoning=default
39%
49
GPT-5.2-Codex (xhigh)openai/gpt-5.2-codex@reasoning=xhigh
39%
50
GPT-5.2 (none)openai/gpt-5.2@reasoning=none
38%
51
MiMo-V2.5-Pro (none)xiaomi/mimo-v2.5-pro@reasoning=none
38%
52
Gemini 3.1 Pro (low)google/gemini-3.1-pro-preview@reasoning=low
37%
53
GPT-5.2-Codex (high)openai/gpt-5.2-codex@reasoning=high
37%
54
openrouter/healer-alpha (none)openrouter/healer-alpha@reasoning=none
37%
55
GPT-5.5 Pro (xhigh)openai/gpt-5.5-pro@reasoning=xhigh
36%
56
Gemini 3 Pro Deep Think (high)google/gemini-3-pro-preview@reasoning=high
36%
57
MiMo-V2.5 (xhigh)xiaomi/mimo-v2.5@reasoning=xhigh
35%
58
openrouter/hunter-alpha (xhigh)openrouter/hunter-alpha@reasoning=xhigh
35%
59
GPT-5.5 Pro (medium)openai/gpt-5.5-pro@reasoning=medium
34%
60
GPT-5.5openai/gpt-5.5-chat@reasoning=default
34%
61
Claude Opus 4anthropic/claude-opus-4@reasoning=default
34%
62
GPT-5.4 mini (high)openai/gpt-5.4-mini@reasoning=high
32%
63
GPT-5.4 mini (none)openai/gpt-5.4-mini@reasoning=none
32%
64
GPT-5.1-Codex-Maxopenai/gpt-5.1-codex@reasoning=default
32%
65
GPT-5.4 mini (xhigh)openai/gpt-5.4-mini@reasoning=xhigh
31%
66
Kimi K2.5 (Reasoning) (high)moonshotai/kimi-k2.5@reasoning=high
31%
67
Gemini 3.1 Pro (high)google/gemini-3.1-pro-preview@reasoning=high
31%
68
GLM-5-Turbo (high)z-ai/glm-5-turbo@reasoning=high
31%
69
Nemotron 3 Super 120B A12B (none)nvidia/nemotron-3-super-120b-a12b@reasoning=none
31%
70
Claude 4 Sonnet (high)anthropic/claude-sonnet-4@reasoning=high
30%
71
Claude 4 Sonnet (none)anthropic/claude-sonnet-4@reasoning=none
29%
72
GPT-5.2 (high)openai/gpt-5.2@reasoning=high
28%
73
Llama 4 Maverickmeta-llama/llama-4-maverick@reasoning=default
28%
74
GLM-5 (Reasoning) (high)z-ai/glm-5@reasoning=high
28%
75
Nemotron 3 Nano 30B A3B (none)nvidia/nemotron-3-nano-30b-a3b:free@reasoning=none
28%
76
GPT-5.2 Instantopenai/gpt-5.2-chat@reasoning=default
27%
77
o3openai/o3@reasoning=default
26%
78
openrouter/healer-alpha (xhigh)openrouter/healer-alpha@reasoning=xhigh
26%
79
GPT-5.1openai/gpt-5.1@reasoning=default
25%
80
Gemma 4 31B (high)google/gemma-4-31b-it@reasoning=high
25%
81
GPT-5.3 Codex (low)openai/gpt-5.3-codex@reasoning=low
24%
82
MiMo-V2.5 (none)xiaomi/mimo-v2.5@reasoning=none
24%
83
GLM-5-Turbo (none)z-ai/glm-5-turbo@reasoning=none
23%
84
GLM-5.1 (xhigh)z-ai/glm-5.1@reasoning=xhigh
22%
85
Step 3.5 Flash (xhigh)stepfun/step-3.5-flash@reasoning=xhigh
22%
86
GPT-5openai/gpt-5@reasoning=default
21%
87
Gemma 4 26B A4B (xhigh)google/gemma-4-26b-a4b-it@reasoning=xhigh
21%
88
GPT-5.3 Codex (high)openai/gpt-5.3-codex@reasoning=high
20%
89
Qwen3 Coder 480B A35Bqwen/qwen3-coder@reasoning=default
20%
90
Gemini 2.5 Progoogle/gemini-2.5-pro@reasoning=default
20%
91
GLM-5 (none)z-ai/glm-5@reasoning=none
20%
92
Gemma 4 31B (none)google/gemma-4-31b-it@reasoning=none
20%
93
Gemini 3.5 Flash (xhigh)google/gemini-3.5-flash@reasoning=xhigh
20%
94
GPT-5.3 Codex (xhigh)openai/gpt-5.3-codex@reasoning=xhigh
19%
95
Grok 4.1 Fast (high)x-ai/grok-4.1-fast@reasoning=high
19%
96
Llama 4 Scoutmeta-llama/llama-4-scout@reasoning=default
19%
97
Gemini 2.5 Flashgoogle/gemini-2.5-flash@reasoning=default
19%
98
Gemini 3.5 Flash (minimal)google/gemini-3.5-flash@reasoning=minimal
19%
99
GPT-5openai/gpt-5-chat@reasoning=default
18%
100
DeepSeek V4 Flash (none)deepseek/deepseek-v4-flash@reasoning=none
18%
101
GLM-5.1 (none)z-ai/glm-5.1@reasoning=none
18%
102
Trinity-Large-Thinking (minimal)arcee-ai/trinity-large-thinking@reasoning=minimal
17%
103
MiMo-V2-Flash (none)xiaomi/mimo-v2-flash@reasoning=none
16%
104
Hy3 Preview (none)tencent/hy3-preview:free@reasoning=none
16%
105
google/gemini-2.0-flash-001google/gemini-2.0-flash-001@reasoning=default
15%
106
DeepSeek V4 Pro (xhigh)deepseek/deepseek-v4-pro@reasoning=xhigh
14%
107
meta-llama/llama-3.1-8b-instructmeta-llama/llama-3.1-8b-instruct@reasoning=default
14%
108
GPT-5.4 nano (high)openai/gpt-5.4-nano@reasoning=high
14%
109
DeepSeek V4 Pro (none)deepseek/deepseek-v4-pro@reasoning=none
14%
110
GPT-4.1openai/gpt-4.1@reasoning=default
14%
111
DeepSeek V4 Flash (xhigh)deepseek/deepseek-v4-flash@reasoning=xhigh
14%
112
GPT-5.4 nano (none)openai/gpt-5.4-nano@reasoning=none
13%
113
DeepSeek V3.2 (Thinking) (high)deepseek/deepseek-v3.2@reasoning=high
13%
114
Step 3.5 Flash (minimal)stepfun/step-3.5-flash@reasoning=minimal
13%
115
Trinity-Large-Thinking (xhigh)arcee-ai/trinity-large-thinking@reasoning=xhigh
13%
116
MiMo-V2-Flash (high)xiaomi/mimo-v2-flash@reasoning=high
13%
117
openai/gpt-4o-2024-08-06openai/gpt-4o-2024-08-06@reasoning=default
12%
118
Gemma 4 26B A4B (none)google/gemma-4-26b-a4b-it@reasoning=none
11%
119
Gemini 3.1 Flash-Litegoogle/gemini-3.1-flash-lite-preview@reasoning=default
11%
120
Seed 1.6 (none)bytedance-seed/seed-1.6@reasoning=none
11%
121
GPT-OSS 120B (low)openai/gpt-oss-120b@reasoning=low
11%
122
baidu/ernie-4.5-vl-424b-a47b (xhigh)baidu/ernie-4.5-vl-424b-a47b@reasoning=xhigh
11%
123
GPT-5.4 nano (xhigh)openai/gpt-5.4-nano@reasoning=xhigh
10%
124
Gemini 3 Flash (high)google/gemini-3-flash-preview@reasoning=high
10%
125
DeepSeek V3.2 (none)deepseek/deepseek-v3.2@reasoning=none
10%
126
Claude 3 Haikuanthropic/claude-3-haiku@reasoning=default
10%
127
Gemini 3 Flash (none)google/gemini-3-flash-preview@reasoning=none
10%
128
nvidia/nemotron-3-nano-30b-a3b:free (xhigh)nvidia/nemotron-3-nano-30b-a3b:free@reasoning=xhigh
10%
129
Kimi K2moonshotai/kimi-k2@reasoning=default
10%
130
Grok 4.1 Fast (none)x-ai/grok-4.1-fast@reasoning=none
10%
131
MiniMax M2.5 (low)minimax/minimax-m2.5@reasoning=low
9%
132
Hy3 Preview (xhigh)tencent/hy3-preview:free@reasoning=xhigh
8%
133
MiniMax M2.5 (high)minimax/minimax-m2.5@reasoning=high
8%
134
GLM-4.5 (xhigh)z-ai/glm-4.5@reasoning=xhigh
8%
135
MiniMax M2.7 (high)minimax/minimax-m2.7@reasoning=high
8%
136
DeepSeek-R1 (xhigh)deepseek/deepseek-r1@reasoning=xhigh
8%
137
o4-mini (high) (low)openai/o4-mini@reasoning=low
8%
138
Seed 1.6 (high)bytedance-seed/seed-1.6@reasoning=high
7%
139
MiniMax M2.7 (low)minimax/minimax-m2.7@reasoning=low
7%
140
DeepSeek-R1 (none)deepseek/deepseek-r1@reasoning=none
7%
141
prime-intellect/intellect-3 (low)prime-intellect/intellect-3@reasoning=low
7%
142
mistralai/mistral-small-2603 (high)mistralai/mistral-small-2603@reasoning=high
6%
143
qwen/qwen3-235b-a22b (none)qwen/qwen3-235b-a22b@reasoning=none
6%
144
GLM-4.5 (none)z-ai/glm-4.5@reasoning=none
6%
145
GPT-OSS 120B (high)openai/gpt-oss-120b@reasoning=high
5%
146
nvidia/nemotron-nano-9b-v2:free (none)nvidia/nemotron-nano-9b-v2:free@reasoning=none
5%
147
prime-intellect/intellect-3 (high)prime-intellect/intellect-3@reasoning=high
5%
148
ai21/jamba-large-1.7ai21/jamba-large-1.7@reasoning=default
5%
149
o4-mini (high) (high)openai/o4-mini@reasoning=high
4%
150
baidu/ernie-4.5-300b-a47bbaidu/ernie-4.5-300b-a47b@reasoning=default
4%
151
deepseek/deepseek-chatdeepseek/deepseek-chat@reasoning=default
4%
152
mistralai/mistral-small-2603 (none)mistralai/mistral-small-2603@reasoning=none
4%
153
baidu/ernie-4.5-vl-424b-a47b (none)baidu/ernie-4.5-vl-424b-a47b@reasoning=none
3%
154
qwen/qwen3-235b-a22b (xhigh)qwen/qwen3-235b-a22b@reasoning=xhigh
3%
155
nvidia/nemotron-nano-9b-v2:free (xhigh)nvidia/nemotron-nano-9b-v2:free@reasoning=xhigh
3%
156
google/gemma-3-27b-itgoogle/gemma-3-27b-it@reasoning=default
3%
157
mistralai/mistral-large-2512mistralai/mistral-large-2512@reasoning=default
2%
158
openai/gpt-4o-mini-2024-07-18openai/gpt-4o-mini-2024-07-18@reasoning=default
2%

FAQ

What does BullshitBench v2 measure?

A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input.

Which model leads the published BullshitBench v2 snapshot?

Claude Sonnet 4.6 (high) currently leads the published BullshitBench v2 snapshot with 91% clear pushback rate. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on BullshitBench v2?

158 AI models are included in BenchLM's mirrored BullshitBench v2 snapshot, based on the public leaderboard captured on May 19, 2026 at 11:13 PM UTC.

Last updated: May 19, 2026 at 11:13 PM UTC · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.