BullshitBench v2

A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input.

How BenchLM shows BullshitBench v2

BenchLM mirrors the published BullshitBench v2 leaderboard using the official snapshot generated on April 7, 2026 at 10:06 PM UTC. The public view reports per-model clear-pushback rates across 100 nonsense prompts, scored by a 3-judge panel.

BullshitBench is a useful reasoning sanity check, but BenchLM currently keeps it display only rather than weighted. The public leaderboard is highly variant-specific and exposes reasoning-effort settings directly, so BenchLM treats it as a mirrored external benchmark instead of a canonical ranking input.

130 model variants79 base models100 nonsense prompts3 judgesDisplay only

Clear pushback rate on BullshitBench v2 — April 7, 2026 at 10:06 PM UTC

BenchLM mirrors the published clear pushback rate view for BullshitBench v2. Claude Sonnet 4.6 (high) leads the public snapshot at 91% , followed by Claude Opus 4.5 (high) (90%) and Claude Sonnet 4.6 (none) (89%). BenchLM does not use these results to rank models overall.

130 modelsReasoningCurrentDisplay onlyUpdated April 7, 2026 at 10:06 PM UTC

The published BullshitBench v2 snapshot is tightly clustered at the top: Claude Sonnet 4.6 (high) sits at 91%, while the third row is only 2.0 points behind. The broader top-10 spread is 17.0 points, so the benchmark still separates strong models even when the leaders cluster.

130 models have been evaluated on BullshitBench v2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. BullshitBench v2 is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About BullshitBench v2

Year

2025

Tasks

Nonsensical and flawed prompts across multiple domains

Format

Prompt challenge and refusal evaluation

Difficulty

Robustness and critical reasoning

BullshitBench evaluates a crucial real-world capability: knowing when NOT to answer. Models that score highly recognize flawed premises, impossible physics scenarios, and logical contradictions rather than hallucinating plausible-sounding responses. V2 includes harder and more diverse challenge categories.

BenchLM freshness & provenance

Version

BullshitBench v2 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Clear pushback rate table (130 models)

#1
Claude Sonnet 4.6 (high)anthropic/claude-sonnet-4.6@reasoning=high
91%
#2
Claude Opus 4.5 (high)anthropic/claude-opus-4.5@reasoning=high
90%
#3
Claude Sonnet 4.6 (none)anthropic/claude-sonnet-4.6@reasoning=none
89%
#4
Claude Opus 4.6 (high)anthropic/claude-opus-4.6@reasoning=high
87%
#5
Claude Opus 4.6 (none)anthropic/claude-opus-4.6@reasoning=none
83%
#6
Claude Sonnet 4.5 (high)anthropic/claude-sonnet-4.5@reasoning=high
79%
#7
Claude Opus 4.5 (none)anthropic/claude-opus-4.5@reasoning=none
79%
#8
Qwen3.5 397B (Reasoning) (high)qwen/qwen3.5-397b-a17b@reasoning=high
78%
#9
Claude Haiku 4.5 (high)anthropic/claude-haiku-4.5@reasoning=high
77%
#10
Claude Sonnet 4.5 (none)anthropic/claude-sonnet-4.5@reasoning=none
74%
#11
Claude Haiku 4.5 (none)anthropic/claude-haiku-4.5@reasoning=none
71%
#12
Qwen3.5 397B (none)qwen/qwen3.5-397b-a17b@reasoning=none
69%
#13
Grok 4.20 Multi-Agent Beta (low)x-ai/grok-4.20-multi-agent-beta@reasoning=low
67%
#14
Grok 4.20 Multi-Agent Beta (xhigh)x-ai/grok-4.20-multi-agent-beta@reasoning=xhigh
64%
#15
Qwen3.6 Plus (high)qwen/qwen3-max-thinking@reasoning=high
63%
#16
Grok 4.20 Beta (low)x-ai/grok-4.20-beta@reasoning=low
56%
#17
Grok 4.20 Beta (xhigh)x-ai/grok-4.20-beta@reasoning=xhigh
54%
#18
Nemotron 3 Super 120B A12B (xhigh)nvidia/nemotron-3-super-120b-a12b:free@reasoning=xhigh
54%
#19
Kimi K2.5 (none)moonshotai/kimi-k2.5@reasoning=none
52%
#20
anthropic/claude-3.5-haikuanthropic/claude-3.5-haiku@reasoning=default
50%
#21
anthropic/claude-3.7-sonnet:thinkinganthropic/claude-3.7-sonnet:thinking@reasoning=default
49%
#22
GPT-5.4 (none)openai/gpt-5.4@reasoning=none
48%
#23
Gemini 3 Pro (low)google/gemini-3-pro-preview@reasoning=low
48%
#24
Nemotron 3 Super 120B A12B (high)nvidia/nemotron-3-super-120b-a12b@reasoning=high
47%
#25
Qwen3.6 Plus (none)qwen/qwen3-max-thinking@reasoning=none
46%
#26
GPT-5.2-Codex (low)openai/gpt-5.2-codex@reasoning=low
45%
#27
Claude 3.5 Sonnetanthropic/claude-3.5-sonnet@reasoning=default
45%
#28
GPT-5.1openai/gpt-5.1-chat@reasoning=default
45%
#29
Claude 4.1 Opus (none)anthropic/claude-opus-4.1@reasoning=none
43%
#30
anthropic/claude-3.7-sonnetanthropic/claude-3.7-sonnet@reasoning=default
43%
#31
Nemotron 3 Super 120B A12B (none)nvidia/nemotron-3-super-120b-a12b:free@reasoning=none
43%
#32
openrouter/hunter-alpha (none)openrouter/hunter-alpha@reasoning=none
43%
#33
GPT-5.4 (xhigh)openai/gpt-5.4@reasoning=xhigh
42%
#34
Claude 4.1 Opus (high)anthropic/claude-opus-4.1@reasoning=high
42%
#35
GPT-5.3 Instantopenai/gpt-5.3-chat@reasoning=default
40%
#36
GPT-5 Codexopenai/gpt-5-codex@reasoning=default
39%
#37
GPT-5.2-Codex (xhigh)openai/gpt-5.2-codex@reasoning=xhigh
39%
#38
GPT-5.2 (none)openai/gpt-5.2@reasoning=none
38%
#39
Gemini 3.1 Pro (low)google/gemini-3.1-pro-preview@reasoning=low
37%
#40
GPT-5.2-Codex (high)openai/gpt-5.2-codex@reasoning=high
37%
#41
openrouter/healer-alpha (none)openrouter/healer-alpha@reasoning=none
37%
#42
Gemini 3 Pro Deep Think (high)google/gemini-3-pro-preview@reasoning=high
36%
#43
openrouter/hunter-alpha (xhigh)openrouter/hunter-alpha@reasoning=xhigh
35%
#44
Claude Opus 4anthropic/claude-opus-4@reasoning=default
34%
#45
GPT-5.4 mini (high)openai/gpt-5.4-mini@reasoning=high
32%
#46
GPT-5.4 mini (none)openai/gpt-5.4-mini@reasoning=none
32%
#47
GPT-5.1-Codex-Maxopenai/gpt-5.1-codex@reasoning=default
32%
#48
GPT-5.4 mini (xhigh)openai/gpt-5.4-mini@reasoning=xhigh
31%
#49
Kimi K2.5 (Reasoning) (high)moonshotai/kimi-k2.5@reasoning=high
31%
#50
Gemini 3.1 Pro (high)google/gemini-3.1-pro-preview@reasoning=high
31%
#51
GLM-5-Turbo (high)z-ai/glm-5-turbo@reasoning=high
31%
#52
Nemotron 3 Super 120B A12B (none)nvidia/nemotron-3-super-120b-a12b@reasoning=none
31%
#53
Claude 4 Sonnet (high)anthropic/claude-sonnet-4@reasoning=high
30%
#54
Claude 4 Sonnet (none)anthropic/claude-sonnet-4@reasoning=none
29%
#55
GPT-5.2 (high)openai/gpt-5.2@reasoning=high
28%
#56
Llama 4 Maverickmeta-llama/llama-4-maverick@reasoning=default
28%
#57
GLM-5 (Reasoning) (high)z-ai/glm-5@reasoning=high
28%
#58
Nemotron 3 Nano 30B A3B (none)nvidia/nemotron-3-nano-30b-a3b:free@reasoning=none
28%
#59
GPT-5.2 Instantopenai/gpt-5.2-chat@reasoning=default
27%
#60
o3openai/o3@reasoning=default
26%
#61
openrouter/healer-alpha (xhigh)openrouter/healer-alpha@reasoning=xhigh
26%
#62
GPT-5.1openai/gpt-5.1@reasoning=default
25%
#63
Gemma 4 31B (high)google/gemma-4-31b-it@reasoning=high
25%
#64
GPT-5.3 Codex (low)openai/gpt-5.3-codex@reasoning=low
24%
#65
GLM-5-Turbo (none)z-ai/glm-5-turbo@reasoning=none
23%
#66
Step 3.5 Flash (xhigh)stepfun/step-3.5-flash@reasoning=xhigh
22%
#67
GPT-5openai/gpt-5@reasoning=default
21%
#68
Gemma 4 26B A4B (xhigh)google/gemma-4-26b-a4b-it@reasoning=xhigh
21%
#69
GPT-5.3 Codex (high)openai/gpt-5.3-codex@reasoning=high
20%
#70
Qwen3 Coder 480B A35Bqwen/qwen3-coder@reasoning=default
20%
#71
Gemini 2.5 Progoogle/gemini-2.5-pro@reasoning=default
20%
#72
GLM-5 (none)z-ai/glm-5@reasoning=none
20%
#73
Gemma 4 31B (none)google/gemma-4-31b-it@reasoning=none
20%
#74
GPT-5.3 Codex (xhigh)openai/gpt-5.3-codex@reasoning=xhigh
19%
#75
Grok 4.1 Fast (high)x-ai/grok-4.1-fast@reasoning=high
19%
#76
Llama 4 Scoutmeta-llama/llama-4-scout@reasoning=default
19%
#77
Gemini 2.5 Flashgoogle/gemini-2.5-flash@reasoning=default
19%
#78
GPT-5openai/gpt-5-chat@reasoning=default
18%
#79
Trinity-Large-Thinking (minimal)arcee-ai/trinity-large-thinking@reasoning=minimal
17%
#80
MiMo-V2-Flash (none)xiaomi/mimo-v2-flash@reasoning=none
16%
#81
google/gemini-2.0-flash-001google/gemini-2.0-flash-001@reasoning=default
15%
#82
meta-llama/llama-3.1-8b-instructmeta-llama/llama-3.1-8b-instruct@reasoning=default
14%
#83
GPT-5.4 nano (high)openai/gpt-5.4-nano@reasoning=high
14%
#84
GPT-4.1openai/gpt-4.1@reasoning=default
14%
#85
GPT-5.4 nano (none)openai/gpt-5.4-nano@reasoning=none
13%
#86
DeepSeek V3.2 (Thinking) (high)deepseek/deepseek-v3.2@reasoning=high
13%
#87
Step 3.5 Flash (minimal)stepfun/step-3.5-flash@reasoning=minimal
13%
#88
Trinity-Large-Thinking (xhigh)arcee-ai/trinity-large-thinking@reasoning=xhigh
13%
#89
MiMo-V2-Flash (high)xiaomi/mimo-v2-flash@reasoning=high
13%
#90
openai/gpt-4o-2024-08-06openai/gpt-4o-2024-08-06@reasoning=default
12%
#91
Gemma 4 26B A4B (none)google/gemma-4-26b-a4b-it@reasoning=none
11%
#92
Gemini 3.1 Flash-Litegoogle/gemini-3.1-flash-lite-preview@reasoning=default
11%
#93
Seed 1.6 (none)bytedance-seed/seed-1.6@reasoning=none
11%
#94
GPT-OSS 120B (low)openai/gpt-oss-120b@reasoning=low
11%
#95
baidu/ernie-4.5-vl-424b-a47b (xhigh)baidu/ernie-4.5-vl-424b-a47b@reasoning=xhigh
11%
#96
GPT-5.4 nano (xhigh)openai/gpt-5.4-nano@reasoning=xhigh
10%
#97
Gemini 3 Flash (high)google/gemini-3-flash-preview@reasoning=high
10%
#98
DeepSeek V3.2 (none)deepseek/deepseek-v3.2@reasoning=none
10%
#99
Claude 3 Haikuanthropic/claude-3-haiku@reasoning=default
10%
#100
Gemini 3 Flash (none)google/gemini-3-flash-preview@reasoning=none
10%
#101
nvidia/nemotron-3-nano-30b-a3b:free (xhigh)nvidia/nemotron-3-nano-30b-a3b:free@reasoning=xhigh
10%
#102
Kimi K2moonshotai/kimi-k2@reasoning=default
10%
#103
Grok 4.1 Fast (none)x-ai/grok-4.1-fast@reasoning=none
10%
#104
MiniMax M2.5 (low)minimax/minimax-m2.5@reasoning=low
9%
#105
MiniMax M2.5 (high)minimax/minimax-m2.5@reasoning=high
8%
#106
GLM-4.5 (xhigh)z-ai/glm-4.5@reasoning=xhigh
8%
#107
MiniMax M2.7 (high)minimax/minimax-m2.7@reasoning=high
8%
#108
DeepSeek-R1 (xhigh)deepseek/deepseek-r1@reasoning=xhigh
8%
#109
o4-mini (high) (low)openai/o4-mini@reasoning=low
8%
#110
Seed 1.6 (high)bytedance-seed/seed-1.6@reasoning=high
7%
#111
MiniMax M2.7 (low)minimax/minimax-m2.7@reasoning=low
7%
#112
DeepSeek-R1 (none)deepseek/deepseek-r1@reasoning=none
7%
#113
prime-intellect/intellect-3 (low)prime-intellect/intellect-3@reasoning=low
7%
#114
mistralai/mistral-small-2603 (high)mistralai/mistral-small-2603@reasoning=high
6%
#115
qwen/qwen3-235b-a22b (none)qwen/qwen3-235b-a22b@reasoning=none
6%
#116
GLM-4.5 (none)z-ai/glm-4.5@reasoning=none
6%
#117
GPT-OSS 120B (high)openai/gpt-oss-120b@reasoning=high
5%
#118
nvidia/nemotron-nano-9b-v2:free (none)nvidia/nemotron-nano-9b-v2:free@reasoning=none
5%
#119
prime-intellect/intellect-3 (high)prime-intellect/intellect-3@reasoning=high
5%
#120
ai21/jamba-large-1.7ai21/jamba-large-1.7@reasoning=default
5%
#121
o4-mini (high) (high)openai/o4-mini@reasoning=high
4%
#122
baidu/ernie-4.5-300b-a47bbaidu/ernie-4.5-300b-a47b@reasoning=default
4%
#123
deepseek/deepseek-chatdeepseek/deepseek-chat@reasoning=default
4%
#124
mistralai/mistral-small-2603 (none)mistralai/mistral-small-2603@reasoning=none
4%
#125
baidu/ernie-4.5-vl-424b-a47b (none)baidu/ernie-4.5-vl-424b-a47b@reasoning=none
3%
#126
qwen/qwen3-235b-a22b (xhigh)qwen/qwen3-235b-a22b@reasoning=xhigh
3%
#127
nvidia/nemotron-nano-9b-v2:free (xhigh)nvidia/nemotron-nano-9b-v2:free@reasoning=xhigh
3%
#128
google/gemma-3-27b-itgoogle/gemma-3-27b-it@reasoning=default
3%
#129
mistralai/mistral-large-2512mistralai/mistral-large-2512@reasoning=default
2%
#130
openai/gpt-4o-mini-2024-07-18openai/gpt-4o-mini-2024-07-18@reasoning=default
2%

FAQ

What does BullshitBench v2 measure?

A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input.

Which model leads the published BullshitBench v2 snapshot?

Claude Sonnet 4.6 (high) currently leads the published BullshitBench v2 snapshot with a clear pushback rate of 91%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on BullshitBench v2?

130 AI models are included in BenchLM's mirrored BullshitBench v2 snapshot, based on the public leaderboard captured on April 7, 2026 at 10:06 PM UTC.

Last updated: April 7, 2026 at 10:06 PM UTC · mirrored from the public benchmark leaderboard

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.