BullshitBench v2

Name: BullshitBench v2
Creator: BenchLM

A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input.

How BenchLM shows BullshitBench v2

BenchLM mirrors the published BullshitBench v2 leaderboard using the official snapshot generated on May 19, 2026 at 11:13 PM UTC. The public view reports per-model clear-pushback rates across 100 nonsense prompts, scored by a 3-judge panel.

BullshitBench is a useful reasoning sanity check, but BenchLM currently keeps it display only rather than weighted. The public leaderboard is highly variant-specific and exposes reasoning-effort settings directly, so BenchLM treats it as a mirrored external benchmark instead of a canonical ranking input.

158 model variants93 base models100 nonsense prompts3 judgesDisplay only

BullshitBench viewer Benchmark homepage Published leaderboard CSV

Clear pushback rate on BullshitBench v2 — May 19, 2026 at 11:13 PM UTC

BenchLM mirrors the published clear pushback rate view for BullshitBench v2. Claude Sonnet 4.6 (high) leads the public snapshot at 91% , followed by Claude Opus 4.5 (high) (90%) and Claude Sonnet 4.6 (none) (89%). BenchLM does not use these results to rank models overall.

1Closed

Claude Sonnet 4.6 (high)

Anthropic

anthropic/claude-sonnet-4.6@reasoning=high

91%

Overall 83Context 200K

2Closed

Claude Opus 4.5 (high)

Anthropic

anthropic/claude-opus-4.5@reasoning=high

90%

Overall 77Context 200K

3Closed

Claude Sonnet 4.6 (none)

Anthropic

anthropic/claude-sonnet-4.6@reasoning=none

89%

Overall 83Context 200K

158 modelsReasoningCurrentDisplay onlyUpdated May 19, 2026 at 11:13 PM UTC

The published BullshitBench v2 snapshot is tightly clustered at the top: Claude Sonnet 4.6 (high) sits at 91%, while the third row is only 2.0 points behind. The broader top-10 spread is 14.0 points, so the benchmark still separates strong models even when the leaders cluster.

158 models have been evaluated on BullshitBench v2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. BullshitBench v2 is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About BullshitBench v2

Year

2025

Tasks

Nonsensical and flawed prompts across multiple domains

Format

Prompt challenge and refusal evaluation

Difficulty

Robustness and critical reasoning

BullshitBench evaluates a crucial real-world capability: knowing when NOT to answer. Models that score highly recognize flawed premises, impossible physics scenarios, and logical contradictions rather than hallucinating plausible-sounding responses. V2 includes harder and more diverse challenge categories.

BullshitBench: Measuring whether AI models challenge nonsensical prompts Public benchmark source

BenchLM freshness & provenance

Version

BullshitBench v2 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Clear pushback rate table (158 models)

Claude Sonnet 4.6 (high)anthropic/claude-sonnet-4.6@reasoning=high

AnthropicClosed

91%

Claude Opus 4.5 (high)anthropic/claude-opus-4.5@reasoning=high

AnthropicClosed

90%

Claude Sonnet 4.6 (none)anthropic/claude-sonnet-4.6@reasoning=none

AnthropicClosed

89%

Claude Opus 4.6 (high)anthropic/claude-opus-4.6@reasoning=high

AnthropicClosed

87%

Claude Opus 4.6 (none)anthropic/claude-opus-4.6@reasoning=none

AnthropicClosed

83%

Claude Opus 4.7 (none)anthropic/claude-opus-4.7@reasoning=none

AnthropicClosed

83%

Claude Sonnet 4.5 (high)anthropic/claude-sonnet-4.5@reasoning=high

AnthropicClosed

79%

Claude Opus 4.5 (none)anthropic/claude-opus-4.5@reasoning=none

AnthropicClosed

79%

Qwen3.5 397B (Reasoning) (high)qwen/qwen3.5-397b-a17b@reasoning=high

AlibabaOpen

78%

Claude Haiku 4.5 (high)anthropic/claude-haiku-4.5@reasoning=high

AnthropicClosed

77%

Claude Opus 4.7 (max)anthropic/claude-opus-4.7@reasoning=max

AnthropicClosed

74%

Claude Sonnet 4.5 (none)anthropic/claude-sonnet-4.5@reasoning=none

AnthropicClosed

74%

Qwen3.6 Plus (none)qwen/qwen3.6-plus@reasoning=none

AlibabaClosed

72%

Claude Haiku 4.5 (none)anthropic/claude-haiku-4.5@reasoning=none

AnthropicClosed

71%

Qwen3.5 397B (none)qwen/qwen3.5-397b-a17b@reasoning=none

AlibabaOpen

69%

Grok 4.20 Multi-Agent Beta (low)x-ai/grok-4.20-multi-agent-beta@reasoning=low

67%

Kimi K2.6 (none)moonshotai/kimi-k2.6@reasoning=none

Moonshot AIOpen

65%

Grok 4.20 Multi-Agent Beta (xhigh)x-ai/grok-4.20-multi-agent-beta@reasoning=xhigh

64%

Qwen3.6 Plus (high)qwen/qwen3-max-thinking@reasoning=high

AlibabaClosed

63%

MiMo-V2.5-Pro (xhigh)xiaomi/mimo-v2.5-pro@reasoning=xhigh

XiaomiClosed

62%

Qwen3.6 Plus (xhigh)qwen/qwen3.6-plus@reasoning=xhigh

AlibabaClosed

59%

Grok 4.20 Beta (low)x-ai/grok-4.20-beta@reasoning=low

56%

Grok 4.20 Beta (xhigh)x-ai/grok-4.20-beta@reasoning=xhigh

54%

Nemotron 3 Super 120B A12B (xhigh)nvidia/nemotron-3-super-120b-a12b:free@reasoning=xhigh

NVIDIAOpen

54%

Kimi K2.5 (none)moonshotai/kimi-k2.5@reasoning=none

Moonshot AIOpen

52%

Grok 4.3 (minimal)x-ai/grok-4.3@reasoning=minimal

xAIClosed

50%

Kimi K2.6 (xhigh)moonshotai/kimi-k2.6@reasoning=xhigh

Moonshot AIOpen

50%

anthropic/claude-3.5-haikuanthropic/claude-3.5-haiku@reasoning=default

Anthropic

50%

anthropic/claude-3.7-sonnet:thinkinganthropic/claude-3.7-sonnet:thinking@reasoning=default

Anthropic

49%

GPT-5.4 (none)openai/gpt-5.4@reasoning=none

OpenAIClosed

48%

Gemini 3 Pro (low)google/gemini-3-pro-preview@reasoning=low

GoogleClosed

48%

GPT-5.5 (xhigh)openai/gpt-5.5@reasoning=xhigh

OpenAIClosed

47%

Nemotron 3 Super 120B A12B (high)nvidia/nemotron-3-super-120b-a12b@reasoning=high

NVIDIAOpen

47%

Qwen3.6 Plus (none)qwen/qwen3-max-thinking@reasoning=none

AlibabaClosed

46%

Grok 4.3 (xhigh)x-ai/grok-4.3@reasoning=xhigh

xAIClosed

46%

GPT-5.5 (none)openai/gpt-5.5@reasoning=none

OpenAIClosed

45%

GPT-5.5 (low)openai/gpt-5.5@reasoning=low

OpenAIClosed

45%

GPT-5.2-Codex (low)openai/gpt-5.2-codex@reasoning=low

OpenAIClosed

45%

Claude 3.5 Sonnetanthropic/claude-3.5-sonnet@reasoning=default

AnthropicClosed

45%

GPT-5.1openai/gpt-5.1-chat@reasoning=default

OpenAIClosed

45%

Claude 4.1 Opus (none)anthropic/claude-opus-4.1@reasoning=none

AnthropicClosed

43%

anthropic/claude-3.7-sonnetanthropic/claude-3.7-sonnet@reasoning=default

Anthropic

43%

Nemotron 3 Super 120B A12B (none)nvidia/nemotron-3-super-120b-a12b:free@reasoning=none

NVIDIAOpen

43%

openrouter/hunter-alpha (none)openrouter/hunter-alpha@reasoning=none

Stealth

43%

GPT-5.4 (xhigh)openai/gpt-5.4@reasoning=xhigh

OpenAIClosed

42%

Claude 4.1 Opus (high)anthropic/claude-opus-4.1@reasoning=high

AnthropicClosed

42%

GPT-5.3 Instantopenai/gpt-5.3-chat@reasoning=default

OpenAIClosed

40%

GPT-5 Codexopenai/gpt-5-codex@reasoning=default

39%

GPT-5.2-Codex (xhigh)openai/gpt-5.2-codex@reasoning=xhigh

OpenAIClosed

39%

GPT-5.2 (none)openai/gpt-5.2@reasoning=none

OpenAIClosed

38%

MiMo-V2.5-Pro (none)xiaomi/mimo-v2.5-pro@reasoning=none

XiaomiClosed

38%

Gemini 3.1 Pro (low)google/gemini-3.1-pro-preview@reasoning=low

GoogleClosed

37%

GPT-5.2-Codex (high)openai/gpt-5.2-codex@reasoning=high

OpenAIClosed

37%

openrouter/healer-alpha (none)openrouter/healer-alpha@reasoning=none

Stealth

37%

GPT-5.5 Pro (xhigh)openai/gpt-5.5-pro@reasoning=xhigh

OpenAIClosed

36%

Gemini 3 Pro Deep Think (high)google/gemini-3-pro-preview@reasoning=high

GoogleClosed

36%

MiMo-V2.5 (xhigh)xiaomi/mimo-v2.5@reasoning=xhigh

XiaomiClosed

35%

openrouter/hunter-alpha (xhigh)openrouter/hunter-alpha@reasoning=xhigh

Stealth

35%

GPT-5.5 Pro (medium)openai/gpt-5.5-pro@reasoning=medium

OpenAIClosed

34%

GPT-5.5openai/gpt-5.5-chat@reasoning=default

OpenAIClosed

34%

Claude Opus 4anthropic/claude-opus-4@reasoning=default

34%

GPT-5.4 mini (high)openai/gpt-5.4-mini@reasoning=high

OpenAIClosed

32%

GPT-5.4 mini (none)openai/gpt-5.4-mini@reasoning=none

OpenAIClosed

32%

GPT-5.1-Codex-Maxopenai/gpt-5.1-codex@reasoning=default

OpenAIClosed

32%

GPT-5.4 mini (xhigh)openai/gpt-5.4-mini@reasoning=xhigh

OpenAIClosed

31%

Kimi K2.5 (Reasoning) (high)moonshotai/kimi-k2.5@reasoning=high

Moonshot AIClosed

31%

Gemini 3.1 Pro (high)google/gemini-3.1-pro-preview@reasoning=high

GoogleClosed

31%

GLM-5-Turbo (high)z-ai/glm-5-turbo@reasoning=high

Z.AIClosed

31%

Nemotron 3 Super 120B A12B (none)nvidia/nemotron-3-super-120b-a12b@reasoning=none

NVIDIAOpen

31%

Claude 4 Sonnet (high)anthropic/claude-sonnet-4@reasoning=high

AnthropicClosed

30%

Claude 4 Sonnet (none)anthropic/claude-sonnet-4@reasoning=none

AnthropicClosed

29%

GPT-5.2 (high)openai/gpt-5.2@reasoning=high

OpenAIClosed

28%

Llama 4 Maverickmeta-llama/llama-4-maverick@reasoning=default

MetaOpen

28%

GLM-5 (Reasoning) (high)z-ai/glm-5@reasoning=high

Z.AIOpen

28%

Nemotron 3 Nano 30B A3B (none)nvidia/nemotron-3-nano-30b-a3b:free@reasoning=none

28%

GPT-5.2 Instantopenai/gpt-5.2-chat@reasoning=default

OpenAIClosed

27%

o3openai/o3@reasoning=default

OpenAIClosed

26%

openrouter/healer-alpha (xhigh)openrouter/healer-alpha@reasoning=xhigh

Stealth

26%

GPT-5.1openai/gpt-5.1@reasoning=default

OpenAIClosed

25%

Gemma 4 31B (high)google/gemma-4-31b-it@reasoning=high

GoogleOpen

25%

GPT-5.3 Codex (low)openai/gpt-5.3-codex@reasoning=low

OpenAIClosed

24%

MiMo-V2.5 (none)xiaomi/mimo-v2.5@reasoning=none

XiaomiClosed

24%

GLM-5-Turbo (none)z-ai/glm-5-turbo@reasoning=none

Z.AIClosed

23%

GLM-5.1 (xhigh)z-ai/glm-5.1@reasoning=xhigh

Z.AIOpen

22%

Step 3.5 Flash (xhigh)stepfun/step-3.5-flash@reasoning=xhigh

StepFunOpen

22%

GPT-5openai/gpt-5@reasoning=default

21%

Gemma 4 26B A4B (xhigh)google/gemma-4-26b-a4b-it@reasoning=xhigh

GoogleOpen

21%

GPT-5.3 Codex (high)openai/gpt-5.3-codex@reasoning=high

OpenAIClosed

20%

Qwen3 Coder 480B A35Bqwen/qwen3-coder@reasoning=default

20%

Gemini 2.5 Progoogle/gemini-2.5-pro@reasoning=default

GoogleClosed

20%

GLM-5 (none)z-ai/glm-5@reasoning=none

Z.AIOpen

20%

Gemma 4 31B (none)google/gemma-4-31b-it@reasoning=none

GoogleOpen

20%

Gemini 3.5 Flash (xhigh)google/gemini-3.5-flash@reasoning=xhigh

GoogleClosed

20%

GPT-5.3 Codex (xhigh)openai/gpt-5.3-codex@reasoning=xhigh

OpenAIClosed

19%

Grok 4.1 Fast (high)x-ai/grok-4.1-fast@reasoning=high

xAIClosed

19%

Llama 4 Scoutmeta-llama/llama-4-scout@reasoning=default

MetaOpen

19%

Gemini 2.5 Flashgoogle/gemini-2.5-flash@reasoning=default

GoogleClosed

19%

Gemini 3.5 Flash (minimal)google/gemini-3.5-flash@reasoning=minimal

GoogleClosed

19%

GPT-5openai/gpt-5-chat@reasoning=default

18%

100

DeepSeek V4 Flash (none)deepseek/deepseek-v4-flash@reasoning=none

DeepSeekOpen

18%

101

GLM-5.1 (none)z-ai/glm-5.1@reasoning=none

Z.AIOpen

18%

102

Trinity-Large-Thinking (minimal)arcee-ai/trinity-large-thinking@reasoning=minimal

Arcee AIOpen

17%

103

MiMo-V2-Flash (none)xiaomi/mimo-v2-flash@reasoning=none

XiaomiOpen

16%

104

Hy3 Preview (none)tencent/hy3-preview:free@reasoning=none

TencentOpen

16%

105

google/gemini-2.0-flash-001google/gemini-2.0-flash-001@reasoning=default

Google

15%

106

DeepSeek V4 Pro (xhigh)deepseek/deepseek-v4-pro@reasoning=xhigh

DeepSeekOpen

14%

107

meta-llama/llama-3.1-8b-instructmeta-llama/llama-3.1-8b-instruct@reasoning=default

FAQ

What does BullshitBench v2 measure?

Which model leads the published BullshitBench v2 snapshot?

Claude Sonnet 4.6 (high) currently leads the published BullshitBench v2 snapshot with 91% clear pushback rate. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on BullshitBench v2?

158 AI models are included in BenchLM's mirrored BullshitBench v2 snapshot, based on the public leaderboard captured on May 19, 2026 at 11:13 PM UTC.

Last updated: May 19, 2026 at 11:13 PM UTC · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.