PinchBench

Name: PinchBench
Creator: BenchLM

An OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows.

How BenchLM shows PinchBench

BenchLM mirrors the public PinchBench average-success-rate view using the official snapshot updated on 04/13/2026, 4:44 PM: 68 models and 860 runs. PinchBench grades runs with automated checks plus an LLM judge.

This benchmark is display only on BenchLM. It is excluded from BenchLM overall rankings, category rankings, and weighted scoring. The table below uses average scores only, matching the public PinchBench average view rather than the best-run view.

68 models860 runsAverage scores onlyOfficial runsDisplay only

PinchBench leaderboard PinchBench methodology Task list

Average success rate on PinchBench — 04/13/2026, 4:44 PM

BenchLM mirrors the published average success rate view for PinchBench. Trinity-Large-Thinking leads the public snapshot at 91.9% , followed by Qwen3.6 Plus (84.0%) and MiniMax M2.7 (82.8%). BenchLM does not use these results to rank models overall.

1Open

Trinity-Large-Thinking

Arcee AI

arcee-ai/trinity-large-thinking

91.9%

Overall —Context 512K

2Closed

Qwen3.6 Plus

Alibaba

qwen/qwen3.6-plus-preview

84.0%

Overall 73Context 1M

3Open

MiniMax M2.7

MiniMax

minimax/minimax-m2.7

82.8%

Overall 62Context 200K

68 modelsAgenticCurrentDisplay onlyUpdated 04/13/2026, 4:44 PM

The published PinchBench snapshot is tightly clustered at the top: Trinity-Large-Thinking sits at 91.9%, while the third row is only 9.1 points behind. The broader top-10 spread is 11.5 points, so the benchmark still separates strong models even when the leaders cluster.

68 models have been evaluated on PinchBench. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. PinchBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About PinchBench

Year

2026

Tasks

23 OpenClaw agent tasks

Format

Average success rate from official runs

Difficulty

Long-horizon agent workflows

PinchBench publishes official OpenClaw runs across 23 tasks and grades results with automated checks plus an LLM judge. BenchLM mirrors the public average-score view as a display-only benchmark.

About PinchBench Public benchmark source

BenchLM freshness & provenance

Version

PinchBench 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Average success rate table (68 models)

Trinity-Large-Thinkingarcee-ai/trinity-large-thinking

Arcee AIOpen

91.9%

Qwen3.6 Plusqwen/qwen3.6-plus-preview

AlibabaClosed

84.0%

MiniMax M2.7minimax/minimax-m2.7

MiniMaxOpen

82.8%

Claude Opus 4.6anthropic/claude-opus-4.6

AnthropicClosed

81.6%

MiMo-V2-Omnixiaomi/mimo-v2-omni

XiaomiClosed

81.1%

GLM-5.1z-ai/glm-5.1

Z.AIOpen

80.9%

Qwen3.5-122B-A10Bqwen/qwen3.5-122b-a10b

AlibabaOpen

80.8%

Claude Sonnet 4.6anthropic/claude-sonnet-4.6

AnthropicClosed

80.7%

GLM-5z-ai/glm-5

Z.AIOpen

80.6%

Qwen3.5 397Bqwen/qwen3.5-397b-a17b

AlibabaOpen

80.4%

MiMo-V2-Proxiaomi/mimo-v2-pro

XiaomiClosed

80.4%

GLM-5-Turboz-ai/glm-5-turbo

Z.AIClosed

80.3%

Claude Sonnet 4.5anthropic/claude-sonnet-4.5

AnthropicClosed

80.0%

Seed-2.0-Litebytedance-seed/seed-2.0-lite

ByteDanceClosed

79.8%

MiniMax M2.1minimax/minimax-m2.1

MiniMax

79.7%

GPT-5.4openai/gpt-5.4

OpenAIClosed

79.4%

Qwen3.5 Plusqwen/qwen3.5-plus-02-15

AlibabaClosed

79.1%

Qwen3 Coder Nextqwen/qwen3-coder-next

Alibaba

79.1%

Claude Opus 4.5anthropic/claude-opus-4.5

AnthropicClosed

78.8%

Kimi K2.5moonshotai/kimi-k2.5

Moonshot AIOpen

78.6%

Qwen3.5-27Bqwen/qwen3.5-27b

AlibabaOpen

78.5%

MiniMax M2.5minimax/minimax-m2.5

MiniMaxClosed

78.1%

Gemini 3.1 Progoogle/gemini-3.1-pro-preview

GoogleClosed

77.5%

Claude Haiku 4.5anthropic/claude-haiku-4.5

AnthropicClosed

77.4%

openrouter/healer-alpha

OpenRouter

77.3%

openrouter/hunter-alpha

OpenRouter

77.3%

anthropic/claude-sonnet-4

Anthropic

77.2%

GLM-4.5-Airz-ai/glm-4.5-air

Z.AIClosed

76.8%

Step 3.5 Flashstepfun/step-3.5-flash

StepFunOpen

76.6%

Gemini 3 Flashgoogle/gemini-3-flash-preview

GoogleClosed

75.3%

mistralai/devstral-2512

Mistral

74.8%

google/gemma-4-26b-a4b-it

Google

74.1%

Nemotron 3 Super 120B A12Bnvidia/nemotron-3-super-120b-a12b

NVIDIAOpen

73.1%

Qwen3 Maxqwen/qwen3-max-thinking

AlibabaClosed

71.8%

Qwen3.5-35B-A3Bqwen/qwen3.5-35b-a3b

AlibabaOpen

71.7%

GPT-5.4 miniopenai/gpt-5.4-mini

OpenAIClosed

71.4%

Grok 4.1 Fastx-ai/grok-4.1-fast

xAIClosed

71.3%

mistralai/mistral-small-2603

Mistral

71.3%

Grok 4.20x-ai/grok-4.20

xAIClosed

71.2%

amazon/nova-2-lite-v1

amazon

70.0%

Mercury 2inception/mercury-2

InceptionClosed

70.0%

MiMo-V2-Flashxiaomi/mimo-v2-flash

XiaomiOpen

69.7%

nvidia/nemotron-3-super-120b-a12b:free

nvidia

69.6%

GPT-5.4 nanoopenai/gpt-5.4-nano

OpenAIClosed

69.5%

GPT-5 miniopenai/gpt-5-mini

OpenAIClosed

69.0%

DeepSeek V3.2deepseek/deepseek-v3.2

DeepSeekOpen

67.7%

Gemini 3 Progoogle/gemini-3-pro-preview

GoogleClosed

67.7%

GLM-5V-Turboz-ai/glm-5v-turbo

Z.AIClosed

67.0%

google/gemma-4-31b-it

Google

66.5%

mistralai/mistral-large-2512

Mistral

66.0%

arcee-ai/trinity-large-preview:free

Arcee AI

65.1%

Gemini 2.5 Progoogle/gemini-2.5-pro

GoogleClosed

65.0%

Trinity-Large-Thinkingarcee-ai/trinity-large-preview

Arcee AIOpen

63.7%

GPT-4o miniopenai/gpt-4o-mini

OpenAIClosed

63.6%

Qwen3.6 Plusqwen/qwen3.6-plus

AlibabaClosed

63.3%

deepseek/deepseek-chat

DeepSeek

62.8%

Gemini 2.5 Flashgoogle/gemini-2.5-flash

GoogleClosed

57.2%

GPT-4oopenai/gpt-4o

OpenAIClosed

56.6%

GPT-5 nanoopenai/gpt-5-nano

OpenAIClosed

56.2%

GPT-OSS 120Bopenai/gpt-oss-120b

OpenAIOpen

52.0%

GPT-OSS 20Bopenai/gpt-oss-20b

OpenAIOpen

50.3%

qwen/qwen3.5-9b

Alibaba

34.8%

Llama 4 Maverickmeta-llama/llama-4-maverick

MetaOpen

34.8%

qwen/qwen-2.5-7b-instruct

Alibaba

34.1%

Llama 3.1 70B Instructmeta-llama/llama-3.1-70b-instruct

meta-llama

22.7%

Gemini 2.5 Flash-Litegoogle/gemini-2.5-flash-lite

Google

12.7%

GPT-5.4 Proopenai/gpt-5.4-pro

OpenAIClosed

12.0%

Llama 4 Scoutmeta-llama/llama-4-scout

MetaOpen

5.4%

FAQ

What does PinchBench measure?

An OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows.

Which model leads the published PinchBench snapshot?

Trinity-Large-Thinking currently leads the published PinchBench snapshot with a average success rate of 91.9%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on PinchBench?

68 AI models are included in BenchLM's mirrored PinchBench snapshot, based on the public leaderboard captured on 04/13/2026, 4:44 PM.

Last updated: 04/13/2026, 4:44 PM · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.