A function-calling benchmark for tool selection, schema adherence, and argument correctness.
As of March 2026, Claude Opus 4.6 leads the BFCL v4 leaderboard with 77.0% , followed by GLM-5 (70.8%) and MiniMax M2.7 (70.6%).
Claude Opus 4.6
Anthropic
GLM-5
Zhipu AI
MiniMax M2.7
MiniMax
According to BenchLM.ai, Claude Opus 4.6 leads the BFCL v4 benchmark with a score of 77.0%, followed by GLM-5 (70.8%) and MiniMax M2.7 (70.6%). The scores show moderate spread, with meaningful differences between the top tier and mid-tier models.
5 models have been evaluated on BFCL v4. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. BFCL v4 is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
Function-calling tasks
Format
Tool invocation and schema evaluation
Difficulty
Advanced tool use
BenchLM stores BFCL v4 as a display-only function-calling reference outside the current weighted core schema.
Trinity-Large-Thinking: Scaling an Open Source Frontier AgentVersion
BFCL v4 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A function-calling benchmark for tool selection, schema adherence, and argument correctness.
Claude Opus 4.6 by Anthropic currently leads with a score of 77.0% on BFCL v4.
5 AI models have been evaluated on BFCL v4 on BenchLM.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.