Best Tool Use & Function Calling Models in 2026

This reporting page focuses on structured output, tool routing, function calling, and MCP-style task completion. It is narrower than the general agentic leaderboard and is better aligned to developers choosing models for tool-heavy applications.

This page ranks models using only sourced tool-use benchmarks in the reporting family.

According to BenchLM.ai, Grok 4.20 leads this ranking with a score of 96.5, followed by Gemini 3.1 Pro (95.6) and Muse Spark (91.5). There is meaningful separation between the top models, suggesting genuine performance differences.

The best open-weight option is GLM-5.1 (ranked #5 with a score of 71.2). While proprietary models lead, open-weight options are within striking distance for teams willing to trade a few points of performance for full model control.

This ranking is based on provisional overall weighted scores across BenchLM.ai's scoring formula tracked by BenchLM.ai. For detailed model profiles, click any model name below. To compare two specific models head-to-head, use the "vs #" links.

Full Rankings (17 models)

Grok 4.20
xAI·Proprietary·2M

96.5

sourced avg

Gemini 3.1 Pro
Google·Proprietary·1M

95.6

sourced avg

Muse Spark
Meta·Proprietary·262K

91.5

sourced avg

4
GPT-5.4
OpenAI·Proprietary·1.05M

73.6

sourced avg

5
GLM-5.1
Z.AI·Open Weight·203K

71.2

sourced avg

6
GPT-5.4 mini
OpenAI·Proprietary·400K

64.7

sourced avg

7
GPT-5.4 nano
OpenAI·Proprietary·400K

61.4

sourced avg

8
Qwen3.6 Plus
Alibaba·Proprietary·1M

56.9

sourced avg

9
MiniMax M2.7
MiniMax·Proprietary·200K

55.5

sourced avg

10
Qwen3.5 397B
Alibaba·Open Weight·128K

54

sourced avg

11
Claude Opus 4.5
Anthropic·Proprietary·200K

53.7

sourced avg

12
GLM-5
Z.AI·Open Weight·200K

50.3

sourced avg

13
Kimi K2.5
Moonshot AI·Open Weight·128K

47.1

sourced avg

14
LFM2.5-VL-450M
LiquidAI·Open Weight·128K

21.1

sourced avg

15
DeepSeek V3.2
DeepSeek·Open Weight·128K

18.5

sourced avg

16
Claude Sonnet 4.5
Anthropic·Proprietary·200K

17

sourced avg

17
GLM-4.7
Z.AI·Open Weight·200K

15.5

sourced avg

Key Takeaways

The top model on this sourced reporting-family slice is Grok 4.20 by xAI with an average of 96.5.

The best open-weight model is GLM-5.1 at position #5.

17 models are listed with sourced benchmark coverage in this reporting family.

Last updated: April 8, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.