Multilingual

Multilingual Benchmarks

Performance across multiple languages

MGSM

88 models
1
GPT-5.3 Codex
OpenAI
ClosedReasoning400K9296%
2
GPT-5.4
OpenAI
ClosedReasoning1M9195%
3
GPT-5.2
OpenAI
ClosedReasoning400K9195%
4
Claude Opus 4.6
Anthropic
ClosedStandard1M9096%
5
Gemini 3.1 Pro
Google
ClosedStandard1M8996%
6
Grok 4.1
xAI
ClosedStandard128K8996%
7
GPT-5.2-Codex
OpenAI
ClosedReasoning400K8891%
8
GPT-5.1-Codex-Max
OpenAI
ClosedReasoning400K8789%
9
Claude Sonnet 4.6
Anthropic
ClosedStandard1M8691%
10
Gemini 3 Pro Deep Think
Google
ClosedReasoning2M8592%
11
Claude Opus 4.5
Anthropic
ClosedStandard200K8590%
12
GPT-5.1
OpenAI
ClosedReasoning400K8589%
13
GPT-5 (high)
OpenAI
ClosedReasoning128K8489%
14
Gemini 3 Pro
Google
ClosedStandard2M8489%
15
GLM-5 (Reasoning)
Zhipu AI
OpenReasoning200K8489%
16
o1-preview
OpenAI
ClosedReasoning200K8390%
17
Claude Sonnet 4.5
Anthropic
ClosedStandard1M8391%
18
Grok 4.1 Fast
xAI
ClosedStandard2M8388%
19
GPT-5 (medium)
OpenAI
ClosedReasoning128K8290%
20
Kimi K2.5 (Reasoning)
Moonshot AI
OpenReasoning128K8288%
21
Qwen3.5 397B (Reasoning)
Alibaba
OpenReasoning128K8291%
22
o3-pro
OpenAI
ClosedReasoning200K7783%
23
o3
OpenAI
ClosedReasoning200K7683%
24
DeepSeek V3.2 (Thinking)
DeepSeek
OpenReasoning128K7584%
25
GPT-5 mini
OpenAI
ClosedReasoning128K7482%
Showing 25 of 88

About Multilingual Benchmarks

Grade school math problems translated into 10 diverse languages plus English