Multilingual
Multilingual Benchmarks
Performance across multiple languages
MGSM
88 models
1 GPT-5.3 Codex OpenAI | Closed | Reasoning | 400K | 92 | 96% |
2 GPT-5.4 OpenAI | Closed | Reasoning | 1M | 91 | 95% |
3 GPT-5.2 OpenAI | Closed | Reasoning | 400K | 91 | 95% |
4 Claude Opus 4.6 Anthropic | Closed | Standard | 1M | 90 | 96% |
5 Gemini 3.1 Pro Google | Closed | Standard | 1M | 89 | 96% |
6 Grok 4.1 xAI | Closed | Standard | 128K | 89 | 96% |
7 GPT-5.2-Codex OpenAI | Closed | Reasoning | 400K | 88 | 91% |
8 GPT-5.1-Codex-Max OpenAI | Closed | Reasoning | 400K | 87 | 89% |
9 Claude Sonnet 4.6 Anthropic | Closed | Standard | 1M | 86 | 91% |
10 Gemini 3 Pro Deep Think Google | Closed | Reasoning | 2M | 85 | 92% |
11 Claude Opus 4.5 Anthropic | Closed | Standard | 200K | 85 | 90% |
12 GPT-5.1 OpenAI | Closed | Reasoning | 400K | 85 | 89% |
13 GPT-5 (high) OpenAI | Closed | Reasoning | 128K | 84 | 89% |
14 Gemini 3 Pro Google | Closed | Standard | 2M | 84 | 89% |
15 GLM-5 (Reasoning) Zhipu AI | Open | Reasoning | 200K | 84 | 89% |
16 o1-preview OpenAI | Closed | Reasoning | 200K | 83 | 90% |
17 Claude Sonnet 4.5 Anthropic | Closed | Standard | 1M | 83 | 91% |
18 Grok 4.1 Fast xAI | Closed | Standard | 2M | 83 | 88% |
19 GPT-5 (medium) OpenAI | Closed | Reasoning | 128K | 82 | 90% |
20 Kimi K2.5 (Reasoning) Moonshot AI | Open | Reasoning | 128K | 82 | 88% |
21 Qwen3.5 397B (Reasoning) Alibaba | Open | Reasoning | 128K | 82 | 91% |
22 o3-pro OpenAI | Closed | Reasoning | 200K | 77 | 83% |
23 o3 OpenAI | Closed | Reasoning | 200K | 76 | 83% |
24 DeepSeek V3.2 (Thinking) DeepSeek | Open | Reasoning | 128K | 75 | 84% |
25 GPT-5 mini OpenAI | Closed | Reasoning | 128K | 74 | 82% |
Showing 25 of 88
About Multilingual Benchmarks
Grade school math problems translated into 10 diverse languages plus English