Skip to main content

Vals-hosted MMLU-Pro mirror (Vals MMLU-Pro mirror)

Vals AI hosted MMLU-Pro view with subject-level task splits.

How BenchLM shows Vals MMLU-Pro mirror

BenchLM mirrors the public Vals AI Vals MMLU-Pro mirror leaderboard captured from https://www.vals.ai/benchmarks/mmlu_pro and updated by Vals on May 16, 2026. The snapshot preserves overall scores, uncertainty, latency, cost-per-test metadata, and task-level scores where Vals publishes them.

Vals MMLU-Pro mirror is display only on BenchLM. Vals proprietary or Vals-hosted aggregate views are useful context, but BenchLM does not use them as weighted ranking inputs or as a replacement for benchmark-native source records.

104 Vals rows15 task viewspublic datasetTasks: Overall, Biology, Business, Chemistry, Computer ScienceDisplay only

Vals MMLU-Pro score on Vals MMLU-Pro mirror — May 16, 2026

BenchLM mirrors the published vals mmlu-pro score view for Vals MMLU-Pro mirror. Gemini 3.1 Pro Preview leads the public snapshot at 90.99% , followed by Gemini 3 Pro Preview (90.10%) and Claude Opus 4.7 (89.87%). BenchLM does not use these results to rank models overall.

104 modelsExternal benchmark mirrorsCurrentDisplay onlyUpdated May 16, 2026

The published Vals MMLU-Pro mirror snapshot is tightly clustered at the top: Gemini 3.1 Pro Preview sits at 90.99%, while the third row is only 1.12 points behind. The broader top-10 spread is 3.41 points, so many of the published scores sit in a relatively narrow band.

104 models have been evaluated on Vals MMLU-Pro mirror. The benchmark falls in the External benchmark mirrors category. BenchLM tracks this category separately from its weighted global scoring system, so these results are best compared on the dedicated Korean benchmark views. Vals MMLU-Pro mirror is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About Vals MMLU-Pro mirror

Year

2026

Tasks

MMLU-Pro subject splits

Format

Accuracy score

Difficulty

Professional academic reasoning

BenchLM keeps this Vals-hosted MMLU-Pro table separate from canonical MMLU-Pro source records.

BenchLM freshness & provenance

Version

Vals MMLU-Pro mirror 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Vals MMLU-Pro score table (104 models)

1
Gemini 3.1 Pro Previewgoogle/gemini-3.1-pro-preview
90.99%
2
Gemini 3 Pro Previewgoogle/gemini-3-pro-preview
90.10%
3
Claude Opus 4.7anthropic/claude-opus-4-7
89.87%
4
Gemini 3.5 Flashgoogle/gemini-3.5-flash
89.52%
5
Claude Opus 4.6 Thinkinganthropic/claude-opus-4-6-thinking
89.11%
6
Gemini 3 Flash Previewgoogle/gemini-3-flash-preview
88.59%
7
GPT-5.5openai/gpt-5.5
88.14%
8
Claude Opus 4.1 20250805 Thinkinganthropic/claude-opus-4-1-20250805-thinking
87.92%
9
Qwen3.6 Plusalibaba/qwen3.6-plus
87.67%
10
Kimi K2.6 Thinkingkimi/kimi-k2.6-thinking
87.57%
11
GPT-5.4openai/gpt-5.4-2026-03-05
87.48%
12
Claude Sonnet 4.5 20250929 Thinkinganthropic/claude-sonnet-4-5-20250929-thinking
87.36%
13
Claude Sonnet 4.6anthropic/claude-sonnet-4-6
87.34%
14
Muse Sparkmeta/muse_spark
87.32%
15
Claude Opus 4.5 20251101 Thinkinganthropic/claude-opus-4-5-20251101-thinking
87.26%
16
DeepSeek V4 Prodeepseek/deepseek-v4-pro
87.25%
17
Claude Opus 4.1anthropic/claude-opus-4-1-20250805
87.21%
18
Qwen3.5 Plus Thinkingalibaba/qwen3.5-plus-thinking
87.18%
19
MiniMax M2.1minimax/MiniMax-M2.1
87.05%
20
GLM 5.1 Thinkingzai/glm-5.1-thinking
86.90%
21
GPT-5openai/gpt-5-2025-08-07
86.54%
22
GPT-5.1openai/gpt-5.1-2025-11-13
86.38%
23
Grok 4.20 0309 Reasoninggrok/grok-4.20-0309-reasoning
86.25%
24
Gemini 3.1 Flash Lite Previewgoogle/gemini-3.1-flash-lite-preview
86.24%
25
GPT-5.2openai/gpt-5.2-2025-12-11
86.23%
26
Claude Opus 4anthropic/claude-opus-4-20250514
86.17%
27
GLM 5 Thinkingzai/glm-5-thinking
86.03%
28
Kimi K2.5 Thinkingkimi/kimi-k2.5-thinking
85.91%
29
Grok 4.3grok/grok-4.3
85.84%
30
O3openai/o3-2025-04-16
85.59%
31
Claude Opus 4.5anthropic/claude-opus-4-5-20251101
85.59%
32
Grok 4 0709grok/grok-4-0709
85.30%
33
Qwen3 Maxalibaba/qwen3-max-2026-01-23
84.98%
34
DeepSeek V3p2 Thinkingfireworks/deepseek-v3p2-thinking
84.92%
35
GPT-5.4 Miniopenai/gpt-5.4-mini-2026-03-17
84.55%
36
Qwen3 Maxalibaba/qwen3-max
84.36%
37
Grok 4.1 Fast Reasoninggrok/grok-4-1-fast-reasoning
84.18%
38
Qwen3.5 Flashalibaba/qwen3.5-flash
84.06%
39
Gemini 2.5 Pro Exp 03 25google/gemini-2.5-pro-exp-03-25
84.06%
40
Claude Sonnet 4 20250514 Thinkinganthropic/claude-sonnet-4-20250514-thinking
83.86%
41
Gemini 2.5 Flash Preview 09 2025google/gemini-2.5-flash-preview-09-2025
83.69%
42
Gemini 2.5 Flash Preview 09 2025 Thinkinggoogle/gemini-2.5-flash-preview-09-2025-thinking
83.66%
43
Qwen3 Max Previewalibaba/qwen3-max-preview
83.54%
44
O1openai/o1-2024-12-17
83.49%
45
DeepSeek R1fireworks/deepseek-r1
83.18%
46
DeepSeek V3p2fireworks/deepseek-v3p2
83.06%
47
GLM 4.7zai/glm-4.7
82.74%
48
Claude 3.7 Sonnet 20250219 Thinkinganthropic/claude-3-7-sonnet-20250219-thinking
82.73%
49
GPT-5 Miniopenai/gpt-5-mini-2025-08-07
82.23%
50
GLM 4.6zai/glm-4.6
82.20%
51
Grok 3 Mini Fast High Reasoninggrok/grok-3-mini-fast-high-reasoning
81.37%
52
Qwen3 235b A22bfireworks/qwen3-235b-a22b
81.25%
53
GLM 4.5zai/glm-4.5
81.22%
54
Kimi K2 Thinkingkimi/kimi-k2-thinking
81.07%
55
Claude 3.7 Sonnetanthropic/claude-3-7-sonnet-20250219
80.66%
56
O4 Miniopenai/o4-mini-2025-04-16
80.56%
57
GPT-4.1openai/gpt-4.1-2025-04-14
80.50%
58
MiniMax M2.7minimax/MiniMax-M2.7
80.43%
59
MiniMax M2.5 Lightningminimax/MiniMax-M2.5-Lightning
80.09%
60
Grok 3 Mini Fast Low Reasoninggrok/grok-3-mini-fast-low-reasoning
80.01%
61
Grok 3grok/grok-3
79.95%
62
Mistral Large 2512mistralai/mistral-large-2512
79.82%
63
Grok 4 Fast Reasoninggrok/grok-4-fast-reasoning
79.70%
64
DeepSeek V3 0324fireworks/deepseek-v3-0324
79.47%
65
Claude Sonnet 4anthropic/claude-sonnet-4-20250514
79.43%
66
Llama4 Maverick Instruct Basicfireworks/llama4-maverick-instruct-basic
79.42%
67
Moonshotai Kimi K2 Instructtogether/moonshotai/Kimi-K2-Instruct
79.39%
68
GPT Oss 120bfireworks/gpt-oss-120b
79.17%
69
Gemini 2.5 Flash Lite Preview 09 2025 Thinkinggoogle/gemini-2.5-flash-lite-preview-09-2025-thinking
79.12%
70
Claude Haiku 4.5 20251001 Thinkinganthropic/claude-haiku-4-5-20251001-thinking
78.72%
71
O3 Miniopenai/o3-mini-2025-01-31
78.69%
72
Gemini 2.5 Flash Lite Preview 09 2025google/gemini-2.5-flash-lite-preview-09-2025
78.64%
73
Claude 3.5 Sonnetanthropic/claude-3-5-sonnet-20241022
78.40%
74
Gemini 2.0 Flash 001google/gemini-2.0-flash-001
77.38%
75
GPT-4.1 Miniopenai/gpt-4.1-mini-2025-04-14
77.22%
76
GPT-5.4 Nanoopenai/gpt-5.4-nano-2026-03-17
77.17%
77
GPT-5 Nanoopenai/gpt-5-nano-2025-08-07
76.07%
78
Grok 2 1212grok/grok-2-1212
75.47%
79
Mistral Medium 3.5mistralai/mistral-medium-3.5
75.33%
80
Gemini 1.5 Pro 002google/gemini-1.5-pro-002
75.29%
81
Mistral Medium 2505mistralai/mistral-medium-2505
75.29%
82
Grok 4.1 Fast Non Reasoninggrok/grok-4-1-fast-non-reasoning
75.21%
83
GPT-4oopenai/gpt-4o-2024-08-06
74.13%
84
DeepSeek V3fireworks/deepseek-v3
73.82%
85
GPT-4oopenai/gpt-4o-2024-11-20
72.56%
86
GPT Oss 20bfireworks/gpt-oss-20b
71.64%
87
Langston Nim Nvidia Llama 3.3 Nemotron Super 49b V1 42e84561together/langston/nim/nvidia/llama-3.3-nemotron-super-49b-v1-42e84561
70.78%
88
Grok 4 Fast Non Reasoninggrok/grok-4-fast-non-reasoning
70.34%
89
Meta Llama Llama 3.3 70B Instruct Turbotogether/meta-llama/Llama-3.3-70B-Instruct-Turbo
69.86%
90
Mistral Large 2411mistralai/mistral-large-2411
69.71%
91
Meta Llama Llama 4 Scout 17B 16E Instructtogether/meta-llama/Llama-4-Scout-17B-16E-Instruct
69.63%
92
Langston Nim Nvidia Llama 3.3 Nemotron Super 49b V1 42e84561 Thinkingtogether/langston/nim/nvidia/llama-3.3-nemotron-super-49b-v1-42e84561-thinking
69.58%
93
Command A 03 2025cohere/command-a-03-2025
69.17%
94
Magistral Medium 2509mistralai/magistral-medium-2509
68.66%
95
Mistral Small 2503mistralai/mistral-small-2503
66.02%
96
Gemini 1.5 Flash 002google/gemini-1.5-flash-002
65.61%
97
Mistral Small 2402mistralai/mistral-small-2402
64.44%
98
Claude 3.5 Haikuanthropic/claude-3-5-haiku-20241022
64.12%
99
GPT-4.1 Nanoopenai/gpt-4.1-nano-2025-04-14
63.48%
100
GPT-4o Miniopenai/gpt-4o-mini-2024-07-18
62.73%
101
Magistral Small 2509mistralai/magistral-small-2509
62.13%
102
Jamba Large 1.6ai21labs/jamba-large-1.6
49.78%
103
Command R Pluscohere/command-r-plus
44.00%
104
Jamba Mini 1.6ai21labs/jamba-mini-1.6
30.28%

FAQ

What does Vals MMLU-Pro mirror measure?

Vals AI hosted MMLU-Pro view with subject-level task splits.

Which model leads the published Vals MMLU-Pro mirror snapshot?

Gemini 3.1 Pro Preview currently leads the published Vals MMLU-Pro mirror snapshot with 90.99% vals mmlu-pro score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on Vals MMLU-Pro mirror?

104 AI models are included in BenchLM's mirrored Vals MMLU-Pro mirror snapshot, based on the public leaderboard captured on May 16, 2026.

Last updated: May 16, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.