Skip to main content

Vals MedCode (MedCode)

Vals AI healthcare benchmark for whether models can support the medical billing process.

How BenchLM shows MedCode

BenchLM mirrors the public Vals AI MedCode leaderboard captured from https://www.vals.ai/benchmarks/medcode and updated by Vals on May 16, 2026. The snapshot preserves overall scores, uncertainty, latency, cost-per-test metadata, and task-level scores where Vals publishes them.

MedCode is display only on BenchLM. Vals proprietary or Vals-hosted aggregate views are useful context, but BenchLM does not use them as weighted ranking inputs or as a replacement for benchmark-native source records.

57 Vals rows1 task viewsprivate datasetTasks: OverallDisplay only

MedCode score on MedCode — May 16, 2026

BenchLM mirrors the published medcode score view for MedCode. Gemini 3.1 Pro Preview leads the public snapshot at 59.06% , followed by Gemini 3 Flash Preview (55.92%) and Gemini 3.5 Flash (55.83%). BenchLM does not use these results to rank models overall.

57 modelsExternal benchmark mirrorsCurrentDisplay onlyUpdated May 16, 2026

The published MedCode snapshot is tightly clustered at the top: Gemini 3.1 Pro Preview sits at 59.06%, while the third row is only 3.24 points behind. The broader top-10 spread is 9.43 points, so many of the published scores sit in a relatively narrow band.

57 models have been evaluated on MedCode. The benchmark falls in the External benchmark mirrors category. BenchLM tracks this category separately from its weighted global scoring system, so these results are best compared on the dedicated Korean benchmark views. MedCode is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About MedCode

Year

2026

Tasks

Medical billing support tasks

Format

Accuracy score

Difficulty

Professional healthcare administration

BenchLM mirrors the public Vals MedCode leaderboard as display-only healthcare evidence.

BenchLM freshness & provenance

Version

MedCode 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

MedCode score table (57 models)

1
Gemini 3.1 Pro Previewgoogle/gemini-3.1-pro-preview
59.06%
2
Gemini 3 Flash Previewgoogle/gemini-3-flash-preview
55.92%
3
Gemini 3.5 Flashgoogle/gemini-3.5-flash
55.83%
4
Claude Opus 4.7anthropic/claude-opus-4-7
54.86%
5
GPT-5.1openai/gpt-5.1-2025-11-13
52.73%
6
Gemini 3 Pro Previewgoogle/gemini-3-pro-preview
52.20%
7
Muse Sparkmeta/muse_spark
51.31%
8
Gemini 2.5 Progoogle/gemini-2.5-pro
50.59%
9
GPT-5.2openai/gpt-5.2-2025-12-11
49.75%
10
GPT-5openai/gpt-5-2025-08-07
49.63%
11
Claude Opus 4.5 20251101 Thinkinganthropic/claude-opus-4-5-20251101-thinking
49.16%
12
Claude Opus 4.6 Thinkinganthropic/claude-opus-4-6-thinking
49.13%
13
GPT-5.5openai/gpt-5.5
49.10%
14
Claude Opus 4.6anthropic/claude-opus-4-6
48.24%
15
Gemini 3.1 Flash Lite Previewgoogle/gemini-3.1-flash-lite-preview
47.60%
16
O3openai/o3-2025-04-16
47.29%
17
Claude Opus 4.1 20250805 Thinkinganthropic/claude-opus-4-1-20250805-thinking
47.23%
18
Claude Opus 4.5anthropic/claude-opus-4-5-20251101
45.17%
19
Claude Sonnet 4.5 20250929 Thinkinganthropic/claude-sonnet-4-5-20250929-thinking
44.13%
20
GPT-5 Miniopenai/gpt-5-mini-2025-08-07
43.05%
21
GLM 5.1 Thinkingzai/glm-5.1-thinking
41.60%
22
Claude Opus 4.1anthropic/claude-opus-4-1-20250805
41.37%
23
GPT-5.4openai/gpt-5.4-2026-03-05
41.29%
24
GPT-5.4 Nanoopenai/gpt-5.4-nano-2026-03-17
41.03%
25
Claude Sonnet 4.5anthropic/claude-sonnet-4-5-20250929
40.57%
26
Gemini 2.5 Flash Preview 09 2025google/gemini-2.5-flash-preview-09-2025
40.54%
27
DeepSeek V4 Prodeepseek/deepseek-v4-pro
40.45%
28
Gemini 2.5 Flash Thinkinggoogle/gemini-2.5-flash-thinking
40.36%
29
Gemini 2.5 Flash Preview 09 2025 Thinkinggoogle/gemini-2.5-flash-preview-09-2025-thinking
40.33%
30
Kimi K2.6 Thinkingkimi/kimi-k2.6-thinking
40.14%
31
Kimi K2.5 Thinkingkimi/kimi-k2.5-thinking
39.32%
32
Gemini 2.5 Flashgoogle/gemini-2.5-flash
38.42%
33
Grok 4 0709grok/grok-4-0709
38.08%
34
Grok 4.3grok/grok-4.3
38.07%
35
Grok 4 Fast Reasoninggrok/grok-4-fast-reasoning
37.38%
36
Qwen3.6 Plusalibaba/qwen3.6-plus
36.89%
37
Llama4 Maverick Instruct Basicfireworks/llama4-maverick-instruct-basic
36.51%
38
Claude Sonnet 4 20250514 Thinkinganthropic/claude-sonnet-4-20250514-thinking
34.96%
39
MiniMax M2.7minimax/MiniMax-M2.7
34.44%
40
Gemini 2.5 Flash Lite Preview 09 2025 Thinkinggoogle/gemini-2.5-flash-lite-preview-09-2025-thinking
34.19%
41
MiniMax M2.1minimax/MiniMax-M2.1
34.08%
42
Claude Sonnet 4anthropic/claude-sonnet-4-20250514
33.94%
43
O4 Miniopenai/o4-mini-2025-04-16
33.79%
44
Mistral Medium 3.5mistralai/mistral-medium-3.5
33.75%
45
Qwen3.5 Flashalibaba/qwen3.5-flash
33.00%
46
GLM 4.7zai/glm-4.7
32.77%
47
Claude Haiku 4.5 20251001 Thinkinganthropic/claude-haiku-4-5-20251001-thinking
32.68%
48
Grok 4.20 0309 Reasoninggrok/grok-4.20-0309-reasoning
32.16%
49
Qwen3 VL Plusalibaba/qwen3-vl-plus-2025-09-23
31.65%
50
Qwen3 Maxalibaba/qwen3-max-2026-01-23
31.37%
51
GPT-5 Nanoopenai/gpt-5-nano-2025-08-07
30.44%
52
Grok 4 Fast Non Reasoninggrok/grok-4-fast-non-reasoning
30.04%
53
Grok 4.1 Fast Non Reasoninggrok/grok-4-1-fast-non-reasoning
28.35%
54
Grok 4.1 Fast Reasoninggrok/grok-4-1-fast-reasoning
28.08%
55
Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite
27.11%
56
Gemini 2.5 Flash Lite Preview 09 2025google/gemini-2.5-flash-lite-preview-09-2025
27.08%
57
Meta Llama Llama 4 Scout 17B 16E Instructtogether/meta-llama/Llama-4-Scout-17B-16E-Instruct
23.31%

FAQ

What does MedCode measure?

Vals AI healthcare benchmark for whether models can support the medical billing process.

Which model leads the published MedCode snapshot?

Gemini 3.1 Pro Preview currently leads the published MedCode snapshot with 59.06% medcode score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on MedCode?

57 AI models are included in BenchLM's mirrored MedCode snapshot, based on the public leaderboard captured on May 16, 2026.

Last updated: May 16, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.