Skip to main content

Vals MedScribe (MedScribe)

Vals AI healthcare benchmark for whether models can support doctors with administrative work.

How BenchLM shows MedScribe

BenchLM mirrors the public Vals AI MedScribe leaderboard captured from https://www.vals.ai/benchmarks/medscribe and updated by Vals on May 16, 2026. The snapshot preserves overall scores, uncertainty, latency, cost-per-test metadata, and task-level scores where Vals publishes them.

MedScribe is display only on BenchLM. Vals proprietary or Vals-hosted aggregate views are useful context, but BenchLM does not use them as weighted ranking inputs or as a replacement for benchmark-native source records.

57 Vals rows1 task viewsprivate datasetTasks: OverallDisplay only

MedScribe score on MedScribe — May 16, 2026

BenchLM mirrors the published medscribe score view for MedScribe. GPT-5.1 leads the public snapshot at 88.09% , followed by GPT-5.5 (86.87%) and Claude Opus 4.6 (86.74%). BenchLM does not use these results to rank models overall.

57 modelsExternal benchmark mirrorsCurrentDisplay onlyUpdated May 16, 2026

The published MedScribe snapshot is tightly clustered at the top: GPT-5.1 sits at 88.09%, while the third row is only 1.35 points behind. The broader top-10 spread is 3.99 points, so many of the published scores sit in a relatively narrow band.

57 models have been evaluated on MedScribe. The benchmark falls in the External benchmark mirrors category. BenchLM tracks this category separately from its weighted global scoring system, so these results are best compared on the dedicated Korean benchmark views. MedScribe is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About MedScribe

Year

2026

Tasks

Medical administrative support tasks

Format

Accuracy score

Difficulty

Professional healthcare administration

BenchLM mirrors the public Vals MedScribe leaderboard as display-only healthcare evidence.

BenchLM freshness & provenance

Version

MedScribe 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

MedScribe score table (57 models)

1
GPT-5.1openai/gpt-5.1-2025-11-13
88.09%
2
GPT-5.5openai/gpt-5.5
86.87%
3
Claude Opus 4.6anthropic/claude-opus-4-6
86.74%
4
Claude Opus 4.6 Thinkinganthropic/claude-opus-4-6-thinking
86.13%
5
Muse Sparkmeta/muse_spark
85.90%
6
Claude Opus 4.5 20251101 Thinkinganthropic/claude-opus-4-5-20251101-thinking
85.32%
7
Claude Haiku 4.5 20251001 Thinkinganthropic/claude-haiku-4-5-20251001-thinking
85.23%
8
Claude Sonnet 4.5anthropic/claude-sonnet-4-5-20250929
84.52%
9
GPT-5.2openai/gpt-5.2-2025-12-11
84.39%
10
Claude Sonnet 4.5 20250929 Thinkinganthropic/claude-sonnet-4-5-20250929-thinking
84.10%
11
GPT-5openai/gpt-5-2025-08-07
83.65%
12
Claude Opus 4.5anthropic/claude-opus-4-5-20251101
83.25%
13
Gemini 2.5 Flash Thinkinggoogle/gemini-2.5-flash-thinking
82.98%
14
Claude Opus 4.7anthropic/claude-opus-4-7
82.95%
15
Gemini 2.5 Flashgoogle/gemini-2.5-flash
82.87%
16
Grok 4 Fast Reasoninggrok/grok-4-fast-reasoning
81.63%
17
MiniMax M2.1minimax/MiniMax-M2.1
80.78%
18
GPT-5 Miniopenai/gpt-5-mini-2025-08-07
80.58%
19
MiniMax M2.7minimax/MiniMax-M2.7
79.87%
20
Grok 4 Fast Non Reasoninggrok/grok-4-fast-non-reasoning
79.72%
21
Grok 4.1 Fast Reasoninggrok/grok-4-1-fast-reasoning
78.73%
22
Gemini 2.5 Flash Preview 09 2025 Thinkinggoogle/gemini-2.5-flash-preview-09-2025-thinking
78.50%
23
Grok 4 0709grok/grok-4-0709
78.15%
24
Kimi K2.6 Thinkingkimi/kimi-k2.6-thinking
78.15%
25
Gemini 2.5 Flash Preview 09 2025google/gemini-2.5-flash-preview-09-2025
77.95%
26
GPT-5.4openai/gpt-5.4-2026-03-05
77.55%
27
Grok 4.1 Fast Non Reasoninggrok/grok-4-1-fast-non-reasoning
77.46%
28
Qwen3 VL Plusalibaba/qwen3-vl-plus-2025-09-23
77.13%
29
GPT-5.4 Nanoopenai/gpt-5.4-nano-2026-03-17
77.09%
30
Qwen3.6 Plusalibaba/qwen3.6-plus
76.96%
31
O3openai/o3-2025-04-16
76.65%
32
Gemini 3.5 Flashgoogle/gemini-3.5-flash
76.57%
33
Kimi K2.5 Thinkingkimi/kimi-k2.5-thinking
76.44%
34
Gemini 3.1 Pro Previewgoogle/gemini-3.1-pro-preview
76.11%
35
Gemini 2.5 Flash Lite Preview 09 2025google/gemini-2.5-flash-lite-preview-09-2025
75.82%
36
DeepSeek V4 Prodeepseek/deepseek-v4-pro
75.14%
37
Grok 4.3grok/grok-4.3
74.40%
38
Claude Opus 4.1 20250805 Thinkinganthropic/claude-opus-4-1-20250805-thinking
73.90%
39
Gemini 2.5 Progoogle/gemini-2.5-pro
73.55%
40
GPT-5 Nanoopenai/gpt-5-nano-2025-08-07
72.86%
41
Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite
72.83%
42
Qwen3 Maxalibaba/qwen3-max-2026-01-23
72.71%
43
Claude Sonnet 4anthropic/claude-sonnet-4-20250514
72.41%
44
GLM 5.1 Thinkingzai/glm-5.1-thinking
72.27%
45
Gemini 3 Pro Previewgoogle/gemini-3-pro-preview
72.04%
46
Claude Opus 4.1anthropic/claude-opus-4-1-20250805
71.75%
47
Qwen3.5 Flashalibaba/qwen3.5-flash
70.62%
48
Gemini 3 Flash Previewgoogle/gemini-3-flash-preview
69.92%
49
Claude Sonnet 4 20250514 Thinkinganthropic/claude-sonnet-4-20250514-thinking
69.35%
50
O4 Miniopenai/o4-mini-2025-04-16
69.14%
51
GLM 4.7zai/glm-4.7
68.63%
52
Mistral Medium 3.5mistralai/mistral-medium-3.5
67.73%
53
Gemini 2.5 Flash Lite Preview 09 2025 Thinkinggoogle/gemini-2.5-flash-lite-preview-09-2025-thinking
66.88%
54
Gemini 3.1 Flash Lite Previewgoogle/gemini-3.1-flash-lite-preview
63.90%
55
Grok 4.20 0309 Reasoninggrok/grok-4.20-0309-reasoning
63.41%
56
Llama4 Maverick Instruct Basicfireworks/llama4-maverick-instruct-basic
54.22%
57
Meta Llama Llama 4 Scout 17B 16E Instructtogether/meta-llama/Llama-4-Scout-17B-16E-Instruct
50.59%

FAQ

What does MedScribe measure?

Vals AI healthcare benchmark for whether models can support doctors with administrative work.

Which model leads the published MedScribe snapshot?

GPT-5.1 currently leads the published MedScribe snapshot with 88.09% medscribe score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on MedScribe?

57 AI models are included in BenchLM's mirrored MedScribe snapshot, based on the public leaderboard captured on May 16, 2026.

Last updated: May 16, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.