Skip to main content

OfficeQA Pro

A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts.

How BenchLM shows OfficeQA Pro right now

BenchLM is tracking OfficeQA Pro in the local dataset, but exact-source verification records for these rows are still being attached. To avoid a blank benchmark page, BenchLM shows the current tracked rows below as a display-only reference table.

These tracked rows are useful for inspection and spot-checking, but until exact-source attachments are completed they should not be treated as fully verified public benchmark rows.

118 tracked modelsLocal tracked rowsAwaiting exact-source attachmentsDisplay only

Tracked score on OfficeQA Pro — April 10, 2026

BenchLM mirrors the published tracked score view for OfficeQA Pro. GPT-5.4 leads the public snapshot at 96% , followed by GPT-5.2 Pro (96%) and Gemini 3.1 Pro (95%). BenchLM does not use these results to rank models overall.

118 modelsMultimodal & Grounded45% of category scoreCurrentUpdated April 10, 2026

The published OfficeQA Pro snapshot is tightly clustered at the top: GPT-5.4 sits at 96%, while the third row is only 1.0 points behind. The broader top-10 spread is 4.0 points, so many of the published scores sit in a relatively narrow band.

118 models have been evaluated on OfficeQA Pro. The benchmark falls in the Multimodal & Grounded category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, OfficeQA Pro contributes 45% of the category score, so strong performance here directly affects a model's overall ranking.

About OfficeQA Pro

Year

2026

Tasks

Document and spreadsheet tasks

Format

Grounded QA over office artifacts

Difficulty

Enterprise grounded reasoning

OfficeQA Pro is useful when choosing models for enterprise copilots because it measures whether they can reason correctly over real office content rather than generic chat prompts.

BenchLM freshness & provenance

Version

OfficeQA Pro 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Tracked score table (118 models)

1
GPT-5.4gpt-5-4
96%
2
GPT-5.2 Progpt-5-2-pro
96%
3
Gemini 3.1 Progemini-3-1-pro
95%
4
Gemini 3 Pro Deep Thinkgemini-3-pro-deep-think
95%
5
GPT-5.2gpt-5-2
95%
6
GPT-5.3 Instantgpt-5-3-instant
95%
7
GPT-5.3 Codexgpt-5-3-codex
94%
8
Claude Opus 4.6claude-opus-4-6
94%
9
GPT-5.1-Codex-Maxgpt-5-1-codex-max
92%
10
GPT-5.2-Codexgpt-5-2-codex
92%
11
Gemini 3 Progemini-3-pro
92%
12
GPT-5.2 Instantgpt-5-2-instant
92%
13
Grok 4.1grok-4-1
91%
14
GPT-5.3-Codex-Sparkgpt-5-3-codex-spark
91%
15
GPT-5.1gpt-5-1
89%
16
Claude Sonnet 4.6claude-sonnet-4-6
88%
17
GPT-5 (medium)gpt-5-medium
87%
18
Claude Opus 4.5claude-opus-4-5
87%
19
Claude Sonnet 4.5claude-sonnet-4-5
87%
20
GPT-5 (high)gpt-5-high
85%
21
GLM-5 (Reasoning)glm-5-reasoning
84%
22
Gemini 2.5 Progemini-2-5-pro
84%
23
Grok 4.1 Fastgrok-4-1-fast
83%
24
GPT-5 minigpt-5-mini
81%
25
80%
26
Qwen3.5 397B (Reasoning)qwen3-5-397b-reasoning
79%
27
79%
28
Gemini 3 Flashgemini-3-flash
79%
29
Claude 4.1 Opusclaude-4-1-opus
79%
30
Seed 1.6seed-1-6
79%
31
Seed-2.0-Liteseed-2-0-lite
79%
32
GPT-4.1gpt-4-1
78%
33
Claude 4 Sonnetclaude-4-sonnet
78%
34
Kimi K2.5 (Reasoning)kimi-k2-5-reasoning
77%
35
DeepSeek V3.2 (Thinking)deepseek-v3-2-thinking
77%
36
76%
37
GLM-4.7glm-4-7
76%
38
Grok 4grok-4
76%
39
Mistral Large 3mistral-large-3
76%
40
75%
41
Qwen2.5-1Mqwen2-5-1m
75%
42
74%
43
Nemotron 3 Ultra 500Bnemotron-3-ultra-500b
74%
44
Claude Haiku 4.5claude-haiku-4-5
74%
45
GPT-4.1 minigpt-4-1-mini
74%
46
MiMo-V2-Flashmimo-v2-flash
73%
47
GLM-5glm-5
73%
48
DeepSeekMath V2deepseekmath-v2
73%
49
Gemini 1.5 Progemini-1-5-pro
73%
50
DeepSeek V3.2deepseek-v3-2
72%
51
Claude 3.5 Sonnetclaude-3-5-sonnet
72%
52
Gemini 3.1 Flash-Litegemini-3-1-flash-lite
72%
53
Ministral 3 14B (Reasoning)ministral-3-14b-reasoning
72%
54
Aion-2.0aion-2-0
72%
55
Seed 1.6 Flashseed-1-6-flash
72%
56
Seed-2.0-Miniseed-2-0-mini
72%
57
o4-mini (high)o4-mini-high
71%
58
Mercury 2mercury-2
71%
59
Ministral 3 14Bministral-3-14b
71%
60
Qwen2.5-72Bqwen2-5-72b
70%
61
DeepSeek LLM 2.0deepseek-llm-2-0
70%
62
GPT-4ogpt-4o
70%
63
Step 3.5 Flashstep-3-5-flash
70%
64
Kimi K2.5kimi-k2-5
69%
65
DeepSeek Coder 2.0deepseek-coder-2-0
69%
66
Claude 4.1 Opus Thinkingclaude-4-1-opus-thinking
69%
67
Qwen3.5 397Bqwen3-5-397b
68%
68
GLM-4.7-Flashglm-4-7-flash
68%
69
MiniMax M2.5minimax-m2-5
68%
70
Nemotron 3 Super 100Bnemotron-3-super-100b
67%
71
Mistral Large 2mistral-large-2
67%
72
GPT-4.1 nanogpt-4-1-nano
67%
73
Claude 3 Opusclaude-3-opus
67%
74
Claude 3 Haikuclaude-3-haiku
67%
75
Nemotron 3 Super 120B A12Bnemotron-3-super-120b-a12b
67%
76
Gemini 2.5 Flashgemini-2-5-flash
66%
77
Llama 3.1 405Bllama-3-1-405b
65%
78
Grok Code Fast 1grok-code-fast-1
63%
79
Gemini 1.0 Progemini-1-0-pro
62%
80
GPT-4 Turbogpt-4-turbo
58%
81
GPT-OSS 120Bgpt-oss-120b
57%
82
Moonshot v1moonshot-v1
57%
83
Mistral 8x7Bmistral-8x7b
56%
84
Z-1z-1
56%
85
Llama 3 70Bllama-3-70b
55%
86
Llama 4 Scoutllama-4-scout
55%
87
GPT-5 nanogpt-5-nano
55%
88
Nemotron Ultra 253Bnemotron-ultra-253b
54%
89
Nemotron-4 15Bnemotron-4-15b
54%
90
Nemotron 3 Nano 30Bnemotron-3-nano-30b
54%
91
Llama 4 Maverickllama-4-maverick
54%
92
GPT-4o minigpt-4o-mini
53%
93
DeepSeek-R1deepseek-r1
53%
94
49%
95
Llama 4 Behemothllama-4-behemoth
49%
96
Qwen3 235B 2507 (Reasoning)qwen3-235b-2507-reasoning
47%
97
Grok 3 [Beta]grok-3-beta
47%
98
DeepSeek V3.1 (Reasoning)deepseek-v3-1-reasoning
47%
99
GLM-4.5glm-4-5
47%
100
Qwen3 235B 2507qwen3-235b-2507
46%
101
Nova Pronova-pro
46%
102
Gemma 3 27Bgemma-3-27b
45%
103
DeepSeek V3.1deepseek-v3-1
45%
104
MiniMax M1 80kminimax-m1-80k
45%
105
LFM2-24B-A2Blfm2-24b-a2b
45%
106
GLM-4.5-Airglm-4-5-air
44%
107
GPT-OSS 20Bgpt-oss-20b
42%
108
Mistral 8x7B v0.2mistral-8x7b-v0-2
40%
109
Ministral 3 8B (Reasoning)ministral-3-8b-reasoning
40%
110
Mistral 7B v0.3mistral-7b-v0-3
39%
111
LFM2.5-1.2B-Thinkinglfm2-5-1-2b-thinking
39%
112
Ministral 3 8Bministral-3-8b
39%
113
LFM2.5-1.2B-Instructlfm2-5-1-2b-instruct
39%
114
Phi-4phi-4
38%
115
Ministral 3 3B (Reasoning)ministral-3-3b-reasoning
37%
116
Ministral 3 3Bministral-3-3b
37%
117
Mixtral 8x22B Instruct v0.1mixtral-8x22b-instruct-v0-1
36%
118
DBRX Instructdbrx-instruct
35%

FAQ

What does OfficeQA Pro measure?

A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts.

Which model leads the published OfficeQA Pro snapshot?

GPT-5.4 currently leads the published OfficeQA Pro snapshot with a tracked score of 96%. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on OfficeQA Pro?

118 AI models are included in BenchLM's mirrored OfficeQA Pro snapshot, based on the public leaderboard captured on April 10, 2026.

Last updated: April 10, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.