OfficeQA Pro (OfficeQA Pro)

A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts.

According to BenchLM.ai, GPT-5.4 Pro leads the OfficeQA Pro benchmark with a score of 96, followed by GPT-5.2 Pro (96) and GPT-5.4 (96). The top models are clustered within 0 points, suggesting this benchmark is nearing saturation for frontier models.

121 models have been evaluated on OfficeQA Pro. The benchmark falls in the multimodalGrounded category, which carries a 15% weight in BenchLM.ai's overall scoring system. Strong performance here directly impacts a model's overall ranking.

About OfficeQA Pro

Year

2026

Tasks

Document and spreadsheet tasks

Format

Grounded QA over office artifacts

Difficulty

Enterprise grounded reasoning

OfficeQA Pro is useful when choosing models for enterprise copilots because it measures whether they can reason correctly over real office content rather than generic chat prompts.

OfficeQA Pro

Leaderboard (121 models)

#1GPT-5.4 Pro
96
#2GPT-5.2 Pro
96
#3GPT-5.4
96
#4GPT-5.2
95
#5GPT-5.3 Instant
95
#6Gemini 3.1 Pro
95
#8GPT-5.3 Codex
94
#9Claude Opus 4.6
94
#10GPT-5.2 Instant
92
#11GPT-5.2-Codex
92
#13Gemini 3 Pro
92
#15Grok 4.1
91
#16GPT-5.1
89
#17Claude Sonnet 4.6
88
#18GPT-5 (medium)
87
#19Claude Opus 4.5
87
#20Claude Sonnet 4.5
87
#21GPT-5 (high)
85
#22GLM-5 (Reasoning)
84
#23Gemini 2.5 Pro
84
#25GPT-5 mini
81
#26o1-preview
80
#28o3-pro
79
#29Seed 1.6
79
#30Claude 4.1 Opus
79
#31Seed-2.0-Lite
79
#32Gemini 3 Flash
79
#33GPT-4.1
78
#34Claude 4 Sonnet
78
#35Kimi K2.5 (Reasoning)
77
#37o3-mini
76
#38GLM-4.7
76
#39Grok 4
76
#40Mistral Large 3
76
#41o3
75
#42Qwen2.5-1M
75
#43o1
74
#45Claude Haiku 4.5
74
#46GPT-4.1 mini
74
#47DeepSeekMath V2
73
#48MiMo-V2-Flash
73
#49GLM-5
73
#50Gemini 1.5 Pro
73
#51DeepSeek V3.2
72
#52Claude 3.5 Sonnet
72
#54Aion-2.0
72
#55Seed 1.6 Flash
72
#57Seed-2.0-Mini
72
#58Mercury 2
71
#59o4-mini (high)
71
#60Ministral 3 14B
71
#61Step 3.5 Flash
70
#62Qwen2.5-72B
70
#63DeepSeek LLM 2.0
70
#64GPT-4o
70
#65DeepSeek Coder 2.0
69
#66Kimi K2.5
69
#67GLM-4.7-Flash
68
#68Qwen3.5 397B
68
#69MiniMax M2.5
68
#71Mistral Large 2
67
#73Claude 3 Opus
67
#74Claude 3 Haiku
67
#75GPT-4.1 nano
67
#76Gemini 2.5 Flash
66
#79Gemini 1.0 Pro
62
#80GPT-4 Turbo
58
#81GPT-OSS 120B
57
#82Moonshot v1
57
#83Mistral 8x7B
56
#84Z-1
56
#86GPT-5 nano
55
#87Llama 3 70B
55
#88Llama 4 Scout
55
#90Nemotron-4 15B
54
#93GPT-4o mini
53
#94DeepSeek-R1
53
#95o1-pro
49
#100GLM-4.5
47
#101Nova Pro
46
#102Qwen3 235B 2507
46
#103Gemma 3 27B
45
#104LFM2-24B-A2B
45
#105Qwen2.5-VL-32B
45
#106DeepSeek V3.1
45
#107MiniMax M1 80k
45
#108Kimi K2
45
#109GLM-4.5-Air
44
#110GPT-OSS 20B
42
#112Mistral 8x7B v0.2
40
#113LFM2.5-1.2B-Thinking
39
#114Ministral 3 8B
39
#115Mistral 7B v0.3
39
#116LFM2.5-1.2B-Instruct
39
#117Phi-4
38
#119Ministral 3 3B
37
#121DBRX Instruct
35

FAQ

What does OfficeQA Pro measure?

A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts.

Which model scores highest on OfficeQA Pro?

GPT-5.4 Pro by OpenAI currently leads with a score of 96 on OfficeQA Pro.

How many models are evaluated on OfficeQA Pro?

121 AI models have been evaluated on OfficeQA Pro on BenchLM.

Last updated: March 12, 2026

Weekly LLM Benchmark Digest

Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.

Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.