A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts.
As of April 29, 2026, GPT-5.5 leads the OfficeQA Pro leaderboard with 54.1% , followed by GPT-5.4 (53.2%) and Claude Opus 4.7 (Adaptive) (43.6%).
GPT-5.5
OpenAI
GPT-5.4
OpenAI
Claude Opus 4.7 (Adaptive)
Anthropic
According to BenchLM.ai, GPT-5.5 leads the OfficeQA Pro benchmark with a score of 54.1%, followed by GPT-5.4 (53.2%) and Claude Opus 4.7 (Adaptive) (43.6%). The scores show moderate spread, with meaningful differences between the top tier and mid-tier models.
3 models have been evaluated on OfficeQA Pro. The benchmark falls in the Multimodal & Grounded category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, OfficeQA Pro contributes 30% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2026
Tasks
Document and spreadsheet tasks
Format
Grounded QA over office artifacts
Difficulty
Enterprise grounded reasoning
OfficeQA Pro is useful when choosing models for enterprise copilots because it measures whether they can reason correctly over real office content rather than generic chat prompts.
Version
OfficeQA Pro 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts.
GPT-5.5 by OpenAI currently leads with a score of 54.1% on OfficeQA Pro.
3 AI models have been evaluated on OfficeQA Pro on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.