According to BenchLM.ai, GPT-5.5 leads the OfficeQA Pro benchmark with a score of 54.1%, followed by GPT-5.4 (53.2%) and Claude Opus 4.7 (Adaptive) (43.6%). The scores show moderate spread, with meaningful differences between the top tier and mid-tier models.

3 models have been evaluated on OfficeQA Pro. The benchmark falls in the Multimodal & Grounded category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, OfficeQA Pro contributes 30% of the category score, so strong performance here directly affects a model's overall ranking.

About OfficeQA Pro

Year

2026

Tasks

Document and spreadsheet tasks

Format

Grounded QA over office artifacts

Difficulty

Enterprise grounded reasoning

OfficeQA Pro is useful when choosing models for enterprise copilots because it measures whether they can reason correctly over real office content rather than generic chat prompts.

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

BenchLM freshness & provenance

Version

OfficeQA Pro 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.