Skip to main content

TIR-Bench

A visual agent benchmark for interface reasoning and task execution over screenshots or software surfaces.

About TIR-Bench

Year

2026

Tasks

Visual agent and interface reasoning

Format

Screenshot-grounded task reasoning

Difficulty

Computer-use visual reasoning

TIR-Bench appears in Qwen's launch tables as a visual-agent benchmark with separate submetrics. BenchLM tracks it as a display-only row while preserving the exact values published by providers.

BenchLM freshness & provenance

Version

TIR-Bench 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (0 models)

FAQ

What does TIR-Bench measure?

A visual agent benchmark for interface reasoning and task execution over screenshots or software surfaces.

Which model scores highest on TIR-Bench?

No models have been evaluated on TIR-Bench yet.

How many models are evaluated on TIR-Bench?

0 AI models have been evaluated on TIR-Bench on BenchLM.

Last updated: April 10, 2026 · BenchLM version TIR-Bench 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.