Benchmark profile

CAIS AI Dashboard Text Capabilities Index (CAIS Text Leaderboard)

A Center for AI Safety dashboard view summarizing text capabilities across HLE, ARC-AGI-2, SWE-Bench Pro, and TextQuests.

How BenchLM shows the CAIS Text Leaderboard

BenchLM mirrors the CAIS AI Dashboard text-capability view as a simple average over hle, arc_agi_2, swebench_pro, textquests. The source dashboard publishes the component benchmark scores and model metadata used here.

The CAIS Text Leaderboard is display only on BenchLM. It is a composite dashboard view rather than a single benchmark-native task set, so BenchLM keeps it out of weighted rankings.

25 mirrored rows4 text componentsCAIS AI DashboardComposite scoreDisplay only

CAIS AI Dashboard Legacy leaderboard URL CAIS simple-evals GitHub

Text average on CAIS Text Leaderboard — June 2026 dashboard snapshot

BenchLM mirrors the published text average view for CAIS Text Leaderboard. GPT-5.5 leads the public snapshot at 54.1% , followed by Opus 4.8 (53.8%) and Gemini 3.1 Pro (52.9%). BenchLM does not use these results to rank models overall.

GPT-5.5

OpenAI

gpt-5.5-high

54.1%

Overall —

Opus 4.8

Anthropic

opus-4-8-adaptive-64k

53.8%

Overall —

Gemini 3.1 Pro

Google

gemini-3.1-pro-preview-high

52.9%

Overall —

25 modelsExternal benchmark mirrorsCurrentDisplay onlyUpdated June 2026 dashboard snapshot

Text average table (25 models)

Score

GPT-5.5OpenAI

54.1%

Opus 4.8Anthropic

53.8%

Gemini 3.1 ProGoogle

52.9%

GPT-5.4OpenAI

49.3%

Gemini 3.5 FlashGoogle

48.8%

Opus 4.7Anthropic

46.9%

Opus 4.6Anthropic

44.0%

Gemini 3 ProGoogle

38.4%

Opus 4.5Anthropic

36.6%

Gemini 3 FlashGoogle

35.6%

GPT-5.2OpenAI

33.8%

Sonnet 4.6Anthropic

32.6%

Grok 4.2xAI

32.5%

DeepSeek 4 ProDeepSeek

32.1%

Kimi K2.6Moonshot AI

31.4%

GLM 5.1Z.AI

29.8%

GPT-5.1OpenAI

29.0%

Kimi K2.5Moonshot AI

26.1%

Sonnet 4.5Anthropic

25.4%

Grok 4.3xAI

24.7%

GPT-5.4-miniOpenAI

24.2%

GPT-5OpenAI

20.9%

Grok 4xAI

20.8%

o3OpenAI

20.5%

DeepSeek 3.2DeepSeek

20.3%

The published CAIS Text Leaderboard snapshot places GPT-5.5 first at 54.1%. The third row is 1.2 points behind. The broader top-10 range is 18.5 points, so the table still separates the published systems.

25 models have been evaluated on CAIS Text Leaderboard. The benchmark falls in the External benchmark mirrors category. We keep external benchmark mirrors separate from the weighted global scoring system, so these results remain source-specific evidence. CAIS Text Leaderboard is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About CAIS Text Leaderboard

Year

2025

Tasks

HLE, ARC-AGI-2, SWE-Bench Pro, and TextQuests

Format

Average component score

Difficulty

Composite frontier text capability

BenchLM mirrors the text-capability portion of the CAIS AI Dashboard as a display-only composite. The displayed score is the average of the public HLE, ARC-AGI-2, SWE-Bench Pro, and TextQuests component scores.

CAIS AI Dashboard Public benchmark source

BenchLM freshness & provenance

Version

CAIS Text Leaderboard 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does CAIS Text Leaderboard measure?

A Center for AI Safety dashboard view summarizing text capabilities across HLE, ARC-AGI-2, SWE-Bench Pro, and TextQuests.

Which model leads the published CAIS Text Leaderboard snapshot?

GPT-5.5 currently leads the published CAIS Text Leaderboard snapshot with 54.1% text average. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on CAIS Text Leaderboard?

25 AI models are included in BenchLM's mirrored CAIS Text Leaderboard snapshot, based on the public leaderboard captured on June 2026 dashboard snapshot.

Last updated: June 2026 dashboard snapshot · mirrored from the public benchmark leaderboard

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.