AI Benchmarks Directory

Terminal-Bench 2.0

A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows.

Terminal-based software tasksInteractive CLI agent evaluationProfessional software engineering

Weighted 28%

Terminal-Bench 2 · updated June 2, 2026

BrowseComp

BrowseComp

A benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions.

Research questions requiring browsingWeb search and evidence synthesisHard web research

Weighted 18%

BrowseComp 2026 · updated June 2, 2026

HLE w/ tools

Humanity's Last Exam with tools

Tool-augmented Humanity's Last Exam scores reported in DeepSeek-V4 thinking-mode evaluations.

Expert questions with tool usePass@1Frontier tool-augmented reasoning

HLE w/ tools 2026 · updated June 2, 2026

GDPval-AA

GDPval-AA

An agentic real-world work-task evaluation reported as an Elo score in DeepSeek-V4 thinking-mode evaluations.

Agentic real-world work tasksEloProfessional agentic workflows

GDPval-AA 2026 · updated June 2, 2026

GDPval-AA

GDPval-AA normalized

A display-only Artificial Analysis normalized score for economically valuable tasks.

Economically valuable tasksNormalized scoreProfessional agentic workflows

GDPval-AA 2026 · updated June 2, 2026

AA Agentic Index

Artificial Analysis Agentic Index

A display-only Artificial Analysis agentic index.

Cross-benchmark agentic indexAggregated model scoreDisplay-only external reference

AA Agentic Index 2026 · updated June 2, 2026

APEX-Agents-AA

APEX-Agents-AA

Artificial Analysis' implementation of the APEX-Agents benchmark for long-horizon professional-services agent tasks.

452 professional-services agent tasksPass@1Long-horizon workplace agent tasks

APEX-Agents-AA 2026 · updated June 2, 2026

Gert Labs

Gert Labs Composite Game Benchmark

A game-environment benchmark that evaluates AI models in novel games covering strategic planning, resource management, spatial reasoning, cooperation, and theory of mind.

Novel game environmentsComposite game leaderboardAgentic coding and decision-making

Gert Labs 2026 · updated June 2, 2026

OSWorld-Verified

OSWorld-Verified

A verified subset of OSWorld focused on computer-use tasks in desktop-like environments, including navigation, editing, and workflow completion.

Desktop and GUI tasksInteractive computer-use evaluationComplex multi-step workflows

Weighted 24%

OSWorld Verified · updated June 2, 2026

CyberGym

CyberGym

A cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.

1,507 vulnerability analysis instancesVulnerability reproduction and PoC generationReal-world cybersecurity

CyberGym 2026 · updated June 2, 2026

BrowseComp-VL

BrowseComp-VL

A vision-language browsing benchmark for multimodal web research and tool-use workflows.

Multimodal browsing tasksVision-language web research evaluationMultimodal browser-agent

BrowseComp-VL 2026 · updated June 2, 2026

OSWorld

OSWorld

A computer-use benchmark for GUI task completion across the broader OSWorld task suite.

Computer-use tasksInteractive GUI evaluationBroad computer-use suite

OSWorld 2026 · updated June 2, 2026

AndroidWorld

AndroidWorld

A mobile GUI agent benchmark for completing Android app workflows and on-device tasks.

Android app workflowsInteractive mobile-agent evaluationComplex mobile task completion

AndroidWorld 2026 · updated June 2, 2026

WebVoyager

WebVoyager

A browser-agent benchmark for completing multi-step workflows on live websites.

Live website workflowsInteractive browser-agent evaluationMulti-step web navigation

WebVoyager 2026 · updated June 2, 2026

MCP Atlas

MCP Atlas

A benchmark for tool-calling over Model Context Protocol integrations and external tools.

Tool-integrated agent tasksInteractive tool-calling evaluationAdvanced tool use

MCP Atlas 2026 · updated June 2, 2026

Toolathlon

Toolathlon

A tool-use benchmark focused on selecting, sequencing, and completing tasks with external tools.

Multi-tool workflowsInteractive tool-calling evaluationAdvanced tool use

Toolathlon 2026 · updated June 2, 2026

ZClawBench

ZClawBench

A Z.AI benchmark for OpenClaw-style agent workflows spanning information search, office work, data analysis, development and operations, automation, and security.

OpenClaw agent workflowsEnd-to-end agent benchmarkBroad productivity and operations workflows

ZClawBench 2026 · updated June 2, 2026

Tau2-Telecom

Tau2-Telecom

A telecom-oriented tool benchmark that measures structured tool use in domain workflows.

Telecom tool workflowsDomain-specific tool evaluationProfessional workflow

τ²-Bench 2026 · updated June 2, 2026

DeepSearchQA

DeepSearchQA

An agentic browsing benchmark where models search the web, gather evidence, and answer list-style questions using browser tools.

Agentic browsing and list-answer questionsSearch / open / find browser-agent evaluationAgentic web research

DeepSearchQA 2026 · updated June 2, 2026

Tau2-Airline

Tau2-Airline

An airline-domain tool-use benchmark for structured workflow execution and API correctness.

Airline support workflowsDomain-specific tool evaluationProfessional workflow

Tau2-Airline 2026 · updated June 2, 2026

PinchBench

PinchBench

An OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows.

23 OpenClaw agent tasksAverage success rate from official runsLong-horizon agent workflows

PinchBench 2026 · updated June 2, 2026

OpenHands Index

OpenHands Index

A holistic coding-agent benchmark that evaluates AI agents across issue resolution, frontend work, greenfield development, testing, and information gathering.

SWE-bench Verified, SWE-bench Multimodal, Commit0, SWT-bench Verified, and GAIAMacro-average across five coding-agent categoriesReal-world software engineering agent tasks

OpenHands Index 2025 · updated June 2, 2026

SWE-Atlas Refactoring

SWE-Atlas Refactoring

A Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks.

SWE-Atlas refactoring tasksRefactoring score with confidence intervalsReal-world software-engineering agent tasks

SWE-Atlas Refactoring 2026 · updated June 2, 2026

InferenceBench

InferenceBench

A benchmark for open-ended LLM inference optimization by AI agents. Agents receive a base model, one H100, and a fixed time budget to build a valid OpenAI-compatible inference server that improves serving speed.

4 inference-serving optimization scenariosTwo-hour autonomous CLI agent runOpen-ended ML systems engineering

InferenceBench 2026 · updated June 2, 2026

BFCL v4

Berkeley Function Calling Leaderboard v4

A function-calling benchmark for tool selection, schema adherence, and argument correctness.

Function-calling tasksTool invocation and schema evaluationAdvanced tool use

BFCL v4 2026 · updated June 2, 2026

MLE-Bench Lite

MLE-Bench Lite

A lightweight machine-learning competition benchmark that measures whether models can iteratively train, evaluate, and improve ML systems in low-resource settings.

Low-resource ML competitionsAutonomous iterative ML optimizationAgentic machine learning

MLE-Bench Lite 2026 · updated June 2, 2026

MM-ClawBench

MM-ClawBench

An OpenClaw-derived agent benchmark covering practical work and life tasks such as office document delivery, research, planning, and code maintenance.

OpenClaw-style real-world tasksAgent workflow evaluationBroad real-world agentic execution

MM-ClawBench 2026 · updated June 2, 2026

Claw-Eval

Claw-Eval

A transparent real-world autonomous-agent benchmark with 300 human-verified tasks, 2,159 rubric items, and Pass^3 scoring across general, multi-turn, and native multimodal agent tasks.

300 tasks, 2,159 rubricsEnd-to-end autonomous-agent evaluation with Pass^3 scoringReal-world general, multi-turn, and native multimodal agent execution

Claw-Eval 2026 · updated June 2, 2026

QwenClawBench

QwenClawBench

Qwen's internal OpenClaw-style benchmark for measuring broad real-world agent performance across practical productivity and research tasks.

Real-world agent workflowsEnd-to-end agent evaluationBroad real-world agentic execution

QwenClawBench 2026 · updated June 2, 2026

QwenWebBench

QwenWebBench

A Qwen benchmark for artifact and webpage generation quality reported as an Elo-style rating.

Web artifacts and interactive deliverablesElo-style artifact benchmarkArtifact generation

QwenWebBench 2026 · updated June 2, 2026

TAU3-Bench

TAU3-Bench

A next-generation tool-use benchmark for complex, long-horizon agent workflows beyond the older tau2 telecom and airline task families.

Long-horizon tool workflowsInteractive tool-use evaluationAdvanced tool use

TAU3-Bench 2026 · updated June 2, 2026

VITA-Bench

VITA-Bench

An interactive real-world agent benchmark grounded in practical consumer-service tasks such as delivery, in-store consumption, and online travel workflows.

Interactive consumer-service agent tasksEnd-to-end interactive agent evaluationLong-horizon real-world workflows

VITA-Bench 2025 · updated June 2, 2026

DeepPlanning

DeepPlanning

A long-horizon planning benchmark that tests whether agents can optimize under explicit time, budget, and feasibility constraints.

Travel planning and constrained shoppingLong-horizon planning benchmarkConstrained agent planning

DeepPlanning 2026 · updated June 2, 2026

MCP-Tasks

MCP-Tasks

A Model Context Protocol task benchmark used in Qwen's launch tables to measure practical execution over MCP-style tools and integrations.

MCP-integrated tool tasksInteractive tool-use evaluationAdvanced MCP workflows

MCP-Tasks 2026 · updated June 2, 2026

WideResearch

WideResearch

A broad research-agent benchmark for open-ended information gathering, synthesis, and answer construction across wide search spaces.

Open-ended research tasksMulti-source research evaluationBroad research-agent workflows

WideResearch 2026 · updated June 2, 2026

GAIA

General AI Assistants

GAIA evaluates AI models on real-world tasks that are conceptually simple for humans but require multi-step reasoning, web browsing, tool use, and multimodal understanding for AI. Tasks span three difficulty levels and test practical assistant capabilities rather than academic knowledge.

466

Weighted 12%

GAIA 2024 · updated June 2, 2026

TAU-bench

Tool-Agent-User Benchmark

TAU-bench evaluates AI agents in realistic enterprise scenarios requiring multi-turn tool use, database interactions, and policy adherence. It tests across retail and airline domains, measuring an agent's ability to reliably complete customer service tasks while following complex business rules.

680

Weighted 10%

TAU-bench 2024 · updated June 2, 2026

WebArena

WebArena Web Agent Benchmark

WebArena is a realistic web environment for evaluating autonomous AI agents on complex, multi-step browser tasks. Agents must navigate e-commerce sites, forums, content management systems, and code repositories to complete practical objectives like purchasing items, finding information, and managing accounts.

812

Weighted 8%

WebArena 2024 · updated June 2, 2026

MEWC

Multi-Environment Web Challenge

A benchmark that evaluates AI agents on multi-environment web challenges, testing navigation and task completion across diverse live web environments.

Web-agent tasksBrowser task completionOpen-web agent workflows

MEWC 2026 · updated June 2, 2026

Finance Agent v2

Finance Agent v2

Vals AI benchmark for realistic financial analyst agent tasks across qualitative analysis, quantitative analysis, market work, comparables, precedents, earnings, disclosure, and modeling.

Financial analyst task categoriesMean score across repeated runsProfessional expert-task agent workflow

Finance Agent v2 2026 · updated June 2, 2026

Coding(27 benchmarks)

StaleSaturatedDisplay only

HumanEval

2021

Evaluating Large Language Models Trained on Code

A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.

164 problemsPython function generationIntroductory to intermediate programming

HumanEval · updated June 2, 2026

BigCodeBench

BigCodeBench

A code-generation benchmark reported in DeepSeek-V4 base-model evaluations.

Code generation tasksPass@1Software engineering

BigCodeBench 2026 · updated June 2, 2026

Codeforces

Codeforces Rating

Competitive-programming rating reported for DeepSeek-V4 thinking-mode evaluations.

Competitive programming contestsRatingElite competitive programming

Codeforces 2026 · updated June 2, 2026

Terminal-Bench 2.0

Terminal-Bench 2.0

A benchmark for agentic software engineering tasks executed in real terminal environments. DeepSeek reports it in the agentic section, while BenchLM also mirrors it in coding for models that publish it as a developer-task signal.

Terminal-based software tasksInteractive CLI agent evaluationProfessional software engineering

Terminal-Bench 2 · updated June 2, 2026

SWE-bench Verified

Software Engineering Benchmark Verified

A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn.

500 verified issuesCode patch generationProfessional software engineering

Weighted 13%

SWE-bench Verified 2024 · updated June 2, 2026

SWE-Rebench

SWE-Rebench

A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.

Fresh GitHub issues (rolling window)Code patch generationProfessional software engineering

Weighted 31%

Rolling 2026 window · updated June 2, 2026

LiveCodeBench

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

A continuously updated benchmark using fresh competitive programming problems from LeetCode, Codeforces, and AtCoder to provide contamination-free code generation evaluation.

Continuously updatedCompetitive programmingCompetitive programming level

Weighted 23%

Rolling 2026 set · updated June 2, 2026

LiveCodeBench v6

LiveCodeBench v6

A newer LiveCodeBench slice used in provider comparison tables to benchmark contamination-resistant coding performance on fresher competitive programming sets.

Fresh programming problemsCompetitive programmingCompetitive programming level

LiveCodeBench v6 2026 · updated June 2, 2026

LiveCodeBench Pro

LiveCodeBench Pro

A harder competitive-programming benchmark family built from Codeforces, ICPC, and IOI problems, with quarter-specific public leaderboards and difficulty-aware reporting.

Quarter-specific contest programming setsCompetitive programmingHigh-end contest programming

LiveCodeBench Pro 2025 · updated June 2, 2026

FLTEval

FLTEval

A repository-level Lean 4 proof engineering benchmark that measures whether a model can complete formal proofs and correctly define new mathematical concepts inside realistic FLT project pull requests.

FLT project pull requestsLean 4 repository task completionFormal verification / proof engineering

FLTEval 2026 · updated June 2, 2026

SWE-bench Pro

SWE-bench Pro

A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.

Real-world software engineeringRepository task completionFrontier coding agent

Weighted 23%

SWE-bench Pro 2026 · updated June 2, 2026

SWE Multilingual

SWE Multilingual

A multilingual software-engineering benchmark for real-world code issue resolution across multiple programming languages.

Multilingual software-engineering tasksRepository task completionProfessional software engineering

SWE Multilingual 2026 · updated June 2, 2026

SWE Multimodal

SWE-bench Multimodal

A multimodal variant of SWE-bench that adds visual context such as screenshots and design mockups to software engineering issue descriptions.

Multimodal software engineering tasksCode patch generation with visual contextFrontier multimodal coding

SWE Multimodal 2025 · updated June 2, 2026

CursorBench v3.1

CursorBench v3.1

Cursor's first-party harder-task benchmark for long-horizon agentic coding behavior inside the Cursor agent loop.

Harder long-horizon agentic coding tasksCursor agent-loop evaluationProfessional agentic software engineering

CursorBench v3.1 2026 · updated June 2, 2026

Multi-SWE Bench

Multi-SWE Bench

A multi-language software-engineering benchmark that measures repository-level bug fixing and implementation across more than one programming ecosystem.

Multi-language repo tasksRepository task completionProfessional software engineering

Multi-SWE Bench 2026 · updated June 2, 2026

VIBE-Pro

VIBE-Pro

A repo-level code generation and full-project delivery benchmark spanning web, mobile, and simulation-style implementation tasks.

Full project delivery tasksRepository-level implementation benchmarkEnd-to-end software delivery

VIBE-Pro 2026 · updated June 2, 2026

Vibe Code Bench

Vibe Code Bench v1.1

Vals.ai benchmark for evaluating whether models can build complete web applications from natural language specifications in a production-like development environment.

End-to-end web application buildsFull-stack app implementation benchmarkEnd-to-end software delivery

Vibe Code Bench 2026 · updated June 2, 2026

ProgramBench

ProgramBench: Can Language Models Rebuild Programs From Scratch?

A cleanroom software-engineering benchmark where agents receive only a compiled executable and documentation, then must architect and implement a complete codebase that reproduces the original program's behavior.

200 program reconstruction tasksCleanroom executable reimplementationFull-repository software architecture

ProgramBench 2026 · updated June 2, 2026

NL2Repo

NL2Repo

A repository-understanding benchmark that measures whether models can map natural-language requests onto the right code locations and system changes.

Natural language to repository tasksRepository understanding benchmarkSystem-level software comprehension

NL2Repo 2026 · updated June 2, 2026

React Native Evals

React Native Evals

An open benchmark for AI coding agents on real-world React Native implementation tasks, emphasizing working app behavior, recommended architecture choices, and strict constraint adherence.

React Native app implementation tasksFramework-specific app development evaluationProduction mobile app engineering

React Native Evals 2026 · updated June 2, 2026

Next.js Evals

AI Agent Evaluations for Next.js

A Vercel benchmark for AI coding agents on Next.js code generation and migration tasks, reporting success rate, average execution time, and an AGENTS.md documentation-assisted split.

24 Next.js code generation and migration tasksAgent task completion with withheld Vitest assertionsFramework-specific web application engineering

Next.js Evals 2026 · updated June 2, 2026

SWE-bench Verified*

SWE-bench Verified (mini-swe-agent-v2)

A display-only SWE-bench Verified reference from Arcee AI's Trinity-Large-Thinking comparison chart.

Repository task completionAgent scaffold benchmarkProfessional software engineering

SWE-bench Verified* 2026 · updated June 2, 2026

Spider 2.0-Lite

Spider 2.0-Lite

A text-to-SQL benchmark over realistic warehouse-scale schemas, reported by Interfaze for model comparison.

Text-to-SQL queriesExecution accuracyEnterprise text-to-SQL

Spider 2.0-Lite 2024 · updated June 2, 2026

SciCode

Scientific Code Benchmark

SciCode evaluates language models on generating code for realistic scientific research problems across 16 subfields of physics, math, chemistry, biology, and material science. Problems decompose into 338 subproblems requiring domain knowledge recall, scientific reasoning, and precise code synthesis. Based on real scripts from published research.

Weighted 10%

SciCode 2024 · updated June 2, 2026

AA Coding Index

Artificial Analysis Coding Index

A display-only Artificial Analysis coding index.

Cross-benchmark coding indexAggregated model scoreDisplay-only external reference

AA Coding Index 2026 · updated June 2, 2026

AA-SciCode

Artificial Analysis SciCode

A display-only Artificial Analysis SciCode score.

Scientific coding subproblemsTask success rateScientific programming

AA-SciCode 2026 · updated June 2, 2026

Terminal-Bench Hard

Terminal-Bench Hard

A display-only Artificial Analysis coding metric for agentic coding and terminal use on a harder Terminal-Bench slice.

Agentic coding and terminal tasksTask success rateProfessional software engineering

Terminal-Bench Hard 2026 · updated June 2, 2026

Reasoning(23 benchmarks)

MuSR

Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.

Multi-step reasoningNarrative-based reasoningComplex reasoning tasks

Weighted 20%

MuSR 2023 · updated June 2, 2026

BBH

2022

BIG-Bench Hard

A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.

StaleSaturatedDisplay only

23 tasksMixed reasoning tasksAdvanced reasoning

BBH 2022 · updated June 2, 2026

DROP

Discrete Reasoning Over Paragraphs

A reading-comprehension benchmark requiring discrete reasoning over paragraphs, reported in DeepSeek-V4 base-model evaluations.

Paragraph reasoning questionsF1Reading and numerical reasoning

DROP 2026 · updated June 2, 2026

HellaSwag

HellaSwag

A commonsense natural-language inference benchmark reported in DeepSeek-V4 base-model evaluations.

Commonsense completion questionsExact matchCommonsense reasoning

HellaSwag 2026 · updated June 2, 2026

WinoGrande

WinoGrande

A commonsense coreference benchmark reported in DeepSeek-V4 base-model evaluations.

Coreference resolution questionsExact matchCommonsense reasoning

WinoGrande 2026 · updated June 2, 2026

CLUEWSC

CLUEWSC

A Chinese Winograd Schema Challenge benchmark reported in DeepSeek-V4 base-model evaluations.

Chinese coreference questionsExact matchChinese commonsense reasoning

CLUEWSC 2026 · updated June 2, 2026

LisanBench

LisanBench

A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.

50 starting words × 3 trialsDifficulty-weighted word-chain reasoningOpen-ended lexical planning

LisanBench 2026 · updated June 2, 2026

Pencil Puzzle Bench

Pencil Puzzle Bench

A multi-step verifiable reasoning benchmark that evaluates whether models can solve pencil puzzles with unique solutions.

300 evaluation puzzlesDirect and agentic puzzle solve rateMulti-step verifiable reasoning

Pencil Puzzle Bench 2026 · updated June 2, 2026

LongBench v2

LongBench v2

A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval.

Long-context tasksExtended-context retrieval and reasoningHard long-context

Weighted 30%

LongBench v2 2025 · updated June 2, 2026

MRCRv2

MRCRv2

A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.

Long-context retrievalMulti-round long-context evaluationHard long-context

MRCRv2 2025 · updated June 2, 2026

MRCR v2 64K-128K

OpenAI MRCR v2 8-needle 64K-128K

MRCR v2 slice focused on long-context retrieval at 64K-128K lengths.

8-needle retrieval tasksLong-context retrievalLong-context reasoning

MRCR v2 64K-128K 2026 · updated June 2, 2026

MRCR v2 128K-256K

OpenAI MRCR v2 8-needle 128K-256K

MRCR v2 slice focused on very long contexts at 128K-256K lengths.

8-needle retrieval tasksVery-long-context retrievalVery long-context reasoning

MRCR v2 128K-256K 2026 · updated June 2, 2026

Graphwalks BFS 128K

Graphwalks BFS 0K-128K

Long-context graph traversal benchmark using breadth-first search tasks.

Graph traversal tasksLong-context graph reasoningAlgorithmic long-context reasoning

Graphwalks BFS 128K 2026 · updated June 2, 2026

Graphwalks Parents 128K

Graphwalks parents 0-128K

Long-context benchmark for recovering parent relationships inside graph tasks.

Graph parent-retrieval tasksLong-context graph reasoningAlgorithmic long-context reasoning

Graphwalks Parents 128K 2026 · updated June 2, 2026

MRCR 1M

MRCR 1M

A million-token MRCR long-context retrieval benchmark reported in DeepSeek-V4 model evaluations.

Million-token retrievalLong-context retrieval MMRMillion-token long context

MRCR 1M 2026 · updated June 2, 2026

CorpusQA 1M

CorpusQA 1M

A million-token CorpusQA long-context question-answering benchmark reported in DeepSeek-V4 model evaluations.

Million-token corpus question answeringLong-context QA accuracyMillion-token long context

CorpusQA 1M 2026 · updated June 2, 2026

ARC-AGI-2

Abstraction and Reasoning Corpus for AGI v2

A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — average individual human performance is 66%.

Visual pattern completion and abstract reasoningGrid transformation puzzles with novel rulesExpert-level — hardest public reasoning benchmark

ARC-AGI 2 · updated June 2, 2026

AI-Needle

AI-Needle

A long-context retrieval benchmark that measures whether a model can recover relevant information embedded deep inside very long contexts.

Long-context retrievalNeedle-in-a-haystack recallLong-context memory

AI-Needle 2026 · updated June 2, 2026

GPQA Diamond

GPQA Diamond

The hardest subset of GPQA featuring the most challenging graduate-level science questions. Sometimes reported separately from the standard GPQA benchmark.

Expert-level science questionsMultiple choice questionsGraduate-level scientific reasoning

GPQA Diamond 2023 · updated June 2, 2026

AA-LCR

Artificial Analysis Long Context Reasoning

A display-only Artificial Analysis long-context reasoning evaluation.

Long-context reasoning tasksAccuracyLong-context reasoning

AA-LCR 2026 · updated June 2, 2026

CritPt

Critical Physics Tasks

A display-only Artificial Analysis metric for research-level physics reasoning.

Research-level physics questionsAccuracyResearch-level physics reasoning

CritPt 2026 · updated June 2, 2026

BullshitBench v2

BullshitBench v2

A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input.

Nonsensical and flawed prompts across multiple domainsPrompt challenge and refusal evaluationRobustness and critical reasoning

BullshitBench v2 2025 · updated June 2, 2026

WildBench

WildBench

An automated evaluation framework using 1,000+ real-world user tasks covering reasoning, planning, coding, and creative writing. Highly correlated with Chatbot Arena human preference rankings.

1,024 real-world tasksReal-world task evaluationDiverse real-world scenarios

WildBench 2024 · updated June 2, 2026

Multimodal & Grounded(47 benchmarks)

MMMU

Massive Multi-discipline Multimodal Understanding

A broad multimodal reasoning benchmark spanning charts, diagrams, tables, and academic visual question answering.

Multimodal academic reasoningImage + text question answeringFrontier multimodal

MMMU 2024 · updated June 2, 2026

MMMU-Pro

Massive Multi-discipline Multimodal Understanding Pro

A harder multimodal benchmark for frontier models that combines text with images, diagrams, charts, and academic visual reasoning tasks.

Multimodal academic reasoningImage + text question answeringFrontier multimodal

Weighted 45%

MMMU-Pro 2024 · updated June 2, 2026

AA-MMMU-Pro

Artificial Analysis MMMU-Pro

A display-only Artificial Analysis MMMU-Pro score.

Multimodal academic reasoningImage + text question answeringFrontier multimodal

AA-MMMU-Pro 2026 · updated June 2, 2026

OCRBench V2

OCRBench V2

A native OCR benchmark for reading text from images across multilingual scripts, low-quality scans, handwriting, structured layouts, charts, and screenshots.

Image OCR tasksAccuracyNative visual text understanding

OCRBench V2 2025 · updated June 2, 2026

olmOCR

olmOCR-Bench

An end-to-end document understanding benchmark over long, layout-rich PDFs with tables, equations, headers, footnotes, and multi-column flows.

Layout-rich PDF understandingMean accuracyComplex document processing

olmOCR 2025 · updated June 2, 2026

VoxPopuli WER

VoxPopuli-Cleaned-AA Word Error Rate

A speech-recognition benchmark on the cleaned Artificial Analysis VoxPopuli subset, reported as word error rate where lower is better.

Speech-to-text transcriptionWord error rateAudio speech recognition

VoxPopuli WER 2026 · updated June 2, 2026

Design Arena Website

Design Arena Website Elo

A display-only Design Arena website-generation Elo score surfaced on OpenRouter model benchmark pages.

Website generation comparisonsEloDesign and website generation

Design Arena Website 2026 · updated June 2, 2026

OfficeQA Pro

OfficeQA Pro

A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts.

Document and spreadsheet tasksGrounded QA over office artifactsEnterprise grounded reasoning

Weighted 30%

OfficeQA Pro 2026 · updated June 2, 2026

MMMU-Pro w/ Python

MMMU-Pro with Python

Tool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning.

Multimodal academic reasoningImage + text question answering with PythonFrontier multimodal

MMMU-Pro w/ Python 2026 · updated June 2, 2026

OmniDocBench 1.5

OmniDocBench 1.5

A document understanding benchmark used in frontier-model comparison tables to measure extraction and grounded reasoning quality on complex documents.

Document understanding tasksDocument understanding benchmarkGrounded document reasoning

OmniDocBench 1.5 2026 · updated June 2, 2026

RealWorldQA

RealWorldQA

A grounded visual QA benchmark focused on answering practical questions about real-world images and scenes.

Real-world visual question answeringImage-grounded QAGeneral visual reasoning

RealWorldQA 2026 · updated June 2, 2026

Video-MME (with subtitle)

Video-MME with subtitle

A video understanding benchmark that allows subtitle access when answering multimodal questions about videos.

Video understandingVideo QA with subtitle contextMultimodal video reasoning

Video-MME (with subtitle) 2026 · updated June 2, 2026

Video-MME (w/o subtitle)

Video-MME without subtitle

A stricter Video-MME setting that removes subtitle help and tests video understanding from visual and audio context alone.

Video understandingVideo QA without subtitle contextMultimodal video reasoning

Video-MME (w/o subtitle) 2026 · updated June 2, 2026

Video-MME

Video-MME

A comprehensive benchmark for multimodal large language models on video understanding, covering temporal reasoning, perception, and question answering over videos.

Video understandingVideo QA and analysisBroad multimodal video reasoning

Video-MME 2024 · updated June 2, 2026

MathVision

MathVision

A visual mathematics benchmark that tests whether a model can solve math problems grounded in diagrams, equations, figures, and other visual inputs.

Visually grounded math problemsImage + math reasoningAdvanced multimodal mathematics

MathVision 2026 · updated June 2, 2026

We-Math

We-Math

A multimodal math benchmark for visually grounded mathematical reasoning and answer generation.

Visually grounded math problemsMultimodal mathematical reasoningAdvanced multimodal mathematics

We-Math 2026 · updated June 2, 2026

DynaMath

DynaMath

A multimodal benchmark for dynamic mathematical reasoning over visual and structured inputs.

Dynamic visual math problemsMultimodal mathematical reasoningAdvanced multimodal mathematics

DynaMath 2026 · updated June 2, 2026

MStar

MStar

A general visual question-answering benchmark used in provider tables for real-image reasoning quality.

Real-image visual QAImage-grounded QAGeneral visual reasoning

MStar 2026 · updated June 2, 2026

ChatCVQA

ChatCVQA

A conversational visual QA benchmark that tests multi-turn grounded answering over images and documents.

Conversational visual QAMulti-turn image-grounded QAConversational multimodal reasoning

ChatCVQA 2026 · updated June 2, 2026

MMLongBench-Doc

MMLongBench-Doc

A long-document multimodal benchmark for grounded reasoning over extended document contexts.

Long document understandingDocument-grounded reasoningLong-context document reasoning

MMLongBench-Doc 2026 · updated June 2, 2026

CC-OCR

CC-OCR

An OCR-focused benchmark for reading and extracting text from visually complex documents and images.

Optical character recognitionText extraction from images and documentsDocument reading

CC-OCR 2026 · updated June 2, 2026

AI2D_TEST

AI2D test split

A diagram understanding benchmark focused on scientific and educational visual question answering.

Diagram understandingDiagram-grounded QAStructured visual reasoning

AI2D_TEST 2026 · updated June 2, 2026

CountBench

CountBench

A visual counting benchmark that tests whether a model can count objects and entities reliably in complex scenes.

Visual counting tasksImage-grounded countingFine-grained visual perception

CountBench 2026 · updated June 2, 2026

RefCOCO (avg)

RefCOCO average

A referring-expression grounding benchmark averaged across RefCOCO variants to test whether a model can localize described objects correctly.

Referring-expression groundingGrounded visual localizationFine-grained visual grounding

RefCOCO (avg) 2026 · updated June 2, 2026

ODINW13

ODINW13

A visual detection and grounding benchmark slice used to compare zero-shot object understanding across diverse domains.

Out-of-distribution object understandingDetection and groundingRobust visual grounding

ODINW13 2026 · updated June 2, 2026

ERQA

ERQA

A grounded visual reasoning benchmark focused on evidence-based question answering over real images.

Evidence-based visual QAGrounded image reasoningGrounded multimodal reasoning

ERQA 2026 · updated June 2, 2026

VideoMMMU

VideoMMMU

A video extension of MMMU-style multimodal reasoning over expert questions grounded in temporal media.

Video-grounded expert reasoningVideo + text reasoningFrontier multimodal video reasoning

VideoMMMU 2026 · updated June 2, 2026

MLVU (M-Avg)

MLVU mean average

A multi-task video understanding benchmark averaged across MLVU categories.

General video understandingVideo QA and understandingBroad multimodal video reasoning

MLVU (M-Avg) 2026 · updated June 2, 2026

MMVU

Multimodal Multi-disciplinary Video Understanding

A benchmark for evaluating multimodal models on video understanding tasks across multiple disciplines, emphasizing temporal reasoning and comprehension over video content.

Video understandingVideo reasoning benchmarkMulti-disciplinary multimodal video reasoning

MMVU 2026 · updated June 2, 2026

ScreenSpot Pro

ScreenSpot Pro

A high-resolution GUI grounding benchmark for professional computer-use environments.

GUI grounding tasksInterface element localizationProfessional GUI grounding

ScreenSpot Pro 2025 · updated June 2, 2026

TIR-Bench

TIR-Bench

A visual agent benchmark for interface reasoning and task execution over screenshots or software surfaces.

Visual agent and interface reasoningScreenshot-grounded task reasoningComputer-use visual reasoning

TIR-Bench 2026 · updated June 2, 2026

GDPval-AA

GDPval-AA

An evaluation focused on professional domain expertise and task delivery quality in office-style knowledge work.

Professional office deliveryELO-style office benchmarkProfessional knowledge work

GDPval-AA 2026 · updated June 2, 2026

MedXpertQA (MM)

MedXpertQA Multimodal

A multimodal medical multiple-choice benchmark covering clinical images such as X-rays, histology, and dermatology.

2,000 multimodal medical questionsMedical visual MCQClinical multimodal reasoning

MedXpertQA (MM) 2026 · updated June 2, 2026

ZeroBench

ZeroBench

A multi-step visual reasoning benchmark with pass@5 reporting and optional tool use.

100 visual reasoning questionsMulti-step visual reasoningTool-augmented visual reasoning

ZeroBench 2026 · updated June 2, 2026

Design2Code

Design2Code

A multimodal coding benchmark for turning visual designs into working frontend implementations.

Design-to-code tasksVisual input to frontend implementationMultimodal coding

Design2Code 2026 · updated June 2, 2026

Flame-VLM-Code

Flame-VLM-Code

A vision-language coding benchmark for generating correct code from visual and multimodal inputs.

Multimodal coding tasksVision-language code generationMultimodal coding

Flame-VLM-Code 2026 · updated June 2, 2026

Vision2Web

Vision2Web

A benchmark for converting visual references into functional web implementations.

Screenshot-to-web tasksVisual reference to web implementationMultimodal web generation

Vision2Web 2026 · updated June 2, 2026

ImageMining

ImageMining

A multimodal retrieval and extraction benchmark over image-heavy task settings.

Visual retrieval tasksImage-grounded retrieval and extractionMultimodal retrieval

ImageMining 2026 · updated June 2, 2026

MMSearch

MMSearch

A multimodal search benchmark for retrieval and grounded answering across mixed-media inputs.

Multimodal search tasksMixed-media retrieval and grounded answeringMultimodal search

MMSearch 2026 · updated June 2, 2026

MMSearch-Plus

MMSearch-Plus

A harder MMSearch variant for multimodal retrieval and grounded tool-use workflows.

Hard multimodal search tasksAdvanced mixed-media retrieval benchmarkAdvanced multimodal search

MMSearch-Plus 2026 · updated June 2, 2026

SimpleVQA

SimpleVQA

A visual question answering benchmark focused on straightforward image-grounded understanding.

Visual QA tasksImage-grounded question answeringGeneral visual understanding

SimpleVQA 2026 · updated June 2, 2026

Facts-VLM

Facts-VLM

A grounded multimodal factuality benchmark for evidence-linked answer correctness.

Grounded factuality tasksEvidence-linked multimodal factualityGrounded multimodal factuality

Facts-VLM 2026 · updated June 2, 2026

V*

A vision-centric benchmark for high-level multimodal reasoning and perception quality.

Frontier multimodal reasoning tasksVision-centric reasoning benchmarkFrontier multimodal

V* 2026 · updated June 2, 2026

CharXiv

CharXiv Reasoning

A scientific chart reasoning benchmark that tests whether models can understand, interpret, and reason about complex scientific visualizations including plots, diagrams, and data charts.

Scientific chart reasoningChart understanding and reasoningScientific visualization reasoning

Weighted 20%

CharXiv 2024 · updated June 2, 2026

CharXiv w/o tools

CharXiv Reasoning without tools

Tool-free variant of CharXiv that isolates raw visual reasoning ability without code execution or tool augmentation.

Scientific chart reasoning (tool-free)Chart understanding without toolsScientific visualization reasoning

Weighted 5%

CharXiv w/o tools 2024 · updated June 2, 2026

SWE-bench Multimodal

SWE-bench Multimodal

A multimodal variant of SWE-bench that adds visual context (screenshots, design mockups) to software engineering issue descriptions, testing whether models can leverage visual information for code generation.

Multimodal software engineering tasksCode patch generation with visual contextFrontier multimodal coding

SWE-bench Multimodal 2025 · updated June 2, 2026

Blueprint-Bench 2

Blueprint-Bench 2

An agentic spatial reasoning benchmark reported as a normalized score.

Spatial reasoning from blueprintsNormalized scoreAgentic spatial reasoning

Blueprint-Bench 2 2026 · updated June 2, 2026

Knowledge(30 benchmarks)

StaleSaturatedDisplay only

MMLU

2020

Massive Multitask Language Understanding

A comprehensive multiple-choice question answering test covering 57 tasks including elementary mathematics, US history, computer science, law, and more. Tests knowledge across diverse academic subjects from high school to professional level.

57 subjectsMultiple choice questionsElementary to professional level

MMLU · updated June 2, 2026

GPQA

Graduate-Level Google-Proof Q&A

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.

448 questionsMultiple choice questionsGraduate level

Weighted 12%

GPQA Diamond · updated June 2, 2026

GPQA-D

GPQA Diamond

A display-only GPQA Diamond reference from provider comparison charts.

Graduate-level science questionsMultiple choice questionsGraduate level

GPQA-D 2026 · updated June 2, 2026

SuperGPQA

SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines

An expanded version of GPQA that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines, providing comprehensive coverage of academic domains.

285 disciplinesMultiple choice questionsGraduate level

Weighted 12%

SuperGPQA 2025 · updated June 2, 2026

MMLU-Pro

Massive Multitask Language Understanding Professional

An enhanced version of MMLU with 10 answer choices instead of 4, featuring more reasoning-focused questions that better differentiate frontier models.

Multiple subjects10-way multiple choiceProfessional level

Weighted 22%

MMLU-Pro · updated June 2, 2026

AGIEval

AGIEval

A human-centric exam benchmark for general knowledge and reasoning reported in DeepSeek-V4 base-model evaluations.

General academic and professional exam questionsExact matchGeneral knowledge

AGIEval 2026 · updated June 2, 2026

HLE

Humanity's Last Exam

An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.

Expert-level questionsOpen-ended and multiple choiceFrontier expert level

Weighted 23%

Humanity's Last Exam · updated June 2, 2026

FrontierScience

FrontierScience

A benchmark for research-level scientific reasoning, designed to separate frontier models on difficult science tasks that mix domain knowledge with deep reasoning.

Research-level science tasksScientific reasoning benchmarkResearch frontier

Weighted 18%

FrontierScience 2026 · updated June 2, 2026

Artificial Analysis Intelligence Index

Artificial Analysis Intelligence Index

A display-only intelligence index published by Artificial Analysis that aggregates provider-reported and benchmark-derived signals into a single model-level score.

Cross-benchmark intelligence indexAggregated model scoreDisplay-only external reference

Artificial Analysis Intelligence Index 2026 · updated June 2, 2026

AA-GPQA Diamond

Artificial Analysis GPQA Diamond

A display-only Artificial Analysis GPQA Diamond score.

Graduate-level science questionsAccuracyGraduate-level science reasoning

AA-GPQA Diamond 2026 · updated June 2, 2026

AA-HLE

Artificial Analysis Humanity's Last Exam

A display-only Artificial Analysis Humanity's Last Exam score.

Expert-level questionsAccuracyFrontier expert reasoning

AA-HLE 2026 · updated June 2, 2026

AA-Omniscience Index

Artificial Analysis Omniscience Index

A display-only Artificial Analysis factual knowledge index.

Knowledge questionsIndex scoreBroad factual knowledge

AA-Omniscience Index 2026 · updated June 2, 2026

AA-Omniscience Accuracy

Artificial Analysis Omniscience Accuracy

A display-only Artificial Analysis knowledge metric for the proportion of correctly answered questions.

Knowledge questionsAccuracyBroad knowledge

AA-Omniscience Accuracy 2026 · updated June 2, 2026

AA-Omniscience Hallucination Rate

Artificial Analysis Omniscience Hallucination Rate

A display-only Artificial Analysis factuality metric for the rate of incorrect answers among non-correct responses.

Knowledge questionsHallucination rateFactuality

AA-Omniscience Hallucination Rate 2026 · updated June 2, 2026

SimpleQA

Measuring Short-Form Factuality in Large Language Models

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.

Factual questionsShort-form Q&AFactual accuracy focused

Weighted 13%

SimpleQA 2024 · updated June 2, 2026

Chinese-SimpleQA

Chinese-SimpleQA

A Chinese short-form factuality benchmark reported by DeepSeek for V4 model evaluations.

Chinese factual questionsShort-form factual QAFactual accuracy focused

Chinese-SimpleQA 2026 · updated June 2, 2026

OpenBookQA

2018

OpenBookQA

A science question-answering benchmark that tests whether models can apply a small open-book set of elementary science facts to multi-step reasoning questions.

Elementary science questions4-way multiple choiceElementary science reasoning

OpenBookQA 2018 · updated June 2, 2026

HealthBench Hard

HealthBench Hard

A harder subset of OpenAI's HealthBench for evaluating open-ended medical and health reasoning with rubric-based grading.

1,000 health promptsOpen-ended health evaluationAdvanced health reasoning

HealthBench Hard 2026 · updated June 2, 2026

MedXpertQA (Text)

MedXpertQA Text

A medical multiple-choice benchmark spanning many specialties with 10 answer options per question.

2,450 medical multiple-choice questionsMedical MCQProfessional medical knowledge

MedXpertQA (Text) 2026 · updated June 2, 2026

FrontierScience Research

FrontierScience Research

A research-focused FrontierScience evaluation variant for scientific investigation and problem solving.

Scientific research problemsResearch evaluationFrontier scientific research

FrontierScience Research 2026 · updated June 2, 2026

TruthfulQA

2021

TruthfulQA

A benchmark designed to measure whether language models produce truthful answers instead of repeating common misconceptions or misleading falsehoods.

Truthfulness and misconception resistanceQuestion answeringHallucination and factuality stress test

TruthfulQA 2021 · updated June 2, 2026

HLE w/o tools

Humanity's Last Exam without tools

Tool-free variant of Humanity's Last Exam that isolates a model's raw frontier reasoning.

Expert-level questionsTool-free expert QAFrontier expert level

HLE w/o tools 2026 · updated June 2, 2026

MMLU-Pro (Arcee)

MMLU-Pro first-party comparison snapshot

A display-only MMLU-Pro reference from Arcee AI's Trinity-Large-Thinking launch chart.

Professional academic QA10-way multiple choiceProfessional level

MMLU-Pro (Arcee) 2026 · updated June 2, 2026

MMLU-Redux

MMLU-Redux

A harder refresh of MMLU intended to keep broad knowledge evaluation useful after the original benchmark became too easy for frontier models.

Broad academic QAMultiple choice questionsAdvanced general knowledge

MMLU-Redux 2026 · updated June 2, 2026

MMMLU

MMMLU

A multilingual MMLU-style benchmark reported in provider evaluation tables.

Multilingual academic QAExact matchBroad multilingual knowledge

MMMLU 2026 · updated June 2, 2026

C-Eval

C-Eval

A Chinese-language academic and professional benchmark spanning humanities, social science, STEM, and applied subjects.

Chinese academic and professional examsMultiple choice questionsHigh school to professional level

C-Eval 2023 · updated June 2, 2026

CMMLU

Chinese Massive Multitask Language Understanding

A Chinese multitask academic benchmark reported in DeepSeek-V4 base-model evaluations.

Chinese academic QAExact matchBroad Chinese knowledge

CMMLU 2026 · updated June 2, 2026

MultiLoKo

MultiLoKo

A multilingual/localized knowledge benchmark reported in DeepSeek-V4 base-model evaluations.

Localized multilingual knowledge questionsExact matchMultilingual knowledge

MultiLoKo 2026 · updated June 2, 2026

FACTS Parametric

FACTS Parametric

A parametric factuality benchmark reported in DeepSeek-V4 base-model evaluations.

Parametric factual recallExact matchFactual accuracy focused

FACTS Parametric 2026 · updated June 2, 2026

TriviaQA

TriviaQA

A reading and trivia question-answering benchmark reported in DeepSeek-V4 base-model evaluations.

Trivia and reading-comprehension QAExact matchGeneral factual QA

TriviaQA 2026 · updated June 2, 2026

Multilingual(8 benchmarks)

MGSM

2022

Multilingual Grade School Math

A multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese.

250 problems × 11 languagesMath word problemsGrade school math, multilingual

Weighted 35%

MGSM 2022 · updated June 2, 2026

MMLU-ProX

MMLU-ProX

A multilingual extension of professional-level academic evaluation across many languages.

Multilingual professional QAMultilingual multiple choiceProfessional multilingual

Weighted 65%

MMLU-ProX 2025 · updated June 2, 2026

NOVA-63

NOVA-63

A broad multilingual benchmark row from Qwen's launch comparisons intended to measure cross-lingual capability beyond a single language family.

Broad multilingual evaluationCross-lingual benchmarkBroad multilingual capability

NOVA-63 2026 · updated June 2, 2026

INCLUDE

INCLUDE

A multilingual benchmark used in provider tables to measure inclusive language coverage and cross-lingual understanding beyond common high-resource languages.

Cross-lingual understandingMultilingual benchmarkBroad multilingual capability

INCLUDE 2026 · updated June 2, 2026

PolyMath

PolyMath

A multilingual mathematical reasoning benchmark that tests whether math performance transfers across languages rather than only in English.

Multilingual math problemsCross-lingual mathematical reasoningAdvanced multilingual reasoning

PolyMath 2026 · updated June 2, 2026

VWT2k-lite

VWT2k-lite

A lighter multilingual benchmark slice published in provider tables for broad cross-lingual transfer and understanding.

Multilingual transfer tasksCross-lingual benchmarkBroad multilingual capability

VWT2k-lite 2026 · updated June 2, 2026

MAXIFE

MAXIFE

A multilingual instruction-following and understanding benchmark row published in Qwen's launch comparisons.

Multilingual instruction followingCross-lingual benchmarkAdvanced multilingual instruction following

MAXIFE 2026 · updated June 2, 2026

SWE Multilingual

SWE-bench Multilingual

A multilingual extension of SWE-bench covering 300 problems across 9 programming languages, testing code generation and bug fixing beyond Python.

300 problems across 9 languagesMulti-language code patch generationProfessional multilingual software engineering

SWE Multilingual 2025 · updated June 2, 2026

Instruction Following(4 benchmarks)

IFEval

Instruction-Following Eval

A benchmark that evaluates language models' ability to follow verifiable instructions such as formatting constraints, keyword inclusion/exclusion, length limits, and structural requirements.

500+ instructionsConstrained generationInstruction precision

Weighted 65%

IFEval 2023 · updated June 2, 2026

IFBench

Instruction Following Benchmark

IFBench evaluates precise instruction-following generalization on 58 challenging, verifiable out-of-domain constraints. Unlike IFEval which tests familiar constraint types, IFBench specifically measures how well models follow novel instructions they haven't been optimized for, exposing overfitting to common instruction patterns.

Weighted 35%

IFBench 2025 · updated June 2, 2026

AA-IFBench

Artificial Analysis IFBench

A display-only Artificial Analysis IFBench score.

Verifiable instruction constraintsConstraint satisfaction accuracyInstruction precision

AA-IFBench 2026 · updated June 2, 2026

SOB Value Acc

Structured Output Benchmark Value Accuracy

A structured-output benchmark from Interfaze measuring whether extracted JSON leaf values exactly match verified ground truth.

Structured output extractionValue accuracyProduction structured-output reliability

SOB Value Acc 2026 · updated June 2, 2026

Mathematics(23 benchmarks)

AIME 2023

American Invitational Mathematics Examination 2023

A 15-question, 3-hour examination where each answer is an integer from 000 to 999. Serves as the intermediate step between AMC 10/12 and the USA Mathematical Olympiad (USAMO).

15 problemsInteger answers 000-999High school olympiad level

AIME 2023 2023 · updated June 2, 2026

AIME 2024

American Invitational Mathematics Examination 2024

The 2024 edition of AIME, maintaining the same format of 15 challenging mathematics problems with integer answers from 000 to 999.

15 problemsInteger answers 000-999High school olympiad level

AIME 2024 2024 · updated June 2, 2026

AIME 2025

American Invitational Mathematics Examination 2025

The most recent AIME examination, featuring 15 challenging mathematics problems testing olympiad-level mathematical reasoning with integer answers from 000-999.

15 problemsInteger answers 000-999High school olympiad level

AIME 2025 · updated June 2, 2026

GSM8K

Grade School Math 8K

A grade-school mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations.

Grade-school math word problemsExact matchGrade-school math

GSM8K 2026 · updated June 2, 2026

MATH

MATH

A competition-style mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations.

Competition math problemsExact matchAdvanced math reasoning

MATH 2026 · updated June 2, 2026

CMath

CMath

A Chinese mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations.

Chinese math problemsExact matchMath reasoning

CMath 2026 · updated June 2, 2026

AIME25 (Arcee)

AIME25 first-party comparison snapshot

A display-only AIME25 reference from Arcee AI's Trinity-Large-Thinking launch chart.

15 problemsInteger answers 000-999High school olympiad level

AIME25 (Arcee) 2026 · updated June 2, 2026

HMMT Feb 2023

Harvard-MIT Mathematics Tournament February 2023

A prestigious high school mathematics competition hosted jointly by Harvard and MIT, featuring challenging problems across various mathematical disciplines.

Tournament problemsCompetition mathematicsHigh school olympiad level

HMMT Feb 2023 2023 · updated June 2, 2026

HMMT Feb 2024

Harvard-MIT Mathematics Tournament February 2024

The 2024 February edition of the Harvard-MIT Mathematics Tournament, continuing the tradition of challenging high school mathematics competition.

Tournament problemsCompetition mathematicsHigh school olympiad level

HMMT Feb 2024 2024 · updated June 2, 2026

HMMT Feb 2025

Harvard-MIT Mathematics Tournament February 2025

The most recent February edition of the Harvard-MIT Mathematics Tournament, featuring the latest challenging problems in competitive mathematics.

Tournament problemsCompetition mathematicsHigh school olympiad level

HMMT Feb 2025 2025 · updated June 2, 2026

BRUMO 2025

Bulgarian Mathematical Olympiad 2025

A challenging mathematical olympiad competition featuring problems that test advanced mathematical reasoning and problem-solving skills at the olympiad level.

Olympiad problemsMathematical olympiadMathematical olympiad level

BRUMO 2025 2025 · updated June 2, 2026

MATH-500

2021

MATH-500 Problem Set

A curated subset of 500 problems from the MATH dataset, covering algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus.

500 problemsFree-form mathematical answersHigh school to undergraduate

Weighted 15%

MATH-500 2021 · updated June 2, 2026

AIME26

AIME 2026

A 2026 American Invitational Mathematics Examination snapshot used in frontier-model comparison tables for mathematical reasoning.

Competition math problemsShort-answer mathematicsOlympiad-style mathematics

AIME26 2026 · updated June 2, 2026

IPhO 2025 (Theory)

International Physics Olympiad 2025 (Theory)

The three official theory problems from the 2025 International Physics Olympiad, scored with blinded human evaluation.

3 olympiad theory problemsPhysics olympiad theoryInternational olympiad physics

IPhO 2025 (Theory) 2026 · updated June 2, 2026

HMMT Feb 2025

Harvard-MIT Mathematics Tournament February 2025

A February 2025 HMMT slice used in exact-value provider tables for advanced contest-math reasoning.

Competition math problemsContest mathematicsOlympiad-style mathematics

HMMT Feb 2025 2025 · updated June 2, 2026

HMMT Nov 2025

Harvard-MIT Mathematics Tournament November 2025

A November 2025 HMMT slice for high-end mathematical reasoning comparisons.

Competition math problemsContest mathematicsOlympiad-style mathematics

HMMT Nov 2025 2025 · updated June 2, 2026

HMMT Feb 2026

Harvard-MIT Mathematics Tournament February 2026

A February 2026 HMMT slice used in newer frontier-model math comparisons.

Competition math problemsContest mathematicsOlympiad-style mathematics

HMMT Feb 2026 2026 · updated June 2, 2026

IMOAnswerBench

IMOAnswerBench

A challenging mathematical reasoning benchmark reported in DeepSeek-V4 model evaluations.

Advanced mathematical answer generationPass@1 math benchmarkOlympiad-level mathematics

IMOAnswerBench 2026 · updated June 2, 2026

Apex

Apex

A high-difficulty mathematical reasoning benchmark reported in DeepSeek-V4 model evaluations.

Advanced mathematical reasoningPass@1 math benchmarkFrontier math reasoning

Apex 2026 · updated June 2, 2026

Apex Shortlist

Apex Shortlist

A shortlist subset of the Apex mathematical reasoning benchmark reported in DeepSeek-V4 model evaluations.

Advanced mathematical reasoningPass@1 math benchmarkFrontier math reasoning

Apex Shortlist 2026 · updated June 2, 2026

MMAnswerBench

MMAnswerBench

A multimodal mathematical reasoning benchmark that tests whether models can answer visually grounded math questions correctly.

Multimodal math questionsVisual and structured mathematical QAAdvanced mathematical reasoning

MMAnswerBench 2026 · updated June 2, 2026

FrontierMath

FrontierMath

An expert-level mathematical reasoning benchmark by Epoch AI featuring original, research-level problems created by mathematicians including IMO gold medalists and Fields Medal recipients. Problems require deep creativity and multi-step reasoning.

350 original research-level math problemsOpen-ended mathematical reasoning with tool accessResearch-level mathematics

Weighted 35%

FrontierMath 2024 · updated June 2, 2026

USAMO 2026

United States of America Mathematical Olympiad 2026

Vals-hosted MMLU-Pro mirror

Vals AI hosted MMLU-Pro view with subject-level task splits.

MMLU-Pro subject splitsAccuracy scoreProfessional academic reasoning