Skip to main content

AI Benchmarks Directory

Explore 144 benchmarks used to evaluate AI language models across 8 categories.

Agentic(29 benchmarks)

View leaderboard

Terminal-Bench 2.0

2026

Terminal-Bench 2.0

A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows.

Current
Terminal-based software tasksInteractive CLI agent evaluationProfessional software engineering
Weighted 28%

Terminal-Bench 2 · updated April 20, 2026

BrowseComp

2025

BrowseComp

A benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions.

Current
Research questions requiring browsingWeb search and evidence synthesisHard web research
Weighted 18%

BrowseComp 2026 · updated April 20, 2026

OSWorld-Verified

2025

OSWorld-Verified

A verified subset of OSWorld focused on computer-use tasks in desktop-like environments, including navigation, editing, and workflow completion.

Current
Desktop and GUI tasksInteractive computer-use evaluationComplex multi-step workflows
Weighted 24%

OSWorld Verified · updated April 20, 2026

BrowseComp-VL

2026

BrowseComp-VL

A vision-language browsing benchmark for multimodal web research and tool-use workflows.

CurrentDisplay only
Multimodal browsing tasksVision-language web research evaluationMultimodal browser-agent
Display only

BrowseComp-VL 2026 · updated April 20, 2026

OSWorld

2026

OSWorld

A computer-use benchmark for GUI task completion across the broader OSWorld task suite.

CurrentDisplay only
Computer-use tasksInteractive GUI evaluationBroad computer-use suite
Display only

OSWorld 2026 · updated April 20, 2026

AndroidWorld

2026

AndroidWorld

A mobile GUI agent benchmark for completing Android app workflows and on-device tasks.

CurrentDisplay only
Android app workflowsInteractive mobile-agent evaluationComplex mobile task completion
Display only

AndroidWorld 2026 · updated April 20, 2026

WebVoyager

2026

WebVoyager

A browser-agent benchmark for completing multi-step workflows on live websites.

CurrentDisplay only
Live website workflowsInteractive browser-agent evaluationMulti-step web navigation
Display only

WebVoyager 2026 · updated April 20, 2026

MCP Atlas

2026

MCP Atlas

A benchmark for tool-calling over Model Context Protocol integrations and external tools.

CurrentDisplay only
Tool-integrated agent tasksInteractive tool-calling evaluationAdvanced tool use
Display only

MCP Atlas 2026 · updated April 20, 2026

Toolathlon

2026

Toolathlon

A tool-use benchmark focused on selecting, sequencing, and completing tasks with external tools.

CurrentDisplay only
Multi-tool workflowsInteractive tool-calling evaluationAdvanced tool use
Display only

Toolathlon 2026 · updated April 20, 2026

ZClawBench

2026

ZClawBench

A Z.AI benchmark for OpenClaw-style agent workflows spanning information search, office work, data analysis, development and operations, automation, and security.

CurrentDisplay only
OpenClaw agent workflowsEnd-to-end agent benchmarkBroad productivity and operations workflows
Display only

ZClawBench 2026 · updated April 20, 2026

Tau2-Telecom

2026

Tau2-Telecom

A telecom-oriented tool benchmark that measures structured tool use in domain workflows.

CurrentDisplay only
Telecom tool workflowsDomain-specific tool evaluationProfessional workflow
Display only

τ²-Bench 2026 · updated April 20, 2026

DeepSearchQA

2026

DeepSearchQA

An agentic browsing benchmark where models search the web, gather evidence, and answer list-style questions using browser tools.

CurrentDisplay only
Agentic browsing and list-answer questionsSearch / open / find browser-agent evaluationAgentic web research
Display only

DeepSearchQA 2026 · updated April 20, 2026

Tau2-Airline

2026

Tau2-Airline

An airline-domain tool-use benchmark for structured workflow execution and API correctness.

CurrentDisplay only
Airline support workflowsDomain-specific tool evaluationProfessional workflow
Display only

Tau2-Airline 2026 · updated April 20, 2026

PinchBench

2026

PinchBench

An OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows.

CurrentDisplay only
23 OpenClaw agent tasksAverage success rate from official runsLong-horizon agent workflows
Display only

PinchBench 2026 · updated April 20, 2026

BFCL v4

2026

Berkeley Function Calling Leaderboard v4

A function-calling benchmark for tool selection, schema adherence, and argument correctness.

CurrentDisplay only
Function-calling tasksTool invocation and schema evaluationAdvanced tool use
Display only

BFCL v4 2026 · updated April 20, 2026

MLE-Bench Lite

2026

MLE-Bench Lite

A lightweight machine-learning competition benchmark that measures whether models can iteratively train, evaluate, and improve ML systems in low-resource settings.

CurrentDisplay only
Low-resource ML competitionsAutonomous iterative ML optimizationAgentic machine learning
Display only

MLE-Bench Lite 2026 · updated April 20, 2026

MM-ClawBench

2026

MM-ClawBench

An OpenClaw-derived agent benchmark covering practical work and life tasks such as office document delivery, research, planning, and code maintenance.

CurrentDisplay only
OpenClaw-style real-world tasksAgent workflow evaluationBroad real-world agentic execution
Display only

MM-ClawBench 2026 · updated April 20, 2026

Claw-Eval

2026

Claw-Eval

An end-to-end real-world agent benchmark for OpenClaw-style workflows spanning tool use, planning, execution, and recovery across practical tasks.

CurrentDisplay only
Real-world agent workflowsEnd-to-end agent evaluationBroad real-world agentic execution
Display only

Claw-Eval 2026 · updated April 20, 2026

QwenClawBench

2026

QwenClawBench

Qwen's internal OpenClaw-style benchmark for measuring broad real-world agent performance across practical productivity and research tasks.

CurrentDisplay only
Real-world agent workflowsEnd-to-end agent evaluationBroad real-world agentic execution
Display only

QwenClawBench 2026 · updated April 20, 2026

QwenWebBench

2026

QwenWebBench

A Qwen benchmark for artifact and webpage generation quality reported as an Elo-style rating.

CurrentDisplay only
Web artifacts and interactive deliverablesElo-style artifact benchmarkArtifact generation
Display only

QwenWebBench 2026 · updated April 20, 2026

TAU3-Bench

2026

TAU3-Bench

A next-generation tool-use benchmark for complex, long-horizon agent workflows beyond the older tau2 telecom and airline task families.

CurrentDisplay only
Long-horizon tool workflowsInteractive tool-use evaluationAdvanced tool use
Display only

TAU3-Bench 2026 · updated April 20, 2026

VITA-Bench

2025

VITA-Bench

An interactive real-world agent benchmark grounded in practical consumer-service tasks such as delivery, in-store consumption, and online travel workflows.

CurrentDisplay only
Interactive consumer-service agent tasksEnd-to-end interactive agent evaluationLong-horizon real-world workflows
Display only

VITA-Bench 2025 · updated April 20, 2026

DeepPlanning

2026

DeepPlanning

A long-horizon planning benchmark that tests whether agents can optimize under explicit time, budget, and feasibility constraints.

CurrentDisplay only
Travel planning and constrained shoppingLong-horizon planning benchmarkConstrained agent planning
Display only

DeepPlanning 2026 · updated April 20, 2026

MCP-Tasks

2026

MCP-Tasks

A Model Context Protocol task benchmark used in Qwen's launch tables to measure practical execution over MCP-style tools and integrations.

CurrentDisplay only
MCP-integrated tool tasksInteractive tool-use evaluationAdvanced MCP workflows
Display only

MCP-Tasks 2026 · updated April 20, 2026

WideResearch

2026

WideResearch

A broad research-agent benchmark for open-ended information gathering, synthesis, and answer construction across wide search spaces.

CurrentDisplay only
Open-ended research tasksMulti-source research evaluationBroad research-agent workflows
Display only

WideResearch 2026 · updated April 20, 2026

GAIA

2024

General AI Assistants

GAIA evaluates AI models on real-world tasks that are conceptually simple for humans but require multi-step reasoning, web browsing, tool use, and multimodal understanding for AI. Tasks span three difficulty levels and test practical assistant capabilities rather than academic knowledge.

Refreshing
466
Weighted 12%

GAIA 2024 · updated April 20, 2026

TAU-bench

2024

Tool-Agent-User Benchmark

TAU-bench evaluates AI agents in realistic enterprise scenarios requiring multi-turn tool use, database interactions, and policy adherence. It tests across retail and airline domains, measuring an agent's ability to reliably complete customer service tasks while following complex business rules.

Refreshing
680
Weighted 10%

TAU-bench 2024 · updated April 20, 2026

WebArena

2024

WebArena Web Agent Benchmark

WebArena is a realistic web environment for evaluating autonomous AI agents on complex, multi-step browser tasks. Agents must navigate e-commerce sites, forums, content management systems, and code repositories to complete practical objectives like purchasing items, finding information, and managing accounts.

Refreshing
812
Weighted 8%

WebArena 2024 · updated April 20, 2026

MEWC

2026

Multi-Environment Web Challenge

A benchmark that evaluates AI agents on multi-environment web challenges, testing navigation and task completion across diverse live web environments.

CurrentDisplay only
Web-agent tasksBrowser task completionOpen-web agent workflows
Display only

MEWC 2026 · updated April 20, 2026

Coding(15 benchmarks)

View leaderboard

HumanEval

2021

Evaluating Large Language Models Trained on Code

A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.

StaleSaturatedDisplay only
164 problemsPython function generationIntroductory to intermediate programming
Display only

HumanEval · updated April 20, 2026

SWE-bench Verified

2024

Software Engineering Benchmark Verified

A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn.

Refreshing
500 verified issuesCode patch generationProfessional software engineering
Weighted 13%

SWE-bench Verified 2024 · updated April 20, 2026

SWE-Rebench

2026

SWE-Rebench

A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.

Current
Fresh GitHub issues (rolling window)Code patch generationProfessional software engineering
Weighted 31%

Rolling 2026 window · updated April 20, 2026

LiveCodeBench

2024

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

A continuously updated benchmark using fresh competitive programming problems from LeetCode, Codeforces, and AtCoder to provide contamination-free code generation evaluation.

Current
Continuously updatedCompetitive programmingCompetitive programming level
Weighted 23%

Rolling 2026 set · updated April 20, 2026

LiveCodeBench v6

2026

LiveCodeBench v6

A newer LiveCodeBench slice used in provider comparison tables to benchmark contamination-resistant coding performance on fresher competitive programming sets.

CurrentDisplay only
Fresh programming problemsCompetitive programmingCompetitive programming level
Display only

LiveCodeBench v6 2026 · updated April 20, 2026

LiveCodeBench Pro

2025

LiveCodeBench Pro

A harder competitive-programming benchmark family built from Codeforces, ICPC, and IOI problems, with quarter-specific public leaderboards and difficulty-aware reporting.

CurrentDisplay only
Quarter-specific contest programming setsCompetitive programmingHigh-end contest programming
Display only

LiveCodeBench Pro 2025 · updated April 20, 2026

FLTEval

2026

FLTEval

A repository-level Lean 4 proof engineering benchmark that measures whether a model can complete formal proofs and correctly define new mathematical concepts inside realistic FLT project pull requests.

CurrentDisplay only
FLT project pull requestsLean 4 repository task completionFormal verification / proof engineering
Display only

FLTEval 2026 · updated April 20, 2026

SWE-bench Pro

2026

SWE-bench Pro

A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.

Current
Real-world software engineeringRepository task completionFrontier coding agent
Weighted 23%

SWE-bench Pro 2026 · updated April 20, 2026

SWE Multilingual

2026

SWE Multilingual

A multilingual software-engineering benchmark for real-world code issue resolution across multiple programming languages.

CurrentDisplay only
Multilingual software-engineering tasksRepository task completionProfessional software engineering
Display only

SWE Multilingual 2026 · updated April 20, 2026

Multi-SWE Bench

2026

Multi-SWE Bench

A multi-language software-engineering benchmark that measures repository-level bug fixing and implementation across more than one programming ecosystem.

CurrentDisplay only
Multi-language repo tasksRepository task completionProfessional software engineering
Display only

Multi-SWE Bench 2026 · updated April 20, 2026

VIBE-Pro

2026

VIBE-Pro

A repo-level code generation and full-project delivery benchmark spanning web, mobile, and simulation-style implementation tasks.

CurrentDisplay only
Full project delivery tasksRepository-level implementation benchmarkEnd-to-end software delivery
Display only

VIBE-Pro 2026 · updated April 20, 2026

NL2Repo

2026

NL2Repo

A repository-understanding benchmark that measures whether models can map natural-language requests onto the right code locations and system changes.

CurrentDisplay only
Natural language to repository tasksRepository understanding benchmarkSystem-level software comprehension
Display only

NL2Repo 2026 · updated April 20, 2026

React Native Evals

2026

React Native Evals

An open benchmark for AI coding agents on real-world React Native implementation tasks, emphasizing working app behavior, recommended architecture choices, and strict constraint adherence.

CurrentDisplay only
React Native app implementation tasksFramework-specific app development evaluationProduction mobile app engineering
Display only

React Native Evals 2026 · updated April 20, 2026

SWE-bench Verified*

2026

SWE-bench Verified (mini-swe-agent-v2)

A display-only SWE-bench Verified reference from Arcee AI's Trinity-Large-Thinking comparison chart.

CurrentDisplay only
Repository task completionAgent scaffold benchmarkProfessional software engineering
Display only

SWE-bench Verified* 2026 · updated April 20, 2026

SciCode

2024

Scientific Code Benchmark

SciCode evaluates language models on generating code for realistic scientific research problems across 16 subfields of physics, math, chemistry, biology, and material science. Problems decompose into 338 subproblems requiring domain knowledge recall, scientific reasoning, and precise code synthesis. Based on real scripts from published research.

Refreshing
80
Weighted 10%

SciCode 2024 · updated April 20, 2026

Reasoning(14 benchmarks)

View leaderboard

MuSR

2023

Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.

Stale
Multi-step reasoningNarrative-based reasoningComplex reasoning tasks
Weighted 20%

MuSR 2023 · updated April 20, 2026

BBH

2022

BIG-Bench Hard

A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.

StaleSaturatedDisplay only
23 tasksMixed reasoning tasksAdvanced reasoning
Display only

BBH 2022 · updated April 20, 2026

LisanBench

2026

LisanBench

A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.

CurrentDisplay only
50 starting words × 3 trialsDifficulty-weighted word-chain reasoningOpen-ended lexical planning
Display only

LisanBench 2026 · updated April 20, 2026

LongBench v2

2025

LongBench v2

A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval.

Current
Long-context tasksExtended-context retrieval and reasoningHard long-context
Weighted 30%

LongBench v2 2025 · updated April 20, 2026

MRCRv2

2025

MRCRv2

A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.

Current
Long-context retrievalMulti-round long-context evaluationHard long-context
Weighted 25%

MRCRv2 2025 · updated April 20, 2026

MRCR v2 64K-128K

2026

OpenAI MRCR v2 8-needle 64K-128K

MRCR v2 slice focused on long-context retrieval at 64K-128K lengths.

CurrentDisplay only
8-needle retrieval tasksLong-context retrievalLong-context reasoning
Display only

MRCR v2 64K-128K 2026 · updated April 20, 2026

MRCR v2 128K-256K

2026

OpenAI MRCR v2 8-needle 128K-256K

MRCR v2 slice focused on very long contexts at 128K-256K lengths.

CurrentDisplay only
8-needle retrieval tasksVery-long-context retrievalVery long-context reasoning
Display only

MRCR v2 128K-256K 2026 · updated April 20, 2026

Graphwalks BFS 128K

2026

Graphwalks BFS 0K-128K

Long-context graph traversal benchmark using breadth-first search tasks.

CurrentDisplay only
Graph traversal tasksLong-context graph reasoningAlgorithmic long-context reasoning
Display only

Graphwalks BFS 128K 2026 · updated April 20, 2026

Graphwalks Parents 128K

2026

Graphwalks parents 0-128K

Long-context benchmark for recovering parent relationships inside graph tasks.

CurrentDisplay only
Graph parent-retrieval tasksLong-context graph reasoningAlgorithmic long-context reasoning
Display only

Graphwalks Parents 128K 2026 · updated April 20, 2026

ARC-AGI-2

2025

Abstraction and Reasoning Corpus for AGI v2

A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — human average is 60%.

Current
Visual pattern completion and abstract reasoningGrid transformation puzzles with novel rulesExpert-level — hardest public reasoning benchmark
Weighted 25%

ARC-AGI 2 · updated April 20, 2026

AI-Needle

2026

AI-Needle

A long-context retrieval benchmark that measures whether a model can recover relevant information embedded deep inside very long contexts.

CurrentDisplay only
Long-context retrievalNeedle-in-a-haystack recallLong-context memory
Display only

AI-Needle 2026 · updated April 20, 2026

GPQA Diamond

2023

GPQA Diamond

The hardest subset of GPQA featuring the most challenging graduate-level science questions. Sometimes reported separately from the standard GPQA benchmark.

StaleDisplay only
Expert-level science questionsMultiple choice questionsGraduate-level scientific reasoning
Display only

GPQA Diamond 2023 · updated April 20, 2026

BullshitBench v2

2025

BullshitBench v2

A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input.

CurrentDisplay only
Nonsensical and flawed prompts across multiple domainsPrompt challenge and refusal evaluationRobustness and critical reasoning
Display only

BullshitBench v2 2025 · updated April 20, 2026

WildBench

2024

WildBench

An automated evaluation framework using 1,000+ real-world user tasks covering reasoning, planning, coding, and creative writing. Highly correlated with Chatbot Arena human preference rankings.

RefreshingDisplay only
1,024 real-world tasksReal-world task evaluationDiverse real-world scenarios
Display only

WildBench 2024 · updated April 20, 2026

Multimodal & Grounded(41 benchmarks)

View leaderboard

MMMU

2024

Massive Multi-discipline Multimodal Understanding

A broad multimodal reasoning benchmark spanning charts, diagrams, tables, and academic visual question answering.

RefreshingDisplay only
Multimodal academic reasoningImage + text question answeringFrontier multimodal
Display only

MMMU 2024 · updated April 20, 2026

MMMU-Pro

2024

Massive Multi-discipline Multimodal Understanding Pro

A harder multimodal benchmark for frontier models that combines text with images, diagrams, charts, and academic visual reasoning tasks.

Refreshing
Multimodal academic reasoningImage + text question answeringFrontier multimodal
Weighted 55%

MMMU-Pro 2024 · updated April 20, 2026

OfficeQA Pro

2026

OfficeQA Pro

A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts.

Current
Document and spreadsheet tasksGrounded QA over office artifactsEnterprise grounded reasoning
Weighted 45%

OfficeQA Pro 2026 · updated April 20, 2026

MMMU-Pro w/ Python

2026

MMMU-Pro with Python

Tool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning.

CurrentDisplay only
Multimodal academic reasoningImage + text question answering with PythonFrontier multimodal
Display only

MMMU-Pro w/ Python 2026 · updated April 20, 2026

OmniDocBench 1.5

2026

OmniDocBench 1.5

A document understanding benchmark used in frontier-model comparison tables to measure extraction and grounded reasoning quality on complex documents.

CurrentDisplay only
Document understanding tasksDocument understanding benchmarkGrounded document reasoning
Display only

OmniDocBench 1.5 2026 · updated April 20, 2026

RealWorldQA

2026

RealWorldQA

A grounded visual QA benchmark focused on answering practical questions about real-world images and scenes.

CurrentDisplay only
Real-world visual question answeringImage-grounded QAGeneral visual reasoning
Display only

RealWorldQA 2026 · updated April 20, 2026

Video-MME (with subtitle)

2026

Video-MME with subtitle

A video understanding benchmark that allows subtitle access when answering multimodal questions about videos.

CurrentDisplay only
Video understandingVideo QA with subtitle contextMultimodal video reasoning
Display only

Video-MME (with subtitle) 2026 · updated April 20, 2026

Video-MME (w/o subtitle)

2026

Video-MME without subtitle

A stricter Video-MME setting that removes subtitle help and tests video understanding from visual and audio context alone.

CurrentDisplay only
Video understandingVideo QA without subtitle contextMultimodal video reasoning
Display only

Video-MME (w/o subtitle) 2026 · updated April 20, 2026

Video-MME

2024

Video-MME

A comprehensive benchmark for multimodal large language models on video understanding, covering temporal reasoning, perception, and question answering over videos.

RefreshingDisplay only
Video understandingVideo QA and analysisBroad multimodal video reasoning
Display only

Video-MME 2024 · updated April 20, 2026

MathVision

2026

MathVision

A visual mathematics benchmark that tests whether a model can solve math problems grounded in diagrams, equations, figures, and other visual inputs.

CurrentDisplay only
Visually grounded math problemsImage + math reasoningAdvanced multimodal mathematics
Display only

MathVision 2026 · updated April 20, 2026

We-Math

2026

We-Math

A multimodal math benchmark for visually grounded mathematical reasoning and answer generation.

CurrentDisplay only
Visually grounded math problemsMultimodal mathematical reasoningAdvanced multimodal mathematics
Display only

We-Math 2026 · updated April 20, 2026

DynaMath

2026

DynaMath

A multimodal benchmark for dynamic mathematical reasoning over visual and structured inputs.

CurrentDisplay only
Dynamic visual math problemsMultimodal mathematical reasoningAdvanced multimodal mathematics
Display only

DynaMath 2026 · updated April 20, 2026

MStar

2026

MStar

A general visual question-answering benchmark used in provider tables for real-image reasoning quality.

CurrentDisplay only
Real-image visual QAImage-grounded QAGeneral visual reasoning
Display only

MStar 2026 · updated April 20, 2026

ChatCVQA

2026

ChatCVQA

A conversational visual QA benchmark that tests multi-turn grounded answering over images and documents.

CurrentDisplay only
Conversational visual QAMulti-turn image-grounded QAConversational multimodal reasoning
Display only

ChatCVQA 2026 · updated April 20, 2026

MMLongBench-Doc

2026

MMLongBench-Doc

A long-document multimodal benchmark for grounded reasoning over extended document contexts.

CurrentDisplay only
Long document understandingDocument-grounded reasoningLong-context document reasoning
Display only

MMLongBench-Doc 2026 · updated April 20, 2026

CC-OCR

2026

CC-OCR

An OCR-focused benchmark for reading and extracting text from visually complex documents and images.

CurrentDisplay only
Optical character recognitionText extraction from images and documentsDocument reading
Display only

CC-OCR 2026 · updated April 20, 2026

AI2D_TEST

2026

AI2D test split

A diagram understanding benchmark focused on scientific and educational visual question answering.

CurrentDisplay only
Diagram understandingDiagram-grounded QAStructured visual reasoning
Display only

AI2D_TEST 2026 · updated April 20, 2026

CountBench

2026

CountBench

A visual counting benchmark that tests whether a model can count objects and entities reliably in complex scenes.

CurrentDisplay only
Visual counting tasksImage-grounded countingFine-grained visual perception
Display only

CountBench 2026 · updated April 20, 2026

RefCOCO (avg)

2026

RefCOCO average

A referring-expression grounding benchmark averaged across RefCOCO variants to test whether a model can localize described objects correctly.

CurrentDisplay only
Referring-expression groundingGrounded visual localizationFine-grained visual grounding
Display only

RefCOCO (avg) 2026 · updated April 20, 2026

ODINW13

2026

ODINW13

A visual detection and grounding benchmark slice used to compare zero-shot object understanding across diverse domains.

CurrentDisplay only
Out-of-distribution object understandingDetection and groundingRobust visual grounding
Display only

ODINW13 2026 · updated April 20, 2026

ERQA

2026

ERQA

A grounded visual reasoning benchmark focused on evidence-based question answering over real images.

CurrentDisplay only
Evidence-based visual QAGrounded image reasoningGrounded multimodal reasoning
Display only

ERQA 2026 · updated April 20, 2026

VideoMMMU

2026

VideoMMMU

A video extension of MMMU-style multimodal reasoning over expert questions grounded in temporal media.

CurrentDisplay only
Video-grounded expert reasoningVideo + text reasoningFrontier multimodal video reasoning
Display only

VideoMMMU 2026 · updated April 20, 2026

MLVU (M-Avg)

2026

MLVU mean average

A multi-task video understanding benchmark averaged across MLVU categories.

CurrentDisplay only
General video understandingVideo QA and understandingBroad multimodal video reasoning
Display only

MLVU (M-Avg) 2026 · updated April 20, 2026

MMVU

2026

Multimodal Multi-disciplinary Video Understanding

A benchmark for evaluating multimodal models on video understanding tasks across multiple disciplines, emphasizing temporal reasoning and comprehension over video content.

CurrentDisplay only
Video understandingVideo reasoning benchmarkMulti-disciplinary multimodal video reasoning
Display only

MMVU 2026 · updated April 20, 2026

ScreenSpot Pro

2025

ScreenSpot Pro

A high-resolution GUI grounding benchmark for professional computer-use environments.

CurrentDisplay only
GUI grounding tasksInterface element localizationProfessional GUI grounding
Display only

ScreenSpot Pro 2025 · updated April 20, 2026

TIR-Bench

2026

TIR-Bench

A visual agent benchmark for interface reasoning and task execution over screenshots or software surfaces.

CurrentDisplay only
Visual agent and interface reasoningScreenshot-grounded task reasoningComputer-use visual reasoning
Display only

TIR-Bench 2026 · updated April 20, 2026

GDPval-AA

2026

GDPval-AA

An evaluation focused on professional domain expertise and task delivery quality in office-style knowledge work.

CurrentDisplay only
Professional office deliveryELO-style office benchmarkProfessional knowledge work
Display only

GDPval-AA 2026 · updated April 20, 2026

MedXpertQA (MM)

2026

MedXpertQA Multimodal

A multimodal medical multiple-choice benchmark covering clinical images such as X-rays, histology, and dermatology.

CurrentDisplay only
2,000 multimodal medical questionsMedical visual MCQClinical multimodal reasoning
Display only

MedXpertQA (MM) 2026 · updated April 20, 2026

ZeroBench

2026

ZeroBench

A multi-step visual reasoning benchmark with pass@5 reporting and optional tool use.

CurrentDisplay only
100 visual reasoning questionsMulti-step visual reasoningTool-augmented visual reasoning
Display only

ZeroBench 2026 · updated April 20, 2026

Design2Code

2026

Design2Code

A multimodal coding benchmark for turning visual designs into working frontend implementations.

CurrentDisplay only
Design-to-code tasksVisual input to frontend implementationMultimodal coding
Display only

Design2Code 2026 · updated April 20, 2026

Flame-VLM-Code

2026

Flame-VLM-Code

A vision-language coding benchmark for generating correct code from visual and multimodal inputs.

CurrentDisplay only
Multimodal coding tasksVision-language code generationMultimodal coding
Display only

Flame-VLM-Code 2026 · updated April 20, 2026

Vision2Web

2026

Vision2Web

A benchmark for converting visual references into functional web implementations.

CurrentDisplay only
Screenshot-to-web tasksVisual reference to web implementationMultimodal web generation
Display only

Vision2Web 2026 · updated April 20, 2026

ImageMining

2026

ImageMining

A multimodal retrieval and extraction benchmark over image-heavy task settings.

CurrentDisplay only
Visual retrieval tasksImage-grounded retrieval and extractionMultimodal retrieval
Display only

ImageMining 2026 · updated April 20, 2026

MMSearch

2026

MMSearch

A multimodal search benchmark for retrieval and grounded answering across mixed-media inputs.

CurrentDisplay only
Multimodal search tasksMixed-media retrieval and grounded answeringMultimodal search
Display only

MMSearch 2026 · updated April 20, 2026

MMSearch-Plus

2026

MMSearch-Plus

A harder MMSearch variant for multimodal retrieval and grounded tool-use workflows.

CurrentDisplay only
Hard multimodal search tasksAdvanced mixed-media retrieval benchmarkAdvanced multimodal search
Display only

MMSearch-Plus 2026 · updated April 20, 2026

SimpleVQA

2026

SimpleVQA

A visual question answering benchmark focused on straightforward image-grounded understanding.

CurrentDisplay only
Visual QA tasksImage-grounded question answeringGeneral visual understanding
Display only

SimpleVQA 2026 · updated April 20, 2026

Facts-VLM

2026

Facts-VLM

A grounded multimodal factuality benchmark for evidence-linked answer correctness.

CurrentDisplay only
Grounded factuality tasksEvidence-linked multimodal factualityGrounded multimodal factuality
Display only

Facts-VLM 2026 · updated April 20, 2026

V*

2026

V*

A vision-centric benchmark for high-level multimodal reasoning and perception quality.

CurrentDisplay only
Frontier multimodal reasoning tasksVision-centric reasoning benchmarkFrontier multimodal
Display only

V* 2026 · updated April 20, 2026

CharXiv

2024

CharXiv Reasoning

A scientific chart reasoning benchmark that tests whether models can understand, interpret, and reason about complex scientific visualizations including plots, diagrams, and data charts.

RefreshingDisplay only
Scientific chart reasoningChart understanding and reasoningScientific visualization reasoning
Display only

CharXiv 2024 · updated April 20, 2026

CharXiv w/o tools

2024

CharXiv Reasoning without tools

Tool-free variant of CharXiv that isolates raw visual reasoning ability without code execution or tool augmentation.

RefreshingDisplay only
Scientific chart reasoning (tool-free)Chart understanding without toolsScientific visualization reasoning
Display only

CharXiv w/o tools 2024 · updated April 20, 2026

SWE-bench Multimodal

2025

SWE-bench Multimodal

A multimodal variant of SWE-bench that adds visual context (screenshots, design mockups) to software engineering issue descriptions, testing whether models can leverage visual information for code generation.

CurrentDisplay only
Multimodal software engineering tasksCode patch generation with visual contextFrontier multimodal coding
Display only

SWE-bench Multimodal 2025 · updated April 20, 2026

Knowledge(18 benchmarks)

View leaderboard

MMLU

2020

Massive Multitask Language Understanding

A comprehensive multiple-choice question answering test covering 57 tasks including elementary mathematics, US history, computer science, law, and more. Tests knowledge across diverse academic subjects from high school to professional level.

StaleSaturatedDisplay only
57 subjectsMultiple choice questionsElementary to professional level
Display only

MMLU · updated April 20, 2026

GPQA

2023

Graduate-Level Google-Proof Q&A

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.

Refreshing
448 questionsMultiple choice questionsGraduate level
Weighted 12%

GPQA Diamond · updated April 20, 2026

GPQA-D

2026

GPQA Diamond

A display-only GPQA Diamond reference from provider comparison charts.

CurrentDisplay only
Graduate-level science questionsMultiple choice questionsGraduate level
Display only

GPQA-D 2026 · updated April 20, 2026

SuperGPQA

2025

SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines

An expanded version of GPQA that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines, providing comprehensive coverage of academic domains.

Current
285 disciplinesMultiple choice questionsGraduate level
Weighted 12%

SuperGPQA 2025 · updated April 20, 2026

MMLU-Pro

2024

Massive Multitask Language Understanding Professional

An enhanced version of MMLU with 10 answer choices instead of 4, featuring more reasoning-focused questions that better differentiate frontier models.

Refreshing
Multiple subjects10-way multiple choiceProfessional level
Weighted 22%

MMLU-Pro · updated April 20, 2026

HLE

2025

Humanity's Last Exam

An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.

Current
Expert-level questionsOpen-ended and multiple choiceFrontier expert level
Weighted 23%

Humanity's Last Exam · updated April 20, 2026

FrontierScience

2026

FrontierScience

A benchmark for research-level scientific reasoning, designed to separate frontier models on difficult science tasks that mix domain knowledge with deep reasoning.

Current
Research-level science tasksScientific reasoning benchmarkResearch frontier
Weighted 18%

FrontierScience 2026 · updated April 20, 2026

Artificial Analysis Intelligence Index

2026

Artificial Analysis Intelligence Index

A display-only intelligence index published by Artificial Analysis that aggregates provider-reported and benchmark-derived signals into a single model-level score.

CurrentDisplay only
Cross-benchmark intelligence indexAggregated model scoreDisplay-only external reference
Display only

Artificial Analysis Intelligence Index 2026 · updated April 20, 2026

SimpleQA

2024

Measuring Short-Form Factuality in Large Language Models

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.

Refreshing
Factual questionsShort-form Q&AFactual accuracy focused
Weighted 13%

SimpleQA 2024 · updated April 20, 2026

OpenBookQA

2018

OpenBookQA

A science question-answering benchmark that tests whether models can apply a small open-book set of elementary science facts to multi-step reasoning questions.

StaleDisplay only
Elementary science questions4-way multiple choiceElementary science reasoning
Display only

OpenBookQA 2018 · updated April 20, 2026

HealthBench Hard

2026

HealthBench Hard

A harder subset of OpenAI's HealthBench for evaluating open-ended medical and health reasoning with rubric-based grading.

CurrentDisplay only
1,000 health promptsOpen-ended health evaluationAdvanced health reasoning
Display only

HealthBench Hard 2026 · updated April 20, 2026

MedXpertQA (Text)

2026

MedXpertQA Text

A medical multiple-choice benchmark spanning many specialties with 10 answer options per question.

CurrentDisplay only
2,450 medical multiple-choice questionsMedical MCQProfessional medical knowledge
Display only

MedXpertQA (Text) 2026 · updated April 20, 2026

FrontierScience Research

2026

FrontierScience Research

A research-focused FrontierScience evaluation variant for scientific investigation and problem solving.

CurrentDisplay only
Scientific research problemsResearch evaluationFrontier scientific research
Display only

FrontierScience Research 2026 · updated April 20, 2026

TruthfulQA

2021

TruthfulQA

A benchmark designed to measure whether language models produce truthful answers instead of repeating common misconceptions or misleading falsehoods.

StaleDisplay only
Truthfulness and misconception resistanceQuestion answeringHallucination and factuality stress test
Display only

TruthfulQA 2021 · updated April 20, 2026

HLE w/o tools

2026

Humanity's Last Exam without tools

Tool-free variant of Humanity's Last Exam that isolates a model's raw frontier reasoning.

CurrentDisplay only
Expert-level questionsTool-free expert QAFrontier expert level
Display only

HLE w/o tools 2026 · updated April 20, 2026

MMLU-Pro (Arcee)

2026

MMLU-Pro first-party comparison snapshot

A display-only MMLU-Pro reference from Arcee AI's Trinity-Large-Thinking launch chart.

CurrentDisplay only
Professional academic QA10-way multiple choiceProfessional level
Display only

MMLU-Pro (Arcee) 2026 · updated April 20, 2026

MMLU-Redux

2026

MMLU-Redux

A harder refresh of MMLU intended to keep broad knowledge evaluation useful after the original benchmark became too easy for frontier models.

CurrentDisplay only
Broad academic QAMultiple choice questionsAdvanced general knowledge
Display only

MMLU-Redux 2026 · updated April 20, 2026

C-Eval

2023

C-Eval

A Chinese-language academic and professional benchmark spanning humanities, social science, STEM, and applied subjects.

StaleDisplay only
Chinese academic and professional examsMultiple choice questionsHigh school to professional level
Display only

C-Eval 2023 · updated April 20, 2026

Multilingual(8 benchmarks)

View leaderboard

MGSM

2022

Multilingual Grade School Math

A multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese.

Stale
250 problems × 11 languagesMath word problemsGrade school math, multilingual
Weighted 35%

MGSM 2022 · updated April 20, 2026

MMLU-ProX

2025

MMLU-ProX

A multilingual extension of professional-level academic evaluation across many languages.

Current
Multilingual professional QAMultilingual multiple choiceProfessional multilingual
Weighted 65%

MMLU-ProX 2025 · updated April 20, 2026

NOVA-63

2026

NOVA-63

A broad multilingual benchmark row from Qwen's launch comparisons intended to measure cross-lingual capability beyond a single language family.

CurrentDisplay only
Broad multilingual evaluationCross-lingual benchmarkBroad multilingual capability
Display only

NOVA-63 2026 · updated April 20, 2026

INCLUDE

2026

INCLUDE

A multilingual benchmark used in provider tables to measure inclusive language coverage and cross-lingual understanding beyond common high-resource languages.

CurrentDisplay only
Cross-lingual understandingMultilingual benchmarkBroad multilingual capability
Display only

INCLUDE 2026 · updated April 20, 2026

PolyMath

2026

PolyMath

A multilingual mathematical reasoning benchmark that tests whether math performance transfers across languages rather than only in English.

CurrentDisplay only
Multilingual math problemsCross-lingual mathematical reasoningAdvanced multilingual reasoning
Display only

PolyMath 2026 · updated April 20, 2026

VWT2k-lite

2026

VWT2k-lite

A lighter multilingual benchmark slice published in provider tables for broad cross-lingual transfer and understanding.

CurrentDisplay only
Multilingual transfer tasksCross-lingual benchmarkBroad multilingual capability
Display only

VWT2k-lite 2026 · updated April 20, 2026

MAXIFE

2026

MAXIFE

A multilingual instruction-following and understanding benchmark row published in Qwen's launch comparisons.

CurrentDisplay only
Multilingual instruction followingCross-lingual benchmarkAdvanced multilingual instruction following
Display only

MAXIFE 2026 · updated April 20, 2026

SWE Multilingual

2025

SWE-bench Multilingual

A multilingual extension of SWE-bench covering 300 problems across 9 programming languages, testing code generation and bug fixing beyond Python.

CurrentDisplay only
300 problems across 9 languagesMulti-language code patch generationProfessional multilingual software engineering
Display only

SWE Multilingual 2025 · updated April 20, 2026

Instruction Following(2 benchmarks)

View leaderboard

Mathematics(17 benchmarks)

View leaderboard

AIME 2023

2023

American Invitational Mathematics Examination 2023

A 15-question, 3-hour examination where each answer is an integer from 000 to 999. Serves as the intermediate step between AMC 10/12 and the USA Mathematical Olympiad (USAMO).

StaleDisplay only
15 problemsInteger answers 000-999High school olympiad level
Display only

AIME 2023 2023 · updated April 20, 2026

AIME 2024

2024

American Invitational Mathematics Examination 2024

The 2024 edition of AIME, maintaining the same format of 15 challenging mathematics problems with integer answers from 000 to 999.

RefreshingDisplay only
15 problemsInteger answers 000-999High school olympiad level
Display only

AIME 2024 2024 · updated April 20, 2026

AIME 2025

2025

American Invitational Mathematics Examination 2025

The most recent AIME examination, featuring 15 challenging mathematics problems testing olympiad-level mathematical reasoning with integer answers from 000-999.

Current
15 problemsInteger answers 000-999High school olympiad level
Weighted 25%

AIME 2025 · updated April 20, 2026

AIME25 (Arcee)

2026

AIME25 first-party comparison snapshot

A display-only AIME25 reference from Arcee AI's Trinity-Large-Thinking launch chart.

CurrentDisplay only
15 problemsInteger answers 000-999High school olympiad level
Display only

AIME25 (Arcee) 2026 · updated April 20, 2026

HMMT Feb 2023

2023

Harvard-MIT Mathematics Tournament February 2023

A prestigious high school mathematics competition hosted jointly by Harvard and MIT, featuring challenging problems across various mathematical disciplines.

StaleDisplay only
Tournament problemsCompetition mathematicsHigh school olympiad level
Display only

HMMT Feb 2023 2023 · updated April 20, 2026

HMMT Feb 2024

2024

Harvard-MIT Mathematics Tournament February 2024

The 2024 February edition of the Harvard-MIT Mathematics Tournament, continuing the tradition of challenging high school mathematics competition.

RefreshingDisplay only
Tournament problemsCompetition mathematicsHigh school olympiad level
Display only

HMMT Feb 2024 2024 · updated April 20, 2026

HMMT Feb 2025

2025

Harvard-MIT Mathematics Tournament February 2025

The most recent February edition of the Harvard-MIT Mathematics Tournament, featuring the latest challenging problems in competitive mathematics.

CurrentDisplay only
Tournament problemsCompetition mathematicsHigh school olympiad level
Display only

HMMT Feb 2025 2025 · updated April 20, 2026

BRUMO 2025

2025

Bulgarian Mathematical Olympiad 2025

A challenging mathematical olympiad competition featuring problems that test advanced mathematical reasoning and problem-solving skills at the olympiad level.

Current
Olympiad problemsMathematical olympiadMathematical olympiad level
Weighted 25%

BRUMO 2025 2025 · updated April 20, 2026

MATH-500

2021

MATH-500 Problem Set

A curated subset of 500 problems from the MATH dataset, covering algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus.

Stale
500 problemsFree-form mathematical answersHigh school to undergraduate
Weighted 15%

MATH-500 2021 · updated April 20, 2026

AIME26

2026

AIME 2026

A 2026 American Invitational Mathematics Examination snapshot used in frontier-model comparison tables for mathematical reasoning.

CurrentDisplay only
Competition math problemsShort-answer mathematicsOlympiad-style mathematics
Display only

AIME26 2026 · updated April 20, 2026

IPhO 2025 (Theory)

2026

International Physics Olympiad 2025 (Theory)

The three official theory problems from the 2025 International Physics Olympiad, scored with blinded human evaluation.

CurrentDisplay only
3 olympiad theory problemsPhysics olympiad theoryInternational olympiad physics
Display only

IPhO 2025 (Theory) 2026 · updated April 20, 2026

HMMT Feb 2025

2025

Harvard-MIT Mathematics Tournament February 2025

A February 2025 HMMT slice used in exact-value provider tables for advanced contest-math reasoning.

CurrentDisplay only
Competition math problemsContest mathematicsOlympiad-style mathematics
Display only

HMMT Feb 2025 2025 · updated April 20, 2026

HMMT Nov 2025

2025

Harvard-MIT Mathematics Tournament November 2025

A November 2025 HMMT slice for high-end mathematical reasoning comparisons.

CurrentDisplay only
Competition math problemsContest mathematicsOlympiad-style mathematics
Display only

HMMT Nov 2025 2025 · updated April 20, 2026

HMMT Feb 2026

2026

Harvard-MIT Mathematics Tournament February 2026

A February 2026 HMMT slice used in newer frontier-model math comparisons.

CurrentDisplay only
Competition math problemsContest mathematicsOlympiad-style mathematics
Display only

HMMT Feb 2026 2026 · updated April 20, 2026

MMAnswerBench

2026

MMAnswerBench

A multimodal mathematical reasoning benchmark that tests whether models can answer visually grounded math questions correctly.

CurrentDisplay only
Multimodal math questionsVisual and structured mathematical QAAdvanced mathematical reasoning
Display only

MMAnswerBench 2026 · updated April 20, 2026

FrontierMath

2024

FrontierMath

An expert-level mathematical reasoning benchmark by Epoch AI featuring original, research-level problems created by mathematicians including IMO gold medalists and Fields Medal recipients. Problems require deep creativity and multi-step reasoning.

Refreshing
350 original research-level math problemsOpen-ended mathematical reasoning with tool accessResearch-level mathematics
Weighted 35%

FrontierMath 2024 · updated April 20, 2026

USAMO 2026

2026

United States of America Mathematical Olympiad 2026

The premier US mathematical olympiad competition, featuring proof-based problems that require deep mathematical insight and rigorous argumentation at the highest competition level.

CurrentDisplay only
6 proof-based problemsMathematical proof constructionInternational olympiad level
Display only

USAMO 2026 2026 · updated April 20, 2026