Skip to main content

AI Benchmarks Directory

Explore 225 benchmarks used to evaluate AI language models across 10 categories.

Agentic(40 benchmarks)

View leaderboard

Terminal-Bench 2.0

2026

Terminal-Bench 2.0

A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows.

Current
Terminal-based software tasksInteractive CLI agent evaluationProfessional software engineering
Weighted 28%

Terminal-Bench 2 · updated June 2, 2026

BrowseComp

2025

BrowseComp

A benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions.

Current
Research questions requiring browsingWeb search and evidence synthesisHard web research
Weighted 18%

BrowseComp 2026 · updated June 2, 2026

HLE w/ tools

2026

Humanity's Last Exam with tools

Tool-augmented Humanity's Last Exam scores reported in DeepSeek-V4 thinking-mode evaluations.

CurrentDisplay only
Expert questions with tool usePass@1Frontier tool-augmented reasoning
Display only

HLE w/ tools 2026 · updated June 2, 2026

GDPval-AA

2026

GDPval-AA

An agentic real-world work-task evaluation reported as an Elo score in DeepSeek-V4 thinking-mode evaluations.

CurrentDisplay only
Agentic real-world work tasksEloProfessional agentic workflows
Display only

GDPval-AA 2026 · updated June 2, 2026

GDPval-AA

2026

GDPval-AA normalized

A display-only Artificial Analysis normalized score for economically valuable tasks.

CurrentDisplay only
Economically valuable tasksNormalized scoreProfessional agentic workflows
Display only

GDPval-AA 2026 · updated June 2, 2026

AA Agentic Index

2026

Artificial Analysis Agentic Index

A display-only Artificial Analysis agentic index.

CurrentDisplay only
Cross-benchmark agentic indexAggregated model scoreDisplay-only external reference
Display only

AA Agentic Index 2026 · updated June 2, 2026

APEX-Agents-AA

2026

APEX-Agents-AA

Artificial Analysis' implementation of the APEX-Agents benchmark for long-horizon professional-services agent tasks.

CurrentDisplay only
452 professional-services agent tasksPass@1Long-horizon workplace agent tasks
Display only

APEX-Agents-AA 2026 · updated June 2, 2026

Gert Labs

2026

Gert Labs Composite Game Benchmark

A game-environment benchmark that evaluates AI models in novel games covering strategic planning, resource management, spatial reasoning, cooperation, and theory of mind.

CurrentDisplay only
Novel game environmentsComposite game leaderboardAgentic coding and decision-making
Display only

Gert Labs 2026 · updated June 2, 2026

OSWorld-Verified

2025

OSWorld-Verified

A verified subset of OSWorld focused on computer-use tasks in desktop-like environments, including navigation, editing, and workflow completion.

Current
Desktop and GUI tasksInteractive computer-use evaluationComplex multi-step workflows
Weighted 24%

OSWorld Verified · updated June 2, 2026

CyberGym

2026

CyberGym

A cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.

CurrentDisplay only
1,507 vulnerability analysis instancesVulnerability reproduction and PoC generationReal-world cybersecurity
Display only

CyberGym 2026 · updated June 2, 2026

BrowseComp-VL

2026

BrowseComp-VL

A vision-language browsing benchmark for multimodal web research and tool-use workflows.

CurrentDisplay only
Multimodal browsing tasksVision-language web research evaluationMultimodal browser-agent
Display only

BrowseComp-VL 2026 · updated June 2, 2026

OSWorld

2026

OSWorld

A computer-use benchmark for GUI task completion across the broader OSWorld task suite.

CurrentDisplay only
Computer-use tasksInteractive GUI evaluationBroad computer-use suite
Display only

OSWorld 2026 · updated June 2, 2026

AndroidWorld

2026

AndroidWorld

A mobile GUI agent benchmark for completing Android app workflows and on-device tasks.

CurrentDisplay only
Android app workflowsInteractive mobile-agent evaluationComplex mobile task completion
Display only

AndroidWorld 2026 · updated June 2, 2026

WebVoyager

2026

WebVoyager

A browser-agent benchmark for completing multi-step workflows on live websites.

CurrentDisplay only
Live website workflowsInteractive browser-agent evaluationMulti-step web navigation
Display only

WebVoyager 2026 · updated June 2, 2026

MCP Atlas

2026

MCP Atlas

A benchmark for tool-calling over Model Context Protocol integrations and external tools.

CurrentDisplay only
Tool-integrated agent tasksInteractive tool-calling evaluationAdvanced tool use
Display only

MCP Atlas 2026 · updated June 2, 2026

Toolathlon

2026

Toolathlon

A tool-use benchmark focused on selecting, sequencing, and completing tasks with external tools.

CurrentDisplay only
Multi-tool workflowsInteractive tool-calling evaluationAdvanced tool use
Display only

Toolathlon 2026 · updated June 2, 2026

ZClawBench

2026

ZClawBench

A Z.AI benchmark for OpenClaw-style agent workflows spanning information search, office work, data analysis, development and operations, automation, and security.

CurrentDisplay only
OpenClaw agent workflowsEnd-to-end agent benchmarkBroad productivity and operations workflows
Display only

ZClawBench 2026 · updated June 2, 2026

Tau2-Telecom

2026

Tau2-Telecom

A telecom-oriented tool benchmark that measures structured tool use in domain workflows.

CurrentDisplay only
Telecom tool workflowsDomain-specific tool evaluationProfessional workflow
Display only

τ²-Bench 2026 · updated June 2, 2026

DeepSearchQA

2026

DeepSearchQA

An agentic browsing benchmark where models search the web, gather evidence, and answer list-style questions using browser tools.

CurrentDisplay only
Agentic browsing and list-answer questionsSearch / open / find browser-agent evaluationAgentic web research
Display only

DeepSearchQA 2026 · updated June 2, 2026

Tau2-Airline

2026

Tau2-Airline

An airline-domain tool-use benchmark for structured workflow execution and API correctness.

CurrentDisplay only
Airline support workflowsDomain-specific tool evaluationProfessional workflow
Display only

Tau2-Airline 2026 · updated June 2, 2026

PinchBench

2026

PinchBench

An OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows.

CurrentDisplay only
23 OpenClaw agent tasksAverage success rate from official runsLong-horizon agent workflows
Display only

PinchBench 2026 · updated June 2, 2026

OpenHands Index

2025

OpenHands Index

A holistic coding-agent benchmark that evaluates AI agents across issue resolution, frontend work, greenfield development, testing, and information gathering.

CurrentDisplay only
SWE-bench Verified, SWE-bench Multimodal, Commit0, SWT-bench Verified, and GAIAMacro-average across five coding-agent categoriesReal-world software engineering agent tasks
Display only

OpenHands Index 2025 · updated June 2, 2026

SWE-Atlas Refactoring

2026

SWE-Atlas Refactoring

A Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks.

CurrentDisplay only
SWE-Atlas refactoring tasksRefactoring score with confidence intervalsReal-world software-engineering agent tasks
Display only

SWE-Atlas Refactoring 2026 · updated June 2, 2026

InferenceBench

2026

InferenceBench

A benchmark for open-ended LLM inference optimization by AI agents. Agents receive a base model, one H100, and a fixed time budget to build a valid OpenAI-compatible inference server that improves serving speed.

CurrentDisplay only
4 inference-serving optimization scenariosTwo-hour autonomous CLI agent runOpen-ended ML systems engineering
Display only

InferenceBench 2026 · updated June 2, 2026

BFCL v4

2026

Berkeley Function Calling Leaderboard v4

A function-calling benchmark for tool selection, schema adherence, and argument correctness.

CurrentDisplay only
Function-calling tasksTool invocation and schema evaluationAdvanced tool use
Display only

BFCL v4 2026 · updated June 2, 2026

MLE-Bench Lite

2026

MLE-Bench Lite

A lightweight machine-learning competition benchmark that measures whether models can iteratively train, evaluate, and improve ML systems in low-resource settings.

CurrentDisplay only
Low-resource ML competitionsAutonomous iterative ML optimizationAgentic machine learning
Display only

MLE-Bench Lite 2026 · updated June 2, 2026

MM-ClawBench

2026

MM-ClawBench

An OpenClaw-derived agent benchmark covering practical work and life tasks such as office document delivery, research, planning, and code maintenance.

CurrentDisplay only
OpenClaw-style real-world tasksAgent workflow evaluationBroad real-world agentic execution
Display only

MM-ClawBench 2026 · updated June 2, 2026

Claw-Eval

2026

Claw-Eval

A transparent real-world autonomous-agent benchmark with 300 human-verified tasks, 2,159 rubric items, and Pass^3 scoring across general, multi-turn, and native multimodal agent tasks.

CurrentDisplay only
300 tasks, 2,159 rubricsEnd-to-end autonomous-agent evaluation with Pass^3 scoringReal-world general, multi-turn, and native multimodal agent execution
Display only

Claw-Eval 2026 · updated June 2, 2026

QwenClawBench

2026

QwenClawBench

Qwen's internal OpenClaw-style benchmark for measuring broad real-world agent performance across practical productivity and research tasks.

CurrentDisplay only
Real-world agent workflowsEnd-to-end agent evaluationBroad real-world agentic execution
Display only

QwenClawBench 2026 · updated June 2, 2026

QwenWebBench

2026

QwenWebBench

A Qwen benchmark for artifact and webpage generation quality reported as an Elo-style rating.

CurrentDisplay only
Web artifacts and interactive deliverablesElo-style artifact benchmarkArtifact generation
Display only

QwenWebBench 2026 · updated June 2, 2026

TAU3-Bench

2026

TAU3-Bench

A next-generation tool-use benchmark for complex, long-horizon agent workflows beyond the older tau2 telecom and airline task families.

CurrentDisplay only
Long-horizon tool workflowsInteractive tool-use evaluationAdvanced tool use
Display only

TAU3-Bench 2026 · updated June 2, 2026

VITA-Bench

2025

VITA-Bench

An interactive real-world agent benchmark grounded in practical consumer-service tasks such as delivery, in-store consumption, and online travel workflows.

CurrentDisplay only
Interactive consumer-service agent tasksEnd-to-end interactive agent evaluationLong-horizon real-world workflows
Display only

VITA-Bench 2025 · updated June 2, 2026

DeepPlanning

2026

DeepPlanning

A long-horizon planning benchmark that tests whether agents can optimize under explicit time, budget, and feasibility constraints.

CurrentDisplay only
Travel planning and constrained shoppingLong-horizon planning benchmarkConstrained agent planning
Display only

DeepPlanning 2026 · updated June 2, 2026

MCP-Tasks

2026

MCP-Tasks

A Model Context Protocol task benchmark used in Qwen's launch tables to measure practical execution over MCP-style tools and integrations.

CurrentDisplay only
MCP-integrated tool tasksInteractive tool-use evaluationAdvanced MCP workflows
Display only

MCP-Tasks 2026 · updated June 2, 2026

WideResearch

2026

WideResearch

A broad research-agent benchmark for open-ended information gathering, synthesis, and answer construction across wide search spaces.

CurrentDisplay only
Open-ended research tasksMulti-source research evaluationBroad research-agent workflows
Display only

WideResearch 2026 · updated June 2, 2026

GAIA

2024

General AI Assistants

GAIA evaluates AI models on real-world tasks that are conceptually simple for humans but require multi-step reasoning, web browsing, tool use, and multimodal understanding for AI. Tasks span three difficulty levels and test practical assistant capabilities rather than academic knowledge.

Refreshing
466
Weighted 12%

GAIA 2024 · updated June 2, 2026

TAU-bench

2024

Tool-Agent-User Benchmark

TAU-bench evaluates AI agents in realistic enterprise scenarios requiring multi-turn tool use, database interactions, and policy adherence. It tests across retail and airline domains, measuring an agent's ability to reliably complete customer service tasks while following complex business rules.

Refreshing
680
Weighted 10%

TAU-bench 2024 · updated June 2, 2026

WebArena

2024

WebArena Web Agent Benchmark

WebArena is a realistic web environment for evaluating autonomous AI agents on complex, multi-step browser tasks. Agents must navigate e-commerce sites, forums, content management systems, and code repositories to complete practical objectives like purchasing items, finding information, and managing accounts.

Refreshing
812
Weighted 8%

WebArena 2024 · updated June 2, 2026

MEWC

2026

Multi-Environment Web Challenge

A benchmark that evaluates AI agents on multi-environment web challenges, testing navigation and task completion across diverse live web environments.

CurrentDisplay only
Web-agent tasksBrowser task completionOpen-web agent workflows
Display only

MEWC 2026 · updated June 2, 2026

Finance Agent v2

2026

Finance Agent v2

Vals AI benchmark for realistic financial analyst agent tasks across qualitative analysis, quantitative analysis, market work, comparables, precedents, earnings, disclosure, and modeling.

CurrentDisplay only
Financial analyst task categoriesMean score across repeated runsProfessional expert-task agent workflow
Display only

Finance Agent v2 2026 · updated June 2, 2026

Coding(27 benchmarks)

View leaderboard

HumanEval

2021

Evaluating Large Language Models Trained on Code

A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.

StaleSaturatedDisplay only
164 problemsPython function generationIntroductory to intermediate programming
Display only

HumanEval · updated June 2, 2026

BigCodeBench

2026

BigCodeBench

A code-generation benchmark reported in DeepSeek-V4 base-model evaluations.

CurrentDisplay only
Code generation tasksPass@1Software engineering
Display only

BigCodeBench 2026 · updated June 2, 2026

Codeforces

2026

Codeforces Rating

Competitive-programming rating reported for DeepSeek-V4 thinking-mode evaluations.

CurrentDisplay only
Competitive programming contestsRatingElite competitive programming
Display only

Codeforces 2026 · updated June 2, 2026

Terminal-Bench 2.0

2026

Terminal-Bench 2.0

A benchmark for agentic software engineering tasks executed in real terminal environments. DeepSeek reports it in the agentic section, while BenchLM also mirrors it in coding for models that publish it as a developer-task signal.

CurrentDisplay only
Terminal-based software tasksInteractive CLI agent evaluationProfessional software engineering
Display only

Terminal-Bench 2 · updated June 2, 2026

SWE-bench Verified

2024

Software Engineering Benchmark Verified

A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn.

Refreshing
500 verified issuesCode patch generationProfessional software engineering
Weighted 13%

SWE-bench Verified 2024 · updated June 2, 2026

SWE-Rebench

2026

SWE-Rebench

A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.

Current
Fresh GitHub issues (rolling window)Code patch generationProfessional software engineering
Weighted 31%

Rolling 2026 window · updated June 2, 2026

LiveCodeBench

2024

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

A continuously updated benchmark using fresh competitive programming problems from LeetCode, Codeforces, and AtCoder to provide contamination-free code generation evaluation.

Current
Continuously updatedCompetitive programmingCompetitive programming level
Weighted 23%

Rolling 2026 set · updated June 2, 2026

LiveCodeBench v6

2026

LiveCodeBench v6

A newer LiveCodeBench slice used in provider comparison tables to benchmark contamination-resistant coding performance on fresher competitive programming sets.

CurrentDisplay only
Fresh programming problemsCompetitive programmingCompetitive programming level
Display only

LiveCodeBench v6 2026 · updated June 2, 2026

LiveCodeBench Pro

2025

LiveCodeBench Pro

A harder competitive-programming benchmark family built from Codeforces, ICPC, and IOI problems, with quarter-specific public leaderboards and difficulty-aware reporting.

CurrentDisplay only
Quarter-specific contest programming setsCompetitive programmingHigh-end contest programming
Display only

LiveCodeBench Pro 2025 · updated June 2, 2026

FLTEval

2026

FLTEval

A repository-level Lean 4 proof engineering benchmark that measures whether a model can complete formal proofs and correctly define new mathematical concepts inside realistic FLT project pull requests.

CurrentDisplay only
FLT project pull requestsLean 4 repository task completionFormal verification / proof engineering
Display only

FLTEval 2026 · updated June 2, 2026

SWE-bench Pro

2026

SWE-bench Pro

A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.

Current
Real-world software engineeringRepository task completionFrontier coding agent
Weighted 23%

SWE-bench Pro 2026 · updated June 2, 2026

SWE Multilingual

2026

SWE Multilingual

A multilingual software-engineering benchmark for real-world code issue resolution across multiple programming languages.

CurrentDisplay only
Multilingual software-engineering tasksRepository task completionProfessional software engineering
Display only

SWE Multilingual 2026 · updated June 2, 2026

SWE Multimodal

2025

SWE-bench Multimodal

A multimodal variant of SWE-bench that adds visual context such as screenshots and design mockups to software engineering issue descriptions.

CurrentDisplay only
Multimodal software engineering tasksCode patch generation with visual contextFrontier multimodal coding
Display only

SWE Multimodal 2025 · updated June 2, 2026

CursorBench v3.1

2026

CursorBench v3.1

Cursor's first-party harder-task benchmark for long-horizon agentic coding behavior inside the Cursor agent loop.

CurrentDisplay only
Harder long-horizon agentic coding tasksCursor agent-loop evaluationProfessional agentic software engineering
Display only

CursorBench v3.1 2026 · updated June 2, 2026

Multi-SWE Bench

2026

Multi-SWE Bench

A multi-language software-engineering benchmark that measures repository-level bug fixing and implementation across more than one programming ecosystem.

CurrentDisplay only
Multi-language repo tasksRepository task completionProfessional software engineering
Display only

Multi-SWE Bench 2026 · updated June 2, 2026

VIBE-Pro

2026

VIBE-Pro

A repo-level code generation and full-project delivery benchmark spanning web, mobile, and simulation-style implementation tasks.

CurrentDisplay only
Full project delivery tasksRepository-level implementation benchmarkEnd-to-end software delivery
Display only

VIBE-Pro 2026 · updated June 2, 2026

Vibe Code Bench

2026

Vibe Code Bench v1.1

Vals.ai benchmark for evaluating whether models can build complete web applications from natural language specifications in a production-like development environment.

CurrentDisplay only
End-to-end web application buildsFull-stack app implementation benchmarkEnd-to-end software delivery
Display only

Vibe Code Bench 2026 · updated June 2, 2026

ProgramBench

2026

ProgramBench: Can Language Models Rebuild Programs From Scratch?

A cleanroom software-engineering benchmark where agents receive only a compiled executable and documentation, then must architect and implement a complete codebase that reproduces the original program's behavior.

CurrentDisplay only
200 program reconstruction tasksCleanroom executable reimplementationFull-repository software architecture
Display only

ProgramBench 2026 · updated June 2, 2026

NL2Repo

2026

NL2Repo

A repository-understanding benchmark that measures whether models can map natural-language requests onto the right code locations and system changes.

CurrentDisplay only
Natural language to repository tasksRepository understanding benchmarkSystem-level software comprehension
Display only

NL2Repo 2026 · updated June 2, 2026

React Native Evals

2026

React Native Evals

An open benchmark for AI coding agents on real-world React Native implementation tasks, emphasizing working app behavior, recommended architecture choices, and strict constraint adherence.

CurrentDisplay only
React Native app implementation tasksFramework-specific app development evaluationProduction mobile app engineering
Display only

React Native Evals 2026 · updated June 2, 2026

Next.js Evals

2026

AI Agent Evaluations for Next.js

A Vercel benchmark for AI coding agents on Next.js code generation and migration tasks, reporting success rate, average execution time, and an AGENTS.md documentation-assisted split.

CurrentDisplay only
24 Next.js code generation and migration tasksAgent task completion with withheld Vitest assertionsFramework-specific web application engineering
Display only

Next.js Evals 2026 · updated June 2, 2026

SWE-bench Verified*

2026

SWE-bench Verified (mini-swe-agent-v2)

A display-only SWE-bench Verified reference from Arcee AI's Trinity-Large-Thinking comparison chart.

CurrentDisplay only
Repository task completionAgent scaffold benchmarkProfessional software engineering
Display only

SWE-bench Verified* 2026 · updated June 2, 2026

Spider 2.0-Lite

2024

Spider 2.0-Lite

A text-to-SQL benchmark over realistic warehouse-scale schemas, reported by Interfaze for model comparison.

RefreshingDisplay only
Text-to-SQL queriesExecution accuracyEnterprise text-to-SQL
Display only

Spider 2.0-Lite 2024 · updated June 2, 2026

SciCode

2024

Scientific Code Benchmark

SciCode evaluates language models on generating code for realistic scientific research problems across 16 subfields of physics, math, chemistry, biology, and material science. Problems decompose into 338 subproblems requiring domain knowledge recall, scientific reasoning, and precise code synthesis. Based on real scripts from published research.

Refreshing
80
Weighted 10%

SciCode 2024 · updated June 2, 2026

AA Coding Index

2026

Artificial Analysis Coding Index

A display-only Artificial Analysis coding index.

CurrentDisplay only
Cross-benchmark coding indexAggregated model scoreDisplay-only external reference
Display only

AA Coding Index 2026 · updated June 2, 2026

AA-SciCode

2026

Artificial Analysis SciCode

A display-only Artificial Analysis SciCode score.

CurrentDisplay only
Scientific coding subproblemsTask success rateScientific programming
Display only

AA-SciCode 2026 · updated June 2, 2026

Terminal-Bench Hard

2026

Terminal-Bench Hard

A display-only Artificial Analysis coding metric for agentic coding and terminal use on a harder Terminal-Bench slice.

CurrentDisplay only
Agentic coding and terminal tasksTask success rateProfessional software engineering
Display only

Terminal-Bench Hard 2026 · updated June 2, 2026

Reasoning(23 benchmarks)

View leaderboard

MuSR

2023

Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.

Stale
Multi-step reasoningNarrative-based reasoningComplex reasoning tasks
Weighted 20%

MuSR 2023 · updated June 2, 2026

BBH

2022

BIG-Bench Hard

A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.

StaleSaturatedDisplay only
23 tasksMixed reasoning tasksAdvanced reasoning
Display only

BBH 2022 · updated June 2, 2026

DROP

2026

Discrete Reasoning Over Paragraphs

A reading-comprehension benchmark requiring discrete reasoning over paragraphs, reported in DeepSeek-V4 base-model evaluations.

CurrentDisplay only
Paragraph reasoning questionsF1Reading and numerical reasoning
Display only

DROP 2026 · updated June 2, 2026

HellaSwag

2026

HellaSwag

A commonsense natural-language inference benchmark reported in DeepSeek-V4 base-model evaluations.

CurrentDisplay only
Commonsense completion questionsExact matchCommonsense reasoning
Display only

HellaSwag 2026 · updated June 2, 2026

WinoGrande

2026

WinoGrande

A commonsense coreference benchmark reported in DeepSeek-V4 base-model evaluations.

CurrentDisplay only
Coreference resolution questionsExact matchCommonsense reasoning
Display only

WinoGrande 2026 · updated June 2, 2026

CLUEWSC

2026

CLUEWSC

A Chinese Winograd Schema Challenge benchmark reported in DeepSeek-V4 base-model evaluations.

CurrentDisplay only
Chinese coreference questionsExact matchChinese commonsense reasoning
Display only

CLUEWSC 2026 · updated June 2, 2026

LisanBench

2026

LisanBench

A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.

CurrentDisplay only
50 starting words × 3 trialsDifficulty-weighted word-chain reasoningOpen-ended lexical planning
Display only

LisanBench 2026 · updated June 2, 2026

Pencil Puzzle Bench

2026

Pencil Puzzle Bench

A multi-step verifiable reasoning benchmark that evaluates whether models can solve pencil puzzles with unique solutions.

CurrentDisplay only
300 evaluation puzzlesDirect and agentic puzzle solve rateMulti-step verifiable reasoning
Display only

Pencil Puzzle Bench 2026 · updated June 2, 2026

LongBench v2

2025

LongBench v2

A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval.

Current
Long-context tasksExtended-context retrieval and reasoningHard long-context
Weighted 30%

LongBench v2 2025 · updated June 2, 2026

MRCRv2

2025

MRCRv2

A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.

Current
Long-context retrievalMulti-round long-context evaluationHard long-context
Weighted 25%

MRCRv2 2025 · updated June 2, 2026

MRCR v2 64K-128K

2026

OpenAI MRCR v2 8-needle 64K-128K

MRCR v2 slice focused on long-context retrieval at 64K-128K lengths.

CurrentDisplay only
8-needle retrieval tasksLong-context retrievalLong-context reasoning
Display only

MRCR v2 64K-128K 2026 · updated June 2, 2026

MRCR v2 128K-256K

2026

OpenAI MRCR v2 8-needle 128K-256K

MRCR v2 slice focused on very long contexts at 128K-256K lengths.

CurrentDisplay only
8-needle retrieval tasksVery-long-context retrievalVery long-context reasoning
Display only

MRCR v2 128K-256K 2026 · updated June 2, 2026

Graphwalks BFS 128K

2026

Graphwalks BFS 0K-128K

Long-context graph traversal benchmark using breadth-first search tasks.

CurrentDisplay only
Graph traversal tasksLong-context graph reasoningAlgorithmic long-context reasoning
Display only

Graphwalks BFS 128K 2026 · updated June 2, 2026

Graphwalks Parents 128K

2026

Graphwalks parents 0-128K

Long-context benchmark for recovering parent relationships inside graph tasks.

CurrentDisplay only
Graph parent-retrieval tasksLong-context graph reasoningAlgorithmic long-context reasoning
Display only

Graphwalks Parents 128K 2026 · updated June 2, 2026

MRCR 1M

2026

MRCR 1M

A million-token MRCR long-context retrieval benchmark reported in DeepSeek-V4 model evaluations.

CurrentDisplay only
Million-token retrievalLong-context retrieval MMRMillion-token long context
Display only

MRCR 1M 2026 · updated June 2, 2026

CorpusQA 1M

2026

CorpusQA 1M

A million-token CorpusQA long-context question-answering benchmark reported in DeepSeek-V4 model evaluations.

CurrentDisplay only
Million-token corpus question answeringLong-context QA accuracyMillion-token long context
Display only

CorpusQA 1M 2026 · updated June 2, 2026

ARC-AGI-2

2025

Abstraction and Reasoning Corpus for AGI v2

A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — average individual human performance is 66%.

Current
Visual pattern completion and abstract reasoningGrid transformation puzzles with novel rulesExpert-level — hardest public reasoning benchmark
Weighted 25%

ARC-AGI 2 · updated June 2, 2026

AI-Needle

2026

AI-Needle

A long-context retrieval benchmark that measures whether a model can recover relevant information embedded deep inside very long contexts.

CurrentDisplay only
Long-context retrievalNeedle-in-a-haystack recallLong-context memory
Display only

AI-Needle 2026 · updated June 2, 2026

GPQA Diamond

2023

GPQA Diamond

The hardest subset of GPQA featuring the most challenging graduate-level science questions. Sometimes reported separately from the standard GPQA benchmark.

StaleDisplay only
Expert-level science questionsMultiple choice questionsGraduate-level scientific reasoning
Display only

GPQA Diamond 2023 · updated June 2, 2026

AA-LCR

2026

Artificial Analysis Long Context Reasoning

A display-only Artificial Analysis long-context reasoning evaluation.

CurrentDisplay only
Long-context reasoning tasksAccuracyLong-context reasoning
Display only

AA-LCR 2026 · updated June 2, 2026

CritPt

2026

Critical Physics Tasks

A display-only Artificial Analysis metric for research-level physics reasoning.

CurrentDisplay only
Research-level physics questionsAccuracyResearch-level physics reasoning
Display only

CritPt 2026 · updated June 2, 2026

BullshitBench v2

2025

BullshitBench v2

A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input.

CurrentDisplay only
Nonsensical and flawed prompts across multiple domainsPrompt challenge and refusal evaluationRobustness and critical reasoning
Display only

BullshitBench v2 2025 · updated June 2, 2026

WildBench

2024

WildBench

An automated evaluation framework using 1,000+ real-world user tasks covering reasoning, planning, coding, and creative writing. Highly correlated with Chatbot Arena human preference rankings.

RefreshingDisplay only
1,024 real-world tasksReal-world task evaluationDiverse real-world scenarios
Display only

WildBench 2024 · updated June 2, 2026

Multimodal & Grounded(47 benchmarks)

View leaderboard

MMMU

2024

Massive Multi-discipline Multimodal Understanding

A broad multimodal reasoning benchmark spanning charts, diagrams, tables, and academic visual question answering.

RefreshingDisplay only
Multimodal academic reasoningImage + text question answeringFrontier multimodal
Display only

MMMU 2024 · updated June 2, 2026

MMMU-Pro

2024

Massive Multi-discipline Multimodal Understanding Pro

A harder multimodal benchmark for frontier models that combines text with images, diagrams, charts, and academic visual reasoning tasks.

Refreshing
Multimodal academic reasoningImage + text question answeringFrontier multimodal
Weighted 45%

MMMU-Pro 2024 · updated June 2, 2026

AA-MMMU-Pro

2026

Artificial Analysis MMMU-Pro

A display-only Artificial Analysis MMMU-Pro score.

CurrentDisplay only
Multimodal academic reasoningImage + text question answeringFrontier multimodal
Display only

AA-MMMU-Pro 2026 · updated June 2, 2026

OCRBench V2

2025

OCRBench V2

A native OCR benchmark for reading text from images across multilingual scripts, low-quality scans, handwriting, structured layouts, charts, and screenshots.

CurrentDisplay only
Image OCR tasksAccuracyNative visual text understanding
Display only

OCRBench V2 2025 · updated June 2, 2026

olmOCR

2025

olmOCR-Bench

An end-to-end document understanding benchmark over long, layout-rich PDFs with tables, equations, headers, footnotes, and multi-column flows.

CurrentDisplay only
Layout-rich PDF understandingMean accuracyComplex document processing
Display only

olmOCR 2025 · updated June 2, 2026

VoxPopuli WER

2026

VoxPopuli-Cleaned-AA Word Error Rate

A speech-recognition benchmark on the cleaned Artificial Analysis VoxPopuli subset, reported as word error rate where lower is better.

CurrentDisplay only
Speech-to-text transcriptionWord error rateAudio speech recognition
Display only

VoxPopuli WER 2026 · updated June 2, 2026

Design Arena Website

2026

Design Arena Website Elo

A display-only Design Arena website-generation Elo score surfaced on OpenRouter model benchmark pages.

CurrentDisplay only
Website generation comparisonsEloDesign and website generation
Display only

Design Arena Website 2026 · updated June 2, 2026

OfficeQA Pro

2026

OfficeQA Pro

A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts.

Current
Document and spreadsheet tasksGrounded QA over office artifactsEnterprise grounded reasoning
Weighted 30%

OfficeQA Pro 2026 · updated June 2, 2026

MMMU-Pro w/ Python

2026

MMMU-Pro with Python

Tool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning.

CurrentDisplay only
Multimodal academic reasoningImage + text question answering with PythonFrontier multimodal
Display only

MMMU-Pro w/ Python 2026 · updated June 2, 2026

OmniDocBench 1.5

2026

OmniDocBench 1.5

A document understanding benchmark used in frontier-model comparison tables to measure extraction and grounded reasoning quality on complex documents.

CurrentDisplay only
Document understanding tasksDocument understanding benchmarkGrounded document reasoning
Display only

OmniDocBench 1.5 2026 · updated June 2, 2026

RealWorldQA

2026

RealWorldQA

A grounded visual QA benchmark focused on answering practical questions about real-world images and scenes.

CurrentDisplay only
Real-world visual question answeringImage-grounded QAGeneral visual reasoning
Display only

RealWorldQA 2026 · updated June 2, 2026

Video-MME (with subtitle)

2026

Video-MME with subtitle

A video understanding benchmark that allows subtitle access when answering multimodal questions about videos.

CurrentDisplay only
Video understandingVideo QA with subtitle contextMultimodal video reasoning
Display only

Video-MME (with subtitle) 2026 · updated June 2, 2026

Video-MME (w/o subtitle)

2026

Video-MME without subtitle

A stricter Video-MME setting that removes subtitle help and tests video understanding from visual and audio context alone.

CurrentDisplay only
Video understandingVideo QA without subtitle contextMultimodal video reasoning
Display only

Video-MME (w/o subtitle) 2026 · updated June 2, 2026

Video-MME

2024

Video-MME

A comprehensive benchmark for multimodal large language models on video understanding, covering temporal reasoning, perception, and question answering over videos.

RefreshingDisplay only
Video understandingVideo QA and analysisBroad multimodal video reasoning
Display only

Video-MME 2024 · updated June 2, 2026

MathVision

2026

MathVision

A visual mathematics benchmark that tests whether a model can solve math problems grounded in diagrams, equations, figures, and other visual inputs.

CurrentDisplay only
Visually grounded math problemsImage + math reasoningAdvanced multimodal mathematics
Display only

MathVision 2026 · updated June 2, 2026

We-Math

2026

We-Math

A multimodal math benchmark for visually grounded mathematical reasoning and answer generation.

CurrentDisplay only
Visually grounded math problemsMultimodal mathematical reasoningAdvanced multimodal mathematics
Display only

We-Math 2026 · updated June 2, 2026

DynaMath

2026

DynaMath

A multimodal benchmark for dynamic mathematical reasoning over visual and structured inputs.

CurrentDisplay only
Dynamic visual math problemsMultimodal mathematical reasoningAdvanced multimodal mathematics
Display only

DynaMath 2026 · updated June 2, 2026

MStar

2026

MStar

A general visual question-answering benchmark used in provider tables for real-image reasoning quality.

CurrentDisplay only
Real-image visual QAImage-grounded QAGeneral visual reasoning
Display only

MStar 2026 · updated June 2, 2026

ChatCVQA

2026

ChatCVQA

A conversational visual QA benchmark that tests multi-turn grounded answering over images and documents.

CurrentDisplay only
Conversational visual QAMulti-turn image-grounded QAConversational multimodal reasoning
Display only

ChatCVQA 2026 · updated June 2, 2026

MMLongBench-Doc

2026

MMLongBench-Doc

A long-document multimodal benchmark for grounded reasoning over extended document contexts.

CurrentDisplay only
Long document understandingDocument-grounded reasoningLong-context document reasoning
Display only

MMLongBench-Doc 2026 · updated June 2, 2026

CC-OCR

2026

CC-OCR

An OCR-focused benchmark for reading and extracting text from visually complex documents and images.

CurrentDisplay only
Optical character recognitionText extraction from images and documentsDocument reading
Display only

CC-OCR 2026 · updated June 2, 2026

AI2D_TEST

2026

AI2D test split

A diagram understanding benchmark focused on scientific and educational visual question answering.

CurrentDisplay only
Diagram understandingDiagram-grounded QAStructured visual reasoning
Display only

AI2D_TEST 2026 · updated June 2, 2026

CountBench

2026

CountBench

A visual counting benchmark that tests whether a model can count objects and entities reliably in complex scenes.

CurrentDisplay only
Visual counting tasksImage-grounded countingFine-grained visual perception
Display only

CountBench 2026 · updated June 2, 2026

RefCOCO (avg)

2026

RefCOCO average

A referring-expression grounding benchmark averaged across RefCOCO variants to test whether a model can localize described objects correctly.

CurrentDisplay only
Referring-expression groundingGrounded visual localizationFine-grained visual grounding
Display only

RefCOCO (avg) 2026 · updated June 2, 2026

ODINW13

2026

ODINW13

A visual detection and grounding benchmark slice used to compare zero-shot object understanding across diverse domains.

CurrentDisplay only
Out-of-distribution object understandingDetection and groundingRobust visual grounding
Display only

ODINW13 2026 · updated June 2, 2026

ERQA

2026

ERQA

A grounded visual reasoning benchmark focused on evidence-based question answering over real images.

CurrentDisplay only
Evidence-based visual QAGrounded image reasoningGrounded multimodal reasoning
Display only

ERQA 2026 · updated June 2, 2026

VideoMMMU

2026

VideoMMMU

A video extension of MMMU-style multimodal reasoning over expert questions grounded in temporal media.

CurrentDisplay only
Video-grounded expert reasoningVideo + text reasoningFrontier multimodal video reasoning
Display only

VideoMMMU 2026 · updated June 2, 2026

MLVU (M-Avg)

2026

MLVU mean average

A multi-task video understanding benchmark averaged across MLVU categories.

CurrentDisplay only
General video understandingVideo QA and understandingBroad multimodal video reasoning
Display only

MLVU (M-Avg) 2026 · updated June 2, 2026

MMVU

2026

Multimodal Multi-disciplinary Video Understanding

A benchmark for evaluating multimodal models on video understanding tasks across multiple disciplines, emphasizing temporal reasoning and comprehension over video content.

CurrentDisplay only
Video understandingVideo reasoning benchmarkMulti-disciplinary multimodal video reasoning
Display only

MMVU 2026 · updated June 2, 2026

ScreenSpot Pro

2025

ScreenSpot Pro

A high-resolution GUI grounding benchmark for professional computer-use environments.

CurrentDisplay only
GUI grounding tasksInterface element localizationProfessional GUI grounding
Display only

ScreenSpot Pro 2025 · updated June 2, 2026

TIR-Bench

2026

TIR-Bench

A visual agent benchmark for interface reasoning and task execution over screenshots or software surfaces.

CurrentDisplay only
Visual agent and interface reasoningScreenshot-grounded task reasoningComputer-use visual reasoning
Display only

TIR-Bench 2026 · updated June 2, 2026

GDPval-AA

2026

GDPval-AA

An evaluation focused on professional domain expertise and task delivery quality in office-style knowledge work.

CurrentDisplay only
Professional office deliveryELO-style office benchmarkProfessional knowledge work
Display only

GDPval-AA 2026 · updated June 2, 2026

MedXpertQA (MM)

2026

MedXpertQA Multimodal

A multimodal medical multiple-choice benchmark covering clinical images such as X-rays, histology, and dermatology.

CurrentDisplay only
2,000 multimodal medical questionsMedical visual MCQClinical multimodal reasoning
Display only

MedXpertQA (MM) 2026 · updated June 2, 2026

ZeroBench

2026

ZeroBench

A multi-step visual reasoning benchmark with pass@5 reporting and optional tool use.

CurrentDisplay only
100 visual reasoning questionsMulti-step visual reasoningTool-augmented visual reasoning
Display only

ZeroBench 2026 · updated June 2, 2026

Design2Code

2026

Design2Code

A multimodal coding benchmark for turning visual designs into working frontend implementations.

CurrentDisplay only
Design-to-code tasksVisual input to frontend implementationMultimodal coding
Display only

Design2Code 2026 · updated June 2, 2026

Flame-VLM-Code

2026

Flame-VLM-Code

A vision-language coding benchmark for generating correct code from visual and multimodal inputs.

CurrentDisplay only
Multimodal coding tasksVision-language code generationMultimodal coding
Display only

Flame-VLM-Code 2026 · updated June 2, 2026

Vision2Web

2026

Vision2Web

A benchmark for converting visual references into functional web implementations.

CurrentDisplay only
Screenshot-to-web tasksVisual reference to web implementationMultimodal web generation
Display only

Vision2Web 2026 · updated June 2, 2026

ImageMining

2026

ImageMining

A multimodal retrieval and extraction benchmark over image-heavy task settings.

CurrentDisplay only
Visual retrieval tasksImage-grounded retrieval and extractionMultimodal retrieval
Display only

ImageMining 2026 · updated June 2, 2026

MMSearch

2026

MMSearch

A multimodal search benchmark for retrieval and grounded answering across mixed-media inputs.

CurrentDisplay only
Multimodal search tasksMixed-media retrieval and grounded answeringMultimodal search
Display only

MMSearch 2026 · updated June 2, 2026

MMSearch-Plus

2026

MMSearch-Plus

A harder MMSearch variant for multimodal retrieval and grounded tool-use workflows.

CurrentDisplay only
Hard multimodal search tasksAdvanced mixed-media retrieval benchmarkAdvanced multimodal search
Display only

MMSearch-Plus 2026 · updated June 2, 2026

SimpleVQA

2026

SimpleVQA

A visual question answering benchmark focused on straightforward image-grounded understanding.

CurrentDisplay only
Visual QA tasksImage-grounded question answeringGeneral visual understanding
Display only

SimpleVQA 2026 · updated June 2, 2026

Facts-VLM

2026

Facts-VLM

A grounded multimodal factuality benchmark for evidence-linked answer correctness.

CurrentDisplay only
Grounded factuality tasksEvidence-linked multimodal factualityGrounded multimodal factuality
Display only

Facts-VLM 2026 · updated June 2, 2026

V*

2026

V*

A vision-centric benchmark for high-level multimodal reasoning and perception quality.

CurrentDisplay only
Frontier multimodal reasoning tasksVision-centric reasoning benchmarkFrontier multimodal
Display only

V* 2026 · updated June 2, 2026

CharXiv

2024

CharXiv Reasoning

A scientific chart reasoning benchmark that tests whether models can understand, interpret, and reason about complex scientific visualizations including plots, diagrams, and data charts.

Refreshing
Scientific chart reasoningChart understanding and reasoningScientific visualization reasoning
Weighted 20%

CharXiv 2024 · updated June 2, 2026

CharXiv w/o tools

2024

CharXiv Reasoning without tools

Tool-free variant of CharXiv that isolates raw visual reasoning ability without code execution or tool augmentation.

Refreshing
Scientific chart reasoning (tool-free)Chart understanding without toolsScientific visualization reasoning
Weighted 5%

CharXiv w/o tools 2024 · updated June 2, 2026

SWE-bench Multimodal

2025

SWE-bench Multimodal

A multimodal variant of SWE-bench that adds visual context (screenshots, design mockups) to software engineering issue descriptions, testing whether models can leverage visual information for code generation.

CurrentDisplay only
Multimodal software engineering tasksCode patch generation with visual contextFrontier multimodal coding
Display only

SWE-bench Multimodal 2025 · updated June 2, 2026

Blueprint-Bench 2

2026

Blueprint-Bench 2

An agentic spatial reasoning benchmark reported as a normalized score.

CurrentDisplay only
Spatial reasoning from blueprintsNormalized scoreAgentic spatial reasoning
Display only

Blueprint-Bench 2 2026 · updated June 2, 2026

Knowledge(30 benchmarks)

View leaderboard

MMLU

2020

Massive Multitask Language Understanding

A comprehensive multiple-choice question answering test covering 57 tasks including elementary mathematics, US history, computer science, law, and more. Tests knowledge across diverse academic subjects from high school to professional level.

StaleSaturatedDisplay only
57 subjectsMultiple choice questionsElementary to professional level
Display only

MMLU · updated June 2, 2026

GPQA

2023

Graduate-Level Google-Proof Q&A

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.

Refreshing
448 questionsMultiple choice questionsGraduate level
Weighted 12%

GPQA Diamond · updated June 2, 2026

GPQA-D

2026

GPQA Diamond

A display-only GPQA Diamond reference from provider comparison charts.

CurrentDisplay only
Graduate-level science questionsMultiple choice questionsGraduate level
Display only

GPQA-D 2026 · updated June 2, 2026

SuperGPQA

2025

SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines

An expanded version of GPQA that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines, providing comprehensive coverage of academic domains.

Current
285 disciplinesMultiple choice questionsGraduate level
Weighted 12%

SuperGPQA 2025 · updated June 2, 2026

MMLU-Pro

2024

Massive Multitask Language Understanding Professional

An enhanced version of MMLU with 10 answer choices instead of 4, featuring more reasoning-focused questions that better differentiate frontier models.

Refreshing
Multiple subjects10-way multiple choiceProfessional level
Weighted 22%

MMLU-Pro · updated June 2, 2026

AGIEval

2026

AGIEval

A human-centric exam benchmark for general knowledge and reasoning reported in DeepSeek-V4 base-model evaluations.

CurrentDisplay only
General academic and professional exam questionsExact matchGeneral knowledge
Display only

AGIEval 2026 · updated June 2, 2026

HLE

2025

Humanity's Last Exam

An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.

Current
Expert-level questionsOpen-ended and multiple choiceFrontier expert level
Weighted 23%

Humanity's Last Exam · updated June 2, 2026

FrontierScience

2026

FrontierScience

A benchmark for research-level scientific reasoning, designed to separate frontier models on difficult science tasks that mix domain knowledge with deep reasoning.

Current
Research-level science tasksScientific reasoning benchmarkResearch frontier
Weighted 18%

FrontierScience 2026 · updated June 2, 2026

Artificial Analysis Intelligence Index

2026

Artificial Analysis Intelligence Index

A display-only intelligence index published by Artificial Analysis that aggregates provider-reported and benchmark-derived signals into a single model-level score.

CurrentDisplay only
Cross-benchmark intelligence indexAggregated model scoreDisplay-only external reference
Display only

Artificial Analysis Intelligence Index 2026 · updated June 2, 2026

AA-GPQA Diamond

2026

Artificial Analysis GPQA Diamond

A display-only Artificial Analysis GPQA Diamond score.

CurrentDisplay only
Graduate-level science questionsAccuracyGraduate-level science reasoning
Display only

AA-GPQA Diamond 2026 · updated June 2, 2026

AA-HLE

2026

Artificial Analysis Humanity's Last Exam

A display-only Artificial Analysis Humanity's Last Exam score.

CurrentDisplay only
Expert-level questionsAccuracyFrontier expert reasoning
Display only

AA-HLE 2026 · updated June 2, 2026

AA-Omniscience Index

2026

Artificial Analysis Omniscience Index

A display-only Artificial Analysis factual knowledge index.

CurrentDisplay only
Knowledge questionsIndex scoreBroad factual knowledge
Display only

AA-Omniscience Index 2026 · updated June 2, 2026

AA-Omniscience Accuracy

2026

Artificial Analysis Omniscience Accuracy

A display-only Artificial Analysis knowledge metric for the proportion of correctly answered questions.

CurrentDisplay only
Knowledge questionsAccuracyBroad knowledge
Display only

AA-Omniscience Accuracy 2026 · updated June 2, 2026

AA-Omniscience Hallucination Rate

2026

Artificial Analysis Omniscience Hallucination Rate

A display-only Artificial Analysis factuality metric for the rate of incorrect answers among non-correct responses.

CurrentDisplay only
Knowledge questionsHallucination rateFactuality
Display only

AA-Omniscience Hallucination Rate 2026 · updated June 2, 2026

SimpleQA

2024

Measuring Short-Form Factuality in Large Language Models

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.

Refreshing
Factual questionsShort-form Q&AFactual accuracy focused
Weighted 13%

SimpleQA 2024 · updated June 2, 2026

Chinese-SimpleQA

2026

Chinese-SimpleQA

A Chinese short-form factuality benchmark reported by DeepSeek for V4 model evaluations.

CurrentDisplay only
Chinese factual questionsShort-form factual QAFactual accuracy focused
Display only

Chinese-SimpleQA 2026 · updated June 2, 2026

OpenBookQA

2018

OpenBookQA

A science question-answering benchmark that tests whether models can apply a small open-book set of elementary science facts to multi-step reasoning questions.

StaleDisplay only
Elementary science questions4-way multiple choiceElementary science reasoning
Display only

OpenBookQA 2018 · updated June 2, 2026

HealthBench Hard

2026

HealthBench Hard

A harder subset of OpenAI's HealthBench for evaluating open-ended medical and health reasoning with rubric-based grading.

CurrentDisplay only
1,000 health promptsOpen-ended health evaluationAdvanced health reasoning
Display only

HealthBench Hard 2026 · updated June 2, 2026

MedXpertQA (Text)

2026

MedXpertQA Text

A medical multiple-choice benchmark spanning many specialties with 10 answer options per question.

CurrentDisplay only
2,450 medical multiple-choice questionsMedical MCQProfessional medical knowledge
Display only

MedXpertQA (Text) 2026 · updated June 2, 2026

FrontierScience Research

2026

FrontierScience Research

A research-focused FrontierScience evaluation variant for scientific investigation and problem solving.

CurrentDisplay only
Scientific research problemsResearch evaluationFrontier scientific research
Display only

FrontierScience Research 2026 · updated June 2, 2026

TruthfulQA

2021

TruthfulQA

A benchmark designed to measure whether language models produce truthful answers instead of repeating common misconceptions or misleading falsehoods.

StaleDisplay only
Truthfulness and misconception resistanceQuestion answeringHallucination and factuality stress test
Display only

TruthfulQA 2021 · updated June 2, 2026

HLE w/o tools

2026

Humanity's Last Exam without tools

Tool-free variant of Humanity's Last Exam that isolates a model's raw frontier reasoning.

CurrentDisplay only
Expert-level questionsTool-free expert QAFrontier expert level
Display only

HLE w/o tools 2026 · updated June 2, 2026

MMLU-Pro (Arcee)

2026

MMLU-Pro first-party comparison snapshot

A display-only MMLU-Pro reference from Arcee AI's Trinity-Large-Thinking launch chart.

CurrentDisplay only
Professional academic QA10-way multiple choiceProfessional level
Display only

MMLU-Pro (Arcee) 2026 · updated June 2, 2026

MMLU-Redux

2026

MMLU-Redux

A harder refresh of MMLU intended to keep broad knowledge evaluation useful after the original benchmark became too easy for frontier models.

CurrentDisplay only
Broad academic QAMultiple choice questionsAdvanced general knowledge
Display only

MMLU-Redux 2026 · updated June 2, 2026

MMMLU

2026

MMMLU

A multilingual MMLU-style benchmark reported in provider evaluation tables.

CurrentDisplay only
Multilingual academic QAExact matchBroad multilingual knowledge
Display only

MMMLU 2026 · updated June 2, 2026

C-Eval

2023

C-Eval

A Chinese-language academic and professional benchmark spanning humanities, social science, STEM, and applied subjects.

StaleDisplay only
Chinese academic and professional examsMultiple choice questionsHigh school to professional level
Display only

C-Eval 2023 · updated June 2, 2026

CMMLU

2026

Chinese Massive Multitask Language Understanding

A Chinese multitask academic benchmark reported in DeepSeek-V4 base-model evaluations.

CurrentDisplay only
Chinese academic QAExact matchBroad Chinese knowledge
Display only

CMMLU 2026 · updated June 2, 2026

MultiLoKo

2026

MultiLoKo

A multilingual/localized knowledge benchmark reported in DeepSeek-V4 base-model evaluations.

CurrentDisplay only
Localized multilingual knowledge questionsExact matchMultilingual knowledge
Display only

MultiLoKo 2026 · updated June 2, 2026

FACTS Parametric

2026

FACTS Parametric

A parametric factuality benchmark reported in DeepSeek-V4 base-model evaluations.

CurrentDisplay only
Parametric factual recallExact matchFactual accuracy focused
Display only

FACTS Parametric 2026 · updated June 2, 2026

TriviaQA

2026

TriviaQA

A reading and trivia question-answering benchmark reported in DeepSeek-V4 base-model evaluations.

CurrentDisplay only
Trivia and reading-comprehension QAExact matchGeneral factual QA
Display only

TriviaQA 2026 · updated June 2, 2026

Multilingual(8 benchmarks)

View leaderboard

MGSM

2022

Multilingual Grade School Math

A multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese.

Stale
250 problems × 11 languagesMath word problemsGrade school math, multilingual
Weighted 35%

MGSM 2022 · updated June 2, 2026

MMLU-ProX

2025

MMLU-ProX

A multilingual extension of professional-level academic evaluation across many languages.

Current
Multilingual professional QAMultilingual multiple choiceProfessional multilingual
Weighted 65%

MMLU-ProX 2025 · updated June 2, 2026

NOVA-63

2026

NOVA-63

A broad multilingual benchmark row from Qwen's launch comparisons intended to measure cross-lingual capability beyond a single language family.

CurrentDisplay only
Broad multilingual evaluationCross-lingual benchmarkBroad multilingual capability
Display only

NOVA-63 2026 · updated June 2, 2026

INCLUDE

2026

INCLUDE

A multilingual benchmark used in provider tables to measure inclusive language coverage and cross-lingual understanding beyond common high-resource languages.

CurrentDisplay only
Cross-lingual understandingMultilingual benchmarkBroad multilingual capability
Display only

INCLUDE 2026 · updated June 2, 2026

PolyMath

2026

PolyMath

A multilingual mathematical reasoning benchmark that tests whether math performance transfers across languages rather than only in English.

CurrentDisplay only
Multilingual math problemsCross-lingual mathematical reasoningAdvanced multilingual reasoning
Display only

PolyMath 2026 · updated June 2, 2026

VWT2k-lite

2026

VWT2k-lite

A lighter multilingual benchmark slice published in provider tables for broad cross-lingual transfer and understanding.

CurrentDisplay only
Multilingual transfer tasksCross-lingual benchmarkBroad multilingual capability
Display only

VWT2k-lite 2026 · updated June 2, 2026

MAXIFE

2026

MAXIFE

A multilingual instruction-following and understanding benchmark row published in Qwen's launch comparisons.

CurrentDisplay only
Multilingual instruction followingCross-lingual benchmarkAdvanced multilingual instruction following
Display only

MAXIFE 2026 · updated June 2, 2026

SWE Multilingual

2025

SWE-bench Multilingual

A multilingual extension of SWE-bench covering 300 problems across 9 programming languages, testing code generation and bug fixing beyond Python.

CurrentDisplay only
300 problems across 9 languagesMulti-language code patch generationProfessional multilingual software engineering
Display only

SWE Multilingual 2025 · updated June 2, 2026

Instruction Following(4 benchmarks)

View leaderboard

Mathematics(23 benchmarks)

View leaderboard

AIME 2023

2023

American Invitational Mathematics Examination 2023

A 15-question, 3-hour examination where each answer is an integer from 000 to 999. Serves as the intermediate step between AMC 10/12 and the USA Mathematical Olympiad (USAMO).

StaleDisplay only
15 problemsInteger answers 000-999High school olympiad level
Display only

AIME 2023 2023 · updated June 2, 2026

AIME 2024

2024

American Invitational Mathematics Examination 2024

The 2024 edition of AIME, maintaining the same format of 15 challenging mathematics problems with integer answers from 000 to 999.

RefreshingDisplay only
15 problemsInteger answers 000-999High school olympiad level
Display only

AIME 2024 2024 · updated June 2, 2026

AIME 2025

2025

American Invitational Mathematics Examination 2025

The most recent AIME examination, featuring 15 challenging mathematics problems testing olympiad-level mathematical reasoning with integer answers from 000-999.

Current
15 problemsInteger answers 000-999High school olympiad level
Weighted 25%

AIME 2025 · updated June 2, 2026

GSM8K

2026

Grade School Math 8K

A grade-school mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations.

CurrentDisplay only
Grade-school math word problemsExact matchGrade-school math
Display only

GSM8K 2026 · updated June 2, 2026

MATH

2026

MATH

A competition-style mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations.

CurrentDisplay only
Competition math problemsExact matchAdvanced math reasoning
Display only

MATH 2026 · updated June 2, 2026

CMath

2026

CMath

A Chinese mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations.

CurrentDisplay only
Chinese math problemsExact matchMath reasoning
Display only

CMath 2026 · updated June 2, 2026

AIME25 (Arcee)

2026

AIME25 first-party comparison snapshot

A display-only AIME25 reference from Arcee AI's Trinity-Large-Thinking launch chart.

CurrentDisplay only
15 problemsInteger answers 000-999High school olympiad level
Display only

AIME25 (Arcee) 2026 · updated June 2, 2026

HMMT Feb 2023

2023

Harvard-MIT Mathematics Tournament February 2023

A prestigious high school mathematics competition hosted jointly by Harvard and MIT, featuring challenging problems across various mathematical disciplines.

StaleDisplay only
Tournament problemsCompetition mathematicsHigh school olympiad level
Display only

HMMT Feb 2023 2023 · updated June 2, 2026

HMMT Feb 2024

2024

Harvard-MIT Mathematics Tournament February 2024

The 2024 February edition of the Harvard-MIT Mathematics Tournament, continuing the tradition of challenging high school mathematics competition.

RefreshingDisplay only
Tournament problemsCompetition mathematicsHigh school olympiad level
Display only

HMMT Feb 2024 2024 · updated June 2, 2026

HMMT Feb 2025

2025

Harvard-MIT Mathematics Tournament February 2025

The most recent February edition of the Harvard-MIT Mathematics Tournament, featuring the latest challenging problems in competitive mathematics.

CurrentDisplay only
Tournament problemsCompetition mathematicsHigh school olympiad level
Display only

HMMT Feb 2025 2025 · updated June 2, 2026

BRUMO 2025

2025

Bulgarian Mathematical Olympiad 2025

A challenging mathematical olympiad competition featuring problems that test advanced mathematical reasoning and problem-solving skills at the olympiad level.

Current
Olympiad problemsMathematical olympiadMathematical olympiad level
Weighted 25%

BRUMO 2025 2025 · updated June 2, 2026

MATH-500

2021

MATH-500 Problem Set

A curated subset of 500 problems from the MATH dataset, covering algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus.

Stale
500 problemsFree-form mathematical answersHigh school to undergraduate
Weighted 15%

MATH-500 2021 · updated June 2, 2026

AIME26

2026

AIME 2026

A 2026 American Invitational Mathematics Examination snapshot used in frontier-model comparison tables for mathematical reasoning.

CurrentDisplay only
Competition math problemsShort-answer mathematicsOlympiad-style mathematics
Display only

AIME26 2026 · updated June 2, 2026

IPhO 2025 (Theory)

2026

International Physics Olympiad 2025 (Theory)

The three official theory problems from the 2025 International Physics Olympiad, scored with blinded human evaluation.

CurrentDisplay only
3 olympiad theory problemsPhysics olympiad theoryInternational olympiad physics
Display only

IPhO 2025 (Theory) 2026 · updated June 2, 2026

HMMT Feb 2025

2025

Harvard-MIT Mathematics Tournament February 2025

A February 2025 HMMT slice used in exact-value provider tables for advanced contest-math reasoning.

CurrentDisplay only
Competition math problemsContest mathematicsOlympiad-style mathematics
Display only

HMMT Feb 2025 2025 · updated June 2, 2026

HMMT Nov 2025

2025

Harvard-MIT Mathematics Tournament November 2025

A November 2025 HMMT slice for high-end mathematical reasoning comparisons.

CurrentDisplay only
Competition math problemsContest mathematicsOlympiad-style mathematics
Display only

HMMT Nov 2025 2025 · updated June 2, 2026

HMMT Feb 2026

2026

Harvard-MIT Mathematics Tournament February 2026

A February 2026 HMMT slice used in newer frontier-model math comparisons.

CurrentDisplay only
Competition math problemsContest mathematicsOlympiad-style mathematics
Display only

HMMT Feb 2026 2026 · updated June 2, 2026

IMOAnswerBench

2026

IMOAnswerBench

A challenging mathematical reasoning benchmark reported in DeepSeek-V4 model evaluations.

CurrentDisplay only
Advanced mathematical answer generationPass@1 math benchmarkOlympiad-level mathematics
Display only

IMOAnswerBench 2026 · updated June 2, 2026

Apex

2026

Apex

A high-difficulty mathematical reasoning benchmark reported in DeepSeek-V4 model evaluations.

CurrentDisplay only
Advanced mathematical reasoningPass@1 math benchmarkFrontier math reasoning
Display only

Apex 2026 · updated June 2, 2026

Apex Shortlist

2026

Apex Shortlist

A shortlist subset of the Apex mathematical reasoning benchmark reported in DeepSeek-V4 model evaluations.

CurrentDisplay only
Advanced mathematical reasoningPass@1 math benchmarkFrontier math reasoning
Display only

Apex Shortlist 2026 · updated June 2, 2026

MMAnswerBench

2026

MMAnswerBench

A multimodal mathematical reasoning benchmark that tests whether models can answer visually grounded math questions correctly.

CurrentDisplay only
Multimodal math questionsVisual and structured mathematical QAAdvanced mathematical reasoning
Display only

MMAnswerBench 2026 · updated June 2, 2026

FrontierMath

2024

FrontierMath

An expert-level mathematical reasoning benchmark by Epoch AI featuring original, research-level problems created by mathematicians including IMO gold medalists and Fields Medal recipients. Problems require deep creativity and multi-step reasoning.

Refreshing
350 original research-level math problemsOpen-ended mathematical reasoning with tool accessResearch-level mathematics
Weighted 35%

FrontierMath 2024 · updated June 2, 2026

USAMO 2026

2026

United States of America Mathematical Olympiad 2026

The premier US mathematical olympiad competition, featuring proof-based problems that require deep mathematical insight and rigorous argumentation at the highest competition level.

CurrentDisplay only
6 proof-based problemsMathematical proof constructionInternational olympiad level
Display only

USAMO 2026 2026 · updated June 2, 2026

korean(8 benchmarks)

View leaderboard

KMMLU

2024

Korean Massive Multitask Language Understanding

Evaluates Korean expert-level knowledge across 45 subjects. 20% of questions require Korean cultural context.

RefreshingDisplay only
35,030 questionsMultiple choice questionsElementary to professional level in Korean
Display only

KMMLU 2024 · updated June 2, 2026

KMMLU-Hard

2025

KMMLU-Hard

A filtered hard subset of KMMLU containing ~5,000 questions that most models get wrong.

CurrentDisplay only
~5,000 questionsMultiple choice questionsAdvanced Korean reasoning
Display only

KMMLU-Hard 2025 · updated June 2, 2026

KMMLU-Redux

KMMLU-Redux

Cleaned KMMLU from national technical qualification exams, with errors removed, decontaminated, and deduplicated.

RefreshingDisplay only
~3,500 questionsTechnical multiple choiceIndustrial/technical
Display only

KMMLU-Redux · updated June 2, 2026

KMMLU-Pro

KMMLU-Pro

Korean National Professional Licensure exams evaluating professional-grade knowledge.

RefreshingDisplay only
~2,500 questionsProfessional licensure examsProfessional
Display only

KMMLU-Pro · updated June 2, 2026

CLIcK

Cultural and Linguistic Intelligence in Korean

Evaluates Korean culture and linguistics.

RefreshingDisplay only
1,995 questionsCultural/linguistic QAKorean cultural nuances
Display only

CLIcK · updated June 2, 2026

KoBALT

Korean Benchmark for Advanced Linguistic Tasks

Evaluates advanced Korean linguistic competence.

RefreshingDisplay only
Linguistics questionsAdvanced linguisticsAdvanced linguistic phenomena
Display only

KoBALT · updated June 2, 2026

Korean CSAT

College Scholastic Ability Test (수능)

The Korean SAT exam.

RefreshingDisplay only
Multi-subject examStandardized testHigh school to college level
Display only

Korean CSAT · updated June 2, 2026

HRM8K

HAE-RAE Math 8K

Korean mathematical reasoning (high-school to Olympiad level).

RefreshingDisplay only
8,011 instancesMath word problemsOlympiad level
Display only

HRM8K · updated June 2, 2026

External benchmark mirrors(15 benchmarks)

View leaderboard

Vals Index

2026

Vals Index v1.1

Vals AI composite benchmark across finance and coding tasks, including Finance Agent v2, CorpFin v2, SWE-bench, Terminal-Bench 2.0, and Vibe Code Bench.

CurrentDisplay only
Finance and coding componentsComposite scorePrivate economic-work benchmark composite
Display only

Vals Index 2026 · updated June 2, 2026

Vals Multimodal Index

2026

Vals Multimodal Index v1.1

Vals AI multimodal composite across finance, coding, education, and mortgage-tax task families.

CurrentDisplay only
Finance, coding, education, and mortgage-tax componentsComposite scorePrivate multimodal economic-work benchmark composite
Display only

Vals Multimodal Index 2026 · updated June 2, 2026

CorpFin v2

2026

Vals CorpFin v2

Vals AI private benchmark for understanding long-context credit agreements.

CurrentDisplay only
Credit-agreement understanding tasksAccuracy scoreProfessional finance document reasoning
Display only

CorpFin v2 2026 · updated June 2, 2026

MedCode

2026

Vals MedCode

Vals AI healthcare benchmark for whether models can support the medical billing process.

CurrentDisplay only
Medical billing support tasksAccuracy scoreProfessional healthcare administration
Display only

MedCode 2026 · updated June 2, 2026

MedScribe

2026

Vals MedScribe

Vals AI healthcare benchmark for whether models can support doctors with administrative work.

CurrentDisplay only
Medical administrative support tasksAccuracy scoreProfessional healthcare administration
Display only

MedScribe 2026 · updated June 2, 2026

MortgageTax

2026

Vals MortgageTax

Vals AI benchmark for mortgage and tax document reasoning, including semantic and numerical extraction task views.

CurrentDisplay only
Mortgage and tax extraction tasksAccuracy scoreProfessional mortgage-tax document reasoning
Display only

MortgageTax 2026 · updated June 2, 2026

ProofBench

2026

Vals ProofBench

Vals AI automated theorem-proving benchmark.

CurrentDisplay only
Automated theorem provingAccuracy scoreFormal proof reasoning
Display only

ProofBench 2026 · updated June 2, 2026

LegalBench

2026

Vals LegalBench

Vals AI legal benchmark with issue, rule, conclusion, interpretation, and rhetoric task views.

CurrentDisplay only
Legal reasoning task viewsAccuracy scoreProfessional legal reasoning
Display only

LegalBench 2026 · updated June 2, 2026

CaseLaw v2

2026

Vals CaseLaw v2

Vals AI private question-answer benchmark over Canadian court cases.

CurrentDisplay only
Canadian case-law question answeringAccuracy scoreProfessional legal retrieval and reasoning
Display only

CaseLaw v2 2026 · updated June 2, 2026

DeepSWE

2026

DeepSWE

A long-horizon software engineering benchmark from Datacurve for measuring frontier coding agents on original tasks drawn from active open-source repositories.

CurrentDisplay only
113 software engineering tasksSolve rateLong-horizon software engineering
Display only

DeepSWE 2026 · updated June 2, 2026

Vals SWE-bench mirror

2026

Vals-hosted SWE-bench mirror

Vals AI hosted SWE-bench view for solving production software engineering tasks.

CurrentDisplay only
Software engineering issue-resolution tasksAccuracy scoreProduction software engineering
Display only

Vals SWE-bench mirror 2026 · updated June 2, 2026

Vals Terminal-Bench 2.0 mirror

2026

Vals-hosted Terminal-Bench 2.0 mirror

Vals AI hosted Terminal-Bench 2.0 view with easy, medium, and hard task splits.

CurrentDisplay only
Terminal task difficulty splitsAccuracy scoreTerminal-based agent execution
Display only

Vals Terminal-Bench 2.0 mirror 2026 · updated June 2, 2026

Vals LiveCodeBench mirror

2026

Vals-hosted LiveCodeBench mirror

Vals AI implementation of LiveCodeBench with easy, medium, and hard task splits.

CurrentDisplay only
Coding problem difficulty splitsAccuracy scoreContamination-resistant coding problems
Display only

Vals LiveCodeBench mirror 2026 · updated June 2, 2026

Vals GPQA Diamond mirror

2026

Vals-hosted GPQA Diamond mirror

Vals AI hosted GPQA Diamond view with few-shot and zero-shot chain-of-thought task splits.

CurrentDisplay only
GPQA Diamond task splitsAccuracy scoreGraduate science reasoning
Display only

Vals GPQA Diamond mirror 2026 · updated June 2, 2026

Vals MMLU-Pro mirror

2026

Vals-hosted MMLU-Pro mirror

Vals AI hosted MMLU-Pro view with subject-level task splits.

CurrentDisplay only
MMLU-Pro subject splitsAccuracy scoreProfessional academic reasoning
Display only

Vals MMLU-Pro mirror 2026 · updated June 2, 2026