Explore 225 benchmarks used to evaluate AI language models across 10 categories.
Terminal-Bench 2.0
A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows.
Terminal-Bench 2 · updated June 2, 2026
BrowseComp
A benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions.
BrowseComp 2026 · updated June 2, 2026
Humanity's Last Exam with tools
Tool-augmented Humanity's Last Exam scores reported in DeepSeek-V4 thinking-mode evaluations.
HLE w/ tools 2026 · updated June 2, 2026
GDPval-AA
An agentic real-world work-task evaluation reported as an Elo score in DeepSeek-V4 thinking-mode evaluations.
GDPval-AA 2026 · updated June 2, 2026
GDPval-AA normalized
A display-only Artificial Analysis normalized score for economically valuable tasks.
GDPval-AA 2026 · updated June 2, 2026
Artificial Analysis Agentic Index
A display-only Artificial Analysis agentic index.
AA Agentic Index 2026 · updated June 2, 2026
APEX-Agents-AA
Artificial Analysis' implementation of the APEX-Agents benchmark for long-horizon professional-services agent tasks.
APEX-Agents-AA 2026 · updated June 2, 2026
Gert Labs Composite Game Benchmark
A game-environment benchmark that evaluates AI models in novel games covering strategic planning, resource management, spatial reasoning, cooperation, and theory of mind.
Gert Labs 2026 · updated June 2, 2026
OSWorld-Verified
A verified subset of OSWorld focused on computer-use tasks in desktop-like environments, including navigation, editing, and workflow completion.
OSWorld Verified · updated June 2, 2026
CyberGym
A cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.
CyberGym 2026 · updated June 2, 2026
BrowseComp-VL
A vision-language browsing benchmark for multimodal web research and tool-use workflows.
BrowseComp-VL 2026 · updated June 2, 2026
OSWorld
A computer-use benchmark for GUI task completion across the broader OSWorld task suite.
OSWorld 2026 · updated June 2, 2026
AndroidWorld
A mobile GUI agent benchmark for completing Android app workflows and on-device tasks.
AndroidWorld 2026 · updated June 2, 2026
WebVoyager
A browser-agent benchmark for completing multi-step workflows on live websites.
WebVoyager 2026 · updated June 2, 2026
MCP Atlas
A benchmark for tool-calling over Model Context Protocol integrations and external tools.
MCP Atlas 2026 · updated June 2, 2026
Toolathlon
A tool-use benchmark focused on selecting, sequencing, and completing tasks with external tools.
Toolathlon 2026 · updated June 2, 2026
ZClawBench
A Z.AI benchmark for OpenClaw-style agent workflows spanning information search, office work, data analysis, development and operations, automation, and security.
ZClawBench 2026 · updated June 2, 2026
Tau2-Telecom
A telecom-oriented tool benchmark that measures structured tool use in domain workflows.
τ²-Bench 2026 · updated June 2, 2026
DeepSearchQA
An agentic browsing benchmark where models search the web, gather evidence, and answer list-style questions using browser tools.
DeepSearchQA 2026 · updated June 2, 2026
Tau2-Airline
An airline-domain tool-use benchmark for structured workflow execution and API correctness.
Tau2-Airline 2026 · updated June 2, 2026
PinchBench
An OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows.
PinchBench 2026 · updated June 2, 2026
OpenHands Index
A holistic coding-agent benchmark that evaluates AI agents across issue resolution, frontend work, greenfield development, testing, and information gathering.
OpenHands Index 2025 · updated June 2, 2026
SWE-Atlas Refactoring
A Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks.
SWE-Atlas Refactoring 2026 · updated June 2, 2026
InferenceBench
A benchmark for open-ended LLM inference optimization by AI agents. Agents receive a base model, one H100, and a fixed time budget to build a valid OpenAI-compatible inference server that improves serving speed.
InferenceBench 2026 · updated June 2, 2026
Berkeley Function Calling Leaderboard v4
A function-calling benchmark for tool selection, schema adherence, and argument correctness.
BFCL v4 2026 · updated June 2, 2026
MLE-Bench Lite
A lightweight machine-learning competition benchmark that measures whether models can iteratively train, evaluate, and improve ML systems in low-resource settings.
MLE-Bench Lite 2026 · updated June 2, 2026
MM-ClawBench
An OpenClaw-derived agent benchmark covering practical work and life tasks such as office document delivery, research, planning, and code maintenance.
MM-ClawBench 2026 · updated June 2, 2026
Claw-Eval
A transparent real-world autonomous-agent benchmark with 300 human-verified tasks, 2,159 rubric items, and Pass^3 scoring across general, multi-turn, and native multimodal agent tasks.
Claw-Eval 2026 · updated June 2, 2026
QwenClawBench
Qwen's internal OpenClaw-style benchmark for measuring broad real-world agent performance across practical productivity and research tasks.
QwenClawBench 2026 · updated June 2, 2026
QwenWebBench
A Qwen benchmark for artifact and webpage generation quality reported as an Elo-style rating.
QwenWebBench 2026 · updated June 2, 2026
TAU3-Bench
A next-generation tool-use benchmark for complex, long-horizon agent workflows beyond the older tau2 telecom and airline task families.
TAU3-Bench 2026 · updated June 2, 2026
VITA-Bench
An interactive real-world agent benchmark grounded in practical consumer-service tasks such as delivery, in-store consumption, and online travel workflows.
VITA-Bench 2025 · updated June 2, 2026
DeepPlanning
A long-horizon planning benchmark that tests whether agents can optimize under explicit time, budget, and feasibility constraints.
DeepPlanning 2026 · updated June 2, 2026
MCP-Tasks
A Model Context Protocol task benchmark used in Qwen's launch tables to measure practical execution over MCP-style tools and integrations.
MCP-Tasks 2026 · updated June 2, 2026
WideResearch
A broad research-agent benchmark for open-ended information gathering, synthesis, and answer construction across wide search spaces.
WideResearch 2026 · updated June 2, 2026
General AI Assistants
GAIA evaluates AI models on real-world tasks that are conceptually simple for humans but require multi-step reasoning, web browsing, tool use, and multimodal understanding for AI. Tasks span three difficulty levels and test practical assistant capabilities rather than academic knowledge.
GAIA 2024 · updated June 2, 2026
Tool-Agent-User Benchmark
TAU-bench evaluates AI agents in realistic enterprise scenarios requiring multi-turn tool use, database interactions, and policy adherence. It tests across retail and airline domains, measuring an agent's ability to reliably complete customer service tasks while following complex business rules.
TAU-bench 2024 · updated June 2, 2026
WebArena Web Agent Benchmark
WebArena is a realistic web environment for evaluating autonomous AI agents on complex, multi-step browser tasks. Agents must navigate e-commerce sites, forums, content management systems, and code repositories to complete practical objectives like purchasing items, finding information, and managing accounts.
WebArena 2024 · updated June 2, 2026
Multi-Environment Web Challenge
A benchmark that evaluates AI agents on multi-environment web challenges, testing navigation and task completion across diverse live web environments.
MEWC 2026 · updated June 2, 2026
Finance Agent v2
Vals AI benchmark for realistic financial analyst agent tasks across qualitative analysis, quantitative analysis, market work, comparables, precedents, earnings, disclosure, and modeling.
Finance Agent v2 2026 · updated June 2, 2026
Evaluating Large Language Models Trained on Code
A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.
HumanEval · updated June 2, 2026
BigCodeBench
A code-generation benchmark reported in DeepSeek-V4 base-model evaluations.
BigCodeBench 2026 · updated June 2, 2026
Codeforces Rating
Competitive-programming rating reported for DeepSeek-V4 thinking-mode evaluations.
Codeforces 2026 · updated June 2, 2026
Terminal-Bench 2.0
A benchmark for agentic software engineering tasks executed in real terminal environments. DeepSeek reports it in the agentic section, while BenchLM also mirrors it in coding for models that publish it as a developer-task signal.
Terminal-Bench 2 · updated June 2, 2026
Software Engineering Benchmark Verified
A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn.
SWE-bench Verified 2024 · updated June 2, 2026
SWE-Rebench
A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.
Rolling 2026 window · updated June 2, 2026
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
A continuously updated benchmark using fresh competitive programming problems from LeetCode, Codeforces, and AtCoder to provide contamination-free code generation evaluation.
Rolling 2026 set · updated June 2, 2026
LiveCodeBench v6
A newer LiveCodeBench slice used in provider comparison tables to benchmark contamination-resistant coding performance on fresher competitive programming sets.
LiveCodeBench v6 2026 · updated June 2, 2026
LiveCodeBench Pro
A harder competitive-programming benchmark family built from Codeforces, ICPC, and IOI problems, with quarter-specific public leaderboards and difficulty-aware reporting.
LiveCodeBench Pro 2025 · updated June 2, 2026
FLTEval
A repository-level Lean 4 proof engineering benchmark that measures whether a model can complete formal proofs and correctly define new mathematical concepts inside realistic FLT project pull requests.
FLTEval 2026 · updated June 2, 2026
SWE-bench Pro
A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.
SWE-bench Pro 2026 · updated June 2, 2026
SWE Multilingual
A multilingual software-engineering benchmark for real-world code issue resolution across multiple programming languages.
SWE Multilingual 2026 · updated June 2, 2026
SWE-bench Multimodal
A multimodal variant of SWE-bench that adds visual context such as screenshots and design mockups to software engineering issue descriptions.
SWE Multimodal 2025 · updated June 2, 2026
CursorBench v3.1
Cursor's first-party harder-task benchmark for long-horizon agentic coding behavior inside the Cursor agent loop.
CursorBench v3.1 2026 · updated June 2, 2026
Multi-SWE Bench
A multi-language software-engineering benchmark that measures repository-level bug fixing and implementation across more than one programming ecosystem.
Multi-SWE Bench 2026 · updated June 2, 2026
VIBE-Pro
A repo-level code generation and full-project delivery benchmark spanning web, mobile, and simulation-style implementation tasks.
VIBE-Pro 2026 · updated June 2, 2026
Vibe Code Bench v1.1
Vals.ai benchmark for evaluating whether models can build complete web applications from natural language specifications in a production-like development environment.
Vibe Code Bench 2026 · updated June 2, 2026
ProgramBench: Can Language Models Rebuild Programs From Scratch?
A cleanroom software-engineering benchmark where agents receive only a compiled executable and documentation, then must architect and implement a complete codebase that reproduces the original program's behavior.
ProgramBench 2026 · updated June 2, 2026
NL2Repo
A repository-understanding benchmark that measures whether models can map natural-language requests onto the right code locations and system changes.
NL2Repo 2026 · updated June 2, 2026
React Native Evals
An open benchmark for AI coding agents on real-world React Native implementation tasks, emphasizing working app behavior, recommended architecture choices, and strict constraint adherence.
React Native Evals 2026 · updated June 2, 2026
AI Agent Evaluations for Next.js
A Vercel benchmark for AI coding agents on Next.js code generation and migration tasks, reporting success rate, average execution time, and an AGENTS.md documentation-assisted split.
Next.js Evals 2026 · updated June 2, 2026
SWE-bench Verified (mini-swe-agent-v2)
A display-only SWE-bench Verified reference from Arcee AI's Trinity-Large-Thinking comparison chart.
SWE-bench Verified* 2026 · updated June 2, 2026
Spider 2.0-Lite
A text-to-SQL benchmark over realistic warehouse-scale schemas, reported by Interfaze for model comparison.
Spider 2.0-Lite 2024 · updated June 2, 2026
Scientific Code Benchmark
SciCode evaluates language models on generating code for realistic scientific research problems across 16 subfields of physics, math, chemistry, biology, and material science. Problems decompose into 338 subproblems requiring domain knowledge recall, scientific reasoning, and precise code synthesis. Based on real scripts from published research.
SciCode 2024 · updated June 2, 2026
Artificial Analysis Coding Index
A display-only Artificial Analysis coding index.
AA Coding Index 2026 · updated June 2, 2026
Artificial Analysis SciCode
A display-only Artificial Analysis SciCode score.
AA-SciCode 2026 · updated June 2, 2026
Terminal-Bench Hard
A display-only Artificial Analysis coding metric for agentic coding and terminal use on a harder Terminal-Bench slice.
Terminal-Bench Hard 2026 · updated June 2, 2026
Testing the Limits of Chain-of-thought with Multistep Soft Reasoning
A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.
MuSR 2023 · updated June 2, 2026
BIG-Bench Hard
A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.
BBH 2022 · updated June 2, 2026
Discrete Reasoning Over Paragraphs
A reading-comprehension benchmark requiring discrete reasoning over paragraphs, reported in DeepSeek-V4 base-model evaluations.
DROP 2026 · updated June 2, 2026
HellaSwag
A commonsense natural-language inference benchmark reported in DeepSeek-V4 base-model evaluations.
HellaSwag 2026 · updated June 2, 2026
WinoGrande
A commonsense coreference benchmark reported in DeepSeek-V4 base-model evaluations.
WinoGrande 2026 · updated June 2, 2026
CLUEWSC
A Chinese Winograd Schema Challenge benchmark reported in DeepSeek-V4 base-model evaluations.
CLUEWSC 2026 · updated June 2, 2026
LisanBench
A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.
LisanBench 2026 · updated June 2, 2026
Pencil Puzzle Bench
A multi-step verifiable reasoning benchmark that evaluates whether models can solve pencil puzzles with unique solutions.
Pencil Puzzle Bench 2026 · updated June 2, 2026
LongBench v2
A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval.
LongBench v2 2025 · updated June 2, 2026
MRCRv2
A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.
MRCRv2 2025 · updated June 2, 2026
OpenAI MRCR v2 8-needle 64K-128K
MRCR v2 slice focused on long-context retrieval at 64K-128K lengths.
MRCR v2 64K-128K 2026 · updated June 2, 2026
OpenAI MRCR v2 8-needle 128K-256K
MRCR v2 slice focused on very long contexts at 128K-256K lengths.
MRCR v2 128K-256K 2026 · updated June 2, 2026
Graphwalks BFS 0K-128K
Long-context graph traversal benchmark using breadth-first search tasks.
Graphwalks BFS 128K 2026 · updated June 2, 2026
Graphwalks parents 0-128K
Long-context benchmark for recovering parent relationships inside graph tasks.
Graphwalks Parents 128K 2026 · updated June 2, 2026
MRCR 1M
A million-token MRCR long-context retrieval benchmark reported in DeepSeek-V4 model evaluations.
MRCR 1M 2026 · updated June 2, 2026
CorpusQA 1M
A million-token CorpusQA long-context question-answering benchmark reported in DeepSeek-V4 model evaluations.
CorpusQA 1M 2026 · updated June 2, 2026
Abstraction and Reasoning Corpus for AGI v2
A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — average individual human performance is 66%.
ARC-AGI 2 · updated June 2, 2026
AI-Needle
A long-context retrieval benchmark that measures whether a model can recover relevant information embedded deep inside very long contexts.
AI-Needle 2026 · updated June 2, 2026
GPQA Diamond
The hardest subset of GPQA featuring the most challenging graduate-level science questions. Sometimes reported separately from the standard GPQA benchmark.
GPQA Diamond 2023 · updated June 2, 2026
Artificial Analysis Long Context Reasoning
A display-only Artificial Analysis long-context reasoning evaluation.
AA-LCR 2026 · updated June 2, 2026
Critical Physics Tasks
A display-only Artificial Analysis metric for research-level physics reasoning.
CritPt 2026 · updated June 2, 2026
BullshitBench v2
A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input.
BullshitBench v2 2025 · updated June 2, 2026
WildBench
An automated evaluation framework using 1,000+ real-world user tasks covering reasoning, planning, coding, and creative writing. Highly correlated with Chatbot Arena human preference rankings.
WildBench 2024 · updated June 2, 2026
Massive Multi-discipline Multimodal Understanding
A broad multimodal reasoning benchmark spanning charts, diagrams, tables, and academic visual question answering.
MMMU 2024 · updated June 2, 2026
Massive Multi-discipline Multimodal Understanding Pro
A harder multimodal benchmark for frontier models that combines text with images, diagrams, charts, and academic visual reasoning tasks.
MMMU-Pro 2024 · updated June 2, 2026
Artificial Analysis MMMU-Pro
A display-only Artificial Analysis MMMU-Pro score.
AA-MMMU-Pro 2026 · updated June 2, 2026
OCRBench V2
A native OCR benchmark for reading text from images across multilingual scripts, low-quality scans, handwriting, structured layouts, charts, and screenshots.
OCRBench V2 2025 · updated June 2, 2026
olmOCR-Bench
An end-to-end document understanding benchmark over long, layout-rich PDFs with tables, equations, headers, footnotes, and multi-column flows.
olmOCR 2025 · updated June 2, 2026
VoxPopuli-Cleaned-AA Word Error Rate
A speech-recognition benchmark on the cleaned Artificial Analysis VoxPopuli subset, reported as word error rate where lower is better.
VoxPopuli WER 2026 · updated June 2, 2026
Design Arena Website Elo
A display-only Design Arena website-generation Elo score surfaced on OpenRouter model benchmark pages.
Design Arena Website 2026 · updated June 2, 2026
OfficeQA Pro
A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts.
OfficeQA Pro 2026 · updated June 2, 2026
MMMU-Pro with Python
Tool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning.
MMMU-Pro w/ Python 2026 · updated June 2, 2026
OmniDocBench 1.5
A document understanding benchmark used in frontier-model comparison tables to measure extraction and grounded reasoning quality on complex documents.
OmniDocBench 1.5 2026 · updated June 2, 2026
RealWorldQA
A grounded visual QA benchmark focused on answering practical questions about real-world images and scenes.
RealWorldQA 2026 · updated June 2, 2026
Video-MME with subtitle
A video understanding benchmark that allows subtitle access when answering multimodal questions about videos.
Video-MME (with subtitle) 2026 · updated June 2, 2026
Video-MME without subtitle
A stricter Video-MME setting that removes subtitle help and tests video understanding from visual and audio context alone.
Video-MME (w/o subtitle) 2026 · updated June 2, 2026
Video-MME
A comprehensive benchmark for multimodal large language models on video understanding, covering temporal reasoning, perception, and question answering over videos.
Video-MME 2024 · updated June 2, 2026
MathVision
A visual mathematics benchmark that tests whether a model can solve math problems grounded in diagrams, equations, figures, and other visual inputs.
MathVision 2026 · updated June 2, 2026
We-Math
A multimodal math benchmark for visually grounded mathematical reasoning and answer generation.
We-Math 2026 · updated June 2, 2026
DynaMath
A multimodal benchmark for dynamic mathematical reasoning over visual and structured inputs.
DynaMath 2026 · updated June 2, 2026
MStar
A general visual question-answering benchmark used in provider tables for real-image reasoning quality.
MStar 2026 · updated June 2, 2026
ChatCVQA
A conversational visual QA benchmark that tests multi-turn grounded answering over images and documents.
ChatCVQA 2026 · updated June 2, 2026
MMLongBench-Doc
A long-document multimodal benchmark for grounded reasoning over extended document contexts.
MMLongBench-Doc 2026 · updated June 2, 2026
CC-OCR
An OCR-focused benchmark for reading and extracting text from visually complex documents and images.
CC-OCR 2026 · updated June 2, 2026
AI2D test split
A diagram understanding benchmark focused on scientific and educational visual question answering.
AI2D_TEST 2026 · updated June 2, 2026
CountBench
A visual counting benchmark that tests whether a model can count objects and entities reliably in complex scenes.
CountBench 2026 · updated June 2, 2026
RefCOCO average
A referring-expression grounding benchmark averaged across RefCOCO variants to test whether a model can localize described objects correctly.
RefCOCO (avg) 2026 · updated June 2, 2026
ODINW13
A visual detection and grounding benchmark slice used to compare zero-shot object understanding across diverse domains.
ODINW13 2026 · updated June 2, 2026
ERQA
A grounded visual reasoning benchmark focused on evidence-based question answering over real images.
ERQA 2026 · updated June 2, 2026
VideoMMMU
A video extension of MMMU-style multimodal reasoning over expert questions grounded in temporal media.
VideoMMMU 2026 · updated June 2, 2026
MLVU mean average
A multi-task video understanding benchmark averaged across MLVU categories.
MLVU (M-Avg) 2026 · updated June 2, 2026
Multimodal Multi-disciplinary Video Understanding
A benchmark for evaluating multimodal models on video understanding tasks across multiple disciplines, emphasizing temporal reasoning and comprehension over video content.
MMVU 2026 · updated June 2, 2026
ScreenSpot Pro
A high-resolution GUI grounding benchmark for professional computer-use environments.
ScreenSpot Pro 2025 · updated June 2, 2026
TIR-Bench
A visual agent benchmark for interface reasoning and task execution over screenshots or software surfaces.
TIR-Bench 2026 · updated June 2, 2026
GDPval-AA
An evaluation focused on professional domain expertise and task delivery quality in office-style knowledge work.
GDPval-AA 2026 · updated June 2, 2026
MedXpertQA Multimodal
A multimodal medical multiple-choice benchmark covering clinical images such as X-rays, histology, and dermatology.
MedXpertQA (MM) 2026 · updated June 2, 2026
ZeroBench
A multi-step visual reasoning benchmark with pass@5 reporting and optional tool use.
ZeroBench 2026 · updated June 2, 2026
Design2Code
A multimodal coding benchmark for turning visual designs into working frontend implementations.
Design2Code 2026 · updated June 2, 2026
Flame-VLM-Code
A vision-language coding benchmark for generating correct code from visual and multimodal inputs.
Flame-VLM-Code 2026 · updated June 2, 2026
Vision2Web
A benchmark for converting visual references into functional web implementations.
Vision2Web 2026 · updated June 2, 2026
ImageMining
A multimodal retrieval and extraction benchmark over image-heavy task settings.
ImageMining 2026 · updated June 2, 2026
MMSearch
A multimodal search benchmark for retrieval and grounded answering across mixed-media inputs.
MMSearch 2026 · updated June 2, 2026
MMSearch-Plus
A harder MMSearch variant for multimodal retrieval and grounded tool-use workflows.
MMSearch-Plus 2026 · updated June 2, 2026
SimpleVQA
A visual question answering benchmark focused on straightforward image-grounded understanding.
SimpleVQA 2026 · updated June 2, 2026
Facts-VLM
A grounded multimodal factuality benchmark for evidence-linked answer correctness.
Facts-VLM 2026 · updated June 2, 2026
V*
A vision-centric benchmark for high-level multimodal reasoning and perception quality.
V* 2026 · updated June 2, 2026
CharXiv Reasoning
A scientific chart reasoning benchmark that tests whether models can understand, interpret, and reason about complex scientific visualizations including plots, diagrams, and data charts.
CharXiv 2024 · updated June 2, 2026
CharXiv Reasoning without tools
Tool-free variant of CharXiv that isolates raw visual reasoning ability without code execution or tool augmentation.
CharXiv w/o tools 2024 · updated June 2, 2026
SWE-bench Multimodal
A multimodal variant of SWE-bench that adds visual context (screenshots, design mockups) to software engineering issue descriptions, testing whether models can leverage visual information for code generation.
SWE-bench Multimodal 2025 · updated June 2, 2026
Blueprint-Bench 2
An agentic spatial reasoning benchmark reported as a normalized score.
Blueprint-Bench 2 2026 · updated June 2, 2026
Massive Multitask Language Understanding
A comprehensive multiple-choice question answering test covering 57 tasks including elementary mathematics, US history, computer science, law, and more. Tests knowledge across diverse academic subjects from high school to professional level.
MMLU · updated June 2, 2026
Graduate-Level Google-Proof Q&A
A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.
GPQA Diamond · updated June 2, 2026
GPQA Diamond
A display-only GPQA Diamond reference from provider comparison charts.
GPQA-D 2026 · updated June 2, 2026
SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines
An expanded version of GPQA that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines, providing comprehensive coverage of academic domains.
SuperGPQA 2025 · updated June 2, 2026
Massive Multitask Language Understanding Professional
An enhanced version of MMLU with 10 answer choices instead of 4, featuring more reasoning-focused questions that better differentiate frontier models.
MMLU-Pro · updated June 2, 2026
AGIEval
A human-centric exam benchmark for general knowledge and reasoning reported in DeepSeek-V4 base-model evaluations.
AGIEval 2026 · updated June 2, 2026
Humanity's Last Exam
An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.
Humanity's Last Exam · updated June 2, 2026
FrontierScience
A benchmark for research-level scientific reasoning, designed to separate frontier models on difficult science tasks that mix domain knowledge with deep reasoning.
FrontierScience 2026 · updated June 2, 2026
Artificial Analysis Intelligence Index
A display-only intelligence index published by Artificial Analysis that aggregates provider-reported and benchmark-derived signals into a single model-level score.
Artificial Analysis Intelligence Index 2026 · updated June 2, 2026
Artificial Analysis GPQA Diamond
A display-only Artificial Analysis GPQA Diamond score.
AA-GPQA Diamond 2026 · updated June 2, 2026
Artificial Analysis Humanity's Last Exam
A display-only Artificial Analysis Humanity's Last Exam score.
AA-HLE 2026 · updated June 2, 2026
Artificial Analysis Omniscience Index
A display-only Artificial Analysis factual knowledge index.
AA-Omniscience Index 2026 · updated June 2, 2026
Artificial Analysis Omniscience Accuracy
A display-only Artificial Analysis knowledge metric for the proportion of correctly answered questions.
AA-Omniscience Accuracy 2026 · updated June 2, 2026
Artificial Analysis Omniscience Hallucination Rate
A display-only Artificial Analysis factuality metric for the rate of incorrect answers among non-correct responses.
AA-Omniscience Hallucination Rate 2026 · updated June 2, 2026
Measuring Short-Form Factuality in Large Language Models
A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.
SimpleQA 2024 · updated June 2, 2026
Chinese-SimpleQA
A Chinese short-form factuality benchmark reported by DeepSeek for V4 model evaluations.
Chinese-SimpleQA 2026 · updated June 2, 2026
OpenBookQA
A science question-answering benchmark that tests whether models can apply a small open-book set of elementary science facts to multi-step reasoning questions.
OpenBookQA 2018 · updated June 2, 2026
HealthBench Hard
A harder subset of OpenAI's HealthBench for evaluating open-ended medical and health reasoning with rubric-based grading.
HealthBench Hard 2026 · updated June 2, 2026
MedXpertQA Text
A medical multiple-choice benchmark spanning many specialties with 10 answer options per question.
MedXpertQA (Text) 2026 · updated June 2, 2026
FrontierScience Research
A research-focused FrontierScience evaluation variant for scientific investigation and problem solving.
FrontierScience Research 2026 · updated June 2, 2026
TruthfulQA
A benchmark designed to measure whether language models produce truthful answers instead of repeating common misconceptions or misleading falsehoods.
TruthfulQA 2021 · updated June 2, 2026
Humanity's Last Exam without tools
Tool-free variant of Humanity's Last Exam that isolates a model's raw frontier reasoning.
HLE w/o tools 2026 · updated June 2, 2026
MMLU-Pro first-party comparison snapshot
A display-only MMLU-Pro reference from Arcee AI's Trinity-Large-Thinking launch chart.
MMLU-Pro (Arcee) 2026 · updated June 2, 2026
MMLU-Redux
A harder refresh of MMLU intended to keep broad knowledge evaluation useful after the original benchmark became too easy for frontier models.
MMLU-Redux 2026 · updated June 2, 2026
MMMLU
A multilingual MMLU-style benchmark reported in provider evaluation tables.
MMMLU 2026 · updated June 2, 2026
C-Eval
A Chinese-language academic and professional benchmark spanning humanities, social science, STEM, and applied subjects.
C-Eval 2023 · updated June 2, 2026
Chinese Massive Multitask Language Understanding
A Chinese multitask academic benchmark reported in DeepSeek-V4 base-model evaluations.
CMMLU 2026 · updated June 2, 2026
MultiLoKo
A multilingual/localized knowledge benchmark reported in DeepSeek-V4 base-model evaluations.
MultiLoKo 2026 · updated June 2, 2026
FACTS Parametric
A parametric factuality benchmark reported in DeepSeek-V4 base-model evaluations.
FACTS Parametric 2026 · updated June 2, 2026
TriviaQA
A reading and trivia question-answering benchmark reported in DeepSeek-V4 base-model evaluations.
TriviaQA 2026 · updated June 2, 2026
Multilingual Grade School Math
A multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese.
MGSM 2022 · updated June 2, 2026
MMLU-ProX
A multilingual extension of professional-level academic evaluation across many languages.
MMLU-ProX 2025 · updated June 2, 2026
NOVA-63
A broad multilingual benchmark row from Qwen's launch comparisons intended to measure cross-lingual capability beyond a single language family.
NOVA-63 2026 · updated June 2, 2026
INCLUDE
A multilingual benchmark used in provider tables to measure inclusive language coverage and cross-lingual understanding beyond common high-resource languages.
INCLUDE 2026 · updated June 2, 2026
PolyMath
A multilingual mathematical reasoning benchmark that tests whether math performance transfers across languages rather than only in English.
PolyMath 2026 · updated June 2, 2026
VWT2k-lite
A lighter multilingual benchmark slice published in provider tables for broad cross-lingual transfer and understanding.
VWT2k-lite 2026 · updated June 2, 2026
MAXIFE
A multilingual instruction-following and understanding benchmark row published in Qwen's launch comparisons.
MAXIFE 2026 · updated June 2, 2026
SWE-bench Multilingual
A multilingual extension of SWE-bench covering 300 problems across 9 programming languages, testing code generation and bug fixing beyond Python.
SWE Multilingual 2025 · updated June 2, 2026
Instruction-Following Eval
A benchmark that evaluates language models' ability to follow verifiable instructions such as formatting constraints, keyword inclusion/exclusion, length limits, and structural requirements.
IFEval 2023 · updated June 2, 2026
Instruction Following Benchmark
IFBench evaluates precise instruction-following generalization on 58 challenging, verifiable out-of-domain constraints. Unlike IFEval which tests familiar constraint types, IFBench specifically measures how well models follow novel instructions they haven't been optimized for, exposing overfitting to common instruction patterns.
IFBench 2025 · updated June 2, 2026
Artificial Analysis IFBench
A display-only Artificial Analysis IFBench score.
AA-IFBench 2026 · updated June 2, 2026
Structured Output Benchmark Value Accuracy
A structured-output benchmark from Interfaze measuring whether extracted JSON leaf values exactly match verified ground truth.
SOB Value Acc 2026 · updated June 2, 2026
American Invitational Mathematics Examination 2023
A 15-question, 3-hour examination where each answer is an integer from 000 to 999. Serves as the intermediate step between AMC 10/12 and the USA Mathematical Olympiad (USAMO).
AIME 2023 2023 · updated June 2, 2026
American Invitational Mathematics Examination 2024
The 2024 edition of AIME, maintaining the same format of 15 challenging mathematics problems with integer answers from 000 to 999.
AIME 2024 2024 · updated June 2, 2026
American Invitational Mathematics Examination 2025
The most recent AIME examination, featuring 15 challenging mathematics problems testing olympiad-level mathematical reasoning with integer answers from 000-999.
AIME 2025 · updated June 2, 2026
Grade School Math 8K
A grade-school mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations.
GSM8K 2026 · updated June 2, 2026
MATH
A competition-style mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations.
MATH 2026 · updated June 2, 2026
CMath
A Chinese mathematical reasoning benchmark reported in DeepSeek-V4 base-model evaluations.
CMath 2026 · updated June 2, 2026
AIME25 first-party comparison snapshot
A display-only AIME25 reference from Arcee AI's Trinity-Large-Thinking launch chart.
AIME25 (Arcee) 2026 · updated June 2, 2026
Harvard-MIT Mathematics Tournament February 2023
A prestigious high school mathematics competition hosted jointly by Harvard and MIT, featuring challenging problems across various mathematical disciplines.
HMMT Feb 2023 2023 · updated June 2, 2026
Harvard-MIT Mathematics Tournament February 2024
The 2024 February edition of the Harvard-MIT Mathematics Tournament, continuing the tradition of challenging high school mathematics competition.
HMMT Feb 2024 2024 · updated June 2, 2026
Harvard-MIT Mathematics Tournament February 2025
The most recent February edition of the Harvard-MIT Mathematics Tournament, featuring the latest challenging problems in competitive mathematics.
HMMT Feb 2025 2025 · updated June 2, 2026
Bulgarian Mathematical Olympiad 2025
A challenging mathematical olympiad competition featuring problems that test advanced mathematical reasoning and problem-solving skills at the olympiad level.
BRUMO 2025 2025 · updated June 2, 2026
MATH-500 Problem Set
A curated subset of 500 problems from the MATH dataset, covering algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus.
MATH-500 2021 · updated June 2, 2026
AIME 2026
A 2026 American Invitational Mathematics Examination snapshot used in frontier-model comparison tables for mathematical reasoning.
AIME26 2026 · updated June 2, 2026
International Physics Olympiad 2025 (Theory)
The three official theory problems from the 2025 International Physics Olympiad, scored with blinded human evaluation.
IPhO 2025 (Theory) 2026 · updated June 2, 2026
Harvard-MIT Mathematics Tournament February 2025
A February 2025 HMMT slice used in exact-value provider tables for advanced contest-math reasoning.
HMMT Feb 2025 2025 · updated June 2, 2026
Harvard-MIT Mathematics Tournament November 2025
A November 2025 HMMT slice for high-end mathematical reasoning comparisons.
HMMT Nov 2025 2025 · updated June 2, 2026
Harvard-MIT Mathematics Tournament February 2026
A February 2026 HMMT slice used in newer frontier-model math comparisons.
HMMT Feb 2026 2026 · updated June 2, 2026
IMOAnswerBench
A challenging mathematical reasoning benchmark reported in DeepSeek-V4 model evaluations.
IMOAnswerBench 2026 · updated June 2, 2026
Apex
A high-difficulty mathematical reasoning benchmark reported in DeepSeek-V4 model evaluations.
Apex 2026 · updated June 2, 2026
Apex Shortlist
A shortlist subset of the Apex mathematical reasoning benchmark reported in DeepSeek-V4 model evaluations.
Apex Shortlist 2026 · updated June 2, 2026
MMAnswerBench
A multimodal mathematical reasoning benchmark that tests whether models can answer visually grounded math questions correctly.
MMAnswerBench 2026 · updated June 2, 2026
FrontierMath
An expert-level mathematical reasoning benchmark by Epoch AI featuring original, research-level problems created by mathematicians including IMO gold medalists and Fields Medal recipients. Problems require deep creativity and multi-step reasoning.
FrontierMath 2024 · updated June 2, 2026
United States of America Mathematical Olympiad 2026
The premier US mathematical olympiad competition, featuring proof-based problems that require deep mathematical insight and rigorous argumentation at the highest competition level.
USAMO 2026 2026 · updated June 2, 2026
Korean Massive Multitask Language Understanding
Evaluates Korean expert-level knowledge across 45 subjects. 20% of questions require Korean cultural context.
KMMLU 2024 · updated June 2, 2026
KMMLU-Hard
A filtered hard subset of KMMLU containing ~5,000 questions that most models get wrong.
KMMLU-Hard 2025 · updated June 2, 2026
KMMLU-Redux
Cleaned KMMLU from national technical qualification exams, with errors removed, decontaminated, and deduplicated.
KMMLU-Redux · updated June 2, 2026
KMMLU-Pro
Korean National Professional Licensure exams evaluating professional-grade knowledge.
KMMLU-Pro · updated June 2, 2026
Cultural and Linguistic Intelligence in Korean
Evaluates Korean culture and linguistics.
CLIcK · updated June 2, 2026
Korean Benchmark for Advanced Linguistic Tasks
Evaluates advanced Korean linguistic competence.
KoBALT · updated June 2, 2026
College Scholastic Ability Test (수능)
The Korean SAT exam.
Korean CSAT · updated June 2, 2026
HAE-RAE Math 8K
Korean mathematical reasoning (high-school to Olympiad level).
HRM8K · updated June 2, 2026
Vals Index v1.1
Vals AI composite benchmark across finance and coding tasks, including Finance Agent v2, CorpFin v2, SWE-bench, Terminal-Bench 2.0, and Vibe Code Bench.
Vals Index 2026 · updated June 2, 2026
Vals Multimodal Index v1.1
Vals AI multimodal composite across finance, coding, education, and mortgage-tax task families.
Vals Multimodal Index 2026 · updated June 2, 2026
Vals CorpFin v2
Vals AI private benchmark for understanding long-context credit agreements.
CorpFin v2 2026 · updated June 2, 2026
Vals MedCode
Vals AI healthcare benchmark for whether models can support the medical billing process.
MedCode 2026 · updated June 2, 2026
Vals MedScribe
Vals AI healthcare benchmark for whether models can support doctors with administrative work.
MedScribe 2026 · updated June 2, 2026
Vals MortgageTax
Vals AI benchmark for mortgage and tax document reasoning, including semantic and numerical extraction task views.
MortgageTax 2026 · updated June 2, 2026
Vals ProofBench
Vals AI automated theorem-proving benchmark.
ProofBench 2026 · updated June 2, 2026
Vals LegalBench
Vals AI legal benchmark with issue, rule, conclusion, interpretation, and rhetoric task views.
LegalBench 2026 · updated June 2, 2026
Vals CaseLaw v2
Vals AI private question-answer benchmark over Canadian court cases.
CaseLaw v2 2026 · updated June 2, 2026
DeepSWE
A long-horizon software engineering benchmark from Datacurve for measuring frontier coding agents on original tasks drawn from active open-source repositories.
DeepSWE 2026 · updated June 2, 2026
Vals-hosted SWE-bench mirror
Vals AI hosted SWE-bench view for solving production software engineering tasks.
Vals SWE-bench mirror 2026 · updated June 2, 2026
Vals-hosted Terminal-Bench 2.0 mirror
Vals AI hosted Terminal-Bench 2.0 view with easy, medium, and hard task splits.
Vals Terminal-Bench 2.0 mirror 2026 · updated June 2, 2026
Vals-hosted LiveCodeBench mirror
Vals AI implementation of LiveCodeBench with easy, medium, and hard task splits.
Vals LiveCodeBench mirror 2026 · updated June 2, 2026
Vals-hosted GPQA Diamond mirror
Vals AI hosted GPQA Diamond view with few-shot and zero-shot chain-of-thought task splits.
Vals GPQA Diamond mirror 2026 · updated June 2, 2026
Vals-hosted MMLU-Pro mirror
Vals AI hosted MMLU-Pro view with subject-level task splits.
Vals MMLU-Pro mirror 2026 · updated June 2, 2026