Explore 144 benchmarks used to evaluate AI language models across 8 categories.
Terminal-Bench 2.0
A benchmark for agentic software engineering tasks executed in real terminal environments. Models must inspect files, run commands, edit code, and recover from errors over multi-step workflows.
Terminal-Bench 2 · updated April 20, 2026
BrowseComp
A benchmark for web-browsing agents that must search, inspect sources, gather evidence, and return the correct answer to research-oriented questions.
BrowseComp 2026 · updated April 20, 2026
OSWorld-Verified
A verified subset of OSWorld focused on computer-use tasks in desktop-like environments, including navigation, editing, and workflow completion.
OSWorld Verified · updated April 20, 2026
BrowseComp-VL
A vision-language browsing benchmark for multimodal web research and tool-use workflows.
BrowseComp-VL 2026 · updated April 20, 2026
OSWorld
A computer-use benchmark for GUI task completion across the broader OSWorld task suite.
OSWorld 2026 · updated April 20, 2026
AndroidWorld
A mobile GUI agent benchmark for completing Android app workflows and on-device tasks.
AndroidWorld 2026 · updated April 20, 2026
WebVoyager
A browser-agent benchmark for completing multi-step workflows on live websites.
WebVoyager 2026 · updated April 20, 2026
MCP Atlas
A benchmark for tool-calling over Model Context Protocol integrations and external tools.
MCP Atlas 2026 · updated April 20, 2026
Toolathlon
A tool-use benchmark focused on selecting, sequencing, and completing tasks with external tools.
Toolathlon 2026 · updated April 20, 2026
ZClawBench
A Z.AI benchmark for OpenClaw-style agent workflows spanning information search, office work, data analysis, development and operations, automation, and security.
ZClawBench 2026 · updated April 20, 2026
Tau2-Telecom
A telecom-oriented tool benchmark that measures structured tool use in domain workflows.
τ²-Bench 2026 · updated April 20, 2026
DeepSearchQA
An agentic browsing benchmark where models search the web, gather evidence, and answer list-style questions using browser tools.
DeepSearchQA 2026 · updated April 20, 2026
Tau2-Airline
An airline-domain tool-use benchmark for structured workflow execution and API correctness.
Tau2-Airline 2026 · updated April 20, 2026
PinchBench
An OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows.
PinchBench 2026 · updated April 20, 2026
Berkeley Function Calling Leaderboard v4
A function-calling benchmark for tool selection, schema adherence, and argument correctness.
BFCL v4 2026 · updated April 20, 2026
MLE-Bench Lite
A lightweight machine-learning competition benchmark that measures whether models can iteratively train, evaluate, and improve ML systems in low-resource settings.
MLE-Bench Lite 2026 · updated April 20, 2026
MM-ClawBench
An OpenClaw-derived agent benchmark covering practical work and life tasks such as office document delivery, research, planning, and code maintenance.
MM-ClawBench 2026 · updated April 20, 2026
Claw-Eval
An end-to-end real-world agent benchmark for OpenClaw-style workflows spanning tool use, planning, execution, and recovery across practical tasks.
Claw-Eval 2026 · updated April 20, 2026
QwenClawBench
Qwen's internal OpenClaw-style benchmark for measuring broad real-world agent performance across practical productivity and research tasks.
QwenClawBench 2026 · updated April 20, 2026
QwenWebBench
A Qwen benchmark for artifact and webpage generation quality reported as an Elo-style rating.
QwenWebBench 2026 · updated April 20, 2026
TAU3-Bench
A next-generation tool-use benchmark for complex, long-horizon agent workflows beyond the older tau2 telecom and airline task families.
TAU3-Bench 2026 · updated April 20, 2026
VITA-Bench
An interactive real-world agent benchmark grounded in practical consumer-service tasks such as delivery, in-store consumption, and online travel workflows.
VITA-Bench 2025 · updated April 20, 2026
DeepPlanning
A long-horizon planning benchmark that tests whether agents can optimize under explicit time, budget, and feasibility constraints.
DeepPlanning 2026 · updated April 20, 2026
MCP-Tasks
A Model Context Protocol task benchmark used in Qwen's launch tables to measure practical execution over MCP-style tools and integrations.
MCP-Tasks 2026 · updated April 20, 2026
WideResearch
A broad research-agent benchmark for open-ended information gathering, synthesis, and answer construction across wide search spaces.
WideResearch 2026 · updated April 20, 2026
General AI Assistants
GAIA evaluates AI models on real-world tasks that are conceptually simple for humans but require multi-step reasoning, web browsing, tool use, and multimodal understanding for AI. Tasks span three difficulty levels and test practical assistant capabilities rather than academic knowledge.
GAIA 2024 · updated April 20, 2026
Tool-Agent-User Benchmark
TAU-bench evaluates AI agents in realistic enterprise scenarios requiring multi-turn tool use, database interactions, and policy adherence. It tests across retail and airline domains, measuring an agent's ability to reliably complete customer service tasks while following complex business rules.
TAU-bench 2024 · updated April 20, 2026
WebArena Web Agent Benchmark
WebArena is a realistic web environment for evaluating autonomous AI agents on complex, multi-step browser tasks. Agents must navigate e-commerce sites, forums, content management systems, and code repositories to complete practical objectives like purchasing items, finding information, and managing accounts.
WebArena 2024 · updated April 20, 2026
Multi-Environment Web Challenge
A benchmark that evaluates AI agents on multi-environment web challenges, testing navigation and task completion across diverse live web environments.
MEWC 2026 · updated April 20, 2026
Evaluating Large Language Models Trained on Code
A set of 164 handwritten programming problems that test the ability to generate correct Python functions from natural language descriptions. Each problem includes function signature, docstring, body, and several unit tests.
HumanEval · updated April 20, 2026
Software Engineering Benchmark Verified
A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn.
SWE-bench Verified 2024 · updated April 20, 2026
SWE-Rebench
A continuously updated software engineering benchmark by Nebius using fresh GitHub issues to avoid contamination. Models are evaluated 5 times per problem under a fixed ReAct scaffolding; the Resolved Rate (best pass@1) is reported.
Rolling 2026 window · updated April 20, 2026
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
A continuously updated benchmark using fresh competitive programming problems from LeetCode, Codeforces, and AtCoder to provide contamination-free code generation evaluation.
Rolling 2026 set · updated April 20, 2026
LiveCodeBench v6
A newer LiveCodeBench slice used in provider comparison tables to benchmark contamination-resistant coding performance on fresher competitive programming sets.
LiveCodeBench v6 2026 · updated April 20, 2026
LiveCodeBench Pro
A harder competitive-programming benchmark family built from Codeforces, ICPC, and IOI problems, with quarter-specific public leaderboards and difficulty-aware reporting.
LiveCodeBench Pro 2025 · updated April 20, 2026
FLTEval
A repository-level Lean 4 proof engineering benchmark that measures whether a model can complete formal proofs and correctly define new mathematical concepts inside realistic FLT project pull requests.
FLTEval 2026 · updated April 20, 2026
SWE-bench Pro
A stronger coding-agent benchmark than SWE-bench Verified, intended to differentiate frontier models on realistic software engineering work.
SWE-bench Pro 2026 · updated April 20, 2026
SWE Multilingual
A multilingual software-engineering benchmark for real-world code issue resolution across multiple programming languages.
SWE Multilingual 2026 · updated April 20, 2026
Multi-SWE Bench
A multi-language software-engineering benchmark that measures repository-level bug fixing and implementation across more than one programming ecosystem.
Multi-SWE Bench 2026 · updated April 20, 2026
VIBE-Pro
A repo-level code generation and full-project delivery benchmark spanning web, mobile, and simulation-style implementation tasks.
VIBE-Pro 2026 · updated April 20, 2026
NL2Repo
A repository-understanding benchmark that measures whether models can map natural-language requests onto the right code locations and system changes.
NL2Repo 2026 · updated April 20, 2026
React Native Evals
An open benchmark for AI coding agents on real-world React Native implementation tasks, emphasizing working app behavior, recommended architecture choices, and strict constraint adherence.
React Native Evals 2026 · updated April 20, 2026
SWE-bench Verified (mini-swe-agent-v2)
A display-only SWE-bench Verified reference from Arcee AI's Trinity-Large-Thinking comparison chart.
SWE-bench Verified* 2026 · updated April 20, 2026
Scientific Code Benchmark
SciCode evaluates language models on generating code for realistic scientific research problems across 16 subfields of physics, math, chemistry, biology, and material science. Problems decompose into 338 subproblems requiring domain knowledge recall, scientific reasoning, and precise code synthesis. Based on real scripts from published research.
SciCode 2024 · updated April 20, 2026
Testing the Limits of Chain-of-thought with Multistep Soft Reasoning
A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.
MuSR 2023 · updated April 20, 2026
BIG-Bench Hard
A suite of 23 challenging tasks from the BIG-Bench collaborative benchmark where prior language models failed to exceed average human performance, even with chain-of-thought prompting.
BBH 2022 · updated April 20, 2026
LisanBench
A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.
LisanBench 2026 · updated April 20, 2026
LongBench v2
A long-context benchmark that measures whether models can actually use extended context windows for reasoning and retrieval.
LongBench v2 2025 · updated April 20, 2026
MRCRv2
A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.
MRCRv2 2025 · updated April 20, 2026
OpenAI MRCR v2 8-needle 64K-128K
MRCR v2 slice focused on long-context retrieval at 64K-128K lengths.
MRCR v2 64K-128K 2026 · updated April 20, 2026
OpenAI MRCR v2 8-needle 128K-256K
MRCR v2 slice focused on very long contexts at 128K-256K lengths.
MRCR v2 128K-256K 2026 · updated April 20, 2026
Graphwalks BFS 0K-128K
Long-context graph traversal benchmark using breadth-first search tasks.
Graphwalks BFS 128K 2026 · updated April 20, 2026
Graphwalks parents 0-128K
Long-context benchmark for recovering parent relationships inside graph tasks.
Graphwalks Parents 128K 2026 · updated April 20, 2026
Abstraction and Reasoning Corpus for AGI v2
A benchmark measuring fluid intelligence and novel abstract reasoning through visual grid puzzles. Models must identify patterns in input-output pairs and generate the correct output for unseen inputs. Considered the hardest public reasoning benchmark — human average is 60%.
ARC-AGI 2 · updated April 20, 2026
AI-Needle
A long-context retrieval benchmark that measures whether a model can recover relevant information embedded deep inside very long contexts.
AI-Needle 2026 · updated April 20, 2026
GPQA Diamond
The hardest subset of GPQA featuring the most challenging graduate-level science questions. Sometimes reported separately from the standard GPQA benchmark.
GPQA Diamond 2023 · updated April 20, 2026
BullshitBench v2
A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input.
BullshitBench v2 2025 · updated April 20, 2026
WildBench
An automated evaluation framework using 1,000+ real-world user tasks covering reasoning, planning, coding, and creative writing. Highly correlated with Chatbot Arena human preference rankings.
WildBench 2024 · updated April 20, 2026
Massive Multi-discipline Multimodal Understanding
A broad multimodal reasoning benchmark spanning charts, diagrams, tables, and academic visual question answering.
MMMU 2024 · updated April 20, 2026
Massive Multi-discipline Multimodal Understanding Pro
A harder multimodal benchmark for frontier models that combines text with images, diagrams, charts, and academic visual reasoning tasks.
MMMU-Pro 2024 · updated April 20, 2026
OfficeQA Pro
A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts.
OfficeQA Pro 2026 · updated April 20, 2026
MMMU-Pro with Python
Tool-augmented MMMU-Pro variant that allows Python assistance during multimodal reasoning.
MMMU-Pro w/ Python 2026 · updated April 20, 2026
OmniDocBench 1.5
A document understanding benchmark used in frontier-model comparison tables to measure extraction and grounded reasoning quality on complex documents.
OmniDocBench 1.5 2026 · updated April 20, 2026
RealWorldQA
A grounded visual QA benchmark focused on answering practical questions about real-world images and scenes.
RealWorldQA 2026 · updated April 20, 2026
Video-MME with subtitle
A video understanding benchmark that allows subtitle access when answering multimodal questions about videos.
Video-MME (with subtitle) 2026 · updated April 20, 2026
Video-MME without subtitle
A stricter Video-MME setting that removes subtitle help and tests video understanding from visual and audio context alone.
Video-MME (w/o subtitle) 2026 · updated April 20, 2026
Video-MME
A comprehensive benchmark for multimodal large language models on video understanding, covering temporal reasoning, perception, and question answering over videos.
Video-MME 2024 · updated April 20, 2026
MathVision
A visual mathematics benchmark that tests whether a model can solve math problems grounded in diagrams, equations, figures, and other visual inputs.
MathVision 2026 · updated April 20, 2026
We-Math
A multimodal math benchmark for visually grounded mathematical reasoning and answer generation.
We-Math 2026 · updated April 20, 2026
DynaMath
A multimodal benchmark for dynamic mathematical reasoning over visual and structured inputs.
DynaMath 2026 · updated April 20, 2026
MStar
A general visual question-answering benchmark used in provider tables for real-image reasoning quality.
MStar 2026 · updated April 20, 2026
ChatCVQA
A conversational visual QA benchmark that tests multi-turn grounded answering over images and documents.
ChatCVQA 2026 · updated April 20, 2026
MMLongBench-Doc
A long-document multimodal benchmark for grounded reasoning over extended document contexts.
MMLongBench-Doc 2026 · updated April 20, 2026
CC-OCR
An OCR-focused benchmark for reading and extracting text from visually complex documents and images.
CC-OCR 2026 · updated April 20, 2026
AI2D test split
A diagram understanding benchmark focused on scientific and educational visual question answering.
AI2D_TEST 2026 · updated April 20, 2026
CountBench
A visual counting benchmark that tests whether a model can count objects and entities reliably in complex scenes.
CountBench 2026 · updated April 20, 2026
RefCOCO average
A referring-expression grounding benchmark averaged across RefCOCO variants to test whether a model can localize described objects correctly.
RefCOCO (avg) 2026 · updated April 20, 2026
ODINW13
A visual detection and grounding benchmark slice used to compare zero-shot object understanding across diverse domains.
ODINW13 2026 · updated April 20, 2026
ERQA
A grounded visual reasoning benchmark focused on evidence-based question answering over real images.
ERQA 2026 · updated April 20, 2026
VideoMMMU
A video extension of MMMU-style multimodal reasoning over expert questions grounded in temporal media.
VideoMMMU 2026 · updated April 20, 2026
MLVU mean average
A multi-task video understanding benchmark averaged across MLVU categories.
MLVU (M-Avg) 2026 · updated April 20, 2026
Multimodal Multi-disciplinary Video Understanding
A benchmark for evaluating multimodal models on video understanding tasks across multiple disciplines, emphasizing temporal reasoning and comprehension over video content.
MMVU 2026 · updated April 20, 2026
ScreenSpot Pro
A high-resolution GUI grounding benchmark for professional computer-use environments.
ScreenSpot Pro 2025 · updated April 20, 2026
TIR-Bench
A visual agent benchmark for interface reasoning and task execution over screenshots or software surfaces.
TIR-Bench 2026 · updated April 20, 2026
GDPval-AA
An evaluation focused on professional domain expertise and task delivery quality in office-style knowledge work.
GDPval-AA 2026 · updated April 20, 2026
MedXpertQA Multimodal
A multimodal medical multiple-choice benchmark covering clinical images such as X-rays, histology, and dermatology.
MedXpertQA (MM) 2026 · updated April 20, 2026
ZeroBench
A multi-step visual reasoning benchmark with pass@5 reporting and optional tool use.
ZeroBench 2026 · updated April 20, 2026
Design2Code
A multimodal coding benchmark for turning visual designs into working frontend implementations.
Design2Code 2026 · updated April 20, 2026
Flame-VLM-Code
A vision-language coding benchmark for generating correct code from visual and multimodal inputs.
Flame-VLM-Code 2026 · updated April 20, 2026
Vision2Web
A benchmark for converting visual references into functional web implementations.
Vision2Web 2026 · updated April 20, 2026
ImageMining
A multimodal retrieval and extraction benchmark over image-heavy task settings.
ImageMining 2026 · updated April 20, 2026
MMSearch
A multimodal search benchmark for retrieval and grounded answering across mixed-media inputs.
MMSearch 2026 · updated April 20, 2026
MMSearch-Plus
A harder MMSearch variant for multimodal retrieval and grounded tool-use workflows.
MMSearch-Plus 2026 · updated April 20, 2026
SimpleVQA
A visual question answering benchmark focused on straightforward image-grounded understanding.
SimpleVQA 2026 · updated April 20, 2026
Facts-VLM
A grounded multimodal factuality benchmark for evidence-linked answer correctness.
Facts-VLM 2026 · updated April 20, 2026
V*
A vision-centric benchmark for high-level multimodal reasoning and perception quality.
V* 2026 · updated April 20, 2026
CharXiv Reasoning
A scientific chart reasoning benchmark that tests whether models can understand, interpret, and reason about complex scientific visualizations including plots, diagrams, and data charts.
CharXiv 2024 · updated April 20, 2026
CharXiv Reasoning without tools
Tool-free variant of CharXiv that isolates raw visual reasoning ability without code execution or tool augmentation.
CharXiv w/o tools 2024 · updated April 20, 2026
SWE-bench Multimodal
A multimodal variant of SWE-bench that adds visual context (screenshots, design mockups) to software engineering issue descriptions, testing whether models can leverage visual information for code generation.
SWE-bench Multimodal 2025 · updated April 20, 2026
Massive Multitask Language Understanding
A comprehensive multiple-choice question answering test covering 57 tasks including elementary mathematics, US history, computer science, law, and more. Tests knowledge across diverse academic subjects from high school to professional level.
MMLU · updated April 20, 2026
Graduate-Level Google-Proof Q&A
A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.
GPQA Diamond · updated April 20, 2026
GPQA Diamond
A display-only GPQA Diamond reference from provider comparison charts.
GPQA-D 2026 · updated April 20, 2026
SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines
An expanded version of GPQA that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines, providing comprehensive coverage of academic domains.
SuperGPQA 2025 · updated April 20, 2026
Massive Multitask Language Understanding Professional
An enhanced version of MMLU with 10 answer choices instead of 4, featuring more reasoning-focused questions that better differentiate frontier models.
MMLU-Pro · updated April 20, 2026
Humanity's Last Exam
An extremely challenging benchmark crowd-sourced from thousands of domain experts worldwide, designed to probe the absolute frontier of AI capabilities with questions that even specialists find difficult.
Humanity's Last Exam · updated April 20, 2026
FrontierScience
A benchmark for research-level scientific reasoning, designed to separate frontier models on difficult science tasks that mix domain knowledge with deep reasoning.
FrontierScience 2026 · updated April 20, 2026
Artificial Analysis Intelligence Index
A display-only intelligence index published by Artificial Analysis that aggregates provider-reported and benchmark-derived signals into a single model-level score.
Artificial Analysis Intelligence Index 2026 · updated April 20, 2026
Measuring Short-Form Factuality in Large Language Models
A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.
SimpleQA 2024 · updated April 20, 2026
OpenBookQA
A science question-answering benchmark that tests whether models can apply a small open-book set of elementary science facts to multi-step reasoning questions.
OpenBookQA 2018 · updated April 20, 2026
HealthBench Hard
A harder subset of OpenAI's HealthBench for evaluating open-ended medical and health reasoning with rubric-based grading.
HealthBench Hard 2026 · updated April 20, 2026
MedXpertQA Text
A medical multiple-choice benchmark spanning many specialties with 10 answer options per question.
MedXpertQA (Text) 2026 · updated April 20, 2026
FrontierScience Research
A research-focused FrontierScience evaluation variant for scientific investigation and problem solving.
FrontierScience Research 2026 · updated April 20, 2026
TruthfulQA
A benchmark designed to measure whether language models produce truthful answers instead of repeating common misconceptions or misleading falsehoods.
TruthfulQA 2021 · updated April 20, 2026
Humanity's Last Exam without tools
Tool-free variant of Humanity's Last Exam that isolates a model's raw frontier reasoning.
HLE w/o tools 2026 · updated April 20, 2026
MMLU-Pro first-party comparison snapshot
A display-only MMLU-Pro reference from Arcee AI's Trinity-Large-Thinking launch chart.
MMLU-Pro (Arcee) 2026 · updated April 20, 2026
MMLU-Redux
A harder refresh of MMLU intended to keep broad knowledge evaluation useful after the original benchmark became too easy for frontier models.
MMLU-Redux 2026 · updated April 20, 2026
C-Eval
A Chinese-language academic and professional benchmark spanning humanities, social science, STEM, and applied subjects.
C-Eval 2023 · updated April 20, 2026
Multilingual Grade School Math
A multilingual benchmark that translates 250 grade school math problems from GSM8K into 10 typologically diverse languages: Bengali, German, Spanish, French, Japanese, Russian, Swahili, Telugu, Thai, and Chinese.
MGSM 2022 · updated April 20, 2026
MMLU-ProX
A multilingual extension of professional-level academic evaluation across many languages.
MMLU-ProX 2025 · updated April 20, 2026
NOVA-63
A broad multilingual benchmark row from Qwen's launch comparisons intended to measure cross-lingual capability beyond a single language family.
NOVA-63 2026 · updated April 20, 2026
INCLUDE
A multilingual benchmark used in provider tables to measure inclusive language coverage and cross-lingual understanding beyond common high-resource languages.
INCLUDE 2026 · updated April 20, 2026
PolyMath
A multilingual mathematical reasoning benchmark that tests whether math performance transfers across languages rather than only in English.
PolyMath 2026 · updated April 20, 2026
VWT2k-lite
A lighter multilingual benchmark slice published in provider tables for broad cross-lingual transfer and understanding.
VWT2k-lite 2026 · updated April 20, 2026
MAXIFE
A multilingual instruction-following and understanding benchmark row published in Qwen's launch comparisons.
MAXIFE 2026 · updated April 20, 2026
SWE-bench Multilingual
A multilingual extension of SWE-bench covering 300 problems across 9 programming languages, testing code generation and bug fixing beyond Python.
SWE Multilingual 2025 · updated April 20, 2026
Instruction-Following Eval
A benchmark that evaluates language models' ability to follow verifiable instructions such as formatting constraints, keyword inclusion/exclusion, length limits, and structural requirements.
IFEval 2023 · updated April 20, 2026
Instruction Following Benchmark
IFBench evaluates precise instruction-following generalization on 58 challenging, verifiable out-of-domain constraints. Unlike IFEval which tests familiar constraint types, IFBench specifically measures how well models follow novel instructions they haven't been optimized for, exposing overfitting to common instruction patterns.
IFBench 2025 · updated April 20, 2026
American Invitational Mathematics Examination 2023
A 15-question, 3-hour examination where each answer is an integer from 000 to 999. Serves as the intermediate step between AMC 10/12 and the USA Mathematical Olympiad (USAMO).
AIME 2023 2023 · updated April 20, 2026
American Invitational Mathematics Examination 2024
The 2024 edition of AIME, maintaining the same format of 15 challenging mathematics problems with integer answers from 000 to 999.
AIME 2024 2024 · updated April 20, 2026
American Invitational Mathematics Examination 2025
The most recent AIME examination, featuring 15 challenging mathematics problems testing olympiad-level mathematical reasoning with integer answers from 000-999.
AIME 2025 · updated April 20, 2026
AIME25 first-party comparison snapshot
A display-only AIME25 reference from Arcee AI's Trinity-Large-Thinking launch chart.
AIME25 (Arcee) 2026 · updated April 20, 2026
Harvard-MIT Mathematics Tournament February 2023
A prestigious high school mathematics competition hosted jointly by Harvard and MIT, featuring challenging problems across various mathematical disciplines.
HMMT Feb 2023 2023 · updated April 20, 2026
Harvard-MIT Mathematics Tournament February 2024
The 2024 February edition of the Harvard-MIT Mathematics Tournament, continuing the tradition of challenging high school mathematics competition.
HMMT Feb 2024 2024 · updated April 20, 2026
Harvard-MIT Mathematics Tournament February 2025
The most recent February edition of the Harvard-MIT Mathematics Tournament, featuring the latest challenging problems in competitive mathematics.
HMMT Feb 2025 2025 · updated April 20, 2026
Bulgarian Mathematical Olympiad 2025
A challenging mathematical olympiad competition featuring problems that test advanced mathematical reasoning and problem-solving skills at the olympiad level.
BRUMO 2025 2025 · updated April 20, 2026
MATH-500 Problem Set
A curated subset of 500 problems from the MATH dataset, covering algebra, counting and probability, geometry, intermediate algebra, number theory, prealgebra, and precalculus.
MATH-500 2021 · updated April 20, 2026
AIME 2026
A 2026 American Invitational Mathematics Examination snapshot used in frontier-model comparison tables for mathematical reasoning.
AIME26 2026 · updated April 20, 2026
International Physics Olympiad 2025 (Theory)
The three official theory problems from the 2025 International Physics Olympiad, scored with blinded human evaluation.
IPhO 2025 (Theory) 2026 · updated April 20, 2026
Harvard-MIT Mathematics Tournament February 2025
A February 2025 HMMT slice used in exact-value provider tables for advanced contest-math reasoning.
HMMT Feb 2025 2025 · updated April 20, 2026
Harvard-MIT Mathematics Tournament November 2025
A November 2025 HMMT slice for high-end mathematical reasoning comparisons.
HMMT Nov 2025 2025 · updated April 20, 2026
Harvard-MIT Mathematics Tournament February 2026
A February 2026 HMMT slice used in newer frontier-model math comparisons.
HMMT Feb 2026 2026 · updated April 20, 2026
MMAnswerBench
A multimodal mathematical reasoning benchmark that tests whether models can answer visually grounded math questions correctly.
MMAnswerBench 2026 · updated April 20, 2026
FrontierMath
An expert-level mathematical reasoning benchmark by Epoch AI featuring original, research-level problems created by mathematicians including IMO gold medalists and Fields Medal recipients. Problems require deep creativity and multi-step reasoning.
FrontierMath 2024 · updated April 20, 2026
United States of America Mathematical Olympiad 2026
The premier US mathematical olympiad competition, featuring proof-based problems that require deep mathematical insight and rigorous argumentation at the highest competition level.
USAMO 2026 2026 · updated April 20, 2026