Explore 14 benchmarks used to evaluate AI language models across 4 categories.
Massive Multitask Language Understanding
A comprehensive multiple-choice question answering test covering 57 tasks including elementary mathematics, US history, computer science, law, and more. Tests knowledge across diverse academic subjects from high school to professional level.
Graduate-Level Google-Proof Q&A
A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Designed to be difficult even for skilled non-experts with access to Google.
SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines
An expanded version of GPQA that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines, providing comprehensive coverage of academic domains.
OpenBookQA: A New Dataset for Open Book Question Answering
A question-answering dataset modeled after open book exams for assessing human understanding of a subject. Requires combining facts from a knowledge base with broad common sense reasoning.
American Invitational Mathematics Examination 2023
A 15-question, 3-hour examination where each answer is an integer from 000 to 999. Serves as the intermediate step between AMC 10/12 and the USA Mathematical Olympiad (USAMO).
American Invitational Mathematics Examination 2024
The 2024 edition of AIME, maintaining the same format of 15 challenging mathematics problems with integer answers from 000 to 999.
American Invitational Mathematics Examination 2025
The most recent AIME examination, featuring 15 challenging mathematics problems testing olympiad-level mathematical reasoning with integer answers from 000-999.
Harvard-MIT Mathematics Tournament February 2023
A prestigious high school mathematics competition hosted jointly by Harvard and MIT, featuring challenging problems across various mathematical disciplines.
Harvard-MIT Mathematics Tournament February 2024
The 2024 February edition of the Harvard-MIT Mathematics Tournament, continuing the tradition of challenging high school mathematics competition.
Harvard-MIT Mathematics Tournament February 2025
The most recent February edition of the Harvard-MIT Mathematics Tournament, featuring the latest challenging problems in competitive mathematics.
Bulgarian Mathematical Olympiad 2025
A challenging mathematical olympiad competition featuring problems that test advanced mathematical reasoning and problem-solving skills at the olympiad level.
Measuring Short-Form Factuality in Large Language Models
A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.
Testing the Limits of Chain-of-thought with Multistep Soft Reasoning
A dataset for evaluating language models on multistep soft reasoning tasks specified in natural language narratives. Tests the ability to perform complex, structured reasoning.