Compare the World's Best AI Models

Comprehensive benchmark results for the latest large language models across multiple evaluation metrics. Find the perfect model for your use case.

Filters & Search

Filter models by creator or search by name to find the perfect AI model for your needs

LLM Benchmark Results

Showing 15 of 20 models • Click column headers to sort • Scroll horizontally for all benchmarks

Knowledge
Coding
Math
Reasoning
MMLU57 subjects
GPQAGraduate-level
SuperGPQA285 disciplines
OpenBookQAOpen book questions
HumanEval164 problems
AIME 2023Math competition
AIME 2024Math competition
AIME 2025Math competition
HMMT Feb 2023Harvard-MIT tournament
HMMT Feb 2024Harvard-MIT tournament
HMMT Feb 2025Harvard-MIT tournament
BRUMO 2025Math olympiad
SimpleQAFactuality
MuSRMulti-step reasoning
1
GPT-5 (high)
OpenAI
OpenAIClosed-source6991%89%87%85%82%93%95%94%88%90%89%91%86%84%
2
GPT-5 (medium)
OpenAI
OpenAIClosed-source6889%87%85%83%80%91%93%92%86%88%87%89%84%82%
3
Grok 4
xAI
xAIClosed-source6886%85%83%81%78%86%88%87%83%85%84%86%82%80%
4
o3-pro
OpenAI
OpenAIClosed-source6887%88%86%84%79%89%91%90%85%87%86%88%85%83%
5
o3
OpenAI
OpenAIClosed-source6785%86%84%82%77%87%89%88%83%85%84%86%83%81%
6
o4-mini (high)
OpenAI
OpenAIClosed-source6582%82%80%78%74%83%85%84%79%81%80%82%80%78%
7
Gemini 2.5 Pro
Google
GoogleClosed-source6583%83%81%79%75%84%86%85%80%82%81%83%81%79%
8
GPT-5 mini
OpenAI
OpenAIClosed-source6479%79%77%75%71%80%82%81%76%78%77%79%77%75%
9
Claude 4.1 Opus
Anthropic
AnthropicClosed-source6176%76%74%72%68%76%78%77%72%74%73%75%74%72%
10
Claude 4 Sonnet
Anthropic
AnthropicClosed-source5973%73%71%69%65%73%75%74%69%71%70%72%71%69%
11
Llama 3.1 405B
Meta
MetaOpen-source5870%70%68%66%62%70%72%71%66%68%67%69%68%66%
12
Mistral Large 2
Mistral
MistralOpen-source5768%68%66%64%60%68%70%69%64%66%65%67%66%64%
13
GPT-4o
OpenAI
OpenAIClosed-source5666%66%64%62%58%66%68%67%62%64%63%65%64%62%
14
Claude 3.5 Sonnet
Anthropic
AnthropicClosed-source5565%65%63%61%57%65%67%66%61%63%62%64%63%61%
15
Gemini 1.5 Pro
Google
GoogleClosed-source5464%64%62%60%56%64%66%65%60%62%61%63%62%60%

Showing 15 of 20 models

AI Model Benchmark Leaderboard

Our comprehensive AI model leaderboard provides detailed performance comparisons across four critical evaluation categories: Knowledge, Coding, Mathematics, and Reasoning. Compare leading models including GPT-4, Claude, Gemini, and top open-source alternatives.

Benchmark Categories:

  • Knowledge Benchmarks: MMLU, ARC-Challenge, HellaSwag, GPQA, OpenBookQA
  • Coding Benchmarks: HumanEval, CodeContest, programming problem solving
  • Math Benchmarks: AIME, HMMT, BRUMO, mathematical reasoning tasks
  • Reasoning Benchmarks: SimpleQA, MuSR, multi-step logical reasoning

Updated regularly with the latest model releases and performance data from leading AI research organizations.

Attribution

Benchmark data is sourced from the OpenBench open-source evaluation infrastructure, providing standardized and reproducible AI model assessments.