Compare the World's Best AI Models

Comprehensive benchmark results for the latest large language models across multiple evaluation metrics. Find the perfect model for your use case.

Filters & Search

Filter models by creator, type, reasoning, or search by name to find the perfect AI model for your needs

LLM Benchmark Results

Showing 25 of 52 models • Click column headers to sort • Scroll horizontally for all benchmarks

Knowledge
Coding
Math
Reasoning
1
GPT-5 (high)
OpenAI
OpenAIProprietaryReasoning128K7293%91%89%87%85%95%97%96%91%93%92%94%89%87%
2
o1-preview
OpenAI
OpenAIProprietaryReasoning200K7192%90%88%86%86%94%96%95%90%92%91%93%88%86%
3
GPT-5 (medium)
OpenAI
OpenAIProprietaryReasoning128K7091%89%87%85%83%93%95%94%89%91%90%92%87%85%
4
Grok 4
xAI
xAIProprietaryNon-Reasoning128K6987%86%84%82%79%87%89%88%84%86%85%87%83%81%
5
GPT-5 mini
OpenAI
OpenAIProprietaryReasoning128K6888%86%84%82%80%90%92%91%86%88%87%89%84%82%
6
o3-pro
OpenAI
OpenAIProprietaryReasoning200K6888%89%87%85%80%90%92%91%86%88%87%89%86%84%
7
o3
OpenAI
OpenAIProprietaryReasoning200K6786%87%85%83%78%88%90%89%84%86%85%87%84%82%
8
Qwen2.5-1M
Alibaba
AlibabaOpen WeightNon-Reasoning1M6684%83%81%79%76%85%87%86%81%83%82%84%81%79%
9
Qwen2.5-72B
Alibaba
AlibabaOpen WeightNon-Reasoning128K6583%82%80%78%75%84%86%85%80%82%81%83%80%78%
10
o4-mini (high)
OpenAI
OpenAIProprietaryNon-Reasoning200K6582%82%80%78%74%83%85%84%79%81%80%82%80%78%
11
Gemini 2.5 Pro
Google
GoogleProprietaryNon-Reasoning2M6583%83%81%79%75%84%86%85%80%82%81%83%81%79%
12
DeepSeek Coder 2.0
DeepSeek
DeepSeekOpen WeightNon-Reasoning128K6480%79%77%75%82%81%83%82%77%79%78%80%78%76%
13
DeepSeek LLM 2.0
DeepSeek
DeepSeekOpen WeightNon-Reasoning128K6379%78%76%74%73%80%82%81%76%78%77%79%77%75%
14
Claude 4.1 Opus
Anthropic
AnthropicProprietaryNon-Reasoning200K6176%76%74%72%68%76%78%77%72%74%73%75%74%72%
15
Claude 4 Sonnet
Anthropic
AnthropicProprietaryNon-Reasoning200K5973%73%71%69%65%73%75%74%69%71%70%72%71%69%
16
Llama 3.1 405B
Meta
MetaOpen WeightNon-Reasoning128K5870%70%68%66%62%70%72%71%66%68%67%69%68%66%
17
Mistral Large 2
Mistral
MistralProprietaryNon-Reasoning128K5768%68%66%64%60%68%70%69%64%66%65%67%66%64%
18
GPT-4o
OpenAI
OpenAIProprietaryNon-Reasoning128K5666%66%64%62%58%66%68%67%62%64%63%65%64%62%
19
Claude 3.5 Sonnet
Anthropic
AnthropicProprietaryNon-Reasoning200K5565%65%63%61%57%65%67%66%61%63%62%64%63%61%
20
Gemini 1.5 Pro
Google
GoogleProprietaryNon-Reasoning2M5464%64%62%60%56%64%66%65%60%62%61%63%62%60%
21
Mistral 8x7B
Mistral
MistralOpen WeightNon-Reasoning32K5265%64%62%60%55%65%67%66%61%63%62%64%63%61%
22
Gemini 1.0 Pro
Google
GoogleProprietaryNon-Reasoning32K5262%62%60%58%54%62%64%63%58%60%59%61%60%58%
23
Claude 3 Opus
Anthropic
AnthropicProprietaryNon-Reasoning200K5161%61%59%57%53%61%63%62%57%59%58%60%59%57%
24
GPT-4 Turbo
OpenAI
OpenAIProprietaryNon-Reasoning128K5060%60%58%56%52%60%62%61%56%58%57%59%58%56%
25
Llama 3 70B
Meta
MetaOpen WeightNon-Reasoning128K4858%58%56%54%50%58%60%59%54%56%55%57%56%54%

Showing 25 of 52 models

AI Model Benchmark Leaderboard

Our comprehensive AI model leaderboard provides detailed performance comparisons across four critical evaluation categories: Knowledge, Coding, Mathematics, and Reasoning. Compare leading models including GPT-4, Claude, Gemini, and top open-weight alternatives.

Benchmark Categories:

  • Knowledge Benchmarks: MMLU, ARC-Challenge, HellaSwag, GPQA, OpenBookQA
  • Coding Benchmarks: HumanEval, CodeContest, programming problem solving
  • Math Benchmarks: AIME, HMMT, BRUMO, mathematical reasoning tasks
  • Reasoning Benchmarks: SimpleQA, MuSR, multi-step logical reasoning

Updated regularly with the latest model releases and performance data from leading AI research organizations.

Attribution

Benchmark data is sourced from the OpenBench open-source evaluation infrastructure, providing standardized and reproducible AI model assessments.