AI Benchmarking Platform

			Knowledge	Coding	Math	Reasoning
1 GPT-5 (high) OpenAI	OpenAI	Proprietary	Reasoning	128K	72	93%	91%	89%	87%	85%	95%	97%	96%	91%	93%	92%	94%	89%	87%
2 o1-preview OpenAI	OpenAI	Proprietary	Reasoning	200K	71	92%	90%	88%	86%	86%	94%	96%	95%	90%	92%	91%	93%	88%	86%
3 GPT-5 (medium) OpenAI	OpenAI	Proprietary	Reasoning	128K	70	91%	89%	87%	85%	83%	93%	95%	94%	89%	91%	90%	92%	87%	85%
4 Grok 4 xAI	xAI	Proprietary	Non-Reasoning	128K	69	87%	86%	84%	82%	79%	87%	89%	88%	84%	86%	85%	87%	83%	81%
5 GPT-5 mini OpenAI	OpenAI	Proprietary	Reasoning	128K	68	88%	86%	84%	82%	80%	90%	92%	91%	86%	88%	87%	89%	84%	82%
6 o3-pro OpenAI	OpenAI	Proprietary	Reasoning	200K	68	88%	89%	87%	85%	80%	90%	92%	91%	86%	88%	87%	89%	86%	84%
7 o3 OpenAI	OpenAI	Proprietary	Reasoning	200K	67	86%	87%	85%	83%	78%	88%	90%	89%	84%	86%	85%	87%	84%	82%
8 Qwen2.5-1M Alibaba	Alibaba	Open Weight	Non-Reasoning	1M	66	84%	83%	81%	79%	76%	85%	87%	86%	81%	83%	82%	84%	81%	79%
9 Qwen2.5-72B Alibaba	Alibaba	Open Weight	Non-Reasoning	128K	65	83%	82%	80%	78%	75%	84%	86%	85%	80%	82%	81%	83%	80%	78%
10 o4-mini (high) OpenAI	OpenAI	Proprietary	Non-Reasoning	200K	65	82%	82%	80%	78%	74%	83%	85%	84%	79%	81%	80%	82%	80%	78%
11 Gemini 2.5 Pro Google	Google	Proprietary	Non-Reasoning	2M	65	83%	83%	81%	79%	75%	84%	86%	85%	80%	82%	81%	83%	81%	79%
12 DeepSeek Coder 2.0 DeepSeek	DeepSeek	Open Weight	Non-Reasoning	128K	64	80%	79%	77%	75%	82%	81%	83%	82%	77%	79%	78%	80%	78%	76%
13 DeepSeek LLM 2.0 DeepSeek	DeepSeek	Open Weight	Non-Reasoning	128K	63	79%	78%	76%	74%	73%	80%	82%	81%	76%	78%	77%	79%	77%	75%
14 Claude 4.1 Opus Anthropic	Anthropic	Proprietary	Non-Reasoning	200K	61	76%	76%	74%	72%	68%	76%	78%	77%	72%	74%	73%	75%	74%	72%
15 Claude 4 Sonnet Anthropic	Anthropic	Proprietary	Non-Reasoning	200K	59	73%	73%	71%	69%	65%	73%	75%	74%	69%	71%	70%	72%	71%	69%
16 Llama 3.1 405B Meta	Meta	Open Weight	Non-Reasoning	128K	58	70%	70%	68%	66%	62%	70%	72%	71%	66%	68%	67%	69%	68%	66%
17 Mistral Large 2 Mistral	Mistral	Proprietary	Non-Reasoning	128K	57	68%	68%	66%	64%	60%	68%	70%	69%	64%	66%	65%	67%	66%	64%
18 GPT-4o OpenAI	OpenAI	Proprietary	Non-Reasoning	128K	56	66%	66%	64%	62%	58%	66%	68%	67%	62%	64%	63%	65%	64%	62%
19 Claude 3.5 Sonnet Anthropic	Anthropic	Proprietary	Non-Reasoning	200K	55	65%	65%	63%	61%	57%	65%	67%	66%	61%	63%	62%	64%	63%	61%
20 Gemini 1.5 Pro Google	Google	Proprietary	Non-Reasoning	2M	54	64%	64%	62%	60%	56%	64%	66%	65%	60%	62%	61%	63%	62%	60%
21 Mistral 8x7B Mistral	Mistral	Open Weight	Non-Reasoning	32K	52	65%	64%	62%	60%	55%	65%	67%	66%	61%	63%	62%	64%	63%	61%
22 Gemini 1.0 Pro Google	Google	Proprietary	Non-Reasoning	32K	52	62%	62%	60%	58%	54%	62%	64%	63%	58%	60%	59%	61%	60%	58%
23 Claude 3 Opus Anthropic	Anthropic	Proprietary	Non-Reasoning	200K	51	61%	61%	59%	57%	53%	61%	63%	62%	57%	59%	58%	60%	59%	57%
24 GPT-4 Turbo OpenAI	OpenAI	Proprietary	Non-Reasoning	128K	50	60%	60%	58%	56%	52%	60%	62%	61%	56%	58%	57%	59%	58%	56%
25 Llama 3 70B Meta	Meta	Open Weight	Non-Reasoning	128K	48	58%	58%	56%	54%	50%	58%	60%	59%	54%	56%	55%	57%	56%	54%

Knowledge

Coding

Math

Reasoning

GPT-5 (high)

OpenAI

Proprietary

Reasoning

128K

93%

91%

89%

87%

85%

95%

97%

96%

91%

93%

92%

94%

89%

87%

o1-preview

OpenAI

Proprietary

Reasoning

200K

92%

90%

88%

86%

94%

96%

95%

90%

92%

91%

93%

88%

86%

GPT-5 (medium)

OpenAI

Proprietary

Reasoning

128K

91%

89%

87%

85%

83%

93%

95%

94%

89%

91%

90%

92%

87%

85%

Grok 4

xAI

Proprietary

Non-Reasoning

128K

87%

86%

84%

82%

79%

87%

89%

88%

84%

86%

85%

87%

83%

81%

GPT-5 mini

OpenAI

Proprietary

Reasoning

128K

88%

86%

84%

82%

80%

90%

92%

91%

86%

88%

87%

89%

84%

82%

o3-pro

OpenAI

Proprietary

Reasoning

200K

88%

89%

87%

85%

80%

90%

92%

91%

86%

88%

87%

89%

86%

84%

OpenAI

Proprietary

Reasoning

200K

86%

87%

85%

83%

78%

88%

90%

89%

84%

86%

85%

87%

84%

82%

Qwen2.5-1M

Alibaba

Open Weight

Non-Reasoning

84%

83%

81%

79%

76%

85%

87%

86%

81%

83%

82%

84%

81%

79%

Qwen2.5-72B

Alibaba

Open Weight

Non-Reasoning

128K

83%

82%

80%

78%

75%

84%

86%

85%

80%

82%

81%

83%

80%

78%

o4-mini (high)

OpenAI

Proprietary

Non-Reasoning

200K

82%

80%

78%

74%

83%

85%

84%

79%

81%

80%

82%

80%

78%

Gemini 2.5 Pro

Google

Proprietary

Non-Reasoning

83%

81%

79%

75%

84%

86%

85%

80%

82%

81%

83%

81%

79%

DeepSeek Coder 2.0

DeepSeek

Open Weight

Non-Reasoning

128K

80%

79%

77%

75%

82%

81%

83%

82%

77%

79%

78%

80%

78%

76%

DeepSeek LLM 2.0

DeepSeek

Open Weight

Non-Reasoning

128K

79%

78%

76%

74%

73%

80%

82%

81%

76%

78%

77%

79%

77%

75%

Claude 4.1 Opus

Anthropic

Proprietary

Non-Reasoning

200K

76%

74%

72%

68%

76%

78%

77%

72%

74%

73%

75%

74%

72%

Claude 4 Sonnet

Anthropic

Proprietary

Non-Reasoning

200K

73%

71%

69%

65%

73%

75%

74%

69%

71%

70%

72%

71%

69%

Llama 3.1 405B

Benchmark Categories:

Knowledge Benchmarks: MMLU, ARC-Challenge, HellaSwag, GPQA, OpenBookQA
Coding Benchmarks: HumanEval, CodeContest, programming problem solving
Math Benchmarks: AIME, HMMT, BRUMO, mathematical reasoning tasks
Reasoning Benchmarks: SimpleQA, MuSR, multi-step logical reasoning

Updated regularly with the latest model releases and performance data from leading AI research organizations.

Attribution

Benchmark data is sourced from the OpenBench open-source evaluation infrastructure, providing standardized and reproducible AI model assessments.

benchLM.ai

Compare the World's Best AI Models

AI Model Benchmark Leaderboard

Benchmark Categories:

Attribution