DeepSeek V3.1 (Reasoning) vs GLM-4.5-Air

Side-by-side benchmark comparison across knowledge, coding, math, and reasoning.

Quick Verdict

GLM-4.5-Air wins overall with a score of 26 vs 25 (1 point difference).GLM-4.5-Air wins 4 out of 4 categories.

Knowledge

GLM-4.5-Air

DeepSeek V3.1 (Reasoning)

31.8

GLM-4.5-Air

32.8

34
MMLU
35
33
GPQA
34
31
SuperGPQA
32
29
OpenBookQA
30

Coding

GLM-4.5-Air

DeepSeek V3.1 (Reasoning)

26

GLM-4.5-Air

27

26
HumanEval
27

Mathematics

GLM-4.5-Air

DeepSeek V3.1 (Reasoning)

33

GLM-4.5-Air

34

34
AIME 2023
35
36
AIME 2024
37
35
AIME 2025
36
30
HMMT Feb 2023
31
32
HMMT Feb 2024
33
31
HMMT Feb 2025
32
33
BRUMO 2025
34

Reasoning

GLM-4.5-Air

DeepSeek V3.1 (Reasoning)

31

GLM-4.5-Air

32

32
SimpleQA
33
30
MuSR
31

Frequently Asked Questions

Which is better, DeepSeek V3.1 (Reasoning) or GLM-4.5-Air?

GLM-4.5-Air scores higher overall with 26 vs 25, a difference of 1 points across all benchmarks.

Which is better for knowledge tasks, DeepSeek V3.1 (Reasoning) or GLM-4.5-Air?

GLM-4.5-Air leads in knowledge tasks with an average score of 32.8 vs 31.8.

Which is better for coding, DeepSeek V3.1 (Reasoning) or GLM-4.5-Air?

GLM-4.5-Air leads in coding with an average score of 27 vs 26.

Which is better for math, DeepSeek V3.1 (Reasoning) or GLM-4.5-Air?

GLM-4.5-Air leads in math with an average score of 34 vs 33.

Which is better for reasoning, DeepSeek V3.1 (Reasoning) or GLM-4.5-Air?

GLM-4.5-Air leads in reasoning with an average score of 32 vs 31.