Model comparison

GLM-5.2 vs Qwen3.6-27B

Updated July 31, 2026. Public scores include evidence status and uncertainty. They are not guarantees for a specific workload.

20 confirmed releases in the last 30 daystrack changes

GLM-5.2

Z.AI

62.9/100

Estimated · Public rank #41

90% interval 47.7–78.2

Qwen3.6-27B

Alibaba

52.8/100

Estimated · Public rank #100

90% interval 41.3–64.3

GLM-5.2 has the higher public score estimate, 62.94 versus 52.81, but the 90% score intervals overlap. Treat that as a lead, not a settled winner.

10 results are shared. Category rows based on different benchmark sets are marked directional and do not name a winner.

Which one for your work

Recommendations appear only when a shared evidence basis or an explicit operating constraint supports the call. Secondary and unsupported use cases stay disclosed below the initial list.

Agentic work
Tool use, computer use, and multi-step task completion
GLM-5.2
GLM-5.2 leads on the same 1 weighted benchmark row.
Confidence: limited
Long documents
Prompts that approach the documented context limit
GLM-5.2
GLM-5.2 has the larger documented context window.
Confidence: documented

Show secondary and unsupported calls

Coding work
Code generation, repair, and software-engineering tasks
Not enough matched evidence
The category averages use different weighted benchmark sets, so they are directional rather than like-for-like.
Confidence: limited
Chat turn cost
1K fresh input + 500 output tokens
Not enough matched evidence
A complete comparable API-rate estimate is not available for both models.
Confidence: listed-rates
Cache-heavy agent loop cost
200K cached + 20K fresh input + 10K output tokens
Not enough matched evidence
A complete comparable API-rate estimate is not available for both models.
Confidence: rate-fallback
Repository review cost
50K fresh input + 3K output tokens
Not enough matched evidence
A complete comparable API-rate estimate is not available for both models.
Confidence: listed-rates

What is actually comparable

Shared results can support a head-to-head reading. Results present for only one model describe coverage, not superiority.

8 GLM-5.2 only

10 Shared

28 Qwen3.6-27B only

Shared results: 10
GLM-5.2 only: 8
Qwen3.6-27B only: 28
Like-for-like categories: 2 / 8

2 categories use different evidence sets. Those rows remain visible for coverage context but do not name a winner.

Category results, on a stated basis

Each row states whether both averages use the same weighted benchmark set. Directional and not-comparable rows remain visible, but they never receive a winner in this template.

Agentic

Like-for-like

GLM-5.2: 81.0
Qwen3.6-27B: 59.3
Weighted basis: 1 vs 1 rows
Reading: GLM-5.2 leads

Math

Like-for-like

GLM-5.2: 95.9
Qwen3.6-27B: 89.2
Weighted basis: 2 vs 2 rows
Reading: GLM-5.2 leads

Coding

Directional only

GLM-5.2: 62.1
Qwen3.6-27B: 77.5
Weighted basis: 1 vs 3 rows
Reading: Directional only

Knowledge

Directional only

GLM-5.2: 59.6
Qwen3.6-27B: 53.3
Weighted basis: 2 vs 4 rows
Reading: Directional only

Reasoning

Not comparable

GLM-5.2: Not measured
Qwen3.6-27B: Not measured
Weighted basis: 0 vs 0 rows
Reading: Not comparable

Multilingual

Not comparable

GLM-5.2: Not measured
Qwen3.6-27B: Not measured
Weighted basis: 0 vs 0 rows
Reading: Not comparable

Multimodal

Not comparable

GLM-5.2: Not measured
Qwen3.6-27B: 76.7
Weighted basis: 0 vs 2 rows
Reading: Not comparable

Instruction following

Not comparable

GLM-5.2: Not measured
Qwen3.6-27B: Not measured
Weighted basis: 0 vs 0 rows
Reading: Not comparable

Category averages with the server-provided evidence basis for GLM-5.2 and Qwen3.6-27B
Category	GLM-5.2	Qwen3.6-27B	Weighted basis	Reading
Agentic	81.0	59.3	Like-for-like1 vs 1 rows	GLM-5.2 leads
Math	95.9	89.2	Like-for-like2 vs 2 rows	GLM-5.2 leads
Coding	62.1	77.5	Directional only1 vs 3 rows	Directional only
Knowledge	59.6	53.3	Directional only2 vs 4 rows	Directional only
Reasoning	Not measured	Not measured	Not comparable0 vs 0 rows	Not comparable
Multilingual	Not measured	Not measured	Not comparable0 vs 0 rows	Not comparable
Multimodal	Not measured	76.7	Not comparable0 vs 2 rows	Not comparable
Instruction following	Not measured	Not measured	Not comparable0 vs 0 rows	Not comparable

Shape of the matched evidence

Only shared public evidence is shown. Sparse evidence stays a ruled list rather than being closed into a radar shape.

Too few matched category axes support a radar. The ruled list below shows only shared benchmark results; positions use each benchmark’s normalized display scale when available.

HLE
Knowledge
GLM-5.2: 54.7%Qwen3.6-27B: 24%Normalized gap 30.7GLM-5.2 source Qwen3.6-27B source
Terminal-Bench 2.0
Agentic
GLM-5.2: 81%Qwen3.6-27B: 59.3%Normalized gap 21.7GLM-5.2 source Qwen3.6-27B source
SWE-bench Pro
Coding
GLM-5.2: 62.1%Qwen3.6-27B: 53.5%Normalized gap 8.6GLM-5.2 source Qwen3.6-27B source
HMMT Feb 2026
Math
GLM-5.2: 92.5%Qwen3.6-27B: 84.3%Normalized gap 8.2GLM-5.2 source Qwen3.6-27B source
AIME26
Math
GLM-5.2: 99.2%Qwen3.6-27B: 94.1%Normalized gap 5.1GLM-5.2 source Qwen3.6-27B source

What each workload costs

Three fixed token mixes turn per-token rates into comparable decisions. Each scenario states context fit and whether cached input had to fall back to the published list-input rate.

Chat turn

1K fresh input + 500 output tokens

GLM-5.2: $0.0036; Fits in one request
Qwen3.6-27B: Self-hosted; infrastructure cost varies; Fits in one request

Qwen3.6-27B has no comparable published API token rate.

Repository review

50K fresh input + 3K output tokens

GLM-5.2: $0.0832; Fits in one request
Qwen3.6-27B: Self-hosted; infrastructure cost varies; Fits in one request

Qwen3.6-27B has no comparable published API token rate.

Cache-heavy agent loop

200K cached + 20K fresh input + 10K output tokens

GLM-5.2: $0.352; Fits in one request; Cached input priced at the published list-input rate
Qwen3.6-27B: Self-hosted; infrastructure cost varies; Fits in one request; Cached-input rate unavailable

GLM-5.2 has no published cached-input rate, so cached tokens use its listed input rate. Qwen3.6-27B has no comparable published API token rate.

Specification differences

Sourced differences are shown directly. Missing facts stay explicit instead of being inferred from a model name or family.

SpecificationGLM-5.2Qwen3.6-27B

Context window

Maximum documented context; output-token limits may be lower.

GLM-5.2

Qwen3.6-27B

262K

API model ID

GLM-5.2

Not sourced

Qwen3.6-27B

Not sourced

Cached-input rate

A missing cached-input rate falls back to the listed input rate only in the stated workload estimate.

GLM-5.2

Not published

Qwen3.6-27B

No comparable hosted API rate

Documented inputs

GLM-5.2

Not sourced

Qwen3.6-27B

Not sourced

Documented outputs

GLM-5.2

Not sourced

Qwen3.6-27B

Not sourced

Provider availability

GLM-5.2

Not sourced

Qwen3.6-27B

Not sourced

Reasoning profile

GLM-5.2

Reasoning

Qwen3.6-27B

Reasoning

Weight access

GLM-5.2

Open Weight

Qwen3.6-27B

Open Weight

License

GLM-5.2

Open Weight

Qwen3.6-27B

Open Weight

Release date

GLM-5.2

2026-06-16

Qwen3.6-27B

2026-04-21

If you already use one of these models

Deployment change: The models list different providers, so authentication, endpoint behavior, limits, and feature support may change.
Quality signal: GLM-5.2 has the higher public score estimate, 62.94 versus 52.81, but the 90% score intervals overlap.
Workload cost: A complete comparable API-rate estimate is not available for both models.
Context tradeoff: GLM-5.2 has the larger documented window (1M).

Run the same representative tasks against both endpoints before changing production traffic.

Self-host vs API cost

Estimates at 50,000 req/day · 1000 tokens/req average.

GLM-5.2

API / mo$4,350

Self-host / moNot listed

Break-even—

Proprietary model — self-hosting not applicable.

Qwen3.6-27B

API / mo$0

Self-host / mo$429

Break-even—

Model the full break-even

Benchmark evidence

The full public result ledger is available for audit without forcing a wide desktop table onto a phone.

Browse raw public benchmark evidence46 rows

Agentic

Terminal-Bench 2.0
GLM-5.281%
Source
Qwen3.6-27B59.3%
Source
GLM-5.2 leads this result
MCP Atlas
GLM-5.276.8%
Source
Qwen3.6-27B—
Not directly comparable
Toolathlon
GLM-5.248.2%
Source
Qwen3.6-27B—
Not directly comparable
ResearchClawBench
GLM-5.220.7%
Source
Qwen3.6-27B—
Not directly comparable
Claw-Eval
GLM-5.2—
Qwen3.6-27B72.4%
Source
Not directly comparable
QwenClawBench
GLM-5.2—
Qwen3.6-27B53.4%
Source
Not directly comparable
QwenWebBench
GLM-5.2—
Qwen3.6-27B1487
Source
Not directly comparable
AndroidWorld
GLM-5.2—
Qwen3.6-27B70.3%
Source
Not directly comparable
Gert Labs
GLM-5.2—
Qwen3.6-27B54.84%
Source
Not directly comparable

Coding

SWE-bench Pro
GLM-5.262.1%
Source
Qwen3.6-27B53.5%
Source
GLM-5.2 leads this result
NL2Repo
GLM-5.248.9%
Source
Qwen3.6-27B36.2%
Source
GLM-5.2 leads this result
Terminal-Bench 2.0
GLM-5.281.0%
Source
Qwen3.6-27B59.3%
Source
GLM-5.2 leads this result
ProgramBench
GLM-5.263.7%
Source
Qwen3.6-27B—
Not directly comparable
cursorBench32
GLM-5.255.0%
Source
Qwen3.6-27B—
Not directly comparable
SWE-bench Verified
GLM-5.2—
Qwen3.6-27B77.2%
Source
Not directly comparable
SWE Multilingual
GLM-5.2—
Qwen3.6-27B71.3%
Source
Not directly comparable
LiveCodeBench
GLM-5.2—
Qwen3.6-27B83.9%
Source
Not directly comparable

Reasoning

CritPt
GLM-5.220.9%
Source
Qwen3.6-27B—
Not directly comparable

Knowledge

GPQA
GLM-5.291.2%
Source
Qwen3.6-27B87.8%
Source
GLM-5.2 leads this result
GPQA-D
GLM-5.291.2%
Source
Qwen3.6-27B—
Not directly comparable
HLE
GLM-5.254.7%
Source
Qwen3.6-27B24%
Source
GLM-5.2 leads this result
HLE w/o tools
GLM-5.240.5%
Source
Qwen3.6-27B—
Not directly comparable
MMLU-Pro
GLM-5.2—
Qwen3.6-27B86.2%
Source
Not directly comparable
MMLU-Redux
GLM-5.2—
Qwen3.6-27B93.5%
Source
Not directly comparable
SuperGPQA
GLM-5.2—
Qwen3.6-27B66%
Source
Not directly comparable
C-Eval
GLM-5.2—
Qwen3.6-27B91.4%
Source
Not directly comparable

Math

AIME26
GLM-5.299.2%
Source
Qwen3.6-27B94.1%
Source
GLM-5.2 leads this result
HMMT Nov 2025
GLM-5.294.4%
Source
Qwen3.6-27B90.7%
Source
GLM-5.2 leads this result
HMMT Feb 2026
GLM-5.292.5%
Source
Qwen3.6-27B84.3%
Source
GLM-5.2 leads this result
MMAnswerBench
GLM-5.291.0%
Source
Qwen3.6-27B80.8%
Source
GLM-5.2 leads this result
HMMT Feb 2025
GLM-5.2—
Qwen3.6-27B93.8%
Source
Not directly comparable

Multimodal

MMMU
GLM-5.2—
Qwen3.6-27B82.9%
Source
Not directly comparable
MMMU-Pro
GLM-5.2—
Qwen3.6-27B75.8%
Source
Not directly comparable
RealWorldQA
GLM-5.2—
Qwen3.6-27B84.1%
Source
Not directly comparable
DynaMath
GLM-5.2—
Qwen3.6-27B85.6%
Source
Not directly comparable
MStar
GLM-5.2—
Qwen3.6-27B81.4%
Source
Not directly comparable
SimpleVQA
GLM-5.2—
Qwen3.6-27B56.1%
Source
Not directly comparable
CharXiv
GLM-5.2—
Qwen3.6-27B78.4%
Source
Not directly comparable
CC-OCR
GLM-5.2—
Qwen3.6-27B81.2%
Source
Not directly comparable
CountBench
GLM-5.2—
Qwen3.6-27B97.8%
Source
Not directly comparable
RefCOCO (avg)
GLM-5.2—
Qwen3.6-27B92.5%
Source
Not directly comparable
ERQA
GLM-5.2—
Qwen3.6-27B62.5%
Source
Not directly comparable
Video-MME (with subtitle)
GLM-5.2—
Qwen3.6-27B87.7%
Source
Not directly comparable
VideoMMMU
GLM-5.2—
Qwen3.6-27B84.4%
Source
Not directly comparable
MLVU (M-Avg)
GLM-5.2—
Qwen3.6-27B86.6%
Source
Not directly comparable
V*
GLM-5.2—
Qwen3.6-27B94.7%
Source
Not directly comparable

Frequently asked questions

Which is better, GLM-5.2 or Qwen3.6-27B?

GLM-5.2 has the higher public score estimate, 62.94 versus 52.81, but the 90% score intervals overlap. The higher estimate is not a decisive winner because the uncertainty ranges overlap.

Which is better for coding, GLM-5.2 or Qwen3.6-27B?

The current coding averages use different weighted benchmark sets, so BenchLM does not name a winner from them. Read the shared benchmark rows directly and test the models on the same task set.

Which is better for agentic tasks, GLM-5.2 or Qwen3.6-27B?

GLM-5.2 leads the like-for-like agentic tasks comparison across 1 shared weighted benchmark row.

Which costs less, GLM-5.2 or Qwen3.6-27B?

Both models do not have comparable published API token rates, so this page does not name a universal price winner.

Which has the larger context window, GLM-5.2 or Qwen3.6-27B?

GLM-5.2 has the larger documented context window: 1M, compared with 262K.

Related comparisons

Compare API pricing Read the methodology Open the model selector

Last updated July 31, 2026

Watch GLM-5.2 vs Qwen3.6-27B

One weekly email when material rank, price, or benchmark evidence changes make this matchup worth revisiting.

Read a sample issue

Join 2,000+ readers.

GLM-5.2 vs Qwen3.6-27B

Which one for your work

Agentic work

Long documents

Coding work

Chat turn cost

Cache-heavy agent loop cost

Repository review cost

What is actually comparable

Category results, on a stated basis

Agentic

Math

Coding

Knowledge

Reasoning

Multilingual

Multimodal

Instruction following

Shape of the matched evidence

What each workload costs

Chat turn

Repository review

Cache-heavy agent loop

Specification differences

Context window

API model ID

Cached-input rate

Documented inputs

Documented outputs

Provider availability

Reasoning profile

Weight access

License

Release date

Self-host vs API cost

Benchmark evidence

Agentic

Coding

Reasoning

Knowledge

Math

Multimodal

Frequently asked questions

Which is better, GLM-5.2 or Qwen3.6-27B?

Which is better for coding, GLM-5.2 or Qwen3.6-27B?

Which is better for agentic tasks, GLM-5.2 or Qwen3.6-27B?

Which costs less, GLM-5.2 or Qwen3.6-27B?

Which has the larger context window, GLM-5.2 or Qwen3.6-27B?

Related comparisons

Watch GLM-5.2 vs Qwen3.6-27B