Model comparison

GLM-5.2 vs Grok 4.3

Updated July 31, 2026. Public scores include evidence status and uncertainty. They are not guarantees for a specific workload.

21 confirmed releases in the last 30 daystrack changes

GLM-5.2

Z.AI

62.9/100

Estimated · Public rank #41

90% interval 47.7–78.2

Grok 4.3

xAI

64.2/100

Supported · Public rank #36

90% interval 54.4–73.9

Grok 4.3 has the higher public score estimate, 64.16 versus 62.94, but the 90% score intervals overlap. Treat that as a lead, not a settled winner.

1 results are shared. Category rows based on different benchmark sets are marked directional and do not name a winner.

Which one for your work

Recommendations appear only when a shared evidence basis or an explicit operating constraint supports the call. Secondary and unsupported use cases stay disclosed below the initial list.

Chat turn cost
1K fresh input + 500 output tokens
Grok 4.3
Grok 4.3 has the lower estimated token cost for this stated workload. Costs use the listed standard API rates.
Confidence: listed-rates
Cache-heavy agent loop cost
200K cached + 20K fresh input + 10K output tokens
Grok 4.3
Grok 4.3 has the lower estimated token cost for this stated workload. GLM-5.2 has no published cached-input rate, so cached tokens use its listed input rate.
Confidence: rate-fallback

Show secondary and unsupported calls

Repository review cost
50K fresh input + 3K output tokens
Grok 4.3
Grok 4.3 has the lower estimated token cost for this stated workload. Costs use the listed standard API rates.
Confidence: listed-rates
Coding work
Code generation, repair, and software-engineering tasks
Not enough matched evidence
No shared weighted benchmark basis supports a winner.
Confidence: limited
Agentic work
Tool use, computer use, and multi-step task completion
Not enough matched evidence
No shared weighted benchmark basis supports a winner.
Confidence: limited
Long documents
Prompts that approach the documented context limit
No clear pick
The documented context windows are equal.
Confidence: documented

What is actually comparable

Shared results can support a head-to-head reading. Results present for only one model describe coverage, not superiority.

17 GLM-5.2 only

1 Shared

1 Grok 4.3 only

Shared results: 1
GLM-5.2 only: 17
Grok 4.3 only: 1
Like-for-like categories: 0 / 8

Category results, on a stated basis

Each row states whether both averages use the same weighted benchmark set. Directional and not-comparable rows remain visible, but they never receive a winner in this template.

Agentic

Not comparable

GLM-5.2: 81.0
Grok 4.3: Not measured
Weighted basis: 1 vs 0 rows
Reading: Not comparable

Coding

Not comparable

GLM-5.2: 62.1
Grok 4.3: Not measured
Weighted basis: 1 vs 0 rows
Reading: Not comparable

Reasoning

Not comparable

GLM-5.2: Not measured
Grok 4.3: Not measured
Weighted basis: 0 vs 0 rows
Reading: Not comparable

Knowledge

Not comparable

GLM-5.2: 59.6
Grok 4.3: Not measured
Weighted basis: 2 vs 0 rows
Reading: Not comparable

Math

Not comparable

GLM-5.2: 95.9
Grok 4.3: Not measured
Weighted basis: 2 vs 0 rows
Reading: Not comparable

Multilingual

Not comparable

GLM-5.2: Not measured
Grok 4.3: Not measured
Weighted basis: 0 vs 0 rows
Reading: Not comparable

Multimodal

Not comparable

GLM-5.2: Not measured
Grok 4.3: Not measured
Weighted basis: 0 vs 0 rows
Reading: Not comparable

Instruction following

Not comparable

GLM-5.2: Not measured
Grok 4.3: Not measured
Weighted basis: 0 vs 0 rows
Reading: Not comparable

Category averages with the server-provided evidence basis for GLM-5.2 and Grok 4.3
Category	GLM-5.2	Grok 4.3	Weighted basis	Reading
Agentic	81.0	Not measured	Not comparable1 vs 0 rows	Not comparable
Coding	62.1	Not measured	Not comparable1 vs 0 rows	Not comparable
Reasoning	Not measured	Not measured	Not comparable0 vs 0 rows	Not comparable
Knowledge	59.6	Not measured	Not comparable2 vs 0 rows	Not comparable
Math	95.9	Not measured	Not comparable2 vs 0 rows	Not comparable
Multilingual	Not measured	Not measured	Not comparable0 vs 0 rows	Not comparable
Multimodal	Not measured	Not measured	Not comparable0 vs 0 rows	Not comparable
Instruction following	Not measured	Not measured	Not comparable0 vs 0 rows	Not comparable

Shape of the matched evidence

Only shared public evidence is shown. Sparse evidence stays a ruled list rather than being closed into a radar shape.

A shared-evidence shape is not available.

BenchLM does not draw a radar or infer missing axes when the matched evidence is too sparse.

What each workload costs

Three fixed token mixes turn per-token rates into comparable decisions. Each scenario states context fit and whether cached input had to fall back to the published list-input rate.

Chat turn

1K fresh input + 500 output tokens

GLM-5.2: $0.0036; Fits in one request
Grok 4.3: $0.0025; Fits in one request

Grok 4.3 has the lower modeled cost

Costs use the listed standard API rates.

Repository review

50K fresh input + 3K output tokens

GLM-5.2: $0.0832; Fits in one request
Grok 4.3: $0.07; Fits in one request

Grok 4.3 has the lower modeled cost

Costs use the listed standard API rates.

Cache-heavy agent loop

200K cached + 20K fresh input + 10K output tokens

GLM-5.2: $0.352; Fits in one request; Cached input priced at the published list-input rate
Grok 4.3: $0.09; Fits in one request

Grok 4.3 has the lower modeled cost

GLM-5.2 has no published cached-input rate, so cached tokens use its listed input rate.

Specification differences

Sourced differences are shown directly. Missing facts stay explicit instead of being inferred from a model name or family.

SpecificationGLM-5.2Grok 4.3

Context window

Maximum documented context; output-token limits may be lower.

GLM-5.2

Grok 4.3

API model ID

GLM-5.2

Not sourced

Grok 4.3

Not sourced

Cached-input rate

A missing cached-input rate falls back to the listed input rate only in the stated workload estimate.

GLM-5.2

Not published

Grok 4.3

$0.2 per 1M cached input tokens

Documented inputs

GLM-5.2

Not sourced

Grok 4.3

Not sourced

Documented outputs

GLM-5.2

Not sourced

Grok 4.3

Not sourced

Provider availability

GLM-5.2

Not sourced

Grok 4.3

Not sourced

Reasoning profile

GLM-5.2

Reasoning

Grok 4.3

Reasoning

Weight access

GLM-5.2

Open Weight

Grok 4.3

Proprietary

License

GLM-5.2

Open Weight

Grok 4.3

Proprietary

Release date

GLM-5.2

2026-06-16

Grok 4.3

2026-04-30

If you already use one of these models

Deployment change: The models list different providers, so authentication, endpoint behavior, limits, and feature support may change.
Quality signal: Grok 4.3 has the higher public score estimate, 64.16 versus 62.94, but the 90% score intervals overlap.
Workload cost: Repository review: $0.0832 vs $0.07. Cache-heavy agent loop: $0.352 vs $0.09.
Context tradeoff: Both models list 1M.

Run the same representative tasks against both endpoints before changing production traffic.

Benchmark evidence

The full public result ledger is available for audit without forcing a wide desktop table onto a phone.

Browse raw public benchmark evidence19 rows

Agentic

Terminal-Bench 2.0
GLM-5.281%
Source
Grok 4.3—
Not directly comparable
MCP Atlas
GLM-5.276.8%
Source
Grok 4.3—
Not directly comparable
Toolathlon
GLM-5.248.2%
Source
Grok 4.3—
Not directly comparable
ResearchClawBench
Shared source
GLM-5.220.7%
Grok 4.312.4%
GLM-5.2 leads this result
Gert Labs
GLM-5.2—
Grok 4.343.86%
Source
Not directly comparable

Coding

SWE-bench Pro
GLM-5.262.1%
Source
Grok 4.3—
Not directly comparable
NL2Repo
GLM-5.248.9%
Source
Grok 4.3—
Not directly comparable
Terminal-Bench 2.0
GLM-5.281.0%
Source
Grok 4.3—
Not directly comparable
ProgramBench
GLM-5.263.7%
Source
Grok 4.3—
Not directly comparable
cursorBench32
GLM-5.255.0%
Source
Grok 4.3—
Not directly comparable

Reasoning

CritPt
GLM-5.220.9%
Source
Grok 4.3—
Not directly comparable

Knowledge

GPQA
GLM-5.291.2%
Source
Grok 4.3—
Not directly comparable
GPQA-D
GLM-5.291.2%
Source
Grok 4.3—
Not directly comparable
HLE
GLM-5.254.7%
Source
Grok 4.3—
Not directly comparable
HLE w/o tools
GLM-5.240.5%
Source
Grok 4.3—
Not directly comparable

Math

AIME26
GLM-5.299.2%
Source
Grok 4.3—
Not directly comparable
HMMT Nov 2025
GLM-5.294.4%
Source
Grok 4.3—
Not directly comparable
HMMT Feb 2026
GLM-5.292.5%
Source
Grok 4.3—
Not directly comparable
MMAnswerBench
GLM-5.291.0%
Source
Grok 4.3—
Not directly comparable

Frequently asked questions

Which is better, GLM-5.2 or Grok 4.3?

Grok 4.3 has the higher public score estimate, 64.16 versus 62.94, but the 90% score intervals overlap. The higher estimate is not a decisive winner because the uncertainty ranges overlap.

Which is better for coding, GLM-5.2 or Grok 4.3?

The published evidence does not provide a shared weighted coding basis for both models, so BenchLM does not name a coding winner.

Which is better for agentic tasks, GLM-5.2 or Grok 4.3?

The published evidence does not provide a shared weighted agentic tasks basis for both models, so BenchLM does not name a agentic tasks winner.

Which costs less, GLM-5.2 or Grok 4.3?

For the stated presets, chat costs $0.0036 on GLM-5.2 and $0.0025 on Grok 4.3; repository review costs $0.0832 and $0.07; the cache-heavy agent loop costs $0.352 and $0.09. GLM-5.2 has no published cached-input rate, so cached tokens use its listed input rate.

Which has the larger context window, GLM-5.2 or Grok 4.3?

Both models list the same context window, 1M.

Related comparisons

Compare API pricing Read the methodology Open the model selector

Last updated July 31, 2026

Watch GLM-5.2 vs Grok 4.3

One weekly email when material rank, price, or benchmark evidence changes make this matchup worth revisiting.

Read a sample issue

Join 2,000+ readers.

GLM-5.2 vs Grok 4.3

Which one for your work

Chat turn cost

Cache-heavy agent loop cost

Repository review cost

Coding work

Agentic work

Long documents

What is actually comparable

Category results, on a stated basis

Agentic

Coding

Reasoning

Knowledge

Math

Multilingual

Multimodal

Instruction following

Shape of the matched evidence

What each workload costs

Chat turn

Repository review

Cache-heavy agent loop

Specification differences

Context window

API model ID

Cached-input rate

Documented inputs

Documented outputs

Provider availability

Reasoning profile

Weight access

License

Release date

Benchmark evidence

Agentic

Coding

Reasoning

Knowledge

Math

Frequently asked questions

Which is better, GLM-5.2 or Grok 4.3?

Which is better for coding, GLM-5.2 or Grok 4.3?

Which is better for agentic tasks, GLM-5.2 or Grok 4.3?

Which costs less, GLM-5.2 or Grok 4.3?

Which has the larger context window, GLM-5.2 or Grok 4.3?

Related comparisons

Watch GLM-5.2 vs Grok 4.3