Model comparison

GPT-5.2 vs GPT-5.4 mini

Updated July 30, 2026. Public scores include evidence status and uncertainty. They are not guarantees for a specific workload.

GPT-5.2

OpenAI

57.6/100

Estimated · Public rank #71

90% interval 49.4–65.9

GPT-5.4 mini

OpenAI

55.8/100

Estimated · Public rank #81

90% interval 44.3–67.3

GPT-5.2 has the higher public score estimate, 57.62 versus 55.79, but the 90% score intervals overlap. Treat that as a lead, not a settled winner.

6 results are shared. Category rows based on different benchmark sets are marked directional and do not name a winner.

Which one for your work

Recommendations appear only when a shared evidence basis or an explicit operating constraint supports the call. Secondary and unsupported use cases stay disclosed below the initial list.

Chat turn cost
1K fresh input + 500 output tokens
GPT-5.4 mini
GPT-5.4 mini has the lower estimated token cost for this stated workload. Costs use the listed standard API rates.
Confidence: listed-rates
Cache-heavy agent loop cost
200K cached + 20K fresh input + 10K output tokens
GPT-5.4 mini
GPT-5.4 mini has the lower estimated token cost for this stated workload. GPT-5.2 has no published cached-input rate, so cached tokens use its listed input rate.
Confidence: rate-fallback

Show secondary and unsupported calls

Repository review cost
50K fresh input + 3K output tokens
GPT-5.4 mini
GPT-5.4 mini has the lower estimated token cost for this stated workload. Costs use the listed standard API rates.
Confidence: listed-rates
Coding work
Code generation, repair, and software-engineering tasks
Not enough matched evidence
No shared weighted benchmark basis supports a winner.
Confidence: limited
Agentic work
Tool use, computer use, and multi-step task completion
Not enough matched evidence
The category averages use different weighted benchmark sets, so they are directional rather than like-for-like.
Confidence: limited
Long documents
Prompts that approach the documented context limit
No clear pick
The documented context windows are equal.
Confidence: documented

What is actually comparable

Shared results can support a head-to-head reading. Results present for only one model describe coverage, not superiority.

9 GPT-5.2 only

6 Shared

8 GPT-5.4 mini only

Shared results: 6
GPT-5.2 only: 9
GPT-5.4 mini only: 8
Like-for-like categories: 1 / 8

3 categories use different evidence sets. Those rows remain visible for coverage context but do not name a winner.

Category results, on a stated basis

Each row states whether both averages use the same weighted benchmark set. Directional and not-comparable rows remain visible, but they never receive a winner in this template.

Math

Like-for-like

GPT-5.2: 35.2
GPT-5.4 mini: 21.7
Weighted basis: 2 vs 2 rows
Reading: GPT-5.2 leads

Agentic

Directional only

GPT-5.2: 55.7
GPT-5.4 mini: 65.7
Weighted basis: 2 vs 2 rows
Reading: Directional only

Knowledge

Directional only

GPT-5.2: 92.4
GPT-5.4 mini: 47.8
Weighted basis: 1 vs 2 rows
Reading: Directional only

Multimodal

Directional only

GPT-5.2: 80.4
GPT-5.4 mini: 76.6
Weighted basis: 2 vs 1 rows
Reading: Directional only

Coding

Not comparable

GPT-5.2: 70.6
GPT-5.4 mini: Not measured
Weighted basis: 2 vs 0 rows
Reading: Not comparable

Reasoning

Not comparable

GPT-5.2: 52.9
GPT-5.4 mini: Not measured
Weighted basis: 1 vs 0 rows
Reading: Not comparable

Multilingual

Not comparable

GPT-5.2: Not measured
GPT-5.4 mini: Not measured
Weighted basis: 0 vs 0 rows
Reading: Not comparable

Instruction following

Not comparable

GPT-5.2: Not measured
GPT-5.4 mini: Not measured
Weighted basis: 0 vs 0 rows
Reading: Not comparable

Category averages with the server-provided evidence basis for GPT-5.2 and GPT-5.4 mini
Category	GPT-5.2	GPT-5.4 mini	Weighted basis	Reading
Math	35.2	21.7	Like-for-like2 vs 2 rows	GPT-5.2 leads
Agentic	55.7	65.7	Directional only2 vs 2 rows	Directional only
Knowledge	92.4	47.8	Directional only1 vs 2 rows	Directional only
Multimodal	80.4	76.6	Directional only2 vs 1 rows	Directional only
Coding	70.6	Not measured	Not comparable2 vs 0 rows	Not comparable
Reasoning	52.9	Not measured	Not comparable1 vs 0 rows	Not comparable
Multilingual	Not measured	Not measured	Not comparable0 vs 0 rows	Not comparable
Instruction following	Not measured	Not measured	Not comparable0 vs 0 rows	Not comparable

Shape of the matched evidence

Only shared public evidence is shown. Sparse evidence stays a ruled list rather than being closed into a radar shape.

Too few matched category axes support a radar. The ruled list below shows only shared benchmark results; positions use each benchmark’s normalized display scale when available.

OSWorld-Verified
Agentic
GPT-5.2: 47.3%GPT-5.4 mini: 72.1%Normalized gap 24.8GPT-5.2 source GPT-5.4 mini source
FrontierMath v2 (Tier 4)
Math
GPT-5.2: 18.800%GPT-5.4 mini: 2.080%Normalized gap 16.7Shared source
FrontierMath v2 (Tiers 1-3)
Math
GPT-5.2: 40.700%GPT-5.4 mini: 28.280%Normalized gap 12.4Shared source
GPQA
Knowledge
GPT-5.2: 92.4%GPT-5.4 mini: 88%Normalized gap 4.4GPT-5.2 source GPT-5.4 mini source
MMMU-Pro
Multimodal
GPT-5.2: 79.5%GPT-5.4 mini: 76.6%Normalized gap 2.9GPT-5.2 source GPT-5.4 mini source

What each workload costs

Three fixed token mixes turn per-token rates into comparable decisions. Each scenario states context fit and whether cached input had to fall back to the published list-input rate.

Chat turn

1K fresh input + 500 output tokens

GPT-5.2: $0.00875; Fits in one request
GPT-5.4 mini: $0.003; Fits in one request

GPT-5.4 mini has the lower modeled cost

Costs use the listed standard API rates.

Repository review

50K fresh input + 3K output tokens

GPT-5.2: $0.1295; Fits in one request
GPT-5.4 mini: $0.051; Fits in one request

GPT-5.4 mini has the lower modeled cost

Costs use the listed standard API rates.

Cache-heavy agent loop

200K cached + 20K fresh input + 10K output tokens

GPT-5.2: $0.525; Fits in one request; Cached input priced at the published list-input rate
GPT-5.4 mini: $0.075; Fits in one request

GPT-5.4 mini has the lower modeled cost

GPT-5.2 has no published cached-input rate, so cached tokens use its listed input rate.

Specification differences

Sourced differences are shown directly. Missing facts stay explicit instead of being inferred from a model name or family.

SpecificationGPT-5.2GPT-5.4 mini

Context window

Maximum documented context; output-token limits may be lower.

GPT-5.2

400K

GPT-5.4 mini

400K

OpenAI GPT-5.4 mini model documentation

API model ID

GPT-5.2

Not sourced

GPT-5.4 mini

gpt-5.4-mini

OpenAI GPT-5.4 mini model documentation

Cached-input rate

A missing cached-input rate falls back to the listed input rate only in the stated workload estimate.

GPT-5.2

Not published

GPT-5.4 mini

$0.075 per 1M cached input tokens

OpenAI pricing

Documented inputs

GPT-5.2

Not sourced

GPT-5.4 mini

text, image

OpenAI model catalog

Documented outputs

GPT-5.2

Not sourced

GPT-5.4 mini

text

OpenAI model catalog

Provider availability

GPT-5.2

Not sourced

GPT-5.4 mini

Generally Available · OpenAI Responses API

OpenAI model catalog

Reasoning profile

GPT-5.2

Reasoning

GPT-5.4 mini

Reasoning

Weight access

GPT-5.2

Proprietary

GPT-5.4 mini

Proprietary

License

GPT-5.2

Proprietary

GPT-5.4 mini

Proprietary

Release date

GPT-5.2

2025-12-11

GPT-5.4 mini

2026-03-17

If you already use one of these models

Deployment change: Both entries list OpenAI as the provider. Confirm endpoint, model ID, limits, and feature support before switching.
Quality signal: GPT-5.2 has the higher public score estimate, 57.62 versus 55.79, but the 90% score intervals overlap.
Workload cost: Repository review: $0.1295 vs $0.051. Cache-heavy agent loop: $0.525 vs $0.075.
Context tradeoff: Both models list 400K.

Run the same representative tasks against both endpoints before changing production traffic.

Benchmark evidence

The full public result ledger is available for audit without forcing a wide desktop table onto a phone.

Browse raw public benchmark evidence23 rows

Agentic

BrowseComp
GPT-5.265.8%
Source
GPT-5.4 mini—
Not directly comparable
OSWorld-Verified
GPT-5.247.3%
Source
GPT-5.4 mini72.1%
Source
GPT-5.4 mini leads this result
Gert Labs
GPT-5.246.54%
Source
GPT-5.4 mini—
Not directly comparable
JobBench
GPT-5.234.3%
Source
GPT-5.4 mini—
Not directly comparable
Terminal-Bench 2.0
GPT-5.2—
GPT-5.4 mini60%
Source
Not directly comparable
MCP Atlas
GPT-5.2—
GPT-5.4 mini57.7%
Source
Not directly comparable
Toolathlon
GPT-5.2—
GPT-5.4 mini42.9%
Source
Not directly comparable
τ²-bench results
GPT-5.2—
GPT-5.4 mini93.4%
Source
Not directly comparable

Coding

SWE-bench Verified
GPT-5.280%
Source
GPT-5.4 mini—
Not directly comparable
SWE-bench Pro
GPT-5.255.6%
Source
GPT-5.4 mini—
Not directly comparable
Vibe Code Bench
Shared source
GPT-5.253.50%
GPT-5.4 mini47.97%
GPT-5.2 leads this result
FrontierCode 1.1 Main
GPT-5.2—
GPT-5.4 mini27.0%
Source
Not directly comparable

Reasoning

ARC-AGI-2
GPT-5.252.9%
Source
GPT-5.4 mini—
Not directly comparable

Knowledge

GPQA
GPT-5.292.4%
Source
GPT-5.4 mini88%
Source
GPT-5.2 leads this result
HLE
GPT-5.2—
GPT-5.4 mini41.5%
Source
Not directly comparable
HLE w/o tools
GPT-5.2—
GPT-5.4 mini28.2%
Source
Not directly comparable

Math

FrontierMath v2 (Tiers 1-3)
Shared source
GPT-5.240.700%
GPT-5.4 mini28.280%
GPT-5.2 leads this result
FrontierMath v2 (Tier 4)
Shared source
GPT-5.218.800%
GPT-5.4 mini2.080%
GPT-5.2 leads this result

Multimodal

MMMU-Pro
GPT-5.279.5%
Source
GPT-5.4 mini76.6%
Source
GPT-5.2 leads this result
MathVision
GPT-5.283.0%
Source
GPT-5.4 mini—
Not directly comparable
CharXiv
GPT-5.282.1%
Source
GPT-5.4 mini—
Not directly comparable
V*
GPT-5.275.9%
Source
GPT-5.4 mini—
Not directly comparable
MMMU-Pro w/ Python
GPT-5.2—
GPT-5.4 mini78%
Source
Not directly comparable

Frequently asked questions

Which is better, GPT-5.2 or GPT-5.4 mini?

GPT-5.2 has the higher public score estimate, 57.62 versus 55.79, but the 90% score intervals overlap. The higher estimate is not a decisive winner because the uncertainty ranges overlap.

Which is better for coding, GPT-5.2 or GPT-5.4 mini?

The published evidence does not provide a shared weighted coding basis for both models, so BenchLM does not name a coding winner.

Which is better for agentic tasks, GPT-5.2 or GPT-5.4 mini?

The current agentic tasks averages use different weighted benchmark sets, so BenchLM does not name a winner from them. Read the shared benchmark rows directly and test the models on the same task set.

Which costs less, GPT-5.2 or GPT-5.4 mini?

For the stated presets, chat costs $0.00875 on GPT-5.2 and $0.003 on GPT-5.4 mini; repository review costs $0.1295 and $0.051; the cache-heavy agent loop costs $0.525 and $0.075. GPT-5.2 has no published cached-input rate, so cached tokens use its listed input rate.

Which has the larger context window, GPT-5.2 or GPT-5.4 mini?

Both models list the same context window, 400K.

Related comparisons

Compare API pricing Read the methodology Open the model selector

Last updated July 30, 2026

Watch GPT-5.2 vs GPT-5.4 mini

One weekly email when material rank, price, or benchmark evidence changes make this matchup worth revisiting.

Read a sample issue

Join 2,000+ readers.

GPT-5.2 vs GPT-5.4 mini

Which one for your work

Chat turn cost

Cache-heavy agent loop cost

Repository review cost

Coding work

Agentic work

Long documents

What is actually comparable

Category results, on a stated basis

Math

Agentic

Knowledge

Multimodal

Coding

Reasoning

Multilingual

Instruction following

Shape of the matched evidence

What each workload costs

Chat turn

Repository review

Cache-heavy agent loop

Specification differences

Context window

API model ID

Cached-input rate

Documented inputs

Documented outputs

Provider availability

Reasoning profile

Weight access

License

Release date

Benchmark evidence

Agentic

Coding

Reasoning

Knowledge

Math

Multimodal

Frequently asked questions

Which is better, GPT-5.2 or GPT-5.4 mini?

Which is better for coding, GPT-5.2 or GPT-5.4 mini?

Which is better for agentic tasks, GPT-5.2 or GPT-5.4 mini?

Which costs less, GPT-5.2 or GPT-5.4 mini?

Which has the larger context window, GPT-5.2 or GPT-5.4 mini?

Related comparisons

Watch GPT-5.2 vs GPT-5.4 mini