Superseded.xAI has newer models in this line:Grok 4.3 Grok 4.5

Model profile · xAI

Grok 4.20

Name: Grok 4.20
Author: xAI

CurrentReleased Mar 10, 2026ProprietaryReasoning2M context

Grok 4.20 scores 53.9 out of 100 and ranks #97 of 215. This profile shows 18 source-displayable benchmark rows; its strongest eligible category is Multimodal & Grounded at #28. API pricing is $2 input and $6 output per million tokens.

Data as of July 28, 2026 · How the score is built

Compare Grok 4.20 Find alternatives

Strongest published evidence

Multimodal & Grounded ranks #28. Particularly strong for screenshots, documents, charts, and grounded multimodal workflows.

Validate before choosing

18 published rows leave some tracked benchmark slots empty. Coding is its lowest eligible category at #86.

Decision snapshot

Each value carries a field reference instead of floating alone. Markers compare this model with the current ranked and priced catalog; they are not absolute quality thresholds.

Capability

53.9/100

field median 57.2

#97 of 215 ranked models

Price

$2input / $6 output

input median $1

blended $4

Speed

233tok/s

field median 107 tok/s

First token 10.33 s

Context

2Mtokens

field median 256,000

Reported for this model; direct source link not stored

Capability shape

Each axis shows percentile within that category’s eligible cohort. The comparison outline is the median of the six nearest public-score peers; a collapsed vertex means the category is not rank-eligible.

Agentic60th percentile
Coding34th percentile
ReasoningNot eligible
KnowledgeNot eligible
MathNot eligible
MultilingualNot eligible
Multimodal13th percentile
Instruction followingNot eligible

The dashed outline is median of 6 nearest peers.

Top decileTop quartileMid-fieldNot eligible

What it costs to get this score

Published API price against the public score. The x-axis uses a log scale; the dashed path marks models that are not beaten by a cheaper, higher-scoring option. Price uses average of published input and output rates.

Explore all models

The chart opens on the current model. Scroll horizontally to inspect the full price axis.

Current modelGrok 4.20 · 53.9 score · $4 blended per million tokens

Horizontal: blended price per million tokens, log scale · Vertical: public score

How much of this is verified

Coverage is split by category so a strong number never hides a thin evidence base. Verified means the row is tied to a published source; provisional rows remain visible but separate.

Agentic1/3 verified
Coding1/4 verified
Reasoning1/2 verified
Knowledge0/4 verified
MathNot measured
MultilingualNot measured
Multimodal0/5 verified
Inst. FollowingNot measured

Verified sourceProvisionalNot measured

Spec sheet

Each documented value carries its source. Missing fields stay visible as not sourced or not published, rather than disappearing from the page.

API model ID: Not published
Context window: 2M
Maximum output: Not sourced yet
Knowledge cutoff: Not sourced yet
Input modalities: Not sourced yet
Output modalities: Not sourced yet
Parameters: Not disclosed by the provider

Availability: Not sourced yet
Cloud regions: Not tracked yet
Lifecycle: Current
API capabilities: Tool calling, structured outputs, and batch support are not tracked yet
Prompt caching: Not documented in the pricing record
Self-host: Weights are not published
Rate limits: Not tracked yet

Category score record

Scores and ranks appear only where published evidence can be displayed. The table keeps the score, weight, cohort, and evidence state together.

Category scores, ranks, weighting, benchmark coverage, and evidence status
Category	Score	Rank	Percentile	Weight	Benchmarks	Evidence
AgenticRank #52 of 129Percentile 60thWeight 22%3 benchmarksMixed sources	49.0	#52 of 129	60th	22%	3 benchmarks	Mixed sources
CodingRank #86 of 130Percentile 34thWeight 20%4 benchmarksMixed sources	46.0	#86 of 130	34th	20%	4 benchmarks	Mixed sources
ReasoningRank Not rankedWeight 17%2 benchmarksMixed sources	61.7	Not ranked	Not available	17%	2 benchmarks	Mixed sources
KnowledgeWeight 12%4 benchmarksReported	Score pending	Not ranked	Not available	12%	4 benchmarks	Reported
MathWeight 5%0 benchmarksNot measured	Not measured	Not ranked	Not available	5%	0 benchmarks	Not measured
MultilingualWeight 7%0 benchmarksNot measured	Not measured	Not ranked	Not available	7%	0 benchmarks	Not measured
MultimodalRank #28 of 32Percentile 13thWeight 12%5 benchmarksReported	35.9	#28 of 32	13th	12%	5 benchmarks	Reported
Inst. FollowingWeight 5%0 benchmarksNot measured	Not measured	Not ranked	Not available	5%	0 benchmarks	Not measured

Benchmark ledger

Coding opens by default. The marker compares each value with the best source-verified result in the catalog; provisional leaders do not set the reference. Expand the remaining categories for every published row.

Coding4 rows

Coding benchmark values, best verified comparison, weight, and source status
Benchmark	Score	Versus best verified row	Gap	Weight	Evidence
SWE-bench VerifiedSoftware Engineering Benchmark Verified	Score76.7%	Versus best verified row Best verified: Claude Opus 5 · 96%	Gap19.3 behind	WeightWeighted 16%	Secondary exact Meta AI: Muse Spark comparison chart
SWE-bench Pro	Score51.8%	Versus best verified row Best verified: Claude Mythos 5 · 80.3%	Gap28.5 behind	WeightWeighted 10%	Secondary exact Meta AI: Muse Spark comparison chart
LiveCodeBench Pro	Score74.2%	Versus best verified row Best verified: Sakana Fugu-Ultra · 90.8%	Gap16.6 behind	WeightDisplay only	Secondary exact Meta AI: Muse Spark comparison chart
Vibe Code BenchVibe Code Bench v1.1	Score4.06%	Versus best verified row Best verified: Claude Opus 4.7 · 71.00%	Gap66.9 behind	WeightDisplay only	Benchmark exact Vals AI: Vibe Code Bench v1.1

Agentic3 rows

Agentic benchmark values, best verified comparison, weight, and source status
Benchmark	Score	Versus best verified row	Gap	Weight	Evidence
Terminal-Bench 2.0	Score47.1%	Versus best verified row Best verified: GPT-5.6 Sol · 91.9%	Gap44.8 behind	WeightWeighted 38%	Secondary exact Meta AI: Muse Spark comparison chart
DeepSearchQA	Score62.8%	Versus best verified row Best verified: Claude Opus 5 · 95.0%	Gap32.2 behind	WeightDisplay only	Secondary exact Meta AI: Muse Spark comparison chart
Gert LabsGert Labs Composite Game Benchmark	Score38.36%	Versus best verified row Best verified: Claude Opus 4.8 · 72.97%	Gap34.6 behind	WeightDisplay only	Benchmark exact Gert Labs rankings

Reasoning2 rows

Reasoning benchmark values, best verified comparison, weight, and source status
Benchmark	Score	Versus best verified row	Gap	Weight	Evidence
ARC-AGI-2Abstraction and Reasoning Corpus for AGI v2	Score53.3%	Versus best verified row Best verified: GPT-5.6 Sol · 92.5%	Gap39.2 behind	WeightWeighted 31%	Secondary exact Meta AI: Muse Spark comparison chart
ARC-AGI-3Abstraction and Reasoning Corpus for AGI v3	Score0.1%	Versus best verified row Best verified: Claude Opus 5 · 30.2%	Gap30.1 behind	WeightDisplay only	Benchmark exact ARC Prize official leaderboard data

Knowledge4 rows

Knowledge benchmark values, best verified comparison, weight, and source status
Benchmark	Score	Versus best verified row	Gap	Weight	Evidence
GPQA-DGPQA Diamond	Score88.5%	Versus best verified row Best verified: Sakana Fugu-Ultra · 95.5%	Gap7 behind	WeightDisplay only	Secondary exact Meta AI: Muse Spark comparison chart
HLE w/o toolsHumanity's Last Exam without tools	Score31.6%	Versus best verified row Best verified: Claude Mythos 5 · 59%	Gap27.4 behind	WeightDisplay only	Secondary exact Meta AI: Muse Spark comparison chart
HealthBench Hard	Score20.3%	Versus best verified row Best verified: Muse Spark · 42.8%	Gap22.5 behind	WeightDisplay only	Secondary exact Meta AI: Muse Spark comparison chart
MedXpertQA (Text)MedXpertQA Text	Score50.2%	Versus best verified row Best verified: Muse Spark · 52.6%	Gap2.4 behind	WeightDisplay only	Secondary exact Meta AI: Muse Spark comparison chart

Multimodal5 rows

Multimodal benchmark values, best verified comparison, weight, and source status
Benchmark	Score	Versus best verified row	Gap	Weight	Evidence
MMMU-ProMassive Multi-discipline Multimodal Understanding Pro	Score75.2%	Versus best verified row Best verified: GPT-5.4 Pro · 94%	Gap18.8 behind	WeightWeighted 45%	Secondary exact Meta AI: Muse Spark comparison chart
CharXivCharXiv Reasoning	Score60.9%	Versus best verified row Best verified: Claude Mythos 5 · 93.5%	Gap32.6 behind	WeightWeighted 25%	Secondary exact Meta AI: Muse Spark comparison chart
ERQA	Score54.1%	Versus best verified row Best verified: Qwen3.7 Plus · 69.8%	Gap15.7 behind	WeightDisplay only	Secondary exact Meta AI: Muse Spark comparison chart
SimpleVQA	Score57.4%	Versus best verified row Best verified: Qwen3.7 Plus · 81.7%	Gap24.3 behind	WeightDisplay only	Secondary exact Meta AI: Muse Spark comparison chart
MedXpertQA (MM)MedXpertQA Multimodal	Score65.8%	Versus best verified row Best verified: Muse Spark · 78.4%	Gap12.6 behind	WeightDisplay only	Secondary exact Meta AI: Muse Spark comparison chart

Lineage

The sequence follows explicit supersedes links. Scores and prices remain blank when the corresponding public row or first-party rate is unavailable.

Nov 17, 2025

Grok 4.1

Score 59.2 · Price not listed

Mar 10, 2026 · you are here

Grok 4.20

Score 53.9 · $2 / $6

Reasoning

Grok 4.20 Multi-agent

How to read this profile

The visual layer above carries the decisions. These notes preserve the model, ranking, coverage, and family context behind the numbers.

Grok 4.20 ranks #97 of 215 on the public leaderboard with a score of 53.88/100. It does not yet have enough sourced coverage for a verified position.

Grok 4.20 is a proprietary model with a 2M context window. It uses an explicit reasoning mode, which can improve complex problem solving while adding latency and token use.

Tracked as unresolved. xAI's current docs now present this model as Grok 4.20, but a clean exact-value benchmark table is still missing. BenchLM also stores secondary exact values from Meta AI's April 8, 2026 comparison chart, which labels the same family as Grok 4.2 Reasoning.

Grok 4.20 sits in the Grok 4.20 family with Grok 4.20 Multi-agent. Its explicit predecessor is Grok 4.1. 18 of 369 tracked benchmark slots currently have displayable evidence. Missing categories stay blank.

Its strongest eligible category is Multimodal & Grounded at #28, while its lowest eligible position is Coding at #86. particularly strong for screenshots, documents, charts, and grounded multimodal workflows.

Frequently asked questions

How does Grok 4.20 perform overall in AI benchmarks?

Grok 4.20 ranks #97 out of 215 models on the public BenchAlign leaderboard, with a score of 53.88/100. Its evidence status is Estimated, and this profile shows 18 source-displayable benchmark rows. The label describes evidence depth, not a provider quality claim; inspect category rows before choosing a workload.

Is Grok 4.20 good for knowledge and understanding?

Grok 4.20 has source-displayable benchmark coverage for knowledge and understanding, but the public category table does not assign it a rank there. The individual rows remain available for inspection. A missing category position means the evidence threshold was not met; it does not convert the model's unmeasured work into a zero.

Is Grok 4.20 good for coding and programming?

Grok 4.20 ranks #86 out of 130 eligible models for coding and programming, with a public category score of 46/100. Higher-ranked alternatives are available for workloads where this category decides the choice. Check the underlying rows before treating the aggregate as a workload guarantee.

Is Grok 4.20 good for reasoning and logic?

Grok 4.20 has source-displayable benchmark coverage for reasoning and logic, but the public category table does not assign it a rank there. The individual rows remain available for inspection. A missing category position means the evidence threshold was not met; it does not convert the model's unmeasured work into a zero.

Is Grok 4.20 good for agentic tool use and computer tasks?

Grok 4.20 ranks #52 out of 129 eligible models for agentic tool use and computer tasks, with a public category score of 49/100. Higher-ranked alternatives are available for workloads where this category decides the choice. Check the underlying rows before treating the aggregate as a workload guarantee.

Is Grok 4.20 good for multimodal and grounded tasks?

Grok 4.20 ranks #28 out of 32 eligible models for multimodal and grounded tasks, with a public category score of 35.9/100. Higher-ranked alternatives are available for workloads where this category decides the choice. Check the underlying rows before treating the aggregate as a workload guarantee.

Which sibling models are related to Grok 4.20?

Grok 4.20 belongs to the Grok 4.20 family. Related tracked variants include Grok 4.20 Multi-agent. A sibling link indicates shared lineage or a documented configuration relationship; it does not mean the variants have identical pricing, context limits, benchmark evidence, or deployment behavior. Compare before switching.

Does Grok 4.20 have full benchmark coverage on BenchLM?

No. Grok 4.20 currently has 19 source-displayable rows across 369 tracked benchmark slots. The profile exposes published, non-generated evidence and leaves missing categories blank until an exact evaluation is available. Coverage describes how much was measured; it is not a penalty added to an individual benchmark result.

What is the context window size of Grok 4.20?

Grok 4.20 has a reported context window of 2M in the exact-model catalog record. The value stays visible, but the profile marks its source link as unavailable instead of presenting it as directly documented. Maximum output length remains separate because providers often publish a different limit.

Last updated July 28, 2026. Runtime fields remain blank until a sourced snapshot exists.

Make the right model choice this week

The model to choose, the cheaper alternative, and the release we would wait on.

Grok 4.20

Strongest published evidence

Validate before choosing

Decision snapshot

Capability shape

Eligible category ranks

What it costs to get this score

How much of this is verified

Spec sheet

Category score record

Benchmark ledger

Lineage

How to read this profile

Frequently asked questions

Make the right model choice this week