Skip to main content
Skip to main content

LLM Speed & Latency Comparison

Compare inference speed across every major AI model. Tokens/sec measures output generation speed. Latency measures time to the first answer token — for reasoning models this includes thinking time, so it reflects end-to-end response latency rather than raw TTFT.

Speed data from Artificial Analysis. Last updated: 2026-06-19. Median tokens/s, Latency first answer chunk (s).

Fastest Output

Mercury 2

789 tok/s · Inception

Lowest Latency

Command A+

0.25s to first answer · Cohere

Fastest (Score 70+)

Gemini 3.5 Flash

284.2 tok/s · Score: 85

Top 25 — Output Speed (tok/s)

Ultra FastFastMediumSlow

Average Speed by Provider

NVIDIA

260 tok/s avg · 2 models

1.3s avg latency

xAI

166 tok/s avg · 6 models

7s avg latency

Google

154 tok/s avg · 8 models

14.2s avg latency

Mistral

126 tok/s avg · 7 models

0.8s avg latency

OpenAI

121 tok/s avg · 25 models

42.3s avg latency

Meta

93 tok/s avg · 3 models

1.3s avg latency

Z.AI

82 tok/s avg · 5 models

1.3s avg latency

Anthropic

52 tok/s avg · 7 models

3.3s avg latency

DeepSeek

48 tok/s avg · 2 models

2.3s avg latency

MiniMax

46 tok/s avg · 2 models

2.3s avg latency

Moonshot AI

44 tok/s avg · 2 models

1.9s avg latency

Model
Mercury 2

Inception · Proprietary

7893.8811
Nemotron 3 Super 100B

NVIDIA · Open Weight

3670.7143
GPT-OSS 20B

OpenAI · Open Weight

3130.6516
Gemini 3.5 Flash

Google · Proprietary

284.218.5585
Ministral 3 3B

Mistral · Open Weight

2740.421
Command A+

Cohere · Open Weight

2720.2539
GPT-OSS 120B

OpenAI · Open Weight

2620.7934
Grok 4.20

xAI · Proprietary

23310.3370
Gemini 2.5 Flash

Google · Proprietary

2210.537
Ling 2.6 Flash

InclusionAI · Open Weight

209.51.0736
Grok 4.3

xAI · Proprietary

20912.3672
Gemini 3.1 Flash-Lite

Google · Proprietary

2057.547
GPT-5.4 mini

OpenAI · Proprietary

2013.8568
GPT-5.4 nano

OpenAI · Proprietary

1913.6459
Grok 3 Mini

xAI · Proprietary

1900.5442
Ministral 3 8B

Mistral · Open Weight

1820.523
GPT-4.1 nano

OpenAI · Proprietary

1810.6326
Mistral Small 4

Mistral · Open Weight

1750.6445
Grok Code Fast 1

xAI · Proprietary

1722.8139
o4-mini (high)

OpenAI · Proprietary

16121.9443
o3-mini

OpenAI · Proprietary

1607.1255
Gemini 3 Flash

Google · Proprietary

1591.1955
Nemotron 3 Nano 30B

NVIDIA · Open Weight

1521.925
Nova Pro

Amazon · Proprietary

1410.8110
Grok 4.1 Fast

xAI · Proprietary

1380.5468
Claude 3 Haiku

Anthropic · Proprietary

1381.1623
GPT-5 nano

OpenAI · Proprietary

13783.3
GPT-4o

OpenAI · Proprietary

1310.8142
MiMo-V2-Flash

Xiaomi · Open Weight

1292.1459
Llama 4 Scout

Meta · Open Weight

1280.726
GPT-5.2-Codex

OpenAI · Proprietary

12387.3476
Llama 4 Maverick

Meta · Open Weight

1210.9517
o3

OpenAI · Proprietary

1185.3856
Gemini 2.5 Pro

Google · Proprietary

11721.1963
GPT-5.1

OpenAI · Proprietary

11157.4777
Ministral 3 14B

Mistral · Open Weight

1100.65
Gemini 3.1 Pro

Google · Proprietary

10929.7188
Gemini 3 Pro

Google · Proprietary

10932.6580
GPT-4.1

OpenAI · Proprietary

1081.0256
GLM-4.5-Air

Z.AI · Proprietary

1061.1819
o1

OpenAI · Proprietary

9832.2956
Qwen3.5 397B

Alibaba · Open Weight

962.4462
GLM-4.7-Flash

Z.AI · Open Weight

950.9111
LFM2-24B-A2B

LiquidAI · Proprietary

920.422
Step 3.5 Flash

StepFun · Open Weight

873.03
GPT-5 mini

OpenAI · Proprietary

8665.32
GPT-5 (high)

OpenAI · Proprietary

8336.2875
GPT-5 (medium)

OpenAI · Proprietary

8336.2870
GLM-4.7

Z.AI · Open Weight

821.168
GPT-4.1 mini

OpenAI · Proprietary

800.7645
GPT-5.3 Codex

OpenAI · Proprietary

7988.2685
GPT-5.4 Pro

OpenAI · Proprietary

74151.7990
GPT-5.4

OpenAI · Proprietary

74151.7987
GLM-5

Z.AI · Open Weight

741.6466
GPT-5.2

OpenAI · Proprietary

73130.3478
DeepSeek R1 Distill Qwen 32B

DeepSeek · Open Weight

600.846
Mistral Medium 3

Mistral · Proprietary

571.243
Grok 4

xAI · Proprietary

5415.663
GLM-4.5

Z.AI · Proprietary

511.4525
Mistral Large 3

Mistral · Proprietary

481.0448
Claude Opus 4.5

Anthropic · Proprietary

461.0175
MiniMax M2.5

MiniMax · Proprietary

462.12
Kimi K2.5

Moonshot AI · Open Weight

452.3863
MiniMax M2.7

MiniMax · Open Weight

452.5352
Claude Sonnet 4.6

Anthropic · Proprietary

441.4880
Kimi K2

Moonshot AI · Proprietary

431.5141
Claude Opus 4.6

Anthropic · Proprietary

401.7886
Claude 4 Sonnet

Anthropic · Proprietary

401.3350
Mistral Large 2

Mistral · Proprietary

381.4538
DeepSeek V3.2

DeepSeek · Open Weight

353.7556
Phi-4

Microsoft · Open Weight

352.0227
GPT-4o mini

OpenAI · Proprietary

333.1649
Gemma 3 27B

Google · Open Weight

312.0416
GPT-4 Turbo

OpenAI · Proprietary

302.8425
Claude 4.1 Opus

Anthropic · Proprietary

291.6651
Claude 4.1 Opus Thinking

Anthropic · Proprietary

291543
Llama 3.1 405B

Meta · Open Weight

292.1940
o3-pro

OpenAI · Proprietary

2784.9357

Speed data sourced from Artificial Analysis. Metrics reflect median performance across providers. Reasoning models typically show higher first-answer latency due to chain-of-thought processing.

Frequently Asked Questions

What does tokens per second mean for LLMs?

Tokens per second (tok/s) measures how fast an LLM generates output text. Higher is better. A model at 200 tok/s produces roughly 150 words per second — fast enough for real-time streaming. Models below 50 tok/s may feel sluggish in interactive applications.

What does the latency column measure?

Latency here is the time from sending a request to receiving the first token of the answer (Artificial Analysis’s "first answer chunk" metric). Lower is better; under 1 second feels instant in chat. For reasoning models this includes the entire thinking phase, so it can reach 10–150s — it is end-to-end response latency, not raw time-to-first-token of the stream.

Which LLM is the fastest?

Currently, Mercury 2 by Inception is the fastest at 789 tokens/second. The fastest model scoring above 70 overall is Gemini 3.5 Flash at 284.2 tok/s.

Why are reasoning models slower?

Reasoning models (like o3, GPT-5, Gemini Deep Think) use chain-of-thought processing — they generate internal "thinking" tokens before producing the final answer. This adds significant first-answer latency (often 10-150 seconds) but can dramatically improve accuracy on complex tasks. The output speed (tok/s) once generation starts is usually comparable to standard models.

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.