LLM Speed & Latency Comparison
Compare inference speed across every major AI model. Tokens/sec measures output generation speed. Latency measures time to the first answer token — for reasoning models this includes thinking time, so it reflects end-to-end response latency rather than raw TTFT.
Speed data from Artificial Analysis. Last updated: 2026-06-19. Median tokens/s, Latency first answer chunk (s).
Mercury 2
789 tok/s · Inception
Command A+
0.25s to first answer · Cohere
Gemini 3.5 Flash
284.2 tok/s · Score: 85
Top 25 — Output Speed (tok/s)
Average Speed by Provider
NVIDIA
260 tok/s avg · 2 models
1.3s avg latency
xAI
166 tok/s avg · 6 models
7s avg latency
154 tok/s avg · 8 models
14.2s avg latency
Mistral
126 tok/s avg · 7 models
0.8s avg latency
OpenAI
121 tok/s avg · 25 models
42.3s avg latency
Meta
93 tok/s avg · 3 models
1.3s avg latency
Z.AI
82 tok/s avg · 5 models
1.3s avg latency
Anthropic
52 tok/s avg · 7 models
3.3s avg latency
DeepSeek
48 tok/s avg · 2 models
2.3s avg latency
MiniMax
46 tok/s avg · 2 models
2.3s avg latency
Moonshot AI
44 tok/s avg · 2 models
1.9s avg latency
| Model | |||
|---|---|---|---|
| Mercury 2 Inception · Proprietary | 789 | 3.88 | 11 |
| Nemotron 3 Super 100B NVIDIA · Open Weight | 367 | 0.71 | 43 |
| GPT-OSS 20B OpenAI · Open Weight | 313 | 0.65 | 16 |
| Gemini 3.5 Flash Google · Proprietary | 284.2 | 18.55 | 85 |
| Ministral 3 3B Mistral · Open Weight | 274 | 0.42 | 1 |
| Command A+ Cohere · Open Weight | 272 | 0.25 | 39 |
| GPT-OSS 120B OpenAI · Open Weight | 262 | 0.79 | 34 |
| Grok 4.20 xAI · Proprietary | 233 | 10.33 | 70 |
| Gemini 2.5 Flash Google · Proprietary | 221 | 0.5 | 37 |
| Ling 2.6 Flash InclusionAI · Open Weight | 209.5 | 1.07 | 36 |
| Grok 4.3 xAI · Proprietary | 209 | 12.36 | 72 |
| Gemini 3.1 Flash-Lite Google · Proprietary | 205 | 7.5 | 47 |
| GPT-5.4 mini OpenAI · Proprietary | 201 | 3.85 | 68 |
| GPT-5.4 nano OpenAI · Proprietary | 191 | 3.64 | 59 |
| Grok 3 Mini xAI · Proprietary | 190 | 0.54 | 42 |
| Ministral 3 8B Mistral · Open Weight | 182 | 0.52 | 3 |
| GPT-4.1 nano OpenAI · Proprietary | 181 | 0.63 | 26 |
| Mistral Small 4 Mistral · Open Weight | 175 | 0.64 | 45 |
| Grok Code Fast 1 xAI · Proprietary | 172 | 2.81 | 39 |
| o4-mini (high) OpenAI · Proprietary | 161 | 21.94 | 43 |
| o3-mini OpenAI · Proprietary | 160 | 7.12 | 55 |
| Gemini 3 Flash Google · Proprietary | 159 | 1.19 | 55 |
| Nemotron 3 Nano 30B NVIDIA · Open Weight | 152 | 1.9 | 25 |
| Nova Pro Amazon · Proprietary | 141 | 0.81 | 10 |
| Grok 4.1 Fast xAI · Proprietary | 138 | 0.54 | 68 |
| Claude 3 Haiku Anthropic · Proprietary | 138 | 1.16 | 23 |
| GPT-5 nano OpenAI · Proprietary | 137 | 83.3 | — |
| GPT-4o OpenAI · Proprietary | 131 | 0.81 | 42 |
| MiMo-V2-Flash Xiaomi · Open Weight | 129 | 2.14 | 59 |
| Llama 4 Scout Meta · Open Weight | 128 | 0.7 | 26 |
| GPT-5.2-Codex OpenAI · Proprietary | 123 | 87.34 | 76 |
| Llama 4 Maverick Meta · Open Weight | 121 | 0.95 | 17 |
| o3 OpenAI · Proprietary | 118 | 5.38 | 56 |
| Gemini 2.5 Pro Google · Proprietary | 117 | 21.19 | 63 |
| GPT-5.1 OpenAI · Proprietary | 111 | 57.47 | 77 |
| Ministral 3 14B Mistral · Open Weight | 110 | 0.6 | 5 |
| Gemini 3.1 Pro Google · Proprietary | 109 | 29.71 | 88 |
| Gemini 3 Pro Google · Proprietary | 109 | 32.65 | 80 |
| GPT-4.1 OpenAI · Proprietary | 108 | 1.02 | 56 |
| GLM-4.5-Air Z.AI · Proprietary | 106 | 1.18 | 19 |
| o1 OpenAI · Proprietary | 98 | 32.29 | 56 |
| Qwen3.5 397B Alibaba · Open Weight | 96 | 2.44 | 62 |
| GLM-4.7-Flash Z.AI · Open Weight | 95 | 0.91 | 11 |
| LFM2-24B-A2B LiquidAI · Proprietary | 92 | 0.42 | 2 |
| Step 3.5 Flash StepFun · Open Weight | 87 | 3.03 | — |
| GPT-5 mini OpenAI · Proprietary | 86 | 65.32 | — |
| GPT-5 (high) OpenAI · Proprietary | 83 | 36.28 | 75 |
| GPT-5 (medium) OpenAI · Proprietary | 83 | 36.28 | 70 |
| GLM-4.7 Z.AI · Open Weight | 82 | 1.1 | 68 |
| GPT-4.1 mini OpenAI · Proprietary | 80 | 0.76 | 45 |
| GPT-5.3 Codex OpenAI · Proprietary | 79 | 88.26 | 85 |
| GPT-5.4 Pro OpenAI · Proprietary | 74 | 151.79 | 90 |
| GPT-5.4 OpenAI · Proprietary | 74 | 151.79 | 87 |
| GLM-5 Z.AI · Open Weight | 74 | 1.64 | 66 |
| GPT-5.2 OpenAI · Proprietary | 73 | 130.34 | 78 |
| DeepSeek R1 Distill Qwen 32B DeepSeek · Open Weight | 60 | 0.84 | 6 |
| Mistral Medium 3 Mistral · Proprietary | 57 | 1.2 | 43 |
| Grok 4 xAI · Proprietary | 54 | 15.6 | 63 |
| GLM-4.5 Z.AI · Proprietary | 51 | 1.45 | 25 |
| Mistral Large 3 Mistral · Proprietary | 48 | 1.04 | 48 |
| Claude Opus 4.5 Anthropic · Proprietary | 46 | 1.01 | 75 |
| MiniMax M2.5 MiniMax · Proprietary | 46 | 2.12 | — |
| Kimi K2.5 Moonshot AI · Open Weight | 45 | 2.38 | 63 |
| MiniMax M2.7 MiniMax · Open Weight | 45 | 2.53 | 52 |
| Claude Sonnet 4.6 Anthropic · Proprietary | 44 | 1.48 | 80 |
| Kimi K2 Moonshot AI · Proprietary | 43 | 1.51 | 41 |
| Claude Opus 4.6 Anthropic · Proprietary | 40 | 1.78 | 86 |
| Claude 4 Sonnet Anthropic · Proprietary | 40 | 1.33 | 50 |
| Mistral Large 2 Mistral · Proprietary | 38 | 1.45 | 38 |
| DeepSeek V3.2 DeepSeek · Open Weight | 35 | 3.75 | 56 |
| Phi-4 Microsoft · Open Weight | 35 | 2.02 | 27 |
| GPT-4o mini OpenAI · Proprietary | 33 | 3.16 | 49 |
| Gemma 3 27B Google · Open Weight | 31 | 2.04 | 16 |
| GPT-4 Turbo OpenAI · Proprietary | 30 | 2.84 | 25 |
| Claude 4.1 Opus Anthropic · Proprietary | 29 | 1.66 | 51 |
| Claude 4.1 Opus Thinking Anthropic · Proprietary | 29 | 15 | 43 |
| Llama 3.1 405B Meta · Open Weight | 29 | 2.19 | 40 |
| o3-pro OpenAI · Proprietary | 27 | 84.93 | 57 |
Speed data sourced from Artificial Analysis. Metrics reflect median performance across providers. Reasoning models typically show higher first-answer latency due to chain-of-thought processing.
See which models offer the best value
Compare all LLM API prices
Which scores can you trust?
Frequently Asked Questions
What does tokens per second mean for LLMs?
Tokens per second (tok/s) measures how fast an LLM generates output text. Higher is better. A model at 200 tok/s produces roughly 150 words per second — fast enough for real-time streaming. Models below 50 tok/s may feel sluggish in interactive applications.
What does the latency column measure?
Latency here is the time from sending a request to receiving the first token of the answer (Artificial Analysis’s "first answer chunk" metric). Lower is better; under 1 second feels instant in chat. For reasoning models this includes the entire thinking phase, so it can reach 10–150s — it is end-to-end response latency, not raw time-to-first-token of the stream.
Which LLM is the fastest?
Currently, Mercury 2 by Inception is the fastest at 789 tokens/second. The fastest model scoring above 70 overall is Gemini 3.5 Flash at 284.2 tok/s.
Why are reasoning models slower?
Reasoning models (like o3, GPT-5, Gemini Deep Think) use chain-of-thought processing — they generate internal "thinking" tokens before producing the final answer. This adds significant first-answer latency (often 10-150 seconds) but can dramatically improve accuracy on complex tasks. The output speed (tok/s) once generation starts is usually comparable to standard models.
The AI models change fast. We track them for you.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.