LLM Speed & Latency Comparison
Compare inference speed across every major AI model. Tokens/sec measures output generation speed. TTFT (Time to First Token) measures response latency.
Speed data from Artificial Analysis. Last updated: 2026-04-07. Median tokens/s, Latency first answer chunk (s).
Mercury 2
789 tok/s · Inception
Ministral 3 3B
0.42s TTFT · Mistral
Grok 4.20
233 tok/s · Score: 78
Top 25 — Output Speed (tok/s)
Average Speed by Provider
NVIDIA
260 tok/s avg · 2 models
1.3s avg TTFT
xAI
157 tok/s avg · 5 models
6s avg TTFT
136 tok/s avg · 7 models
13.5s avg TTFT
Mistral
126 tok/s avg · 7 models
0.8s avg TTFT
OpenAI
121 tok/s avg · 25 models
42.3s avg TTFT
Meta
93 tok/s avg · 3 models
1.3s avg TTFT
Z.AI
82 tok/s avg · 5 models
1.3s avg TTFT
Anthropic
52 tok/s avg · 7 models
3.3s avg TTFT
DeepSeek
48 tok/s avg · 2 models
2.3s avg TTFT
MiniMax
46 tok/s avg · 2 models
2.3s avg TTFT
Moonshot AI
44 tok/s avg · 2 models
1.9s avg TTFT
| Model | |||
|---|---|---|---|
| Mercury 2 Inception · Proprietary | 789 | 3.88 | 13 |
| Nemotron 3 Super 100B NVIDIA · Open Weight | 367 | 0.71 | 46 |
| GPT-OSS 20B OpenAI · Open Weight | 313 | 0.65 | 20 |
| Ministral 3 3B Mistral · Open Weight | 274 | 0.42 | 1 |
| GPT-OSS 120B OpenAI · Open Weight | 262 | 0.79 | 38 |
| Grok 4.20 xAI · Proprietary | 233 | 10.33 | 78 |
| Gemini 2.5 Flash Google · Proprietary | 221 | 0.5 | 41 |
| Gemini 3.1 Flash-Lite Google · Proprietary | 205 | 7.5 | 51 |
| GPT-5.4 mini OpenAI · Proprietary | 201 | 3.85 | 73 |
| GPT-5.4 nano OpenAI · Proprietary | 191 | 3.64 | 63 |
| Grok 3 Mini xAI · Proprietary | 190 | 0.54 | 48 |
| Ministral 3 8B Mistral · Open Weight | 182 | 0.52 | 3 |
| GPT-4.1 nano OpenAI · Proprietary | 181 | 0.63 | 28 |
| Mistral Small 4 Mistral · Open Weight | 175 | 0.64 | 47 |
| Grok Code Fast 1 xAI · Proprietary | 172 | 2.81 | 42 |
| o4-mini (high) OpenAI · Proprietary | 161 | 21.94 | 46 |
| o3-mini OpenAI · Proprietary | 160 | 7.12 | 58 |
| Gemini 3 Flash Google · Proprietary | 159 | 1.19 | 67 |
| Nemotron 3 Nano 30B NVIDIA · Open Weight | 152 | 1.9 | 27 |
| Nova Pro Amazon · Proprietary | 141 | 0.81 | 11 |
| Grok 4.1 Fast xAI · Proprietary | 138 | 0.54 | 72 |
| Claude 3 Haiku Anthropic · Proprietary | 138 | 1.16 | 25 |
| GPT-5 nano OpenAI · Proprietary | 137 | 83.3 | 11 |
| GPT-4o OpenAI · Proprietary | 131 | 0.81 | 41 |
| MiMo-V2-Flash Xiaomi · Open Weight | 129 | 2.14 | 63 |
| Llama 4 Scout Meta · Open Weight | 128 | 0.7 | 24 |
| GPT-5.2-Codex OpenAI · Proprietary | 123 | 87.34 | 80 |
| Llama 4 Maverick Meta · Open Weight | 121 | 0.95 | 18 |
| o3 OpenAI · Proprietary | 118 | 5.38 | 60 |
| Gemini 2.5 Pro Google · Proprietary | 117 | 21.19 | 67 |
| GPT-5.1 OpenAI · Proprietary | 111 | 57.47 | 81 |
| Ministral 3 14B Mistral · Open Weight | 110 | 0.6 | 6 |
| Gemini 3.1 Pro Google · Proprietary | 109 | 29.71 | 94 |
| Gemini 3 Pro Google · Proprietary | 109 | 32.65 | 83 |
| GPT-4.1 OpenAI · Proprietary | 108 | 1.02 | 61 |
| GLM-4.5-Air Z.AI · Proprietary | 106 | 1.18 | 22 |
| o1 OpenAI · Proprietary | 98 | 32.29 | 60 |
| Qwen3.5 397B Alibaba · Open Weight | 96 | 2.44 | 66 |
| GLM-4.7-Flash Z.AI · Open Weight | 95 | 0.91 | 13 |
| LFM2-24B-A2B LiquidAI · Proprietary | 92 | 0.42 | 3 |
| Step 3.5 Flash StepFun · Open Weight | 87 | 3.03 | 15 |
| GPT-5 mini OpenAI · Proprietary | 86 | 65.32 | 11 |
| GPT-5 (high) OpenAI · Proprietary | 83 | 36.28 | 80 |
| GPT-5 (medium) OpenAI · Proprietary | 83 | 36.28 | 74 |
| GLM-4.7 Z.AI · Open Weight | 82 | 1.1 | 72 |
| GPT-4.1 mini OpenAI · Proprietary | 80 | 0.76 | 47 |
| GPT-5.3 Codex OpenAI · Proprietary | 79 | 88.26 | 89 |
| GPT-5.4 OpenAI · Proprietary | 74 | 151.79 | 94 |
| GPT-5.4 Pro OpenAI · Proprietary | 74 | 151.79 | 92 |
| GLM-5 Z.AI · Open Weight | 74 | 1.64 | 77 |
| GPT-5.2 OpenAI · Proprietary | 73 | 130.34 | 84 |
| DeepSeek R1 Distill Qwen 32B DeepSeek · Open Weight | 60 | 0.84 | 7 |
| Mistral Medium 3 Mistral · Proprietary | 57 | 1.2 | 45 |
| Grok 4 xAI · Proprietary | 54 | 15.6 | 67 |
| GLM-4.5 Z.AI · Proprietary | 51 | 1.45 | 29 |
| Mistral Large 3 Mistral · Proprietary | 48 | 1.04 | 52 |
| Claude Opus 4.5 Anthropic · Proprietary | 46 | 1.01 | 80 |
| MiniMax M2.5 MiniMax · Proprietary | 46 | 2.12 | 17 |
| Kimi K2.5 Moonshot AI · Open Weight | 45 | 2.38 | 68 |
| MiniMax M2.7 MiniMax · Proprietary | 45 | 2.53 | 64 |
| Claude Sonnet 4.6 Anthropic · Proprietary | 44 | 1.48 | 86 |
| Kimi K2 Moonshot AI · Proprietary | 43 | 1.51 | 44 |
| Claude Opus 4.6 Anthropic · Proprietary | 40 | 1.78 | 92 |
| Claude 4 Sonnet Anthropic · Proprietary | 40 | 1.33 | 52 |
| Mistral Large 2 Mistral · Proprietary | 38 | 1.45 | 40 |
| DeepSeek V3.2 DeepSeek · Open Weight | 35 | 3.75 | 60 |
| Phi-4 Microsoft · Open Weight | 35 | 2.02 | 29 |
| GPT-4o mini OpenAI · Proprietary | 33 | 3.16 | 45 |
| Gemma 3 27B Google · Open Weight | 31 | 2.04 | 19 |
| GPT-4 Turbo OpenAI · Proprietary | 30 | 2.84 | 27 |
| Claude 4.1 Opus Anthropic · Proprietary | 29 | 1.66 | 53 |
| Claude 4.1 Opus Thinking Anthropic · Proprietary | 29 | 15 | 45 |
| Llama 3.1 405B Meta · Open Weight | 29 | 2.19 | 43 |
| o3-pro OpenAI · Proprietary | 27 | 84.93 | 60 |
Speed data sourced from Artificial Analysis. Metrics reflect median performance across providers. Reasoning models typically show higher TTFT due to chain-of-thought processing.
See which models offer the best value
Compare all LLM API prices
Which scores can you trust?
Frequently Asked Questions
What does tokens per second mean for LLMs?
Tokens per second (tok/s) measures how fast an LLM generates output text. Higher is better. A model at 200 tok/s produces roughly 150 words per second — fast enough for real-time streaming. Models below 50 tok/s may feel sluggish in interactive applications.
What is TTFT (Time to First Token)?
TTFT (Time to First Token) measures the latency between sending a request and receiving the first token of the response. Lower is better. For chat applications, TTFT under 1 second feels instant. Reasoning models often have high TTFT (10-150s) because they "think" before responding.
Which LLM is the fastest?
Currently, Mercury 2 by Inception is the fastest at 789 tokens/second. The fastest model scoring above 70 overall is Grok 4.20 at 233 tok/s.
Why are reasoning models slower?
Reasoning models (like o3, GPT-5, Gemini Deep Think) use chain-of-thought processing — they generate internal "thinking" tokens before producing the final answer. This adds significant TTFT latency (often 10-150 seconds) but can dramatically improve accuracy on complex tasks. The output speed (tok/s) once generation starts is usually comparable to standard models.
The AI models change fast. We track them for you.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.