Compare inference speed across every major AI model. Tokens/sec measures output generation speed. TTFT (Time to First Token) measures response latency.
Speed data from Artificial Analysis. Last updated: 2026-03-31. Median tokens/s, Latency first answer chunk (s).
Mercury 2
789 tok/s · Inception
Ministral 3 3B
0.42s TTFT · Mistral
Grok 4.1 Fast
138 tok/s · Score: 70
NVIDIA
260 tok/s avg · 2 models
1.3s avg TTFT
xAI
157 tok/s avg · 5 models
6s avg TTFT
136 tok/s avg · 7 models
13.5s avg TTFT
Mistral
126 tok/s avg · 7 models
0.8s avg TTFT
OpenAI
121 tok/s avg · 25 models
42.3s avg TTFT
Meta
93 tok/s avg · 3 models
1.3s avg TTFT
Zhipu AI
82 tok/s avg · 5 models
1.3s avg TTFT
Anthropic
52 tok/s avg · 7 models
3.3s avg TTFT
DeepSeek
48 tok/s avg · 2 models
2.3s avg TTFT
MiniMax
46 tok/s avg · 2 models
2.3s avg TTFT
Moonshot AI
44 tok/s avg · 2 models
1.9s avg TTFT
| Model | |||
|---|---|---|---|
| Mercury 2 Inception · Proprietary | 789 | 3.88 | 49 |
| Nemotron 3 Super 100B NVIDIA · Open Weight | 367 | 0.71 | 56 |
| GPT-OSS 20B OpenAI · Open Weight | 313 | 0.65 | 36 |
| Ministral 3 3B Mistral · Open Weight | 274 | 0.42 | 19 |
| GPT-OSS 120B OpenAI · Open Weight | 262 | 0.79 | 50 |
| Grok 4.20 xAI · Proprietary | 233 | 10.33 | 9 |
| Gemini 2.5 Flash Google · Proprietary | 221 | 0.5 | 50 |
| Gemini 3.1 Flash-Lite Google · Proprietary | 205 | 7.5 | 56 |
| GPT-5.4 mini OpenAI · Proprietary | 201 | 3.85 | 66 |
| GPT-5.4 nano OpenAI · Proprietary | 191 | 3.64 | 58 |
| Grok 3 Mini xAI · Proprietary | 190 | 0.54 | 49 |
| Ministral 3 8B Mistral · Open Weight | 182 | 0.52 | 23 |
| GPT-4.1 nano OpenAI · Proprietary | 181 | 0.63 | 44 |
| Mistral Small 4 Mistral · Open Weight | 175 | 0.64 | 62 |
| Grok Code Fast 1 xAI · Proprietary | 172 | 2.81 | 56 |
| o4-mini (high) OpenAI · Proprietary | 161 | 21.94 | 58 |
| o3-mini OpenAI · Proprietary | 160 | 7.12 | 65 |
| Gemini 3 Flash Google · Proprietary | 159 | 1.19 | 67 |
| Nemotron 3 Nano 30B NVIDIA · Open Weight | 152 | 1.9 | 42 |
| Nova Pro Amazon · Proprietary | 141 | 0.81 | 33 |
| Grok 4.1 Fast xAI · Proprietary | 138 | 0.54 | 70 |
| Claude 3 Haiku Anthropic · Proprietary | 138 | 1.16 | 43 |
| GPT-5 nano OpenAI · Proprietary | 137 | 83.3 | 36 |
| GPT-4o OpenAI · Proprietary | 131 | 0.81 | 50 |
| MiMo-V2-Flash Xiaomi · Open Weight | 129 | 2.14 | 67 |
| Llama 4 Scout Meta · Open Weight | 128 | 0.7 | 44 |
| GPT-5.2-Codex OpenAI · Proprietary | 123 | 87.34 | 82 |
| Llama 4 Maverick Meta · Open Weight | 121 | 0.95 | 39 |
| o3 OpenAI · Proprietary | 118 | 5.38 | 64 |
| Gemini 2.5 Pro Google · Proprietary | 117 | 21.19 | 65 |
| GPT-5.1 OpenAI · Proprietary | 111 | 57.47 | 78 |
| Ministral 3 14B Mistral · Open Weight | 110 | 0.6 | 39 |
| Gemini 3.1 Pro Google · Proprietary | 109 | 29.71 | 87 |
| Gemini 3 Pro Google · Proprietary | 109 | 32.65 | 79 |
| GPT-4.1 OpenAI · Proprietary | 108 | 1.02 | 64 |
| GLM-4.5-Air Zhipu AI · Proprietary | 106 | 1.18 | 38 |
| o1 OpenAI · Proprietary | 98 | 32.29 | 64 |
| Qwen3.5 397B Alibaba · Open Weight | 96 | 2.44 | 68 |
| GLM-4.7-Flash Zhipu AI · Open Weight | 95 | 0.91 | 47 |
| LFM2-24B-A2B LiquidAI · Proprietary | 92 | 0.42 | 27 |
| Step 3.5 Flash StepFun · Open Weight | 87 | 3.03 | 53 |
| GPT-5 mini OpenAI · Proprietary | 86 | 65.32 | 51 |
| GPT-5 (high) OpenAI · Proprietary | 83 | 36.28 | 82 |
| GPT-5 (medium) OpenAI · Proprietary | 83 | 36.28 | 76 |
| GLM-4.7 Zhipu AI · Open Weight | 82 | 1.1 | 74 |
| GPT-4.1 mini OpenAI · Proprietary | 80 | 0.76 | 57 |
| GPT-5.3 Codex OpenAI · Proprietary | 79 | 88.26 | 85 |
| GPT-5.4 Pro OpenAI · Proprietary | 74 | 151.79 | 92 |
| GPT-5.4 OpenAI · Proprietary | 74 | 151.79 | 82 |
| GLM-5 Zhipu AI · Open Weight | 74 | 1.64 | 75 |
| GPT-5.2 OpenAI · Proprietary | 73 | 130.34 | 82 |
| DeepSeek R1 Distill Qwen 32B DeepSeek · Open Weight | 60 | 0.84 | 3 |
| Mistral Medium 3 Mistral · Proprietary | 57 | 1.2 | 53 |
| Grok 4 xAI · Proprietary | 54 | 15.6 | 68 |
| GLM-4.5 Zhipu AI · Proprietary | 51 | 1.45 | 40 |
| Mistral Large 3 Mistral · Proprietary | 48 | 1.04 | 58 |
| Claude Opus 4.5 Anthropic · Proprietary | 46 | 1.01 | 76 |
| MiniMax M2.5 MiniMax · Proprietary | 46 | 2.12 | 48 |
| Kimi K2.5 Moonshot AI · Open Weight | 45 | 2.38 | 72 |
| MiniMax M2.7 MiniMax · Proprietary | 45 | 2.53 | 66 |
| Claude Sonnet 4.6 Anthropic · Proprietary | 44 | 1.48 | 84 |
| Kimi K2 Moonshot AI · Proprietary | 43 | 1.51 | 53 |
| Claude Opus 4.6 Anthropic · Proprietary | 40 | 1.78 | 85 |
| Claude 4 Sonnet Anthropic · Proprietary | 40 | 1.33 | 62 |
| Mistral Large 2 Mistral · Proprietary | 38 | 1.45 | 52 |
| DeepSeek V3.2 DeepSeek · Open Weight | 35 | 3.75 | 61 |
| Phi-4 Microsoft · Open Weight | 35 | 2.02 | 40 |
| GPT-4o mini OpenAI · Proprietary | 33 | 3.16 | 54 |
| Gemma 3 27B Google · Open Weight | 31 | 2.04 | 35 |
| GPT-4 Turbo OpenAI · Proprietary | 30 | 2.84 | 43 |
| Claude 4.1 Opus Anthropic · Proprietary | 29 | 1.66 | 62 |
| Claude 4.1 Opus Thinking Anthropic · Proprietary | 29 | 15 | 57 |
| Llama 3.1 405B Meta · Open Weight | 29 | 2.19 | 53 |
| o3-pro OpenAI · Proprietary | 27 | 84.93 | 67 |
Speed data sourced from Artificial Analysis. Metrics reflect median performance across providers. Reasoning models typically show higher TTFT due to chain-of-thought processing.
See which models offer the best value
Compare all LLM API prices
Which scores can you trust?
Tokens per second (tok/s) measures how fast an LLM generates output text. Higher is better. A model at 200 tok/s produces roughly 150 words per second — fast enough for real-time streaming. Models below 50 tok/s may feel sluggish in interactive applications.
TTFT (Time to First Token) measures the latency between sending a request and receiving the first token of the response. Lower is better. For chat applications, TTFT under 1 second feels instant. Reasoning models often have high TTFT (10-150s) because they "think" before responding.
Currently, Mercury 2 by Inception is the fastest at 789 tokens/second. The fastest model scoring above 70 overall is Grok 4.1 Fast at 138 tok/s.
Reasoning models (like o3, GPT-5, Gemini Deep Think) use chain-of-thought processing — they generate internal "thinking" tokens before producing the final answer. This adds significant TTFT latency (often 10-150 seconds) but can dramatically improve accuracy on complex tasks. The output speed (tok/s) once generation starts is usually comparable to standard models.
Get notified when new models drop, benchmark scores change, or the leaderboard shifts. One email per week.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.