As of July 2026, ElevenLabs is our pick for production voice quality, with Cartesia for latency-critical streaming and cloud TTS for cost at scale. How we compare the speech layer of the AI app stack — and which API fits which build.
Share This Report
Copy the link, post it, or save a PDF version.
As of July 2026, the best text-to-speech API for most AI products is ElevenLabs — it remains the quality bar the rest of the market is measured against, with a free tier that covers building and demoing an agent before you pay. If your constraint is streaming latency rather than maximal naturalness, Cartesia is the pick; if it's unit cost at massive volume, the cloud providers win on price.
Some links below are partner links (marked). Partners never affect which tools appear, their order, or our verdicts — same rule as our model rankings.
This roundup covers the voice layer of the BenchLM AI App Stack: the API that turns your agent's text into audio. It deliberately does not rank voice agent platforms (Retell, Vapi, OpenAI Realtime) or the LLMs behind them — that's a different decision, covered in Best LLM for Voice Agents. Pick your model there; pick its voice here.
| Tool | Best for | Pricing model | Free tier | Standout |
|---|---|---|---|---|
| ElevenLabs | Production voice agents, voice-first products | Subscription tiers by usage | Yes — enough to demo an agent | Quality bar + cloning + streaming |
| Cartesia | Latency-critical streaming | Usage-based | Yes | Time-to-first-audio |
| OpenAI TTS / Realtime | Teams already all-in on OpenAI | Per-token/usage | Via API credits | One-vendor stack simplicity |
| Google Cloud TTS | High-volume, cost-sensitive workloads | Per character | Monthly free quota | Price at scale, language breadth |
| Amazon Polly | AWS-native products | Per character | 12-month free quota | AWS integration, price |
| PlayHT | Voice cloning breadth | Subscription tiers | Limited | Large voice library |
| Kokoro (open-weight) | Self-hosters | Your GPU bill | n/a | No per-character cost |
Prices in this category change often — check each vendor's live pricing page rather than a blog table (including ours). For the model side of the same latency budget, see LLM speed rankings.
ElevenLabs (partner link) is the default recommendation for the same reason it appears in every serious voice agent tutorial: the voices survive production. Interruptions, resumed sentences, numbers and URLs read aloud, emotional register on support calls — the failure modes that make cheaper TTS feel robotic are the ones it handles. Cloning and style direction are mature, streaming is first-class, and the free tier is genuinely enough to build and test a working agent before paying anything.
Honest limits: at very high volume the per-character economics favor the cloud providers, and if your product is a phone tree reading account balances, you're paying for naturalness you don't need. That's what the scenarios below are for.
Cartesia built its reputation on time-to-first-audio, and for live conversational agents where the human is waiting in silence, that's the metric that decides whether your product feels alive. If you've already minimized model latency with a fast LLM and you're still missing your end-to-end budget, this is the lever left.
Google Cloud TTS and Amazon Polly are a tier below on naturalness and a tier above on economics. For read-aloud features, notifications, accessibility, and high-volume IVR — anywhere the voice is a feature rather than the product — they're the rational pick, especially if you're already on that cloud.
OpenAI's TTS and Realtime APIs are the simplicity play: one API key, one bill, speech and reasoning from the same stack. You trade away voice control and the quality ceiling; you gain never thinking about the integration again.
Voice is layer 5 of the AI App Stack. Upstream of it: which LLM powers the agent (latency-ranked in LLM speed) and where the whole thing runs. The stack pillar has the full map.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
As of July 2026, Browse AI is our pick for no-code scraping and change monitoring, Firecrawl for LLM-ready markdown, and Apify for developer pipelines. How we compare the data layer of the AI app stack.
As of July 2026, Netlify is our pick for shipping AI apps fast, Cloudflare for edge scale, and Railway for long-running backends. What AI apps demand from a host — streaming, functions, secrets — and which platform fits which build.
Which LLM is best for voice agents in 2026? We rank models by first-answer latency and output speed — the metrics that actually decide voice — name the fastest capable models, and compare the voice-agent platforms (Retell, Vapi, OpenAI Realtime, ElevenLabs).