Best LLM for Voice Agents

As of July 28, 2026, there is no universal best LLM for voice agents. The right model is the fastest one that clears your own task, instruction-following, and tool-use tests. We use measured first-answer latency to build the shortlist, but we do not turn that one metric into a fake voice-agent ranking. Your speech recognition, network path, text-to-speech layer, and call flow still decide what the caller hears.

This article contains partner links, marked before you reach them. Partner status never changes the models, platforms, or order discussed here. See the affiliate disclosure.

For the speech layer, see Best Text-to-Speech APIs. ElevenLabs is a partner and one option in that separate decision; its pricing calculator models API character volume and peak concurrency without treating voice quality as a measured result.

Start with one latency budget

A voice turn is a chain: speech recognition, model inference, speech generation, transport, and playback. Each stage adds delay. ElevenLabs' own latency guide makes the same distinction between model inference and the end-to-end time a user experiences. That distinction matters because a vendor's model-latency number is not your caller's time-to-first-audio.

Measure from the end of the user's speech to the first audible sample from the agent. Keep median and tail latency. A beautiful median can hide the calls that feel broken.

Then define the quality gates. A useful voice model must:

keep answers short enough to hear;
follow escalation and safety instructions;
call the right tool with the right arguments;
recover after interruptions and partial transcripts; and
pronounce the structured output your speech layer receives.

The model with the highest overall score may clear those gates. It may also arrive late and read JSON aloud. Voice is impolite to abstractions.

Use the live table as a shortlist

The table below rebuilds from the current runtime and model catalogs. It includes models above a minimum overall-quality threshold and orders them by first-answer latency.

Model	Latency (first answer)	Output speed	Type	Overall score
Claude Opus 4.5	1.01s	46 t/s	Non-Reasoning	63
GLM-4.7	1.10s	82 t/s	Reasoning	60
Gemini 3 Flash	1.19s	159 t/s	Non-Reasoning	60
GLM-4.5	1.45s	51 t/s	Non-Reasoning	57
Claude Sonnet 4.6	1.48s	44 t/s	Non-Reasoning	64
GLM-5	1.64s	74 t/s	Non-Reasoning	65
Claude Opus 4.6	1.78s	40 t/s	Non-Reasoning	68
MiniMax M2.5	2.12s	46 t/s	Non-Reasoning	59
Kimi K2.5	2.38s	45 t/s	Non-Reasoning	59
Qwen3.5 397B	2.44s	96 t/s	Non-Reasoning	56

The latency column is first answer from the runtime source, not full voice-to-voice delay. Output speed and overall score are screening signals. The speed dashboard carries the current rows and source notes.

We deliberately stop short of naming the first row as the winner. Runtime observations can change with provider load, region, prompt length, and serving updates. More importantly, this table does not test your system prompt, tool schema, accents, background noise, or call-transfer rules.

Take three candidates into a call replay set. Use real, consented transcripts or synthetic cases that match the same turn lengths and tool calls. Record task success, wrong-tool rate, first audio, interruptions, and cost. That small test tells you more than another paragraph about model families.

Keep slow work off the conversational path

A two-model design is useful when ordinary turns are simple but some tasks require deeper work.

Path	Job	Selection rule
Conversational model	Greeting, clarification, short answers, routine tool calls	Lowest measured latency that clears the call eval
Background model	Policy synthesis, multi-document lookup, complex calculations	Highest task success inside the allowed wait and cost

The conversational model can acknowledge the request, start the tool call, and continue when the result returns. ElevenAgents documents optional tool-call sounds for filling a long operation; the more general lesson is to tell the caller that work is happening instead of leaving an unexplained silence.

This pattern has a limit. Routing adds its own failure states: duplicate actions, stale results, and a fast model summarizing a slow model incorrectly. Keep one trace across the turn and test the handoff, not just the two models separately.

Map the platform before choosing it

We have not run a controlled benchmark across voice-agent platforms. The table is a product-architecture map built from current first-party documentation, not a quality or latency ranking.

Approach	Documented architecture	Model choice	Where it loses
OpenAI Realtime	Native realtime audio and speech-to-speech	OpenAI realtime models	Less component freedom
Vapi	Orchestrates transcription, model, voice, and telephony	Multiple providers or compatible custom endpoints	More vendor boundaries to debug
Retell	Voice-agent and telephony orchestration	Retell or configured response engine	Platform-specific call configuration
ElevenAgents	Speech, turn-taking, tools, workflows, and monitoring	Supported or custom LLM	More of the stack sits with one vendor
Build the chain	Your STT, model, TTS, transport, and state	Full control	You own interruption handling and operations

OpenAI's route removes boundaries between speech and reasoning. Vapi's documentation exposes replaceable transcriber, model, and voice components. ElevenAgents documents custom models, tools, telephony, testing, and analytics in one platform. Retell exposes agent and response-engine configuration. Those are verifiable differences; "sounds most human" is not, because we have not run the listening test.

If that managed speech-and-agent architecture fits the replay test, start an ElevenLabs evaluation. Partner link; use the calculator first so the test includes the correct API quota and concurrency limit.

Pick by failure mode

For a customer-service phone agent, start with the fastest three models that pass the task suite. Tool errors and escalation failures matter more than a small leaderboard gap.

For an internal assistant, a slower model may be acceptable when users can see that work is in progress. Measure the actual workflow rather than borrowing a consumer-call threshold.

For complex backend work, split the conversational and background paths. Make the acknowledgment explicit, attach an idempotency key to actions, and prevent the background result from firing the same tool twice.

For multilingual calls, cross-check the multilingual leaderboard with the languages and voices your speech provider actually supports. Then test code-switching and names. A language appearing on two feature lists does not prove that the pair works well together.

Reader questions

Frequently asked questions

01What is the best LLM for voice agents in 2026?

The best LLM for a voice agent is the fastest model that still passes your task, instruction-following, and tool-use tests. There is no durable universal winner. Start with the live speed table, shortlist models that meet your quality bar, then measure full voice-to-voice latency on your own calls.

02What is the lowest-latency LLM for voice AI?

The answer changes as providers update their serving stacks, so a model name copied into an article goes stale quickly. The live speed table on this page ranks models with usable overall scores by measured first-answer latency. Treat it as a shortlist, because region, load, prompt length, and your audio stack alter production latency.

03Why does latency matter more than benchmark score for voice agents?

A caller experiences the delay before the first audible response, not an abstract model score. Speech recognition, model inference, text-to-speech, networking, and playback buffering all add time. A stronger model can still be the worse product choice when its delay makes people interrupt, repeat themselves, or abandon the call.

04Can I use a reasoning model in a voice agent?

Yes, but test it in the conversational loop rather than assuming the label predicts latency. A useful pattern is to keep a fast model on ordinary turns and call a slower reasoning model only for hard background work. The voice agent can acknowledge the request while that tool call runs, then speak the result.

05What is the best voice agent platform in 2026?

Choose by architecture. OpenAI Realtime provides native speech-to-speech. Vapi and Retell orchestrate replaceable speech, model, and telephony components. ElevenAgents combines ElevenLabs speech with agent workflows, tools, and monitoring. We have not run a controlled platform benchmark, so this is a product-fit map rather than a performance ranking.

06How much latency is too much for a voice agent?

There is no responsible universal cutoff without the call type and user population. Measure end-of-user-speech to first audible agent audio, then inspect the distribution rather than one demo. Track median and tail latency, interruptions, repeated questions, and abandonment. Set the budget from user behavior, then divide it across the pipeline.

Source ledger

External sources linked in this article

01latency guideelevenlabs.io
02tool-call soundselevenlabs.io
03OpenAI Realtimeplatform.openai.com
04Vapidocs.vapi.ai
05Retelldocs.retellai.com
06ElevenAgentselevenlabs.io

Share or save

Share on X Share on LinkedIn