Voice (text-to-speech)
The Voice tab controls how your agent sounds — the text-to-speech (TTS) engine that turns the agent's replies into spoken audio on a call.
Every reply your agent generates is text. The Voice tab decides which provider and voice convert that text into speech, how fast it speaks, what language it uses, and what happens if the primary voice fails. These settings apply to voice agents (phone calls).
Where Voice fits in the call loop
On a live call, audio flows in a loop:
- The caller speaks → speech-to-text (Deepgram) transcribes it.
- The agent's LLM generates a reply, grounded in your prompt, knowledge base, tools, and flow.
- Text-to-speech (this tab) turns that reply into audio.
- The audio streams back to the caller.
So the Voice tab is step 3 — the agent's actual voice.
Choosing a provider
Open the Voice tab in the agent builder and choose your TTS provider.
- Cartesia — the platform default, using the Sonic voices. It's ready to use with no setup, so it's the fastest way to get an agent talking.
- ElevenLabs — bring your own key. Connect your ElevenLabs account in Integrations first, then select it here to use its voices.
Picking a voice
Once you've chosen a provider, pick a voice from the voice library in the Voice tab. The library lists the voices available for your selected provider — browse and select the one that fits your agent's persona.
Speed
Set how fast the agent speaks with the speed control, which ranges from 0.5 to 2.0:
- Below 1.0 slows the voice down.
- 1.0 is the natural pace.
- Above 1.0 speeds it up.
Adjust to taste — a slightly measured pace often sounds clearer on a phone call.
Language
Set the language for speech output so the voice pronounces your content correctly. Match this to the language your agent converses in.
Fallback voices
Configure one or more fallback voices. These are used automatically if the primary voice fails, so your agent keeps talking instead of going silent mid-call. Choosing fallbacks is a good safety net for production agents.
How streaming keeps replies fast
The agent doesn't wait to finish its entire reply before speaking. Replies are streamed sentence-by-sentence: as soon as the first sentence of the LLM's response is ready, it's sent to text-to-speech and streamed back to the caller. This keeps latency to roughly one second.
Because audio streams as it's produced, the caller can also barge-in — talk over the agent — and the conversation stays natural and responsive.
Next steps
- Transcriber (speech-to-text) — tune how the agent hears the caller.
- Model — choose the LLM that generates the replies your voice speaks.
- Phone numbers — attach your agent to a number and take live calls.
- Integrations — connect ElevenLabs or other providers.