The original is one click away. Open original ↗
How AI voice agents work and their key constraints in 2025
Executive overview
Speech is 3x faster than typing, making voice the inevitable primary interface for AI. Most voice agents today chain speech-to-text, an LLM, and text-to-speech together — a pipeline with hard latency and cost limits. The two main reasons voice is viable now: inference costs have dropped ~5x in 18 months, and model speed has improved enough to hit the 500ms response threshold humans need to feel heard.
The bottleneck is not the model — it is the full pipeline: every millisecond counts, and every extra tool call multiplies failure risk.
How voice agents work
- Version 1: mic capture → speech-to-text → LLM → text-to-speech → audio playback
- Version 2: adds VAD (voice activation detection) to determine when to interrupt without being awkward
- Version 2.5: summarises prior conversation turns into text; passes only the latest turn as raw speech — preserves nuance without bloating the context window
- Version 3 (speech-to-speech): no transcription layer; AI processes audio directly — highest quality but expensive due to context window size; viable once context windows approach infinite scale
Tools voice agents can use
- Appointment booking (hairstylists, SaaS demo calls)
- Simple lookups (e.g. order status)
- Call transfer to a human when the AI reaches its limit
- Keep tool calls to 1-2 per interaction; each additional call compounds failure probability
Speed: the critical constraint
- Target latency: 500ms end-to-end (max: 800ms)
- Measure it by recording the conversation and counting milliseconds between user stop and AI start in an audio editor
- Three steps consume ~80% of total time: speech-to-text, LLM inference, text-to-speech
- With 800ms budget: 150ms speech-to-text + 190ms text-to-speech = only 460ms left for everything else
- Best models for intelligence-to-speed trade-off: GPT-4o and Gemini 2.0 Flash (Gemini is faster; Claude Sonnet is too slow for voice today)
Cost: the exponential trap
- Costs have fallen ~5x over 18 months across major models
- Conversation length drives cost exponentially: a 30-minute conversation can cost ~100x more than a 3-minute one
- Mitigation: summarise earlier turns in text before passing to the model (the 2.5 approach) to limit context window usage
Turn detection
- Naive approach: wait 0.8 seconds of silence, then respond — functional but robotic
- Advanced: semantic VAD — reads filler words ("um"), sentence-ending tone (rising vs. falling), and pause patterns to infer whether the user is done
- Open-source option: Smart Turn; OpenAI has its own semantic VAD variant
- Acceptable incorrect interruption rate: below 5%
Tool use in voice contexts
- Each tool call adds visible latency; signal this to the user with audio tones or a spoken acknowledgement ("I'm looking that up now")
- Async pattern: acknowledge the request, continue engaging the user, resolve the tool call in the background
- Limit tools to simple, well-defined actions; open-ended tool chains will fail with the faster, cheaper models suitable for voice
Use case selection: what works and what fails
- Fail — McDonald's drive-through (2024): 85% accuracy sounds acceptable, but open-ended ordering produced 200-nugget errors and bizarre combinations; chaos from unconstrained scope
- Win — ABN AMRO's Anna: 3.5 million conversations per year; automated 50% of call-centre interactions
- Structured use cases work; open-ended ones fail
Building your first voice agent
- Use Pipecat (open-source framework) to avoid rebuilding infrastructure from scratch
- Recommended model stack: DeepGram for speech-to-text, GPT-4o or Gemini 2.0 Flash for inference, Cartesia for speed or ElevenLabs for human-sounding output (at a latency cost)
- Constrained use cases to start: appointment setting, single-item lookups, surveys
- Start small, automate one workflow step, then expand
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.