How AI voice agents work and their key constraints in 2025

Executive overview

Speech is 3x faster than typing, making voice the inevitable primary interface for AI. Most voice agents today chain speech-to-text, an LLM, and text-to-speech together — a pipeline with hard latency and cost limits. The two main reasons voice is viable now: inference costs have dropped ~5x in 18 months, and model speed has improved enough to hit the 500ms response threshold humans need to feel heard.

The bottleneck is not the model — it is the full pipeline: every millisecond counts, and every extra tool call multiplies failure risk.

How voice agents work

Version 1: mic capture → speech-to-text → LLM → text-to-speech → audio playback
Version 2: adds VAD (voice activation detection) to determine when to interrupt without being awkward
Version 2.5: summarises prior conversation turns into text; passes only the latest turn as raw speech — preserves nuance without bloating the context window
Version 3 (speech-to-speech): no transcription layer; AI processes audio directly — highest quality but expensive due to context window size; viable once context windows approach infinite scale

Tools voice agents can use

Appointment booking (hairstylists, SaaS demo calls)
Simple lookups (e.g. order status)
Call transfer to a human when the AI reaches its limit
Keep tool calls to 1-2 per interaction; each additional call compounds failure probability

Speed: the critical constraint

Target latency: 500ms end-to-end (max: 800ms)
Measure it by recording the conversation and counting milliseconds between user stop and AI start in an audio editor
Three steps consume ~80% of total time: speech-to-text, LLM inference, text-to-speech
With 800ms budget: 150ms speech-to-text + 190ms text-to-speech = only 460ms left for everything else
Best models for intelligence-to-speed trade-off: GPT-4o and Gemini 2.0 Flash (Gemini is faster; Claude Sonnet is too slow for voice today)

Cost: the exponential trap

Costs have fallen ~5x over 18 months across major models
Conversation length drives cost exponentially: a 30-minute conversation can cost ~100x more than a 3-minute one
Mitigation: summarise earlier turns in text before passing to the model (the 2.5 approach) to limit context window usage

Turn detection

Naive approach: wait 0.8 seconds of silence, then respond — functional but robotic
Advanced: semantic VAD — reads filler words ("um"), sentence-ending tone (rising vs. falling), and pause patterns to infer whether the user is done
Open-source option: Smart Turn; OpenAI has its own semantic VAD variant
Acceptable incorrect interruption rate: below 5%

Tool use in voice contexts

Each tool call adds visible latency; signal this to the user with audio tones or a spoken acknowledgement ("I'm looking that up now")
Async pattern: acknowledge the request, continue engaging the user, resolve the tool call in the background
Limit tools to simple, well-defined actions; open-ended tool chains will fail with the faster, cheaper models suitable for voice

Use case selection: what works and what fails

Fail — McDonald's drive-through (2024): 85% accuracy sounds acceptable, but open-ended ordering produced 200-nugget errors and bizarre combinations; chaos from unconstrained scope
Win — ABN AMRO's Anna: 3.5 million conversations per year; automated 50% of call-centre interactions
Structured use cases work; open-ended ones fail

Building your first voice agent

Use Pipecat (open-source framework) to avoid rebuilding infrastructure from scratch
Recommended model stack: DeepGram for speech-to-text, GPT-4o or Gemini 2.0 Flash for inference, Cartesia for speed or ElevenLabs for human-sounding output (at a latency cost)
Constrained use cases to start: appointment setting, single-item lookups, surveys
Start small, automate one workflow step, then expand

Building $10,000 software MVPs with AI in under an hour

Brett Malinowski May 14, 2026

AI tools & automation 9

MVP & prototyping 8

Automation & tools 6

One person with Claude Code can replace a three-person agency team
Partner with niche creators who already have audience and distribution
Use pre-built components for payments and chat — don't build infrastructure from scratch

AI strategy & adoption

YouTube

How to actually make money with AI: five brutal truths

Dan Martell May 14, 2026

AI strategy & adoption 9

Business models 8

Automation & tools 5

AI is a hammer — you still need to find the nail
Validate with manual "Wizard of Oz" delivery before automating anything
Future orgs are workflow-based; humans own outcomes, agents own tasks

AI strategy & adoption

YouTube

How to choose the right home for your AI workflow

Dylan Davis May 13, 2026

AI strategy & adoption 9

Automation & tools 6

AI defaults to building apps — that's usually the wrong choice
85–90% of workflows belong inside a project or skill, not deployed code
Deploying an app triggers per-token API costs that subscriptions don't cover

How AI voice agents work and their key constraints in 2025

Executive overview

How voice agents work

Tools voice agents can use

Speed: the critical constraint

Cost: the exponential trap

Turn detection

Tool use in voice contexts

Use case selection: what works and what fails

Building your first voice agent

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

How voice agents work

Tools voice agents can use

Speed: the critical constraint

Cost: the exponential trap

Turn detection

Tool use in voice contexts

Use case selection: what works and what fails

Building your first voice agent

More like this — when you're ready for early access.

More in AI

Building $10,000 software MVPs with AI in under an hour

How to actually make money with AI: five brutal truths

How to choose the right home for your AI workflow

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.