How AI voice agents actually work

Reddit r/AI_Agents 05/22/26, 11:17 AM News

ai-voice-agents speech-to-text text-to-speech llm orchestration telephony voice-activity-detection

Summary

A detailed explainer on the five-layer architecture of AI voice agents, including speech-to-text, LLM, text-to-speech, orchestrator, and telephony, all operating under a 500ms latency constraint to maintain natural conversation flow.

A voice agent isn't one model. It's five layers stitched together under a brutal constraint: anything over 500ms on a phone call feels unnatural. Layer 1: Speech-to-text (100ms): converts raw audio to text. The key is streaming and transcribe as the customer speaks, don't wait for the full sentence. Waiting for silence before processing adds seconds of dead air. Layer 2: LLM (200ms): reads the transcript, checks the knowledge base, generates a response. The LLM alone sounds generic. What makes it sound like your employee is the context layer injected before every response like product catalog, CRM data, customer history, playbooks, escalation rules. Layer 3: Text-to-speech (150ms): converts the response back to natural-sounding audio. Chunked TTS is critical start speaking the first sentence while the LLM is still generating the second. Voice cloning lets you match your brand's tone. Layer 4: Orchestrator: the traffic controller. Manages state across the conversation, handles turn-taking, routes between the other layers. This is where the hardest problem lives knowing when someone is done talking. Voice activity detection listens for silence. Endpointing algorithms distinguish a pause from a full stop. Barge-in handling lets the caller interrupt mid-sentence and the agent stops immediately. This is what separates a voice agent from an IVR menu. Layer 5: Telephony: connects everything to actual phone lines. SIP trunking, call routing, the infrastructure that makes it a real phone call instead of a web demo. In total it takes about 500ms.

Original Article

How AI voice agents actually work

Similar Articles

Voice feels like the underrated output layer for AI agents

Building voice AI agents that take turns like humans — the gotchas nobody warns you about

What tasks are AI voice agents actually good at today?

What’s the Biggest Problem With AI Voice Agents Right Now?

What’s your current / best AI voice agents stack in 2026?

Submit Feedback

Similar Articles

Voice feels like the underrated output layer for AI agents

Building voice AI agents that take turns like humans — the gotchas nobody warns you about

What tasks are AI voice agents actually good at today?

What’s the Biggest Problem With AI Voice Agents Right Now?

What’s your current / best AI voice agents stack in 2026?