6 months running a production voice agent for service businesses. The latency math is way harder than the demos suggest.

Reddit r/ArtificialInteligence News

Summary

After 6 months running a voice AI agent for service businesses, the author reveals that real-world latency is bimodal (median ~800ms, p95 ~2.4s) and this p95 determines user perception. Issues like VAD misfires, function call degradation with long prompts, and TTS quality matter more than LLM choice, with multilingual support adding significant costs.

building a voice AI for restaurants and salons for the last 6 months. wanted to share some technical reality vs the “800ms latency” demos everyone shows. what nobody talks about: latency is bimodal, not average. demos show median latency. real users churn on the p95. our median is \~800ms, p95 is 2.4s. that p95 is what determines if the agent feels human or broken. it comes from rare edge cases: model retry on malformed function call output, slow tool execution (calendar lookup against a slow third-party API), VAD misfires on background noise. interruption handling breaks more often than the conversation itself. users interrupt the agent constantly. naive VAD treats every cough or background noise as interruption. we ended up with a 3-layer system: VAD signal + semantic check (is what they said actually a continuation?) + acoustic energy threshold. still wrong maybe 5% of the time. function calling reliability degrades with prompt length. with system prompt under 1.5k tokens, function call accuracy is 96%. above 3k tokens, drops to 84% on the same model. nobody tells you this when you stuff personality, business rules, and few-shot examples into one prompt. TTS choice matters more than LLM choice for perceived quality. users complain about robotic voice 10x more than about wrong answers. swapping LLM from GPT-4 to Claude or Gemini moved business metrics 2%. swapping TTS from generic to ElevenLabs Flash moved booking conversion 14%. multilingual is a tax on everything. we support 50+ languages. each language adds: separate TTS voice tuning, separate VAD calibration (some languages have more sibilants which confuse VAD), separate few-shot examples in the prompt. cost per call in Russian is \~40% higher than English purely because of these calibrations. anyone else running voice agents in production? curious what your p95 looks like and how you’re handling the multilingual cost explosion.
Original Article

Similar Articles

Latency matters more than model selection when building AI tutoring systems

Reddit r/AI_Agents

A practitioner argues that speech start latency—not model selection—is the critical factor in AI tutoring systems, recommending targets under 1 second for speech start and highlighting streaming TTS as the highest-leverage optimization. The post outlines a full pipeline from ASR through TTS and avatar sync, identifying where latency compounds most.