6 months running a production voice agent for service businesses. The latency math is way harder than the demos suggest.

Reddit r/ArtificialInteligence 05/15/26, 06:26 AM News

voice-ai production latency vad function-calling tts multilingual service-business

Summary

After 6 months running a voice AI agent for service businesses, the author reveals that real-world latency is bimodal (median ~800ms, p95 ~2.4s) and this p95 determines user perception. Issues like VAD misfires, function call degradation with long prompts, and TTS quality matter more than LLM choice, with multilingual support adding significant costs.

building a voice AI for restaurants and salons for the last 6 months. wanted to share some technical reality vs the “800ms latency” demos everyone shows. what nobody talks about: latency is bimodal, not average. demos show median latency. real users churn on the p95. our median is \~800ms, p95 is 2.4s. that p95 is what determines if the agent feels human or broken. it comes from rare edge cases: model retry on malformed function call output, slow tool execution (calendar lookup against a slow third-party API), VAD misfires on background noise. interruption handling breaks more often than the conversation itself. users interrupt the agent constantly. naive VAD treats every cough or background noise as interruption. we ended up with a 3-layer system: VAD signal + semantic check (is what they said actually a continuation?) + acoustic energy threshold. still wrong maybe 5% of the time. function calling reliability degrades with prompt length. with system prompt under 1.5k tokens, function call accuracy is 96%. above 3k tokens, drops to 84% on the same model. nobody tells you this when you stuff personality, business rules, and few-shot examples into one prompt. TTS choice matters more than LLM choice for perceived quality. users complain about robotic voice 10x more than about wrong answers. swapping LLM from GPT-4 to Claude or Gemini moved business metrics 2%. swapping TTS from generic to ElevenLabs Flash moved booking conversion 14%. multilingual is a tax on everything. we support 50+ languages. each language adds: separate TTS voice tuning, separate VAD calibration (some languages have more sibilants which confuse VAD), separate few-shot examples in the prompt. cost per call in Russian is \~40% higher than English purely because of these calibrations. anyone else running voice agents in production? curious what your p95 looks like and how you’re handling the multilingual cost explosion.

Original Article

6 months running a production voice agent for service businesses. The latency math is way harder than the demos suggest.

Similar Articles

Our voice agent's p99 was 280ms. Competitor's was 450ms. Users said ours felt slower. We measured why.

@svpino: Humans have an average of 200-250 ms of latency when speaking to each other. This voice model is even faster: only 110 …

Latency matters more than model selection when building AI tutoring systems

I tested 5 AI voice agent platforms in 2026 on real calls — here’s my honest ranking

What’s your current / best AI voice agents stack in 2026?

Submit Feedback

Similar Articles

Our voice agent's p99 was 280ms. Competitor's was 450ms. Users said ours felt slower. We measured why.

@svpino: Humans have an average of 200-250 ms of latency when speaking to each other. This voice model is even faster: only 110 …

Latency matters more than model selection when building AI tutoring systems

I tested 5 AI voice agent platforms in 2026 on real calls — here’s my honest ranking

What’s your current / best AI voice agents stack in 2026?