Your voice agent probably isn't slow because of the LLM.

Reddit r/AI_Agents 06/17/26, 02:32 PM News

voice-agents latency debugging stt tts performance measurement

Summary

A developer debunks the common belief that LLM latency is the primary cause of slow voice agents, explaining that delays often stem from earlier stages like audio capture, VAD, and STT. They recommend logging specific latency metrics and testing various STT/TTS providers and orchestration frameworks to diagnose issues.

Hot take after debugging a few voice agent flows: Everyone blames the LLM first. But a lot of the “this voice agent feels slow” problem comes before the LLM even gets a stable transcript. The delay can be from: mic/audio capture WebRTC / SIP / telephony VAD STT first partial STT final transcript endpointing LLM first token tool call TTS first audio audio playback interruption handling If you only measure total response time, you learn nothing. I’d log: user_speech_start stt_first_partial stt_final llm_first_token tool_call_start tool_call_done tts_first_audio playback_start barge_in_detected For STT, I’d test Deepgram, AssemblyAI, Smallest AI Pulse, Speechmatics, Soniox, OpenAI realtime/transcribe. For TTS, ElevenLabs, Cartesia, Deepgram Aura, PlayHT. For orchestration, LiveKit/Pipecat/Vapi/Retell depending on how much control you want. The weird part is that the fastest demo stack is not always the best production stack. Under real calls, endpointing and partial stability matter a lot. How are you guys measuring latency? p50? p90? p95? Or just “does it feel human”?

Original Article

Similar Articles

6 months running a production voice agent for service businesses. The latency math is way harder than the demos suggest.

Reddit r/ArtificialInteligence

After 6 months running a voice AI agent for service businesses, the author reveals that real-world latency is bimodal (median ~800ms, p95 ~2.4s) and this p95 determines user perception. Issues like VAD misfires, function call degradation with long prompts, and TTS quality matter more than LLM choice, with multilingual support adding significant costs.

Our voice agent's p99 was 280ms. Competitor's was 450ms. Users said ours felt slower. We measured why.

Reddit r/AI_Agents

A voice agent team found that despite lower end-to-end latency (280ms vs competitor's 450ms), users perceived it as slower due to poor barge-in interrupt rate (380ms vs 60ms). They identified three fixes—memory pinning, VAD threshold tuning, and smaller TTS chunks—that improved barge-in rate from 41% to 89% at 100ms, making users feel it's faster.

How AI voice agents actually work

Reddit r/AI_Agents

A detailed explainer on the five-layer architecture of AI voice agents, including speech-to-text, LLM, text-to-speech, orchestrator, and telephony, all operating under a 500ms latency constraint to maintain natural conversation flow.

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

Reddit r/LocalLLaMA

A fully offline, CPU-only voice loop for local LLMs using Silero VAD, Parakeet STT, and Supertonic TTS, integrated via a one-command installer. Works with Ollama, LM Studio, and various agent frameworks.

Latency matters more than model selection when building AI tutoring systems

Reddit r/AI_Agents

A practitioner argues that speech start latency—not model selection—is the critical factor in AI tutoring systems, recommending targets under 1 second for speech start and highlighting streaming TTS as the highest-leverage optimization. The post outlines a full pipeline from ASR through TTS and avatar sync, identifying where latency compounds most.

Similar Articles

6 months running a production voice agent for service businesses. The latency math is way harder than the demos suggest.

Our voice agent's p99 was 280ms. Competitor's was 450ms. Users said ours felt slower. We measured why.

How AI voice agents actually work

I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)

Latency matters more than model selection when building AI tutoring systems

Submit Feedback