Best STT API for voice agents? I’d test latency before accuracy

Reddit r/AI_Agents 06/25/26, 10:28 AM News

speech-to-text stt voice-agents latency real-time api-evaluation

Summary

The author argues that for live voice agents, STT latency and real-time behavior are more critical than raw transcription accuracy, and proposes a different evaluation scorecard.

I used to think the “best STT for voice agents” question meant: which one has the best transcription accuracy? I don’t think that anymore. For live agents, the transcript can be technically accurate and still ruin the call if it arrives late or keeps changing. The user doesn’t care that your WER is good. They feel: “why did the bot pause?” “why did it answer before I finished?” “why did it miss the number I corrected?” “why is it talking over me?” So my current test is less “which STT is most accurate?” and more: can the rest of the agent safely use the text fast enough? I’m trying a LiveKit + Langfuse setup where I log every turn: user starts talking first transcript fragment usable transcript LLM starts tool call voice starts user interrupts agent shuts up Smallest AI Pulse is on my shortlist here for a specific reason: I don’t want to evaluate it like a file transcription tool. I want to see whether it behaves like a real-time listening layer for a voice agent. For this use case, my scorecard would be: first usable text final transcript delay partial rewrite chaos endpointing barge-in behavior phone audio names/numbers/dates p95 turn latency Accuracy still matters, obviously. But for voice agents, latency decides whether the whole thing feels alive or fake. Anyone else measuring STT this way?

Original Article

Similar Articles

6 months running a production voice agent for service businesses. The latency math is way harder than the demos suggest.

Reddit r/ArtificialInteligence

After 6 months running a voice AI agent for service businesses, the author reveals that real-world latency is bimodal (median ~800ms, p95 ~2.4s) and this p95 determines user perception. Issues like VAD misfires, function call degradation with long prompts, and TTS quality matter more than LLM choice, with multilingual support adding significant costs.

Your voice agent probably isn't slow because of the LLM.

Reddit r/AI_Agents

A developer debunks the common belief that LLM latency is the primary cause of slow voice agents, explaining that delays often stem from earlier stages like audio capture, VAD, and STT. They recommend logging specific latency metrics and testing various STT/TTS providers and orchestration frameworks to diagnose issues.

@svpino: I've built two voice pipelines for two different companies. They both look like this: Audio → STT → Clean transcript → …

X AI KOLs Following

Santiago highlights the limitation of traditional STT pipelines that lose tone and emotion, then introduces Velma, a voice-native AI model from Modulate that analyzes raw audio to capture intent, emotion, and other acoustic signals, available via API at 10x cheaper than LLM-based approaches.

Our voice agent's p99 was 280ms. Competitor's was 450ms. Users said ours felt slower. We measured why.

Reddit r/AI_Agents

A voice agent team found that despite lower end-to-end latency (280ms vs competitor's 450ms), users perceived it as slower due to poor barge-in interrupt rate (380ms vs 60ms). They identified three fixes—memory pinning, VAD threshold tuning, and smaller TTS chunks—that improved barge-in rate from 41% to 89% at 100ms, making users feel it's faster.

Latency matters more than model selection when building AI tutoring systems

Reddit r/AI_Agents

A practitioner argues that speech start latency—not model selection—is the critical factor in AI tutoring systems, recommending targets under 1 second for speech start and highlighting streaming TTS as the highest-leverage optimization. The post outlines a full pipeline from ASR through TTS and avatar sync, identifying where latency compounds most.

Similar Articles

6 months running a production voice agent for service businesses. The latency math is way harder than the demos suggest.

Your voice agent probably isn't slow because of the LLM.

@svpino: I've built two voice pipelines for two different companies. They both look like this: Audio → STT → Clean transcript → …

Our voice agent's p99 was 280ms. Competitor's was 450ms. Users said ours felt slower. We measured why.

Latency matters more than model selection when building AI tutoring systems

Submit Feedback