My voice-agent test now includes the 600-second cliff
Summary
The author describes a voice agent call cut off at 600 seconds without warning, and proposes a testing approach to handle max duration gracefully, including pre-cutoff warnings and state preservation.
Similar Articles
The smallest voice-agent test I like: make it ask the missing question
A simple test for voice agents: give an underspecified instruction (like 'use the address on file') and see if the agent asks for clarification before committing. The quality of the follow-up question reveals the agent's reliability.
6 months running a production voice agent for service businesses. The latency math is way harder than the demos suggest.
After 6 months running a voice AI agent for service businesses, the author reveals that real-world latency is bimodal (median ~800ms, p95 ~2.4s) and this p95 determines user perception. Issues like VAD misfires, function call degradation with long prompts, and TTS quality matter more than LLM choice, with multilingual support adding significant costs.
Our voice agent's p99 was 280ms. Competitor's was 450ms. Users said ours felt slower. We measured why.
A voice agent team found that despite lower end-to-end latency (280ms vs competitor's 450ms), users perceived it as slower due to poor barge-in interrupt rate (380ms vs 60ms). They identified three fixes—memory pinning, VAD threshold tuning, and smaller TTS chunks—that improved barge-in rate from 41% to 89% at 100ms, making users feel it's faster.
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
EVA-Bench introduces a comprehensive end-to-end framework for evaluating voice agents, simulating realistic multi-turn conversations and measuring performance across voice-specific failure modes with novel accuracy (EVA-A) and experience (EVA-X) metrics. The benchmark includes 213 scenarios across enterprise domains and a perturbation suite for accent and noise robustness, revealing substantial gaps in current systems.
EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions
EchoChain is a new benchmark for evaluating AI models' ability to revise in-progress responses when users interrupt mid-generation. The benchmark identifies three failure patterns (contextual inertia, interruption amnesia, objective displacement) and finds that across evaluated real-time voice models, no system exceeds 50% pass rate.