My voice-agent test now includes the 600-second cliff

Reddit r/AI_Agents News

Summary

The author describes a voice agent call cut off at 600 seconds without warning, and proposes a testing approach to handle max duration gracefully, including pre-cutoff warnings and state preservation.

I thought long voice calls were basically solved until I used one during a drive and it cut off at exactly 600 seconds. The annoying part was not just the timeout. It was that the call ended mid-thought with no warning, no summary, and no clean next step. The transcript existed, but the workflow felt abandoned. Now I treat max duration as a QA case, not an infrastructure detail. My test: - run the call past the expected limit - warn before cutoff - capture a short summary - preserve the transcript - write a retry or next-action state If the agent cannot wind down gracefully, the call is not production-ready yet. It may handle normal turns perfectly and still feel broken at the exact moment the user needs continuity. For people building voice agents: do you test max-duration / wind-down behavior, or mostly interruptions and latency?
Original Article

Similar Articles

6 months running a production voice agent for service businesses. The latency math is way harder than the demos suggest.

Reddit r/ArtificialInteligence

After 6 months running a voice AI agent for service businesses, the author reveals that real-world latency is bimodal (median ~800ms, p95 ~2.4s) and this p95 determines user perception. Issues like VAD misfires, function call degradation with long prompts, and TTS quality matter more than LLM choice, with multilingual support adding significant costs.

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Hugging Face Daily Papers

EVA-Bench introduces a comprehensive end-to-end framework for evaluating voice agents, simulating realistic multi-turn conversations and measuring performance across voice-specific failure modes with novel accuracy (EVA-A) and experience (EVA-X) metrics. The benchmark includes 213 scenarios across enterprise domains and a perturbation suite for accent and noise robustness, revealing substantial gaps in current systems.

EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions

arXiv cs.CL

EchoChain is a new benchmark for evaluating AI models' ability to revise in-progress responses when users interrupt mid-generation. The benchmark identifies three failure patterns (contextual inertia, interruption amnesia, objective displacement) and finds that across evaluated real-time voice models, no system exceeds 50% pass rate.