Full duplex vs half duplex - the spectrum of AI voice models [D]

Reddit r/MachineLearning News

Summary

An analysis of half-duplex vs full-duplex architecture in AI voice models, discussing key features like overlap, backchannels, and barge-in that make voice agents sound robotic.

It seems that there are two ways to build voice AI: Half-duplex: strict turn-taking. You speak, the other side waits until you’re done, one direction of speech at a time. ← This is how almost every voice assistant works today. Full-duplex: two channels, both sides can talk at any time - no more waiting for your “turn”. ← This is the way humans actually talk. In fact, there are three crucial things half-duplex voice models can't really do: * Overlap - talking and listening at the same time without falling apart * Backchannels - the "mhms," "rights," and "yeahs" you drop in while the other person is still going * Barge-in - getting interrupted mid-sentence and recovering gracefully These three features are a big reason why voice agents still feel “robotic” to this day. But what exactly is the spectrum from half-duplex to full-duplex? Is a Moshi-style architecture the only way to approach full-duplex natural voice conversations? What are ways half-duplex systems could imitate full-duplex? Would love to hear others' thoughts on this.
Original Article

Similar Articles

How AI voice agents actually work

Reddit r/AI_Agents

A detailed explainer on the five-layer architecture of AI voice agents, including speech-to-text, LLM, text-to-speech, orchestrator, and telephony, all operating under a 500ms latency constraint to maintain natural conversation flow.

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

arXiv cs.CL

This paper analyzes synchronization and turn-taking dynamics in full-duplex speech dialogue models by simulating conversations between two instances of the Moshi model, measuring representational alignment via CKA and predicting turn boundaries with LSTM probes.

EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions

arXiv cs.CL

EchoChain is a new benchmark for evaluating AI models' ability to revise in-progress responses when users interrupt mid-generation. The benchmark identifies three failure patterns (contextual inertia, interruption amnesia, objective displacement) and finds that across evaluated real-time voice models, no system exceeds 50% pass rate.

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Hugging Face Daily Papers

This paper introduces Omni-DuplexEval, a benchmark and automatic evaluation framework for real-time duplex interaction in multimodal large language models, assessing continuous response generation and proactive event detection in streaming scenarios.