Tag
TurnNat is a likelihood-based framework for automatically evaluating turn-taking naturalness in dyadic spoken dialogue, using a causal turn-taking prediction model trained on natural conversations to measure timing atypicality via negative log-likelihood.
BayLing-Duplex is a native full-duplex speech language model that enables a single autoregressive LLM to manage turn-taking and interruptions without external VAD modules, achieving high success rates and improved response quality over prior models.
This paper identifies 'state inertia' in full-duplex spoken language models, where the model's internal predictive focus lags during user interruptions, and proposes a training-free activation steering method to improve interruption handling.
Kyutai Labs released a new paper on using reinforcement learning to post-train speech models (Moshi and PersonaPlex) for more human-like interaction, including when to respond, wait, or give listening cues.
An analysis of half-duplex vs full-duplex architecture in AI voice models, discussing key features like overlap, backchannels, and barge-in that make voice agents sound robotic.
Raon-Speech is a 9B-parameter speech language model for English and Korean, supporting understanding, answering, and generation, with a full-duplex extension Raon-SpeechChat for natural real-time conversation. It achieves strong performance across 42 benchmarks and is fully open-sourced.
This paper analyzes synchronization and turn-taking dynamics in full-duplex speech dialogue models by simulating conversations between two instances of the Moshi model, measuring representational alignment via CKA and predicting turn boundaries with LSTM probes.
Thinking Machines Lab and OpenBMB released MiniCPM-o 4.5, a 9B full-duplex omnimodal model with the Omni-Flow framework that enables continuous, time-aligned real-time video and voice interaction, surpassing previous models and available as open source.
Mira Murati's team showcased a preview of the new interaction model. Trained from scratch, it natively supports full-duplex real-time audio and video conversations, instant interruptions, multi-language translation, and dynamic multi-tasking. The demonstration verified its core capabilities in low-latency streaming interaction, multimodal perception, and concurrent task execution.
EchoChain is a new benchmark for evaluating AI models' ability to revise in-progress responses when users interrupt mid-generation. The benchmark identifies three failure patterns (contextual inertia, interruption amnesia, objective displacement) and finds that across evaluated real-time voice models, no system exceeds 50% pass rate.
MoshiRAG combines a compact full-duplex speech language model with asynchronous retrieval-augmented generation to improve factuality while maintaining real-time interactivity. The approach leverages natural temporal gaps in conversation to retrieve external knowledge without disrupting the natural flow of dialogue.
MTR-DuplexBench introduces a comprehensive benchmark for evaluating Full-Duplex Speech Language Models in multi-round conversations, addressing challenges like blurred turn boundaries and context inconsistency while assessing conversational features, dialogue quality, instruction following, and safety.