@kyutai_labs: New paper: Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models We use RL to post-train speech models (Mo…
Summary
Kyutai Labs released a new paper on using reinforcement learning to post-train speech models (Moshi and PersonaPlex) for more human-like interaction, including when to respond, wait, or give listening cues.
View Cached Full Text
Cached at: 06/10/26, 05:53 PM
New paper: Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models
We use RL to post-train speech models (Moshi and PersonaPlex) to talk more like a human: to know when to respond, when to wait, and when to nod along with “yeah”s and “okay”s when listening. https://t.co/MGtqCmyD5L
Similar Articles
Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models
This paper analyzes synchronization and turn-taking dynamics in full-duplex speech dialogue models by simulating conversations between two instances of the Moshi model, measuring representational alignment via CKA and predicting turn boundaries with LSTM probes.
From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator
This paper theoretically identifies and mitigates context distribution shift in multi-turn dialogue RL, proposing Calibrated Interactive RL that couples interactive RL with simulator alignment to reduce the sim-to-real gap and achieve state-of-the-art performance.
MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
MTR-DuplexBench introduces a comprehensive benchmark for evaluating Full-Duplex Speech Language Models in multi-round conversations, addressing challenges like blurred turn boundaries and context inconsistency while assessing conversational features, dialogue quality, instruction following, and safety.
@samsja19: Very exciting work to bridge the gap between RL and mid/pretraining You can learn from your environment beyond the rewa…
A new method called ECHO bridges RL and pre-training by using next token prediction on tool call outputs to learn from the environment beyond reward signals, combining world modeling and agentic actions.
OpenAI's New Voice Models Want to Do More Than Talk Back
OpenAI has launched three new real-time audio models to enable continuous, multitasking voice interactions that prioritize long-context reasoning, live translation, and seamless tool use.