Liberating LLM Capabilities in Full-Duplex Speech Models
Summary
Proposes Listen-Write-Speak (LWS), a text-first tri-channel paradigm that allows a single autoregressive LLM to continuously listen, write visible text, and speak in real-time, enabling full-duplex speech interaction without architectural modifications.
View Cached Full Text
Cached at: 06/09/26, 12:41 PM
Paper page - Liberating LLM Capabilities in Full-Duplex Speech Models
Source: https://huggingface.co/papers/2606.07547 Published on May 4
·
Submitted byhttps://huggingface.co/zly-idleness
zlyon Jun 9
Abstract
A text-first tri-channel speech interface enables real-time interaction with visible text output alongside spoken responses, demonstrating superior performance in full-duplex conversational tasks.
Speech-based large language modelsare typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a singleautoregressive LLMcontinuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a sharedcausal attentioncontext. This behavior is implemented entirely through aToken Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-secondcognitive annotationsconsistent with the revealed input timeline. Empirically, LWS demonstrates strongfull-duplex interactionon Full-Duplex-Bench, reaches 4.72 onVoiceBenchAlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations onURO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2606\.07547
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.07547 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.07547 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.07547 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM
BayLing-Duplex is a native full-duplex speech language model that enables a single autoregressive LLM to manage turn-taking and interruptions without external VAD modules, achieving high success rates and improved response quality over prior models.
Streaming Speech-to-Text Translation with a SpeechLLM
Presents a SpeechLLM architecture for streaming speech-to-text translation that adaptively decides when to output tokens based on audio, achieving 1-2 second latency with quality close to non-streaming baselines.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
Proposes TextPro-SLM, a speech large language model that minimizes the modality gap by processing spoken input to resemble prosody-aware text input, achieving strong paralinguistic understanding with low training data.
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs
This paper proposes Multi-Stream LLMs, which transition from sequential message-based instruction tuning to parallel stream processing. This approach allows language models to simultaneously read, think, and generate across multiple concurrent data flows, addressing bottlenecks in autonomous agent applications.
Multi-Stream LLMs: new paper on parallelizing/separating prompts, thinking, I/O
This paper proposes Multi-Stream LLMs, which use multiple parallel input/output streams to allow models to read and generate simultaneously, unblocking limitations of sequential chat formats.