Liberating LLM Capabilities in Full-Duplex Speech Models

Hugging Face Daily Papers Papers

Summary

Proposes Listen-Write-Speak (LWS), a text-first tri-channel paradigm that allows a single autoregressive LLM to continuously listen, write visible text, and speak in real-time, enabling full-duplex speech interaction without architectural modifications.

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.
Original Article
View Cached Full Text

Cached at: 06/09/26, 12:41 PM

Paper page - Liberating LLM Capabilities in Full-Duplex Speech Models

Source: https://huggingface.co/papers/2606.07547 Published on May 4

·

Submitted byhttps://huggingface.co/zly-idleness

zlyon Jun 9

Abstract

A text-first tri-channel speech interface enables real-time interaction with visible text output alongside spoken responses, demonstrating superior performance in full-duplex conversational tasks.

Speech-based large language modelsare typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a singleautoregressive LLMcontinuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a sharedcausal attentioncontext. This behavior is implemented entirely through aToken Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-secondcognitive annotationsconsistent with the revealed input timeline. Empirically, LWS demonstrates strongfull-duplex interactionon Full-Duplex-Bench, reaches 4.72 onVoiceBenchAlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations onURO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.

View arXiv pageView PDFProject pageGitHub2Add to collection

Get this paper in your agent:

hf papers read 2606\.07547

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.07547 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.07547 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.07547 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Streaming Speech-to-Text Translation with a SpeechLLM

arXiv cs.CL

Presents a SpeechLLM architecture for streaming speech-to-text translation that adaptively decides when to output tokens based on audio, achieving 1-2 second latency with quality close to non-streaming baselines.