BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

arXiv cs.CL 06/15/26, 04:00 AM Papers
full-duplex speech-dialogue autoregressive-llm turn-taking speech-language-model fine-tuning
Summary
BayLing-Duplex is a native full-duplex speech language model that enables a single autoregressive LLM to manage turn-taking and interruptions without external VAD modules, achieving high success rates and improved response quality over prior models.
arXiv:2606.14528v1 Announce Type: new Abstract: Real-time, full-duplex speech interaction is a key feature of next-generation spoken chatbots, allowing the model to listen and speak at the same time and to handle natural phenomena such as overlap, hesitation, and barge-in. Existing speech language models (SpeechLMs) such as LLaMA-Omni and GLM-4-Voice are still turn-based and rely on an external Voice Activity Detection (VAD) module to mark the end of the user's turn, which fundamentally limits their interactive ability. In this paper, we introduce BayLing-Duplex, a native full-duplex SpeechLM where a single autoregressive LLM decides when to listen, when to speak, and when to stop, with no auxiliary turn-taking module. The design adds only a few special tokens to the standard vocabulary, so it transfers across LLMs and reuses existing training and serving stacks with no architectural adaptation. Starting from the public GLM-4-Voice checkpoint and using only 400K full-duplex samples for fine-tuning followed by a lightweight DPO stage, BayLing-Duplex reaches 92% turn-taking success and 100% interruption success on InstructS2S-Eval, while improving the speech-response score from 2.17 to 3.39 over Moshi. BayLing-Duplex also matches or surpasses its turn-based counterpart on Llama Questions, Web Questions, and Alpaca-Eval, showing that simultaneous listen-and-speak modeling does not sacrifice response quality.
Original Article
View Cached Full Text
Cached at: 06/15/26, 08:58 AM
# Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM
Source: [https://arxiv.org/html/2606.14528](https://arxiv.org/html/2606.14528)
Qingkai Fang1,2,3, Shoutao Guo1,2,3, Yang Feng1,2,3 1Key Laboratory of Intelligent Information Processing Institute of Computing Technology, Chinese Academy of Sciences \(ICT/CAS\) 2Key Laboratory of AI Safety, Chinese Academy of Sciences 3University of Chinese Academy of Sciences, Beijing, China \{fangqingkai21b,guoshoutao22z,fengyang\}@ict\.ac\.cn

###### Abstract

Real\-time, full\-duplex speech interaction is a key feature of next\-generation spoken chatbots, allowing the model to listen and speak at the same time and to handle natural phenomena such as overlap, hesitation, and barge\-in\. Existing speech language models \(SpeechLMs\) such as LLaMA\-Omni\(Fanget al\.,[2025a](https://arxiv.org/html/2606.14528#bib.bib16)\)and GLM\-4\-Voice\(Zenget al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib2)\)are still turn\-based and rely on an external Voice Activity Detection \(VAD\) module to mark the end of the user’s turn, which fundamentally limits their interactive ability\. In this paper, we introduce BayLing\-Duplex, a native full\-duplex SpeechLM where a single autoregressive LLM decides when to listen, when to speak, and when to stop, with no auxiliary turn\-taking module\. The design adds only a few special tokens to the standard vocabulary, so it transfers across LLMs and reuses existing training and serving stacks with no architectural adaptation\. Starting from the public GLM\-4\-Voice checkpoint and using only 400K full\-duplex samples for fine\-tuning followed by a lightweight DPO stage, BayLing\-Duplex reaches 92% turn\-taking success and 100% interruption success on InstructS2S\-Eval, while improving the speech\-response score from 2\.17 to 3\.39 over Moshi\(Défossezet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib1)\)\. BayLing\-Duplex also matches or surpasses its turn\-based counterpart on Llama Questions, Web Questions, and Alpaca\-Eval, showing that simultaneous listen\-and\-speak modeling does not sacrifice response quality\.111Code and models are available at[https://github\.com/BayLing\-Models/BayLing\-Duplex](https://github.com/BayLing-Models/BayLing-Duplex)\.

BayLing\-Duplex: Native Full\-Duplex Speech Dialogue with a Single Autoregressive LLM

Qingkai Fang1,2,3, Shoutao Guo1,2,3, Yang Feng1,2,3††thanks:Corresponding author: Yang Feng\.1Key Laboratory of Intelligent Information ProcessingInstitute of Computing Technology, Chinese Academy of Sciences \(ICT/CAS\)2Key Laboratory of AI Safety, Chinese Academy of Sciences3University of Chinese Academy of Sciences, Beijing, China\{fangqingkai21b,guoshoutao22z,fengyang\}@ict\.ac\.cn

## 1Introduction

Speech, as a critical interface for human\-computer interaction, can enhance user experience compared with text\. In recent years, with the rapid development of large language models \(LLMs\), building intelligent spoken chatbots has attracted widespread attention from both academia and industry\. GPT\-4o\(OpenAI,[2024](https://arxiv.org/html/2606.14528#bib.bib14)\)enables real\-time, intelligent, and natural speech interaction, marking a step toward more natural human\-computer interaction\.

The traditional approach is a cascaded pipeline of automatic speech recognition \(ASR\), an LLM, and text\-to\-speech \(TTS\) synthesis\. While straightforward, the cascaded design accumulates errors across stages, suffers from high response latency, and discards paralinguistic information in the input speech\. To address these limitations, end\-to\-end SpeechLMs have gained attention, using a single unified model to process speech input and output\. They can be categorized intonativeSpeechLMs that discretize speech into tokens and extend the LLM vocabulary\(Zhanget al\.,[2023](https://arxiv.org/html/2606.14528#bib.bib35); Zenget al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib2); Défossezet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib1)\), andmodularSpeechLMs that incorporate a speech encoder and a speech decoder around the LLM\(Fanget al\.,[2025a](https://arxiv.org/html/2606.14528#bib.bib16),[b](https://arxiv.org/html/2606.14528#bib.bib17); Wanget al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib25)\)\. Despite different architectures, both families predominantly assume aturn\-basedinteraction: the model consumes one segmented user utterance and emits a single response\.

Deployment therefore requires a front\-end Voice Activity Detection \(VAD\) module to mark the end of the user’s turn\. The turn\-based assumption has two intrinsic limitations\. First, the system behavior is bounded by the VAD’s accuracy: false positives cut the user off mid\-sentence and false negatives delay the response, since acoustic VAD has no access to dialogue semantics\. Second, the turn\-based abstraction discards interaction patterns that pervade real conversation, including mid\-utterance pauses that should not be mistaken for end\-of\-turn, user barge\-in that should preempt the current response, and short backchannels that should not trigger a full reply\. Outsourcing these decisions to a small front\-end module places a hard ceiling on the system’s interactive ability\.Full\-duplexSpeechLMs address these issues by listening and speaking continuously, deciding internally when to talk\(Nguyenet al\.,[2023](https://arxiv.org/html/2606.14528#bib.bib15); Défossezet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib1); Zhanget al\.,[2024a](https://arxiv.org/html/2606.14528#bib.bib20)\)\. However, native full\-duplex training typically requires millions of hours of pretraining and tens of thousands of hours of paired full\-duplex dialogue data\(Défossezet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib1)\), which is beyond the reach of most academic teams\. In this paper, we explore an alternative: converting a strong turn\-based SpeechLM into a competitive full\-duplex one with a small, structured fine\-tuning recipe\. The conversion is non\-trivial, since the model must consume the user’s incoming speech while emitting its own response, and make every turn\-taking decision at the same time scale as speech tokens\.

In this paper, we propose BayLing\-Duplex, a native full\-duplex SpeechLM in which a single autoregressive LLM jointly handles user\-speech understanding, dialogue\-state decisions, and assistant\-speech generation through a multi\-channel interleaved sequence \(Figure[1](https://arxiv.org/html/2606.14528#S2.F1)\)\. BayLing\-Duplex takes GLM\-4\-Voice\(Zenget al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib2)\)as its backbone, integrating a speech tokenizer, an LLM, and a speech decoder; we introduce no new modules or auxiliary heads on top of the GLM\-4\-Voice backbone; the only addition is four special dialogue\-state tokens that share the standard token vocabulary\. As a result, the design transfers to any autoregressive LLM and runs on off\-the\-shelf LLM training and serving frameworks without any architectural adaptation\. Three streams – user speech, assistant text, and assistant speech – are tokenized at the same frame rate and interleaved block by block, and four dialogue\-state tokens in the text channel encode silence, reply onset, text completion, and speech completion\. With this layout, every turn\-taking and interruption decision reduces to ordinary next\-token prediction over GLM\-4\-Voice’s standard vocabulary\. We start from the publicly released GLM\-4\-Voice checkpoint and fine\-tune it on 400K full\-duplex samples, followed by a lightweight Direct Preference Optimization \(DPO\)\(Rafailovet al\.,[2023](https://arxiv.org/html/2606.14528#bib.bib8)\)stage targeting turn\-taking and barge\-in timing\. Experimental results show that BayLing\-Duplex reaches 92% turn\-taking success rate and 100% interruption success rate on InstructS2S\-Eval, while improving the speech\-response score from 2\.17 to 3\.39 over Moshi\(Défossezet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib1)\)\. On full\-duplex spoken question answering, BayLing\-Duplex reaches 46\.0%/18\.1% accuracy on Llama Questions and Web Questions, significantly outperforming Moshi’s 21\.0%/9\.2%, and the duplex model is on par with or stronger than its turn\-based counterpart on three standard spoken benchmarks\.

## 2BayLing\-Duplex

![Refer to caption](https://arxiv.org/html/2606.14528v1/x1.png)Figure 1:Multi\-channel interleaved sequence in BayLing\-Duplex\. The user speech, assistant text and assistant speech channels are interleaved block\-by\-block at a fixedN:M:NN\{:\}M\{:\}Nratio; hereN=M=2N\{=\}M\{=\}2for clarity \(we useN=10N\{=\}10,M=5M\{=\}5in practice\)\. The text channel embeds the dialogue\-state tokens\[SILENCE\]\(\[S\]\),\[ASSISTANT\]\(\[A\]\),\[PAD\]\(\[P\]\) and\[EPAD\]\(\[E\]\)\. The illustrated dialogue starts with the user asking “Hi, what time is it?”; the assistant takes the turn at2\.02\.0s with “The time is…”, is interrupted at3\.03\.0s by “Wait, Beijing Time\!”, and re\-starts with “9PM” at4\.54\.5s\. Turn\-taking, being interrupted, and re\-starting are all expressed as ordinary next\-token prediction over the standard LLM vocabulary\.In this section, we introduce the model architecture of BayLing\-Duplex\. As shown in Figure[1](https://arxiv.org/html/2606.14528#S2.F1), we use GLM\-4\-Voice\(Zenget al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib2)\)as the backbone, which integrates a speech tokenizer, an LLM, and a speech decoder\. The speech tokenizer is a modified Whisper\-large\-v3\(Radfordet al\.,[2023](https://arxiv.org/html/2606.14528#bib.bib3)\)encoder with a vector quantizer that turns 16 kHz waveforms into discrete tokens atfs=12\.5f\_\{s\}=12\.5Hz \(one token per 80 ms\); the LLM is a 9B\-parameter decoder\-only Transformer initialized from GLM\-4\-9B\(GLMet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib7)\)with the speech tokens added to its vocabulary; and the speech decoder is a flow\-matching\(Lipmanet al\.,[2023](https://arxiv.org/html/2606.14528#bib.bib5)\)model followed by a HiFi\-GAN\(Konget al\.,[2020](https://arxiv.org/html/2606.14528#bib.bib6)\)vocoder, both adapted from CosyVoice\(Duet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib4)\)\. The core innovation of BayLing\-Duplex is the multi\-channel interleaved sequence layout, which realizes full\-duplex behavior without introducing any new modules or auxiliary heads\.

### 2\.1Multi\-Channel Interleaved Sequence

A full\-duplex dialogue consists of a sequence of utterances by the user and the assistant, possibly with overlap to support barge\-in\. We organize this dialogue into a single multi\-channel interleaved sequence as follows\.

#### Two\-Channel Audio Tokenization

We synthesize two single\-channel audio tracks of equal length: the user track is filled with user utterances \(silence elsewhere\) and the assistant track with assistant utterances\. Both are tokenized by the speech tokenizer, yielding aligned sequences𝐗=\(x1,…,xTs\)\\mathbf\{X\}=\(x\_\{1\},\\dots,x\_\{T\_\{s\}\}\)and𝐘=\(y1,…,yTs\)\\mathbf\{Y\}=\(y\_\{1\},\\dots,y\_\{T\_\{s\}\}\)\. Silence is tokenized by the same encoder rather than replaced by a special token, preserving acoustic continuity\. For each assistant utterancekk,𝐰k\\mathbf\{w\}\_\{k\}denotes its textual content, andsk,eks\_\{k\},e\_\{k\}denote its start and end times \(in seconds\), respectively\.

#### Block Structure

The sequence is organized inBBblocks, each containingNNuser\-speech tokens,MMtext tokens, andNNassistant\-speech tokens:

Blockb:xbN\+1:\(b\+1\)N⏟user speech𝐳bM\+1:\(b\+1\)M⏟textybN\+1:\(b\+1\)N⏟assistant speech\.\\text\{Block \}b\\\!:\\;\\underbrace\{x\_\{bN\+1\{:\}\(b\+1\)N\}\}\_\{\\text\{user speech\}\}\\,\\underbrace\{\\mathbf\{z\}\_\{bM\+1\{:\}\(b\+1\)M\}\}\_\{\\text\{text\}\}\\,\\underbrace\{y\_\{bN\+1\{:\}\(b\+1\)N\}\}\_\{\\text\{assistant speech\}\}\.\(1\)The text channel𝐙=\(z1,…,zTz\)\\mathbf\{Z\}=\(z\_\{1\},\\dots,z\_\{T\_\{z\}\}\)has lengthTz=Ts⋅M/NT\_\{z\}=T\_\{s\}\\cdot M/N\. The model is trained to predict the text and assistant\-speech tokens autoregressively given the past sequence\.

#### Block Size

The block sizeNNcontrols a fundamental trade\-off\. With a smallNN, each block has too few text slots to express even a short sub\-word, which produces jittery turn\-taking and unstable response timing; with a largeNN, the minimum response latency exceeds the human\-acceptability threshold, since the model can only respond at the granularity of one block\(Défossezet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib1)\)\. We chooseN=10N=10andM=5M=5throughout the paper, givingΔt=0\.8\\Delta t=0\.8s and 6\.25 text tokens per second on average, close to the natural English speech rate of GLM\-4\-Voice during turn\-based decoding\.N=10N\{=\}10matches the typical English minimum\-perceptible\-latency threshold while keepingΔt\\Delta tsmall enough for fluid turn\-taking; we leave a systematic sweep overNNto future work\.

#### Causal Shift

At blockbb, the model has observed user tokens up to time\(b\+1\)Δt\(b\{\+\}1\)\\Delta t, so the earliest assistant audio it can emit corresponds to that same instant\. We therefore shift the assistant text and speech channels one block ahead of the user channel during training: text and assistant\-speech tokens at blockbbcorrespond to wall\-clock window\[\(b\+1\)Δt,\(b\+2\)Δt\)\[\(b\{\+\}1\)\\Delta t,\(b\{\+\}2\)\\Delta t\)\. At inference, the output is played back with the same offsetΔt\\Delta tadded\.

#### Text\-Channel Construction

The text channel𝐙\\mathbf\{Z\}acts as an inner monologue: it never reaches the user, but conditions the assistant\-speech tokens within the same block\.𝐙\\mathbf\{Z\}is initialized with\[SILENCE\]everywhere and overwritten by each assistant utterancekk\. Its boundary indices in the text channel are

jkast\\displaystyle j\_\{k\}^\{\\text\{ast\}\}=⌊\(sk−Δt\)fs⌋⋅MN−1,\\displaystyle=\\lfloor\(s\_\{k\}\-\\Delta t\)f\_\{s\}\\rfloor\\cdot\\tfrac\{M\}\{N\}\-1,\(2\)jkepad\\displaystyle j\_\{k\}^\{\\text\{epad\}\}=⌈\(ek−Δt\)fs⌉⋅MN,\\displaystyle=\\lceil\(e\_\{k\}\-\\Delta t\)f\_\{s\}\\rceil\\cdot\\tfrac\{M\}\{N\},\(3\)with the textual content𝐰k\\mathbf\{w\}\_\{k\}filling positions fromjkast\+1j\_\{k\}^\{\\text\{ast\}\}\{\+\}1\. The text channel embeds four dialogue\-state tokens that encode the high\-level state of the dialogue:

- •\[SILENCE\]: the assistant should stay silent;
- •\[ASSISTANT\]: the start of an assistant reply;
- •\[PAD\]: the textual content has been written but the corresponding speech is still being emitted;
- •\[EPAD\]: both the text and the speech of the current reply are complete\.

When the text channel emits\[SILENCE\]the assistant\-speech tokens correspond to silence; when it emits\[ASSISTANT\]followed by content, the assistant\-speech tokens encode the corresponding utterance\. With this layout, all dialogue\-state decisions reduce to next\-token prediction over GLM\-4\-Voice’s standard vocabulary, requiring no extra classification head, attention\-mask trick, or state machine\.

### 2\.2Training

We start from GLM\-4\-Voice’s publicly released checkpoint, which has already been pretrained on millions of hours of speech\-text data and supervised fine\-tuned on turn\-based dialogue\. Two further stages are applied\.

#### Stage I: Supervised Fine\-Tuning

The user\-speech channel𝐗\\mathbf\{X\}is conditioning only and contributes no loss; the cross\-entropy is evaluated only at text\-channel and assistant\-speech positions, with the supervised set

𝒱=\{i:si∈𝐙∪𝐘\},ℓi=−log⁡πθ\(si∣𝐬<i\)\.\\mathcal\{V\}=\\\{i:s\_\{i\}\\in\\mathbf\{Z\}\\cup\\mathbf\{Y\}\\\},\\quad\\ell\_\{i\}=\-\\log\\pi\_\{\\theta\}\(s\_\{i\}\\mid\\mathbf\{s\}\_\{<i\}\)\.\(4\)\[SILENCE\]dominates a typical sequence while\[ASSISTANT\]appears only once per turn, so we aggregate the per\-position losses with per\-token weightsωi\\omega\_\{i\}to keep the rare role tokens from being drowned out:

ℒSFT=∑i∈𝒱ωiℓi∑i∈𝒱ωi\.\\mathcal\{L\}\_\{\\text\{SFT\}\}=\\frac\{\\sum\_\{i\\in\\mathcal\{V\}\}\\omega\_\{i\}\\ell\_\{i\}\}\{\\sum\_\{i\\in\\mathcal\{V\}\}\\omega\_\{i\}\}\.\(5\)We tune two key weights:ωsil\\omega\_\{\\text\{sil\}\}for\[SILENCE\]andωrole\\omega\_\{\\text\{role\}\}for\[ASSISTANT\]/\[EPAD\]; we writeℒSFT\(𝐬\)\\mathcal\{L\}\_\{\\text\{SFT\}\}\(\\mathbf\{s\}\)when this loss is evaluated on a specific sequence𝐬\\mathbf\{s\}\. Ablations are reported in Section[4\.4](https://arxiv.org/html/2606.14528#S4.SS4)\.

#### Stage II: Direct Preference Optimization

Stage I teaches the layout but only weakly optimizes temporal decisions\. We construct preference pairs whose positive examples are the SFT data and whose negatives differ*only*in timing; the construction is detailed in Section[3](https://arxiv.org/html/2606.14528#S3)\. The training objective combines DPO with an auxiliary SFT term that prevents catastrophic forgetting of generation quality:

ℒ=ℒDPO\+λftx⋅ℒSFT\(𝐬\+\),\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{DPO\}\}\+\\lambda\_\{\\text\{ftx\}\}\\cdot\\mathcal\{L\}\_\{\\text\{SFT\}\}\(\\mathbf\{s\}^\{\+\}\),\(6\)ℒDPO=−log⁡σ\(β\[log⁡πθ\(𝐬\+\)πref\(𝐬\+\)−log⁡πθ\(𝐬−\)πref\(𝐬−\)\]\),\\mathcal\{L\}\_\{\\text\{DPO\}\}\\\!=\\\!\-\\log\\\!\\sigma\\\!\\\!\\left\(\\\!\\beta\\\!\\left\[\\log\\tfrac\{\\pi\_\{\\theta\}\(\\mathbf\{s\}^\{\+\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{s\}^\{\+\}\)\}\\\!\-\\\!\\log\\tfrac\{\\pi\_\{\\theta\}\(\\mathbf\{s\}^\{\-\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{s\}^\{\-\}\)\}\\right\]\\\!\\right\),\(7\)whereπref\\pi\_\{\\text\{ref\}\}is the Stage I checkpoint\.

### 2\.3Inference

Algorithm 1Inference of BayLing\-Duplex\.1:live user\-speech stream; block sizes

NN,

MM; causal offset

Δt\\Delta t
2:assistant\-speech waveform

3:

b←0b\\leftarrow 0, history

𝐒←\(\)\\mathbf\{S\}\\leftarrow\(\)
4:whiledialogue is activedo

5:receive

NNuser\-speech tokens

𝐱b\\mathbf\{x\}\_\{b\}from the stream

6:

𝐒←𝐒⊕𝐱b\\mathbf\{S\}\\leftarrow\\mathbf\{S\}\\oplus\\mathbf\{x\}\_\{b\}
7:for

j=1,…,Mj=1,\\ldots,Mdo⊳\\trianglerighttext channel

8:

zj∼πθ\(⋅∣𝐒\)z\_\{j\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid\\mathbf\{S\}\), mask to text \+ state tokens

9:

𝐒←𝐒⊕zj\\mathbf\{S\}\\leftarrow\\mathbf\{S\}\\oplus z\_\{j\}
10:endfor

11:for

j=1,…,Nj=1,\\ldots,Ndo⊳\\trianglerightassistant speech

12:

yj∼πθ\(⋅∣𝐒\)y\_\{j\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid\\mathbf\{S\}\), mask to speech tokens

13:

𝐒←𝐒⊕yj\\mathbf\{S\}\\leftarrow\\mathbf\{S\}\\oplus y\_\{j\}
14:endfor

15:decode

\{y1,…,yN\}\\\{y\_\{1\},\\dots,y\_\{N\}\\\}and play at

\(b\+1\)Δt\(b\{\+\}1\)\\Delta t
16:

b←b\+1b\\leftarrow b\+1
17:endwhile

During inference, decoding proceeds block by block, as summarized in Algorithm[1](https://arxiv.org/html/2606.14528#alg1)\. The text\-channel slots are masked to text\-and\-state tokens and the assistant\-speech slots to speech tokens; without masking, the LLM occasionally emits cross\-channel tokens that corrupt the speech decoder’s input\. During training no mask is applied because the cross\-entropy loss naturally suppresses incorrect token types\. During silence the user channel still receives the live waveform, which the tokenizer maps to its silence token; user input is never zero\-padded artificially\. When the user barges in mid\-block, the in\-flight assistant\-speech tokens finish generating before the next block re\-conditions on the new user audio, keeping decoding strictly autoregressive\.

## 3Data Construction

In this section, we describe how we construct the full\-duplex training data\. We build upon the multi\-turn speech\-to\-speech dialogue corpus introduced inFanget al\.\([2025b](https://arxiv.org/html/2606.14528#bib.bib17)\), which contains 200K samples derived from the Alpaca and UltraChat datasets through rewriting with Llama\-3\.3\-70B\-Instruct and synthesized into speech with CosyVoice’s zero\-shot voice cloning\(Duet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib4)\)\. The user instructions are synthesized with diverse voices via voice cloning, while the assistant responses use a uniform voice; this preserves voice diversity across dialogues and consistency within a dialogue\.

Each multi\-turn dialogue is then converted into the multi\-channel interleaved format for two full\-duplex scenarios\. Forturn\-taking, a 0\.8 s gap is inserted from the end of the user’s utterance to the start of the assistant’s response, and the gap from the end of the assistant’s response to the start of the next user utterance is drawn fromUniform\(0\.5,3\.0\)\\mathrm\{Uniform\}\(0\.5,3\.0\)s\. Forinterruption, the user re\-enters at a random point during the assistant’s response, and the assistant stops after a small reaction delayδreact∼Uniform\(0\.8,2\.0\)\\delta\_\{\\text\{react\}\}\\sim\\mathrm\{Uniform\}\(0\.8,2\.0\)s\. We generate 200K full\-duplex samples for each scenario and mix them in a 1:1 ratio during training\.

For DPO, we reuse the same SFT samples as positive examples and synthesize negatives by perturbing only the timing of the assistant\. Forturn\-taking, the negative replaces the 0\.8 s gap with a value drawn fromUniform\(2,5\)\\mathrm\{Uniform\}\(2,5\)s, which forces the SFT model to over\-predict\[SILENCE\]after the user finishes speaking\. Forinterruption, the negative replaces the reaction delay with a value drawn fromUniform\(3,5\)\\mathrm\{Uniform\}\(3,5\)s, so that the assistant continues to speak well after the user has barged in\. Each positive is paired with one negative; positive and negative share the same user\-channel audio and textual content𝐰k\\mathbf\{w\}\_\{k\}, so the DPO objective is forced to focus its update on the dialogue\-state tokens and not on textual content, which is essential for preserving response quality during the preference\-optimization stage\.

## 4Experiments

### 4\.1Experimental Setup

#### Model Configuration

We use the GLM\-4\-Voice checkpoint as the backbone, withN=10N\{=\}10andM=5M\{=\}5\(block durationΔt=0\.8\\Delta t=0\.8s\)\. The LLM is fully fine\-tuned, while the speech tokenizer and speech decoder are frozen\. We add no new parameters or auxiliary heads\.

#### Training Details

Stage I \(SFT\) is trained on the 400K full\-duplex dialogues described in Section[3](https://arxiv.org/html/2606.14528#S3)for one epoch with batch size 32 and a peak learning rate of1×10−51\\\!\\times\\\!10^\{\-5\}, using a cosine schedule with 10% warm\-up\. Stage II \(DPO\) runs for 200 steps with a peak learning rate of3×10−73\\\!\\times\\\!10^\{\-7\},β=0\.5\\beta\\\!=\\\!0\.5, andλftx=0\.5\\lambda\_\{\\text\{ftx\}\}\\\!=\\\!0\.5, using a cosine schedule with 5% warm\-up\. Both stages are trained with the LLaMA\-Factory codebase\(Zhenget al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib9)\)\.

### 4\.2Evaluation

We evaluate BayLing\-Duplex on three tasks: spoken question answering, full\-duplex turn\-taking, and full\-duplex interruption\. In all experiments we sample from the LLM with temperature 0\.8\. The synthesized assistant audio is transcribed by Whisper\-large\-v3\(Radfordet al\.,[2023](https://arxiv.org/html/2606.14528#bib.bib3)\)and segmented by Silero VAD\(Silero Team,[2024](https://arxiv.org/html/2606.14528#bib.bib10)\)\.

#### Spoken Question Answering

The spoken question answering task feeds a spoken question directly to the full\-duplex model with no external VAD, and checks whether the reference answer appears in the model’s response\. We evaluate on Llama Questions\(Nachmaniet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib12)\)\(300 items\) and Web Questions\(Berantet al\.,[2013](https://arxiv.org/html/2606.14528#bib.bib13)\)\(2032 items, synthesized into speech by CosyVoice\)\.

#### Turn\-Taking

For turn\-taking, we follow LLaMA\-Omni\(Fanget al\.,[2025a](https://arxiv.org/html/2606.14528#bib.bib16)\)and useInstructS2S\-Eval, 199 spoken instructions filtered from thehelpful\_baseandvicunasubsets of Alpaca\-Eval\(Liet al\.,[2023](https://arxiv.org/html/2606.14528#bib.bib11)\)\. We feed each instruction to the duplex model in real time and measure when the assistant starts replying after the user finishes speaking, as well as the quality of the reply\.

#### Interruption

For interruption, we pair adjacent items from InstructS2S\-Eval into 199 two\-utterance audios where the second utterance starts during the first response\. We measure how quickly the model stops the current response when interrupted, and how relevant the new reply is to the second question\.

All timing metrics are computed on the synthesized assistant audio: we run Silero VAD\(Silero Team,[2024](https://arxiv.org/html/2606.14528#bib.bib10)\)on the waveform to obtain its non\-silence segments, definetuser\-endt\_\{\\text\{user\-end\}\}as the right edge of the last non\-silence frame in the synthesized user audio,tassistant\-startt\_\{\\text\{assistant\-start\}\}as the start of the assistant’s first non\-silence segment, andtstopt\_\{\\text\{stop\}\}as the right edge of the assistant’s last non\-silence segment that follows a barge\-in \(i\.e\., when the assistant falls silent after being interrupted\)\. The evaluation pipeline never inspects the model’s text channel or special tokens, and Silero VAD is used only for evaluation, not for inference\.

We use the following metrics\.

TT SR@3s: turn\-taking success rate, defined as the fraction of test items for which the assistant starts replying within 3 s of the user’s end\.

S2S Score: a 1–5 GPT\-4o\(OpenAI,[2024](https://arxiv.org/html/2606.14528#bib.bib14)\)judgment on the transcribed assistant reply, considering helpfulness, relevance, fluency, and suitability for speech interaction\.

Overlap \(Ovl\): the gap, in seconds, from the user’s barge\-in to the assistant’s stop; lower is better\.

ISR@2s: the interruption success rate, defined as the fraction of test items whose overlap is at most 2 s\.

Q2 S2S: the S2S Score on the assistant’s reply to the second \(interrupting\) question, used to measure whether the model produces a relevant new response after being interrupted\.

For spoken QA, we report exact\-match accuracy\. Accuracy is computed by case\-insensitive substring match between the reference answer and the Whisper transcription of the assistant’s audio\.

#### Baseline

We compare BayLing\-Duplex with Moshi\(Défossezet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib1)\), a state\-of\-the\-art native full\-duplex SpeechLM with parallel audio streams and an Inner Monologue text channel\. We use the publicly released Moshika checkpoint\.

### 4\.3Main Results

Table 1:Full\-duplex spoken QA accuracy \(%\)\. The audio is fed directly to the duplex model with no external VAD\.Table 2:Main results on full\-duplex turn\-taking and interruption on InstructS2S\-Eval \(199 spoken instructions\)\. SR@3s: turn\-taking success rate at 3 s; S2S: GPT\-4o speech\-response score; Overlap: gap from user barge\-in to assistant stop; ISR@2s: interruption success rate at 2 s; Q2 S2S: speech\-response score on the assistant’s reply to the second \(interrupting\) question\.#### Spoken Question Answering

Table[1](https://arxiv.org/html/2606.14528#S4.T1)reports spoken\-QA accuracy in the full\-duplex setting, where the spoken question is fed directly to the duplex model and the model itself decides when to reply\. We observe that: \(1\) BayLing\-Duplex \(SFT\) reaches 44\.3%/18\.0% on Llama/Web Questions, significantly outperforming Moshi’s 21\.0%/9\.2% even before DPO\. \(2\) DPO further improves accuracy to 46\.0%/18\.1%, indicating that better timing also yields better content\. \(3\) The improvement is consistent across both benchmarks, suggesting that the multi\-channel layout preserves the content\-modeling capability of the GLM\-4\-Voice backbone\.

#### Turn\-Taking and Interruption

Table[2](https://arxiv.org/html/2606.14528#S4.T2)shows turn\-taking and interruption results\. We observe that: \(1\) The SFT model already reaches 88\.9% TT SR@3s and a 3\.23 S2S Score, significantly outperforming Moshi \(71\.9%, 2\.17\)\. \(2\) DPO pushes TT SR@3s to 92\.0% and the S2S Score to 3\.39, exceeding Moshi’s 2\.17 by 1\.22 points\. \(3\) The interruption gain is even larger: Overlap drops from 2\.07 s \(Moshi\) to 1\.51 s \(SFT\) and 1\.10 s \(\+DPO\); ISR@2s climbs from 81\.9% to 100%; and Q2 S2S rises from 2\.45 to 3\.27\. Interruption benefits the most because the negatives in DPO directly postpone the\[EPAD\]token\.

### 4\.4Ablation Study

We conduct ablation studies to understand the contribution of each component\.

Table 3:Token\-weight ablation in Stage I\.ωrole\\omega\_\{\\text\{role\}\}weights\[ASSISTANT\]/\[EPAD\],ωsil\\omega\_\{\\text\{sil\}\}weights\[SILENCE\]\.Table 4:DPO hyperparameter ablation\.β\\betais the KL coefficient andλftx\\lambda\_\{\\text\{ftx\}\}is the auxiliary\-SFT weight\. ISR@2s = 100\.0% across all settings and is omitted\.Table 5:Response quality of BayLing\-Duplex vs\. a turn\-based SFT baseline trained on the same data and the same backbone\.#### Token Weights

Table[3](https://arxiv.org/html/2606.14528#S4.T3)shows the ablation on the per\-token weights of the SFT loss\. We observe that: \(1\) Uniform weighting \(ω=1\\omega=1\) collapses the model to near\-permanent silence, with TT SR@3s of only 60\.3%\. The 100% ISR@2s in this row is a degenerate consequence: a model that almost never speaks needs no time to stop\. \(2\) Reducingωsil\\omega\_\{\\text\{sil\}\}to 0\.1 alone raises TT SR to 82\.4%\. \(3\) Raisingωrole\\omega\_\{\\text\{role\}\}to 10 withωsil=0\.1\\omega\_\{\\text\{sil\}\}=0\.1further pushes TT SR to 88\.9% and the speech score to 3\.23\. Both adjustments are needed: down\-weighting\[SILENCE\]alone or up\-weighting role tokens alone is insufficient, because the gradient is otherwise dominated by silence positions\.

#### DPO Hyperparameters

Table[4](https://arxiv.org/html/2606.14528#S4.T4)sweeps the Kullback–Leibler \(KL\) coefficientβ\\betaand the auxiliary\-SFT coefficientλftx\\lambda\_\{\\text\{ftx\}\}\. We observe that: \(1\) ISR@2s reaches 100% across all settings, indicating that DPO is robust on interruption\. \(2\) TT SR@3s and the S2S Score both peak at 92\.0%/3\.39 withβ=0\.5,λftx=0\.5\\beta\\\!=\\\!0\.5,\\lambda\_\{\\text\{ftx\}\}\\\!=\\\!0\.5, which we use as the default\. \(3\)λftx=1\.0\\lambda\_\{\\text\{ftx\}\}\\\!=\\\!1\.0slightly degrades the DPO effect, whileλftx=0\.3\\lambda\_\{\\text\{ftx\}\}\\\!=\\\!0\.3recovers similar interaction quality but yields a lower S2S Score\. \(4\) Loweringβ\\betato 0\.1 makes the model drift further from the SFT policy and produces a slightly lower S2S Score \(3\.31\), consistent with the view that the SFT checkpoint already captures most of the layout knowledge and DPO mainly fine\-tunes timing\.

#### Effect of Full\-Duplex Training on Response Quality

A natural concern is that learning timing decisions might erode the underlying response quality\. We compare BayLing\-Duplex \(SFT\) with a turn\-based SFT baseline trained on the same data in the original GLM\-4\-Voice format\. Table[5](https://arxiv.org/html/2606.14528#S4.T5)shows that the duplex model is on par with or stronger than the turn\-based one: it loses 1\.0 point on Llama Questions but gains 2\.1 points on Web Questions and 0\.07 on Alpaca\-Eval\. This indicates that multi\-channel interleaved training introduces full\-duplex behavior without sacrificing response quality: the gains in turn\-taking and interruption come from a layout that exposes timing as an in\-vocabulary prediction problem, not from a degraded language model\.

## 5Related Work

#### Speech Language Models

SpeechLMs are generally divided into two categories: native SpeechLMs that directly input and output speech tokens through a decoder\-only Transformer \(SpeechGPT\(Zhanget al\.,[2023](https://arxiv.org/html/2606.14528#bib.bib35)\), GLM\-4\-Voice\(Zenget al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib2)\), IntrinsicVoice\(Zhanget al\.,[2024b](https://arxiv.org/html/2606.14528#bib.bib31)\), Spirit\-LM\(Nguyenet al\.,[2025](https://arxiv.org/html/2606.14528#bib.bib32)\), Step\-Audio\(Huanget al\.,[2025](https://arxiv.org/html/2606.14528#bib.bib27); Wuet al\.,[2025](https://arxiv.org/html/2606.14528#bib.bib33)\)\), and modular SpeechLMs that add speech encoders and decoders around the LLM \(LLaMA\-Omni\(Fanget al\.,[2025a](https://arxiv.org/html/2606.14528#bib.bib16)\), LLaMA\-Omni 2\(Fanget al\.,[2025b](https://arxiv.org/html/2606.14528#bib.bib17)\), Mini\-Omni\(Xie and Wu,[2024a](https://arxiv.org/html/2606.14528#bib.bib30)\), SALMONN\(Tanget al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib36)\), Freeze\-Omni\(Wanget al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib25)\), MinMo\(Chenet al\.,[2025](https://arxiv.org/html/2606.14528#bib.bib24)\), Stream\-Omni\(Zhanget al\.,[2025](https://arxiv.org/html/2606.14528#bib.bib28)\), VITA\-1\.5\(Fuet al\.,[2025](https://arxiv.org/html/2606.14528#bib.bib34)\), VITA\-Audio\(Longet al\.,[2025](https://arxiv.org/html/2606.14528#bib.bib29)\)\)\. Native models inherit the LLM training stack with minimal architectural changes, but they enlarge the per\-step softmax with the union of text and speech tokens and require continued pretraining on large amounts of speech to keep the model’s text capability from collapsing\. Modular models keep the LLM vocabulary clean and reuse strong off\-the\-shelf speech encoders and decoders, at the cost of a more elaborate training pipeline that must align the inserted modules with the frozen or partially\-trained LLM\. Both families assume that a complete user utterance is available before the model speaks, and segment the user audio with an external VAD; BayLing\-Duplex removes the VAD entirely and lets the model itself decide when to speak\.

#### Full\-Duplex Speech Language Models

Full\-duplex SpeechLMs lift the turn\-based assumption\. dGSLM\(Nguyenet al\.,[2023](https://arxiv.org/html/2606.14528#bib.bib15)\)pioneered dual\-channel modeling on naturalistic conversational speech, demonstrating that a single autoregressive model can predict both speakers without an external turn\-taking signal, but at the cost of relying on tens of thousands of hours of two\-channel dialogue and offering limited semantic coverage\. Moshi\(Défossezet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib1)\)folds the user and assistant audio into two parallel residual vector quantization \(RVQ\) streams stacked over a text Inner Monologue and uses a depth\-Transformer to emit one frame per step for low theoretical latency; the parallel\-RVQ design requires per\-codebook conditioning and full\-duplex pretraining at the scale of millions of hours of speech\. SyncLLM\(Veluriet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib19)\)embeds an explicit wall\-clock signal so that user and assistant tokens advance in lock\-step, but the time tokens enlarge the vocabulary and shift the burden of timing to the LLM\. OmniFlatten\(Zhanget al\.,[2024a](https://arxiv.org/html/2606.14528#bib.bib20)\)flattens user\-speech, assistant\-speech, and assistant\-text tokens into a single GPT stream, which simplifies the training stack but interleaves channels at the per\-token granularity and fragments the contiguous text monologue\. SALMONN\-omni\(Yuet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib18)\)runs on continuous embeddings with a thinking mechanism, sidestepping the discretization trade\-offs but requiring a separate codec for the audio output and an extra branch for the thinking trace\. LSLM\(Maet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib21)\), Mini\-Omni2\(Xie and Wu,[2024b](https://arxiv.org/html/2606.14528#bib.bib22)\), and Freeze\-Omni\(Wanget al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib25)\)reach partial duplexity through input\-side barge\-in or command\-based interruption: the model can be cut off but cannot decide for itself when to start or stop talking\.Zhanget al\.\([2024c](https://arxiv.org/html/2606.14528#bib.bib23)\)reach duplexity at the text level via time\-division multiplexing\. The most relevant concurrent work is FLM\-Audio\(Yaoet al\.,[2025](https://arxiv.org/html/2606.14528#bib.bib26)\), which similarly preserves natural text monologues but merges all channels at every step\. BayLing\-Duplex interleaves three channels at a coarse block granularity that preserves contiguous text monologues, and unifies all dialogue\-state decisions as next\-token prediction over the standard LLM vocabulary\.

#### Text Channels and Inner Monologues

Many full\-duplex SpeechLMs introduce an intermediate text channel as scaffolding for speech generation\. Moshi\(Défossezet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib1)\)interleaves a per\-frame Inner Monologue track that emits time\-aligned text before each frame of audio and reports that this track is critical for keeping the spoken response semantically coherent\. SALMONN\-omni\(Yuet al\.,[2024](https://arxiv.org/html/2606.14528#bib.bib18)\)pursues a similar idea on continuous embeddings with a separate thinking branch\. OmniFlatten\(Zhanget al\.,[2024a](https://arxiv.org/html/2606.14528#bib.bib20)\)and FLM\-Audio\(Yaoet al\.,[2025](https://arxiv.org/html/2606.14528#bib.bib26)\)likewise weave text alongside speech tokens, reusing the LLM’s text generation pathway to plan content\. Our text channel inherits this design at the granularity of one block rather than one frame: it never reaches the user but conditions the assistant\-speech tokens within the same block, and it is the channel where every dialogue\-state decision is made\. Compared with per\-step interleaving, the coarser scheme keeps each utterance’s text contiguous over several consecutive blocks, which we conjecture aligns better with the text distribution that the underlying LLM was pretrained on\.

## 6Conclusion

We introduce BayLing\-Duplex, a native full\-duplex SpeechLM whose multi\-channel interleaved sequence lets a single autoregressive LLM decide when to listen, speak, and stop\. Four dialogue\-state tokens added to the standard vocabulary turn turn\-taking and interruption into ordinary next\-token prediction, with no auxiliary classifier or scheduler on top of GLM\-4\-Voice\. With only 400K full\-duplex samples and a lightweight DPO stage, BayLing\-Duplex reaches 92% turn\-taking and 100% interruption success\.

## Limitations

The training and evaluation audio is fully synthesized: it is single\-speaker, near\-field, and noise\-free\. Real\-world deployment must handle background noise, reverberation, and competing speakers, all of which can shift the boundaries detected in the user channel and trigger spurious turn\-taking events\. We expect data augmentation \(additive noise, room impulse response, distractor speakers\) to mitigate this, and we leave a controlled study across in\-car, outdoor, and meeting\-room conditions to future work\. Our analysis also focuses on turn\-taking and interruption; backchannels, multi\-party conversation, and emotion\-aware turn\-taking are not explored\. The chosen block sizeN=10N\{=\}10caps the minimum response latency at 0\.8 s; reducingNNwould lower latency but shrink the per\-block text budget, and we leave a systematic sweep overNNto future work\. Finally, like Moshi and OmniFlatten, our model is bounded by the quality and the bias of the underlying SpeechLM \(GLM\-4\-Voice\); we share its limitations on rare languages, code\-switching, and out\-of\-distribution acoustic conditions\.

## Ethical Considerations

BayLing\-Duplex synthesizes natural\-sounding speech in real time, which lowers the barrier for voice\-based impersonation, social\-engineering attacks, and audio disinformation\. Continuous\-listening interfaces also raise privacy concerns: always\-on user\-channel input may inadvertently capture private speech, including utterances from bystanders who have not consented to recording\. As BayLing\-Duplex is built on top of GLM\-4\-Voice, it inherits the linguistic, demographic, and acoustic biases of that backbone, and its turn\-taking and interruption decisions may behave unevenly across speakers, accents, and languages\. We release the model strictly for research on full\-duplex dialogue modeling; production deployments should add speaker verification, on\-device wake\-word gating, watermarking of synthesized speech, and explicit user consent for continuous capture\.

## References

- J\. Berant, A\. Chou, R\. Frostig, and P\. Liang \(2013\)Semantic parsing on Freebase from question\-answer pairs\.InProc\. of EMNLP,Cited by:[§4\.2](https://arxiv.org/html/2606.14528#S4.SS2.SSS0.Px1.p1.1)\.
- Q\. Chen, Y\. Chen, Y\. Chen, M\. Chen, Y\. Chen, C\. Deng, Z\. Du, R\. Gao, C\. Gao, Z\. Gao, Y\. Li, X\. Lv, J\. Liu, H\. Luo, B\. Ma, C\. Ni, X\. Shi, J\. Tang, H\. Wang, H\. Wang, W\. Wang, Y\. Wang, Y\. Xu, F\. Yu, Z\. Yan, Y\. Yang, B\. Yang, X\. Yang, G\. Yang, T\. Zhao, Q\. Zhang, S\. Zhang, N\. Zhao, P\. Zhang, C\. Zhang, and J\. Zhou \(2025\)MinMo: a multimodal large language model for seamless voice interaction\.External Links:2501\.06282,[Link](https://arxiv.org/abs/2501.06282)Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Défossez, L\. Mazaré, M\. Orsini, A\. Royer, P\. Pérez, H\. Jégou, E\. Grave, and N\. Zeghidour \(2024\)Moshi: a speech\-text foundation model for real\-time dialogue\.arXiv preprint arXiv:2410\.00037\.Cited by:[§1](https://arxiv.org/html/2606.14528#S1.p2.1),[§1](https://arxiv.org/html/2606.14528#S1.p3.1),[§1](https://arxiv.org/html/2606.14528#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.14528#S2.SS1.SSS0.Px3.p1.9),[§4\.2](https://arxiv.org/html/2606.14528#S4.SS2.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px3.p1.1)\.
- Z\. Du, Q\. Chen, S\. Zhang, K\. Hu, H\. Lu, Y\. Yang, H\. Hu, S\. Zheng, Y\. Gu, Z\. Ma, Z\. Gao, and Z\. Yan \(2024\)CosyVoice: a scalable multilingual zero\-shot text\-to\-speech synthesizer based on supervised semantic tokens\.arXiv preprint arXiv:2407\.05407\.Cited by:[§2](https://arxiv.org/html/2606.14528#S2.p1.1),[§3](https://arxiv.org/html/2606.14528#S3.p1.1)\.
- Q\. Fang, S\. Guo, Y\. Zhou, Z\. Ma, S\. Zhang, and Y\. Feng \(2025a\)LLaMA\-Omni: seamless speech interaction with large language models\.InProc\. of ICLR,Cited by:[§1](https://arxiv.org/html/2606.14528#S1.p2.1),[§4\.2](https://arxiv.org/html/2606.14528#S4.SS2.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- Q\. Fang, Y\. Zhou, S\. Guo, S\. Zhang, and Y\. Feng \(2025b\)LLaMA\-Omni 2: LLM\-based real\-time spoken chatbot with autoregressive streaming speech synthesis\.InProc\. of ACL,Cited by:[§1](https://arxiv.org/html/2606.14528#S1.p2.1),[§3](https://arxiv.org/html/2606.14528#S3.p1.1),[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- C\. Fu, H\. Lin, X\. Wang, Y\. Zhang, Y\. Shen, X\. Liu, H\. Cao, Z\. Long, H\. Gao, K\. Li, L\. Ma, X\. Zheng, R\. Ji, X\. Sun, C\. Shan, and R\. He \(2025\)VITA\-1\.5: towards GPT\-4o level real\-time vision and speech interaction\.arXiv preprint arXiv:2501\.01957\.Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- T\. GLM, :, A\. Zeng, B\. Xu, B\. Wang, C\. Zhang, D\. Yin, D\. Zhang, D\. Rojas, G\. Feng, H\. Zhao, H\. Lai, H\. Yu, H\. Wang, J\. Sun, J\. Zhang, J\. Cheng, J\. Gui, J\. Tang, J\. Zhang, J\. Sun, J\. Li, L\. Zhao, L\. Wu, L\. Zhong, M\. Liu, M\. Huang, P\. Zhang, Q\. Zheng, R\. Lu, S\. Duan, S\. Zhang, S\. Cao, S\. Yang, W\. L\. Tam, W\. Zhao, X\. Liu, X\. Xia, X\. Zhang, X\. Gu, X\. Lv, X\. Liu, X\. Liu, X\. Yang, X\. Song, X\. Zhang, Y\. An, Y\. Xu, Y\. Niu, Y\. Yang, Y\. Li, Y\. Bai, Y\. Dong, Z\. Qi, Z\. Wang, Z\. Yang, Z\. Du, Z\. Hou, and Z\. Wang \(2024\)ChatGLM: a family of large language models from glm\-130b to glm\-4 all tools\.External Links:2406\.12793,[Link](https://arxiv.org/abs/2406.12793)Cited by:[§2](https://arxiv.org/html/2606.14528#S2.p1.1)\.
- A\. Huang, B\. Wu, B\. Wang, C\. Yan, C\. Hu, C\. Feng, F\. Tian, F\. Shen, J\. Li, M\. Chen, P\. Liu, R\. Miao, W\. You, X\. Chen, X\. Yang, Y\. Huang, Y\. Zhang, Z\. Gong, Z\. Zhang, H\. Zhou, J\. Sun, B\. Li, C\. Feng, C\. Wan, H\. Hu, J\. Wu, J\. Zhen, R\. Ming, S\. Yuan, X\. Zhang, Y\. Zhou, B\. Li, B\. Ma, H\. Wang, K\. An, W\. Ji, W\. Li, X\. Wen, X\. Kong, Y\. Ma, Y\. Liang, Y\. Mou, B\. Ahmidi, B\. Wang, B\. Li, C\. Miao, C\. Xu, C\. Wang, D\. Shi, D\. Sun, D\. Hu, D\. Sai, E\. Liu, G\. Huang, G\. Yan, H\. Wang, H\. Jia, H\. Zhang, J\. Gong, J\. Guo, J\. Liu, J\. Liu, J\. Feng, J\. Wu, J\. Wu, J\. Yang, J\. Wang, J\. Zhang, J\. Lin, K\. Li, L\. Xia, L\. Zhou, L\. Zhao, L\. Gu, M\. Chen, M\. Wu, M\. Li, M\. Li, M\. Li, M\. Liang, N\. Wang, N\. Hao, Q\. Wu, Q\. Tan, R\. Sun, S\. Shuai, S\. Pang, S\. Yang, S\. Gao, S\. Yuan, S\. Liu, S\. Deng, S\. Jiang, S\. Liu, T\. Cao, T\. Wang, W\. Deng, W\. Xie, W\. Ming, W\. He, W\. Sun, X\. Han, X\. Huang, X\. Deng, X\. Liu, X\. Wu, X\. Zhao, Y\. Wei, Y\. Yu, Y\. Cao, Y\. Li, Y\. Ma, Y\. Xu, Y\. Wang, Y\. Shi, Y\. Wang, Y\. Zhou, Y\. Zhong, Y\. Zhang, Y\. Wei, Y\. Luo, Y\. Lu, Y\. Yin, Y\. Luo, Y\. Ding, Y\. Yan, Y\. Dai, Y\. Yang, Z\. Xie, Z\. Ge, Z\. Sun, Z\. Huang, Z\. Chang, Z\. Guan, Z\. Yang, Z\. Zhang, B\. Jiao, D\. Jiang, H\. Shum, J\. Chen, J\. Li, S\. Zhou, X\. Zhang, X\. Zhang, and Y\. Zhu \(2025\)Step\-audio: unified understanding and generation in intelligent speech interaction\.External Links:2502\.11946,[Link](https://arxiv.org/abs/2502.11946)Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Kong, J\. Kim, and J\. Bae \(2020\)HiFi\-GAN: generative adversarial networks for efficient and high fidelity speech synthesis\.InProc\. of NeurIPS,Cited by:[§2](https://arxiv.org/html/2606.14528#S2.p1.1)\.
- X\. Li, T\. Zhang, Y\. Dubois, R\. Taori, I\. Gulrajani, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023\)AlpacaEval: an automatic evaluator of instruction\-following models\.Note:[https://github\.com/tatsu\-lab/alpaca\_eval](https://github.com/tatsu-lab/alpaca_eval)Cited by:[§4\.2](https://arxiv.org/html/2606.14528#S4.SS2.SSS0.Px2.p1.1)\.
- Y\. Lipman, R\. T\. Q\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2023\)Flow matching for generative modeling\.InProc\. of ICLR,Cited by:[§2](https://arxiv.org/html/2606.14528#S2.p1.1)\.
- Z\. Long, Y\. Shen, C\. Fu, H\. Gao, L\. Li, P\. Chen, M\. Zhang, H\. Shao, J\. Li, J\. Peng, H\. Cao, K\. Li, R\. Ji, and X\. Sun \(2025\)VITA\-audio: fast interleaved cross\-modal token generation for efficient large speech\-language model\.External Links:2505\.03739,[Link](https://arxiv.org/abs/2505.03739)Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- Z\. Ma, Y\. Song, C\. Du, J\. Cong, Z\. Chen, Y\. Wang, Y\. Wang, and X\. Chen \(2024\)Language model can listen while speaking\.arXiv preprint arXiv:2408\.02622\.Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px2.p1.1)\.
- E\. Nachmani, A\. Levkovitch, R\. Hirsch, J\. Salazar, C\. Asawaroengchai, S\. Mariooryad, E\. Rivlin, R\. Skerry\-Ryan, and M\. T\. Ramanovich \(2024\)Spoken question answering and speech continuation using spectrogram\-powered LLM\.InProc\. of ICLR,Cited by:[§4\.2](https://arxiv.org/html/2606.14528#S4.SS2.SSS0.Px1.p1.1)\.
- T\. A\. Nguyen, E\. Kharitonov, J\. Copet, Y\. Adi, W\. Hsu, A\. Elkahky, P\. Tomasello, R\. Algayres, B\. Sagot, A\. Mohamed, and E\. Dupoux \(2023\)Generative spoken dialogue language modeling\.InTransactions of the ACL,Cited by:[§1](https://arxiv.org/html/2606.14528#S1.p3.1),[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px2.p1.1)\.
- T\. A\. Nguyen, B\. Muller, B\. Yu, M\. R\. Costa\-jussa, M\. Elbayad, S\. Popuri, C\. Ropers, P\. Duquenne, R\. Algayres, R\. Mavlyutov, I\. Gat, M\. Williamson, G\. Synnaeve, J\. Pino, B\. Sagot, and E\. Dupoux \(2025\)SpiRit\-LM: interleaved spoken and written language model\.Transactions of the ACL\.Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- OpenAI \(2024\)Hello GPT\-4o\.Note:[https://openai\.com/index/hello\-gpt\-4o/](https://openai.com/index/hello-gpt-4o/)Cited by:[§1](https://arxiv.org/html/2606.14528#S1.p1.1),[§4\.2](https://arxiv.org/html/2606.14528#S4.SS2.SSS0.Px3.p5.1)\.
- A\. Radford, J\. W\. Kim, T\. Xu, G\. Brockman, C\. McLeavey, and I\. Sutskever \(2023\)Robust speech recognition via large\-scale weak supervision\.InProc\. of ICML,Cited by:[§2](https://arxiv.org/html/2606.14528#S2.p1.1),[§4\.2](https://arxiv.org/html/2606.14528#S4.SS2.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.InProc\. of NeurIPS,Cited by:[§1](https://arxiv.org/html/2606.14528#S1.p4.1)\.
- Silero Team \(2024\)Silero VAD: pre\-trained enterprise\-grade voice activity detector\.Note:[https://github\.com/snakers4/silero\-vad](https://github.com/snakers4/silero-vad)Cited by:[§4\.2](https://arxiv.org/html/2606.14528#S4.SS2.SSS0.Px3.p2.3),[§4\.2](https://arxiv.org/html/2606.14528#S4.SS2.p1.1)\.
- C\. Tang, W\. Yu, G\. Sun, X\. Chen, T\. Tan, W\. Li, L\. Lu, Z\. Ma, and C\. Zhang \(2024\)SALMONN: towards generic hearing abilities for large language models\.InProc\. of ICLR,Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- B\. Veluri, B\. N\. Peloquin, B\. Yu, H\. Gong, and S\. Gollakota \(2024\)Beyond turn\-based interfaces: synchronous LLMs as full\-duplex dialogue agents\.InProc\. of EMNLP,Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px2.p1.1)\.
- X\. Wang, Y\. Li, C\. Fu, Y\. Shen, L\. Xie, K\. Li, X\. Sun, and L\. Ma \(2024\)Freeze\-Omni: a smart and low latency speech\-to\-speech dialogue model with frozen LLM\.arXiv preprint arXiv:2411\.00774\.Cited by:[§1](https://arxiv.org/html/2606.14528#S1.p2.1),[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px2.p1.1)\.
- B\. Wu, C\. Yan, C\. Hu, C\. Yi, C\. Feng, F\. Tian, F\. Shen, G\. Yu, H\. Zhang, J\. Li, M\. Chen, P\. Liu, W\. You, X\. T\. Zhang, X\. Li, X\. Yang, Y\. Deng, Y\. Huang, Y\. Li, Y\. Zhang, Z\. You, B\. Li, C\. Wan, H\. Hu, J\. Zhen, S\. Chen, S\. Yuan, X\. Zhang, Y\. Jiang, Y\. Zhou, Y\. Yang, B\. Li, B\. Ma, C\. Song, D\. Pang, G\. Hu, H\. Sun, K\. An, N\. Wang, S\. Gao, W\. Ji, W\. Li, W\. Sun, X\. Wen, Y\. Ren, Y\. Ma, Y\. Lu, B\. Wang, B\. Li, C\. Miao, C\. Liu, C\. Xu, D\. Shi, D\. Hu, D\. Wu, E\. Liu, G\. Huang, G\. Yan, H\. Zhang, H\. Nie, H\. Jia, H\. Zhou, J\. Sun, J\. Wu, J\. Wu, J\. Yang, J\. Yang, J\. Lin, K\. Li, L\. Yang, L\. Shi, L\. Zhou, L\. Gu, M\. Li, M\. Li, M\. Li, N\. Wu, Q\. Han, Q\. Tan, S\. Pang, S\. Fan, S\. Liu, T\. Cao, W\. Lu, W\. He, W\. Xie, X\. Zhao, X\. Li, Y\. Yu, Y\. Yang, Y\. Liu, Y\. Lu, Y\. Wang, Y\. Ding, Y\. Liang, Y\. Lu, Y\. Luo, Y\. Yin, Y\. Zhan, Y\. Zhang, Z\. Yang, Z\. Zhang, B\. Jiao, D\. Jiang, H\. Shum, J\. Chen, J\. Li, X\. Zhang, and Y\. Zhu \(2025\)Step\-audio 2 technical report\.External Links:2507\.16632,[Link](https://arxiv.org/abs/2507.16632)Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- Z\. Xie and C\. Wu \(2024a\)Mini\-Omni: language models can hear, talk while thinking in streaming\.arXiv preprint arXiv:2408\.16725\.Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- Z\. Xie and C\. Wu \(2024b\)Mini\-Omni2: towards open\-source GPT\-4o with vision, speech and duplex capabilities\.arXiv preprint arXiv:2410\.11190\.Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px2.p1.1)\.
- Y\. Yao, X\. Li, X\. Jiang, X\. Fang, N\. Yu, W\. Ma, A\. Sun, and Y\. Wang \(2025\)FLM\-Audio: natural monologues improves native full\-duplex chatbots via dual training\.arXiv preprint arXiv:2509\.02521\.Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px3.p1.1)\.
- W\. Yu, S\. Wang, X\. Yang, X\. Chen, X\. Tian, J\. Zhang, G\. Sun, L\. Lu, Y\. Wang, and C\. Zhang \(2024\)SALMONN\-omni: a codec\-free LLM for full\-duplex speech understanding and generation\.arXiv preprint arXiv:2411\.18138\.Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px3.p1.1)\.
- A\. Zeng, Z\. Du, M\. Liu, K\. Wang, S\. Jiang, L\. Zhao, Y\. Dong, and J\. Tang \(2024\)GLM\-4\-Voice: towards intelligent and human\-like end\-to\-end spoken chatbot\.arXiv preprint arXiv:2412\.02612\.Cited by:[§1](https://arxiv.org/html/2606.14528#S1.p2.1),[§1](https://arxiv.org/html/2606.14528#S1.p4.1),[§2](https://arxiv.org/html/2606.14528#S2.p1.1),[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- D\. Zhang, S\. Li, X\. Zhang, J\. Zhan, P\. Wang, Y\. Zhou, and X\. Qiu \(2023\)SpeechGPT: empowering large language models with intrinsic cross\-modal conversational abilities\.arXiv preprint arXiv:2305\.11000\.Cited by:[§1](https://arxiv.org/html/2606.14528#S1.p2.1),[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- Q\. Zhang, L\. Cheng, C\. Deng, Q\. Chen, W\. Wang, S\. Zheng, J\. Liu, H\. Yu, C\. Tan, Z\. Du, and S\. Zhang \(2024a\)OmniFlatten: an end\-to\-end GPT model for seamless voice conversation\.arXiv preprint arXiv:2410\.17799\.Cited by:[§1](https://arxiv.org/html/2606.14528#S1.p3.1),[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px3.p1.1)\.
- S\. Zhang, S\. Guo, Q\. Fang, Y\. Zhou, and Y\. Feng \(2025\)Stream\-Omni: simultaneous multimodal interactions with large language\-vision\-speech model\.arXiv preprint arXiv:2506\.13642\.Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- X\. Zhang, X\. Lyu, Z\. Du, Q\. Chen, D\. Zhang, H\. Hu, C\. Tan, T\. Zhao, Y\. Wang, B\. Zhang, H\. Lu, Y\. Zhou, and X\. Qiu \(2024b\)IntrinsicVoice: empowering LLMs with intrinsic real\-time voice interaction abilities\.arXiv preprint arXiv:2410\.08035\.Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px1.p1.1)\.
- X\. Zhang, Y\. Chen, S\. Hu, X\. Han, Z\. Xu, Y\. Xu, W\. Zhao, M\. Sun, and Z\. Liu \(2024c\)Beyond the turn\-based game: enabling real\-time conversations with duplex models\.arXiv preprint arXiv:2406\.15718\.Cited by:[§5](https://arxiv.org/html/2606.14528#S5.SS0.SSS0.Px2.p1.1)\.
- Y\. Zheng, R\. Zhang, J\. Zhang, Y\. Ye, Z\. Luo, Z\. Feng, and Y\. Ma \(2024\)LlamaFactory: unified efficient fine\-tuning of 100\+ language models\.InProc\. of ACL: System Demonstrations,Cited by:[§4\.1](https://arxiv.org/html/2606.14528#S4.SS1.SSS0.Px2.p1.4)\.
BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

Similar Articles

Liberating LLM Capabilities in Full-Duplex Speech Models

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Submit Feedback

Similar Articles

Liberating LLM Capabilities in Full-Duplex Speech Models
MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer
Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models
Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction