Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

arXiv cs.CL Papers

Summary

This paper identifies 'state inertia' in full-duplex spoken language models, where the model's internal predictive focus lags during user interruptions, and proposes a training-free activation steering method to improve interruption handling.

arXiv:2606.11386v1 Announce Type: new Abstract: Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored. We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream. Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input. We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead. Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:36 PM

# Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering
Source: [https://arxiv.org/html/2606.11386](https://arxiv.org/html/2606.11386)
###### Abstract

Full\-duplex spoken language models \(FD\-SLMs\) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored\. We analyze the predictive behavior encoded in FD\-SLM hidden representations and find that they exhibit stream\-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream\. Building on this observation, we show that FD\-SLMs dynamically modulate their internal predictive focus between two states: a*generative state*aligned with model output generation and a*perceptive state*aligned with incoming user input\. However, this modulation can lag behind abrupt changes in conversational context\. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input\. We term this delayed internal transitionstate inertia\. To quantify its downstream impact, we introduce the*Zero\-Buffer Benchmark \(ZBB\)*, a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly\. We evaluate this setting using response correctness and initial\-word occurrence rate \(IWOR\)\. Finally, we mitigate state inertia through activation steering with a*perception vector*, a training\-free intervention with little additional computational overhead\. Across multiple state\-of\-the\-art FD\-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine\-tuning\.

## 1Introduction

Achieving human\-level conversational fluency has long been a central goal in spoken dialogue systems\[[2](https://arxiv.org/html/2606.11386#bib.bib2),[18](https://arxiv.org/html/2606.11386#bib.bib37),[21](https://arxiv.org/html/2606.11386#bib.bib43)\]\. Recently,*full\-duplex spoken language models \(FD\-SLMs\)*have attracted increasing attention for their ability to listen and speak simultaneously, moving beyond the rigid turn\-by\-turn interaction of conventional half\-duplex spoken language models \(HD\-SLMs\)\[[6](https://arxiv.org/html/2606.11386#bib.bib38),[5](https://arxiv.org/html/2606.11386#bib.bib52),[10](https://arxiv.org/html/2606.11386#bib.bib44),[39](https://arxiv.org/html/2606.11386#bib.bib46),[21](https://arxiv.org/html/2606.11386#bib.bib43),[26](https://arxiv.org/html/2606.11386#bib.bib39),[14](https://arxiv.org/html/2606.11386#bib.bib53),[42](https://arxiv.org/html/2606.11386#bib.bib54),[46](https://arxiv.org/html/2606.11386#bib.bib55)\]\. In practice, FD\-SLMs often operate with a dual\-channel structure\[[33](https://arxiv.org/html/2606.11386#bib.bib47),[13](https://arxiv.org/html/2606.11386#bib.bib45),[22](https://arxiv.org/html/2606.11386#bib.bib28),[2](https://arxiv.org/html/2606.11386#bib.bib2)\], jointly processing a user stream containing incoming user speech and a model stream representing the model’s own speech\. This design enables timing\-sensitive conversational behaviors such as backchanneling, smooth interruption handling, fluid turn\-taking, and synchronized interaction\[[6](https://arxiv.org/html/2606.11386#bib.bib38),[26](https://arxiv.org/html/2606.11386#bib.bib39),[24](https://arxiv.org/html/2606.11386#bib.bib48),[11](https://arxiv.org/html/2606.11386#bib.bib4)\]\.

Despite these capabilities, the internal mechanism by which FD\-SLMs coordinate listening and speaking remains underexplored\. Inspired by*logit lens*\[[27](https://arxiv.org/html/2606.11386#bib.bib20),[4](https://arxiv.org/html/2606.11386#bib.bib19),[30](https://arxiv.org/html/2606.11386#bib.bib51)\], we analyze the predictive behavior encoded in FD\-SLM hidden representations\. Our analysis reveals “stream\-specific” predictive patterns:*during listening, hidden representations preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream*\. We further find that*FD\-SLMs coordinate the listening and speaking behavior by dynamically modulating two states: the“generative state”and the“perceptive state”*\. However, this modulation is not always successful on demand\. In particular, we find that when a user abruptly interrupts the model while it is speaking, the model remains transiently biased toward the generative state and fails to transition promptly into the perceptive state\. We refer to this phenomenon as“state inertia”\.

State inertia causes the model to miss the user input when an interruption occurs\. This loss of information degrades the quality of the model’s response\. Interestingly, “state inertia” resembles speech\-induced suppression in human auditory processing, where speech production can suppress activity in the auditory cortex and increase auditory response latency\[[28](https://arxiv.org/html/2606.11386#bib.bib31),[20](https://arxiv.org/html/2606.11386#bib.bib32)\]\.

To quantify the effect of state inertia, we introduce theZero\-Buffer Benchmark \(ZBB\), a diagnostic benchmark for measuring whether FD\-SLMs can immediately understand user input after interruption\. Unlike existing benchmarks that evaluate overall dialogue quality\[[26](https://arxiv.org/html/2606.11386#bib.bib39),[29](https://arxiv.org/html/2606.11386#bib.bib35),[49](https://arxiv.org/html/2606.11386#bib.bib5),[40](https://arxiv.org/html/2606.11386#bib.bib6)\], ZBB places the critical semantic keyword as the first word of the interrupting utterance, with no leading filler or acoustic buffer\[[8](https://arxiv.org/html/2606.11386#bib.bib7),[15](https://arxiv.org/html/2606.11386#bib.bib8)\]\. This design directly tests whether the model perceives the earliest semantic information after interruption, precisely when state inertia is most likely to affect perception\. We evaluate model performance using response correctness and Initial Word Occurrence Rate \(IWOR\), which measures whether the model recognizes the beginning of the interruption\. Across multiple FD\-SLMs, interruption substantially degrades both metrics, showing that state inertia has measurable behavioral consequences\.

Finally, we mitigate state inertia using a training\-free*activation steering*method\[[38](https://arxiv.org/html/2606.11386#bib.bib9),[51](https://arxiv.org/html/2606.11386#bib.bib18),[32](https://arxiv.org/html/2606.11386#bib.bib3)\]\. We construct a*perception vector*from the difference between hidden representations in the generative state and the perceptive state, and apply it at the onset of interruption to steer the model toward the perceptive state\. This steering requires no fine\-tuning and adds only a lightweight inference\-time hidden\-state update\. Empirically, steering with the perception vector consistently improves interruption handling across multiple FD\-SLMs; for example, on PersonaPlex\[[33](https://arxiv.org/html/2606.11386#bib.bib47)\], it improves correctness from 28% to 45% and IWOR from 40% to 72%\.

In summary, our main contributions are as follows:

- •Internal state analysis and state inertia:We show that FD\-SLM hidden representations exhibit stream\-specific predictive behavior and dynamically modulate between generative and perceptive states\. Building on this analysis, we identifystate inertia, a delayed internal transition that reduces the model’s ability to process abrupt user interruptions\.
- •Zero\-Buffer Benchmark \(ZBB\):We introduce ZBB, a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly, together with correctness and Initial Word Occurrence Rate \(IWOR\)\.
- •Training\-free mitigation via activation steering:We introduce a training\-free activation steering method based on a perception vector, which mitigates state inertia and substantially improves interruption handling across multiple FD\-SLMs\.

![Refer to caption](https://arxiv.org/html/2606.11386v1/images/neurips_main.png)Figure 1:Overview of state inertia and activation steering\.\(a\) FD\-SLMs process concurrent user and model streams, conditioning on incoming user audio and previous model output tokens to generate text and audio tokens\. \(b\) FD\-SLMs coordinate speaking and listening by modulating between generative and perceptive states, tracked by generation and perception affinity\. During abrupt interruptions, the model can remain biased toward the generative state before transitioning to the perceptive state, causing early user input to be missed\. Injecting a perception vector at interruption onset accelerates this transition and improves interruption handling\.
## 2Related Work

#### Full\-Duplex Spoken Language Models\.

Many existing spoken language models follow a half\-duplex interaction pattern, processing input and output speech sequentially and relying on explicit turn\-taking boundaries between listening and speaking\[[16](https://arxiv.org/html/2606.11386#bib.bib10),[45](https://arxiv.org/html/2606.11386#bib.bib11),[48](https://arxiv.org/html/2606.11386#bib.bib34)\]\. This rigid interaction pattern can make conversations feel unnatural, especially in scenarios involving interruptions, backchannels, or overlapping speech\[[34](https://arxiv.org/html/2606.11386#bib.bib42)\]\. In contrast, full\-duplex spoken language models \(FD\-SLMs\) support real\-time bidirectional speech interaction, allowing the model to continuously perceive user audio while generating speech responses\[[2](https://arxiv.org/html/2606.11386#bib.bib2),[39](https://arxiv.org/html/2606.11386#bib.bib46),[50](https://arxiv.org/html/2606.11386#bib.bib12)\]\. This capability enables more natural conversational behaviors, including backchanneling, interruption handling, and overlapping speech\[[26](https://arxiv.org/html/2606.11386#bib.bib39)\]\. Motivated by these advantages, recent work has developed several full\-duplex systems, including open\-source models such as Moshi\[[13](https://arxiv.org/html/2606.11386#bib.bib45)\], PersonaPlex\[[33](https://arxiv.org/html/2606.11386#bib.bib47)\], and Raon\-SpeechChat\[[22](https://arxiv.org/html/2606.11386#bib.bib28)\]\. While these systems demonstrate the promise of full\-duplex interaction, the internal mechanisms by which they coordinate simultaneous listening and speaking remain underexplored\.

#### FD\-SLMs Benchmarks\.

Existing benchmarks for FD\-SLMs\[[24](https://arxiv.org/html/2606.11386#bib.bib48),[23](https://arxiv.org/html/2606.11386#bib.bib49),[29](https://arxiv.org/html/2606.11386#bib.bib35),[6](https://arxiv.org/html/2606.11386#bib.bib38)\]primarily assess macroscopic conversational properties\. These include turn\-taking dynamics, such as properly taking or yielding the floor; end\-to\-end response latency; overall instruction following; and full\-duplex\-specific behaviors such as backchanneling\. However, these benchmarks largely overlook a critical fine\-grained capability: whether the model accurately recognizes user input immediately following an abrupt interruption\. This distinction is important because a model may eventually recover and produce a plausible response while still missing information at the beginning of the interrupting utterance\. In this work, we assess this moment\-level listening ability, which we discuss in Section[4](https://arxiv.org/html/2606.11386#S4)\.

#### Activation Steering\.

Activation steering modifies model behavior at inference time by injecting steering vectors into hidden states, often using mean\-activation differences between contrasting concepts or behaviors\[[51](https://arxiv.org/html/2606.11386#bib.bib18),[38](https://arxiv.org/html/2606.11386#bib.bib9),[32](https://arxiv.org/html/2606.11386#bib.bib3)\]\. Prior work has used steering to control text\-generation behavior, such as instruction following, persona modification, vulnerability analysis, and representation probing\[[35](https://arxiv.org/html/2606.11386#bib.bib36),[7](https://arxiv.org/html/2606.11386#bib.bib1),[41](https://arxiv.org/html/2606.11386#bib.bib50),[1](https://arxiv.org/html/2606.11386#bib.bib29)\]\. We instead apply activation steering to FD\-SLMs, using it to steer hidden representations toward processing user input and improve immediate interruption handling\.

## 3Internal Mechanism of Full\-Duplex SLMs

### 3\.1Full\-duplex Spoken Language Model

As shown in Figure[1](https://arxiv.org/html/2606.11386#S1.F1), Full\-Duplex Spoken Language Models \(FD\-SLMs\) process two concurrent speech streams: a*user stream*and a*model stream*\. An audio codec discretizes the continuous speech signals into audio tokens, allowing the interaction to be represented as a sequence of timesteps\[[12](https://arxiv.org/html/2606.11386#bib.bib16),[47](https://arxiv.org/html/2606.11386#bib.bib15)\]\. At each timesteptt, the FD\-SLM conditions on the incoming user audio tokens and its previously generated model tokens, and then produces the next model response\. Practically, recent FD\-SLMs first generate text tokens as a semantically rich*intermediate representation*, which then guides the generation of the corresponding speech\[[5](https://arxiv.org/html/2606.11386#bib.bib52),[13](https://arxiv.org/html/2606.11386#bib.bib45),[33](https://arxiv.org/html/2606.11386#bib.bib47),[22](https://arxiv.org/html/2606.11386#bib.bib28)\]\.

Formally, at timesteptt, letuaudio\(t\)u^\{\(t\)\}\_\{\\mathrm\{audio\}\}denote the user input audio tokens, and letmaudio\(t\)m^\{\(t\)\}\_\{\\mathrm\{audio\}\}andmtext\(t\)m^\{\(t\)\}\_\{\\mathrm\{text\}\}denote the model output audio and text tokens, respectively\. LetMθM\_\{\\theta\}denote an FD\-SLM parameterized byθ\\theta\. At each timestep,MθM\_\{\\theta\}generates the model output tokensmtext\(t\)m^\{\(t\)\}\_\{\\mathrm\{text\}\}andmaudio\(t\)m^\{\(t\)\}\_\{\\mathrm\{audio\}\}conditioned on the current user input audio tokensuaudio\(t\)u^\{\(t\)\}\_\{\\mathrm\{audio\}\}, the model’s previous audio and text tokens, and the preceding dialogue contextc\(t\)c^\{\(t\)\}:

\(maudio\(t\),mtext\(t\)\)∼Mθ\(⋅∣uaudio\(t\),maudio\(t−1\),mtext\(t−1\),c\(t\)\),\\left\(m^\{\(t\)\}\_\{\\mathrm\{audio\}\},m^\{\(t\)\}\_\{\\mathrm\{text\}\}\\right\)\\sim M\_\{\\theta\}\\left\(\\cdot\\mid u^\{\(t\)\}\_\{\\mathrm\{audio\}\},m^\{\(t\-1\)\}\_\{\\mathrm\{audio\}\},m^\{\(t\-1\)\}\_\{\\mathrm\{text\}\},c^\{\(t\)\}\\right\),\(1\)wherec\(t\)c^\{\(t\)\}summarizes the dialogue history before timesteptt\.

Throughout the paper, we use a timestep as the minimal unit of processing rather than an individual token\. Unlike text\-only LLMs, FD\-SLMs may contain multiple tokens at each timestep across parallel streams, making timesteps a more consistent unit for our analysis\[[13](https://arxiv.org/html/2606.11386#bib.bib45),[2](https://arxiv.org/html/2606.11386#bib.bib2),[9](https://arxiv.org/html/2606.11386#bib.bib57),[43](https://arxiv.org/html/2606.11386#bib.bib58)\]\.

### 3\.2Logit Lens

Unlike text\-only LLMs or half\-duplex SLMs, FD\-SLMs must continuously coordinate listening to the user with generation of their own speech\. However, how this coordination is represented internally remains poorly understood\. To analyze this internal behavior, we use the*logit lens*\[[27](https://arxiv.org/html/2606.11386#bib.bib20),[4](https://arxiv.org/html/2606.11386#bib.bib19)\], which projects hidden representations from intermediate layers into the vocabulary space, allowing us to inspect how token\-level predictions evolve across model depth\.

Leth\(t\)∈ℝdh^\{\(t\)\}\\in\\mathbb\{R\}^\{d\}denote the hidden representation at the selected layer and timesteptt, and letWunembed∈ℝ\|V\|×dW\_\{\\mathrm\{unembed\}\}\\in\\mathbb\{R\}^\{\|V\|\\times d\}be the unembedding matrix, whereVVdenotes the token vocabulary\. For any target tokeny∈Vy\\in V, we define its projected probability under the hidden representation as

P​\(y∣h\(t\)\)=exp⁡\(wy⊤​h\(t\)\)∑v∈Vexp⁡\(wv⊤​h\(t\)\),P\(y\\mid h^\{\(t\)\}\)=\\frac\{\\exp\(w\_\{y\}^\{\\top\}h^\{\(t\)\}\)\}\{\\sum\_\{v\\in V\}\\exp\(w\_\{v\}^\{\\top\}h^\{\(t\)\}\)\},\(2\)wherewy⊤w\_\{y\}^\{\\top\}andwv⊤w\_\{v\}^\{\\top\}are the rows ofWunembedW\_\{\\mathrm\{unembed\}\}corresponding to tokensyyandvv, respectively\.

At each timesteptt, we then decode the most likely token under this projected distribution:

ydecode\(t\)=arg⁡maxy∈V⁡P​\(y∣h\(t\)\)\.y\_\{\\text\{decode\}\}^\{\(t\)\}=\\arg\\max\_\{y\\in V\}P\(y\\mid h^\{\(t\)\}\)\.\(3\)
To understand how the model’s internal behavior differs between listening and speaking, we construct a dataset for turn\-by\-turn interactions, where the model first listens to the user’s speech and then speaks to respond\. We conduct logit\-lens analysis on PersonaPlex\[[33](https://arxiv.org/html/2606.11386#bib.bib47)\]to qualitatively compare hidden\-representation predictions between the listening and speaking segments\. Further details of the dataset construction are provided in Appendix[A\.1](https://arxiv.org/html/2606.11386#A1.SS1)\.

Finding 1FD\-SLM hidden representations exhibitstream\-specific predictive focus: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the output model stream\.

Table[1](https://arxiv.org/html/2606.11386#S3.T1)illustrates the predictive behavior on the user query “Can you compare renewable energy sources and explain their pros and cons in daily use?” While the user is speaking, the model stays silent because it is listening\. Even so, logit\-lens decoding of its intermediate layers anticipates the upcoming user words rather than the model’s own output: after hearing “explain,” intermediate layers decode tokens such as “why” and “how”; after hearing “their,” they decode tokens such as “own” and “pro”; and subsequent predictions align with “and” and “cons\.” During model speaking, in contrast, the decoded tokens track the model’s own output stream\. Complete layer\-wise decoding examples for both segments, together with additional decoded samples, are provided in Appendix[E](https://arxiv.org/html/2606.11386#A5)\.

Table 1:Examples of logit\-lens decoded predictions during a listening segment\. Bold tokens indicate decoded predictions that match or anticipate the actual incoming user speech\.
### 3\.3Generative and Perceptive State

The qualitative observation using logit lens suggests that hidden representations exhibit stream\-specific predictive focus: their predictions can be more aligned with either incoming user input or model output generation\. Building on this observation, we quantify how this predictive focus shifts over time by defining two affinity scores:*generation affinity*and*perception affinity*\.

Generation Affinity𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\):Generation affinity𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\)quantifies the extent to which the hidden representationh\(t\)h^\{\(t\)\}supports generation of the output model stream\. We define generation affinity as the mean projected probability assigned to the model output text tokenmtext\(t\)m^\{\(t\)\}\_\{\\mathrm\{text\}\}and audio tokenmaudio\(t\)m^\{\(t\)\}\_\{\\mathrm\{audio\}\}conditioned on the current hidden representationh\(t\)h^\{\(t\)\}:

𝒮gen​\(t\)=12​\(P​\(maudio\(t\)∣h\(t\)\)\+P​\(mtext\(t\)∣h\(t\)\)\)\.\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\)=\\frac\{1\}\{2\}\\left\(P\(m^\{\(t\)\}\_\{\\mathrm\{audio\}\}\\mid h^\{\(t\)\}\)\+P\(m\_\{\\text\{text\}\}^\{\(t\)\}\\mid h^\{\(t\)\}\)\\right\)\.\(4\)A high𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\)indicates thath\(t\)h^\{\(t\)\}is strongly aligned with the model’s own output generation, suggesting that the FD\-SLM is in agenerative state\.

Perception Affinity𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\):Perception affinity𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)quantifies the extent to which the hidden representationh\(t\)h^\{\(t\)\}supports prediction of the incoming user stream\. We define perception affinity as the projected probability assigned to the next incoming user audio tokenuaudio\(t\+1\)u^\{\(t\+1\)\}\_\{\\mathrm\{audio\}\}conditioned on the current hidden representationh\(t\)h^\{\(t\)\}:

𝒮perc​\(t\)=P​\(uaudio\(t\+1\)∣h\(t\)\)\.\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)=P\(u\_\{\\mathrm\{audio\}\}^\{\(t\+1\)\}\\mid h^\{\(t\)\}\)\.\(5\)A high𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)indicates thath\(t\)h^\{\(t\)\}is strongly aligned with predicting the incoming user audio, suggesting that the FD\-SLM is in aperceptive state\.

We compute𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\)and𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)on the 100 examples from the turn\-by\-turn interaction dataset\. For audio\-token probabilities, we use the first codec codebook, which primarily encodes semantic speech information, while later residual codebooks encode finer acoustic details\[[13](https://arxiv.org/html/2606.11386#bib.bib45),[47](https://arxiv.org/html/2606.11386#bib.bib15),[12](https://arxiv.org/html/2606.11386#bib.bib16)\]\.111Using only the first audio codebook also avoids FD\-SLM\-specific timing offsets associated with later residual codebooks\.We align all examples by settingt=0t=0to the end of the user utterance and average the resulting score trajectories across examples\. For demonstration, we show the results on PersonaPlex\.

Finding 2FD\-SLMs coordinate speaking and listening by dynamically modulating between generative and perceptive states\.

As shown in Figure[3](https://arxiv.org/html/2606.11386#S3.F3),𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\)rises aftert=0t=0, indicating a transition into the generative state as the model prepares to respond\. Conversely, Figure[3](https://arxiv.org/html/2606.11386#S3.F3)shows that𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)remains high while the user is speaking \(t<0t<0\), indicating a perceptive state, and then rapidly decays after the user utterance ends\. Together, these results show that FD\-SLMs do not maintain generation and perception uniformly throughout the interaction; instead, they reconfigure their generative and perceptive states according to the conversational role they currently occupy\.

We note that the final layers show a different pattern:𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)remains low while𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\)remains high even during user\-speaking segments\. This is expected because the final layers are closest to the output distribution and must still produce model tokens at every timestep, which often correspond to silence while the user is speaking\.

![Refer to caption](https://arxiv.org/html/2606.11386v1/images/generation_score.png)Figure 2:Generation affinity𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\)across internal layers of PersonaPlex on the turn\-by\-turn interaction dataset\. We align 100 examples at the end of the user utterance, witht=0t=0marking this transition\. Values are shown on a logarithmic scale\.
![Refer to caption](https://arxiv.org/html/2606.11386v1/images/perception_score.png)Figure 3:Perception affinity𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)across internal layers of PersonaPlex on the turn\-by\-turn interaction dataset\. We align 100 examples at the end of the user utterance, witht=0t=0marking this transition\. Values are shown on a logarithmic scale\.

### 3\.4State Inertia

Real\-world spoken conversations often involve overlapping speech, including interruptions and backchanneling\. Prior work reports that overlap occurs in over 40% of conversational turns\[[25](https://arxiv.org/html/2606.11386#bib.bib40),[19](https://arxiv.org/html/2606.11386#bib.bib41)\], making overlap handling an important capability for FD\-SLMs\. Unlike half\-duplex systems, FD\-SLMs are designed to listen while speaking; this simultaneous listening\-and\-speaking capability is a central motivation for full\-duplex speech modeling\.

In this work, we focus on user interruption as a representative and practically important form of speech overlapping\. During an interruption, the user begins speaking while the model is still generating, and the model must quickly shift attention to the new input, yield the floor when appropriate, and respond to the updated conversational context\. This scenario commonly arises in spoken assistant settings, where users may interrupt system speech to correct an error, redirect the dialogue, or provide input before the system finishes speaking\[[36](https://arxiv.org/html/2606.11386#bib.bib14),[31](https://arxiv.org/html/2606.11386#bib.bib13)\]\.

We compare how the generation and perception affinities,𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\)and𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\), evolve under two conditions:interruptionandno\-interruption\. In theinterruptioncondition, we first present a*speech\-inducing prompt*: an open\-ended question designed to place the model in a generative state\. We then interrupt the model using a user utterance from the dataset introduced in the previous section\. In theno\-interruptioncondition, we present the same user utterance without first prompting the model to produce a substantive response\. Detailed dataset construction is presented in Appendix[A\.2](https://arxiv.org/html/2606.11386#A1.SS2)For demonstration, we present an analysis using PersonaPlex as a representative example\.

Finding 3The model exhibitsstate inertia: a tendency to remain in its prior state even when the conversational context requires an immediate transition\.

As shown in Figures[5](https://arxiv.org/html/2606.11386#S3.F5)and[5](https://arxiv.org/html/2606.11386#S3.F5),𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)remains low immediately after abrupt user input in theinterruptioncondition compared with theno\-interruptioncondition\. This indicates that the model does not immediately transition out of the prompt\-induced generative state\. In this example,𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)takes approximately 7–8 timesteps, corresponding to about 0\.6 seconds, to recover to the perceptive state\. In contrast, under theno\-interruptioncondition, the model transitions into the perceptive state almost immediately when the user begins speaking\. We observe a similar delay in the generative\-state transition, as shown in Appendix[C](https://arxiv.org/html/2606.11386#A3)\. We refer to this delayed internal transition asstate inertia\.

![Refer to caption](https://arxiv.org/html/2606.11386v1/images/S_perc_no_intr.png)Figure 4:Perception affinity𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)in theno\-interruptioncondition\. The model transitions into the perceptive state immediately after the user begins speaking\.
![Refer to caption](https://arxiv.org/html/2606.11386v1/images/S_perc_intr.png)Figure 5:Perception affinity𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)in theinterruptioncondition\. The model transitions into the perceptive state after 7–8 timesteps, exhibiting state inertia\.

## 4Zero\-Buffer Benchmark \(ZBB\)

A question naturally arises: whether state inertia, the delayed transition into the perceptive state, reduces the model’s ability to perceive and understand user interruptions? To systematically quantify its impact on dialogue comprehension, we introduce the*Zero\-Buffer Benchmark*\(ZBB\), which evaluates whether FD\-SLMs can immediately understand user input when an interruption occurs\. The key design principle is to place the critical semantic content at the very onset of the interrupting utterance, with no leading filler or acoustic buffer, so that the model must perceive core meaning exactly when state inertia is most likely to disrupt perception\.

Each ZBB example consists of a*speech\-inducing prompt*followed by a*zero\-buffer query*\. The speech\-inducing prompt is an open\-ended question that places the model in a generative state; while the model is actively responding, we abruptly interrupt it with the zero\-buffer query\. Each zero\-buffer query follows the template<Subject\>, <Description\>, <Confirmation Request\>\(e\.g\.,“Submarine flies in the clouds, right?”\), where the subject keyword is deliberately placed as the first word\. Because the subject carries the information needed to judge the description, missing the onset of the interruption causes the model to lose the subject and often produce an incorrect or incoherent answer\. The detail ZBB dataset creation and examples are provided in Appendix[A\.3](https://arxiv.org/html/2606.11386#A1.SS3)\.

For evaluation, we transcribe the generated audio and evaluate the following metrics with an LLM judge:

- •Correctness:Whether the model answers the zero\-buffer query correctly\.
- •Initial Word Occurrence Rate \(IWOR\):Whether the response explicitly mentions the initial semantic word of the zero\-buffer query, or a direct synonym\. IWOR provides a diagnostic measure of whether the model perceived the initial subject\.

Evaluating several recent FD\-SLMs on ZBB, we find that interruption substantially degrades both correctness and IWOR \(Section[6\.2](https://arxiv.org/html/2606.11386#S6.SS2)\), showing that state inertia has a measurable downstream impact on immediate interruption comprehension\. To address this, the next section introduces a training\-free activation steering method that accelerates the model’s transition into the perceptive state\.

## 5Activation Steering with Perception Vector

To mitigate the impact of state inertia, we apply activation steering\[[38](https://arxiv.org/html/2606.11386#bib.bib9)\]when the user begins speaking during model generation, shifting the model’s hidden representations from the generative state toward the perceptive state\.

We classify each timestepttas generation\-dominant or perception\-dominant using𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\)and𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)computed at intermediate transformer layers\. Specifically, we defineTgen=\{t∣𝒮gen​\(t\)≥Θgen∧𝒮perc​\(t\)<Θperc\}T\_\{\\text\{gen\}\}=\\\{t\\mid\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\)\\geq\\Theta\_\{\\text\{gen\}\}\\wedge\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)<\\Theta\_\{\\text\{perc\}\}\\\}andTperc=\{t∣𝒮perc​\(t\)≥Θperc∧𝒮gen​\(t\)<Θgen\}T\_\{\\text\{perc\}\}=\\\{t\\mid\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)\\geq\\Theta\_\{\\text\{perc\}\}\\wedge\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\)<\\Theta\_\{\\text\{gen\}\}\\\}, whereΘgen\\Theta\_\{\\text\{gen\}\}andΘperc\\Theta\_\{\\text\{perc\}\}are predefined thresholds\.

Following established representation engineering methods\[[38](https://arxiv.org/html/2606.11386#bib.bib9),[51](https://arxiv.org/html/2606.11386#bib.bib18),[32](https://arxiv.org/html/2606.11386#bib.bib3)\], we construct a*perception vector*as the difference between the mean hidden representations of perception\-dominant and generation\-dominant timesteps\. Leth\(t\)h^\{\(t\)\}denote the hidden representation at the selected steering layer and timesteptt\. We define the perception vectorμg→p\\mu\_\{g\\to p\}, which points from the generative state toward the perceptive state, as

μg→p=1\|Tperc\|​∑t∈Tperch\(t\)−1\|Tgen\|​∑t∈Tgenh\(t\)\.\\mu\_\{g\\to p\}=\\frac\{1\}\{\|T\_\{\\text\{perc\}\}\|\}\\sum\_\{t\\in T\_\{\\text\{perc\}\}\}h^\{\(t\)\}\-\\frac\{1\}\{\|T\_\{\\text\{gen\}\}\|\}\\sum\_\{t\\in T\_\{\\text\{gen\}\}\}h^\{\(t\)\}\.\(6\)
At inference time, we steer the model by adding the perception vector to the hidden representation at the selected steering layer,h~\(t\)=h\(t\)\+α​μg→p\\tilde\{h\}^\{\(t\)\}=h^\{\(t\)\}\+\\alpha\\mu\_\{g\\to p\}, whereh~\(t\)\\tilde\{h\}^\{\(t\)\}denotes the steered hidden representation andα\\alphacontrols the steering strength\. In our ZBB experiments, steering is applied at the onset of the zero\-buffer query, with the onset detected by an energy\-based detector\.

The geometry of the hidden representation space provides additional support for the perception vector\. In Appendix[D](https://arxiv.org/html/2606.11386#A4), we show that generation\-dominant and perception\-dominant timesteps are clearly separated under PCA projection\. This separation suggests that the vector captures a meaningful transition direction rather than a noisy difference between overlapping distributions\.

## 6Experiments and Results on Zero\-Buffer Benchmark

### 6\.1Setup

Evaluation conditions\.We evaluate three advanced FD\-SLMs spanning distinct architectural paradigms: PersonaPlex\[[33](https://arxiv.org/html/2606.11386#bib.bib47)\], Moshi\[[13](https://arxiv.org/html/2606.11386#bib.bib45)\], and Raon\-SpeechChat\[[22](https://arxiv.org/html/2606.11386#bib.bib28)\]\. For each model, we compare three conditions:no interruption,interruption, andinterruption with steering\. In theinterruptioncondition, we first present a speech\-inducing prompt and then abruptly interrupt the model with a zero\-buffer query\. In theno\-interruptioncondition, we present the same zero\-buffer query without first inducing substantive model speech\. This condition represents the model’s performance when no generative\-to\-perceptive transition is required\. In theinterruption with steeringcondition, we apply the perception vector at the onset of the zero\-buffer query and measure whether it restores performance after interruption\.

Perception vector construction\.To construct the perception vector, we classify timesteps intoTgenT\_\{\\text\{gen\}\}andTpercT\_\{\\text\{perc\}\}using the affinity scores defined in Section[3\.3](https://arxiv.org/html/2606.11386#S3.SS3)\. For classification, we average𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\)and𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)over layers 12–24 and apply the thresholds in Table[3](https://arxiv.org/html/2606.11386#S6.T3)\. Unless otherwise stated, we use the steering layer, steering strengthα\\alpha, and steering spanΔ​Tsteer\\Delta T\_\{\\text\{steer\}\}specified in Table[3](https://arxiv.org/html/2606.11386#S6.T3)\. Importantly, the conversations used to computeμg→p\\mu\_\{g\\to p\}are drawn from the turn\-by\-turn interaction dataset introduced in Section[3\.2](https://arxiv.org/html/2606.11386#S3.SS2), and are disjoint from the ZBB evaluation set\. Thus, the perception vector captures general state\-level differences rather than information specific to the ZBB examples\. Representative examples of these conversations are provided in Appendix[A](https://arxiv.org/html/2606.11386#A1)\.

Steering schedule\.At inference time, we apply the perception vectorμg→p\\mu\_\{g\\to p\}starting at the onset of the zero\-buffer query, denotedtintt\_\{\\text\{int\}\}\. We detecttintt\_\{\\text\{int\}\}using an energy\-based onset detector\. Leth\(t\)h^\{\(t\)\}denote the hidden representation at the selected steering layer and timesteptt\. To avoid steering the model throughout the entire interrupted utterance, we apply steering over a finite spanΔ​Tsteer\\Delta T\_\{\\text\{steer\}\}and linearly decay its magnitude to zero:

h~\(t\)=\{h\(t\)\+α​\(1−t−tintΔ​Tsteer\)​μg→p,tint≤t<tint\+Δ​Tsteer,h\(t\),otherwise,\\tilde\{h\}^\{\(t\)\}=\\begin\{cases\}h^\{\(t\)\}\+\\alpha\\left\(1\-\\frac\{t\-t\_\{\\text\{int\}\}\}\{\\Delta T\_\{\\text\{steer\}\}\}\\right\)\\mu\_\{g\\to p\},&t\_\{\\text\{int\}\}\\leq t<t\_\{\\text\{int\}\}\+\\Delta T\_\{\\text\{steer\}\},\\\\ h^\{\(t\)\},&\\text\{otherwise\},\\end\{cases\}\(7\)whereh~\(t\)\\tilde\{h\}^\{\(t\)\}denotes the steered hidden representation andα\\alphacontrols the steering strength\.

### 6\.2ZBB Evaluation Results

As shown in Table[3](https://arxiv.org/html/2606.11386#S6.T3), interruption causes a severe degradation in both correctness and IWOR across all three FD\-SLMs\. On PersonaPlex, for instance, correctness drops from 0\.49 to 0\.28 and IWOR from 0\.74 to 0\.40 when the query arrives as an interruption\. The IWOR drop in particular indicates that the model often fails to perceive the initial subject of the interrupting utterance, showing that state inertia has a measurable downstream impact on immediate interruption comprehension\.

Notably, activation steering improves both correctness and IWOR across all evaluated models\. For PersonaPlex and Moshi, the perception vector raises response correctness and restores most of the interruption\-induced IWOR drop \(94% and 92%, respectively\)\. For Raon\-SpeechChat, steering improves both metrics as well, though absolute correctness remains low\.

We further show qualitatively that activation steering reduces state inertia\. We compare𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)around the onset of the zero\-buffer query under theinterruptionandinterruption with steeringconditions in Figures[7](https://arxiv.org/html/2606.11386#S6.F7)and[7](https://arxiv.org/html/2606.11386#S6.F7), respectively\. In theinterruptioncondition,𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)remains low immediately after the zero\-buffer query begins, indicating a delayed transition into the perceptive state\. In contrast, underinterruption with steering,𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)recovers immediately after the zero\-buffer query onset\. We provide an attention\-based analysis in Appendix[G](https://arxiv.org/html/2606.11386#A7), showing that steering increases attention to the first few interruption timesteps\. Additional steering\-parameter sweeps are provided in Appendix[F](https://arxiv.org/html/2606.11386#A6)\.

We also evaluate steering on Full\-Duplex Bench \(FDB\)\[[26](https://arxiv.org/html/2606.11386#bib.bib39)\]and confirm that steering does not degrade overall full\-duplex dialogue performance\. Results and discussion are provided in Appendix[H](https://arxiv.org/html/2606.11386#A8)\.

Table 2:FD\-SLMs performance on ZBB\. Uncertainties denote one standard error; parentheses show the percentage of the interruption\-induced drop recovered by steering\.
Table 3:Activation steering hyperparameters\. Thresholds are reported in natural\-log scale\.

![Refer to caption](https://arxiv.org/html/2606.11386v1/images/interrupt_start_perception_score_unsteered.png)Figure 6:Perception affinity𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)in theinterruptioncondition\. Without steering, perception affinity takes approximately 7–8 timesteps to recover after interruption, indicating state inertia\.
![Refer to caption](https://arxiv.org/html/2606.11386v1/images/interrupt_start_perception_score_steered.png)Figure 7:Perception affinity𝒮perc​\(t\)\\mathcal\{S\}\_\{\\text\{perc\}\}\(t\)in theinterruption with steeringcondition\. With activation steering, perception affinity recovers immediately after interruption, indicating a faster transition toward the perceptive state\.

## 7Limitations

Our work has several limitations\. First, the steering method relies on detecting the onset of user interruption\. We use an energy\-based onset detector, but real\-world deployment may require more robust voice activity detection, especially in noisy or multi\-speaker settings\. We discuss false\-trigger sensitivity in Appendix[I](https://arxiv.org/html/2606.11386#A9)\. Second, our evaluation is constrained by the limited availability of open\-source FD\-SLMs, as few such models are currently publicly available\. Finally, our logit\-lens\-based affinity scores are diagnostic approximations and can be noisy for individual examples\.

## 8Conclusion

We study how FD\-SLMs coordinate listening and speaking through hidden representations\. Using logit\-lens\-based affinity scores, we find that FD\-SLMs exhibit stream\-specific predictive focus and modulate between generative and perceptive states\. We identify*state inertia*, a delayed transition during abrupt interruptions that causes models to miss early user input\. To evaluate this failure mode, we introduce the Zero\-Buffer Benchmark \(ZBB\) and show that interruption degrades both correctness and IWOR across multiple FD\-SLMs\. Finally, activation steering with the perception vector reduces state inertia and improves interruption handling without fine\-tuning\. Overall, our results show that hidden representations can be used not only to analyze FD\-SLM listening–speaking coordination, but also to improve full\-duplex interruption robustness\.

## References

- \[1\]G\. Alain and Y\. Bengio\(2017\)Understanding intermediate layers using linear classifier probes\.External Links:[Link](https://openreview.net/forum?id=ryF7rTqgl)Cited by:[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px3.p1.1)\.
- \[2\]S\. Arora, K\. Chang, C\. Chien, Y\. Peng, H\. Wu, Y\. Adi, E\. Dupoux, H\. Lee, K\. Livescu, and S\. Watanabe\(2025\)On the landscape of spoken language models: a comprehensive survey\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1),[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.11386#S3.SS1.p3.1)\.
- \[3\]J\. Ball\(2023\)Voice activity detection \(vad\) in noisy environments\.arXiv preprint arXiv:2312\.05815\.Cited by:[Appendix I](https://arxiv.org/html/2606.11386#A9.p2.1)\.
- \[4\]N\. Belrose, I\. Ostrovsky, L\. McKinney, Z\. Furman, L\. Smith, D\. Halawi, S\. Biderman, and J\. Steinhardt\(2023\)Eliciting latent predictions from transformers with the tuned lens\.arXiv preprint arXiv:2303\.08112\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p2.1),[§3\.2](https://arxiv.org/html/2606.11386#S3.SS2.p1.1)\.
- \[5\]K\. Chang, W\. Chen, E\. Hu, H\. Lee, and J\. Glass\(2026\)TiCo: time\-controllable training for spoken dialogue models\.arXiv preprint arXiv:2603\.22267\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.11386#S3.SS1.p1.1)\.
- \[6\]K\. Chang, E\. Hu, C\. Kuan, W\. Ren, W\. Chen, G\. Lin, Y\. Tsao, S\. Sun, H\. Lee, and J\. Glass\(2026\)Game\-time: evaluating temporal dynamics in spoken language models\.InICASSP 2026\-2026 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 16302–16306\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1),[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px2.p1.1)\.
- \[7\]R\. Chen, A\. Arditi, H\. Sleight, O\. Evans, and J\. Lindsey\(2025\)Persona vectors: monitoring and controlling character traits in language models\.arXiv preprint arXiv:2507\.21509\.Cited by:[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px3.p1.1)\.
- \[8\]H\. H\. Clark and J\. E\. Fox Tree\(2002\)Using uh and um in spontaneous speaking\.Cognition84\(1\),pp\. 73–111\.External Links:ISSN 0010\-0277,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0010-0277%2802%2900017-3),[Link](https://www.sciencedirect.com/science/article/pii/S0010027702000173)Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p4.1)\.
- \[9\]J\. Copet, F\. Kreuk, I\. Gat, T\. Remez, D\. Kant, G\. Synnaeve, Y\. Adi, and A\. Défossez\(2023\)Simple and controllable music generation\.Advances in neural information processing systems36,pp\. 47704–47720\.Cited by:[§3\.1](https://arxiv.org/html/2606.11386#S3.SS1.p3.1)\.
- \[10\]W\. Cui, D\. Yu, X\. Jiao, Z\. Meng, G\. Zhang, Q\. Wang, S\. Y\. Guo, and I\. King\(2025\)Recent advances in speech language models: a survey\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13943–13970\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1)\.
- \[11\]W\. Cui, L\. Zhu, X\. Li, Z\. Guo, H\. Bai, L\. Hou, and I\. King\(2025\)Think before you talk: enhancing meaningful dialogue generation in full\-duplex speech language models with planning\-inspired text guidance\.arXiv preprint arXiv:2508\.07375\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1)\.
- \[12\]A\. Défossez, J\. Copet, G\. Synnaeve, and Y\. Adi\(2023\)High fidelity neural audio compression\.Transactions on Machine Learning Research\.Cited by:[§3\.1](https://arxiv.org/html/2606.11386#S3.SS1.p1.1),[§3\.3](https://arxiv.org/html/2606.11386#S3.SS3.p4.3)\.
- \[13\]A\. Défossez, L\. Mazaré, M\. Orsini, A\. Royer, P\. Pérez, H\. Jégou, E\. Grave, and N\. Zeghidour\(2024\)Moshi: a speech\-text foundation model for real\-time dialogue\.arXiv preprint arXiv:2410\.00037\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1),[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.11386#S3.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.11386#S3.SS1.p3.1),[§3\.3](https://arxiv.org/html/2606.11386#S3.SS3.p4.3),[§6\.1](https://arxiv.org/html/2606.11386#S6.SS1.p1.1)\.
- \[14\]D\. Ding, Z\. Ju, Y\. Leng, S\. Liu, T\. Liu, Z\. Shang, K\. Shen, W\. Song, X\. Tan, H\. Tang,et al\.\(2025\)Kimi\-audio technical report\.arXiv preprint arXiv:2504\.18425\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1)\.
- \[15\]E\. Duvall, A\. Robbins, T\. Graham, and S\. Divett\(2014\)Exploring filler words and their impact\.Schwa\. Language & Linguistics11,pp\. 35–49\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p4.1)\.
- \[16\]Q\. Fang, S\. Guo, Y\. Zhou, Z\. Ma, S\. Zhang, and Y\. Feng\(2025\)LLaMA\-omni: seamless speech interaction with large language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PYmrUQmMEw)Cited by:[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px1.p1.1)\.
- \[17\]M\. Geva, R\. Schuster, J\. Berant, and O\. Levy\(2021\)Transformer feed\-forward layers are key\-value memories\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 5484–5495\.Cited by:[Appendix D](https://arxiv.org/html/2606.11386#A4.p3.1)\.
- \[18\]J\. Glass\(1999\)Challenges for spoken dialogue systems\.InProceedings of the 1999 IEEE ASRU Workshop,Vol\.696\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1)\.
- \[19\]M\. Heldner and J\. Edlund\(2010\)Pauses, gaps and overlaps in conversations\.Journal of Phonetics38\(4\),pp\. 555–568\.Cited by:[§3\.4](https://arxiv.org/html/2606.11386#S3.SS4.p1.1)\.
- \[20\]J\. F\. Houde, S\. S\. Nagarajan, K\. Sekihara, and M\. M\. Merzenich\(2002\)Modulation of the auditory cortex during speech: an meg study\.Journal of cognitive neuroscience14\(8\),pp\. 1125–1138\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p3.1)\.
- \[21\]S\. Ji, Y\. Chen, M\. Fang, J\. Zuo, J\. Lu, H\. Wang, Z\. Jiang, L\. Zhou, S\. Liu, X\. Cheng,et al\.\(2024\)Wavchat: a survey of spoken dialogue models\.arXiv preprint arXiv:2411\.13577\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1)\.
- \[22\]Krafton\(2026\)Raon\-speech technical report\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1),[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.11386#S3.SS1.p1.1),[§6\.1](https://arxiv.org/html/2606.11386#S6.SS1.p1.1)\.
- \[23\]G\. Lin, C\. Chen, Z\. Chen, and H\. Lee\(2026\)Full\-duplex\-bench\-v3: benchmarking tool use for full\-duplex voice agents under real\-world disfluency\.arXiv preprint arXiv:2604\.04847\.Cited by:[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px2.p1.1)\.
- \[24\]G\. Lin, S\. S\. Kuan, J\. Shi, K\. Chang, S\. Arora, S\. Watanabe, and H\. Lee\(2025\)Full\-duplex\-bench\-v2: a multi\-turn evaluation framework for duplex dialogue systems with an automated examiner\.arXiv preprint arXiv:2510\.07838\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1),[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px2.p1.1)\.
- \[25\]G\. Lin, S\. S\. Kuan, Q\. Wang, J\. Lian, T\. Li, S\. Watanabe, and H\. Lee\(2026\)Full\-duplex\-bench v1\. 5: evaluating overlap handling for full\-duplex speech models\.InICASSP 2026\-2026 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 19447–19451\.Cited by:[§3\.4](https://arxiv.org/html/2606.11386#S3.SS4.p1.1)\.
- \[26\]G\. Lin, J\. Lian, T\. Li, Q\. Wang, G\. Anumanchipalli, A\. H\. Liu, and H\. Lee\(2025\)Full\-duplex\-bench: a benchmark to evaluate full\-duplex spoken dialogue models on turn\-taking capabilities\.arXiv preprint arXiv:2503\.04721\.Cited by:[Appendix H](https://arxiv.org/html/2606.11386#A8.p1.1),[§1](https://arxiv.org/html/2606.11386#S1.p1.1),[§1](https://arxiv.org/html/2606.11386#S1.p4.1),[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px1.p1.1),[§6\.2](https://arxiv.org/html/2606.11386#S6.SS2.p4.1)\.
- \[27\]nostalgebraist\(2020\)Interpreting GPT: the logit lens\.LessWrong\.External Links:[Link](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens)Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p2.1),[§3\.2](https://arxiv.org/html/2606.11386#S3.SS2.p1.1)\.
- \[28\]J\. Numminen, R\. Salmelin, and R\. Hari\(1999\)Subject’s own speech reduces reactivity of the human auditory cortex\.Neuroscience Letters265\(2\),pp\. 119–122\.External Links:ISSN 0304\-3940,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0304-3940%2899%2900218-9),[Link](https://www.sciencedirect.com/science/article/pii/S0304394099002189)Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p3.1)\.
- \[29\]Y\. Peng, Y\. Chao, D\. Ng, Y\. Ma, C\. Ni, B\. Ma, and E\. S\. Chng\(2025\)FD\-bench: a full\-duplex benchmarking pipeline designed for full duplex spoken dialogue systems\.InProc\. Interspeech 2025,pp\. 176–180\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p4.1),[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px2.p1.1)\.
- \[30\]D\. Rai, Y\. Zhou, S\. Feng, A\. Saparov, and Z\. Yao\(2024\)A practical review of mechanistic interpretability for transformer\-based language models\.arXiv preprint arXiv:2407\.02646\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p2.1)\.
- \[31\]A\. Raux\(2008\)Flexible turn\-taking for spoken dialog systems\.Language Technologies Institute, CMU Dec12\.Cited by:[§3\.4](https://arxiv.org/html/2606.11386#S3.SS4.p2.1)\.
- \[32\]N\. Rimsky, N\. Gabrieli, J\. Schulz, M\. Tong, E\. Hubinger, and A\. Turner\(2024\)Steering llama 2 via contrastive activation addition\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 15504–15522\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p5.1),[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2606.11386#S5.p3.3)\.
- \[33\]R\. Roy, J\. Raiman, S\. Lee, T\. Ene, R\. Kirby, S\. Kim, J\. Kim, and B\. Catanzaro\(2026\)PersonaPlex: voice and role control for full duplex conversational speech models\.arXiv preprint arXiv:2602\.06053\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1),[§1](https://arxiv.org/html/2606.11386#S1.p5.1),[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.11386#S3.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.11386#S3.SS2.p4.1),[§6\.1](https://arxiv.org/html/2606.11386#S6.SS1.p1.1)\.
- \[34\]G\. Skantze\(2021\)Turn\-taking in conversational systems and human\-robot interaction: a review\.Computer Speech & Language67,pp\. 101178\.Cited by:[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px1.p1.1)\.
- \[35\]A\. Stolfo, V\. Balachandran, S\. Yousefi, E\. Horvitz, and B\. Nushi\(2024\)Improving instruction\-following in language models through activation steering\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px3.p1.1)\.
- \[36\]N\. Ström and S\. Seneff\(2000\)Intelligent barge\-in in conversational systems\.\.InINTERSPEECH,pp\. 652–655\.Cited by:[§3\.4](https://arxiv.org/html/2606.11386#S3.SS4.p2.1)\.
- \[37\]I\. Tenney, D\. Das, and E\. Pavlick\(2019\)BERT rediscovers the classical nlp pipeline\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 4593–4601\.Cited by:[Appendix D](https://arxiv.org/html/2606.11386#A4.p3.1)\.
- \[38\]A\. M\. Turner, L\. Thiergart, G\. Leech, D\. Udell, J\. J\. Vazquez, U\. Mini, and M\. MacDiarmid\(2023\)Steering language models with activation engineering\.arXiv preprint arXiv:2308\.10248\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p5.1),[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2606.11386#S5.p1.1),[§5](https://arxiv.org/html/2606.11386#S5.p3.3)\.
- \[39\]B\. Veluri, B\. N\. Peloquin, B\. Yu, H\. Gong, and S\. Gollakota\(2024\)Beyond turn\-based interfaces: synchronous llms as full\-duplex dialogue agents\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 21390–21402\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1),[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px1.p1.1)\.
- \[40\]C\. Wang, H\. Yue, G\. Li, Z\. Zhao, S\. Wang, S\. Wang, X\. Xu, H\. Bu, and L\. Xie\(2026\)Full\-duplex interaction in spoken dialogue systems: a comprehensive study from the icassp 2026 humdial challenge\.arXiv preprint arXiv:2604\.21406\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p4.1)\.
- \[41\]H\. Wang and K\. Shu\(2024\)Trojan activation attack: red\-teaming large language models using steering vectors for safety\-alignment\.InProceedings of the 33rd ACM International Conference on Information and Knowledge Management,pp\. 2347–2357\.Cited by:[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px3.p1.1)\.
- \[42\]B\. Wu, C\. Yan, C\. Hu, C\. Yi, C\. Feng, F\. Tian, F\. Shen, G\. Yu, H\. Zhang, J\. Li,et al\.\(2025\)Step\-audio 2 technical report\.arXiv preprint arXiv:2507\.16632\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1)\.
- \[43\]H\. Wu, H\. Chung, Y\. Lin, Y\. Wu, X\. Chen, Y\. Pai, H\. Wang, K\. Chang, A\. Liu, and H\. Lee\(2024\)Codec\-superb: an in\-depth analysis of sound codec models\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 10330–10348\.Cited by:[§3\.1](https://arxiv.org/html/2606.11386#S3.SS1.p3.1)\.
- \[44\]K\. Xia, B\. Mu, X\. Shi, J\. Xu, and L\. Xie\(2026\)Semantic\-aware interruption detection in spoken dialogue systems: benchmark, metric, and model\.arXiv preprint arXiv:2603\.24144\.Cited by:[Appendix I](https://arxiv.org/html/2606.11386#A9.p2.1)\.
- \[45\]Z\. Xie and C\. Wu\(2024\)Mini\-omni: language models can hear, talk while thinking in streaming\.arXiv preprint arXiv:2408\.16725\.Cited by:[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px1.p1.1)\.
- \[46\]J\. Xu, Z\. Guo, H\. Hu, Y\. Chu, X\. Wang, J\. He, Y\. Wang, X\. Shi, T\. He, X\. Zhu,et al\.\(2025\)Qwen3\-omni technical report\.arXiv preprint arXiv:2509\.17765\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p1.1)\.
- \[47\]N\. Zeghidour, A\. Luebs, A\. Omran, J\. Skoglund, and M\. Tagliasacchi\(2021\)Soundstream: an end\-to\-end neural audio codec\.IEEE/ACM Transactions on Audio, Speech, and Language Processing30,pp\. 495–507\.Cited by:[§3\.1](https://arxiv.org/html/2606.11386#S3.SS1.p1.1),[§3\.3](https://arxiv.org/html/2606.11386#S3.SS3.p4.3)\.
- \[48\]A\. Zeng, Z\. Du, M\. Liu, K\. Wang, S\. Jiang, L\. Zhao, Y\. Dong, and J\. Tang\(2024\)Glm\-4\-voice: towards intelligent and human\-like end\-to\-end spoken chatbot\.arXiv preprint arXiv:2412\.02612\.Cited by:[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px1.p1.1)\.
- \[49\]H\. Zhang, W\. Cui, H\. Xu, X\. Li, L\. Zhu, H\. Bai, S\. Ma, and I\. King\(2025\)MTR\-duplexbench: towards a comprehensive evaluation of multi\-round conversations for full\-duplex speech language models\.arXiv preprint arXiv:2511\.10262\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p4.1)\.
- \[50\]X\. Zhang, Y\. Chen, S\. Hu, X\. Han, Z\. Xu, Y\. Xu, W\. Zhao, M\. Sun, and Z\. Liu\(2024\)Beyond the turn\-based game: enabling real\-time conversations with duplex models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 11543–11557\.Cited by:[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px1.p1.1)\.
- \[51\]A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski,et al\.\(2023\)Representation engineering: a top\-down approach to ai transparency\.arXiv preprint arXiv:2310\.01405\.Cited by:[§1](https://arxiv.org/html/2606.11386#S1.p5.1),[§2](https://arxiv.org/html/2606.11386#S2.SS0.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2606.11386#S5.p3.3)\.

## Appendix ADataset Details

### A\.1Turn\-by\-turn interaction dataset

![Refer to caption](https://arxiv.org/html/2606.11386v1/x1.png)Figure 8:An example from the turn\-by\-turn interaction dataset used for logit\-lens analysis and model\-internal generation/perception affinity analysis\.The turn\-by\-turn interaction dataset consists of 100*user queries*covering a diverse set of everyday conversational topics, each followed by a response window in which the model takes its turn to reply\. We use this dataset for our logit\-lens analysis, and to identify the generative and perceptive states by computing the generation and perception affinities\.

We generate these user queries with a text\-based LLM \(Claude Opus 4\.5\) according to the following criteria: \(1\) the utterances should cover varied topics from daily conversation in order to increase diversity; \(2\) they should be open\-ended, so that model responses are not biased toward a fixed answer format; and \(3\) after text\-to\-speech synthesis, they should correspond to approximately 15–20 seconds of speech, providing a sufficiently long listening segment for analysis\. Example queries are shown below\.

Example User query 1Can you compare renewable energy sources and explain their pros and cons in daily use?

Example User query 2My neighbor got this new puppy last week\. Cutest little thing you’ve ever seen, but it barks all night long\. I mean, non\-stop\. I haven’t slept properly in days\. I don’t want to be rude about it, but I’m seriously considering saying something to her about the noise\.

After generating the text queries, we synthesize them into speech using theDia2\-2Btext\-to\-speech \(TTS\) model222[https://huggingface\.co/nari\-labs/Dia2\-2B](https://huggingface.co/nari-labs/Dia2-2B)\. Because FD\-SLMs operate on continuous audio input, each synthesized user utterance is followed by a 10\-second silence segment, during which the model is allowed to respond\. Thus, each audio input is approximately 25–30 seconds long: the first 15–20 seconds contain user speech, during which the model is expected to listen, and the final 10 seconds provide a response window for the model\. The dataset contains 100 such examples\.

### A\.2Interruption and No\-Interruption Conditions for Analyzing State Inertia

![Refer to caption](https://arxiv.org/html/2606.11386v1/x2.png)Figure 9:An example from the dataset for state inertia analysis, illustrating the paired \(a\) no\-interruption and \(b\) interruption conditions\. In the interruption condition, a speech\-inducing prompt first places the model in a generative state, and a user utterance then interrupts its ongoing response; in the no\-interruption condition, the same utterance is presented without a preceding prompt\.To analyze state inertia, we construct paired*no\-interruption*and*interruption*conditions from the same user queries in Appendix[A\.1](https://arxiv.org/html/2606.11386#A1.SS1)\.

For the no\-interruption condition, we present a user query on its own\. The model is therefore not speaking when the user begins, yielding an ordinary turn\-taking dialogue with no overlap\. This setting is the same as in the turn\-by\-turn interaction dataset\.

For the interruption condition, we first input a user*speech\-inducing prompt*, which is an open\-ended question designed to drive the model into a sustained generative state by eliciting a long response\. These speech\-inducing prompts are constructed according to the following criteria: \(1\) they should cover diverse topics to reduce topic bias; \(2\) they should involve relatively technical or explanatory content, so that the model is likely to produce a longer response; and \(3\) they do not need to be long, since their purpose is only to induce model\-side speaking behavior\. The speech\-inducing prompts are generated using Claude Opus 4\.5 and synthesized into speech usingDia2\-2B\.

An example speech\-inducing prompt is shown below\.

Example Speech\-Inducing PromptCan you describe how antibiotics work and why antibiotic resistance is a global concern?

After receiving the speech\-inducing prompt, the model begins generating a response; after 5 seconds, we abruptly interrupt it with the user query\. This setup creates an interruption condition in which the model must transition from an ongoing generative state to a perceptive state\.

### A\.3Zero\-Buffer Benchmark Dataset

![Refer to caption](https://arxiv.org/html/2606.11386v1/x3.png)Figure 10:An example from the ZBB dataset, showing the paired \(a\) no\-interruption and \(b\) interruption conditions\. In the no\-interruption condition, the zero\-buffer query is presented on its own\. In the interruption condition, a speech\-inducing prompt is followed by a zero\-buffer query that interrupts the model’s ongoing response, testing whether the model can perceive the critical information at the onset of the interruption\.As described in Section[4](https://arxiv.org/html/2606.11386#S4), the Zero\-Buffer Benchmark \(ZBB\) contains two evaluation conditions: an interruption condition and a no\-interruption condition\. In the interruption condition, each example consists of a*speech\-inducing prompt*followed by a*zero\-buffer query*\. In the no\-interruption condition, the model receives the same zero\-buffer query without first being induced into a sustained speaking state\. This paired design allows us to measure how interruption affects both response correctness and initial\-word recognition\.

The speech\-inducing prompts are constructed in the same way as in Appendix[A\.2](https://arxiv.org/html/2606.11386#A1.SS2)\. Each zero\-buffer query follows the template

<Subject\>, <Description\>, <Confirmation Request\>\.

The subject appears as the first word of the query, so missing the onset of the interruption often removes the key information needed to answer correctly\. To balance the dataset, we generate 50 subjects\. For each subject, we create one factually correct description and one factually incorrect description, resulting in 100 zero\-buffer queries in total\. The confirmation request is kept short, so that the first word remains the primary semantic cue at the onset of the interruption\.

The subjects are chosen from common entities, objects, and animals, so that the expected answer is unambiguous and does not require specialized knowledge\.

An example positive–negative pair is shown below\.

Example Zero\-Buffer Query PairCorrect:Banana is a yellow fruit, right? Incorrect:Banana is a red fruit, right?

Pairing the same subject with both a correct and an incorrect description helps control for subject\-specific difficulty\. In this way, differences in correctness are less likely to be explained by some subjects being inherently easier or harder to recognize\.

The speech\-inducing prompts and zero\-buffer queries are synthesized into audio using theDia2\-2Btext\-to\-speech model333[https://huggingface\.co/nari\-labs/Dia2\-2B](https://huggingface.co/nari-labs/Dia2-2B)\.

### A\.4LLM\-Based Evaluation for ZBB

We evaluate model responses using two metrics: correctness and Initial Word Occurrence Rate \(IWOR\)\. For both metrics, we first transcribe the model’s generated speech into text using the ASR modelnvidia/parakeet\-tdt\-0\.6b\-v2444[https://huggingface\.co/nvidia/parakeet\-tdt\-0\.6b\-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2)\. We then evaluate the transcription using GPT\-4\.1\-mini with the prompts below\.

For correctness, the evaluator determines whether the model gives a factually correct and direct answer to the interruption query\.

`CORRECTNESS\_SYSTEM\_PROMPT For IWOR, the evaluator determines whether the model response explicitly mentions the subject entity appearing as the first word of the interruption query, or a direct synonym\. This metric is designed to measure whether the model perceived the initial semantic keyword of the interruption\. FIRST\_WORD\_SYSTEM\_PROMPT The final correctness score is the fraction of examples for which the evaluator assigns a score of 1 under the correctness rubric\. The final IWOR score is the fraction of examples for which the evaluator assigns a score of 1 under the first\-word rubric\. The following example illustrates the correctness evaluation\. Example of Correctness Evaluation The following example illustrates the IWOR evaluation\. Example of IWOR Evaluation Correctness and IWOR capture complementary aspects of interruption handling\. Correctness measures whether the model answers the full interruption query accurately, whereas IWOR measures whether the model perceived the initial semantic keyword\. A model may answer incorrectly even after recognizing the first word, or it may respond to the tail end of the question without explicitly recognizing the subject\. We therefore report both metrics\.`

`Appendix B Computational Resources All experiments in this paper are conducted on NVIDIA L40S GPUs\. Our experiments involve inference\-time analysis and activation steering on open\-source FD\-SLMs, without model training or fine\-tuning\. Therefore, the compute requirements are modest compared with training\-based approaches\. The experiments can be run on any GPU with sufficient memory to host the evaluated models, including PersonaPlex, Moshi, and Raon\-SpeechChat\. Appendix C Delayed Transition Out of the Generative State In addition to the delayed transition into the perceptive state discussed in the main text, we also observe a delayed transition out of the generative state\. Figure 12 and Figure 12 compare 𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\) under the no\-interruption and interruption conditions, respectively\. Under the no\-interruption condition, generation affinity decreases shortly after the user begins speaking, indicating that the model exits the generative state relatively quickly\. In contrast, under the interruption condition, 𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\) remains elevated for substantially longer after the user begins speaking, indicating that the model continues to occupy the generative state despite the change in conversational context\. This provides complementary evidence for state inertia: the model exhibits a delayed internal transition not only into the perceptive state, but also out of the generative state\. Figure 11: Generation affinity 𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\) in the no\-interruption condition\. The model exits the generative state soon after the user begins speaking, with recovery occurring after approximately 5 timesteps\. Figure 12: Generation affinity 𝒮gen​\(t\)\\mathcal\{S\}\_\{\\text\{gen\}\}\(t\) in the interruption condition\. The model remains in the generative state for approximately 20 timesteps after the user interrupts and begins speaking, corresponding to nearly 2 seconds\. Appendix D PCA of Hidden Representations The perception vector μg→p\\mu\_\{g\\to p\} is computed as the difference between the mean hidden representations of perception\-dominant and generation\-dominant timesteps\. This mean\-difference direction is meaningful only if the two underlying representation distributions are sufficiently separated; if they heavily overlap, the resulting vector could instead reflect noise from weakly distinguishable distributions\. To examine this possibility, we analyze the separability of these hidden representations using Principal Component Analysis \(PCA\)\. As shown in Figure 13, generation\-dominant and perception\-dominant timesteps form clearly separated clusters in the PCA\-projected hidden space across most layers\. This separation supports the validity of the perception vector: it is not merely a noisy difference between overlapping distributions, but a direction aligned with a prominent structure in the model’s hidden representations\. The dominant separating component varies across depth\. In lower layers, the two sets are primarily separated along the first principal component, whereas in deeper layers the separation becomes more apparent along the second principal component\. One possible interpretation is that the dominant sources of variance change across layers: lower layers may emphasize surface\-level or modality\-specific structure, while deeper layers may allocate the leading principal component to content\-related variation \[17, 37\], leaving state\-related variation to appear in a secondary component\. We treat this explanation as suggestive rather than conclusive\. Figure 13: PCA projections of hidden representations from generation\-dominant and perception\-dominant timesteps across transformer layers\. Generation\-dominant and perception\-dominant representations form separated clusters in the projected space\. The separation is most visible along the first principal component in shallower layers \(left\) and along the second principal component in deeper layers \(right\)\. Appendix E Decoding Hidden States with the Logit Lens This appendix provides detailed qualitative examples from the turn\-by\-turn interaction dataset, complementing the analysis in Section 3\.2\. We visualize the top logit\-lens prediction at each layer and timestep\. For each hidden representation h\(t\)h^\{\(t\)\}, we project it into the vocabulary space using the same probability definition as in Section 3\.2, and decode ydecode\(t\)=arg⁡maxy∈V⁡P​\(y∣h\(t\)\)\.y\_\{\\mathrm\{decode\}\}^\{\(t\)\}=\\arg\\max\_\{y\\in V\}P\(y\\mid h^\{\(t\)\}\)\. \(8\) In each heatmap, the text annotation in a cell shows ydecode\(t\)y\_\{\\mathrm\{decode\}\}^\{\(t\)\}, while the color indicates the projected probability assigned to the eventual model\-side text token mtext\(t\)m\_\{\\mathrm\{text\}\}^\{\(t\)\}\. Table 4: Examples of logit\-lens decoded predictions during listening\. Bold tokens match or anticipate the actual upcoming user\-side token\. E\.1 Logit\-Lens Decoding During Listening Figure 14 shows that, during listening, intermediate layers often predict continuations of the incoming user utterance rather than only the model\-side output token\. For example, when the user\-side phrase is “their pros and cons,” decoded tokens include “pro,” “and,” and “cons,” which anticipate upcoming user\-side content\. The decoded tokens may also be semantically related to the ongoing utterance even when they do not exactly match the next token\. For example, at the timestep corresponding to the input token “explain,” the decoded tokens include “why,” “how,” and “personal,” which are relevant continuations\. We highlight several representative examples in Table 4\. An additional layer\-wise logit\-lens decoding example is provided in Figure 15\. Figure 14: Logit\-lens decoding of PersonaPlex hidden states during a listening segment\. Intermediate layers often decode tokens related to the incoming user stream, even though the final model\-side output remains mostly <PAD\>\. This suggests that the model internally tracks user\-side content before converting this computation into a silent model\-side output\. Figure 15: Additional logit\-lens decoding example during a listening segment\. The user input is “How does water treatment make tap water safe to drink in modern cities?” Intermediate layers decode tokens that anticipate or semantically track the incoming user stream: around “tap,” decoded tokens include “water”; around “water,” decoded tokens include “quality,” “safe,” and “tastes”; around “safe,” decoded tokens include “to,” “for,” and “safety”; and around “to,” decoded tokens include “drink\.” This provides further qualitative evidence that hidden states can track user\-side continuations during listening\. E\.2 Logit\-Lens Decoding During Model Speech Figure 16 shows the complementary pattern during model speech\. Intermediate hidden states assign higher projected probability to model\-side text tokens, and decoded tokens more directly follow the model output stream\. Some timesteps still have lower model\-text probability because recent FD\-SLMs often distribute text\-token and audio\-token generation across different frames; during audio\-generation frames, the model\-side text token may be <PAD\> or <EPAD\>\. An additional layer\-wise logit\-lens decoding example is provided in Figure 17\. Together, Figures 14 and 16 provide qualitative evidence for stream\-specific predictive focus: hidden states tend to track the incoming user stream during listening and the model\-side output stream during speaking\. This supports the interpretation of Sperc​\(t\)S\_\{\\mathrm\{perc\}\}\(t\) and Sgen​\(t\)S\_\{\\mathrm\{gen\}\}\(t\) in Section 3\.3 as indicators of perceptive and generative states, respectively\. Figure 16: Logit\-lens decoding of PersonaPlex hidden states during a model speaking segment\. Compared with the listening segment in Figure 14, the speaking segment shows stronger alignment with the model\-side output stream across a broader range of layers, consistent with a generative state\. Figure 17: Additional logit\-lens decoding example during a model speaking segment\. This example corresponds to the model response beginning with “Modern cities treat water…” after the user query shown in Figure 15\. The decoded tokens follow the model\-side output stream, providing further qualitative evidence of generative\-state alignment during speaking\. Appendix F Steering Parameter Analysis Figure 18: Correctness and IWOR across steering layers for different steering strengths α\\alpha on PersonaPlex\. Figure 19: Correctness and IWOR across steering spans Δ​Tsteer\\Delta T\_\{\\mathrm\{steer\}\} on PersonaPlex, with the steering layer fixed to 23 and α=5\.5\\alpha=5\.5\. At Δ​Tsteer=3\\Delta T\_\{\\mathrm\{steer\}\}=3, both metrics achieve the best performance\. Steering layer and strength α\\alpha\. We investigate how the steering layer and steering strength α\\alpha affect ZBB performance\. We perform a grid search over candidate steering layers and values of α\\alpha on PersonaPlex\. As shown in Figure 19, steering is most effective at layer 23 across the tested values of α\\alpha\. The best configuration is achieved at α=5\.5\\alpha=5\.5, where correctness reaches 0\.45 and IWOR reaches 0\.72\. Steering span Δ​Tsteer\\Delta T\_\{\\mathrm\{steer\}\}\. We further investigate how the steering span affects ZBB performance\. For this scan, we fix the steering layer to 23 and the steering strength to α=5\.5\\alpha=5\.5\. As shown in Figure 19, short steering spans already improve both correctness and IWOR over the interruption condition in Section 6\.2, while a span of 3 timesteps achieves the best overall performance\. Longer spans gradually reduce performance, suggesting that steering is most effective when applied briefly at the interruption onset rather than throughout the interrupted utterance\. Appendix G Attention Recovery After Steering Given that activation steering improves both correctness and IWOR, we further examine whether it changes attention allocation after interruption\. Specifically, we measure how strongly subsequent timesteps attend back to earlier timesteps in the interrupting user input\. We compute the average attention weight assigned to the input at timestep tt by the subsequent nn timesteps at the attention layer of interest\. Let wj​\(t,τ\)w\_\{j\}\(t,\\tau\) denote the attention weight from the query at timestep τ\\tau to the key at timestep tt in attention head jj, and let ℋ\\mathcal\{H\} denote the set of attention heads in this layer\. We define sts\_\{t\} as the average attention score assigned to timestep tt over the next nn timesteps, averaged across all attention heads: st=1n​\|ℋ\|​∑τ=t\+1t\+n∑j∈ℋwj​\(t,τ\)\.s\_\{t\}=\\frac\{1\}\{n\|\\mathcal\{H\}\|\}\\sum\_\{\\tau=t\+1\}^\{t\+n\}\\sum\_\{j\\in\\mathcal\{H\}\}w\_\{j\}\(t,\\tau\)\. \(9\) This metric sts\_\{t\} quantifies how strongly later hidden states attend back to the user input at timestep tt\. We use it to examine whether injecting the perception vector μg→p\\mu\_\{g\\to p\} restores attention to the beginning of the interrupting utterance\. We compute sts\_\{t\} on ZBB examples under three conditions: no\-interruption, interruption, and interruption with steering\. The heatmaps are aligned to the beginning of the zero\-buffer query, allowing us to compare how much attention the model allocates to the earliest timesteps of the interruption\. Figure 20 shows that sts\_\{t\} decreases in the interruption condition, especially near the beginning of the zero\-buffer query\. After injecting the perception vector, sts\_\{t\} in the interruption with steering condition increases substantially relative to the interruption condition and approaches the level of the no\-interruption condition\. This result suggests that the perception vector helps restore attention to the earliest timesteps of the interrupting user input, providing additional evidence that steering mitigates state inertia at the attention level\. Figure 20: Attention recovery after steering\. Heatmaps show the average attention weight assigned to each interruption timestep tt by subsequent timesteps at varying offsets\. Attention around the 5th timestep corresponds to the first semantic word of the zero\-buffer query\. Left: In the interruption condition, attention to the beginning of the zero\-buffer query is reduced, consistent with degraded correctness and IWOR\. Middle: In the interruption with steering condition, injecting the perception vector μg→p\\mu\_\{g\\to p\} restores attention to the earliest interruption timesteps\. Right: In the no\-interruption condition, the model allocates strong attention to the beginning of the zero\-buffer query\. Appendix H Full\-Duplex Bench Results We also evaluate activation steering on Full\-Duplex Bench \(FDB\) \[26\] to test its effect on broader full\-duplex dialogue performance\. We use the FDB user\-interruption evaluation, which scores model responses to interruption queries on a 1–5 scale using GPT\-4\-Turbo\. As shown in Table 5, steering preserves the score within uncertainty, suggesting that the perception vector does not degrade general full\-duplex response quality\. One reason is that FDB interruption queries often contain a leading filler or attention\-getting phrase before the core semantic content\. For example, queries such as “Let’s switch to talking about laptops” or “Hold on, what time is the meeting scheduled today?” provide several initial words before the main content needed to answer the query\. Therefore, unlike ZBB, FDB does not require the model to process the core semantic content immediately after interruption\. By the time the core content appears, the model may have already transitioned toward the perceptive state, making FDB less sensitive to state inertia\. Table 5: Full\-Duplex Bench results before and after steering, using our reproduction of the original FDB setup\. Appendix I Robustness to False Triggers We evaluate the robustness of activation steering to false trigger events\. Since steering is applied at the detected interruption onset, an incorrect trigger could inject the perception vector when no real interruption occurs\. To simulate this failure mode, we randomly inject the perception vector at incorrect timesteps while the model answers ZBB queries, and evaluate the resulting response quality using GPT\-4\.1\-mini on a 1–5 scale\. As shown in Figure 21, response quality degrades gradually as false triggers become more frequent\. This suggests that the method is tolerant to occasional false triggers, but accurate interruption detection remains important for deployment\. Semantic\-aware interruption detection or VAD systems can reduce this risk by distinguishing semantically meaningful speech from non\-semantic acoustic events \[44, 3\]\. Figure 21: Response quality under false steering triggers\. The x\-axis represents the expected interval between false triggers\. Response quality gradually decreases as false triggers become more frequent\.`

Similar Articles