CobSeg: Coherence Boundary Modeling for Dialogue Topic Segmentation
Summary
CobSeg is a multi-branch architecture for dialogue topic segmentation that models coherence boundaries and improves lexical transition detection, outperforming prior non-LLM approaches on multiple benchmarks.
View Cached Full Text
Cached at: 06/01/26, 09:27 AM
# CobSeg: Coherence Boundary Modeling for Dialogue Topic Segmentation
Source: [https://arxiv.org/html/2605.30668](https://arxiv.org/html/2605.30668)
###### Abstract
Dialogue topic segmentation is critical in many human\-AI collaborative applications which requires identifying heterogeneous boundary cues, including lexical transitions near utterance edges and semantic discontinuities across utterances\. Existing utterance models often dilute these local lexical signals\. We propose CobSeg, a novel multi\-branch architecture that separates coherence\-level semantic continuity from lexical boundary transitions and recovers both through directional boundary prediction\. CobSeg further uses boundary informativeness weighting to emphasize high\-utility utterance positions, and incorporates a corpus\-derived topic coherence cue with learned combination weights\. While CobSeg is evaluated as a compact trainable segmenter under supervised gold\-boundary training and a pseudo\-label setting with automatically induced boundaries, it performs enhanced boundary prediction without LLM calls during inference\. Across five benchmarks, it improvesPkP\_\{k\}andWdW\_\{d\}particularly when local lexical cues are prominent: under gold supervision, it reducesPkP\_\{k\}by 0\.7 points andWdW\_\{d\}by 0\.6 points on VHF, and reachesPkP\_\{k\}of 1\.0 on DialSeg711; with induced boundaries, it reducesPkP\_\{k\}by 14\.8 points on VHF, by 1\.5 points on DialSeg711, and by 1\.1 points on TIAGE, outperforming prior non\-LLM approaches\.
CobSeg: Coherence Boundary Modeling for Dialogue Topic Segmentation
## 1Introduction
Figure 1:Overview Structure of CobSeg\.Dialogue topic segmentation identifies topic shifts in multi\-turn conversations and partitions dialogues into coherent topical units, making long and unstructured conversations easier to organize and process in practical applications such as customer service analysis, meeting review, and traffic communication, monitoring and management\. It supports downstream tasks including dialogue understanding\(Wang et al\.,[2020](https://arxiv.org/html/2605.30668#bib.bib47)\), retrieval\(Xu et al\.,[2021a](https://arxiv.org/html/2605.30668#bib.bib53)\), and summarization\(Qi et al\.,[2021](https://arxiv.org/html/2605.30668#bib.bib36)\)\. Especially, it helps enhance human–AI collaboration in Vessel Traffic Services \(VTS\) and Air Traffic Control \(ATC\) systems\. Recent work spans supervised models trained on gold boundaries\(Koshorek et al\.,[2018](https://arxiv.org/html/2605.30668#bib.bib22); Jiang et al\.,[2023](https://arxiv.org/html/2605.30668#bib.bib21)\), unsupervised methods that induce boundaries from coherence patterns\(Gao et al\.,[2023](https://arxiv.org/html/2605.30668#bib.bib13); Xing and Carenini,[2021](https://arxiv.org/html/2605.30668#bib.bib51)\), pseudo\-label training with auxiliary signals\(Artemiev et al\.,[2024](https://arxiv.org/html/2605.30668#bib.bib1)\), and LLM\-based reasoning\(Lee et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib23); Das et al\.,[2024](https://arxiv.org/html/2605.30668#bib.bib7)\)\.
However, direct LLM inference represents a different operating regime from trainable segmentation models\. LLM\-based methods typically rely on large general\-purpose models and prompt\-based reasoning at test time, while trainable segmenters aim to learn task\-specific boundary predictors that can be deployed independently after training\. This distinction is important in practical settings where inference cost, latency, reproducibility, on\-premise deployment requirements, and data privacy make repeated LLM calls less desirable\.
As[Figure˜1](https://arxiv.org/html/2605.30668#S1.F1)illustrates, the core challenge is assigning reliable decision signals to candidate boundaries by integrating utterance\-level content, cut\-level transition evidence, and sequence\-level decoding\.
Existing work models DTS tasks from the utterance level, which introduces limitations\.\(Devlin et al\.,[2019](https://arxiv.org/html/2605.30668#bib.bib8); Liu et al\.,[2019](https://arxiv.org/html/2605.30668#bib.bib26)\)compress sentence\-level encoders under each utterance into a fixed vector, averaging boundary\-adjacent tokens with mid\-utterance content and diluting the lexical transitions that mark topic shifts\. Recent utterance\-pair frameworks\(Yang et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib57); Somasundaran et al\.,[2020](https://arxiv.org/html/2605.30668#bib.bib41)\)operate entirely at the utterance level and cannot recover these token\-level cues\.\(Li et al\.,[2018](https://arxiv.org/html/2605.30668#bib.bib24); Nair et al\.,[2023](https://arxiv.org/html/2605.30668#bib.bib31)\)allocate capacity uniformly across all positions, while only a small fraction of utterance positions carry decisive transition signals\.\(Gao et al\.,[2023](https://arxiv.org/html/2605.30668#bib.bib13); Park et al\.,[2023](https://arxiv.org/html/2605.30668#bib.bib32); Glavaš et al\.,[2016](https://arxiv.org/html/2605.30668#bib.bib15); Gong et al\.,[2022](https://arxiv.org/html/2605.30668#bib.bib16)\)use corpus\-level topic coherence to fixed preprocessing heuristics rather than understanding the deep pattern of multi\-utterance pairing\.
Proposed CobSeg addresses each limitation through a targeted design, while following the trainable\-segmenter regime described above: it performs boundary prediction with a compact task\-specific model and does not require LLM calls during inference\. A Lexical Boundary Detector retains token\-level transition evidence by upweighting utterance\-edge tokens before pooling\. Utterance boundary informativeness weighting \(UBIW\) learns per\-position informativeness scores, concentrating capacity on high\-utility cut positions\. A topic coherence cue from unsupervised keyword induction supplies statistical boundary evidence with learned coefficients, injected at the logit level so the model can strengthen or suppress the signal during training\. Directional boundary heads separately model topic conclusion and topic initiation, reflecting the asymmetry of the two signals\.
The main contributions are:
- •We propose CobSeg, a compact trainable multi\-branch framework that separates lexical transition cues from semantic coherence signals for DTS, enabling efficient boundary prediction without inference\-time LLM calls\.
- •We introduce UBIW and directional prediction heads to model sparse, asymmetric evidence around topic shifts\.
- •We evaluate CobSeg under supervised and pseudo\-label settings on five benchmarks, showing improvements inPkP\_\{k\}andWdW\_\{d\}with targeted ablations\.
## 2Methodology
### 2\.1Overall Structure
Figure 2:Detailed architecture of CobSeg\.Given a dialogue𝒟=\{u1,…,uT\}\\mathcal\{D\}=\\\{u\_\{1\},\\ldots,u\_\{T\}\\\}withTTutterances, dialogue topic segmentation predicts whether each adjacent utterance pair forms a topic boundary\. The boundary sequence is defined as𝐲=\{y1,…,yT−1\}\\mathbf\{y\}=\\\{y\_\{1\},\\ldots,y\_\{T\-1\}\\\}, whereyt=1y\_\{t\}=1indicates a topic boundary betweenutu\_\{t\}andut\+1u\_\{t\+1\}\. Thus, each prediction is associated with a cut positiont∈\{1,…,T−1\}t\\in\\\{1,\\ldots,T\-1\\\}rather than an individual utterance\. CobSeg performs structured boundary prediction over all candidate cut positions using a linear\-chain CRF:
𝐲∗=argmax𝐲∈\{0,1\}T−1\[∑t=1T−1st\(yt;𝒟\)\+∑t=1T−2Ayt,yt\+1\],\\mathbf\{y\}^\{\\ast\}=\\arg\\max\_\{\\mathbf\{y\}\\in\\\{0,1\\\}^\{T\-1\}\}\\left\[\\sum\_\{t=1\}^\{T\-1\}s\_\{t\}\(y\_\{t\};\\mathcal\{D\}\)\+\\sum\_\{t=1\}^\{T\-2\}A\_\{y\_\{t\},y\_\{t\+1\}\}\\right\],\(1\)wherest\(yt;𝒟\)s\_\{t\}\(y\_\{t\};\\mathcal\{D\}\)is the local emission score at cut positiontt, andA∈ℝ2×2A\\in\\mathbb\{R\}^\{2\\times 2\}is the learned transition matrix\.
[Figure˜2](https://arxiv.org/html/2605.30668#S2.F2)illustrates the overall architecture of CobSeg\. The model computes each cut\-level emission score from three complementary views of the dialogue: a Coherence Encoder that models semantic continuity across utterances, a Lexical Boundary Detector that preserves token\-level transition cues near utterance edges, and a Topic Structure Extractor that provides corpus\-informed topic coherence evidence\. The same architecture is used in both supervised and pseudo\-label settings; only the source of boundary supervision differs\. Training minimizes the CRF negative log\-likelihood over cut positions, augmented with lightweight auxiliary losses that encourage sparse UBIW scores and margin\-based separation of boundary\-adjacent tokens\(Koshorek et al\.,[2018](https://arxiv.org/html/2605.30668#bib.bib22)\); full hyperparameters are listed in[Table˜7](https://arxiv.org/html/2605.30668#A3.T7)\.
### 2\.2Multi\-View Transition Representation
For an utteranceutu\_\{t\}, let𝐱t\(s\)∈ℝd\\mathbf\{x\}\_\{t\}^\{\(s\)\}\\in\\mathbb\{R\}^\{d\}denote the Coherence Encoder representation obtained by mean pooling the hidden states of the main encoder\(Devlin et al\.,[2019](https://arxiv.org/html/2605.30668#bib.bib8); Liu et al\.,[2019](https://arxiv.org/html/2605.30668#bib.bib26)\)\. Let𝐯t,j∈ℝd\\mathbf\{v\}\_\{t,j\}\\in\\mathbb\{R\}^\{d\}denote the contextual state of thejj\-th token inutu\_\{t\},𝐩j\\mathbf\{p\}\_\{j\}its positional embedding,mt,j∈\{0,1\}m\_\{t,j\}\\in\\\{0,1\\\}the token mask, andρt,j∈\[0,1\]\\rho\_\{t,j\}\\in\[0,1\]the normalized token position\. Following the TextSeg backbone\(Koshorek et al\.,[2018](https://arxiv.org/html/2605.30668#bib.bib22)\), the Lexical Boundary Detector applies a temporally sensitive structure for token sequence modeling within the utterance\. Bidirectional context modeling allows the encoder to capture dependencies from both past and future tokens\(Schuster and Paliwal,[1997](https://arxiv.org/html/2605.30668#bib.bib38)\)\. LetEnctok\\mathrm\{Enc\}\_\{\\mathrm\{tok\}\}denote a bidirectional LSTM encoder that processes token sequences within each utterance:
𝐡t,j=Enctok\(𝐯t,j\+𝐩j\),\\mathbf\{h\}\_\{t,j\}=\\mathrm\{Enc\}\_\{\\mathrm\{tok\}\}\(\\mathbf\{v\}\_\{t,j\}\+\\mathbf\{p\}\_\{j\}\),\(2\)The resulting token states are pooled with an edge aware weighting rule:
πt,j=α\+\(1−α\)\|2ρt,j−1\|γ,𝐱t\(w\)=𝐖w∑jmt,jπt,j𝐡t,j∑jmt,jπt,j,\\begin\{split\}\\pi\_\{t,j\}&=\\alpha\+\(1\-\\alpha\)\\lvert 2\\rho\_\{t,j\}\-1\\rvert^\{\\gamma\},\\\\ \\mathbf\{x\}\_\{t\}^\{\(w\)\}&=\\mathbf\{W\}\_\{w\}\\frac\{\\sum\_\{j\}m\_\{t,j\}\\pi\_\{t,j\}\\mathbf\{h\}\_\{t,j\}\}\{\\sum\_\{j\}m\_\{t,j\}\\pi\_\{t,j\}\},\\end\{split\}\(3\)which upweights tokens near utterance boundaries\. For the Topic Structure Extractor,𝐱t\(t\)∈ℝd\\mathbf\{x\}\_\{t\}^\{\(t\)\}\\in\\mathbb\{R\}^\{d\}denotes a static topic representation obtained by mean\-pooling an encoder oriented toward coherence\(Gao et al\.,[2021](https://arxiv.org/html/2605.30668#bib.bib14)\)overutu\_\{t\}\.
For each branchb∈\{s,w,t\}b\\in\\\{s,w,t\\\}, CobSeg converts utterance representations into cut oriented transition features\. At cut positiontt, the left and right utterance states are compared through
𝐳t\(b\)=\[𝐱t\(b\);𝐱t\+1\(b\);\|𝐱t\+1\(b\)−𝐱t\(b\)\|;𝐱t\(b\)⊙𝐱t\+1\(b\)\],\\mathbf\{z\}\_\{t\}^\{\(b\)\}=\\left\[\\mathbf\{x\}\_\{t\}^\{\(b\)\};\\mathbf\{x\}\_\{t\+1\}^\{\(b\)\};\\lvert\\mathbf\{x\}\_\{t\+1\}^\{\(b\)\}\-\\mathbf\{x\}\_\{t\}^\{\(b\)\}\\rvert;\\mathbf\{x\}\_\{t\}^\{\(b\)\}\\odot\\mathbf\{x\}\_\{t\+1\}^\{\(b\)\}\\right\],\(4\)which summarizes persistence, change magnitude, and feature interaction\. The branch state is then updated by two asymmetric gated residual adapters that share the same transition mapϕb\\phi\_\{b\}, scalar gategbg\_\{b\}, and transition feature𝐳t\(b\)\\mathbf\{z\}\_\{t\}^\{\(b\)\}, but anchor to opposite sides of the cut:
𝐫t\(b,end\)=\(1−σ\(gb\)\)𝐱t\(b\)\+σ\(gb\)ϕb\(𝐳t\(b\)\),𝐫t\(b,start\)=\(1−σ\(gb\)\)𝐱t\+1\(b\)\+σ\(gb\)ϕb\(𝐳t\(b\)\),\\begin\{split\}\\mathbf\{r\}\_\{t\}^\{\(b,\\mathrm\{end\}\)\}&=\(1\-\\sigma\(g\_\{b\}\)\)\\mathbf\{x\}\_\{t\}^\{\(b\)\}\+\\sigma\(g\_\{b\}\)\\,\\phi\_\{b\}\(\\mathbf\{z\}\_\{t\}^\{\(b\)\}\),\\\\\[4\.0pt\] \\mathbf\{r\}\_\{t\}^\{\(b,\\mathrm\{start\}\)\}&=\(1\-\\sigma\(g\_\{b\}\)\)\\mathbf\{x\}\_\{t\+1\}^\{\(b\)\}\+\\sigma\(g\_\{b\}\)\\,\\phi\_\{b\}\(\\mathbf\{z\}\_\{t\}^\{\(b\)\}\),\\end\{split\}\(5\)where𝐫t\(b,end\)\\mathbf\{r\}\_\{t\}^\{\(b,\\mathrm\{end\}\)\}anchors to the left utteranceutu\_\{t\}for modeling topic conclusion, and𝐫t\(b,start\)\\mathbf\{r\}\_\{t\}^\{\(b,\\mathrm\{start\}\)\}anchors to the right utteranceut\+1u\_\{t\+1\}for modeling topic initiation\. The learned scalar gategbg\_\{b\}controls how much the transition correction replaces the base utterance representation in each direction\.
The same NSP fusion and contextualization steps are applied to both directional adapter outputs\. For brevity,𝐫t\(b\)\\mathbf\{r\}\_\{t\}^\{\(b\)\}stands for either𝐫t\(b,end\)\\mathbf\{r\}\_\{t\}^\{\(b,\\mathrm\{end\}\)\}or𝐫t\(b,start\)\\mathbf\{r\}\_\{t\}^\{\(b,\\mathrm\{start\}\)\}in the equations that follow\. The NSP channel is disabled in the supervised main results reported in[Table˜1](https://arxiv.org/html/2605.30668#S2.T1); in those experiments𝐫t\(b,nsp\)=𝐫t\(b\)\\mathbf\{r\}\_\{t\}^\{\(b,\\mathrm\{nsp\}\)\}=\\mathbf\{r\}\_\{t\}^\{\(b\)\}andβnsp=0\\beta\_\{\\mathrm\{nsp\}\}=0\. The NSP channel is enabled only in the pseudo\-label setting, where the cross\-encoder boundary probabilityqtq\_\{t\}provides an additional coherence signal to compensate for noisier training targets\. When the optional NSP channel is enabled, adjacent utterance pairs are encoded by a cross encoder to produce an auxiliary pair representation𝐧t\\mathbf\{n\}\_\{t\}and a pairwise boundary probabilityqtq\_\{t\}\(Devlin et al\.,[2019](https://arxiv.org/html/2605.30668#bib.bib8)\)\. The auxiliary representation is injected into each branch as
𝐫t\(b,nsp\)=𝐫t\(b\)\+σ\(λb\)𝐧t,\\mathbf\{r\}\_\{t\}^\{\(b,\\mathrm\{nsp\}\)\}=\\mathbf\{r\}\_\{t\}^\{\(b\)\}\+\\sigma\(\\lambda\_\{b\}\)\\mathbf\{n\}\_\{t\},\(6\)whereλb\\lambda\_\{b\}is a fusion gate specific to each branch\. When the NSP channel is absent,𝐫t\(b,nsp\)=𝐫t\(b\)\\mathbf\{r\}\_\{t\}^\{\(b,\\mathrm\{nsp\}\)\}=\\mathbf\{r\}\_\{t\}^\{\(b\)\}\. Each branch is then contextualized by a temporally sensitive structure for sequence modeling at the dialogue level, following the TextSeg architecture\(Koshorek et al\.,[2018](https://arxiv.org/html/2605.30668#bib.bib22)\)\. The contextualizer aggregates information across the dialogue sequence, allowing each cut position to access both preceding and following context\(Schuster and Paliwal,[1997](https://arxiv.org/html/2605.30668#bib.bib38); Vaswani et al\.,[2017](https://arxiv.org/html/2605.30668#bib.bib45)\):
𝐟t\(b\)=𝒞b\(𝐫1:T−1\(b,nsp\),t\)\+𝐖b𝐫t\(b,nsp\),\\mathbf\{f\}\_\{t\}^\{\(b\)\}=\\mathcal\{C\}\_\{b\}\(\\mathbf\{r\}\_\{1:T\-1\}^\{\(b,\\mathrm\{nsp\}\)\},t\)\+\\mathbf\{W\}\_\{b\}\\mathbf\{r\}\_\{t\}^\{\(b,\\mathrm\{nsp\}\)\},\(7\)where𝒞b\\mathcal\{C\}\_\{b\}is a temporally sensitive sequence encoder stacked over the transition features and the residual projection𝐖b\\mathbf\{W\}\_\{b\}preserves the local cut position signal after dialogue level encoding\.
### 2\.3Utterance Boundary Informativeness Weighting
In a typical multi turn dialogue, only a small fraction of utterances carry decisive evidence about topic transitions\. Most turns continue an ongoing topic without introducing lexical shifts\. Treating every utterance position as equally informative dilutes the model’s representational budget\.
CobSeg addresses this through utterance boundary informativeness weighting, which learns to score each cut position by its predicted utility for boundary prediction and reweights the branch state accordingly\. The module operates independently on the end view and start view of each branch\. This allows the model to learn that different utterances may carry distinct informativeness for recognizing topic conclusions versus topic initiations\. For instance, an utterance that summarizes a preceding topic is informative for the end direction, while one that introduces new terminology may be more informative for the start direction\.
For branch state𝐟t\(b\)\\mathbf\{f\}\_\{t\}^\{\(b\)\}, each branchbband direction⋆∈\{end,start\}\\star\\in\\\{\\mathrm\{end\},\\mathrm\{start\}\\\}has its own scorerψb,⋆\\psi\_\{b,\\star\}and its own informativeness scoreat\(b,⋆\)a\_\{t\}^\{\(b,\\star\)\}\. Leta~t\(b,⋆\)\\tilde\{a\}\_\{t\}^\{\(b,\\star\)\}denote the centered version over valid positions, and𝐟^t\(b,⋆\)\\hat\{\\mathbf\{f\}\}\_\{t\}^\{\(b,\\star\)\}the reweighted branch state\. The reweighting mechanism draws inspiration from attention based feature weighting, where importance scores modulate the contribution of different positions\(Bahdanau et al\.,[2014](https://arxiv.org/html/2605.30668#bib.bib3)\)\. The reweighting takes the form
at\(b,⋆\)=σ\(ψb,⋆\(𝐟t\(b\)\)\),a~t\(b,⋆\)=at\(b,⋆\)−1T−1∑τ=1T−1aτ\(b,⋆\),𝐟^t\(b,⋆\)=𝐟t\(b\)\(1\+σ\(ξ\)a~t\(b,⋆\)\),\\begin\{split\}a\_\{t\}^\{\(b,\\star\)\}&=\\sigma\(\\psi\_\{b,\\star\}\(\\mathbf\{f\}\_\{t\}^\{\(b\)\}\)\),\\\\ \\tilde\{a\}\_\{t\}^\{\(b,\\star\)\}&=a\_\{t\}^\{\(b,\\star\)\}\-\\frac\{1\}\{T\-1\}\\sum\_\{\\tau=1\}^\{T\-1\}a\_\{\\tau\}^\{\(b,\\star\)\},\\\\ \\hat\{\\mathbf\{f\}\}\_\{t\}^\{\(b,\\star\)\}&=\\mathbf\{f\}\_\{t\}^\{\(b\)\}\\left\(1\+\\sigma\(\\xi\)\\tilde\{a\}\_\{t\}^\{\(b,\\star\)\}\\right\),\\end\{split\}\(8\)whereψb,⋆\\psi\_\{b,\\star\}is a lightweight informativeness scorer specific to branchbband direction⋆\\star, andξ\\xicontrols the residual scaling strength\. The centering step makes the module redistributive rather than uniformly amplifying\. Positions that appear more boundary informative are emphasized relative to less informative ones, while the mean activation across the dialogue is preserved\. This design prevents the model from simply boosting all positions and forces it to make comparative judgments about which utterances genuinely contribute to boundary evidence\. The reweighted states feed directly into the directional boundary heads described next\.
### 2\.4Directional Evidence Composition
A topic boundary should indicate both the conclusion of the preceding topic and the initiation of a new one\. These two signals are asymmetric: conclusion may appear through summary cues or reduced lexical specificity, while initiation may involve new terms, topic\-setting phrases, or style shifts\. CobSeg models this distinction with direction\-specific prediction heads for the end view and start view\. Both views share contextualized features but use separate UBIW scorers and classifiers, enabling lightweight directional specialization without separate encoders\.
Given the reweighted branch states𝐟^t\(b,end\)\\hat\{\\mathbf\{f\}\}\_\{t\}^\{\(b,\\mathrm\{end\}\)\}and𝐟^t\(b,start\)\\hat\{\\mathbf\{f\}\}\_\{t\}^\{\(b,\\mathrm\{start\}\)\}, each branch produces two\-dimensional logits:
𝐞t\(b,end\)=Headb,end\(𝐟^t\(b,end\)\),𝐞t\(b,start\)=Headb,start\(𝐟^t\(b,start\)\),\\begin\{split\}\\mathbf\{e\}\_\{t\}^\{\(b,\\mathrm\{end\}\)\}&=\\mathrm\{Head\}\_\{b,\\mathrm\{end\}\}\(\\hat\{\\mathbf\{f\}\}\_\{t\}^\{\(b,\\mathrm\{end\}\)\}\),\\\\ \\mathbf\{e\}\_\{t\}^\{\(b,\\mathrm\{start\}\)\}&=\\mathrm\{Head\}\_\{b,\\mathrm\{start\}\}\(\\hat\{\\mathbf\{f\}\}\_\{t\}^\{\(b,\\mathrm\{start\}\)\}\),\\end\{split\}\(9\)where the last component denotes the positive boundary class\. Each head is a single linear projection, keeping per\-branch overhead small\.
The Coherence Encoder and Lexical Boundary Detector form the main evidence channel, combining semantic continuity and vocabulary\-change signals\. The Topic Structure Extractor is added as a residual expert:
𝐞t\(main,⋆\)=ω𝐞t\(s,⋆\)\+\(1−ω\)𝐞t\(w,⋆\),𝐞t\(⋆\)=\(1−τ\)𝐞t\(main,⋆\)\+τ𝐞t\(t,⋆\),\\begin\{split\}\\mathbf\{e\}\_\{t\}^\{\(\\mathrm\{main\},\\star\)\}&=\\omega\\,\\mathbf\{e\}\_\{t\}^\{\(s,\\star\)\}\+\(1\-\\omega\)\\mathbf\{e\}\_\{t\}^\{\(w,\\star\)\},\\\\ \\mathbf\{e\}\_\{t\}^\{\(\\star\)\}&=\(1\-\\tau\)\\mathbf\{e\}\_\{t\}^\{\(\\mathrm\{main\},\\star\)\}\+\\tau\\,\\mathbf\{e\}\_\{t\}^\{\(t,\\star\)\},\\end\{split\}\(10\)with learned gatesω,τ∈\(0,1\)\\omega,\\tau\\in\(0,1\)\. This lets topic\-level structure contribute when useful while preserving local lexical evidence\. The positive component of𝐞t\(⋆\)\\mathbf\{e\}\_\{t\}^\{\(\\star\)\}is denotedℓ¯t,1\(⋆\)\\bar\{\\ell\}\_\{t,1\}^\{\(\\star\)\}\.
The final positive boundary logit further incorporates topic coherence cues and a class\-prior bias:
ℓt,1\(⋆\)=ℓ¯t,1\(⋆\)\+βcohktcoh\+βbndktbnd\+βnsplogqt1−qt\+logw\+,\\begin\{split\}\\ell\_\{t,1\}^\{\(\\star\)\}=\\bar\{\\ell\}\_\{t,1\}^\{\(\\star\)\}\+\\beta\_\{\\mathrm\{coh\}\}k\_\{t\}^\{\\mathrm\{coh\}\}\+\\beta\_\{\\mathrm\{bnd\}\}k\_\{t\}^\{\\mathrm\{bnd\}\}\\\\ \\quad\+\\beta\_\{\\mathrm\{nsp\}\}\\log\\frac\{q\_\{t\}\}\{1\-q\_\{t\}\}\+\\log w\_\{\+\},\\end\{split\}\(11\)wherektcohk\_\{t\}^\{\\mathrm\{coh\}\}andktbndk\_\{t\}^\{\\mathrm\{bnd\}\}are keyword\-based topic coherence channels\. The fixed biaslogw\+\\log w\_\{\+\}compensates for the rarity of boundaries and is applied during both training and inference;w\+=2\.0w\_\{\+\}=2\.0is selected based on validation performance\.
Table 1:Supervised results for trainable DTS segmenters↓\\downarrow: Less is better;↑\\uparrow: More is better;BOLD:1st1^\{st\}per column;UNDERLINE:2nd2^\{nd\}per column;−−\-\-: not available\.
The keyword cues are built in two stages\. First, training utterances are represented using semantic embeddings and TF\-IDF features, then clustered intoKKcoarse topic regions by K\-Means\. For each region, TF\-IDF identifies cohesive salient terms and ambient background terms\. Boundary\-related terms are extracted from utterances far from cluster centroids, while centroid\-near terms serve as core reference vocabulary\.
Second, during inference, each utterance is scored by keyword density without explicit region assignment\. Let𝒞sal\\mathcal\{C\}\_\{\\mathrm\{sal\}\},𝒞amb\\mathcal\{C\}\_\{\\mathrm\{amb\}\},ℬmrk\\mathcal\{B\}\_\{\\mathrm\{mrk\}\}, andℬcore\\mathcal\{B\}\_\{\\mathrm\{core\}\}denote the four keyword sets, and letdens\(u,𝒮\)=\|tokens\(u\)∩𝒮\|/\|tokens\(u\)\|\\mathrm\{dens\}\(u,\\mathcal\{S\}\)=\|\\mathrm\{tokens\}\(u\)\\cap\\mathcal\{S\}\|/\|\\mathrm\{tokens\}\(u\)\|\. The coherence and boundary scores for cut positionttare:
ktcoh\\displaystyle k\_\{t\}^\{\\mathrm\{coh\}\}=∑i=tt\+1\[dens\(ui,𝒞sal\)\+λambdens\(ui,𝒞amb\)\],\\displaystyle=\\sum\_\{i=t\}^\{t\+1\}\\Big\[\\mathrm\{dens\}\(u\_\{i\},\\mathcal\{C\}\_\{\\mathrm\{sal\}\}\)\+\\lambda\_\{\\mathrm\{amb\}\}\\mathrm\{dens\}\(u\_\{i\},\\mathcal\{C\}\_\{\\mathrm\{amb\}\}\)\\Big\],\(12\)ktbnd\\displaystyle k\_\{t\}^\{\\mathrm\{bnd\}\}=∑i=tt\+1\[dens\(ui,ℬmrk\)−λcoredens\(ui,ℬcore\)\]\.\\displaystyle=\\sum\_\{i=t\}^\{t\+1\}\\Big\[\\mathrm\{dens\}\(u\_\{i\},\\mathcal\{B\}\_\{\\mathrm\{mrk\}\}\)\-\\lambda\_\{\\mathrm\{core\}\}\\mathrm\{dens\}\(u\_\{i\},\\mathcal\{B\}\_\{\\mathrm\{core\}\}\)\\Big\]\.
withλamb=0\.25\\lambda\_\{\\mathrm\{amb\}\}=0\.25,λcore=0\.15\\lambda\_\{\\mathrm\{core\}\}=0\.15,K=8K=8, and a salient keyword budget of 220 terms per dataset\. These keyword sets are fixed after preprocessing; onlyβcoh\\beta\_\{\\mathrm\{coh\}\},βbnd\\beta\_\{\\mathrm\{bnd\}\}, andβnsp\\beta\_\{\\mathrm\{nsp\}\}are learned\. Since the coefficients can shrink toward zero, uninformative keyword cues can be ignored automatically\. The ablation in[Table˜5](https://arxiv.org/html/2605.30668#S3.T5)shows that removing the coherence cue worsensPkP\_\{k\}andWdW\_\{d\}, indicating that the fixed keyword partition provides useful signal\.
The final cut emission averages end\-view and start\-view evidence:
𝐞tcut=12𝐞t\(end\)\+12𝐞t\(start\),\\mathbf\{e\}\_\{t\}^\{\\mathrm\{cut\}\}=\\frac\{1\}\{2\}\\mathbf\{e\}\_\{t\}^\{\(\\mathrm\{end\}\)\}\+\\frac\{1\}\{2\}\\mathbf\{e\}\_\{t\}^\{\(\\mathrm\{start\}\)\},\(13\)fort∈\{1,…,T−1\}t\\in\\\{1,\\ldots,T\-1\\\}\. Here𝐞t\(end\)\\mathbf\{e\}\_\{t\}^\{\(\\mathrm\{end\}\)\}scores topic conclusion fromutu\_\{t\}, while𝐞t\(start\)\\mathbf\{e\}\_\{t\}^\{\(\\mathrm\{start\}\)\}scores topic initiation fromut\+1u\_\{t\+1\}\. The CRF local score isst\(yt;𝒟\)=\[𝐞tcut\]yts\_\{t\}\(y\_\{t\};\\mathcal\{D\}\)=\[\\mathbf\{e\}\_\{t\}^\{\\mathrm\{cut\}\}\]\_\{y\_\{t\}\}\. Thus, a strong boundary requires evidence from both sides of the cut\.
Training uses the same CRF decoding interface in both supervised and pseudo\-label settings\. The model minimizes negative log likelihood over theT−1T\-1cut positions; only the source of boundary labels differs\.
## 3Experiments
Table 2:Regime\-aware DTS comparison under non\-gold\-boundary settings### 3\.1Setup
We distinguish trainable segmentation architectures from direct LLM inference methods in our evaluation scope\. The supervised comparison covers trainable segmenters on five benchmarks: VHF\(Sun et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib43)\), DialSeg711\(Xu et al\.,[2021b](https://arxiv.org/html/2605.30668#bib.bib55)\), Doc2Dial\(Feng et al\.,[2020](https://arxiv.org/html/2605.30668#bib.bib12)\), TIAGE\(Xie et al\.,[2021](https://arxiv.org/html/2605.30668#bib.bib50)\), and SuperSeg\(Jiang et al\.,[2023](https://arxiv.org/html/2605.30668#bib.bib21)\)\. Evaluation usesPkP\_\{k\}\(Beeferman et al\.,[1999](https://arxiv.org/html/2605.30668#bib.bib4)\),WdW\_\{d\}\(Pevzner and Hearst,[2002](https://arxiv.org/html/2605.30668#bib.bib34)\), and boundaryF1F\_\{1\}, with all methods evaluated under identical train/validation/test splits in[Table˜6](https://arxiv.org/html/2605.30668#A2.T6), a unified metric implementation, and the same hyperparameter tuning protocol; hyperparameters are listed in[Table˜7](https://arxiv.org/html/2605.30668#A3.T7)\. Direct LLM inference is excluded from the main comparison because it relies on large general\-purpose models and prompt\-based reasoning at inference time, whereas CobSeg performs boundary prediction with a learned segmenter and does not call an LLM during inference\.
### 3\.2Main Results
#### 3\.2\.1Supervised results for trainable DTS
[Table˜1](https://arxiv.org/html/2605.30668#S2.T1)reports the supervised comparison aamong trainable DTS segmenters\. Baseline values are obtained under the split configuration and hyperparameter tuning protocol described in[Appendix˜B](https://arxiv.org/html/2605.30668#A2)\.
CobSeg achieves the lowestPkP\_\{k\}andWdW\_\{d\}on 5 of 5 datasets and the highestF1F\_\{1\}on 3 of 5 datasets\. The remaining two datasets show a consistent trade off: CobSeg improves local cut positioning but does not always produce the best boundary\-rate calibration\. On Doc2Dial and SuperSeg, encoder baselines obtain higherF1F\_\{1\}despite worsePkP\_\{k\}andWdW\_\{d\}, indicating that sharper local boundary decisions do not fully determine the global number of emitted boundaries\.
#### 3\.2\.2Regime\-aware comparison under NGB
[Table˜2](https://arxiv.org/html/2605.30668#S3.T2)reports results under None\-Gold\-Boundary\(NGB\) settings, covering classical and neural unsupervised methods, methods trained with induced supervision, direct LLM inference methods, and CobSeg pseudo\-label\. Because these methods differ in how boundary signals are obtained and how inference is performed, the table should be interpreted as a regime\-aware comparison rather than a single unified leaderboard\. In particular, direct LLM inference methods perform prompt\-based reasoning at test time, whereas CobSeg pseudo\-label trains a learned segmenter from induced supervision and performs inference without LLM calls\.
Overall, CobSeg pseudo\-label achieves the strongest performance among inference\-free methods on most datasets\. It obtains the bestPkP\_\{k\}andWdW\_\{d\}on VHF, DialSeg711, Doc2Dial, and TIAGE, with especially large gains on VHF\. It also achieves the bestF1F\_\{1\}on VHF and TIAGE, and remains competitive on DialSeg711 and Doc2Dial, although DynamicCOCO gives higherF1F\_\{1\}on these two datasets\. On SuperSeg, CobSeg pseudo\-label is competitive but not dominant: it improves over most unsupervised baselines inPkP\_\{k\}andWdW\_\{d\}, but SumSeg achieves the bestF1F\_\{1\}and Def\-DTS remains stronger under direct LLM inference\. These results suggest that pseudo\-label training can substantially narrow the gap to stronger supervision, while its effectiveness still depends on the quality of induced pseudo segments, especially on larger and more open\-domain corpora\.
### 3\.3Pseudo Label Source Analysis
To compare pseudo\-label quality, four TeT\-based boundary generators are evaluated directly against gold test annotations on DialSeg711 and Doc2Dial\. They share the same peak\-picking backbone and differ only in the adjacent\-utterance similarity signal: TF\-IDF, GloVe, BERT CLS, or BERT NSP\. No segmenter is trained in this analysis; the goal is to measure induction quality alone\. Reconstructed pseudo\-boundaries from the CobSeg pipeline are evaluated under the same protocol and are comparable to these TeT\-based sources\.
Table 3:Pseudo\-label source quality on DialSeg711 and Doc2Dial, evaluated at the 100\-dialogue scale\.Table 4:Full supervision sweep on the test split\.[Table˜3](https://arxiv.org/html/2605.30668#S3.T3)shows that pseudo\-label quality is uneven across metrics\. TeTNSP\{\}\_\{\\text\{NSP\}\}gives the bestPkP\_\{k\}/WdW\_\{d\}trade off among TeT variants, while TeTCLS\{\}\_\{\\text\{CLS\}\}tends to maximizeF1F\_\{1\}by emitting more boundaries\. DialSTART is competitive under the same 100\-dialogue protocol, but it does not clearly dominate the TeT\-based sources for producing training boundaries\. Therefore, the stronger downstream results of CobSeg pseudo\-label in[Table˜2](https://arxiv.org/html/2605.30668#S3.T2)should be attributed primarily to the segmenter architecture rather than to unusually strong pseudo\-label induction\.
### 3\.4Label Budget Analysis
To measure label efficiency, the supervised model is trained with progressively smaller fractions of the original training set\.[Table˜4](https://arxiv.org/html/2605.30668#S3.T4)reports the sweep and uses the 100% supervised result as the reference point\. VHF and DialSeg711 approach full\-data performance quickly, while Doc2Dial, TIAGE, and SuperSeg improve more gradually and require broader supervision for stable boundary localization\.
### 3\.5Ablation Study
To isolate the contribution of the main design choices, the ablation study keeps the training setting, optimization recipe, and backbone encoder fixed while removing one component at a time\. The four ablations remove the Lexical Boundary Detector, boundary informativeness weighting, topic coherence cue, and directional boundary heads on Doc2Dial and SuperSeg\.
Table 5:Component ablations on Doc2Dial and SuperSeg\.[Table˜5](https://arxiv.org/html/2605.30668#S3.T5)shows that all components contribute to the final performance\. Removing the topic coherence cue consistently worsensPkP\_\{k\}andWdW\_\{d\}, suggesting that corpus\-level statistical signals complement learned representations\. The Lexical Boundary Detector and informativeness weighting bring dataset\-dependent gains, with larger effects when topic shifts coincide with clear lexical changes near utterance boundaries\. Collapsing the directional end/start heads into a single score also degradesPkP\_\{k\}andWdW\_\{d\}, confirming the benefit of modeling topic conclusion and initiation separately\.
### 3\.6Discussion
The gap betweenPkP\_\{k\}/WdW\_\{d\}andF1F\_\{1\}on Doc2Dial and SuperSeg suggests a granularity\-calibration trade\-off\(Coen,[2025](https://arxiv.org/html/2605.30668#bib.bib6)\): CobSeg sharpens cut\-level boundary localization but does not fully regulate the global boundary rate in long dialogues\. Adaptive thresholding or corpus\-level segment length priors may further improve this aspect\.
Ablation results show that topic coherence provides the most consistent contribution, while the Lexical Boundary Detector, UBIW, and directional boundary heads contribute in dataset\-dependent ways\. This supports CobSeg’s core hypothesis that dialogue topic shifts benefit from combining local lexical transition evidence, directional end/start signals, and corpus\-level coherence cues, rather than relying only on utterance\-level semantic representations\.
The controlled comparison in[Section˜3\.3](https://arxiv.org/html/2605.30668#S3.SS3)identifies pseudo\-segment fidelity as the main bottleneck in the pseudo\-label setting\. Although CobSeg remains competitive among methods without test\-time LLM calls, low\-quality boundary induction limits its gains, suggesting iterative joint refinement of the inducer and segmenter as future work\. More broadly, CobSeg should be viewed as a compact trainable boundary model rather than a replacement for direct LLM reasoning: unlike prompt\-based LLM inference, it performs segmentation with a learned model and is most useful when efficiency, reproducibility, controllability, deployment cost, or privacy constraints matter\.
## 4Conclusion
CobSeg models topic shifts through token\-level lexical cues, directional prediction heads, and informativeness weighting\. Under gold supervision, it reducesPkP\_\{k\}by 0\.7 points andWdW\_\{d\}by 0\.6 points on VHF, while reachingPkP\_\{k\}of 1\.0 on DialSeg711\. With induced boundaries, the largest gain is a 14\.8 pointsPkP\_\{k\}reduction on VHF; on DialSeg711 and TIAGE,WdW\_\{d\}drops by 0\.3 and 1\.8 points, respectively, outperforming all non‑LLM methods\.
The strongest gains appear on VHF, DialSeg711, and TIAGE, where topic shifts are often marked by local lexical signals\. The ablation study shows that the topic coherence cue provides the most consistent contribution, while lexical boundary modeling, informativeness weighting, and directional prediction further improve boundary localization in a dataset\-dependent manner\.
A gap remains between local cut placement and global calibration: CobSeg achieves lowerPkP\_\{k\}andWdW\_\{d\}but trails some encoder baselines in boundaryF1F\_\{1\}on Doc2Dial and SuperSeg\. This indicates that the model is effective at sharpening individual boundary decisions, but still requires better control of boundary rate across longer dialogues\. The pseudo\-label experiments further show that training with automatically induced boundaries can recover substantial performance when gold annotations are unavailable, but its effectiveness depends on pseudo\-segment quality\.
Overall, the results suggest that explicit boundary\-centered modeling remains valuable for dialogue topic segmentation, up to 14\.8PkP\_\{k\}and 15\.0WdW\_\{d\}imporved in induced VHF setting, particularly when efficient, reproducible, and controllable inference without test\-time LLM calls is required\. Progress in NGB segmentation will require joint advances in segmenter design and pseudo\-label induction\.
## 5Limitations
Current study has several limitations\. First, CobSeg still lags behind the best encoder baselines in boundaryF1F\_\{1\}on several large or open\-domain corpora, consistent with the granularity\-sensitivity findings ofCoen \([2025](https://arxiv.org/html/2605.30668#bib.bib6)\)\. While the model improves local boundary discrimination through directional fusion and coherence\-aware logits, its decoding stage does not explicitly calibrate the number of emitted boundaries against corpus\-level segment length distributions, which can lead to a mismatch between sharper cut\-level decisions and global boundary\-rate control; adaptive thresholding or corpus\-level boundary\-rate priors may help address this\. Second, the pseudo\-label setting depends on the quality of pseudo\-segment induction\. Since pseudo\-segments provide a less stable supervision signal than manual annotations, noisy boundary induction can limit the effectiveness of the learned segmenter, and jointly optimizing the pseudo\-segment inducer and the segmenter through iterative refinement is a natural direction for future work\. Third, our results should not be interpreted as showing that compact trainable segmenters universally outperform direct LLM inference\. Direct LLM methods and CobSeg operate under different inference regimes, with comparisons affected by model scale, prompting strategy, inference cost, and access to large general\-purpose models\. CobSeg instead targets a complementary setting where efficient, reproducible, and controllable segmentation without test\-time LLM calls is desirable\.
## Acknowledgement
This research was funded byABCwith funding grant numberABCbyABC\.
## References
- Artemiev et al\. \(2024\)Aleksei Artemiev, Daniil Parinov, Alexey Grishanov, Ivan Borisov, Alexey Vasilev, Daniil Muravetskii, Aleksey Rezvykh, Aleksei Goncharov, and Andrey Savchenko\. 2024\.Leveraging summarization for unsupervised dialogue topic segmentation\.In*Findings of the Association for Computational Linguistics: NAACL 2024*, pages 4697–4704\.
- Badjatiya et al\. \(2018\)Pinkesh Badjatiya, Litton J Kurisinkel, Manish Gupta, and Vasudeva Varma\. 2018\.Attention\-based neural text segmentation\.In*European Conference on Information Retrieval*, pages 180–193\. Springer\.
- Bahdanau et al\. \(2014\)Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio\. 2014\.Neural machine translation by jointly learning to align and translate\.*arXiv preprint arXiv:1409\.0473*\.
- Beeferman et al\. \(1999\)Doug Beeferman, Adam Berger, and John Lafferty\. 1999\.Statistical models for text segmentation\.*Machine learning*, 34\(1\):177–210\.
- Chung et al\. \(2024\)Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, and 1 others\. 2024\.Scaling instruction\-finetuned language models\.*Journal of Machine Learning Research*, 25\(70\):1–53\.
- Coen \(2025\)Michael H Coen\. 2025\.When f1 fails: Granularity\-aware evaluation for dialogue topic segmentation\.*arXiv preprint arXiv:2512\.17083*\.
- Das et al\. \(2024\)Sarkar Snigdha Sarathi Das, Chirag Shah, Mengting Wan, Jennifer Neville, Longqi Yang, Reid Andersen, Georg Buscher, and Tara Safavi\. 2024\.S3\-dst: Structured open\-domain dialogue segmentation and state tracking in the era of llms\.In*Findings of the Association for Computational Linguistics: ACL 2024*, pages 14996–15014\.
- Devlin et al\. \(2019\)Jacob Devlin, Ming\-Wei Chang, Kenton Lee, and Kristina Toutanova\. 2019\.Bert: Pre\-training of deep bidirectional transformers for language understanding\.In*Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\)*, pages 4171–4186\.
- Du et al\. \(2013\)Lan Du, Wray Buntine, and Mark Johnson\. 2013\.Topic segmentation with a structured topic model\.In*Proceedings of the 2013 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies*, pages 190–200\.
- Eisenstein and Barzilay \(2008\)Jacob Eisenstein and Regina Barzilay\. 2008\.Bayesian unsupervised topic segmentation\.In*Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing*, pages 334–343\.
- Fan et al\. \(2024\)Yaxin Fan, Feng Jiang, Peifeng Li, and Haizhou Li\. 2024\.Uncovering the potential of chatgpt for discourse analysis in dialogue: An empirical study\.In*Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\)*, pages 16998–17010\.
- Feng et al\. \(2020\)Song Feng, Hui Wan, Chulaka Gunasekara, Siva Patel, Sachindra Joshi, and Luis Lastras\. 2020\.doc2dial: A goal\-oriented document\-grounded dialogue dataset\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 8118–8128\.
- Gao et al\. \(2023\)Haoyu Gao, Rui Wang, Ting\-En Lin, Yuchuan Wu, Min Yang, Fei Huang, and Yongbin Li\. 2023\.Unsupervised dialogue topic segmentation with topic\-aware contrastive learning\.In*SIGIR ’23*, SIGIR ’23, page 2481–2485, New York, NY, USA\. Association for Computing Machinery\.
- Gao et al\. \(2021\)Tianyu Gao, Xingcheng Yao, and Danqi Chen\. 2021\.Simcse: Simple contrastive learning of sentence embeddings\.In*Proceedings of the 2021 conference on empirical methods in natural language processing*, pages 6894–6910\.
- Glavaš et al\. \(2016\)Goran Glavaš, Federico Nanni, and Simone Paolo Ponzetto\. 2016\.Unsupervised text segmentation using semantic relatedness graphs\.In*Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics*, pages 125–130\.
- Gong et al\. \(2022\)Zheng Gong, Shiwei Tong, Han Wu, Qi Liu, Hanqing Tao, Wei Huang, and Runlong Yu\. 2022\.Tipster: A topic\-guided language model for topic\-aware text segmentation\.In*International Conference on Database Systems for Advanced Applications*, pages 213–221\. Springer\.
- Hearst \(1997\)Marti A Hearst\. 1997\.Text tiling: Segmenting text into multi\-paragraph subtopic passages\.*Computational linguistics*, 23\(1\):33–64\.
- Hou et al\. \(2024\)Xia Hou, Qifeng Li, and Tongliang Li\. 2024\.An unsupervised dialogue topic segmentation model based on utterance rewriting\.*arXiv preprint arXiv:2409\.07672*\.
- Hwang et al\. \(2024\)Yerin Hwang, Yongil Kim, Yunah Jang, Jeesoo Bang, Hyunkyung Bae, and Kyomin Jung\. 2024\.Mp2d: An automated topic shift dialogue generation framework leveraging knowledge graphs\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 17682–17702\.
- Inan et al\. \(2022\)Hakan Inan, Rashi Rungta, and Yashar Mehdad\. 2022\.Structured summarization: Unified text segmentation and segment labeling as a generation task\.*arXiv preprint arXiv:2209\.13759*\.
- Jiang et al\. \(2023\)Junfeng Jiang, Chengzhang Dong, Sadao Kurohashi, and Akiko Aizawa\. 2023\.Superdialseg: A large\-scale dataset for supervised dialogue segmentation\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 4086–4101\.
- Koshorek et al\. \(2018\)Omri Koshorek, Adir Cohen, Noam Mor, Michael Rotman, and Jonathan Berant\. 2018\.Text segmentation as a supervised learning task\.In*Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 \(Short Papers\)*, pages 469–473\.
- Lee et al\. \(2025\)Seungmin Lee, Yongsang Yoo, Minhwa Jung, and Min Song\. 2025\.Def\-dts: Deductive reasoning for open\-domain dialogue topic segmentation\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 20736–20753\.
- Li et al\. \(2018\)Jing Li, Aixin Sun, and Shafiq R Joty\. 2018\.Segbot: A generic neural text segmentation model with pointer network\.In*IJCAI*, pages 4166–4172\.
- Lin et al\. \(2023\)Jiangyi Lin, Yaxin Fan, Xiaomin Chu, Peifeng Li, and Qiaoming Zhu\. 2023\.Multi\-granularity prompts for topic shift detection in dialogue\.In*International Conference on Intelligent Computing*, pages 511–522\. Springer\.
- Liu et al\. \(2019\)Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov\. 2019\.Roberta: A robustly optimized bert pretraining approach\.*arXiv preprint arXiv:1907\.11692*\.
- Liu et al\. \(2023\)Zhengyuan Liu, Siti Umairah Md Salleh, Hong Choon Oh, Pavitra Krishnaswamy, and Nancy Chen\. 2023\.Joint dialogue topic segmentation and categorization: A case study on clinical spoken conversations\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track*, pages 185–193\.
- Lo et al\. \(2021\)Kelvin Lo, Yuan Jin, Weicong Tan, Ming Liu, Lan Du, and Wray Buntine\. 2021\.Transformer over pre\-trained transformer for neural text segmentation with enhanced topic coherence\.In*Findings of the Association for Computational Linguistics: EMNLP 2021*, pages 3334–3340\.
- Lv et al\. \(2025\)Yantong Lv, Wanjie Tao, Qun Dai, Zulong Chen, Quan Lu, and Ning Jiang\. 2025\.Dynamic topic segmentation in dialogues: Enhancing boundaries with topic\-aware propagation\.In*Companion Proceedings of the ACM on Web Conference 2025*, pages 1172–1175\.
- Ma et al\. \(2024\)Xinbei Ma, Yi Xu, Hai Zhao, and Zhuosheng Zhang\. 2024\.Multi\-turn dialogue comprehension from a topic\-aware perspective\.*Neurocomputing*, 578:127385\.
- Nair et al\. \(2023\)Inderjeet Nair, Aparna Garimella, Balaji Vasan Srinivasan, Natwar Modani, Niyati Chhaya, Srikrishna Karanam, and Sumit Shekhar\. 2023\.A neural crf\-based hierarchical approach for linear text segmentation\.In*Findings of the Association for Computational Linguistics: EACL 2023*, pages 883–893\.
- Park et al\. \(2023\)Seongmin Park, Jinkyu Seo, and Jihwa Lee\. 2023\.Unsupervised dialogue topic segmentation in hyperdimensional space\.*arXiv preprint arXiv:2308\.10464*\.
- Pennington et al\. \(2014\)Jeffrey Pennington, Richard Socher, and Christopher D Manning\. 2014\.Glove: Global vectors for word representation\.In*Proceedings of the 2014 conference on empirical methods in natural language processing \(EMNLP\)*, pages 1532–1543\.
- Pevzner and Hearst \(2002\)Lev Pevzner and Marti A Hearst\. 2002\.A critique and improvement of an evaluation metric for text segmentation\.*Computational Linguistics*, 28\(1\):19–36\.
- Pu and Wang \(2023\)Hengfeng Pu and Liqing Wang\. 2023\.Dialogue segmentation based on dynamic context coherence\.In*Proceedings of the 2023 7th International Conference on Natural Language Processing and Information Retrieval*, pages 190–195\.
- Qi et al\. \(2021\)MengNan Qi, Hao Liu, YuZhuo Fu, and Ting Liu\. 2021\.Improving abstractive dialogue summarization with hierarchical pretraining and topic segment\.In*Findings of the Association for Computational Linguistics: EMNLP 2021*, pages 1121–1130\.
- Raffel et al\. \(2020\)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu\. 2020\.Exploring the limits of transfer learning with a unified text\-to\-text transformer\.*Journal of machine learning research*, 21\(140\):1–67\.
- Schuster and Paliwal \(1997\)Mike Schuster and Kuldip K Paliwal\. 1997\.Bidirectional recurrent neural networks\.*IEEE transactions on Signal Processing*, 45\(11\):2673–2681\.
- Sevegnani et al\. \(2021\)Karin Sevegnani, David M Howcroft, Ioannis Konstas, and Verena Rieser\. 2021\.Otters: One\-turn topic transitions for open\-domain dialogue\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 2492–2504\.
- Solbiati et al\. \(2021\)Alessandro Solbiati, Kevin Heffernan, Georgios Damaskinos, Shivani Poddar, Shubham Modi, and Jacques Cali\. 2021\.Unsupervised topic segmentation of meetings with bert embeddings\.*arXiv preprint arXiv:2106\.12978*\.
- Somasundaran et al\. \(2020\)Swapna Somasundaran and 1 others\. 2020\.Two\-level transformer and auxiliary coherence modeling for improved text segmentation\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 34, pages 7797–7804\.
- Song et al\. \(2016\)Yiping Song, Lili Mou, Rui Yan, Li Yi, Zinan Zhu, Xiaohua Hu, and Ming Zhang\. 2016\.Dialogue session segmentation by embedding\-enhanced texttiling\.*arXiv preprint arXiv:1610\.03955*\.
- Sun et al\. \(2025\)Sijin Sun, Liangbin Zhao, Ming Deng, and Xiuju Fu\. 2025\.Dash: Dialogue\-aware similarity and handshake recognition for topic segmentation in public\-channel conversations\.*arXiv preprint arXiv:2512\.15042*\.
- Takanobu et al\. \(2018\)Ryuichi Takanobu, Minlie Huang, Zhongzhou Zhao, Feng\-Lin Li, Haiqing Chen, Xiaoyan Zhu, Liqiang Nie, and 1 others\. 2018\.A weakly supervised method for topic segmentation and labeling in goal\-oriented dialogues via reinforcement learning\.In*Ijcai*, pages 4403–4410\.
- Vaswani et al\. \(2017\)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin\. 2017\.Attention is all you need\.*Advances in neural information processing systems*, 30\.
- Vijjini et al\. \(2023\)Anvesh Rao Vijjini, Hanieh Deilamsalehy, Franck Dernoncourt, and Snigdha Chaturvedi\. 2023\.Curricular next conversation prediction pretraining for transcript segmentation\.In*Findings of the Association for Computational Linguistics: EACL 2023*, pages 2597–2607\.
- Wang et al\. \(2020\)Weishi Wang, Steven CH Hoi, and Shafiq Joty\. 2020\.Response selection for multi\-party conversations with dynamic topic tracking\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 6581–6591\.
- Wu et al\. \(2020\)Chien\-Sheng Wu, Steven CH Hoi, Richard Socher, and Caiming Xiong\. 2020\.Tod\-bert: Pre\-trained natural language understanding for task\-oriented dialogue\.In*Proceedings of the 2020 conference on empirical methods in natural language processing \(EMNLP\)*, pages 917–929\.
- Xia et al\. \(2022\)Jinxiong Xia, Cao Liu, Jiansong Chen, Yuchen Li, Fan Yang, Xunliang Cai, Guanglu Wan, and Houfeng Wang\. 2022\.Dialogue topic segmentation via parallel extraction network with neighbor smoothing\.In*Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval*, pages 2126–2131\.
- Xie et al\. \(2021\)Huiyuan Xie, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu, and Ann Copestake\. 2021\.Tiage: A benchmark for topic\-shift aware dialog modeling\.In*Findings of the Association for Computational Linguistics: EMNLP 2021*, pages 1684–1690\.
- Xing and Carenini \(2021\)Linzi Xing and Giuseppe Carenini\. 2021\.Improving unsupervised dialogue topic segmentation with utterance\-pair coherence scoring\.In*Proceedings of the 22nd annual meeting of the Special Interest Group on discourse and dialogue*, pages 167–177\.
- Xu et al\. \(2025\)Jiahui Xu, Feng Jiang, Anningzhe Gao, Luis Fernando D’Haro, and Haizhou Li\. 2025\.Unsupervised mutual learning of discourse parsing and topic segmentation in dialogue\.*IEEE Transactions on Audio, Speech and Language Processing*\.
- Xu et al\. \(2021a\)Jun Xu, Zeyang Lei, Haifeng Wang, Zheng\-Yu Niu, Hua Wu, and Wanxiang Che\. 2021a\.Discovering dialog structure graph for coherent dialog generation\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 1726–1739\.
- Xu et al\. \(2019\)Peng Xu, Hamidreza Saghir, Jin Sung Kang, Teng Long, Avishek Joey Bose, Yanshuai Cao, and Jackie Chi Kit Cheung\. 2019\.A cross\-domain transferable neural coherence model\.In*Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics*, pages 678–687\.
- Xu et al\. \(2021b\)Yi Xu, Hai Zhao, and Zhuosheng Zhang\. 2021b\.Topic\-aware multi\-turn dialogue modeling\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 35, pages 14176–14184\.
- Yang et al\. \(2022\)Chenxu Yang, Zheng Lin, Jiangnan Li, Fandong Meng, Weiping Wang, Lanrui Wang, and Jie Zhou\. 2022\.Take: Topic\-shift aware knowledge selection for dialogue generation\.In*Proceedings of the 29th International Conference on Computational Linguistics*, pages 253–265\.
- Yang et al\. \(2025\)Shihao Yang, Ziyi Zhang, Yue Jiang, Chunsheng Qin, and Shuhua Liu\. 2025\.A unified supervised and unsupervised dialogue topic segmentation framework based on utterance pair modeling\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 4898–4908\.
## Appendix ARelated Work
Early segmentation methods identify topic boundaries mainly through lexical cohesion\. TextTiling locates topic shifts at valleys of lexical similarity\(Hearst,[1997](https://arxiv.org/html/2605.30668#bib.bib17)\), BayesSeg models segmentation as a Bayesian generative process\(Eisenstein and Barzilay,[2008](https://arxiv.org/html/2605.30668#bib.bib10)\), and GraphSeg constructs semantic relatedness graphs to infer topical structure\(Glavaš et al\.,[2016](https://arxiv.org/html/2605.30668#bib.bib15)\)\. These approaches established lexical and semantic discontinuity as key signals for segmentation, but they were largely designed for monologic texts rather than dialogues, where topic shifts are often implicit, interaction\-driven, and context\-dependent\.
Supervised neural methods further advance dialogue segmentation by leveraging pretrained or task\-specific encoders for utterance representation\. Early work enhances TextTiling with GloVe embeddings\(Song et al\.,[2016](https://arxiv.org/html/2605.30668#bib.bib42); Pennington et al\.,[2014](https://arxiv.org/html/2605.30668#bib.bib33)\), while subsequent studies introduce attention mechanisms\(Badjatiya et al\.,[2018](https://arxiv.org/html/2605.30668#bib.bib2)\), pointer networks\(Li et al\.,[2018](https://arxiv.org/html/2605.30668#bib.bib24)\), and supervised hierarchical architectures such as TextSeg\(Koshorek et al\.,[2018](https://arxiv.org/html/2605.30668#bib.bib22)\)\. With the rise of pretrained language models, BERT, RoBERTa, and TOD\-BERT provide stronger contextualized representations for segmentation\(Devlin et al\.,[2019](https://arxiv.org/html/2605.30668#bib.bib8); Liu et al\.,[2019](https://arxiv.org/html/2605.30668#bib.bib26); Wu et al\.,[2020](https://arxiv.org/html/2605.30668#bib.bib48)\)\. Building on these encoders, later methods stack transformer layers for sequence modeling\(Lo et al\.,[2021](https://arxiv.org/html/2605.30668#bib.bib28)\), incorporate two\-level transformer architectures with coherence modeling\(Somasundaran et al\.,[2020](https://arxiv.org/html/2605.30668#bib.bib41)\), apply BERT to meeting segmentation\(Solbiati et al\.,[2021](https://arxiv.org/html/2605.30668#bib.bib40)\), or integrate CRF layers for structured prediction\(Nair et al\.,[2023](https://arxiv.org/html/2605.30668#bib.bib31)\)\. Although these methods differ in their encoding and decoding strategies, they typically operate at the sentence or utterance level by compressing each unit into a fixed vector representation\. This design may obscure fine\-grained boundary cues, since boundary\-adjacent tokens are averaged together with mid\-utterance content, potentially diluting lexical transition signals\.
Coherence\-based methods model topic continuity through utterance pair scoring\. Xing and Carenini propose coherence scoring, referred to as CSM\(Xing and Carenini,[2021](https://arxiv.org/html/2605.30668#bib.bib51)\)\.\(Xu et al\.,[2019](https://arxiv.org/html/2605.30668#bib.bib54)\)develop cross\-domain coherence models\.\(Xia et al\.,[2022](https://arxiv.org/html/2605.30668#bib.bib49)\)use parallel extraction with neighbor smoothing\.\(Yang et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib57)\)unify supervised and unsupervised DTS through utterance pair modeling\.\(Xu et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib52)\)jointly learn discourse parsing and segmentation\. These methods focus on pairwise semantic coherence at the sentence level without explicit token transition modeling, which may miss lexical cues concentrated at utterance boundaries\.
Topic modeling methods derive signals from latent structure\.\(Du et al\.,[2013](https://arxiv.org/html/2605.30668#bib.bib9)\)apply structured topic models\.\(Gong et al\.,[2022](https://arxiv.org/html/2605.30668#bib.bib16)\)propose topic\-guided language models\. Liu et al\. jointly optimize segmentation and topic categorization with shared representations\(Liu et al\.,[2023](https://arxiv.org/html/2605.30668#bib.bib27)\)\.\(Park et al\.,[2023](https://arxiv.org/html/2605.30668#bib.bib32)\)perform segmentation in hyperdimensional space\.\(Hou et al\.,[2024](https://arxiv.org/html/2605.30668#bib.bib18)\)use utterance rewriting to support topic detection\.\(Artemiev et al\.,[2024](https://arxiv.org/html/2605.30668#bib.bib1)\)leverage summarization as weak supervision\.\(Inan et al\.,[2022](https://arxiv.org/html/2605.30668#bib.bib20)\)frame segmentation as generation task\.
Topic shift and LLM methods address DTS through transition modeling\.\(Sevegnani et al\.,[2021](https://arxiv.org/html/2605.30668#bib.bib39)\)study one\-turn transitions\.\(Yang et al\.,[2022](https://arxiv.org/html/2605.30668#bib.bib56)\)propose topic shift aware knowledge selection\.\(Lin et al\.,[2023](https://arxiv.org/html/2605.30668#bib.bib25)\)use multi\-granularity prompts\.\(Hwang et al\.,[2024](https://arxiv.org/html/2605.30668#bib.bib19)\)generate topic shift dialogues\.\(Takanobu et al\.,[2018](https://arxiv.org/html/2605.30668#bib.bib44)\)apply reinforcement learning\.\(Vijjini et al\.,[2023](https://arxiv.org/html/2605.30668#bib.bib46)\)use curricular pretraining\.\(Fan et al\.,[2024](https://arxiv.org/html/2605.30668#bib.bib11)\)explore ChatGPT for discourse analysis\. Recent state\-of\-the\-art methods include unified frameworks and LLM\-based reasoning\.\(Yang et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib57)\)unify supervised and unsupervised DTS through utterance pair relation modeling, achieving strong performance through sentence pairing\.\(Lee et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib23)\)cast DTS as deductive reasoning with LLMs, demonstrating effectiveness through structured inference\.\(Das et al\.,[2024](https://arxiv.org/html/2605.30668#bib.bib7)\)propose S3\-DST for structured dialogue segmentation\.\(Lv et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib29)\)enhance boundaries with topic\-aware propagation through dynamic mechanisms\. CobSeg complements these approaches by explicitly modeling token lexical transitions, learning boundary informativeness weighting to concentrate capacity on decisive positions, and integrating corpus statistical cues with learned coefficients rather than as fixed preprocessing\.
## Appendix BAdditional Dataset Statistics
[Table˜6](https://arxiv.org/html/2605.30668#A2.T6)summarize the benchmark characteristics and provide additional distributional information used to contextualize the cross\-dataset comparisons\. The maximum sequence lengths in[Table˜7](https://arxiv.org/html/2605.30668#A3.T7)\(48/48/64/32/64\) are manually selected to cover approximately the 90th percentile of dialogue turns for each dataset, ensuring that most dialogues fit within the context window without excessive padding\. The number of K\-Means clusters is set toK=8K=8, following prior work that partitions dialogue utterances into a small number of intent categories via LLM\-based or clustering\-based methods\(Lee et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib23); Das et al\.,[2024](https://arxiv.org/html/2605.30668#bib.bib7); Ma et al\.,[2024](https://arxiv.org/html/2605.30668#bib.bib30)\)\. The salient keyword budget is set to 220 terms per dataset, selected as the point of diminishing TF\-IDF cross\-cluster distinctiveness gain in a sweep over candidate values\.
Table 6:Dataset statistics computed from the benchmark files in the project repository\.
## Appendix CImplementation Details
The CobSeg architecture is optimized with the hyperparameter configuration in[Table˜7](https://arxiv.org/html/2605.30668#A3.T7); the same settings are used for both supervised training and unsupervised pseudo\-label training unless stated otherwise\.
Table 7:Key hyperparameters and training configuration of CobSeg\.
## Appendix DComputational Cost
[Table˜8](https://arxiv.org/html/2605.30668#A4.T8)compares the parameter count, inference latency, and GPU memory footprint against representative supervised baselines\. All measurements are obtained under the supervised setting with identical hardware and batch size\. Parameter counts are in millions; GPU memory is in MB\. CobSeg \(architecture\) refers to the modules proposed in this paper without any pretrained encoder\. The intermediate rows show incremental additions: the Lexical Boundary Detector, UBIW module, and Topic Structure Extractor\. CobSeg \(full\) adds two pretrained encoders \(RoBERTa\-base for Coherence Encoder / Lexical Boundary Detector and SimCSE for the Topic Structure Extractor\) to the complete CobSeg architecture\. The parameter overhead directly attributable to the proposed method is the gap between CobSeg \(architecture\) and a single\-encoder baseline, approximately 24M parameters\.
Table 8:Computational cost comparison under the supervised setting\.Model\#Paramms/dialGPU MemRoBERTa124\.656\.8511BERT109\.487\.1454TOD\-BERT109\.487\.1454T5109\.639\.0457CobSeg \(full\)272\.6531\.41131CobSeg \(architecture\)23\.887\.6236\+ Lexical Boundary Detector31\.137\.9236\+ UBIW31\.208\.9236\+ Topic Structure Extractor38\.529\.1236
## Appendix EBaseline Descriptions
Unless an official split is provided by the original benchmark release, dataset splits follow an approximate 7:1\.5:1\.5 train/validation/test ratio\. For TIAGE, the official split configuration released by the original authors is used directly\. For VHF, the split is derived from the official DASH dataset release\. For SuperSeg, the official split released with the benchmark is used\. For DialSeg711, the public dataset split is used\. For Doc2Dial, the original dataset is publicly available; the segmentation split is constructed following the topic annotations provided in the official release\. All experiments are conducted on these split settings\. Where baseline values are reproduced under the splits used in this work, this is noted in the results\.
##### Supervised Setting\.
CobSeg is compared with TextSeg\(Koshorek et al\.,[2018](https://arxiv.org/html/2605.30668#bib.bib22)\), BERT\(Devlin et al\.,[2019](https://arxiv.org/html/2605.30668#bib.bib8)\), RoBERTa\(Liu et al\.,[2019](https://arxiv.org/html/2605.30668#bib.bib26)\), TOD\-BERT\(Wu et al\.,[2020](https://arxiv.org/html/2605.30668#bib.bib48)\), T5\(Raffel et al\.,[2020](https://arxiv.org/html/2605.30668#bib.bib37)\), FLAN\-T5\(Chung et al\.,[2024](https://arxiv.org/html/2605.30668#bib.bib5)\), RetroTS\-T5\(Xie et al\.,[2021](https://arxiv.org/html/2605.30668#bib.bib50)\), and SUPRP\(Yang et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib57)\)\. TextSeg uses hierarchical LSTMs with CRF decoding for structured boundary prediction\(Nair et al\.,[2023](https://arxiv.org/html/2605.30668#bib.bib31)\)\. BERT, RoBERTa, and TOD\-BERT are pretrained transformer encoders fine\-tuned for boundary classification\. T5 and FLAN\-T5 are encoder\-decoder models adapted for segmentation as a sequence\-to\-sequence task\. RetroTS\-T5 extends T5 with retrospective topic shift detection\.
##### Unsupervised Setting\.
CobSeg is compared with Random, Even, BayesSeg\(Eisenstein and Barzilay,[2008](https://arxiv.org/html/2605.30668#bib.bib10)\), TeT\(Hearst,[1997](https://arxiv.org/html/2605.30668#bib.bib17)\), TeTGlove\{\}\_\{\\text\{Glove\}\}\(Song et al\.,[2016](https://arxiv.org/html/2605.30668#bib.bib42); Pennington et al\.,[2014](https://arxiv.org/html/2605.30668#bib.bib33)\), TeTCLS\{\}\_\{\\text\{CLS\}\}, TeTNSP\{\}\_\{\\text\{NSP\}\}, GraphSeg\(Glavaš et al\.,[2016](https://arxiv.org/html/2605.30668#bib.bib15)\), GreedySeg\(Xu et al\.,[2021b](https://arxiv.org/html/2605.30668#bib.bib55)\), CSM\(Xing and Carenini,[2021](https://arxiv.org/html/2605.30668#bib.bib51)\), T5\(Raffel et al\.,[2020](https://arxiv.org/html/2605.30668#bib.bib37)\), DialSTART\(Gao et al\.,[2023](https://arxiv.org/html/2605.30668#bib.bib13)\), DynamicCOCO\(Pu and Wang,[2023](https://arxiv.org/html/2605.30668#bib.bib35)\), UMLF\(Xu et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib52)\), SumSeg\(Artemiev et al\.,[2024](https://arxiv.org/html/2605.30668#bib.bib1)\), DyDTS\(Lv et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib29)\), UUPRP\(Yang et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib57)\), Def\-DTS\(Lee et al\.,[2025](https://arxiv.org/html/2605.30668#bib.bib23)\), and S3\-DST\(Das et al\.,[2024](https://arxiv.org/html/2605.30668#bib.bib7)\)\. TeTGlove\{\}\_\{\\text\{Glove\}\}enhances TextTiling with GloVe embeddings\(Song et al\.,[2016](https://arxiv.org/html/2605.30668#bib.bib42)\)\. TeTCLS\{\}\_\{\\text\{CLS\}\}and TeTNSP\{\}\_\{\\text\{NSP\}\}replace lexical similarity with BERT\-based sentence embeddings and next\-sentence prediction scores, respectively\. CSM models utterance\-pair coherence\. DialSTART learns topical representations through contrastive learning\. SumSeg uses summarization as weak supervision\. UMLF jointly learns discourse parsing and segmentation\. DyDTS applies dynamic topic\-aware propagation\. UUPRP unifies utterance\-pair relations\. Def\-DTS and S3\-DST use LLMs for structured reasoning at inference time\.
## Appendix FQualitative Analysis
[Figure˜3](https://arxiv.org/html/2605.30668#A6.F3)visualizes boundary attribution at the token level and utterance informativeness for a VHF example\. Tokens adjacent to boundaries and high utility procedural phrases receive stronger emphasis than mid\-utterance filler content\. This matches the intended role of the Lexical Boundary Detector and boundary informativeness weighting\.[Figure˜4](https://arxiv.org/html/2605.30668#A6.F4)presents a visualization of utterance embeddings with topic\-region assignments\. The topic coherence cue keywords are derived from the unsupervised topic partition itself; the projection is used only for visualization\. The topic regions, local density patterns, and keywords at the region level indicate that the topic partition captures both vocabulary that is cohesive within topics and lexical triggers sensitive to boundaries, providing the statistical basis for the topic coherence cue\.
Figure 3:Attribution at the token level and informativeness at the utterance level on a VHF example\. Lexical cues adjacent to boundaries receive stronger emphasis than less informative interior tokens\.Figure 4:Visualization of utterance embeddings with topic\-region assignments\. The projected topic regions reveal both vocabulary cohesive within topics and lexical triggers sensitive to boundaries\. The topic coherence cue is derived from the topic partition, not from the visualization projection\.
## Appendix GUnsupervised Pseudo\-Segment Induction
The pseudo\-label pipeline produces training boundaries through three stages\. Stage 1 \(NSP\+TeT initialization\): provisional segment boundaries are obtained by running NSP\-based coherence scoring and TextTiling peak picking on each training dialogue\. The threshold coefficientα\\alphais selected on the validation split, then the full training split is cut and 100 training dialogues are sampled\. Stage 2 \(segment label\-summary specification\): the LLM receives all provisional segments of a sampled dialogue and outputs one pseudo label and one pseudo summary for each segment\. Stage 3 \(dialogue reconstruction\): target segment lengths are drawn from the empirical training segment\-length distribution, and the LLM receives the ordered pseudo label\-summary records with these target lengths to generate an entirely new dialogue in one call\. This length\-controlled reconstruction keeps the synthetic training set close to the original corpus in its structural statistics, so the pseudo data changes the lexical realization while preserving the segment\-length distributional profile\. Because the segment lengths of the reconstructed dialogue are fixed and validated, the resulting boundaries are known by construction and serve as pseudo\-boundary labels\. The pipeline does not expose the reconstruction LLM to original dialogue text\. LLM configuration and pipeline parameters are listed in[Table˜7](https://arxiv.org/html/2605.30668#A3.T7)\.
The constrained input\-output schema of the two LLM\-assisted stages is described below rather than reproducing the raw prompts verbatim in the paper body\.
### G\.1Segment Label\-Summary Specification
The label\-summary stage receives all provisional segments of a dialogue and outputs one pseudo label and one pseudo summary for each segment\. The LLM processes the entire provisional dialogue structure at once, preserving segment order and keeping the number of segments unchanged\.
Listing 1:Prompt for segment\-level pseudo label\-summary induction\.1\[System\]
2Youlabeleachdialoguesegmentwithapseudo
3labelandapseudosummary\.ReturnJSONonly\.
4Donotexplain\.ReturnexactlyoneJSONobject
5withapseudo\_segkey\.
6\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
7\[User\]
8ReturnJSONwithexactlyonekey:pseudo\_seg\.
9pseudo\_segmustbeanarrayofobjects,oneper
10dialoguesegment\.Eachobjectmusthaveexactlytwo
11stringkeys:pseudo\_labelandpseudo\_summary\.
12Theoutputmustpreservetheinputsegmentorder\.
13Theoutputmustcontainexactlythesamenumber
14ofsegmentsastheinput\.
15Eachoutputitemmustcorrespondtooneinput
16segmentandmustnotmergeorsplitsegments\.
17
18InputJSON:
19\{
20"pseudo\_seg":\[
21\{
22"pseudo\_seg\_index":0,
23"pseudo\_seg\_len":4,
24"utterances":\["\.\.\.","\.\.\.","\.\.\.","\.\.\."\]
25\}
26\]
27\}
### G\.2Dialogue Reconstruction from Segment Specifications
The reconstruction stage generates a complete dialogue from segment metadata alone\. The LLM receives only the ordered pseudo label\-summary records and target segment lengths sampled from the empirical training distribution, without any original dialogue text\. The output contains generated utterances and segment lengths, which are known by construction\. This length control is a distribution\-matching constraint: each generated segment is forced to have the requested number of utterances so that the reconstructed corpus preserves the original dataset’s segment\-length characteristics\.
Listing 2:Prompt for dialogue reconstruction from segment metadata\.1\[System\]
2Yougenerateafullcoherentdialoguefrom
3segmentspecifications\.ReturnJSONonly\.
4Donotexplain\.Returnexactlytwokeys:
5pseudo\_utteranceandpseudo\_seg\.
6\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
7\[User\]
8Generateonecontinuousdialogue\.
9ReturnJSONwithexactlytwokeys:
10pseudo\_utteranceandpseudo\_seg\.
11pseudo\_utterancemustbeaJSONarrayof
12plainstrings\.
13Donotprefixutteranceswithspeaker
14identifiers,shipnames,roletags,SpeakerA/B,
15speaker:,orspeakerlabels\.
16pseudo\_segmustbeaJSONarrayofintegers,and
17eachintegerisasegmentLENGTH,notasegment
18index\.
19pseudo\_segmustlisttheutterancecountsfor
20eachsegmentinorder\.
21Eachintegermustmatchtherequestedtarget
22segmentlength\.
23ThenumberofsegmentsmustbeexactlyK\.
24Thepseudo\_segarraymustbeexactly
25\[l\_1,\.\.\.,l\_K\]\.
26Thetotalnumberofutterancesmustbeexactly
27sum\(\[l\_1,\.\.\.,l\_K\]\)\.
28Thepseudo\_utterancearraylengthmustequalthe
29sumofthepseudo\_segarray\.
30
31InputJSON:
32\{
33"pseudo\_seg\_count":K,
34"target\_pseudo\_seg":\[l\_1,\.\.\.,l\_K\],
35"total\_target\_pseudo\_utterance":sum\(\[l\_1,\.\.\.,l\_K\]\),
36"pseudo\_seg\_specs":\[
37\{
38"pseudo\_seg\_index":0,
39"pseudo\_label":"navigationcoordination",
40"pseudo\_summary":"twovesselscoordinate
41asafepassingplan",
42"pseudo\_seg\_len":4
43\}
44\]
45\}
The constrained prompts ensure that the LLM produces dialogues with exact segment boundaries\. The segment count comes from NSP\+TeT initialization, while target segment lengths are sampled from the training split segment\-length distribution to keep the reconstructed corpus statistically aligned with the source dataset\. The reconstruction LLM sees only topic\-level metadata and sampled lengths\.Similar Articles
Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention
This paper identifies a localized 'entity binding failure' in Speech Large Language Models (SLLMs) where logical reasoning involving entity tracking collapses to chance-level accuracy, and proposes Entity-Aware Chain-of-Thought (EA-CoT) prompting to resolve this, achieving up to 24.4% absolute accuracy improvement.
Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning
Researchers from KTH Royal Institute of Technology propose a two-stage framework that fine-tunes LLMs on dialogue transcripts and uses contrastive learning to create joint embeddings for aligning backchannel signals with conversational context, demonstrating improved context-backchannel retrieval compared to previous methods.
CobwebTM: Probabilistic Concept Formation for Lifelong and Hierarchical Topic Modeling
CobwebTM is a low-parameter lifelong hierarchical topic modeling approach that adapts the Cobweb algorithm to continuous document embeddings, enabling unsupervised topic discovery and dynamic hierarchical organization without predefining topic counts. The method combines incremental symbolic concept formation with pretrained representations to achieve strong topic coherence while avoiding catastrophic forgetting.
SFL-MTSC: Leveraging Semantic Frame-Level Multi-Task Self-Consistency for Robust Multi-Intent Spoken Language Understanding
Introduces SFL-MTSC, a structured aggregation framework for robust multi-intent spoken language understanding using LLM self-consistency at the semantic frame level, showing improved slot F1 and overall accuracy on the MAC-SLU benchmark.
Linear Semantic Segmentation for Low-Resource Spoken Dialects
This paper introduces a benchmark for semantic segmentation in low-resource dialectal Arabic and proposes a model that improves performance on conversational speech compared to standard baselines.