Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

arXiv cs.AI Papers

Summary

This paper investigates a harmful phenomenon in long chain-of-thought (CoT) training traces where post-conclusion continuation reduces training utility, and proposes a diagnostic method called HarmfulContinuationCut (HCC) to detect such harmful continuations.

arXiv:2605.29288v1 Announce Type: new Abstract: Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet answer-correct traces can still lead to markedly different fine-tuning outcomes. We study post-conclusion continuation in answer-correct long-CoT data: a continuation where the answer appears sufficiently supported, but the trace continues with additional reasoning that remains in the supervised target. To test its training effect, we use a delete-only editor to construct answer-preserving suffix removal and compare CoT-based SFT on the original and processed traces. We observe improved SFT outcomes after removing the editor-identified post-conclusion continuation, suggesting that this continuation is harmful to training in our setting. We therefore refer to this empirically supported phenomenon as harmful continuation. Beyond this intervention, we further characterize the removed post-conclusion continuation through uncertainty and hidden-state progress. We observe persistent local uncertainty together with weakened terminal-directional progress, forming an uncertainty--geometry mismatch. Finally, we instantiate Harmful Continuation Cut (HCC), a lightweight boundary proxy that approximates the editor-identified post-conclusion continuation boundary.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:16 AM

# Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces
Source: [https://arxiv.org/html/2605.29288](https://arxiv.org/html/2605.29288)
Chen He1∗Yuhao Wu2∗Lei Wang3Wenxuan Zhang2Fumin Shen1† 1University of Electronic Science and Technology of China 2Singapore University of Technology and Design3Singapore Management University ∗Equal contribution†\\quad\\daggerCorresponding author

###### Abstract

Long chain\-of\-thought \(CoT\) traces are widely used as supervision for reasoning\-oriented LLM SFT, yet answer\-correct traces can still lead to markedly different fine\-tuning outcomes\. We study post\-conclusion continuation in answer\-correct long\-CoT data: a continuation where the answer appears sufficiently supported, but the trace continues with additional reasoning that remains in the supervised target\. To test its training effect, we use a delete\-only editor to construct answer\-preserving suffix removal and compare CoT\-based SFT on the original and processed traces\. We observe improved SFT outcomes after removing the editor\-identified post\-conclusion continuation, suggesting that this continuation is harmful to training in our setting\. We therefore refer to this empirically supported phenomenon asharmful continuation\. Beyond this intervention, we further characterize the removed post\-conclusion continuation through uncertainty and hidden\-state progress\. We observe persistent local uncertainty together with weakened terminal\-directional progress, forming an uncertainty–geometry mismatch\. Finally, we instantiateHarmfulContinuationCut \(HCC\), a lightweight boundary proxy that approximates the editor\-identified post\-conclusion continuation boundary\.

Diagnosing Harmful Continuation in Answer\-Correct Long\-CoT Training Traces

## 1Introduction

Long chain\-of\-thought traces have become important training targets for reasoning models\(Weiet al\.,[2022](https://arxiv.org/html/2605.29288#bib.bib1); Luoet al\.,[2025b](https://arxiv.org/html/2605.29288#bib.bib28); Ou and Yin,[2025](https://arxiv.org/html/2605.29288#bib.bib29)\)\. They are used not only in supervised fine\-tuningOu and Yin \([2025](https://arxiv.org/html/2605.29288#bib.bib29)\), but also in reasoning\-oriented continued training and as cold\-start data before reinforcement learningWanget al\.\([2026](https://arxiv.org/html/2605.29288#bib.bib30)\)\. Unlike final\-answer annotations, long\-CoT traces expose full reasoning trajectories that models are encouraged to imitate during training\. This makes CoT trace quality a central issue for reasoning training where the training target specifies not only what answer to produce, but also what trajectory should be treated as learnable reasoning behavior\.

Prior work has already shown that answer\-correct CoT traces can differ substantially in training utility\. The style and structure of source trajectories can affect the generalization of SFT models\(Tianet al\.,[2025](https://arxiv.org/html/2605.29288#bib.bib4); Zhanget al\.,[2025](https://arxiv.org/html/2605.29288#bib.bib5); Liet al\.,[2025](https://arxiv.org/html/2605.29288#bib.bib8)\), and recent studies further connect trace compatibility, reasoning patterns, and learnability to downstream outcomes\(Liuet al\.,[2026](https://arxiv.org/html/2605.29288#bib.bib7); Liet al\.,[2026](https://arxiv.org/html/2605.29288#bib.bib6)\)\. However, most existing methods remain confined to trace selection, prefix selection, or externally guided rewriting, which leaves the internal failure mode of answer\-correct traces under\-explained\. As a result, they do not characterize where useful reasoning may end and begin, or why such a phase can be associated with weaker SFT despite preserving answer correctness\.

To address this gap, we take a diagnostic view of answer\-correct long\-CoT traces\. We seek a trace\-internal diagnostic explanation for why answer\-correct traces may differ in training utility, rather than assuming that a long reasoning trace is uniformly useful once its final answer is correct\. Our goal is not to claim that every long tail is harmful, nor to treat length as the central issue\. Instead, we ask whether some traces enter a low\-value post\-conclusion continuation: the answer is already sufficiently supported, but subsequent reasoning remains locally costly while showing weak hidden\-state progress\. From the uncertainty perspective, we observe that some post\-conclusion continuation remains locally costly or unstable, suggesting that the trace continues to explore after evaluator\-based answer support has largely saturated\. From the geometric perspective, this continued exploration shows weakened terminal\-directional hidden\-state progress\. We refer to this hypothesized low\-value phase as post\-conclusion continuation before evaluating its downstream training effect\. When answer\-preserving removal of this continuation improves SFT outcomes, we call the empirically supported training\-unfavorable case harmful continuation\.

To examine the training relevance of this post\-conclusion continuation, we use a delete\-only editor as an operational intervention tool\. The editor does not rewrite the trace; it only removes post\-conclusion suffixes while preserving the original prefix and final answer\. This allows us to test whether answer\-preserving removal of the post\-conclusion continuation improves SFT outcomes\. Motivated by this diagnosis, we instantiateHarmfulContinuationCut \(HCC\), a lightweight boundary proxy\. HCC uses a frozen Qwen2\.5\-0\.5B\-Instruct backbone with a cut head to extract sentence\-level reasoning states and approximate the editor\-identified post\-conclusion continuation\.

Our contributions can be summarized as follows:

- •We formulate post\-conclusion continuation in answer\-correct long\-CoT traces without assuming inherent harmfulness\.
- •We show through answer\-preserving suffix removal that the removed continuation is training\-unfavorable in our SFT settings\.
- •We characterize the removed continuation with an uncertainty–geometry mismatch and propose HCC as a proxy to remove it\.

## 2Related Work

Long\-CoT reasoning training\.Long\-CoT traces have become important for post\-training pipelines\. While early pipelines often treat answer\-correct traces as directly usable supervision, recent studies show that correctness alone does not determine their training value\. Work on data selection, trace compatibility, and informative alignmentYanget al\.\([2026](https://arxiv.org/html/2605.29288#bib.bib15)\); Zhanget al\.\([2025](https://arxiv.org/html/2605.29288#bib.bib5)\); Chandraet al\.\([2025](https://arxiv.org/html/2605.29288#bib.bib16)\)suggests that only a subset of answer\-correct traces provides beneficial supervision\. Other approaches modify or shorten reasoning traces through sequence truncation, prefix optimization, adaptive prefix alignment, robustness to partial reasoning, or length\-aware trainingChenet al\.\([2025a](https://arxiv.org/html/2605.29288#bib.bib14)\); Sunet al\.\([2026](https://arxiv.org/html/2605.29288#bib.bib17)\); Liuet al\.\([2026](https://arxiv.org/html/2605.29288#bib.bib7)\); Silvestri and Cetin \([2026](https://arxiv.org/html/2605.29288#bib.bib2)\); Xuet al\.\([2025](https://arxiv.org/html/2605.29288#bib.bib18)\); Luoet al\.\([2025a](https://arxiv.org/html/2605.29288#bib.bib19)\); Maet al\.\([2025](https://arxiv.org/html/2605.29288#bib.bib20)\)\. However, these methods mainly operate in a heuristic manner\. They do not directly characterize where useful reasoning may end, where useless reasoning may begin, or why this reasoning can be associated with weaker SFT supervision despite answer correctness\.

Properties of long\-CoT trajectories\.Recent work increasingly treats long\-CoT traces as structured reasoning trajectories rather than flat text sequences\. Studies on overthinking show that long\-reasoning models may repeatedly verify intermediate conclusions or continue reasoning without meaningful gainChenet al\.\([2025b](https://arxiv.org/html/2605.29288#bib.bib21)\)\. Complementary analyses examine global reasoning patterns, trajectory geometry, and step\-level anchors that contribute to actual progressJianget al\.\([2025](https://arxiv.org/html/2605.29288#bib.bib22)\); Ballonet al\.\([2026](https://arxiv.org/html/2605.29288#bib.bib24)\);[Bogdanet al\.](https://arxiv.org/html/2605.29288#bib.bib23); Yanget al\.\([2025](https://arxiv.org/html/2605.29288#bib.bib25)\)\. Most closely related to our motivation,Liet al\.\([2026](https://arxiv.org/html/2605.29288#bib.bib6)\)connects trajectory properties to downstream SFT outcomes and shows that different reasoning patterns can lead to different generalization behavior\. Our work builds on this trajectory\-level view, but focuses on a post\-conclusion continuation inside answer\-correct traces: continuation that remains uncertain or costly while showing weakened terminal\-directional progress\.

## 3Operational Partition and Diagnostics of Post\-Conclusion Continuation

### 3\.1Data Construction

In this section, we do not assume that editor\-removed sentences are ground\-truth harmful continuation\. Instead, we use the delete\-only editor to construct an operational partition for diagnosing post\-conclusion suffixes\. The resulting groups are used to reveal statistical signatures of a possible low\-value phase, rather than to define harmfulness by editor decisions alone\. We use Qwen3\-235B\-A22B\-Instruct\-2507Team \([2025](https://arxiv.org/html/2605.29288#bib.bib13)\)and DeepSeek\-R1\-V3\.2Guoet al\.\([2025](https://arxiv.org/html/2605.29288#bib.bib11)\)to generate trajectories, and sample 4,780 answer\-correct long\-CoT solution trajectories from the OpenR1\-Math\-220k dataset\. These trajectories serve as the original CoT training traces\. For simplicity, we use\{T\}Q\\\{T\\\}\_\{Q\}and\{T\}R\\\{T\\\}\_\{R\}to refer to the two sets of trajectories from the two models, respectively\. We then employ Qwen3\.5\-27BTeam \([2025](https://arxiv.org/html/2605.29288#bib.bib13)\)as a delete\-only offline editor to expose post\-conclusion continuation for empirical analysis\. Given a trajectory from\{T\}Q\\\{T\\\}\_\{Q\}or\{T\}R\\\{T\\\}\_\{R\}, the editor marks the post\-conclusion sentences that can be removed while preserving the reasoning necessary to recover the final answer\.

![Refer to caption](https://arxiv.org/html/2605.29288v1/x1.png)Figure 1:Evaluator\-based uncertainty diagnostics as reasoning segments are progressively added for \(a\) retained reasoning and \(b\) editor\-removed continuation\.#### Operational groups\.

We divide each edited trajectory into two operational groups\. The first group is the*retained reasoning*, namely the editor\-preserved portion that supports the final answer\. The second group is the*editor\-removed continuation*, namely the post\-conclusion continuation marked as removable by the offline editor\. At this stage, these terms refer to operational groups rather than predefined theoretical labels\.

### 3\.2Uncertainty View

In this section, we aim to answer the following question:Does the post\-conclusion continuation continue to improve evaluator\-based final\-answer recoverability, or does answer support appear to saturate while local uncertainty remains high?

Comparison protocols\.We analyze uncertainty at both answer and sentence levels\. At the answer level, we progressively append reasoning sentences along the same complete response trajectory and compute prefix\-conditioned final\-answer entropy and NLL\. These quantities should be interpreted as evaluator\-based diagnostics of answer recoverability, rather than direct measurements of causal reasoning contribution\. For segment\-wise visualization, positions are normalized separately within retained reasoning and the subsequent editor\-removed continuation\. At the sentence level, we use sentence entropy and sentence NLL to measure local predictive difficulty\. For boundary\-level analysis, we trackK1K\_\{1\},KTK\_\{T\},C1C\_\{1\}, andCTC\_\{T\}, denoting the first and last sentences of retained reasoning and editor\-removed continuation, and compare both local uncertainty changes and answer\-NLL reduction\. The detailed protocols are described in the Appendix\.

![Refer to caption](https://arxiv.org/html/2605.29288v1/x2.png)

![Refer to caption](https://arxiv.org/html/2605.29288v1/x3.png)

Figure 2:Boundary\-level diagnostic changes around the editor\-identified post\-conclusion continuation: \(a\) sentence entropy, \(b\) entropy change, \(c\) sentence NLL, and \(d\) answer\-NLL reduction change\.![Refer to caption](https://arxiv.org/html/2605.29288v1/x4.png)Figure 3:Operational hidden\-state progress of retained reasoning and editor\-removed continuation: \(a\) ECDF of token\-normalized hidden displacement and \(b\) hidden displacement versus forward progress\.Answer uncertainty dynamics\.Figure[1](https://arxiv.org/html/2605.29288#S3.F1)shows how answer\-level uncertainty changes as reasoning sentences are progressively appended to the same complete response\. For visualization, the x\-axis is normalized within its corresponding segment, with Figure[1](https://arxiv.org/html/2605.29288#S3.F1)\(a\) showing retained reasoning and Figure[1](https://arxiv.org/html/2605.29288#S3.F1)\(b\) showing the subsequent editor\-removed continuation\. In retained reasoning, answer entropy changes non\-monotonically but does not exhibit a persistent increase, while answer NLL steadily decreases as more useful reasoning is added\. This suggests that the retained segment improves evaluator\-based final\-answer recoverability even when intermediate reasoning involves local exploration or verification\. In contrast, once the trace enters editor\-removed continuation, both answer entropy and answer NLL increase as more post\-conclusion content is appended\. This suggests that the continuation does not consistently improve evaluator\-based answer recoverability, but instead introduces a higher\-uncertainty state after the answer has been sufficiently supported\.

Table 1:Paired per\-sample comparison of operational hidden\-state progress between editor\-removed continuation and retained reasoning\. The paired difference is defined asΔ=Removed mean−Retained mean\\Delta=\\text\{Removed mean\}\-\\text\{Retained mean\}\.Boundary\-level mismatch\.Figure[2](https://arxiv.org/html/2605.29288#S3.F2)examines local uncertainty and evaluator\-based answer\-support changes on the boundary between retained reasoning and editor\-removed continuation\. FromK1K\_\{1\}toKTK\_\{T\}, sentence entropy and sentence NLL increase, but the answer\-NLL reduction also becomes stronger, suggesting that local uncertainty within retained reasoning can still accompany improved answer recoverability under the evaluator\. The transition fromKTK\_\{T\}toC1C\_\{1\}shows a different pattern\. Local uncertainty rises at the beginning of editor\-removed continuation, while the gain in answer support no longer increases correspondingly\. FromC1C\_\{1\}toCTC\_\{T\}, this high\-uncertainty regime is maintained or amplified, but the continuation does not provide stable additional answer\-NLL reduction\. The candidate low\-value pattern therefore emerges when increased local prediction difficulty is no longer matched by consistent improvements in evaluator\-based final\-answer recoverability\.

### 3\.3Geometric View

Following the uncertainty analysis, another question arises:Does the increased predictive uncertainty of post\-conclusion continuation translate into effective hidden\-state progress?

Comparison protocols\.Following prior work on the geometry of Transformer hidden representationsValerianiet al\.\([2023](https://arxiv.org/html/2605.29288#bib.bib31)\); Gurnee and Tegmark \([2024](https://arxiv.org/html/2605.29288#bib.bib32)\)and trajectory\-level analyses of long\-CoT reasoningJianget al\.\([2025](https://arxiv.org/html/2605.29288#bib.bib22)\), we use sentence\-boundary hidden states as an operational proxy for reasoning\-state evolution\. Specifically, hidden displacement measures the magnitude of representation change between consecutive reasoning steps, while forward progress measures the component of this change aligned with the terminal direction of the analyzed trace\. These metrics characterize representation\-level state movement and terminal\-directional progress under an operational proxy\. Because the terminal direction is derived from the observed trace representation, forward progress should be interpreted as an operational terminal\-directional proxy rather than a ground\-truth answer direction\. We further compute progress efficiency, defined as the ratio between forward progress and hidden displacement, to measure how effectively local state movement is converted into directional progress\. To control for sentence length, we also report token\-normalized variants of hidden displacement and forward progress\. Curvature is included as an auxiliary diagnostic of directional change, while displacement, forward progress, and efficiency serve as the primary geometric indicators\. The detailed protocols are described in Appendix\.

Distributional geometric tendency\.Figure[3](https://arxiv.org/html/2605.29288#S3.F3)\(a\) shows that many sentences in both groups have near\-zero token\-normalized hidden displacement, suggesting that fine\-grained geometric scores should not be used as hard sentence\-level deletion criteria\. Nevertheless, retained reasoning is shifted toward larger displacement, indicating stronger state movement per token than editor\-removed continuation\. Figure[3](https://arxiv.org/html/2605.29288#S3.F3)\(b\) shows a similar pattern\. Although the two groups overlap substantially in the scatter space, retained reasoning more often exhibits larger hidden displacement and stronger forward progress\. Thus, geometry provides a distributional signal of useful reasoning progress rather than a pointwise separation rule\.

Paired evidence\.Table[1](https://arxiv.org/html/2605.29288#S3.T1)gives a paired per\-sample comparison and shows that editor\-removed continuation has weaker operational hidden\-state progress than retained reasoning\. For hidden displacement, the removed mean is much lower than the retained mean \(21\.9121\.91vs\.44\.9244\.92\), and the paired difference is consistently negative, with79%79\\%of samples showing lower values for the removed segment\. Forward progress shows the same trend, suggesting that the removed continuation advances less under the terminal\-directional proxy\. These gaps remain after token normalization, where both hidden displacement per token and forward progress per token are lower for editor\-removed continuation\. By contrast, curvature has only a small absolute gap, suggesting that the post\-conclusion continuation is associated mainly with weaker displacement and weaker terminal\-directional progress, rather than with a curvature pattern\.

### 3\.4Uncertainty–Geometry Mismatch

The above analyses reveal a consistent diagnostic pattern in the editor\-identified post\-conclusion continuation\. From the uncertainty view, this suffix often remains locally costly or unstable, while evaluator\-based answer recoverability no longer improves consistently\. From the geometric view, the same suffix shows weaker hidden displacement and weaker terminal\-directional progress than retained reasoning\. We refer to this pattern as an uncertainty–geometry mismatch\. Importantly, this mismatch is not used to define harmfulness and does not by itself prove causal training harm\. Instead, it characterizes the editor\-removable post\-conclusion continuation that is later tested through downstream SFT intervention\.

## 4Method

Our diagnostic analysis suggests that editor\-removed post\-conclusion continuation is associated with persistent local uncertainty and weak terminal\-directional hidden\-state progress\. Based on this operational pattern, we instantiate a lightweight boundary proxy termed Harmful Continuation Cut\.

### 4\.1Boundary Proxy

Given a questionqqand a verified source trajectory, we separate the sentence\-level reasoning trace from the final answer\. Letr=\(r1,r2,…,rT\)r=\(r\_\{1\},r\_\{2\},\\dots,r\_\{T\}\)denote the reasoning trace, and leta∗a^\{\*\}denote the final answer\. Letc∗∈\{0,…,T\}c^\{\\ast\}\\in\\\{0,\\dots,T\\\}denote the editor\-identified post\-conclusion continuation boundary used for supervision\. Our goal is to learn a lightweight proxy that predicts this boundary, so that the retained prefixr≤c∗r\_\{\\leq c^\{\\ast\}\}is kept while the post\-conclusion continuationr\>c∗r\_\{\>c^\{\\ast\}\}is removed\.

We encode the question and reasoning trace with a frozen causal language model and extract a hidden representation at each sentence boundary, obtaininght∈ℝDh\_\{t\}\\in\\mathbb\{R\}^\{D\}\. This representation denotes the model state after consuming the prefix up to sentencertr\_\{t\}\. We then pass the sentence\-level states through a shared sequence encoder:

ht~=SeqEnc​\(h1:T\)t\.\\tilde\{h\_\{t\}\}=\\mathrm\{SeqEnc\}\(h\_\{1:T\}\)\_\{t\}\.\(1\)The contextualized representationht~\\tilde\{h\_\{t\}\}is used as the common input for latent regularization and uncertainty–geometry diagnostic estimation\.

### 4\.2Sequential Latent Regularization

HCC uses a sequential variational latent representation to regularize sentence\-level boundary states\. For each contextualized sentence stateht~\\tilde\{h\_\{t\}\}, we define a posterior latent distribution:

qϕ​\(zt∣ht~\)=𝒩​\(μt,Σt\),q\_\{\\phi\}\(z\_\{t\}\\mid\\tilde\{h\_\{t\}\}\)=\\mathcal\{N\}\(\\mu\_\{t\},\\Sigma\_\{t\}\),\(2\)where the mean and variance are predicted fromht~\\tilde\{h\_\{t\}\}\. We also define a sequential prior from the previous contextual state:

pη​\(zt∣ht−1~\)=𝒩​\(μtp,Σtp\)\.p\_\{\\eta\}\(z\_\{t\}\\mid\\tilde\{h\_\{t\-1\}\}\)=\\mathcal\{N\}\(\\mu^\{p\}\_\{t\},\\Sigma^\{p\}\_\{t\}\)\.\(3\)The sampled latent variable is projected back to the boundary\-prediction space:

bt=flat​\(zt\)\.b\_\{t\}=f\_\{\\mathrm\{lat\}\}\(z\_\{t\}\)\.\(4\)The latent representation is regularized by:

ℒKL=∑t=1TDKL\(qϕ\(zt∣ht~\)∥pη\(zt∣ht−1~\)\)\.\\mathcal\{L\}\_\{\\mathrm\{KL\}\}=\\sum\_\{t=1\}^\{T\}D\_\{\\mathrm\{KL\}\}\\left\(q\_\{\\phi\}\(z\_\{t\}\\mid\\tilde\{h\_\{t\}\}\)\\,\\\|\\,p\_\{\\eta\}\(z\_\{t\}\\mid\\tilde\{h\_\{t\-1\}\}\)\\right\)\.\(5\)This term provides compact latent regularization for boundary prediction, rather than a hard information bottleneck or an explicit answer\-prediction objective\.

### 4\.3Uncertainty–Geometry Diagnostic Estimation

Post\-conclusion continuation may remain locally costly or unstable while contributing limited hidden\-state progress\. To capture this uncertainty–geometry mismatch, HCC jointly estimates a local uncertainty signal and an operational progress signal fromht~\\tilde\{h\_\{t\}\}\.

For uncertainty, we define a scalar targetTtT\_\{t\}from source\-trace sentence\-level statistics such as entropy, NLL, or log\-perplexity\. HCC estimates this signal as:

stent,T^t=fent​\(ht~\),s^\{\\mathrm\{ent\}\}\_\{t\},\\hat\{T\}\_\{t\}=f\_\{\\mathrm\{ent\}\}\(\\tilde\{h\_\{t\}\}\),\(6\)wherestents^\{\\mathrm\{ent\}\}\_\{t\}is the uncertainty\-aware context vector used for boundary prediction, andT^t\\hat\{T\}\_\{t\}is the scalar uncertainty estimate\. The regression loss is:

ℒent=∑t=1THuber​\(T^t,Tt\)\.\\mathcal\{L\}\_\{\\mathrm\{ent\}\}=\\sum\_\{t=1\}^\{T\}\\mathrm\{Huber\}\(\\hat\{T\}\_\{t\},T\_\{t\}\)\.\(7\)
For geometry, we define a scalar progress targetGtG\_\{t\}from hidden\-state movement statistics\. HCC estimates this signal as:

stgeo,G^t=fgeo​\(ht~\),s^\{\\mathrm\{geo\}\}\_\{t\},\\hat\{G\}\_\{t\}=f\_\{\\mathrm\{geo\}\}\(\\tilde\{h\_\{t\}\}\),\(8\)wherestgeos^\{\\mathrm\{geo\}\}\_\{t\}is the progress\-aware context vector used for boundary prediction, andG^t\\hat\{G\}\_\{t\}is the scalar progress estimate\. The regression loss is:

ℒgeo=∑t=1THuber​\(G^t,Gt\)\.\\mathcal\{L\}\_\{\\mathrm\{geo\}\}=\\sum\_\{t=1\}^\{T\}\\mathrm\{Huber\}\(\\hat\{G\}\_\{t\},G\_\{t\}\)\.\(9\)
Together, these estimates provide a unified diagnostic representation of whether a post\-conclusion sentence remains locally uncertain while showing weak operational progress\.

### 4\.4Mismatch\-Aware Boundary Prediction

HCC fuses the latent regularization signal with the uncertainty–geometry diagnostic representation in a shared boundary\-prediction space:

mt=LN​\(bt\+αgeo​stgeo\+αent​stent\),m\_\{t\}=\\mathrm\{LN\}\\left\(b\_\{t\}\+\\alpha\_\{\\mathrm\{geo\}\}s^\{\\mathrm\{geo\}\}\_\{t\}\+\\alpha\_\{\\mathrm\{ent\}\}s^\{\\mathrm\{ent\}\}\_\{t\}\\right\),\(10\)whereαgeo\\alpha\_\{\\mathrm\{geo\}\}andαent\\alpha\_\{\\mathrm\{ent\}\}are learnable scalar gates\. Thus, HCC does not concatenate the raw contextual state directly into the final cut representation\. Instead, the scalar estimatesT^t\\hat\{T\}\_\{t\}andG^t\\hat\{G\}\_\{t\}are trained with auxiliary losses, while their context vectors contribute to boundary prediction\.

For cut prediction, we prepend a learned beginning\-of\-sequence statem0m\_\{0\}and compute:

πt=CutHead​\(mt\),t=0,…,T\.\\pi\_\{t\}=\\mathrm\{CutHead\}\(m\_\{t\}\),\\quad t=0,\\dots,T\.\(11\)Here,πt\\pi\_\{t\}denotes the logit of choosing sentencettas the last retained sentence\. Given the editor\-identified boundaryc∗c^\{\*\}, the cut head is trained with:

ℒcut=−log⁡exp⁡\(πc∗\)∑j=0Texp⁡\(πj\)\.\\mathcal\{L\}\_\{\\mathrm\{cut\}\}=\-\\log\\frac\{\\exp\(\\pi\_\{c^\{\*\}\}\)\}\{\\sum\_\{j=0\}^\{T\}\\exp\(\\pi\_\{j\}\)\}\.\(12\)
We also train a sentence\-level deletion head to match the deletion labels produced by the offline editor\. Letyt∈\{0,1\}y\_\{t\}\\in\\\{0,1\\\}denote whether sentencertr\_\{t\}should be deleted\. The deletion probability is:

y^t=DelHead​\(mt\),\\hat\{y\}\_\{t\}=\\mathrm\{DelHead\}\(m\_\{t\}\),\(13\)with the binary deletion loss:

ℒdel=−∑t=1T\[yt​log⁡y^t\+\(1−yt\)​log⁡\(1−y^t\)\]\.\\mathcal\{L\}\_\{\\mathrm\{del\}\}=\-\\sum\_\{t=1\}^\{T\}\\left\[y\_\{t\}\\log\\hat\{y\}\_\{t\}\+\(1\-y\_\{t\}\)\\log\(1\-\\hat\{y\}\_\{t\}\)\\right\]\.\(14\)
The overall training objective is:

ℒ=\\displaystyle\\mathcal\{L\}=ℒcut\+λdel​ℒdel\+λKL​ℒKL\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{cut\}\}\+\\lambda\_\{\\mathrm\{del\}\}\\mathcal\{L\}\_\{\\mathrm\{del\}\}\+\\lambda\_\{\\mathrm\{KL\}\}\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\(15\)\+λent​ℒent\+λgeo​ℒgeo\.\\displaystyle\+\\lambda\_\{\\mathrm\{ent\}\}\\mathcal\{L\}\_\{\\mathrm\{ent\}\}\+\\lambda\_\{\\mathrm\{geo\}\}\\mathcal\{L\}\_\{\\mathrm\{geo\}\}\.

## 5Experiments

### 5\.1Experimental Setup

Datasets\.We use the same trajectories as in Section[3](https://arxiv.org/html/2605.29288#S3), also denoted as\{T\}Q\\\{T\\\}\_\{Q\}and\{T\}R\\\{T\\\}\_\{R\}\. For HCC training, we use the 500M\-scaleQwen2\.5\-0\.5B\-Instructas a frozen backbone and train lightweight prediction heads to learn the editor\-identified removable post\-conclusion continuation boundary\. To test whether the learned deletion rule transfers across nearby source\-model families rather than only memorizing a single source style, we construct the train–validation split across source models\. Specifically, we use\{T\}Q\\\{T\\\}\_\{Q\}to train HCC, then process\{T\}R\\\{T\\\}\_\{R\}with the trained HCC and SFT the baseline model on the processed\{T\}R\\\{T\\\}\_\{R\}, and vice versa\.

Table 2:Main results using different long\-CoT trajectories\. Len\. denotes the average response length\.MethodSFT on\{T\}Q\\\{T\\\}\_\{Q\}SFT on\{T\}R\\\{T\\\}\_\{R\}MATH500AMC23GSM8KAvg\.Len\.MATH500AMC23GSM8KAvg\.Len\.Backbone: LLaMA3\.2\-3B\-InstructVanilla29\.810\.069\.036\.33478\.838\.415\.069\.941\.14967\.1Heuristic34\.015\.071\.540\.23430\.840\.015\.070\.141\.75172\.1Editor41\.820\.075\.445\.71906\.151\.222\.575\.949\.81942\.1HCC43\.217\.575\.145\.22010\.147\.622\.577\.849\.31934\.4Backbone: Qwen2\.5\-Math\-7B\-InstructVanilla85\.862\.595\.881\.4519\.685\.662\.595\.881\.3510\.6Heuristic82\.857\.595\.878\.7526\.184\.462\.595\.780\.9498\.5Editor84\.470\.096\.283\.5499\.086\.467\.596\.083\.3443\.2HCC82\.662\.596\.180\.4497\.986\.665\.096\.282\.6455\.9Evaluation metrics\.As for the comparison methods, we include 4 different methods to process the original trajectories, including \(1\) Vanilla, indicating that the original trajectories are directly used for SFT without any processing; \(2\) Editor, indicating that the trajectory is processed by Qwen3\.5\-27B to remove the post\-conclusion continuation; \(3\) Heuristic, indicating the heuristic method proposed byLiet al\.\([2026](https://arxiv.org/html/2605.29288#bib.bib6)\)and \(4\) HCC, indicating the proposed method in this paper\. We evaluate the resulting SFT models on three reasoning benchmarks: MATH500, AMC23, and GSM8K\. The primary evaluation metric is pass@1\.

### 5\.2Main Results

Overall comparisons\.Table[2](https://arxiv.org/html/2605.29288#S5.T2)reports the main SFT results using different processed versions of the same answer\-correct long\-CoT traces\. From the results, we can draw several conclusions: \(1\) Across different training traces and models, answer\-preserving removal of the editor\-identified post\-conclusion continuation improves downstream SFT outcomes over training on the original traces\. This provides interventional evidence that the post\-conclusion continuation is training\-unfavorable in these settings\. \(2\) HCC achieves performance close to the 27B editor\-processed reference, and in several cases even surpasses it\. For example, in the LLaMA3\.2\-3B setting, HCC obtains an average score of45\.245\.2on\{T\}Q\\\{T\\\}\_\{Q\}and49\.349\.3on\{T\}R\\\{T\\\}\_\{R\}, closely matching the editor\-processed results of45\.745\.7and49\.849\.8\. HCC also outperforms the editor reference on MATH500 under\{T\}Q\\\{T\\\}\_\{Q\}and on GSM8K under\{T\}R\\\{T\\\}\_\{R\}\. These results indicate that a lightweight boundary proxy can recover much of the benefit of large\-model delete\-only trace editing\. \(3\) Compared with heuristic truncation, HCC\-processed traces are both more accurate and substantially shorter in response length\. The comparison with heuristic truncation suggests that length reduction alone may not fully explain the gains, although length\-controlled interventions would be needed for a complete separation\.

### 5\.3More Experiments

#### Analysis of uncertainty dynamics\.

Figure[4](https://arxiv.org/html/2605.29288#S5.F4)compares the reasoning dynamics of models trained on original, HCC\-processed, and editor\-processed traces\. In Figure[4](https://arxiv.org/html/2605.29288#S5.F4)\(a\), Vanilla shows a sharp late\-stage increase in answer NLL, suggesting that generated reasoning becomes less favorable under evaluator\-based answer recoverability\. In contrast, both HCC and Editor keep answer NLL much more stable across the reasoning process, suggesting that post\-conclusion continuation removal improves answer\-support consistency under this diagnostic\. Figure[4](https://arxiv.org/html/2605.29288#S5.F4)\(b\) further shows the segment\-level uncertainty pattern\. Vanilla first reduces entropy but then exhibits a clear entropy rebound in the later reasoning bins, which is consistent with re\-entering an unstable post\-conclusion continuation\. By comparison, HCC and Editor continue to reduce segment entropy in the later stage and produce highly similar curves\. These curves provide post\-training diagnostic evidence consistent with Section[3](https://arxiv.org/html/2605.29288#S3)pattern\. The similarity between HCC and Editor curves suggests that the lightweight proxy approximates the diagnostic behavior of delete\-only edited traces\.

#### Analysis of geometric dynamics\.

Figure[5](https://arxiv.org/html/2605.29288#S5.F5)examines whether processed traces lead to stronger hidden\-state progress under the chosen proxy\. In Figure[5](https://arxiv.org/html/2605.29288#S5.F5)\(a\), HCC and Editor produce larger token\-normalized hidden displacement than Vanilla, suggesting stronger state movement per generated token\. Figure[5](https://arxiv.org/html/2605.29288#S5.F5)\(b\) shows that Vanilla exhibits a clearer positive entropy\-progress mismatch in the middle and late stages, where uncertainty is not matched by sufficient geometric progress\. By contrast, HCC and Editor keep this mismatch closer to zero and reduce it near the end\. These curves provide post\-training diagnostic evidence consistent with Section[3](https://arxiv.org/html/2605.29288#S3)pattern, and again suggest that HCC approximates the delete\-only editor behavior\.

![Refer to caption](https://arxiv.org/html/2605.29288v1/x5.png)Figure 4:Post\-training uncertainty diagnostics for generated reasoning traces\.![Refer to caption](https://arxiv.org/html/2605.29288v1/x6.png)Figure 5:Operational uncertainty\-progress diagnostics in generated reasoning traces\.
#### Analysis of reinforcement learning\.

We further examine whether HCC\-processed SFT provides a stronger initialization for subsequent GRPO training\. Using LLaMA3\.2\-3B\-Instruct trained on\{T\}Q\\\{T\\\}\_\{Q\}, we apply GRPO to the checkpoints obtained from Vanilla SFT and HCC\-based SFT, and evaluate performance across different RL steps\. As shown in Table[3](https://arxiv.org/html/2605.29288#S5.T3), the model initialized from HCC\-based SFT consistently outperforms the Vanilla counterpart at each evaluated step\. The model initialized from HCC\-based SFT maintains higher performance across the evaluated RL steps in this setting\. This suggests that harmful continuation removal can provide a stronger SFT initialization\. More broadly, the effect of SFT data processing can persist into subsequent RL\.

Table 3:Comparison across reinforcement steps\.Table 4:Comparison of random cut and HCC\-based method on LLaMA3\.2\-3B\-Instruct\.
#### Analysis of Random Cut\.

We further introduce a random cut baseline to rule out the possibility that the improvement mainly comes from shorter responses\. To align it with HCC, random cut preserves the final answer, removes a sentence\-complete suffix from the reasoning trace, and controls the removed length to match the average truncation length of HCC\. As shown in Table[4](https://arxiv.org/html/2605.29288#S5.T4), random cut is consistently inferior to HCC on MATH500, AMC23, and GSM8K, yielding an average score of only 29\.0 compared with 49\.3 for HCC\. This large gap suggests that arbitrary length reduction is not a reliable solution\. Since random cut does not identify whether the reasoning has already concluded, it may discard necessary intermediate steps and damage the reasoning chain\. HCC instead removes post\-conclusion continuation, thereby reducing redundant tail reasoning while preserving the core answer\-supporting process\.

#### Analysis of MMLU datasets\.

We further examine whether the SFT improvements transfer to several non\-mathematical evaluation subjects on MMLUHendryckset al\.\([2021](https://arxiv.org/html/2605.29288#bib.bib33)\)\. As shown in Figure[6](https://arxiv.org/html/2605.29288#S5.F6), we select 6 subjects that require different types of knowledge, including college physics, college biology, clinical knowledge, professional psychology, high school statistics, and high school biology\. From the results, we can see that HCC\-based SFT outperforms Vanilla\-based SFT across all subjects, and achieves comparable performance to the editor\-based reference\. These results suggest that models trained on HCC\-processed mathematical traces can retain or improve performance on selected out\-of\-domain knowledge\-intensive evaluations\. They do not by themselves establish that the same harmful continuation pattern appears in non\-mathematical training traces\. The overall performance of different models and different SFT data settings are shown in the Appendix\.

![Refer to caption](https://arxiv.org/html/2605.29288v1/x7.png)Figure 6:Visualization of the performance of LLaMA3\.2\-3B\-Instruct on selected MMLU subjects\.

## 6Conclusion

We studied post\-conclusion continuation in answer\-correct long\-CoT SFT traces, where generation continues after the answer appears sufficiently supported\. Using a delete\-only editor, we constructed answer\-preserving post\-conclusion continuation removal and observed improved downstream SFT outcomes, suggesting that the removed continuation is training\-unfavorable in our setting\. We therefore refer to this empirically supported phenomenon as harmful continuation\. We further characterized the removed continuation through uncertainty and operational hidden\-state analyses, revealing an uncertainty–geometry mismatch\. Based on this, we instantiated HCC as a lightweight boundary proxy for approximating editor\-identified post\-conclusion continuation removal\.

## Limitations

The delete\-only editor provides an operational intervention, not a ground\-truth oracle of harmfulness; its labels should be understood as editor\-identified post\-conclusion continuation boundaries\. Our measurements are diagnostic proxies rather than causal proof of training harm\. Finally, HCC approximates the editor\-identified removal boundary rather than intrinsic harmfulness, and finer component\-level attribution remains future work\.

## References

- Probing the trajectories of reasoning traces in large language models\.arXiv preprint arXiv:2601\.23163\.Cited by:[§2](https://arxiv.org/html/2605.29288#S2.p2.1)\.
- \[2\]P\. C\. Bogdan, U\. Macar, N\. Nanda, and A\. ConmyThought anchors: which llm reasoning steps matter?\.InMechanistic Interpretability Workshop at NeurIPS 2025,Cited by:[§2](https://arxiv.org/html/2605.29288#S2.p2.1)\.
- A\. Chandra, A\. Agrawal, A\. Hosseini, S\. Fischmeister, R\. Agarwal, N\. Goyal, and A\. Courville \(2025\)Shape of thought: when distribution matters more than correctness in reasoning tasks\.arXiv preprint arXiv:2512\.22255\.Cited by:[§2](https://arxiv.org/html/2605.29288#S2.p1.1)\.
- W\. Chen, V\. Kothapalli, A\. Fatahibaarzi, H\. Sang, S\. Tang, Q\. Song, Z\. Wang, and M\. Abdul\-Mageed \(2025a\)Distilling the essence: efficient reasoning distillation via sequence truncation\.arXiv preprint arXiv:2512\.21002\.Cited by:[§2](https://arxiv.org/html/2605.29288#S2.p1.1)\.
- X\. Chen, J\. Xu, T\. Liang, Z\. He, J\. Pang, D\. Yu, L\. Song, Q\. Liu, M\. Zhou, Z\. Zhang,et al\.\(2025b\)Do not think that much for 2\+ 3=? on the overthinking of long reasoning models\.InForty\-second International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.29288#S2.p2.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§3\.1](https://arxiv.org/html/2605.29288#S3.SS1.p1.4)\.
- W\. Gurnee and M\. Tegmark \(2024\)Language models represent space and time\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 2483–2503\.Cited by:[§3\.3](https://arxiv.org/html/2605.29288#S3.SS3.p2.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.Proceedings of the International Conference on Learning Representations \(ICLR\)\.Cited by:[§5\.3](https://arxiv.org/html/2605.29288#S5.SS3.SSS0.Px5.p1.1)\.
- G\. Jiang, Y\. Liu, Z\. Li, W\. Bi, F\. Zhang, L\. Song, Y\. Wei, and D\. Lian \(2025\)What makes a good reasoning chain? uncovering structural patterns in long chain\-of\-thought reasoning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 6501–6525\.Cited by:[§2](https://arxiv.org/html/2605.29288#S2.p2.1),[§3\.3](https://arxiv.org/html/2605.29288#S3.SS3.p2.1)\.
- Y\. Li, X\. Yue, Z\. Xu, F\. Jiang, L\. Niu, B\. Y\. Lin, B\. Ramasubramanian, and R\. Poovendran \(2025\)Small models struggle to learn from strong reasoners\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 25366–25394\.Cited by:[§1](https://arxiv.org/html/2605.29288#S1.p2.1)\.
- Z\. Li, X\. Xi, Z\. Chen, W\. Wang, G\. Jiang, R\. Shen, L\. Song, Y\. Wei, and D\. Lian \(2026\)On the role of reasoning patterns in the generalization discrepancy of long chain\-of\-thought supervised fine\-tuning\.arXiv preprint arXiv:2604\.01702\.Cited by:[§1](https://arxiv.org/html/2605.29288#S1.p2.1),[§2](https://arxiv.org/html/2605.29288#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.29288#S5.SS1.p2.1)\.
- Z\. Liu, Z\. Wu, X\. Li, Y\. Yan, S\. Wang, Z\. Chen, Y\. Gu, G\. Yu, and M\. Sun \(2026\)Long\-chain reasoning distillation via adaptive prefix alignment\.arXiv preprint arXiv:2601\.10064\.Cited by:[§1](https://arxiv.org/html/2605.29288#S1.p2.1),[§2](https://arxiv.org/html/2605.29288#S2.p1.1)\.
- H\. Luo, L\. Shen, H\. He, Y\. Wang, S\. Liu, W\. Li, N\. Tan, X\. Cao, and D\. Tao \(2025a\)O1\-pruner: length\-harmonizing fine\-tuning for o1\-like reasoning pruning\.arXiv preprint arXiv:2501\.12570\.Cited by:[§2](https://arxiv.org/html/2605.29288#S2.p1.1)\.
- Y\. Luo, Y\. Song, X\. Zhang, J\. Liu, W\. Wang, G\. Chen, W\. Su, and B\. Zheng \(2025b\)Deconstructing long chain\-of\-thought: a structured reasoning optimization framework for long cot distillation\.arXiv preprint arXiv:2503\.16385\.Cited by:[§1](https://arxiv.org/html/2605.29288#S1.p1.1)\.
- X\. Ma, G\. Wan, R\. Yu, G\. Fang, and X\. Wang \(2025\)Cot\-valve: length\-compressible chain\-of\-thought tuning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6025–6035\.Cited by:[§2](https://arxiv.org/html/2605.29288#S2.p1.1)\.
- L\. Ou and Y\. Yin \(2025\)Empowering lightweight mllms with reasoning via long cot sft\.arXiv preprint arXiv:2509\.03321\.Cited by:[§1](https://arxiv.org/html/2605.29288#S1.p1.1)\.
- G\. Silvestri and E\. Cetin \(2026\)Learning from partial chain\-of\-thought via truncated\-reasoning self\-distillation\.arXiv preprint arXiv:2603\.13274\.Cited by:[§2](https://arxiv.org/html/2605.29288#S2.p1.1)\.
- Y\. Sun, Z\. Zhao, Y\. Wei, Y\. Zhang, and C\. Gong \(2026\)Well begun, half done: reinforcement learning with prefix optimization for llm reasoning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 33144–33152\.Cited by:[§2](https://arxiv.org/html/2605.29288#S2.p1.1)\.
- Q\. Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§3\.1](https://arxiv.org/html/2605.29288#S3.SS1.p1.4)\.
- X\. Tian, Y\. Ji, H\. Wang, S\. Chen, S\. Zhao, Y\. Peng, H\. Zhao, and X\. Li \(2025\)Not all correct answers are equal: why your distillation source matters\.arXiv preprint arXiv:2505\.14464\.Cited by:[§1](https://arxiv.org/html/2605.29288#S1.p2.1)\.
- L\. Valeriani, D\. Doimo, F\. Cuturello, A\. Laio, A\. Ansuini, and A\. Cazzaniga \(2023\)The geometry of hidden representations of large transformer models\.Advances in Neural Information Processing Systems36,pp\. 51234–51252\.Cited by:[§3\.3](https://arxiv.org/html/2605.29288#S3.SS3.p2.1)\.
- S\. Wang, L\. Yu, C\. Gao, C\. Zheng, S\. Liu, R\. Lu, K\. Dang, X\. Chen, J\. Yang, Z\. Zhang,et al\.\(2026\)Beyond the 80/20 rule: high\-entropy minority tokens drive effective reinforcement learning for llm reasoning\.Advances in Neural Information Processing Systems38,pp\. 115452–115486\.Cited by:[§1](https://arxiv.org/html/2605.29288#S1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.29288#S1.p1.1)\.
- S\. Xu, W\. Xie, L\. Zhao, and P\. He \(2025\)Chain of draft: thinking faster by writing less\.arXiv preprint arXiv:2502\.18600\.Cited by:[§2](https://arxiv.org/html/2605.29288#S2.p1.1)\.
- S\. Yang, Y\. Tong, X\. Niu, G\. Neubig, and X\. Yue \(2025\)Demystifying long chain\-of\-thought reasoning\.InForty\-second International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.29288#S2.p2.1)\.
- Y\. Yang, M\. Lai, W\. Zhao, X\. Fan, Z\. Xi, M\. Wu, C\. Huang, J\. Zhao, H\. Lv, J\. Tong,et al\.\(2026\)Which reasoning trajectories teach students to reason better? a simple metric of informative alignment\.arXiv preprint arXiv:2601\.14249\.Cited by:[§2](https://arxiv.org/html/2605.29288#S2.p1.1)\.
- D\. Zhang, Q\. Dai, and H\. Peng \(2025\)The best instruction\-tuning data are those that fit\.arXiv preprint arXiv:2502\.04194\.Cited by:[§1](https://arxiv.org/html/2605.29288#S1.p2.1),[§2](https://arxiv.org/html/2605.29288#S2.p1.1)\.

## Appendix AImplementation Details

We implement SFT with LLaMAFactory\. All models are fine\-tuned with the AdamW optimizer using a learning rate of1×10−51\\times 10^\{\-5\}\. For fair comparison, Vanilla, Heuristic, Editor, and HCC use the same training configuration, and only differ in the SFT traces used for supervision\. During evaluation, we fix the decoding temperature to0to ensure deterministic generation and improve reproducibility\. All benchmark results are reported under this fixed evaluation setting\.

## Appendix BInformation Bottleneck Motivation for Answer\-Sufficiency

In this appendix, we provide an information\-bottleneck motivation for the answer\-sufficiency component of HCC\. The goal is to formalize when a reasoning prefix can already support the final answer, and to give an idealized sufficient condition under which a suffix does not change the answer decision under an answer\-prediction head\.

### B\.1Sequential Conditional Information Bottleneck

Given a questionQQ, a sentence\-level reasoning traceR=\(r1,…,rT\)R=\(r\_\{1\},\\dots,r\_\{T\}\), and the final answerAA, letR≤tR\_\{\\leq t\}andR\>tR\_\{\>t\}denote the prefix and suffix at steptt\. For each prefixR≤tR\_\{\\leq t\}, we introduce a stochastic bottleneck representationZtZ\_\{t\}induced from the contextualized reasoning stateht~\\tilde\{h\_\{t\}\}:

qϕ​\(zt∣ht~\)=𝒩​\(μt,Σt\)\.q\_\{\\phi\}\(z\_\{t\}\\mid\\tilde\{h\_\{t\}\}\)=\\mathcal\{N\}\(\\mu\_\{t\},\\Sigma\_\{t\}\)\.
The ideal conditional information bottleneck objective aims to preserve answer\-relevant information while removing unnecessary dependence on the reasoning prefix:

min\\displaystyle\\minI​\(Zt;R≤t∣Q\)\\displaystyle I\(Z\_\{t\};R\_\{\\leq t\}\\mid Q\)s\.t\.\\displaystyle\\mathrm\{s\.t\.\}I​\(Zt;A∣Q\)​is sufficiently large\.\\displaystyle I\(Z\_\{t\};A\\mid Q\)\\ \\text\{is sufficiently large\}\.Equivalently, one may optimize the following Lagrangian form:

𝒥IB=−I​\(Zt;A∣Q\)\+β​I​\(Zt;R≤t∣Q\)\.\\mathcal\{J\}\_\{\\mathrm\{IB\}\}=\-I\(Z\_\{t\};A\\mid Q\)\+\\beta I\(Z\_\{t\};R\_\{\\leq t\}\\mid Q\)\.
Since exact mutual information is intractable for free\-form reasoning traces, we use a variational approximation\. The answer\-relevance term is optimized through a variational answer predictorpψ​\(a∣zt,q\)p\_\{\\psi\}\(a\\mid z\_\{t\},q\), leading to the negative log\-likelihood term:

−log⁡pψ​\(a∗∣zt,q\)\.\-\\log p\_\{\\psi\}\(a^\{\*\}\\mid z\_\{t\},q\)\.For the compression term, we introduce a sequential priorpη​\(zt∣zt−1,q\)p\_\{\\eta\}\(z\_\{t\}\\mid z\_\{t\-1\},q\)and use the following KL penalty as a variational surrogate for the incremental information injected into the bottleneck:

DKL\(qϕ\(zt∣ht~\)∥pη\(zt∣zt−1,q\)\)\.D\_\{\\mathrm\{KL\}\}\\big\(q\_\{\\phi\}\(z\_\{t\}\\mid\\tilde\{h\_\{t\}\}\)\\,\\\|\\,p\_\{\\eta\}\(z\_\{t\}\\mid z\_\{t\-1\},q\)\\big\)\.Thus, the practical sequential IB loss is:

ℒIB=∑t=1T\[\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{IB\}\}=\\sum\_\{t=1\}^\{T\}\\Big\[−log⁡pψ​\(a∗∣zt,q\)\\displaystyle\-\\log p\_\{\\psi\}\(a^\{\*\}\\mid z\_\{t\},q\)\+βDKL\(qϕ\(zt∣h~t\)∥pη\(zt∣zt−1,q\)\)\]\.\\displaystyle\+\\beta D\_\{\\mathrm\{KL\}\}\\big\(q\_\{\\phi\}\(z\_\{t\}\\mid\\tilde\{h\}\_\{t\}\)\\,\\\|\\,p\_\{\\eta\}\(z\_\{t\}\\mid z\_\{t\-1\},q\)\\big\)\\Big\]\.This objective encouragesZtZ\_\{t\}to retain information predictive of the final answer while discouraging unnecessary state complexity\.

### B\.2Ideal Answer\-Sufficiency Boundary

We next formalize an ideal boundary after which the remaining reasoning suffix provides no additional answer information under the bottleneck representation\.

#### Definition\.

The ideal answer\-sufficiency boundary is defined as:

τ∗=min⁡\{t:I​\(A;R\>t∣Q,Zt\)=0\}\.\\tau^\{\*\}=\\min\\left\\\{t:I\(A;R\_\{\>t\}\\mid Q,Z\_\{t\}\)=0\\right\\\}\.This condition means that, onceQQandZtZ\_\{t\}are given, the suffixR\>tR\_\{\>t\}provides no additional information about the final answerAA\. If the set is empty, the trace has no boundary satisfying this ideal condition\.

#### Proposition\.

If

I​\(A;R\>t∣Q,Zt\)=0,I\(A;R\_\{\>t\}\\mid Q,Z\_\{t\}\)=0,then conditioning onR\>tR\_\{\>t\}does not change the Bayes\-optimal answer distribution given\(Q,Zt\)\(Q,Z\_\{t\}\)\.

#### Proof\.

By the definition of conditional mutual information, the condition

I​\(A;R\>t∣Q,Zt\)=0I\(A;R\_\{\>t\}\\mid Q,Z\_\{t\}\)=0implies the conditional independence relation:

A⟂R\>t∣\(Q,Zt\)\.A\\perp R\_\{\>t\}\\mid\(Q,Z\_\{t\}\)\.Therefore, for any answer candidateaa, we have:

p​\(a∣Q,Zt,R\>t\)=p​\(a∣Q,Zt\)\.p\(a\\mid Q,Z\_\{t\},R\_\{\>t\}\)=p\(a\\mid Q,Z\_\{t\}\)\.As a result, the Bayes\-optimal answer predictor satisfies:

arg⁡maxa⁡p​\(a∣Q,Zt,R\>t\)\\displaystyle\\arg\\max\_\{a\}p\(a\\mid Q,Z\_\{t\},R\_\{\>t\}\)=arg⁡maxa⁡p​\(a∣Q,Zt\)\.\\displaystyle=\\arg\\max\_\{a\}p\(a\\mid Q,Z\_\{t\}\)\.Thus, onceZtZ\_\{t\}is answer\-sufficient in the conditional\-independence sense, the suffixR\>tR\_\{\>t\}is irrelevant to the answer decision under the bottleneck representation\. ∎

#### Remark\.

This proposition establishes an answer\-sufficiency condition only\. It does not imply that deletion necessarily improves SFT\. The empirical benefit studied in the main paper additionally depends on whether the answer\-sufficient suffix introduces high local uncertainty or weak geometric progress as supervision\.

## Appendix CExperimental Settings

### C\.1Uncertainty Metrics

We compute uncertainty metrics with a fixed evaluator model\. Each answer\-correct trace is split into sentence\-level units, and the final boxed answer is used as the answer target\. When scoring intermediate reasoning text, standalone boxed\-answer strings are removed to reduce trivial answer leakage\.

For sentence\-level uncertainty, each sentencert=\(y1,…,ym\)r\_\{t\}=\(y\_\{1\},\\ldots,y\_\{m\}\)is scored under its preceding contextPt−1P\_\{t\-1\}\. We report token\-averaged NLL and predictive entropy:

NLLsent​\(rt\)=−1m​∑i=1mlog⁡p​\(yi∣Pt−1,y<i\),\\mathrm\{NLL\}\_\{\\mathrm\{sent\}\}\(r\_\{t\}\)=\-\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\log p\(y\_\{i\}\\mid P\_\{t\-1\},y\_\{<i\}\),Entsent\(rt\)=1m∑i=1mH\(p\(⋅∣Pt−1,y<i\)\)\.\\mathrm\{Ent\}\_\{\\mathrm\{sent\}\}\(r\_\{t\}\)=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}H\\\!\\left\(p\(\\cdot\\mid P\_\{t\-1\},y\_\{<i\}\)\\right\)\.
For answer\-level uncertainty, we measure how a reasoning prefix affects recovery of the boxed answera∗=\(a1,…,aL\)a^\{\*\}=\(a\_\{1\},\\ldots,a\_\{L\}\)\. Given the prefixPtP\_\{t\}after appending sentencertr\_\{t\}, we compute:

NLLans​\(Pt\)=−1L​∑i=1Llog⁡p​\(ai∣Pt,a<i\)\.\\mathrm\{NLL\}\_\{\\mathrm\{ans\}\}\(P\_\{t\}\)=\-\\frac\{1\}\{L\}\\sum\_\{i=1\}^\{L\}\\log p\(a\_\{i\}\\mid P\_\{t\},a\_\{<i\}\)\.Answer entropy is computed analogously over the answer\-token positions\. We define answer\-NLL reduction as:

Δans​\(t\)=NLLans​\(Pt−1\)−NLLans​\(Pt\)\.\\Delta\_\{\\mathrm\{ans\}\}\(t\)=\\mathrm\{NLL\}\_\{\\mathrm\{ans\}\}\(P\_\{t\-1\}\)\-\\mathrm\{NLL\}\_\{\\mathrm\{ans\}\}\(P\_\{t\}\)\.A larger value means that appendingrtr\_\{t\}makes the final answer easier to recover under the evaluator\.

For segment\-wise plots, sentences are appended along the original complete trace, while the x\-axis is normalized separately within retained reasoning and editor\-removed continuation\. For boundary\-level plots,K1K\_\{1\}andKTK\_\{T\}denote the first and last retained sentences, andC1C\_\{1\}andCTC\_\{T\}denote the first and last editor\-removed sentences\.

### C\.2Geometric Metrics

We compute geometric metrics from hidden states at sentence boundaries\. Lethth\_\{t\}denote the evaluator hidden state after consuming the prefix ending at sentencertr\_\{t\}\. The local state update is:

Δ​ht=ht−ht−1\.\\Delta h\_\{t\}=h\_\{t\}\-h\_\{t\-1\}\.
Hidden displacement measures the size of this update:

Dt=‖Δ​ht‖2\.D\_\{t\}=\\\|\\Delta h\_\{t\}\\\|\_\{2\}\.Forward progress measures the projection of the local update onto the remaining direction toward the terminal state of the analyzed trace:

Gt=⟨Δ​ht,hT−ht−1⟩‖hT−ht−1‖2\+ϵ\.G\_\{t\}=\\frac\{\\langle\\Delta h\_\{t\},h\_\{T\}\-h\_\{t\-1\}\\rangle\}\{\\\|h\_\{T\}\-h\_\{t\-1\}\\\|\_\{2\}\+\\epsilon\}\.This is an operational proxy for terminal\-directional hidden\-state progress, not a direct measurement of the true reasoning process\.

Progress efficiency is defined as:

Et=GtDt\+ϵ\.E\_\{t\}=\\frac\{G\_\{t\}\}\{D\_\{t\}\+\\epsilon\}\.To control for sentence length, we also report token\-normalized variants:

Dttok=Dtnt,Gttok=Gtnt,D\_\{t\}^\{\\mathrm\{tok\}\}=\\frac\{D\_\{t\}\}\{n\_\{t\}\},\\qquad G\_\{t\}^\{\\mathrm\{tok\}\}=\\frac\{G\_\{t\}\}\{n\_\{t\}\},wherentn\_\{t\}is the token length ofrtr\_\{t\}\.

Curvature is used only as an auxiliary direction\-change diagnostic:

Curvt=1−⟨Δ​ht−1,Δ​ht⟩‖Δ​ht−1‖2​‖Δ​ht‖2\+ϵ,t\>1\.\\mathrm\{Curv\}\_\{t\}=1\-\\frac\{\\langle\\Delta h\_\{t\-1\},\\Delta h\_\{t\}\\rangle\}\{\\\|\\Delta h\_\{t\-1\}\\\|\_\{2\}\\\|\\Delta h\_\{t\}\\\|\_\{2\}\+\\epsilon\},\\qquad t\>1\.
For paired comparisons, we average each metric within retained reasoning and editor\-removed continuation for each example\. The paired difference is:

Δ=Meanremoved−Meanretained\.\\Delta=\\mathrm\{Mean\}\_\{\\mathrm\{removed\}\}\-\\mathrm\{Mean\}\_\{\\mathrm\{retained\}\}\.We report group means, the fraction of examples where the removed continuation is lower or higher, and the 95% confidence interval ofΔ\\Delta\.

## Appendix DAdditional Experiments

### D\.1Additional Analysis of Harmful Continuation Diagnosis

#### Uncertainty View\.

Figure[7](https://arxiv.org/html/2605.29288#A4.F7)compares the answer\-level perturbation induced by retained reasoning and editor\-removed continuation\. Editor\-removed continuation shows larger NLL perturbation and a right\-shifted distribution of average log\-probability perturbation\. This suggests that the removed harmful continuation is not simply irrelevant text, but remains answer\-conditioned and introduces stronger instability to evaluator\-based final\-answer prediction\.

#### Geometric View\.

Figure[8](https://arxiv.org/html/2605.29288#A4.F8)further compares the geometric behavior of the two groups\. Retained reasoning induces larger token\-normalized hidden displacement, while editor\-removed continuation is more concentrated in the low forward\-progress region\. This is consistent with the view that editor\-removed continuation corresponds to a low\-progress phase: it can still affect answer prediction, but does not provide comparable representation\-level state movement toward the terminal reasoning state\. Together, these additional results support the uncertainty–geometry mismatch diagnosis of harmful continuation\.

![Refer to caption](https://arxiv.org/html/2605.29288v1/x8.png)Figure 7:Additional uncertainty\-side diagnosis\. \(a\) NLL perturbation induced by retained reasoning and editor\-removed continuation\. \(b\) ECDF of average log\-probability perturbation\. Editor\-removed continuation induces larger answer\-level perturbations\.![Refer to caption](https://arxiv.org/html/2605.29288v1/x9.png)Figure 8:Additional geometry\-side diagnosis\. \(a\) Token\-normalized hidden displacement of retained reasoning and editor\-removed continuation\. \(b\) ECDF of token\-normalized forward progress\. Editor\-removed continuation is more concentrated in low\-progress regions under the operational proxy\.Table 5:Comparison of different methods across backbone models on the MMLU dataset\.\{T\}Q\\\{T\\\}\_\{Q\}and\{T\}R\\\{T\\\}\_\{R\}denote SFT trajectories from Qwen\-style and R1\-style long\-CoT sources, respectively\.

### D\.2Additional analysis of Test Datasets

Table 6:HCC\-based self\-consistency diagnostics\. Phase, Sent\. ratio, and Len\. denote the occurrence rate of outputs matching the HCC removable\-continuation pattern, the corresponding sentence\-level ratio, and the average token length, respectively\. These metrics are detector\-based consistency measures rather than independent evidence of harmfulness\.#### Case study\.

Figure[9](https://arxiv.org/html/2605.29288#A4.F9)presents a qualitative example comparing the same base model trained with HCC\-processed traces and original long\-CoT traces\. The HCC\-trained model quickly identifies the correct reasoning path, computes the remaining distance, and outputs the correct solution without entering a long verification loop\. In contrast, the Vanilla\-trained model first reaches the correct answer, but then derives a conflicting result from an alternative calculation, as shown in the gray region\. After this conflict appears, the model repeatedly compares the two answers and rechecks different parts of the solution, entering a low\-efficiency reasoning loop highlighted in yellow\. The response eventually exhausts the token budget without producing a successful final answer\. This example qualitatively illustrates a behavior consistent with the harmful continuation pattern: the model may continue uncertain, low\-progress verification even after a sufficient answer has already been reached\. It suggests that HCC\-processed supervision can reduce such continuation patterns\.

#### HCC\-based self\-consistency diagnostic\.

Table[6](https://arxiv.org/html/2605.29288#A4.T6)evaluates whether model outputs after SFT match the removable post\-conclusion continuation pattern learned by HCC\. For a fair cross\-source test, we use the HCC proxy trained on\{T\}Q\\\{T\\\}\_\{Q\}to analyze outputs from models trained on\{T\}R\\\{T\\\}\_\{R\}\-based data\. This analysis should be interpreted as a detector\-based self\-consistency diagnostic rather than an independent measurement of causal harmfulness\. Under this diagnostic, models trained on HCC\-processed traces produce fewer outputs that are classified as containing removable post\-conclusion continuation\. For example, HCC reduces the sentence\-level detected continuation ratio from51\.84%51\.84\\%to19\.45%19\.45\\%on GSM8K and from59\.51%59\.51\\%to27\.35%27\.35\\%on MATH500\.

Table 7:Computational cost comparison given similar input lengths of Qwen3\.5\-27B and HCC\.
#### Analysis of Computational Costs\.

Table[7](https://arxiv.org/html/2605.29288#A4.T7)compares the computational cost of the 27B offline editor and our HCC proxy given similar input lengths\. HCC uses only498498M parameters, which is about1\.8%1\.8\\%of Qwen3\.5\-27B\. It also requires2\.52\.5T MACs and5\.15\.1T FLOPs, compared with137\.1137\.1T MACs and274\.3274\.3T FLOPs for Qwen3\.5\-27B\. This corresponds to a roughly54\.2×54\.2\\timesreduction in computation\. These results show that HCC provides a much cheaper proxy for editor\-identified harmful continuation boundary approximation, making large\-scale SFT trace processing more practical\.

![Refer to caption](https://arxiv.org/html/2605.29288v1/x10.png)Figure 9:Case study of harmful continuation after SFT\. The left part indicates the reasoning process of a model trained on HCC\-processed traces, while the right part shows the reasoning process of a model trained on original traces\. We use grey and yellow highlights to indicate the conflicting reasoning and the inefficient reasoning loop, respectively\.

Similar Articles

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

arXiv cs.CL

Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

arXiv cs.AI

This paper introduces a prefix-level trajectory evaluation protocol to distinguish harmful overthinking from verbose but harmless overthinking in large reasoning models, showing that continued reasoning after reaching the correct answer can destabilize performance. The authors find that early stopping improves accuracy by up to 21% on multimodal benchmarks, and identify logical drift and visual reinterpretation as key causes of correctness deviations.