@agarwl_: Self-distillation does not work for thinking models YET https://arxiv.org/abs/2603.24472 https://openreview.net/forum?i…

X AI KOLs Timeline 06/15/26, 11:54 PM Papers

self-distillation reasoning llm mathematical-reasoning epistemic-verbalization post-training

Summary

This paper studies why self-distillation degrades reasoning in LLMs, finding that it suppresses epistemic verbalization (uncertainty expression), leading to performance drops of up to 40% in mathematical reasoning tasks.

Self-distillation does not work for thinking models YET https://arxiv.org/abs/2603.24472 https://openreview.net/forum?id=VhCJItwQHn… https://arxiv.org/abs/2606.11709

Original Article

View Cached Full Text

Cached at: 06/16/26, 11:51 AM

Self-distillation does not work for thinking models YET https://arxiv.org/abs/2603.24472 https://openreview.net/forum?id=VhCJItwQHn… https://arxiv.org/abs/2606.11709

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Source: https://arxiv.org/html/2603.24472 Jeonghye Kim1,2, Xufang Luo1†\dagger, Minbeom Kim3, Sangmook Lee3,Dohyung Kim3, Jiwon Jeon2, Dongsheng Li1, Yuqing Yang11Microsoft Research2KAIST3Seoul National University blog post [Uncaptioned image] beanie00/self-distillation-analysis

Abstract

Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression ofepistemic verbalization—the model’s expression of uncertainty during reasoning. Through controlled experiments varyingconditioning context richnessandtask coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-1.7B/8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.

1Introduction

Refer to caption (a)Chemistry(Olmo3-7B-Instruct) (b)DAPO-Math-17k(DeepSeek-Distill-Qwen-7B)

Figure 1:(a) Training score and response length changes for GRPO and Self-Distillation (SDPO)(SDPO)in Chemistry, using results from SDPO Wandb logs(wandb)(link). (b) Training score and response length changes on DAPO-Math-17k with GRPO and SDPO.Recently, self-distillation(self-distillation)has attracted increasing attention in the post-training of large language models (LLMs). In this paradigm, two instances of the same model are employed: one conditioned on privileged information (e.g., ground-truth solutions) serves as a teacher, providing dense reward signals for responses generated by another instance that does not have access to that information. Several studies have demonstrated that combining this framework with post-training methods such as Reinforcement Learning from Verifiable Rewards (RLVR) leads to highly efficient performance gains(zhu2025token;understanding;SDPO;shenfeld2026self;song2026expanding;zhao2026self;opcd). These methods have shown particularly strong improvements in domains such as agentic environments and scientific reasoning, especially under in-domain evaluation settings. Interestingly, a consistent trend observed across these works is that performance improves as response length decreases, suggesting that self-distillation promotes more concise and effective reasoning.

However, when we apply the same self-distillation approach to mathematical reasoning tasks, we observe a markedly different phenomenon. Figure1compares the effects of a representative self-distillation algorithm, SDPO, in the Chemistry domain (a) and the Math domain (b). As shown in the figure, in the Chemistry domain, self-distillation substantially reduces response length compared to GRPO while rapidly improving performance. In contrast, in the Math domain, although response length consistently decreases as training progresses, performance drops significantly, contrary to prior findings.

This raises a question:”Why does performance sometimes degrade despite the model being trained to move toward the correct answer?”

Our analysis reveals a consistent pattern: the more informative the context provided to the teacher, the more concise and confident the resulting reasoning becomes, with substantially fewer expressions of uncertainty and, particularly in math reasoning, degraded performance. We trace this effect to the suppression ofepistemic verbalization(understanding), whereby models explicitly verbalize and incorporate uncertainty during reasoning. Strong reasoning models such as DeepSeek-R1(deepseek-r1)frequently express uncertainty using tokens like “Wait” or “Hmm”. Although these expressions may not directly advance the reasoning, removing them discards important signals that a reasoning path may be flawed, leading to significant performance drops(understanding).

To systematically understand when and why self-distillation suppresses epistemic verbalization and thus affects performance, we conduct a comprehensive empirical study and identify two key factors:information richnessandtask coverage. The more privileged the information the teacher receives, such as the correct solution, the more its reasoning traces suppress expressions of uncertainty, encouraging the student to imitate a confident reasoning style that presupposes information unavailable at inference time. When task coverage is limited, this compression enables rapid in-domain optimization. However, as coverage increases, removing epistemic verbalization can interfere with optimization across diverse tasks, degrading performance on more challenging or unseen problems.

More broadly, our results show that even when the training objective faithfully guides the model toward correct reasoning traces, the resulting reasoning style can quietly shift in ways that hurt generalization. The suppression of epistemic verbalization is not penalized by standard objectives, yet negatively impacts out-of-distribution (OOD) performance. This suggests that post-training objectives need to account not only for answer correctness, but also for eliciting and preserving uncertainty-aware reasoning behaviors. We believe these findings offer a useful step toward a deeper understanding of reasoning in self-distillation and post-training more broadly.

2Preliminaries

Self-Distillation

Letx∈𝒳x\in\mathcal{X}denote an input andy=(y1,…,yT)y=(y_{1},\dots,y_{T})a sequence generated by a language modelπθ\pi_{\theta}. The model defines an autoregressive distributionπθ(y|x)=∏t=1Tπθ(yt∣x,y<t).\pi_{\theta}(y|x)=\prod_{t=1}^{T}\pi_{\theta}(y_{t}\mid x,y_{<t}).In self-distillation, the same model acts as both astudentand ateacherunder different conditioning contexts. The student first generates a sequencey∼πθ(⋅∣x)y\sim\pi_{\theta}(\cdot\mid x). The teacher policy is obtained by conditioning the model on aricher contextccthat provides additional information about the input (e.g., solutions, environment feedback, or other auxiliary signals). Training minimizes the divergence between the student and teacher next-token distributions:

ℒSelf-Distillation(θ)=∑tKL(πθ(⋅∣x,y<t)∥stopgrad(πθ(⋅∣x,c,y<t))).\mathcal{L}_{\mathrm{\text{Self-Distillation}}}(\theta)=\sum_{t}\mathrm{KL}\!\left(\pi_{\theta}(\cdot\mid x,y_{<t})\;\|\;\mathrm{stopgrad}\big(\pi_{\theta}(\cdot\mid x,c,y_{<t})\big)\right).(1)This objective encourages the student to match the teacher’s predictions under the richer context, enabling the model to improve by distilling privileged information available at training time without requiring an external teacher.

Key Characteristics of Math Reasoning

In LLMs, math reasoning can be viewed asself-Bayesian reasoning, where each step is generated conditioned only on the promptxxand previously generated tokensy=(y1,…,yT)y=(y_{1},\dots,y_{T}), allowing the model to iteratively update its belief over intermediate hypotheses(understanding). Such reasoning spans diverse tasks such as arithmetic, algebra, geometry, word problems, and logical pattern recognition, so evaluation benchmarks are frequently OOD relative to training data due to compositional and reasoning-depth shifts.

Refer to caption (2a)Reasoning with Epistemic Verbalization (2b)Unguided vs. Solution-Guided

Within this process, verbalized uncertainty toward correct responsey∗y^{*}, referred to asepistemic verbalization(understanding), can serve as an informative signal rather than stylistic redundancy. As shown in Figure2(2a), reasoning without such signals may prematurely commit to incorrect hypotheses with little room for recovery, whereas epistemic verbalization preserves alternative hypotheses and supports gradual uncertainty reduction.

3LLM Reasoning Behavior Under Richer Information

In self-distillation, the teacher has access to a richer contextcc, which provides strong hints and enables it to generate reasoning traces with minimal expressed uncertainty (Figure2(2b)). While this yields more concise responses, it may hinder the student’s ability to perform uncertainty-aware reasoning: aggressive length constraints and overly confident reasoning styles risk eliminating not only unnecessary verbosity but also valuable epistemic signals, especially in smaller models with limited parametric knowledge. This raises two questions we investigate throughout the paper: (i) how does the richness ofccrelate to the suppression of uncertainty expression and to reasoning performance? and (ii) if epistemic verbalization is generally beneficial, why does suppressing it improve performance in domains such as Chemistry while degrading it in Math?

To study these questions, we first need a way to quantify how informative the conditioning context is. Lety∗y^{*}denote a random variable representing an ideal correct response toxx. We formalize the information thatccprovides abouty∗y^{*}as the conditional mutual information

I(y∗;c∣x)=H(y∗∣x)−H(y∗∣x,c),I(y^{*};\,c\mid x)\;=\;H(y^{*}\mid x)\;-\;H(y^{*}\mid x,\,c),(2)which captures the reduction in uncertainty abouty∗y^{*}once the additional contextccis given.

Building on this measure, we construct a controlled comparison to examine how the model’s reasoning behavior changes as the conditioning information is varied. We use the DAPO-Math-17k dataset(dapo)with DeepSeek-R1-Distill-Qwen-7B(deepseek-r1)as the base model, selecting 100 problems on which the base model achieves accuracy between 0.125 and 0.5 over 8 rollouts. Letssdenote the full solution (including chain-of-thought in<think>tags),s∖thinks_{\setminus\text{think}}the solution with<think>content removed, andyr∼πθ(⋅∣x,s)y_{r}\sim\pi_{\theta}(\cdot\mid x,s)a response previously generated under full solution guidance. We compare the model’s responses across four generation settings:

•(1) Unguided generation(c=∅c=\emptyset):I(y∗;c∣x)=0I(y^{*};\,c\mid x)=0by definition.
•(2) Solution-guided generation(c=sc=s): provides maximal guidance and yields the largestI(y∗;c∣x)I(y^{*};\,c\mid x).
•(3) Solution-guided generation without think contents(c=s∖thinkc=s_{\setminus\text{think}}): sinces∖thinks_{\setminus\text{think}}is a strict informational subset ofss, we haveI(y∗;s∖think∣x)≤I(y∗;s∣x)I(y^{*};\,s_{\setminus\text{think}}\mid x)\leq I(y^{*};\,s\mid x).
•(4) Regeneration-conditioned generation(c=yrc=y_{r}, whereyr∼πθ(⋅∣x,s)y_{r}\sim\pi_{\theta}(\cdot\mid x,s)): sinceyry_{r}is derived fromss, the data processing inequality givesI(y∗;yr∣x)≤I(y∗;s∣x)I(y^{*};\,y_{r}\mid x)\leq I(y^{*};\,s\mid x).

These settings induce the following ordering over the conditional mutual information:

I(y∗;∅∣x)=0⏟(1)<I(y∗;s∖think∣x)⏟(3)≤I(y∗;yr∣x)⏟(4)≤I(y∗;s∣x)⏟(2).\underbrace{I(y^{*};\,\emptyset\mid x)=0}_{\text{(1)}}\;<\;\underbrace{I(y^{*};\,s_{\setminus\text{think}}\mid x)}_{\text{(3)}}\;\leq\;\underbrace{I(y^{*};\,y_{r}\mid x)}_{\text{(4)}}\;\leq\;\underbrace{I(y^{*};\,s\mid x)}_{\text{(2)}}.(3)

Prompts

The prompts used for unguided and solution-guided settings are as follows. For regeneration, we used the same prompts as inSDPO. A comparative analysis with other regeneration prompts is provided in AppendixC.

Epistemic Tokens

Followingunderstanding, we define a set of 10 epistemic markers𝒯={wait,hmm,perhaps,maybe,actually,alternatively,seems,might,likely,check}\mathcal{T}=\{\textit{wait},\,\textit{hmm},\,\textit{perhaps},\,\textit{maybe},\,\textit{actually},\,\textit{alternatively},\,\textit{seems},\,\textit{might},\,\textit{likely},\,\textit{check}\}as practical indicators of regions where the externalization of uncertainty is likely to occur. These tokens are intuitively recognizable as uncertainty expressions, and we further verify that they co-occur with extended uncertainty expressions in sampled LLM outputs (see Table4in AppendixB.5). We measure the epistemic token count of a responseyyasE(y)=∑t∈𝒯count(t,y)E(y)=\sum_{t\in\mathcal{T}}\mathrm{count}(t,\,y).

Results

We analyze how different forms of conditioning information toward the correct answer affect the model’s reasoning behavior by comparing the average response length𝔼[L(y)]\mathbb{E}[L(y)], score, and the epistemic token count𝔼[E(y)]\mathbb{E}[E(y)]across the four settings. As shown in Table1, both quantities decrease monotonically asI(y∗;c∣x)I(y^{*};\,c\mid x)increases:

𝔼[L(y)]|(1)>𝔼[L(y)]|(3)>𝔼[L(y)]|(4)>𝔼[L(y)]|(2),\mathbb{E}\bigl[L(y)\bigr]\Big|_{(1)}\;>\;\mathbb{E}\bigl[L(y)\bigr]\Big|_{(3)}\;>\;\mathbb{E}\bigl[L(y)\bigr]\Big|_{(4)}\;>\;\mathbb{E}\bigl[L(y)\bigr]\Big|_{(2)},(4)and analogously for𝔼[E(y)]\mathbb{E}[E(y)], confirming that richer conditioning information leads to more concise and confident reasoning.

Table 1:Comparison of response characteristics under varying levels of rich information.Avg. ScoreAvg. LengthEpistemic Token Count(1) Unguided0.3013,054182.5\rowcolorc-pink-light (2) Solution-Guided (c=sc=s)0.981,8738.8(3) Solution-Guided (c=s∖thinkc=s_{\setminus\text{think}})0.7812,036159.8\rowcolorc-pink-light (4) Regeneration-Conditioned0.952,80824.1Specifically, unguided generation (c=∅c=\emptyset) produces substantially longer responses with the highest epistemic token counts. When the full solutionssis provided in (2), the model follows the given reasoning trace with high confidence, and its concise output can be viewed as a compressed representation of the reasoning inss. In (3), removing the<think>portion retains onlys∖thinks_{\setminus\text{think}}(640 out of 13,054 response tokens), and both𝔼[L(y)]\mathbb{E}[L(y)]and𝔼[E(y)]\mathbb{E}[E(y)]increase again toward the unguided level, reflecting the substantial information loss. Setting (4), conditioning on the regenerated responseyry_{r}, yields intermediate values (lower than (3) but higher than (2)), indicating thatyry_{r}preserves much of the informative structure of the full solution. Detailed per-token breakdowns are reported in AppendixB.1.

Takeaway 1: Information Richness and Epistemic VerbalizationAs the conditioning contextccbecomes more informative and directly useful, the LLM generates answers more confidently and with fewer expressions of epistemic uncertainty.

4Supervised Finetuning with Self-Distillation

A natural follow-up question is whether the suppression of epistemic verbalization under highI(y∗;c∣x)I(y^{*};\,c\mid x)is merely stylistic or has a tangible impact on reasoning capability. To test this, we conduct off-policy self-distillation (SFT) using DeepSeek-R1-Distill-Qwen-7B(deepseek)on two datasets, each containing 800 correct responses:

•𝒟ug\mathcal{D}_{\text{ug}}: unguided responses (c=∅c=\emptyset), with high𝔼[E(y)]\mathbb{E}[E(y)]and𝔼[L(y)]≈12k\mathbb{E}[L(y)]\approx 12\text{k}tokens.
•𝒟sg\mathcal{D}_{\text{sg}}: solution-guided responses (c=sc=s), with low𝔼[E(y)]\mathbb{E}[E(y)]and𝔼[L(y)]≈2k\mathbb{E}[L(y)]\approx 2\text{k}tokens.

Both datasets consist entirely of correct trajectories. The key difference lies in the epistemic density of the training signal. We evaluate the resulting checkpoints across multiple math benchmarks (examples from each dataset are presented in AppendixH.1).

Table 2:Math benchmark performance of the base model DeepSeek-R1-Distill-Qwen-7B and its SFT checkpoints trained on unguided and solution-guided datasets.DeepSeek-R1-Distill-Qwen-7BAIME24AIME25AMC23MATH500Base54.7937.9289.0692.19SFT on𝒟ug\mathcal{D}_{\text{ug}}51.0440.0087.6690.93\rowcolorc-pink-light SFT on𝒟sg\mathcal{D}_{\text{sg}}20.2112.7157.0365.52As shown in Table2, training on𝒟sg\mathcal{D}_{\text{sg}}leads to substantial degradation across all benchmarks, despite the dataset consisting of correct answers, whereas training on𝒟ug\mathcal{D}_{\text{ug}}produces no significant performance change. This asymmetry arises because solution-guided responses are concise precisely due to the external contextss. Using them as SFT targets forces the model to imitate a reasoning style that presupposes information unavailable at inference time, effectively suppressing the epistemic tokens that support autonomous exploration and error correction.

Takeaway 2: Epistemic Suppression and Reasoning PerformanceEven when trained on correct trajectories, excessively suppressing epistemic verbalization can substantially degrade reasoning performance.

5On-Policy Self-Distillation

We now turn to on-policy self-distillation(SDPO;zhao2026self;opcd), where the model learns from dense reward signals provided by a self-teacher with access to the correct solution, based on its own responses. We compare GRPO with Reinforcement Learning via Self-Distillation (SDPO)(SDPO)on the DAPO-Math-17k dataset(dapo), using Qwen3-8B(qwen3)and DeepSeek-R1-Distill-Qwen-7B(deepseek-r1). Additional results for Qwen3-1.7B and Olmo-3-7B-Instruct(olmo)are in AppendixF.3. For each model, we track training score, response length, and OOD performance on AIME24 and AMC23. We fix the teacher to the initial policy, which performs better than a moving target (see Section5.4for a comparison).

The behavior of on-policy self-distillation depends on two factors: (i) the extent of epistemic verbalization already present in the base model, and (ii) the richness of the conditioning contextcc. To disentangle these, we compare GRPO and SDPO under two settings:c=sc=s(full solution) andc=s∖thinkc=s_{\setminus\text{think}}(solution without<think>content).

5.1DeepSeek-R1-Distill-Qwen-7B

Refer to caption aTraining Score-Length Comparison bAMC23 Score and Response Length cAIME24 Score and Response Length dChange in Epistemic Token Usage on AIME24

Figure 3:On-policy self-distillation results for DeepSeek-R1-Distill-Qwen-7B. GRPO yields modest OOD gains with a slight increase in epistemic verbalization, whereas SDPO degrades both performance and epistemic token usage, particularly withc=sc=s.DeepSeek-R1-Distill-Qwen-7B is a representative high-reasoning model that generates rich epistemic verbalizations within<think>tags, averaging about 180 epistemic tokens per response, as shown in Section3, leading to lengthy yet high-performing reasoning outputs.

Training Performance

As shown in Figure3a, GRPO training slightly increases𝔼[L(y)]\mathbb{E}[L(y)]with a modest improvement in score. In contrast, SDPO withc=sc=scauses a sharp initial drop in both𝔼[L(y)]\mathbb{E}[L(y)]and score. Performance gradually recovers but remains below GRPO throughout training. When the conditioning is reduced toc=s∖thinkc=s_{\setminus\text{think}}, the drop in𝔼[L(y)]\mathbb{E}[L(y)]is attenuated and the score trajectory approaches that of GRPO, consistent with the relationship betweenI(y∗;c∣x)I(y^{*};\,c\mid x)and epistemic suppression discussed in Section3.

OOD Evaluation - AIME24, AMC23

Consistent with the training trends, GRPO yields modest gains on both OOD benchmarks (AIME24: 54.7→\to56.0; AMC23: 89.3→\to91.1, Figures3band3c) with a slight increase in𝔼[L(y)]\mathbb{E}[L(y)]. SDPO withc=sc=sdegrades performance substantially (∼40%{\sim}40\%on AIME24,∼15%{\sim}15\%on AMC23). Reducing the conditioning toc=s∖thinkc=s_{\setminus\text{think}}mitigates the drop, though performance still remains below the base model.

Reasoning Pattern

Figure3dillustrates the epistemic token counts of the trained models. GRPO increases𝔼[E(y)]\mathbb{E}[E(y)], whereas SDPO suppresses it more aggressively, consistent with the observed correlation between epistemic suppression and performance degradation throughout our analysis. Additional analyses on total versus epistemic token changes, as well as the number of uncertainty expressions per reasoning trace, evaluated using an LLM-as-Judge, are provided in AppendicesB.4andB.5. Furthermore, a comparison of the KL divergence of epistemic tokens with respect to the mean is presented in AppendixB.6. Example responses generated by GRPO and SDPO are included in AppendixH.2.

5.2Qwen3-8B (Thinking Mode: ON)

Qwen3-8B is a hybrid reasoning model that can operate in either a thinking mode for deliberate reasoning or a non-thinking mode for direct responses. With thinking mode enabled, Qwen3-8B produces longer responses than DeepSeek-R1-Distill-Qwen-7B, with a higher number of epistemic tokens (see AppendixB.2for a comparison of epistemic tokens across models).

Refer to caption aTraining Score-Length Comparison bAMC23 Score and Response Length cAIME24 Score and Response Length dChange in Epistemic Token Usage on AIME24

Figure 4:On-policy self-distillation results for Qwen3-8B (Thinking Mode: ON). Both GRPO and SDPO reduce response length and epistemic verbalization, but SDPO’s more aggressive suppression leads to greater degradation in OOD performance, particularly on AIME24.#### Training Performance

As shown in Figure4a,𝔼[L(y)]\mathbb{E}[L(y)]decreases under both GRPO and SDPO, with SDPO exhibiting a larger reduction and a correspondingly larger performance drop. Notably,𝔼[L(y)]\mathbb{E}[L(y)]first drops sharply, then increases slightly. Since the teacher policy is fixed as the reference policy, shortening the response by∼900{\sim}900tokens reduces the informativeness ofcc, i.e., decreasesI(y∗,c∣x)I(y^{*},c\mid x). As the context becomes less informative, the model compensates by increasing epistemic verbalization, causing the length to partially recover.

OOD Evaluation - AIME24, AMC23

The gap becomes more pronounced on OOD benchmarks (Figures4band4c): GRPO maintains largely stable performance with gradually decreasing𝔼[L(y)]\mathbb{E}[L(y)], whereas SDPO falls below the base model, particularly withc=sc=s. Notably, although GRPO and SDPO withc=s∖thinkc=s_{\setminus\text{think}}achieve comparable training performance, their OOD results diverge, especially on the more challenging AIME24, where SDPO withc=s∖thinkc=s_{\setminus\text{think}}shows progressive performance degradation as training proceeds.

Reasoning Pattern

Figure4dreveals that both methods reduce𝔼[E(y)]\mathbb{E}[E(y)]relative to the base model, with SDPO more aggressively so. This suggests that Qwen3-8B originally generates more epistemic verbalization than necessary. While both methods mitigate this redundancy, overly aggressive suppression risks removing epistemic signals that carry useful reasoning information.

5.3Qwen3-8B (Thinking Mode: OFF)

Refer to caption aTraining Score-Length Comparison bAIME24 Score and Response Length

Figure 5:On-policy self-distillation results for Qwen3-8B (Thinking Mode: OFF). GRPO rapidly increases response length via epistemic verbalization and achieves strong training gains, while SDPO reduces response length and struggles to improve, with slight OOD degradation on AIME24.When Qwen3-8B is used without thinking mode, the<think>tag is absent, so we consider onlyc=sc=s. The model initially produces much shorter responses and exhibits significantly lower performance. GRPO rapidly increases𝔼[L(y)]\mathbb{E}[L(y)]by promoting epistemic verbalization (as shown in AppendixF.1), quickly achieving a high training score. In contrast, SDPO reduces𝔼[L(y)]\mathbb{E}[L(y)]and improves much more slowly. Even when the training score slightly increases, as shown in Figure5b, performance on AIME24 slightly declines (0.25→0.230.25\to 0.23), further illustrating the cost of epistemic suppression under self-distillation.

5.4Ablation Study: Fixed vs. Moving Target Teacher

In naive on-policy self-distillation, the teacher and student share a continuously updated policy, making the teacher a moving target that can introduce training instability(zhao2026self;opcd). To mitigate this, SDPO uses an EMA-smoothed teacher (EMA rate: 0.05). However, we find that setting the EMA rate to 0.0 (i.e., fixing the teacher to the initial policy) yields better performance, and thus Section5follows this setting.

Refer to caption aDeepSeek-Distill-7B Training Comparison bDeepSeek-Distill-7B AIME24 Comparison

Figure 6:Fixed vs. moving target teacher for DeepSeek-R1-Distill-Qwen-7B. Even slow EMA updates (rate 0.05) amplify epistemic suppression via a feedback loop, causing greater performance degradation than a fixed teacher.Figure6ashows additional comparison results when the teacher is updated during training. As shown, even slow updates (e.g., rate 0.05) lead to a sharper reduction in response length, resulting in larger performance degradation. This can be interpreted as a feedback loop in self-distillation. The model is trained to produce increasingly confident outputs, and when this is used as the teacher, it generates even more confident responses, amplifying the effect over iterations. Further ablations on learning rate and top-kklogits are in AppendixG.

Takeaway 3: Epistemic Verbalization Changes and Performance in On-Policy Self-DistillationAs the teacher contextccbecomes more information-rich, on-policy self-distillation reduces epistemic verbalization and shortens responses. This effect varies depending on the level of uncertainty originally expressed by the base model.

6Relationship Between Task Coverage, Epistemic Verbalization and Generalization Ability

Across the off-policy and on-policy settings analyzed above, self-distillation consistently produces more confident responses with reduced𝔼[E(y)]\mathbb{E}[E(y)]. This aligns with the findings ofSDPO, which reports that SDPO learns to reason concisely: on Science Q&A (Chemistry, Physics, Biology, and Materials Science)(sciknoweval), tool use(toolalpaca), and LiveCodeBench v6(livecodebench), SDPO achieves higher accuracy than GRPO while producing substantially shorter outputs with fewer epistemic markers.

In other words, in these domains, self-distillation suppresses epistemic verbalization and improves performance simultaneously. The key question iswhy the same mechanism leads to performance degradation in our math-focused setup. We hypothesize that the answer lies in differences in task coverage between the training and evaluation distributions.

6.1Comparison of Task Coverage

To test this hypothesis, we compare the dataset characteristics of the settings where SDPO outperformed GRPO against our experimental setup. As shown in Table3, the Chemistry dataset, despite its large size, draws from only six main problem types that differ primarily in surface details rather than underlying structure. LiveCodeBench v6 contains diverse problems but only 131 in total, leading to repeated exposure during training with identical train/eval splits. In contrast, DAPO-Math-17k exposes the model to 14,000 distinct problems (78% of the 25,600 samples drawn over 100 steps, due to repeated sampling), spanning a broad, non-overlapping range of problem types, and evaluation is performed on unseen problem types.

Table 3:Comparison of the total number of problems, problem composition, and train–evaluation splits in ScienceQ&A, LiveCodeBench v6, and our experiments.DomainDataset AnalysisScienceQ&A (Chemistry)Total 2,400 questions: Reaction Balancing (300), Molecular Descriptor Counting (300), Molecular Weight Calculation (600), Property Prediction (e.g., logS) (500), Precursor/Reactant Selection (Retrosynthesis) (300), Product Prediction (Organic Reactions) (400). Train/Eval split:90% for training, 10% for evaluation.LiveCodeBench v6Train/Eval split:Total 131 questions for training and all 131 questions are used for both training and evaluation. Only 50% of the public test cases are used during training, while the full set including hidden test cases is used for evaluation.DAPO-Math-17kTrain/Eval split:Total 14,000 questions for training and evaluation is conducted on standard benchmarks (AIME24/25, AMC23, MATH 500) using questions not part of the training data.

6.2Relationship Between Task Coverage and Learning Performance

To further investigate the interplay between task coverage and generalization, we vary the number of training questions|𝒟|∈{1,8,64,128,512}|\mathcal{D}|\in\{1,8,64,128,512\}from DAPO-Math-17k and train with both GRPO and SDPO. All experiments use Qwen3-8B (Thinking Mode OFF).

Training Logs

As shown in Figure7, GRPO and SDPO exhibit distinct training dynamics as|𝒟||\mathcal{D}|varies. When|𝒟|≤128|\mathcal{D}|\leq 128, SDPO quickly achieves high scores while reducing𝔼[L(y)]\mathbb{E}[L(y)]by up to8×8\times, indicating higher training efficiency on a small task set. However, at|𝒟|=512|\mathcal{D}|=512, further reductions in𝔼[L(y)]\mathbb{E}[L(y)]begin to hurt the training score relative to GRPO, whose𝔼[L(y)]\mathbb{E}[L(y)]gradually increases with|𝒟||\mathcal{D}|.

Refer to caption

Figure 7:Training score and response length comparison between GRPO and SDPO for|D|∈{1,8,64,128,512}|D|\in\{1,8,64,128,512\}.This difference can be interpreted throughtask coverage. As|𝒟||\mathcal{D}|grows, the model must accommodate a broader range of reasoning patterns. GRPO addresses this by increasing𝔼[E(y)]\mathbb{E}[E(y)], allowing the model to express greater uncertainty and adapt its reasoning accordingly. SDPO instead encourages confident, concise responses—effective when task coverage is small but limiting when the problem set becomes larger and more diverse.

Refer to caption

Figure 8:Evaluation performance on AIME24 and MATH500 and response length as|D||D|varies over {1, 8, 64, 128, 512}.

OOD Evaluation - AIME24, MATH500

The distinction between GRPO and SDPO becomes more pronounced on OOD benchmarks (Figure8). Under GRPO, performance scales consistently with|𝒟||\mathcal{D}|:|𝒟|=1|\mathcal{D}|=1converges quickly but soon stops improving, while larger|𝒟||\mathcal{D}|yields progressively higher final scores accompanied by increasing𝔼[L(y)]\mathbb{E}[L(y)]. Under SDPO, the pattern reverses: smaller|𝒟||\mathcal{D}|leads to more severe OOD degradation. Even at the largest|𝒟||\mathcal{D}|(DAPO setting), SDPO still underperforms the base model. Changes in GRPO and SDPO reasoning patterns across task coverage are provided in AppendixB.3.

Takeaway 4: Epistemic Verbalization Depends on Task GeneralizationThe value of epistemic verbalization scales with generalization demands: largely redundant for familiar, repetitive tasks (small|𝒟||\mathcal{D}|) and removable for efficiency, but increasingly important as task diversity grows.

7Conclusion

In this work, we provide an empirical analysis of on-policy self-distillation, motivated by an information-theoretic view of conditioning context richness. Our experiments suggest that the effectiveness of self-distillation is closely tied to how information is provided to the model and how the model expresses uncertainty during reasoning. We observe that self-distillation tends to produce answers with higher confidence and shorter reasoning traces. While this effect enables more compact reasoning and can quickly improve in-domain performance when task coverage is limited, it becomes less effective when task coverage is broad and may even harm OOD performance.

Importantly, the fact that performance can degrade even when a mathematically sound objective function is designed to elicit correct CoT reasoning suggests that the choice of optimization objective alone may not be sufficient for preserving robust reasoning, and that we need to pay closer attention to how training reshapes the model’s reasoning behavior, beyond answer correctness. We hope that our analysis contributes to a deeper understanding of self-distillation and other LLM post-training methods, and provides insights for developing more robust training strategies.

Acknowledgments

We thank Li Dong, Tianzhu Ye, and Sojeong Rhee for their valuable discussions. This work was supported by Microsoft Research and partly by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT)(RS-2025-00557589).

Ethics Statement

This work aims to analyze the causes of reasoning performance degradation during self-distillation of LLMs. All experiments were conducted using publicly available models (Qwen3, DeepSeek-R1-Distill-Qwen, OLMo) and datasets (DAPO-Math-17k, AIME, AMC, MATH500). As this study does not involve any new data collection or human subjects research, concerns related to privacy or human subjects ethics do not apply.

A key finding of this work is that self-distillation can suppress epistemic verbalization, thereby diminishing a model’s ability to express uncertainty during reasoning. This observation carries important implications for the design of LLM post-training methods. The ability to appropriately express uncertainty is directly related to model reliability and safety. Training procedures that indiscriminately suppress such expressions may increase the risk of models producing incorrect answers with unwarranted confidence. We hope that our analysis contributes to the development of more robust and trustworthy reasoning models.

Disclosure of LLM Usage

We used LLMs for conducting the LLM-as-a-Judge experiments described in AppendixB.5, refining plot designs, and improving the clarity of our writing. LLMs were not involved in research ideation, and their role in writing was limited to polishing author-written text.

References

Appendix ARelated Works

Self-Distillation for LLM Reasoning.

Self-distillation, where a single model acts as both teacher and student under asymmetric conditioning(self-distillation), has recently been widely adopted for LLM post-training. SDPO(SDPO)conditions the self-teacher on its previously generated correct trajectories or rich environment feedback to provide dense credit assignment, while OPSD(zhao2026self)leverages ground-truth solutions as privileged information, achieving strong efficiency gains in both cases. Our analysis builds upon the SDPO codebase, and a detailed comparison with OPSD is provided in AppendixE.privileged,empo2, andopcdfurther extend this paradigm to settings that incorporate additional signals, such as privileged information, self-generated hints derived from environmental interactions, or contextual guidance, with the goal of internalizing this external knowledge into the model’s parameters.

While these works demonstrate the effectiveness of self-distillation, they do not investigate when or why self-distillation may degrade reasoning performance, especially in scenarios where a model must solve problems entirely on its own, without interaction with the external environment, which is the central question of our work.

Uncertainty Expression in LLM Reasoning.

understandingintroduce an information-theoretic framework that identifies epistemic verbalization, the explicit externalization of uncertainty, as informationally necessary for robust reasoning. Our work answers a question this framework cannot readily explain:*Under what conditions is this behavior encouraged or suppressed? And if epistemic verbalization is necessary, why does suppressing it lead to better performance in some domains, such as Chemistry, while degrading performance in others?*We trace this to two factors:conditioning context richness(Section3) andtask coverage(Section6), supported by a systematic analysis of recent on-policy self-distillation methods across various models with varying pretrained capabilities.

Reasoning Compression.

Several recent methods aim to shorten reasoning traces while preserving accuracy. GFPO(sample-more)filters training rollouts by length and token efficiency within enlarged sample groups, reducing GRPO’s length inflation by 46–85% on STEM and coding benchmarks. OPSDC(OPSDC)uses a conciseness instruction as privileged context, achieving 57–59% token reduction on MATH-500. ConPress(ConPress)exploits multi-question contextual pressure to elicit self-compressed traces, yielding 30–60% reductions via SFT. Accordion-Thinking(Accordion-Thinking)trains models to produce periodic step summaries that can be discarded, achieving 3×\timesthroughput with a gap-vanishing phenomenon during RL training. CEEH(CEEH)proposes difficulty-aware entropy regularization to prevent entropy collapse under length constraints.

Our analysis complements these efforts by showing that compression which suppresses uncertainty expression, even when producing correct answers, can harm OOD generalization, and that this effect is modulated by task coverage. Therefore, the key challenge is to filter out non-informative content while retaining epistemic expressions that enable iterative belief refinement, rather than blindly compressing the reasoning process.

Appendix BAdditional Analysis of Epistemic Verbalization

B.1Per-Token Analysis of Epistemic Verbalization

In Table1of Section3, we compared the average number of ten epistemic tokens per response. Figure9further extends this analysis by examining how the average per-response count of each individual token changes under varying levels of conditioning information. When examining the per-token counts, all tokens exhibit a consistent trend:

𝔼[E(y)]|(1)>𝔼[E(y)]|(3)>𝔼[E(y)]|(4)>𝔼[E(y)]|(2),\mathbb{E}\bigl[E(y)\bigr]\Big|_{(1)}>\mathbb{E}\bigl[E(y)\bigr]\Big|_{(3)}>\mathbb{E}\bigl[E(y)\bigr]\Big|_{(4)}>\mathbb{E}\bigl[E(y)\bigr]\Big|_{(2)},(5)where tokens such aswait,maybe, andperhapsare particularly prominent.

Refer to caption Figure 9:Per-token breakdown of epistemic token usage across the four generation settings. Each bar represents the average number of occurrences per response for an individual epistemic token. All tokens follow the same ordering as the aggregate trend, withwait,maybe, andperhapsshowing the largest variation across settings.

B.2Comparison of Epistemic Token Usage Across Models

Following the analysis of DeepSeek-R1-Distill-Qwen-7B (DeepSeek-Distill-7B) in Section3, we further compare epistemic token usage across three settings: DeepSeek-Distill-7B, Qwen3-8B (Thinking Mode: ON), and Qwen3-8B (Thinking Mode: OFF).

As shown in Figure10, both DeepSeek-Distill-7B and Qwen3-8B (Thinking Mode: ON) produce substantially more epistemic tokens than Qwen3-8B (Thinking Mode: OFF). While the two thinking-enabled models share a similar tendency to express uncertainty, they differ in their preferred epistemic tokens. For instance, DeepSeek-Distill-7B frequently useswaitand employsperhapsandmaybeat comparable rates, whereas Qwen3-8B usesperhapsrelatively less and favorsmaybe. Qwen3-8B also usesalternativelyandcheckfar more than DeepSeek-Distill-7B, and overall embeds a greater amount of uncertainty within its reasoning.

Extending the discussion in Section3, we also observe that Qwen3-8B generates far fewer epistemic tokens under solution-guided generation than under unguided generation. Across all three settings, Qwen3-8B (Thinking Mode: ON) produces the most epistemic tokens, followed by DeepSeek-Distill-7B, and then Qwen3-8B (Thinking Mode: OFF).

Refer to caption Figure 10:Comparison of epistemic token usage across DeepSeek-R1-Distill-Qwen-7B, Qwen3-8B (thinking enabled), and Qwen3-8B (thinking disabled).

B.3Analysis of GRPO and SDPO Reasoning Patterns by Task Coverage

To provide a more in-depth analysis of the results in Figure8(Section6.2), we compare the change in epistemic token counts relative to the base model on AIME24 across six training configurations: GRPO and SDPO, each with|D|∈{1,64,512}|D|\in\{1,64,512\}. Figure11shows that GRPO tends to increase epistemic token usage relative to the base model as|D||D|grows, whereas SDPO reduces epistemic token usage, with smaller|D||D|leading to a greater reduction.

Refer to caption Figure 11:Change in epistemic token counts relative to the base model on AIME24 for GRPO and SDPO with|D|∈{1,64,512}|D|\in\{1,64,512\}.

B.4Comparative Analysis of Epistemic and Overall Word Usage

In addition to analyzing epistemic tokens, we examine how the overall word distribution in model responses shifts across the base model, GRPO-trained model, and SDPO-trained model. We first extract English words from each response using regular expressions and construct a reference vocabulary𝒱\mathcal{V}from all unique words appearing in the base model responses. For each wordw∈𝒱w\in\mathcal{V}and modelm∈{base,GRPO,SDPO}m\in\{\text{base},\text{GRPO},\text{SDPO}\}, we compute the average per-response frequency:

μm(w)=1N∑i=1Nf(w,ri(m)),\mu_{m}(w)=\frac{1}{N}\sum_{i=1}^{N}f(w,r_{i}^{(m)}),(6)wheref(w,r)f(w,r)denotes the number of occurrences of wordwwin responserr. We then define the frequency shiftΔm(w)=μm(w)−μbase(w)\Delta_{m}(w)=\mu_{m}(w)-\mu_{\text{base}}(w)to capture how much each word’s usage changes relative to the base model.

Using these frequency shifts, we compare the behavior of the predefined set of epistemic tokensTT(as defined in Section3) against the rest of the vocabulary. Specifically, we compare{Δm(w)∣w∈T}\{\Delta_{m}(w)\mid w\in T\}against{Δm(w)∣w∈𝒱∖T}\{\Delta_{m}(w)\mid w\in\mathcal{V}\setminus T\}to determine whether the changes in epistemic token usage are disproportionately large relative to the overall shift in word distribution. This allows us to distinguish whether the training procedure specifically amplifies epistemic expressions or merely induces a uniform change across the vocabulary.

Refer to caption Figure 12:Comparison of per-word frequency shifts (Δm(w)\Delta_{m}(w)) for epistemic tokens (TT) versus non-epistemic tokens (𝒱∖T\mathcal{V}\setminus T) in GRPO- and SDPO-trained models relative to the base model. Epistemic tokens exhibit disproportionately large frequency changes compared to the general vocabulary, indicating that the training procedures specifically amplify or suppress epistemic expressions rather than inducing uniform changes across the word distribution.As shown in Figure12, the disparity between the two panels confirms that the observed epistemic shifts are not byproducts of general vocabulary drift. Across all model-method pairs, the average per-word shift over the full vocabulary remains near zero (|Δ|<1|\Delta|<1), whereas epistemic tokens exhibit shifts up to30×30\times–40×40\timeslarger in magnitude (e.g.,−11.9-11.9and−12.2-12.2for SDPO). This indicates that both training objectives selectively target epistemic expressions rather than uniformly perturbing the output distribution.

B.5Uncertainty Annotation via LLM-as-Judge

We employed epistemic markers as indicators of uncertainty expressions, as they serve as practical and interpretable cues for regions where uncertainty externalization is likely to occur. To more comprehensively capture diverse forms of uncertainty expression, we additionally leveraged GPT-5.4(gpt5)as an automated annotator. Specifically, GPT-5.4 was prompted to identify and count uncertainty expressions in each model-generated response. The prompt used for this annotation is shown below.

Prompt for Uncertainty AnnotationYou are an expert linguistic annotator. Your task is to count uncertainty expressions in a model-generated math reasoning response.Uncertainty expressions include:•Hedging expressions:expressions that indicate uncertainty (e.g.,wait, hmm, perhaps, maybe)•Doubt markers:expressions that explicitly convey uncertainty (e.g.,I’m not sure, not certain, double-check, verify thiswhen used to express doubt)Do NOT count:•Confident statements such as “we know”, “clearly”•Mathematical conditionals such as “ifx=0x=0, then…”Return ONLY a JSON object with exactly these fields:{ "count": <integer, total number of uncertainty expressions>, "examples": [<up to 5 short quoted phrases from the text>] }

Table 4:Representative examples of uncertainty expressions identified by GPT-5.4. Expressions are drawn from model-generated reasoning traces across all evaluated methods.Uncertainty Expression“maybe there’s a simpler way”“Wait, hold on. Is that correct?”“I think the key here is to findssandttfirst, right?”“I’m not entirely sure”“perhaps there’s a relationship here”“Hmm, that seems contradictory”“perhaps I made a miscalculation”“it seems a bit tricky”“Wait, but this might get complicated”“it seems less likely”“maybe we can assume some symmetry”“Alternatively, perhaps it’s better“maybe we don’t considern=0n=0”to think of it…”“might not satisfy the surface area constraint”“maybe I need another approach”“actually, no. Let me think again.”Table4presents representative examples of uncertainty expressions identified by the annotator. The majority of identified expressions coincide with the epistemic tokens examined in the previous analysis (e.g.,maybe,perhaps,wait), while others manifest as multi-token phrases such as“I’m not entirely sure”that cannot be captured by single-token frequency alone. This confirms that epistemic markers serve as a reliable proxy for uncertainty externalization, while the LLM-as-Judge approach provides complementary coverage over more diverse expression forms.

Table 5:Average number of uncertainty expressions per response, as annotated by GPT-5.4.MethodAvg. Unc. Expr.Baseline21.36GRPO24.36SDPO9.18 Refer to caption Figure 13:Box plot of uncertainty expression counts across methods. Table5reports the average number of such expressions per response for each method. When evaluated through LLM-as-Judge annotation, SDPO again substantially reduces uncertainty expressions compared to both the baseline and GRPO, consistent with the token-level analysis. Figure13complements this with a box plot comparison across methods: notably, GRPO exhibits considerably greater variance than the baseline, suggesting that RLVR can amplify inconsistency in hedging behavior. In contrast, SDPO maintains consistently low counts with a narrow spread, confirming that on-policy self-distillation suppresses unnecessary hedging and self-doubt during reasoning.

B.6Student–Teacher KL Divergence on Epistemic Tokens

We additionally measure the per-token KL divergence between the studentπθ(⋅∣x,y<t)\pi_{\theta}(\cdot\mid x,y_{<t})and the self-teacherπθ(⋅∣x,c,y<t)\pi_{\theta}(\cdot\mid x,c,y_{<t})for each surface-form variant of the epistemic tokens in𝒯\mathcal{T}. As shown in Table6, every variant exceeds the corpus-wide mean KL of0.0760.076, ranging from1.9×1.9\times(‘Wait’) to6.8×6.8\times(‘ Hmm’). Notably, sentence-initial variants (e.g.,‘ Hmm’at6.8×6.8\times,‘ wait’at3.8×3.8\times) tend to exhibit larger divergence than mid-sentence forms, suggesting that the self-teacher most strongly disagrees at positions where the student initiates epistemic deliberation. This provides a quantitative explanation for the suppression of epistemic verbalization under SDPO.

Table 6:KL divergence between student and self-teacher for epistemic tokens𝒯\mathcal{T}. The corpus-wide mean KL across all 1,642 frequent tokens is0.0760.076.TokenVariantMean KLCountRatiohmm‘ Hmm’0.5165426.8×6.8\times‘Hmm’0.3378874.4×4.4\timesmaybe‘Maybe’0.432525.7×5.7\times‘ Maybe’0.1929372.5×2.5\times‘ maybe’0.1734,6762.3×2.3\timesactually‘ actually’0.1881,1342.5×2.5\timesalternatively‘ Alternatively’0.1941792.6×2.6\times‘Alternatively’0.1753,4712.3×2.3\timesmight‘ might’0.1716842.3×2.3\timesperhaps‘Perhaps’0.255543.4×3.4\times‘ perhaps’0.1654,1732.2×2.2\timeswait‘ wait’0.2851,0663.8×3.8\times‘ Wait’0.1961,3142.6×2.6\times‘Wait’0.14615,0001.9×1.9\timesseems‘ seems’0.1529042.0×2.0\times

Appendix CCan Uncertainty-Encouraging Prompts Mitigate the Suppression?

We further investigate whether prompting the teacher to produce more verbose reasoning with explicit expressions of uncertainty can mitigate the performance degradation caused by the suppression of uncertainty expressions. To this end, we design a modified regeneration prompt that encourages detailed, step-by-step reasoning with uncertainty, and repeat the reasoning behavior analysis (Section3) and on-policy self-distillation experiments (Section5) using this prompt.

Prompt for unguided generation{question} Please reason step by step, and put your final answer within \boxed{}.Regeneration prompt (followed the prompt inSDPO){question} Please reason step by step, and put your final answer within \boxed{}. Correct solution:{previously correct solution} Correctly solve the original question.Regeneration prompt with encouraging uncertainty{question} Please reason step by step, and put your final answer within \boxed{}. Correct solution:{previously correct solution} Solve the original questionfrom scratch with a full, detailed reasoning trace, without skipping any steps, and include as much uncertainty or ambiguity in the explanation.Table 7:Response characteristics of solution-guided generation with and without the uncertainty-encouraging prompt.Avg. LengthEpistemic Token CountUnguided13,054182.5Solution-Guided (vanilla)1,8738.8Solution-Guided (w/ encouraged uncertainty)3,09831.4 Refer to caption aDeepSeek-Distill-7B Training Comparison bDeepSeek-Distill-7B AIME24 Comparison

Figure 14:Training dynamics and AIME24 evaluation of on-policy self-distillation using the uncertainty-encouraging regeneration prompt. (a) Training score and response length over training steps. (b) AIME24 Acc@16 and response length over training steps. While the uncertainty-encouraging prompt partially mitigates response shortening, the overall performance degradation persists.Table7shows that the uncertainty-encouraging prompt partially recovers the suppressed reasoning behavior: compared to the vanilla solution-guided prompt, it produces longer responses (3,098 vs. 1,873 tokens) and more epistemic expressions (31.4 vs. 8.8). However, both metrics remain far below those of unguided generation, indicating that the presence of a reference solution fundamentally constrains the model’s reasoning verbosity regardless of the prompt instruction.

Figure14shows that the uncertainty-encouraging prompt leads to slightly higher training scores and longer responses compared to the vanilla setting, and this improvement carries over to AIME24 evaluation. Nevertheless, both variants still suffer from substantial performance degradation, suggesting that prompting alone is insufficient to resolve the fundamental issue of uncertainty suppression in solution-guided generation.

Appendix DExperimental Details

Training

For GRPO and SDPO training, we built upon the SDPO implementation (https://github.com/lasgroup/SDPO) and additionally incorporated the DAPO-Math-17k dataset. The original DAPO-Math-17k dataset uses the following prompt format:

Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\n{question}\nRemember to put your answer on its own line after “Answer:”.

We replaced this with a simpler format:

{question}\nPlease reason step by step, and put your final answer within \boxed{}.

as we observed that this format consistently yielded higher evaluation performance. For reward verification, we used the scoring function from the verl framework, which extracts the answer from the\boxed{}expression and verifies correctness via exact match followed by mathematical equivalence checking using math-verify, adapted from EleutherAI’s lm-evaluation-harness(eval-harness).

The GRPO and SDPO training hyperparameters are listed in Tables8,9, and10. For the experiments on the relationship between task coverage and learning performance in Figure8, we reduced the question batch size to 64 due to the use of smaller training questions.

Table 8:Common hyperparameters shared by GRPO and SDPO.CategoryParameterValueDataMax. prompt length2048Max. response length20480BatchingQuestion batch size256Mini batch size64 or 128Number of rollouts8RolloutInference enginevllmTemperature1.0TrainingOptimizerAdamWWarmup steps10Weight decay0.01Gradient clip norm1.0Table 9:GRPO-specific hyperparameters.CategoryParameterValueLossϵ\epsilon-high0.28Rollout IS clip2KL coefficient (λ\lambda)0.0TrainingLearning rate1×10−61\times 10^{-6}Table 10:SDPO-specific hyperparameters.CategoryParameterValueLossDistillation divergenceJensen–ShannonTop-kkdistillation100EMA update rate0.0TrainingLearning rate1×10−51\times 10^{-5}

Evaluation

Table 11:Evaluation hyperparameters.ModelMax TokensTemp.Top-ppTop-KKDeepSeek-R1-Distill-7B389120.60.9520Qwen3-8B (thinking)389120.60.9520Qwen3-8B (non-thinking)389120.70.820OLMo-3-7B-Instruct389120.60.9520

Chat Templates for Different Model Series

Table12summarizes the chat template formats used by several open-weight model families, each using distinct special tokens to delineate user and assistant turns. The placeholder{prompt}denotes the user message.

Table 12:Chat template formats by model series.ModelTemplateDeepSeek-R1-Distill-7B<|begin_of_sentence|><|User|>{prompt} <|Assistant|><think>Qwen3-8B (think ON)<|im_start|>user\n{prompt}<|im_end|> <|im_start|>assistantQwen3-8B (think OFF)<|im_start|>user\n{prompt}<|im_end|> <|im_start|>assistant\n<think>\n</think>OLMo-3-7B-Instruct<|im_start|>system\n{sys}<|im_end|> <|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant

Appendix EComparison with OPSD

Recently, OPSD(zhao2026self)demonstrated performance gains in mathematical reasoning through self-distillation, particularly on the Qwen3 series. Unlike our setup, where both the student and the teacher either enable or disable thinking mode, OPSD adopts a hybrid configuration in which the student operates with thinking mode disabled while the teacher has it enabled.

As our experiments also confirm, enabling thinking mode produces substantially longer responses with a greater number of epistemic tokens, making this hybrid setup function more akin to conventional teacher–student distillation, despite using the same underlying model. We note that this configuration is inherently limited to model families such as Qwen3 that support toggling thinking mode on and off.

Furthermore, for training efficiency, OPSD does not train on the entire student response; instead, it focuses only on a prefix (1024 tokens by default). Unlike SDPO, which performs full fine-tuning based on verl(hybridflow), OPSD uses LoRA fine-tuning based on trl(trl). Additionally, OPSD’s hyperparameters (batch size 32, learning rate 1e-6) are smaller than those used in our setup (batch size 256, learning rate 1e-5), resulting in higher training efficiency but smaller parameter updates per step.

Refer to caption aHybrid vs. homogeneous setup performance over training. bResponse length and epistemic token usage in the homogeneous setup.

Figure 15:Training dynamics of OPSD hybrid distillation in Qwen3-1.7B compared to our homogeneous chat template setup. (a) Under the hybrid setup, the thinking-enabled teacher initially improves student performance, but gains reverse over time. In contrast, the homogeneous setup shows a consistent decrease. (b) Response length and epistemic token usage in the homogeneous setup.Under this hybrid setup with prefix learning in Qwen3-1.7B, as shown in Figure15a(orange line), we observe an intriguing training dynamic: in the early stages, the thinking-enabled teacher drives the student toward longer responses with improved performance, demonstrating the effectiveness of hybrid distillation in the early phase of training. However, as training progresses, the response length gradually decreases, accompanied by a corresponding degradation in performance. In contrast, under our homogeneous setup as in Figure15b, where both the student and the teacher have thinking mode enabled, performance declines consistently, while response length and epistemic token count also decrease steadily, consistent with our earlier analysis.

While this hybrid distillation setup for the Qwen3 series presents an interesting research direction with its own unique training dynamics, such as why performance initially improves before declining and whether this stems from changes in reasoning behavior or chat template mismatch, a thorough investigation is beyond the scope of this work and is left for future exploration.

Appendix FMore On-Policy Self-Distillation Results

F.1Qwen3-8B (Thinking Mode: OFF)

Refer to caption aTraining score-length comparison bAMC23 score and response length cAIME24 score and response length dChange in epistemic token usage on AIME24

Figure 16:Extended results for Qwen3-8B (thinking mode off): training score-length trade-off, evaluation performance on AMC23 and AIME24, and change in epistemic token usage.As an extension of Figure5in Section5.3, we additionally compare AMC23 evaluation scores and the change in epistemic token usage on AIME24. As shown in Figure16, GRPO significantly increases response length and yields substantial performance gains on both benchmarks. In contrast, SDPO exhibits divergent trends: on AMC23, acc@16 increases from 0.67 to 0.73 while reducing response length by roughly half, whereas on AIME24, acc@16 slightly decreases from 0.25 to 0.23, with pass@16 dropping more substantially. Notably on AMC23, SDPO achieves approximately a 6-point improvement with shorter responses, whereas GRPO obtains a much larger improvement of around 36 points at the cost of considerably longer responses. Achieving large performance gains while maintaining a reasonable response length remains an open challenge.

F.2Qwen3-1.7B (Thinking Mode: ON)

To investigate whether our findings generalize to smaller models, we conducted the same on-policy self-distillation experiment on Qwen3-1.7B. As shown in Figure17, our finding that self-distillation reduces uncertainty and makes outputs more concise, while simultaneously degrading mathematical reasoning performance, remains consistent across model sizes. Notably, the performance degradation is more severe for the 1.7B model (45.9%) compared to Qwen3-8B with Thinking Mode ON (12.1%) and DeepSeek-R1-Distill-Qwen-7B (40.7%). This suggests that the externalization of uncertainty may be particularly important for robust reasoning in smaller models, which possess more limited inherent capabilities.

Refer to caption aTraining score-length comparison bAIME24 score and response length

Figure 17:GRPO and SDPO results on Qwen3-1.7B (Thinking Mode: ON): training score and response length, and OOD evaluation on AIME24. Self-distillation consistently reduces response length but leads to a substantial drop in performance, with a more pronounced degradation compared to larger models.

F.3Olmo-3-7B-Instruct

In addition to DeepSeek-R1-Distill-Qwen-7B and Qwen3-8B, we further evaluate on-policy self-distillation on OLMo-3-7B-Instruct from a different model family. As shown in Figure18, consistent with our previous analyses, SDPO also degrades reasoning performance on this model, with OOD evaluation scores falling below those of the base model. This confirms that our findings are not model-dependent but reflect robust characteristics of reasoning behavior across diverse model families.

Refer to caption aTraining score-length comparison bAIME24 score and response length

Figure 18:SDPO results on OLMo-3-7B-Instruct: training score and response length, and OOD evaluation on AIME24.

F.4Pass@16 Score

In addition to the acc@16 scores for DeepSeek-Distill-7B and Qwen3-8B (Thinking Mode: ON) presented in Figures3b,3c,4b, and4c, we report pass@16 scores in Figure19. The gap between GRPO and SDPO in pass@16 is larger for DeepSeek-Distill-7B than for Qwen3-8B, and more pronounced on the harder benchmark (AIME24) compared to AMC23.

Refer to caption aDeepSeek-Distill-7B bQwen3-8B (thinking mode enabled)

Figure 19:Pass@16 on AMC23 and AIME24 for GRPO and SDPO across training steps.

Appendix GMore Ablation Study

To examine the effect of various training hyperparameters on self-distillation behavior, we conduct additional experiments by varying the top-kkdistillation parameter and the learning rate. As shown in Figure20a, increasing top-kkfrom 100 to 256 yields no significant difference in training dynamics or final performance. Meanwhile, as illustrated in Figure20b, reducing the learning rate from1e−51\mathrm{e}{-5}to1e−61\mathrm{e}{-6}slows the rate of degradation, but the model ultimately converges to the same reasoning behavior.

Refer to caption aEffect of top-kkdistillation (k=100k=100vs.k=256k=256). bEffect of learning rate (1e−51\mathrm{e}{-5}vs.1e−61\mathrm{e}{-6}).

Figure 20:Ablation study on top-kkdistillation and learning rate. Both modifications fail to prevent the convergence toward degraded reasoning behavior; a lower learning rate only delays the process.

Appendix HQualitative Analysis

H.1Unguided vs. Solution-Guided Generation

We analyze examples of𝒟ug\mathcal{D}_{\text{ug}}(Unguided responses) and𝒟sg\mathcal{D}_{\text{sg}}(Solution-guided responses) introduced in Section4. Below, we present the two responses generated for the same problem. As illustrated, the unguided response exhibits substantially more uncertainty and exploratory reasoning, frequently revisiting assumptions, considering alternative approaches, and self-correcting along the way. In contrast, the solution-guided response proceeds in a more concise and direct manner, arriving at each intermediate result without detours.

Problem.How many ways can88mutually non-attacking rooks be placed on a9×99\times 9chessboard so that all88rooks are on squares of the same color? (Two rooks are said to be attacking each other if they are placed in the same row or column of the board.) Please reason step by step, and put your final answer within\boxed{\phantom{0}}.

Unguided Response – Exploratory ReasoningAlright, so I have this problem where I need to figure out how many ways I can place 8 mutually non-attacking rooks on a 9×\times9 chessboard, and all of them have to be on squares of the same color. Hmm, let me break this down.First, I know that a chessboard is 8×\times8, but this is a 9×\times9 chessboard. So, it’s one square larger on each side. Also, non-attacking rooks mean that no two rooks can be in the same row or column. Since it’s 9×\times9, if I were placing 9 rooks, they would each occupy their own row and column. But here, we’re placing 8 rooks, so one row and one column will be empty.But wait, the problem also says all rooks must be on squares of the same color. So, not only do they have to not attack each other, but they all have to be on, say, white squares or all on black squares.[…]So, perhaps, the number of ways to place 8 non-attacking rooks on white squares would be equal to the number of permutation matrices on the white squares.Similarly for black squares.But how do we compute that?I recall that in combinatorics, the number of permutation matrices avoiding certain positions is given by the inclusion-exclusion principle, but that might get complicated.Alternatively, maybe we can model this as a bipartite graph matching problem.[…]But I don’t know the exact formula for that.Alternatively, perhaps we can use the principle of inclusion-exclusion or recursion.Alternatively, perhaps notice that the white squares form a bipartite graph which is a union of two complete bipartite graphs, but that might not hold.Wait, let me consider the structure of the white squares. […] So, the bipartite graph is actually composed of two complete bipartite graphs: one connecting even rows to even columns, and the other connecting odd rows to odd columns.Therefore, the bipartite graph is actually two disjoint complete bipartite graphs: one between the odd rows (5 rows) and odd columns (5 columns), and another between the even rows (4 rows) and even columns (4 columns).So, the white squares form two separate complete bipartite graphs:K5,5K_{5,5}andK4,4K_{4,4}.[…]So, to get a matching of size 8, we need to havekkedges fromK5,5K_{5,5}and8−k8-kedges fromK4,4K_{4,4}, wherekkranges from 4 to 5.Ifk=4k=4, then8−4=48-4=4edges fromK4,4K_{4,4}.Ifk=5k=5, then8−5=38-5=3edges fromK4,4K_{4,4}.[…]So, forK5,5K_{5,5}, the number of matchings of size 4 is(54)2⋅4!=25×24=600\binom{5}{4}^{2}\cdot 4!=25\times 24=600.Similarly, the number of matchings of size 5 inK5,5K_{5,5}is(55)2⋅5!=1×120=120\binom{5}{5}^{2}\cdot 5!=1\times 120=120.ForK4,4K_{4,4}, the number of matchings of size 4 is(44)2⋅4!=1×24=24\binom{4}{4}^{2}\cdot 4!=1\times 24=24.And the number of matchings of size 3 is(43)2⋅3!=16×6=96\binom{4}{3}^{2}\cdot 3!=16\times 6=96.Therefore:600×24+120×96=14,400+11,520=25,920600\times 24+120\times 96=14{,}400+11{,}520=25{,}920.[…]InK5,4K_{5,4}, the number of perfect matchings (size 4) is(54)⋅(44)⋅4!=5×1×24=120\binom{5}{4}\cdot\binom{4}{4}\cdot 4!=5\times 1\times 24=120.Similarly, inK4,5K_{4,5}, the number of perfect matchings is(44)⋅(54)⋅4!=1×5×24=120\binom{4}{4}\cdot\binom{5}{4}\cdot 4!=1\times 5\times 24=120.Therefore, the number of matchings of size 8 in the black square bipartite graph is120×120=14,400120\times 120=14{,}400.Therefore, total is25,920+14,400=40,32025{,}920+14{,}400=40{,}320.But wait,40,32040{,}320is8!=40,3208!=40{,}320. Hmm, that’s interesting.But why?Is there a combinatorial interpretation where the number of ways is8!8!multiplied by something?Wait,8!8!is 40320, which is exactly the number of ways we got.But why is it exactly8!8!? It seems coincidental.[…]Wait, perhaps it’s best to just accept that the total number is40,32040{,}320.So, the final answer is40,320\boxed{40{,}320}.

Solution-Guided Response – Concise, Direct ReasoningAlright, so I have this problem where I need to figure out how many ways I can place 8 mutually non-attacking rooks on a 9×\times9 chessboard, and all of them have to be on squares of the same color. Hmm, okay, let’s break this down step by step.First, I know that a rook can move any number of squares vertically or horizontally, so non-attacking rooks must be placed on different rows and different columns. That means for 8 rooks, we need 8 distinct rows and 8 distinct columns.[…] Specifically, there are 41 squares of one color and 40 of the other. […] So white squares are 41 and black squares are 40.So, I need to calculate the number of ways to place 8 non-attacking rooks on white squares and then on black squares separately, and then add them together.Starting with white squares. […] The white square bipartite graph has two components:K5,5K_{5,5}andK4,4K_{4,4}. Similarly, […] the black square bipartite graph has two components:K5,4K_{5,4}andK4,5K_{4,5}.[…] The number of ways to choose 4 edges fromK5,5K_{5,5}is(54)2⋅4!=25×24=600\binom{5}{4}^{2}\cdot 4!=25\times 24=600. The number of ways to choose 5 edges fromK5,5K_{5,5}is(55)2⋅5!=1×120=120\binom{5}{5}^{2}\cdot 5!=1\times 120=120. Similarly, forK4,4K_{4,4}, the number of ways to choose 4 edges is(44)2⋅4!=1×24=24\binom{4}{4}^{2}\cdot 4!=1\times 24=24, and choosing 3 edges is(43)2⋅3!=16×6=96\binom{4}{3}^{2}\cdot 3!=16\times 6=96.So, the total number of ways for white squares is(600×24)+(120×96)=14,400+11,520=25,920(600\times 24)+(120\times 96)=14{,}400+11{,}520=25{,}920.For black squares, […] the total number of ways is14,40014{,}400.Adding both cases together, we get25,920+14,400=40,32025{,}920+14{,}400=\boxed{40{,}320}.

H.2GRPO vs. SDPO Generation

Problem (AIME24).

Find the number of ways to place a digit in each cell of a 2×\times3 grid so that the sum of the two numbers formed by reading left to right is 999, and the sum of the three numbers formed by reading top to bottom is 99. The grid below is an example of such an arrangement because8+991=9998+991=999and9+9+81=999+9+81=99:

008991\begin{array}[]{|c|c|c|}\hline\cr 0&0&8\\ \hline\cr 9&9&1\\ \hline\cr\end{array}The correct answer is45\boxed{45}.

GRPO explores with repeated reflections, deriving the extra constrainta+b+c=8a+b+c=8beyond the pairwise sums, and correctly counts(102)=45\binom{10}{2}=45solutions. SDPO proceeds directly and stops after derivinga+d=b+e=c+f=9a+d=b+e=c+f=9, multiplying10×10×10=100010\times 10\times 10=1000without checking the row-sum equation, thereby missing the additional constrainta+b+c=8a+b+c=8.

DeepSeek-R1-Distill-Qwen-7B GRPO Generation (Correct: 45)Okay, so I need to figure out how many ways there are to fill a 2×\times3 grid with digits such that two conditions are satisfied: (1) the sum of the two numbers formed by reading left to right is 999, and (2) the sum of the three numbers formed by reading top to bottom is 99.Let me visualize the grid asabcdef\begin{array}[]{|c|c|c|}\hline\cr a&b&c\\ \hline\cr d&e&f\\ \hline\cr\end{array}. The first condition gives(100a+10b+c)+(100d+10e+f)=999(100a+10b+c)+(100d+10e+f)=999, and the second gives(10a+d)+(10b+e)+(10c+f)=99(10a+d)+(10b+e)+(10c+f)=99.[…]So we have:Equation (A):100S1+10S2+S3=999100\,S_{1}+10\,S_{2}+S_{3}=999, whereS1=a+dS_{1}=a+d,S2=b+eS_{2}=b+e,S3=c+fS_{3}=c+f.Equation (B):10T+U=9910T+U=99, whereT=a+b+cT=a+b+c,U=d+e+fU=d+e+f.[…] SinceS1S_{1}isa+da+d,S1S_{1}can be from 0 to 18. But ifS1S_{1}is 10, then100S1=1000100S_{1}=1000, which is already larger than 999. Therefore,S1S_{1}can be from 0 to 9.ForS1=9S_{1}=9:100⋅9=900100\cdot 9=900. Then10S2+S3=9910S_{2}+S_{3}=99.ForS1=8S_{1}=8:100⋅8=800100\cdot 8=800. Then10S2+S3=19910S_{2}+S_{3}=199. But10S2+S310S_{2}+S_{3}can be at most10⋅18+18=19810\cdot 18+18=198. So 199 is too big. ThereforeS1S_{1}can’t be 8.ForS1=9S_{1}=9,10S2+S3=9910S_{2}+S_{3}=99. SinceS2S_{2}andS3S_{3}are between 0 and 18,10S2+S310S_{2}+S_{3}can be between 0 and 198. So 99 is possible.[…] ThereforeS2S_{2}must be 9. Hence,S2=9S_{2}=9, thenS3=99−90=9S_{3}=99-90=9.Let me check otherS1S_{1}values. ForS1=7S_{1}=7:10S2+S3=29910S_{2}+S_{3}=299, even larger. Hence impossible. Similarly for lowerS1S_{1}. Hence, onlyS1=9S_{1}=9is possible.Wait, let me double-check thatS1=9S_{1}=9is the only possible value.ForS1=9S_{1}=9, we have10S2+S3=9910S_{2}+S_{3}=99, which requiresS2=9S_{2}=9andS3=9S_{3}=9. […] So onlyS2=9S_{2}=9andS3=9S_{3}=9.Therefore, Equation (A) only holds whenS1=S2=S3=9S_{1}=S_{2}=S_{3}=9.[…] FromT+U=27T+U=27and10T+U=9910T+U=99: subtracting gives9T=729T=72, soT=8T=8,U=19U=19.[…] Substitutingd=9−ad=9-a,e=9−be=9-b,f=9−cf=9-c, Equation 5 (d+e+f=19d+e+f=19) is automatically satisfied. The main constraint reduces toa+b+c=8a+b+c=8witha,b,c∈{0,…,9}a,b,c\in\{0,\ldots,9\}.Since8<98<9, the upper bound is automatic, so the count is the number of non-negative integer solutions toa+b+c=8a+b+c=8, which is(102)=45\binom{10}{2}=45.Sinced=9−ad=9-aanda≤8a\leq 8,ddranges from 1 to 9 — all valid digits. Similarly fore,fe,f. Therefore, all 45 solutions are valid.Wait, let me check if there are constraints I missed.When forming the numbersN1N_{1}andN2N_{2}, they can have leading zeros, which is fine since the example itself has008008in the top row. […] So there is no problem with leading zeros.Let me verify with the given example:a=0,b=0,c=8a=0,b=0,c=8, sod=9,e=9,f=1d=9,e=9,f=1. Columns:09+09+81=9909+09+81=99✓. Rows:008+991=999008+991=999✓.Another example:a=8,b=0,c=0a=8,b=0,c=0. Thend=1,e=9,f=9d=1,e=9,f=9. Top: 800, bottom: 199.800+199=999800+199=999✓. Columns:81+9+9=9981+9+9=99✓.Another:a=0,b=1,c=7a=0,b=1,c=7. Thend=9,e=8,f=2d=9,e=8,f=2.17+982=99917+982=999✓.9+18+72=999+18+72=99✓.Wait, one more check — what if leading zeros aren’t allowed?Then the top row would needa≠0a\neq 0, but the example given hasa=0a=0, so that interpretation is invalid. Therefore, leading zeros are allowed, and the answer is indeed 45.Therefore, I think the correct answer is45\boxed{45}.

DeepSeek-R1-Distill-Qwen-7B SDPO Generation (Incorrect: 1000)Okay, so I need to figure out how many ways there are to fill a 2×\times3 grid with digits so that the sum of the two numbers formed by reading left to right is 999, and the sum of the three numbers formed by reading top to bottom is 99.Labeling the grid as Row 1:a,b,ca,b,cand Row 2:d,e,fd,e,f, the two conditions give:(1)10(a+b+c)+(d+e+f)=99\displaystyle 10(a+b+c)+(d+e+f)=99(2)100(a+d)+10(b+e)+(c+f)=999\displaystyle 100(a+d)+10(b+e)+(c+f)=999LetS=a+b+cS=a+b+c,T=d+e+fT=d+e+f, andx=a+dx=a+d,y=b+ey=b+e,z=c+fz=c+f. Then (1) becomes10S+T=9910S+T=99and (2) becomes100x+10y+z=999100x+10y+z=999.Sincex+y+z=S+Tx+y+z=S+T, substitutingT=99−10ST=99-10Sgivesx+y+z=99−9Sx+y+z=99-9S. Subtracting this from (2):99x+9y=900+9S⇒11x+y=100+S.99x+9y=900+9S\;\Rightarrow\;11x+y=100+S.[…] From0≤99−10S≤270\leq 99-10S\leq 27we getS∈{8,9}S\in\{8,9\}.Case 1:S=8S=8.ThenT=19T=19,11x+y=10811x+y=108, andx+y+z=27x+y+z=27. Substitutingy=108−11xy=108-11xgivesz=10x−81z=10x-81. The constraint0≤z≤180\leq z\leq 18forcesx=9x=9, soy=9y=9,z=9z=9. Thereforea+d=9a+d=9,b+e=9b+e=9,c+f=9c+f=9.For each of the three pairs, there are 10 possibilities (since for each digit from 0 to 9, there’s exactly one digit that adds up to 9). So for each of the three pairs, there are 10 possibilities. Therefore, the total number of solutions in this case is10×10×10=100010\times 10\times 10=1000.Case 2:S=9S=9.ThenT=9T=9,11x+y=10911x+y=109, andx+y+z=18x+y+z=18. Substituting givesz=10x−91z=10x-91. The constraint0≤z≤180\leq z\leq 18forcesx=10x=10, but theny=109−110=−1y=109-110=-1, which is invalid. Therefore, this case is invalid.Thus, only Case 1 is valid, leading to 1000 possible configurations.The digitsa,b,c,d,e,fa,b,c,d,e,fmust be digits between 0 and 9. In Case 1, we havea+d=9a+d=9, so each pair(a,d)(a,d)can be any pair of digits adding to 9, which is valid. Similarly for(b,e)(b,e)and(c,f)(c,f). So there are no invalid digit values here. Therefore, all 1000 configurations are valid.Let me check with the example given in the problem:008991\begin{array}[]{|c|c|c|}\hline\cr 0&0&8\\ \hline\cr 9&9&1\\ \hline\cr\end{array}. Horizontal sum:008+991=999008+991=999✓. Vertical sum:09+09+81=9909+09+81=99✓. Each of the pairs(a,d)=(0,9)(a,d)=(0,9),(b,e)=(0,9)(b,e)=(0,9),(c,f)=(8,1)(c,f)=(8,1)adds to 9, which fits Case 1. So this example is counted in the 1000 configurations.Hence, the answer should be1000\boxed{1000}.

Rishabh Agarwal (@agarwl_): ML Twitter: What’s your favorite on-policy (self)-distillation paper / blogs from this year? Sharing your own work is totally fine!

If who want to learn more about LLM distillation, you can watch:

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Abstract

1Introduction

2Preliminaries

Self-Distillation

Key Characteristics of Math Reasoning

3LLM Reasoning Behavior Under Richer Information

Prompts

Epistemic Tokens

Results

4Supervised Finetuning with Self-Distillation

5On-Policy Self-Distillation

5.1DeepSeek-R1-Distill-Qwen-7B

Training Performance

OOD Evaluation - AIME24, AMC23

Reasoning Pattern

5.2Qwen3-8B (Thinking Mode: ON)

OOD Evaluation - AIME24, AMC23

Reasoning Pattern

5.3Qwen3-8B (Thinking Mode: OFF)

5.4Ablation Study: Fixed vs. Moving Target Teacher

6Relationship Between Task Coverage, Epistemic Verbalization and Generalization Ability

6.1Comparison of Task Coverage

6.2Relationship Between Task Coverage and Learning Performance

Training Logs

OOD Evaluation - AIME24, MATH500

7Conclusion

Acknowledgments

Ethics Statement

Disclosure of LLM Usage

References

Appendix ARelated Works

Self-Distillation for LLM Reasoning.

Uncertainty Expression in LLM Reasoning.

Reasoning Compression.

Appendix BAdditional Analysis of Epistemic Verbalization

B.1Per-Token Analysis of Epistemic Verbalization

B.2Comparison of Epistemic Token Usage Across Models

B.3Analysis of GRPO and SDPO Reasoning Patterns by Task Coverage

B.4Comparative Analysis of Epistemic and Overall Word Usage

B.5Uncertainty Annotation via LLM-as-Judge

B.6Student–Teacher KL Divergence on Epistemic Tokens

Appendix CCan Uncertainty-Encouraging Prompts Mitigate the Suppression?

Appendix DExperimental Details

Training

Evaluation

Chat Templates for Different Model Series

Appendix EComparison with OPSD

Appendix FMore On-Policy Self-Distillation Results

F.1Qwen3-8B (Thinking Mode: OFF)

F.2Qwen3-1.7B (Thinking Mode: ON)

F.3Olmo-3-7B-Instruct

F.4Pass@16 Score

Appendix GMore Ablation Study

Appendix HQualitative Analysis

H.1Unguided vs. Solution-Guided Generation

H.2GRPO vs. SDPO Generation

Problem (AIME24).

Similar Articles

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting

Submit Feedback