Rethinking the Role of Temperature in Large Language Model Distillation

arXiv cs.LG Papers

Summary

This paper reexamines the role of temperature in large language model distillation, revealing that temperature asymmetrically benefits forward KL divergence over reverse KL, allowing simple KL methods to match state-of-the-art distillation approaches at higher temperatures.

arXiv:2606.00306v1 Announce Type: new Abstract: Reverse Kullback-Leibler (RKL) divergence is widely favored over forward KL (FKL) in large language models (LLM) distillation, yet this preference is largely based on comparisons that omit the temperature $\tau$, overlooking its central role in softening teacher distributions and improving knowledge transfer. In this work, we revisit temperature in LLM distillation and show that it fundamentally changes the comparison between FKL and RKL. Our analysis reveals an asymmetric effect: temperature substantially enriches FKL with non-dominant token signals, whereas it mainly rescales RKL gradients, causing FKL to benefit much more from $\tau$ scaling than RKL. This asymmetry overturns the standard empirical conclusion: although RKL outperforms FKL at $\tau=1$, FKL consistently surpasses RKL at higher temperatures across instruction-following benchmarks. Moreover, the impact of temperature is not limited to FKL; it improves a broader family of distillation objectives, enabling simple KL-based methods to achieve competitive performance against recent state-of-the-art LLM distillation approaches.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:40 PM

# Rethinking the Role of Temperature in Large Language Model Distillation
Source: [https://arxiv.org/html/2606.00306](https://arxiv.org/html/2606.00306)
Hoang\-Chau Luong Lingwei Chen Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY, United States cl6300@rit\.edu, lwcics@rit\.edu

###### Abstract

Reverse Kullback–Leibler \(RKL\) divergence is widely favored over forward KL \(FKL\) in large language models \(LLM\) distillation, yet this preference is largely based on comparisons that omit the temperatureτ\\tau, overlooking its central role in softening teacher distributions and improving knowledge transfer\. In this work, we revisit temperature in LLM distillation and show that it fundamentally changes the comparison between FKL and RKL\. Our analysis reveals an asymmetric effect: temperature substantially enriches FKL with non\-dominant token signals, whereas it mainly rescales RKL gradients, causing FKL to benefit much more fromτ\\tauscaling than RKL\. This asymmetry overturns the standard empirical conclusion: although RKL outperforms FKL atτ=1\\tau=1, FKL consistently surpasses RKL at higher temperatures across instruction\-following benchmarks\. Moreover, the impact of temperature is not limited to FKL; it improves a broader family of distillation objectives, enabling simple KL\-based methods to achieve competitive performance against recent state\-of\-the\-art LLM distillation approaches\.

Rethinking the Role of Temperature in Large Language Model Distillation

Hoang\-Chau Luong Lingwei ChenGolisano College of Computing and Information SciencesRochester Institute of TechnologyRochester, NY, United Statescl6300@rit\.edu, lwcics@rit\.edu

## 1Introduction

Knowledge distillation \(KD\)\(Hintonet al\.,[2015](https://arxiv.org/html/2606.00306#bib.bib9)\)is a widely used paradigm for model compression that transfers knowledge from a large teacher model to a smaller student model\(Romeroet al\.,[2015](https://arxiv.org/html/2606.00306#bib.bib48); Cho and Hariharan,[2019](https://arxiv.org/html/2606.00306#bib.bib34); Gouet al\.,[2021](https://arxiv.org/html/2606.00306#bib.bib49)\)\. A central component of classical KD is the distillation temperatureτ\\tau, which softens teacher distribution and reveals informative “dark knowledge” beyond the top prediction\(Hintonet al\.,[2015](https://arxiv.org/html/2606.00306#bib.bib9); Tanget al\.,[2020](https://arxiv.org/html/2606.00306#bib.bib47); Zhaoet al\.,[2022](https://arxiv.org/html/2606.00306#bib.bib13)\)\. By redistributing probability mass from dominant classes to lower\-probability alternatives, KD temperature prevents the distillation signal from collapsing onto the teacher’s top prediction, and provides the student with richer inter\-class relational information\.

Although temperature has been extensively studied in vision\-based KDBeyeret al\.\([2022](https://arxiv.org/html/2606.00306#bib.bib50)\); Liet al\.\([2023](https://arxiv.org/html/2606.00306#bib.bib39)\); Sunet al\.\([2024](https://arxiv.org/html/2606.00306#bib.bib37),[2025](https://arxiv.org/html/2606.00306#bib.bib51)\), it is often omitted in LLM distillation\. Early sequence\-level KD work for language tasks\(Kim and Rush,[2016](https://arxiv.org/html/2606.00306#bib.bib4)\)reported thatτ=1\\tau=1achieved the best empirical performance and therefore adopted the default softmax distribution without further analyzing the role of temperature\. Subsequent LLM distillation methods similarly formulate and evaluate their distillation frameworks under the defaultτ=1\\tau=1setting\(Guet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib8); Agarwalet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib7); Koet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib17); Wanget al\.,[2025](https://arxiv.org/html/2606.00306#bib.bib18); Song and Zheng,[2026](https://arxiv.org/html/2606.00306#bib.bib42)\), implicitly removing temperature from the design space of KL\-based distillation objectives\.

This convention directly affects how recent work compares forward KL \(FKL\) and reverse KL \(RKL\) objectives, where the commonly reported superiority of RKL is largely established through evaluations conducted atτ=1\\tau=1\(Guet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib8); Koet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib17); Wuet al\.,[2025](https://arxiv.org/html/2606.00306#bib.bib1)\)\. Unlike RKL, whose optimization is weighted by the student distribution, FKL directly weights each sample loss by the teacher probabilities\. As a result, low\-entropy teacher soft targets suppress the non\-target probability mass that FKL relies on, making it appear inferior not due to an inherent limitation of the objective itself, but because it is evaluated under insufficiently softened teacher distributions\. This raises an important question:*Is RKL intrinsically superior for LLM distillation or has FKL been systematically underestimated?*

### 1\.1Related Work

Temperature in KD\. Given teacher logitsztz^\{t\}and student logitszsz^\{s\}, temperature\-scaled distributions are defined as

pτ=softmax​\(zt/τ\),qτ=softmax​\(zs/τ\),p^\{\\tau\}=\\mathrm\{softmax\}\(z^\{t\}/\\tau\),\\ q^\{\\tau\}=\\mathrm\{softmax\}\(z^\{s\}/\\tau\),\(1\)whereτ≥1\\tau\\geq 1\. Increasingτ\\tausmooths the distribution by reducing probability mass on dominant tokens and amplifying lower\-probability alternatives\. This smoothing has been widely used in vision\-based distillation\(Hintonet al\.,[2015](https://arxiv.org/html/2606.00306#bib.bib9); Zhaoet al\.,[2022](https://arxiv.org/html/2606.00306#bib.bib13); Liet al\.,[2023](https://arxiv.org/html/2606.00306#bib.bib39); Jinet al\.,[2023](https://arxiv.org/html/2606.00306#bib.bib36); Sunet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib37); Cuiet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib25); Weiet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib24)\), as well as in early NLP distillation methods such as Seq\-KD\(Kim and Rush,[2016](https://arxiv.org/html/2606.00306#bib.bib4)\), DistilBERT\(Sanhet al\.,[2020](https://arxiv.org/html/2606.00306#bib.bib40)\), and TinyBERT\(Jiaoet al\.,[2020](https://arxiv.org/html/2606.00306#bib.bib41)\)\. In LLM distillation, however, most methods omit KD temperature and evaluate objectives atτ=1\\tau=1\(Guet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib8); Koet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib17); Wuet al\.,[2025](https://arxiv.org/html/2606.00306#bib.bib1); Song and Zheng,[2026](https://arxiv.org/html/2606.00306#bib.bib42)\)\. Although recent work has revisited temperature for FKL\(Song and Zheng,[2026](https://arxiv.org/html/2606.00306#bib.bib42)\)or proposed adaptive temperature strategies\(Xieet al\.,[2026](https://arxiv.org/html/2606.00306#bib.bib44); Luonget al\.,[2026b](https://arxiv.org/html/2606.00306#bib.bib52)\), its objective\-dependent effect on FKL and RKL remains poorly understood\.

KL objectives for LLM distillation\. FKL and RKL are two standard distribution\-matching objectives\. FKL minimizesDKL​\(pτ∥qτ\)D\_\{\\mathrm\{KL\}\}\(p^\{\\tau\}\\\|q^\{\\tau\}\), encouraging the student to cover the teacher distribution, while RKL minimizesDKL​\(qτ∥pτ\)D\_\{\\mathrm\{KL\}\}\(q^\{\\tau\}\\\|p^\{\\tau\}\)and is associated with mode\-seeking behavior\. Recent LLM distillation methods favor RKL due to its strong empirical performance\(Guet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib8); Koet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib17)\), leading to variants such as AKL\(Wuet al\.,[2025](https://arxiv.org/html/2606.00306#bib.bib1)\), AB\-KD\(Wanget al\.,[2025](https://arxiv.org/html/2606.00306#bib.bib18)\), SFKL/SRKL\(Koet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib17)\), and DRKL\(Luonget al\.,[2026a](https://arxiv.org/html/2606.00306#bib.bib45)\)\. However, these comparisons are typically conducted underτ=1\\tau=1setting, leaving the interaction between temperature and KL objective design underexplored\.

### 1\.2Contributions

We revisit the role of temperature in LLM distillation and address three key questions:

RQ1: How does temperature distinguish FKL from RKL?We theoretically show that both FKL and RKL converge to the same logit\-matching behavior in the high\-temperature regime\. At practical temperatures, KD temperature reshapes the teacher soft targets and enriches the non\-target knowledge transfer for FKL, while primarily rescaling RKL gradients\. Thus, KD temperature changes*what*FKL learns, but mainly changes*how*RKL learns\.

RQ2: When does FKL outperform RKL?We demonstrate that the FKL–RKL comparison is highly temperature\-dependent\. While RKL often performs better atτ=1\\tau=1, FKL benefits substantially more from softened teacher targets and consistently surpasses RKL at higher KD temperatures\. This finding challenges the common view that RKL is intrinsically superior for LLM distillation\.

RQ3: Does temperature broadly improve KL\-based distillation objectives?We further show that KD temperature improves not only FKL, but also a broader family of KL\-based objectives, including Sym\-KL, JS, SFKL, and AKL\. With appropriate KD temperature, these simple objectives achieve performance competitive with existing LLM distillation methods\. This suggests that KD temperature is not merely a correction for FKL, but a general objective\-dependent factor for designing and fairly comparing LLM distillation methods\.

![Refer to caption](https://arxiv.org/html/2606.00306v1/x1.png)\(a\)GPT\-2 XL→\\toGPT\-2 Base
![Refer to caption](https://arxiv.org/html/2606.00306v1/x2.png)\(b\)GPT\-2 XL→\\toGPT\-2 Medium
![Refer to caption](https://arxiv.org/html/2606.00306v1/x3.png)\(c\)OPT 6\.7B→\\toOPT 1\.3B

Figure 1:Effect of temperature on FKL and RKL across model scales\. Temperature consistently improves FKL while providing limited benefit for RKL, leading to a reversal in their performance at higher temperatures\.

## 2How Does Temperature Distinguish FKL from RKL?

This section theoretically analyzes the behavior of FKL and RKL under KD temperature, which shows that the two objectives become equivalent in the high\-temperature regime, but diverge substantially at low temperatures, explaining why FKL can outperform RKL in LLM distillation\.

Letzt,zs∈ℝVz^\{t\},z^\{s\}\\in\\mathbb\{R\}^\{V\}denote the teacher and student logits over a vocabulary of sizeVV\. The temperature\-scaled teacher and student distributions, denoted bypτp^\{\\tau\}andqτq^\{\\tau\}, are defined in Eq\. \([1](https://arxiv.org/html/2606.00306#S1.E1)\)\. We studyℒFKL=DKL​\(pτ∥qτ\)=∑i=1Vpiτ​log⁡\(piτ/qiτ\)\\mathcal\{L\}\_\{\\mathrm\{FKL\}\}=D\_\{\\mathrm\{KL\}\}\(p^\{\\tau\}\\\|q^\{\\tau\}\)=\\sum\_\{i=1\}^\{V\}p\_\{i\}^\{\\tau\}\\log\(p\_\{i\}^\{\\tau\}/q\_\{i\}^\{\\tau\}\)andℒRKL=DKL​\(qτ∥pτ\)\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}=D\_\{\\mathrm\{KL\}\}\(q^\{\\tau\}\\\|p^\{\\tau\}\)\. Their gradients with respect to student logitzisz\_\{i\}^\{s\}are

∇zisℒFKL\\displaystyle\\nabla\_\{z\_\{i\}^\{s\}\}\\mathcal\{L\}\_\{\\mathrm\{FKL\}\}=1τ​\(qiτ−piτ\),\\displaystyle=\\frac\{1\}\{\\tau\}\(q\_\{i\}^\{\\tau\}\-p\_\{i\}^\{\\tau\}\),\(2\)∇zisℒRKL\\displaystyle\\nabla\_\{z\_\{i\}^\{s\}\}\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}=1τ​qiτ​\(log⁡qiτpiτ−ℒRKL\)\.\\displaystyle=\\frac\{1\}\{\\tau\}q\_\{i\}^\{\\tau\}\\left\(\\log\\frac\{q\_\{i\}^\{\\tau\}\}\{p\_\{i\}^\{\\tau\}\}\-\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}\\right\)\.\(3\)
###### Proposition 2\.1\.

Assume the teacher and student logits are centered, i\.e\.,∑izit=∑izis=0\\sum\_\{i\}z\_\{i\}^\{t\}=\\sum\_\{i\}z\_\{i\}^\{s\}=0\. Asτ→∞\\tau\\to\\infty, FKL and RKL gradients satisfy

∇zisℒFKL=∇zisℒRKL=1V​τ2​\(zis−zit\)\+O​\(τ−3\)\.\\nabla\_\{z\_\{i\}^\{s\}\}\\mathcal\{L\}\_\{\\mathrm\{FKL\}\}=\\nabla\_\{z\_\{i\}^\{s\}\}\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}=\\frac\{1\}\{V\\tau^\{2\}\}\(z\_\{i\}^\{s\}\-z\_\{i\}^\{t\}\)\+O\(\\tau^\{\-3\}\)\.

High\-temperature regime: FKL and RKL become logit matching\. Recent studies\(Wuet al\.,[2025](https://arxiv.org/html/2606.00306#bib.bib1); Luonget al\.,[2026a](https://arxiv.org/html/2606.00306#bib.bib45)\)suggest that FKL and RKL reach the same solution under idealized assumptions, such as sufficient student capacity and exact global optimization\. Proposition[2\.1](https://arxiv.org/html/2606.00306#S2.Thmtheorem1)\(proof is provided in Appendix[B](https://arxiv.org/html/2606.00306#A2)\) shows that temperature alone is sufficient to recover this equivalence: asτ→∞\\tau\\to\\infty, both objectives share the same gradient and reduce to teacher–student logit matching\. Thus, the practical distinction between FKL and RKL must come from their low\-temperature regime\.

Low\-temperature regime: temperature separates FKL from RKL\. At practical temperatures, FKL directly matches the student distribution to the softened teacher distribution, whereas RKL still operates through student\-weighted teacher–student logit gaps\. For FKL,pτp^\{\\tau\}serves as the probability target, and changingτ\\tauchanges the supervision itself:∂piτ∂τ=piττ2​\(𝔼j∼pτ​\[zjt\]−zit\)\.\\frac\{\\partial p\_\{i\}^\{\\tau\}\}\{\\partial\\tau\}=\\frac\{p\_\{i\}^\{\\tau\}\}\{\\tau^\{2\}\}\(\\mathbb\{E\}\_\{j\\sim p^\{\\tau\}\}\[z\_\{j\}^\{t\}\]\-z\_\{i\}^\{t\}\)\.Increasing temperature reduces the dominance of high\-logit tokens and redistributes probability mass to lower\-logit tokens\. Together with Eq\. \([2](https://arxiv.org/html/2606.00306#S2.E2)\), this shows temperature changes the strength of the FKL update, and teacher distribution the student is trained to match\.

###### Proposition 2\.2\.

LetΔi=zis−zit\\Delta\_\{i\}=z\_\{i\}^\{s\}\-z\_\{i\}^\{t\}, the RKL gradient can be rewritten as

∇zisℒRKL=qiττ2​\(Δi−𝔼j∼qτ​\[Δj\]\),\\nabla\_\{z\_\{i\}^\{s\}\}\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}=\\frac\{q\_\{i\}^\{\\tau\}\}\{\\tau^\{2\}\}\\left\(\\Delta\_\{i\}\-\\mathbb\{E\}\_\{j\\sim q^\{\\tau\}\}\[\\Delta\_\{j\}\]\\right\),\(4\)

For RKL, temperature plays a different role\. Proposition[2\.2](https://arxiv.org/html/2606.00306#S2.Thmtheorem2)\(see Appendix[C](https://arxiv.org/html/2606.00306#A3)for proof\) shows that the RKL gradient is driven by teacher–student logit gaps rather than direct probability matching between student and softened teacher distributions as in FKL\. Thus, temperature mainly conditions RKL optimization by rescaling the gradient by1/τ21/\\tau^\{2\}and smoothing the student\-side weightsqiτq\_\{i\}^\{\\tau\}\.

Insight\.Temperature changes*what*FKL learns, but mainly*how*RKL learns\. At low\-regime temperatures, FKL receives richer non\-target supervision from softened teacher distribution, while RKL gets gradient rescaling\.

## 3When Does FKL Outperform RKL?

Our analysis leads to a natural question:*Can temperature change which KL objective is better?*We answer this question empirically on five instruction\-following benchmarks including Dolly Eval, Self\-Instruct\(Wanget al\.,[2023](https://arxiv.org/html/2606.00306#bib.bib27)\), Vicuna Eval\(Chianget al\.,[2023](https://arxiv.org/html/2606.00306#bib.bib28)\), Super\-Natural Instructions \(Super\-NI\)\(Wanget al\.,[2022](https://arxiv.org/html/2606.00306#bib.bib29)\), and Unnatural Instructions \(UnNI\)\(Honovichet al\.,[2023](https://arxiv.org/html/2606.00306#bib.bib30)\)using GPT\-2\(Radfordet al\.,[2019](https://arxiv.org/html/2606.00306#bib.bib14)\)and OPT models\(Zhanget al\.,[2022](https://arxiv.org/html/2606.00306#bib.bib19)\), with implementation details provided in Appendix[A](https://arxiv.org/html/2606.00306#A1)\. Figure[1](https://arxiv.org/html/2606.00306#S1.F1)presents the comparative average ROUGE\-L across five datasets, and reveals a clear reversal\. Underτ=1\\tau=1, RKL outperforms FKL, consistent with prior LLM distillation results\. However, once temperature is introduced, FKL consistently surpasses RKL, showing that its apparent weakness atτ=1\\tau=1stems from insufficient access to softened non\-target teacher signals\. In particular, we draw the following insights\.

Temperature turns FKL from weak to outperforming RKL\. Increasing temperature softens the distribution, exposes richer teacher information, and yields large gains\. For GPT\-2 Base, FKL improves from19\.4119\.41to20\.9720\.97average ROUGE\-L atτ=3\.0\\tau=3\.0\. Similar gains appear for larger students: GPT\-2 Medium rises from around1919to above2222, and OPT\-1\.3B from below2121to above2424, with best results aroundτ=2\.5\\tau=2\.5–3\.53\.5\. Across model scales, these gains reverse the ranking: FKL shifts from trailing RKL atτ=1\\tau=1to outperforming it at proper temperatures\. For example, FKL exceeds RKL by1\.51\.5points atτ=3\.0\\tau=3\.0on GPT\-2 Base, by over1\.01\.0point atτ=3\.5\\tau=3\.5on GPT\-2 Medium, and by0\.270\.27points atτ=2\.0\\tau=2\.0on OPT 1\.3B\.

RKL benefits less from temperature\. For GPT\-2 Base, RKL remains nearly flat across temperatures and can even degrade at largerτ\\tau\. For larger students, the gains are modest, about11point on GPT\-2 Medium and22points on OPT\-1\.3B, compared with consistent gains above33points for FKL\. This agrees with our theoretical analysis: temperature mainly smooths and reconditions the RKL gradient without giving RKL direct access to the softened teacher distribution as a probability target\.

Insight\.FKL outperforms RKL when temperature exposes useful non\-target teacher information\. Atτ=1\\tau=1, FKL underuses the teacher because the target is too sharp\. At higher temperature, the softened distribution becomes sufficiently informative for FKL to surpass RKL\.

## 4Does Temperature Broadly Improve KL\-based Distillation Objectives?

We now ask whether temperature is broadly beneficial for KL\-based distillation objectives\. To answer this, we evaluate FKL, RKL, and Sym\-KL, together with recent objectives such as JS\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib7)\), SFKL/SRKL\(Koet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib17)\), AKL\(Wuet al\.,[2025](https://arxiv.org/html/2606.00306#bib.bib1)\), AB\(Wanget al\.,[2025](https://arxiv.org/html/2606.00306#bib.bib18)\), and DRKL\(Luonget al\.,[2026a](https://arxiv.org/html/2606.00306#bib.bib45)\)\. Experiments are conducted on three instruction\-following benchmarks including Dolly, Super\-NI, and UnNI, with two additional benchmarks reported in Appendix[D](https://arxiv.org/html/2606.00306#A4)and implementation detailed in Appendix[A](https://arxiv.org/html/2606.00306#A1)\.

![Refer to caption](https://arxiv.org/html/2606.00306v1/x4.png)Figure 2:Temperature\-sensitivity heatmap on GPT\-2 XL→\\rightarrowGPT\-2 Base\.Temperature broadly improves KD objectives\. Table[1](https://arxiv.org/html/2606.00306#S4.T1)compares each objective at defaultτ=1\\tau=1with its temperature\-tuned variant\. Temperature improves all objectives across GPT\-2 and OPT models, although the gains vary substantially across objectives\. For GPT\-2 Base, FKL improves from19\.7019\.70to25\.5325\.53average ROUGE\-L \(\+5\.83\+5\.83\), while AKL achieves the largest gain \(\+6\.62\+6\.62\)\. JS, SFKL, and Sym\-KL also improve notably, whereas RKL and SRKL gain only\+0\.49\+0\.49and\+0\.79\+0\.79, supporting our analysis that temperature is especially beneficial for FKL\-objectives that directly exploit softened teacher probabilities\. For OPT\-1\.3B, the gains become broadly substantial, ranging from\+3\.46\+3\.46for RKL to\+4\.88\+4\.88for AKL\. Figure[2](https://arxiv.org/html/2606.00306#S4.F2)visualizes ROUGE\-L gains across temperatures and objectives\. FKL\-based objectives \(FKL, Sym\-KL, JS, and AKL\) exhibit consistent gains, while RKL\-based objectives \(RKL, SRKL, and SFKL\) are less stable and can even degrade\. Thus, temperature is more robustly beneficial for FKL\-based objectives but requires careful tuning for RKL\-based ones\.

Temperature changes objective rankings\. Applying KD temperature substantially strengthens standard KL\-based objectives and changes their relative ranking against stronger baselines\. On GPT\-2 Base, JS\+τ\\tauachieves the best average score \(25\.6325\.63\), while FKL\+τ\\taureaches25\.5325\.53, surpassing state\-of\-the\-art AB \(23\.9123\.91\) and DRKL \(25\.1625\.16\)\. On OPT\-1\.3B, FKL\+τ\\tauimproves to29\.0829\.08, outperforming both AB \(25\.9625\.96\) and DRKL \(27\.7127\.71\), while AKL\+τ\\tauachieves the best overall score \(29\.6229\.62\)\. Thus, temperature is not a minor implementation detail: it can change which objective appears strongest and is necessary for fair comparison\.

Table 1:Effect of temperature scaling on distillation losses\. We report ROUGE\-L mean and standard deviation over five random seeds\. Avg\. is computed over Dolly, Super\-NI, and UnNI\.Case study\. Appendix[E](https://arxiv.org/html/2606.00306#A5)provides case studies on the Unnatural Instructions test set, comparing model outputs under different objectives\.

Insight\.Temperature broadly improves KL\-based distillation\. Its gains are largest for FKL\-based objectives, but weaker for RKL\-based objectives\. Thus, KD temperature should be considered essential for fair comparison, rather than a minor implementation detail\.

## 5Conclusion

In this work, we revisit the role of KD temperature in LLM distillation and show that it can change the relative behavior of FKL and RKL\. KD temperature makes non\-target teacher probabilities more informative, allowing FKL to transfer richer knowledge and surpass RKL despite underperforming atτ=1\\tau=1\. Beyond that, KD temperature also improves a broad range of KL\-based distillation objectives, showing that it is essential for fair comparison\.

## Limitations

Due to computational constraints, our language experiments use teacher models with up to 7B parameters, and evaluating larger teachers remains future work\. We also exclude on\-policy sampling, stronger student initialization, auxiliary supervision, and task\-specific enhancements, although these techniques may further improve performance\. This design ensures a controlled comparison across distillation objectives, allowing us to isolate the effect of the objective itself\.

## References

- On\-policy distillation of language models: learning from self\-generated mistakes\.InThe twelfth international conference on learning representations,Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p5.5),[§1](https://arxiv.org/html/2606.00306#S1.p2.2),[§4](https://arxiv.org/html/2606.00306#S4.p1.1)\.
- L\. Beyer, X\. Zhai, A\. Royer, L\. Markeeva, R\. Anil, and A\. Kolesnikov \(2022\)Knowledge distillation: a good teacher is patient and consistent\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10925–10934\.Cited by:[§1](https://arxiv.org/html/2606.00306#S1.p2.2)\.
- W\. Chiang, Z\. Li, Z\. Lin, Y\. Sheng, Z\. Wu, H\. Zhang, L\. Zheng, S\. Zhuang, Y\. Zhuang, J\. E\. Gonzalez, I\. Stoica, and E\. P\. Xing \(2023\)Vicuna: an open\-source chatbot impressing gpt\-4 with 90%\* chatgpt quality\.External Links:[Link](https://lmsys.org/blog/2023-03-30-vicuna/)Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p4.1),[§3](https://arxiv.org/html/2606.00306#S3.p1.2)\.
- J\. H\. Cho and B\. Hariharan \(2019\)On the efficacy of knowledge distillation\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 4794–4802\.Cited by:[§1](https://arxiv.org/html/2606.00306#S1.p1.1)\.
- M\. Conover, M\. Hayes, A\. Mathur, J\. Xie, J\. Wan, S\. Shah, A\. Ghodsi, P\. Wendell, M\. Zaharia, and R\. Xin \(2023\)Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p2.2)\.
- J\. Cui, Z\. Tian, Z\. Zhong, X\. Qi, B\. Yu, and H\. Zhang \(2024\)Decoupled kullback\-leibler divergence loss\.Advances in Neural Information Processing Systems37,pp\. 74461–74486\.Cited by:[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5)\.
- J\. Gou, B\. Yu, S\. J\. Maybank, and D\. Tao \(2021\)Knowledge distillation: a survey\.International journal of computer vision129\(6\),pp\. 1789–1819\.Cited by:[§1](https://arxiv.org/html/2606.00306#S1.p1.1)\.
- Y\. Gu, L\. Dong, F\. Wei, and M\. Huang \(2024\)Minillm: knowledge distillation of large language models\.InInternational Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p2.2),[Appendix A](https://arxiv.org/html/2606.00306#A1.p3.2),[Appendix A](https://arxiv.org/html/2606.00306#A1.p5.5),[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5),[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p2.3),[§1](https://arxiv.org/html/2606.00306#S1.p2.2),[§1](https://arxiv.org/html/2606.00306#S1.p3.1)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531\.Cited by:[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5),[§1](https://arxiv.org/html/2606.00306#S1.p1.1)\.
- O\. Honovich, T\. Scialom, O\. Levy, and T\. Schick \(2023\)Unnatural instructions: tuning language models with \(almost\) no human labor\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 14409–14428\.Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p4.1),[Table 3](https://arxiv.org/html/2606.00306#A3.T3),[Table 4](https://arxiv.org/html/2606.00306#A4.T4),[Appendix E](https://arxiv.org/html/2606.00306#A5.p1.1),[§3](https://arxiv.org/html/2606.00306#S3.p1.2)\.
- X\. Jiao, Y\. Yin, L\. Shang, X\. Jiang, X\. Chen, L\. Li, F\. Wang, and Q\. Liu \(2020\)Tinybert: distilling bert for natural language understanding\.InFindings of the association for computational linguistics: EMNLP 2020,pp\. 4163–4174\.Cited by:[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5)\.
- Y\. Jin, J\. Wang, and D\. Lin \(2023\)Multi\-level logit distillation\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 24276–24285\.Cited by:[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5)\.
- Y\. Kim and A\. M\. Rush \(2016\)Sequence\-level knowledge distillation\.InProceedings of the 2016 conference on empirical methods in natural language processing,pp\. 1317–1327\.Cited by:[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5),[§1](https://arxiv.org/html/2606.00306#S1.p2.2)\.
- J\. Ko, S\. Kim, T\. Chen, and S\. Yun \(2024\)DistiLLM: towards streamlined distillation for large language models\.InForty\-first International Conference on Machine Learning,Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p5.5),[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5),[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p2.3),[§1](https://arxiv.org/html/2606.00306#S1.p2.2),[§1](https://arxiv.org/html/2606.00306#S1.p3.1),[§4](https://arxiv.org/html/2606.00306#S4.p1.1)\.
- Z\. Li, X\. Li, L\. Yang, B\. Zhao, R\. Song, L\. Luo, J\. Li, and J\. Yang \(2023\)Curriculum temperature for knowledge distillation\.InProceedings of the AAAI conference on artificial intelligence,Vol\.37,pp\. 1504–1512\.Cited by:[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5),[§1](https://arxiv.org/html/2606.00306#S1.p2.2)\.
- C\. Lin \(2004\)Rouge: a package for automatic evaluation of summaries\.InText summarization branches out,pp\. 74–81\.Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p4.1)\.
- H\. Luong, D\. B\. Tran, and L\. Chen \(2026a\)Diversity\-aware reverse kullback\-leibler divergence for large language model distillation\.arXiv preprint arXiv:2604\.00223\.Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p5.5),[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p2.3),[§2](https://arxiv.org/html/2606.00306#S2.p3.1),[§4](https://arxiv.org/html/2606.00306#S4.p1.1)\.
- H\. Luong, N\. Van Vo, K\. Zhao, and L\. Chen \(2026b\)Consistently informative soft\-label temperature for knowledge distillation\.arXiv preprint arXiv:2605\.20357\.Cited by:[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p2.2),[§3](https://arxiv.org/html/2606.00306#S3.p1.2)\.
- A\. Romero, N\. Ballas, S\. E\. Kahou, A\. Chassang, C\. Gatta, and Y\. Bengio \(2015\)FitNets: hints for thin deep nets\.InIn Proceedings of ICLR,Cited by:[§1](https://arxiv.org/html/2606.00306#S1.p1.1)\.
- V\. Sanh, L\. Debut, J\. Chaumond, and T\. Wolf \(2020\)DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter\.External Links:1910\.01108,[Link](https://arxiv.org/abs/1910.01108)Cited by:[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5)\.
- M\. Song and M\. Zheng \(2026\)A survey of on\-policy distillation for large language models\.arXiv preprint arXiv:2604\.00626\.Cited by:[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5),[§1](https://arxiv.org/html/2606.00306#S1.p2.2)\.
- S\. Sun, W\. Ren, J\. Li, R\. Wang, and X\. Cao \(2024\)Logit standardization in knowledge distillation\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 15731–15740\.Cited by:[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5),[§1](https://arxiv.org/html/2606.00306#S1.p2.2)\.
- W\. Sun, D\. Chen, S\. Lyu, G\. Chen, C\. Chen, and C\. Wang \(2025\)Knowledge distillation with refined logits\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 1110–1119\.Cited by:[§1](https://arxiv.org/html/2606.00306#S1.p2.2)\.
- J\. Tang, R\. Shivanna, Z\. Zhao, D\. Lin, A\. Singh, E\. H\. Chi, and S\. Jain \(2020\)Understanding and improving knowledge distillation\.arXiv preprint arXiv:2002\.03532\.Cited by:[§1](https://arxiv.org/html/2606.00306#S1.p1.1)\.
- G\. Wang, Z\. Yang, Z\. Wang, S\. Wang, Q\. Xu, and Q\. Huang \(2025\)ABKD: pursuing a proper allocation of the probability mass in knowledge distillation via $\\alpha$\-$\\beta$\-divergence\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 65167–65212\.Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p5.5),[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p2.3),[§1](https://arxiv.org/html/2606.00306#S1.p2.2),[§4](https://arxiv.org/html/2606.00306#S4.p1.1)\.
- Y\. Wang, Y\. Kordi, S\. Mishra, A\. Liu, N\. A\. Smith, D\. Khashabi, and H\. Hajishirzi \(2023\)Self\-instruct: aligning language models with self\-generated instructions\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 13484–13508\.Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p4.1),[§3](https://arxiv.org/html/2606.00306#S3.p1.2)\.
- Y\. Wang, S\. Mishra, P\. Alipoormolabashi, Y\. Kordi, A\. Mirzaei, A\. Naik, A\. Ashok, A\. S\. Dhanasekaran, A\. Arunkumar, D\. Stap,et al\.\(2022\)Super\-naturalinstructions: generalization via declarative instructions on 1600\+ nlp tasks\.InProceedings of the 2022 conference on empirical methods in natural language processing,pp\. 5085–5109\.Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p4.1),[§3](https://arxiv.org/html/2606.00306#S3.p1.2)\.
- S\. Wei, C\. Luo, and Y\. Luo \(2024\)Scaled decoupled distillation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 15975–15983\.Cited by:[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5)\.
- T\. Wu, C\. Tao, J\. Wang, R\. Yang, Z\. Zhao, and N\. Wong \(2025\)Rethinking kullback\-leibler divergence in knowledge distillation for large language models\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 5737–5755\.Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p5.5),[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5),[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p2.3),[§1](https://arxiv.org/html/2606.00306#S1.p3.1),[§2](https://arxiv.org/html/2606.00306#S2.p3.1),[§4](https://arxiv.org/html/2606.00306#S4.p1.1)\.
- X\. Xie, Z\. Xue, J\. Wu, J\. Li, Y\. Wang, X\. Hu, Y\. Liu, and J\. Zhang \(2026\)Llm\-oriented token\-adaptive knowledge distillation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 34070–34078\.Cited by:[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5)\.
- S\. Zhang, S\. Roller, N\. Goyal, M\. Artetxe, M\. Chen, S\. Chen, C\. Dewan, M\. Diab, X\. Li, X\. V\. Lin,et al\.\(2022\)Opt: open pre\-trained transformer language models\.arXiv preprint arXiv:2205\.01068\.Cited by:[Appendix A](https://arxiv.org/html/2606.00306#A1.p2.2),[§3](https://arxiv.org/html/2606.00306#S3.p1.2)\.
- B\. Zhao, Q\. Cui, R\. Song, Y\. Qiu, and J\. Liang \(2022\)Decoupled knowledge distillation\.InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition,pp\. 11953–11962\.Cited by:[§1\.1](https://arxiv.org/html/2606.00306#S1.SS1.p1.5),[§1](https://arxiv.org/html/2606.00306#S1.p1.1)\.

This appendix provides additional implementation details and supporting theoretical and empirical materials for the main paper:

- •Appendix[A](https://arxiv.org/html/2606.00306#A1)presents implementation, training, and evaluation details, including datasets, teacher–student settings, compared objectives, and temperature tuning protocols\.
- •Appendix[B](https://arxiv.org/html/2606.00306#A2)provides proof for Proposition[2\.1](https://arxiv.org/html/2606.00306#S2.Thmtheorem1)\.
- •Appendix[C](https://arxiv.org/html/2606.00306#A3)provides proof for Proposition[2\.2](https://arxiv.org/html/2606.00306#S2.Thmtheorem2)\.
- •Appendix[D](https://arxiv.org/html/2606.00306#A4)reports additional experimental results on Self\-Instruct and Vicuna benchmarks across objectives\.
- •Appendix[E](https://arxiv.org/html/2606.00306#A5)presents qualitative case studies illustrating how KD temperature improves instruction following and output quality under different distillation objectives\.
- •Appendix[F](https://arxiv.org/html/2606.00306#A6)describes the limited use of AI assistants for language refinement during manuscript preparation\.

## Appendix AImplementation Details

Training resources\. All experiments were conducted on a GPU cluster with four NVIDIA A100 GPUs, each with 40GB memory\.

Experimental setup\. We evaluate LLM distillation in an off\-policy instruction\-following setting, where student models are trained on fixed teacher\-generated responses\. FollowingGuet al\.\([2024](https://arxiv.org/html/2606.00306#bib.bib8)\), we use the instruction–response dataset constructed fromdatabricks\-dolly\-15k\(Conoveret al\.,[2023](https://arxiv.org/html/2606.00306#bib.bib26)\), containing 14k training examples, 500 validation examples, and 500 test examples\. We first fine\-tune the teacher models on this dataset and then distill their responses into smaller students\. Our teacher–student pairs include GPT\-2 XL\(Radfordet al\.,[2019](https://arxiv.org/html/2606.00306#bib.bib14)\)\(1\.5B\)→\\rightarrowGPT\-2 Base \(120M\) and OPT 6\.7B\(Zhanget al\.,[2022](https://arxiv.org/html/2606.00306#bib.bib19)\)→\\rightarrowOPT 1\.3B\.

Training protocol\. To isolate the effect of the distillation objective and KD temperature, all methods are trained under the same protocol\. For both GPT\-2 and OPT families, we use a batch size of 32, train for 20 epochs, and set the maximum input length to 512 tokens\. FollowingGuet al\.\([2024](https://arxiv.org/html/2606.00306#bib.bib8)\), the learning rate is set to5×10−45\\times 10^\{\-4\}for GPT\-2 Base and5×10−55\\times 10^\{\-5\}for OPT\-1\.3B\. All compared methods use the same training data, teacher checkpoint, student initialization, and hyperparameter budget, ensuring that performance differences are attributable to the distillation objective and temperature design rather than implementation or training discrepancies\.

Evaluation protocol\. We evaluate the distilled students on five instruction\-following benchmarks, including Dolly Eval, Self\-Instruct\(Wanget al\.,[2023](https://arxiv.org/html/2606.00306#bib.bib27)\), Vicuna Eval\(Chianget al\.,[2023](https://arxiv.org/html/2606.00306#bib.bib28)\), Super\-Natural Instructions \(Super\-NI\)\(Wanget al\.,[2022](https://arxiv.org/html/2606.00306#bib.bib29)\), and Unnatural Instructions \(UnNI\)\(Honovichet al\.,[2023](https://arxiv.org/html/2606.00306#bib.bib30)\)\. We report ROUGE\-L\(Lin,[2004](https://arxiv.org/html/2606.00306#bib.bib20)\)averaged over five random seeds\{10,20,30,40,50\}\\\{10,20,30,40,50\\\}\. Model checkpoints are saved after each epoch, and final results are reported using the checkpoint with the best validation ROUGE\-L\. The decoding temperature is set to 1 during evaluation\.

Compared objectives\. We compare KL\-family and recent LLM distillation objectives under the same off\-policy setting\. The KL\-family baselines include FKL, RKL\(Guet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib8)\), symmetric KL \(Sym\-KL\), Jensen–Shannon divergence \(JS\)\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib7)\), skewed FKL \(SFKL\)\(Koet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib17)\), skewed RKL \(SRKL\)\(Koet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib17)\)\. For Sym\-KL, we use0\.5​FKL\+0\.5​RKL0\.5\\,\\mathrm\{FKL\}\+0\.5\\,\\mathrm\{RKL\}and for SFKL and SRKL, we set the smoothing parameter toα=0\.1\\alpha=0\.1\. We also compare with stronger recent objectives, including Adaptive KL \(AKL\)\(Wuet al\.,[2025](https://arxiv.org/html/2606.00306#bib.bib1)\),α\\alpha\-β\\betadivergence \(AB\)\(Wanget al\.,[2025](https://arxiv.org/html/2606.00306#bib.bib18)\)with\(α=0\.2,β=0\.7\)\(\\alpha=0\.2,\\beta=0\.7\), and diversity\-aware RKL \(DRKL\)\(Luonget al\.,[2026a](https://arxiv.org/html/2606.00306#bib.bib45)\)\. Since our focus is off\-policy distillation from fixed teacher responses, we exclude methods that require on\-policy sampling or additional external datasets\. For all baselines, we follow the hyperparameter settings from the original papers and use the implementation provided by DistilLM\(Koet al\.,[2024](https://arxiv.org/html/2606.00306#bib.bib17)\)\.

For the temperature\-tuned variants reported in Tables[1](https://arxiv.org/html/2606.00306#S4.T1)and[2](https://arxiv.org/html/2606.00306#A3.T2), including FKL\+τ\\tau, RKL\+τ\\tau, Sym\-KL\+τ\\tau, JS\+τ\\tau, SFKL\+τ\\tau, SRKL\+τ\\tau, and AKL\+τ\\tau, we apply KD temperature on top of each original distillation objective while keeping its implementation and default hyperparameters unchanged\. The only tuned factor is the KD temperature, for which we perform a grid search overτ∈\{1\.5,2\.0,2\.5,3\.0,3\.5,4\.0\}\\tau\\in\\\{1\.5,2\.0,2\.5,3\.0,3\.5,4\.0\\\}and report the best\-performing setting for comparison\. Therefore, the gains reported in Tables[1](https://arxiv.org/html/2606.00306#S4.T1)and[2](https://arxiv.org/html/2606.00306#A3.T2)are attributable solely to KD temperature, not to additional objective\-specific hyperparameter tuning\. Further tuning of each objective’s own hyperparameters could potentially yield additional improvements\.

## Appendix BProof for Proposition[2\.1](https://arxiv.org/html/2606.00306#S2.Thmtheorem1)

###### Proof\.

For largeτ\\tau, we use the first\-order expansionexp⁡\(zi/τ\)=1\+zi/τ\+O​\(τ−2\)\\exp\(z\_\{i\}/\\tau\)=1\+z\_\{i\}/\\tau\+O\(\\tau^\{\-2\}\)\. For the teacher distribution, this gives

∑jexp⁡\(zjt/τ\)\\displaystyle\\sum\_\{j\}\\exp\(z\_\{j\}^\{t\}/\\tau\)=V\+1τ​∑jzjt\+O​\(τ−2\)\\displaystyle=V\+\\frac\{1\}\{\\tau\}\\sum\_\{j\}z\_\{j\}^\{t\}\+O\(\\tau^\{\-2\}\)=V\+O​\(τ−2\),\\displaystyle=V\+O\(\\tau^\{\-2\}\),where the last equality follows from the centered\-logit assumption∑jzjt=0\\sum\_\{j\}z\_\{j\}^\{t\}=0\. Thus,

piτ=1\+zit/τ\+O​\(τ−2\)V\+O​\(τ−2\)=1V\+zitV​τ\+O​\(τ−2\)\.p\_\{i\}^\{\\tau\}=\\frac\{1\+z\_\{i\}^\{t\}/\\tau\+O\(\\tau^\{\-2\}\)\}\{V\+O\(\\tau^\{\-2\}\)\}=\\frac\{1\}\{V\}\+\\frac\{z\_\{i\}^\{t\}\}\{V\\tau\}\+O\(\\tau^\{\-2\}\)\.Applying the same argument to the student logits gives

qiτ=1V\+zisV​τ\+O​\(τ−2\)\.q\_\{i\}^\{\\tau\}=\\frac\{1\}\{V\}\+\\frac\{z\_\{i\}^\{s\}\}\{V\\tau\}\+O\(\\tau^\{\-2\}\)\.For FKL part\.Substituting these into the FKL gradient in Eq\. \([2](https://arxiv.org/html/2606.00306#S2.E2)\) gives

∇zisℒFKL=1V​τ2​\(zis−zit\)\+O​\(τ−3\)\.\\nabla\_\{z\_\{i\}^\{s\}\}\\mathcal\{L\}\_\{\\mathrm\{FKL\}\}=\\frac\{1\}\{V\\tau^\{2\}\}\(z\_\{i\}^\{s\}\-z\_\{i\}^\{t\}\)\+O\(\\tau^\{\-3\}\)\.
For RKL part\.Sinceqjτ=1/V\+O​\(τ−1\)q\_\{j\}^\{\\tau\}=1/V\+O\(\\tau^\{\-1\}\)and the centered\-logit assumption gives∑jΔj=0\\sum\_\{j\}\\Delta\_\{j\}=0, we have

𝔼j∼qτ​\[Δj\]\\displaystyle\\mathbb\{E\}\_\{j\\sim q^\{\\tau\}\}\[\\Delta\_\{j\}\]=∑jqjτ​Δj\\displaystyle=\\sum\_\{j\}q\_\{j\}^\{\\tau\}\\Delta\_\{j\}=1V​∑jΔj\+∑jO​\(τ−1\)​Δj\\displaystyle=\\frac\{1\}\{V\}\\sum\_\{j\}\\Delta\_\{j\}\+\\sum\_\{j\}O\(\\tau^\{\-1\}\)\\Delta\_\{j\}=O​\(τ−1\)\.\\displaystyle=O\(\\tau^\{\-1\}\)\.Using this, Proposition[2\.2](https://arxiv.org/html/2606.00306#S2.Thmtheorem2), andqiτ=1/V\+O​\(τ−1\)q\_\{i\}^\{\\tau\}=1/V\+O\(\\tau^\{\-1\}\)asτ→∞\\tau\\to\\infty, we have

∇zisℒRKL=1V​τ2​\(zis−zit\)\+O​\(τ−3\)\.\\nabla\_\{z\_\{i\}^\{s\}\}\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}=\\frac\{1\}\{V\\tau^\{2\}\}\(z\_\{i\}^\{s\}\-z\_\{i\}^\{t\}\)\+O\(\\tau^\{\-3\}\)\.This completes the proof\. ∎

## Appendix CProof for Proposition[2\.2](https://arxiv.org/html/2606.00306#S2.Thmtheorem2)

###### Proof\.

The log\-ratio can be decomposed as

log⁡qiτpiτ=Δiτ\+Cτ,where​Cτ=log⁡ZtτZsτ,\\log\\frac\{q\_\{i\}^\{\\tau\}\}\{p\_\{i\}^\{\\tau\}\}=\\frac\{\\Delta\_\{i\}\}\{\\tau\}\+C\_\{\\tau\},\\text\{where \}C\_\{\\tau\}=\\log\\frac\{Z\_\{t\}^\{\\tau\}\}\{Z\_\{s\}^\{\\tau\}\},\(5\)Ztτ=∑jexp⁡\(zjt/τ\)Z\_\{t\}^\{\\tau\}=\\sum\_\{j\}\\exp\(z\_\{j\}^\{t\}/\\tau\)andZsτ=∑jexp⁡\(zjs/τ\)Z\_\{s\}^\{\\tau\}=\\sum\_\{j\}\\exp\(z\_\{j\}^\{s\}/\\tau\)\. The termCτC\_\{\\tau\}comes only from the softmax normalization constants and is independent of token indexii\. From the definition ofℒRKL\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}, we have

ℒRKL=𝔼j∼qτ​\[log⁡qjτpjτ\]=1τ​𝔼j∼qτ​\[Δj\]\+Cτ\.\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}=\\mathbb\{E\}\_\{j\\sim q^\{\\tau\}\}\\left\[\\log\\frac\{q\_\{j\}^\{\\tau\}\}\{p\_\{j\}^\{\\tau\}\}\\right\]=\\frac\{1\}\{\\tau\}\\mathbb\{E\}\_\{j\\sim q^\{\\tau\}\}\[\\Delta\_\{j\}\]\+C\_\{\\tau\}\.Taking Eq\. \([5](https://arxiv.org/html/2606.00306#A3.E5)\) minus thisℒRKL\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}, we obtain

log⁡qiτpiτ−ℒRKL=1τ​\(Δi−𝔼j∼qτ​\[Δj\]\)\.\\log\\frac\{q\_\{i\}^\{\\tau\}\}\{p\_\{i\}^\{\\tau\}\}\-\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}=\\frac\{1\}\{\\tau\}\\left\(\\Delta\_\{i\}\-\\mathbb\{E\}\_\{j\\sim q^\{\\tau\}\}\[\\Delta\_\{j\}\]\\right\)\.Substituting this into Eq\. \([3](https://arxiv.org/html/2606.00306#S2.E3)\) gives

∇zisℒRKL=qiττ2​\(Δi−𝔼j∼qτ​\[Δj\]\)\.\\nabla\_\{z\_\{i\}^\{s\}\}\\mathcal\{L\}\_\{\\mathrm\{RKL\}\}=\\frac\{q\_\{i\}^\{\\tau\}\}\{\\tau^\{2\}\}\\left\(\\Delta\_\{i\}\-\\mathbb\{E\}\_\{j\\sim q^\{\\tau\}\}\[\\Delta\_\{j\}\]\\right\)\.This completes the proof\. ∎

Table 2:Effect of temperature scaling on distillation losses\. We report ROUGE\-L mean and standard deviation over five random seeds\. Avg\. is computed over Self\-Inst\. and Vicuna\.Table 3:Qualitative examples from the Unnatural Instructions\(Honovichet al\.,[2023](https://arxiv.org/html/2606.00306#bib.bib30)\)benchmark\. The OPT\-1\.3B student is distilled from the OPT\-6\.7B teacher\. With KD temperature, all distillation objectives generate responses that better satisfy the given instructions compared with theirτ=1\\tau=1counterparts\.
## Appendix DMore Results

Additional benchmarks support the main finding\. Table[2](https://arxiv.org/html/2606.00306#A3.T2)reports results on Self\-Instruct and Vicuna\. Temperature improves every objective across both GPT\-2 Base and OPT\-1\.3B, confirming that its benefit generalizes beyond the main benchmarks\. For GPT\-2 Base, AKL and FKL obtain the largest gains, improving by\+2\.94\+2\.94and\+2\.33\+2\.33average ROUGE\-L, respectively, and both surpass DRKL\. For OPT\-1\.3B, all objectives improve by\+1\.29\+1\.29to\+2\.06\+2\.06, with SRKL\+τ\\tauachieving the best average score of16\.7316\.73\. These results further show that KD temperature is a broadly useful factor for KL\-based distillation and can alter the ranking among objectives\.

Table 4:Qualitative examples from the Unnatural Instructions\(Honovichet al\.,[2023](https://arxiv.org/html/2606.00306#bib.bib30)\)benchmark\. The OPT\-1\.3B student is distilled from the OPT\-6\.7B teacher\. With KD temperature, all distillation objectives generate responses that better satisfy the given instructions compared with theirτ=1\\tau=1counterparts\.
## Appendix ECase Studies

Tables[3](https://arxiv.org/html/2606.00306#A3.T3)and[4](https://arxiv.org/html/2606.00306#A4.T4)show qualitative examples from the Unnatural Instructions benchmark\(Honovichet al\.,[2023](https://arxiv.org/html/2606.00306#bib.bib30)\)\. The OPT\-1\.3B student is distilled from the OPT\-6\.7B teacher under different distillation objectives, with and without KD temperature\. Across both cases, applying temperature consistently improves instruction following and output format compared with the correspondingτ=1\\tau=1objectives\.

In Case study \#1, the task requires extracting proper nouns in their original order\. Without temperature, several objectives omit entities, add incorrect nouns, or generate unrelated continuations: RKL includes “Class,” SFKL omits “Mr\. Krebs,” and JS/SRKL produce long hallucinated passages\. With temperature, all KL\-based objectives produce the exact ground\-truth output: “Toby, Mr\. Krebs, Mrs\. Dvorak,” suggesting better preservation of fine\-grained constraints such as output format and entity selection\.

Case study \#2 is more challenging, but shows a similar trend\. Atτ=1\\tau=1, several objectives misunderstand the blank\-filling instruction: FKL and RKL copy or corrupt the incomplete sentence, Sym\-KL repeats the word options, and SRKL critiques the prompt instead of answering\. With temperature, the outputs become more aligned with the intended completion, often expressing that the person did not want coffee\. Although some responses do not exactly match the ground\-truth word “no,” they are substantially closer than theirτ=1\\tau=1counterparts\. Overall, these examples show that KD temperature improves not only ROUGE\-L scores, but also the student’s ability to follow instructions and produce task\-appropriate responses\.

## Appendix FUse of AI Assistants

In preparing this paper, we made limited use of AI Assistants such as ChatGPT only to help refine wording, correct grammar, and enhance clarity\. All core research contributions, including conceptual development, theoretical analysis, experimental evaluations, and paper organization and composition, were conducted entirely by the authors\.

Similar Articles

A Study on Hidden Layer Distillation for Large Language Model Pre-Training

arXiv cs.CL

This paper investigates Hidden Layer Distillation (HLD) for Large Language Model pre-training, comparing it against standard logit-based knowledge distillation using Gemma3. The study finds that while HLD does not consistently outperform standard methods on downstream tasks, it yields systematic perplexity gains, suggesting potential for future improvements in extracting latent signals during pre-training.

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Hugging Face Daily Papers

This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.

On-Policy Distillation (5 minute read)

TLDR AI

This paper introduces on-policy distillation, which trains a student model on its own trajectories with teacher token-level KL supervision to fix train-inference mismatch, unifying forward-KL, reverse-KL, and JSD losses, with reverse-KL favored for smaller students.

Consistently Informative Soft-Label Temperature for Knowledge Distillation

arXiv cs.LG

Proposes CIST, a method that assigns separate sample-wise adaptive temperatures to teacher and student in knowledge distillation, producing consistently informative soft labels and relaxing rigid logit-scale matching. Experiments on vision and language tasks show consistent improvements over standard KD.