DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

arXiv cs.CL 06/04/26, 04:00 AM Papers
Summary
DuDi is a dual-signal multilingual distillation framework combining sequence-level and token-level signals with a cross-lingual verbalizer to improve small language models' performance on Southeast Asian languages. Experiments on SEA-HELM show DuDi consistently outperforms competitive distillation baselines across multiple model families and scales.
arXiv:2606.04694v1 Announce Type: new Abstract: Small language models (SLMs) are efficient and scalable, but their multilingual capabilities degrade severely at sub-billion scales, especially for Southeast Asian (SEA) languages. We introduce DuDi, a dual-signal multilingual distillation framework that combines an online sequence-level signal with off-policy and on-policy token-level signals. DuDi further uses a cross-lingual verbalizer to refine teacher feedback and improve teacher-student transferability in multilingual settings. Experiments on SEA-HELM across multiple model families, scales, and teacher-student settings show that DuDi consistently outperforms competitive distillation baselines. Ablations and analyses confirm that sequence-level optimization, token-level supervision, and cross-lingual verbalization provide complementary and transferable learning signals for multilingual SLMs.
Original Article
View Cached Full Text
Cached at: 06/05/26, 02:17 AM
# DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer
Source: [https://arxiv.org/html/2606.04694](https://arxiv.org/html/2606.04694)
Patomporn Payoungkhamdee1\*†\\dagger,Tinnakit Udsa1\*,Jian Gang Ngui2, Sarana Nutanong1,Alham Fikri Aji3,Peerat Limkonchotiwat2 1School of Information Science and Technology, VISTEC2AI Singapore3MBZUAI \{patomporn\.p\_s21,tinnakit\.u\_s24\}@vistec\.ac\.th,peerat@aisingapore\.org [GitHub](https://github.com/aisingapore/DuDi)[![[Uncaptioned image]](https://arxiv.org/html/2606.04694v1/hf-logo.png)Hugging Face](https://huggingface.co/collections/aisingapore/dudi-dual-signal-distillation-with-cross-lingual-verbalizer)

###### Abstract

Small language models \(SLMs\) are efficient and scalable, but their multilingual capabilities degrade severely at sub\-billion scales, especially for Southeast Asian \(SEA\) languages\. We introduce DuDi, a dual\-signal multilingual distillation framework that combines an online sequence\-level signal with off\-policy and on\-policy token\-level signals\. DuDi further uses a cross\-lingual verbalizer to refine teacher feedback and improve teacher\-student transferability in multilingual settings\. Experiments on SEA\-HELM across multiple model families, scales, and teacher–student settings show that DuDi consistently outperforms competitive distillation baselines\. Ablations and analyses confirm that sequence\-level optimization, token\-level supervision, and cross\-lingual verbalization provide complementary and transferable learning signals for multilingual SLMs\.

DuDi: Dual\-Signal Distillation with Cross\-Lingual Verbalizer

Patomporn Payoungkhamdee1\*†\\dagger, Tinnakit Udsa1\*, Jian Gang Ngui2,Sarana Nutanong1,Alham Fikri Aji3,Peerat Limkonchotiwat21School of Information Science and Technology, VISTEC2AI Singapore3MBZUAI\{patomporn\.p\_s21,tinnakit\.u\_s24\}@vistec\.ac\.th,peerat@aisingapore\.org[GitHub](https://github.com/aisingapore/DuDi)[![[Uncaptioned image]](https://arxiv.org/html/2606.04694v1/hf-logo.png)Hugging Face](https://huggingface.co/collections/aisingapore/dudi-dual-signal-distillation-with-cross-lingual-verbalizer)

\*\*footnotetext:Equal contribution$\\dagger$$\\dagger$footnotetext:Work was conducted while Patomporn Payoungkhamdee was a visiting scholar at AI Singapore## 1Introduction

Small language models \(SLMs\) have recently attracted growing attention due to their efficiency and scalability\(Huet al\.,[2024](https://arxiv.org/html/2606.04694#bib.bib16); Nguyenet al\.,[2024](https://arxiv.org/html/2606.04694#bib.bib39); Subramanianet al\.,[2025](https://arxiv.org/html/2606.04694#bib.bib40); Wanget al\.,[2024](https://arxiv.org/html/2606.04694#bib.bib41)\)\. For instance, Qwen2\.5\-1\.5B and \-0\.5B achieve about 1\.5×\\timesand 2\.2×\\timeshigher inference throughput than Qwen2\.5\-7B, while reducing memory use by 48% and 81%, respectively\.111Measured with BF16 precision and input length 6144, based on benchmarks from[https://qwen\.readthedocs\.io/en/v2\.5/benchmark/speed\_benchmark\.html](https://qwen.readthedocs.io/en/v2.5/benchmark/speed_benchmark.html)These gains reduce deployment costs and enable more efficient large\-scale serving\. Compared with larger models, SLMs offer practical advantages in scalability, computation, and memory usage\(Huet al\.,[2024](https://arxiv.org/html/2606.04694#bib.bib16)\)\. These properties make them suitable for resource\-constrained and edge\-device deployment\(Liuet al\.,[2024](https://arxiv.org/html/2606.04694#bib.bib17)\), while supporting real\-world applications at scale\(Phamet al\.,[2025](https://arxiv.org/html/2606.04694#bib.bib15); Chenet al\.,[2025](https://arxiv.org/html/2606.04694#bib.bib12)\)\.

However, multilingual capabilities in SLMs remain limitedQinet al\.\([2025](https://arxiv.org/html/2606.04694#bib.bib11)\); Xuanet al\.\([2025](https://arxiv.org/html/2606.04694#bib.bib13)\), especially for Southeast Asian \(SEA\) languages, a highly diverse region with hundreds of millions of speakers\. As shown in Figure[1](https://arxiv.org/html/2606.04694#S1.F1), SEA performance drops substantially as model size falls below the billion\-scale regime\. In particular, the Supervised Fine\-Tuning \(SFT\) variant of Qwen2\.5\-0\.5B drops sharply relative to Qwen2\.5\-1\.5B, while the newer Qwen3\-0\.6B still shows limited SEA performance under standard SFT\. These results suggest that scaling down weakens multilingual understanding, motivating training strategies tailored for SLMs\.

![Refer to caption](https://arxiv.org/html/2606.04694v1/x1.png)Figure 1:Comparison of SEA performance across different model scales and training frameworks, evaluated using the SEA\-HELM benchmark \(↑\\uparrow\)\. Details of each model are provided in Section[4](https://arxiv.org/html/2606.04694#S4)\.A common approach to improve SLMs’ performance is knowledge distillation \(KD\), which transfers knowledge from a larger teacher to a smaller studentHintonet al\.\([2015](https://arxiv.org/html/2606.04694#bib.bib34)\); Kim and Rush \([2016](https://arxiv.org/html/2606.04694#bib.bib35)\); Agarwalet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib36)\); Guet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib23)\); Koet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib22),[2025](https://arxiv.org/html/2606.04694#bib.bib20)\)\. Despite recent advances in KD, multilingual distillation remains largely limited to task\-specific or data\-centric settingsPayoungkhamdeeet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib28)\); Zhanget al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib10)\), leaving general\-purpose multilingual distillation for SLMs underexplored\. This gap is especially pronounced for SEA languages due to their linguistic diversity and limited high\-quality training data, motivating multilingual KD strategies designed for SLMs\.

To address this limitation, we proposeDual\-SignalDistillation with Cross\-Lingual Verbalizer \(DuDi\), a general\-purpose framework designed for multilingual distillation in SLMs\. DuDi builds a unified KD framework around three components:*Sequence Signal*,*Token Signal*, and*Cross\-Lingual Verbalizer*\. For the sequence signal, DuDi introduces an online sequence\-level objective that guides the student policy toward the ground\-truth direction\. For the token signal, DuDi uses both off\-policy and on\-policy supervision: off\-policy signals come from the training corpus, while on\-policy signals come from student\-generated responses\. To facilitate knowledge transferability, DuDi uses a cross\-lingual verbalizer to refine teacher logits during on\-policy distillation, aligning student responses with the ground\-truth demonstrations\. This design enables us to better facilitate student learning in a multilingual environment\.

To evaluate DuDi, we compare it with competitive methods under the SEA training and evaluation framework, using SEA\-Instruct222[https://huggingface\.co/datasets/aisingapore/SEA\-Instruct\-2602](https://huggingface.co/datasets/aisingapore/SEA-Instruct-2602)and SEA\-HELMSusantoet al\.\([2025](https://arxiv.org/html/2606.04694#bib.bib18)\)\. The experimental results demonstrate that DuDi achieves the strongest overall performance under the Qwen2\.5\-0\.5B setting, with gains across most SEA languages\. This trend generalizes across scales and architectures, demonstrating scalability and robustness\. Ablations show consistent degradation when any DuDi component is removed, highlighting the need to jointly optimize sequence\-level objective, dual\-policy token signals, and the cross\-lingual verbalizer\. Finally, analysis of the DuDi verbalizer demonstrates that it provides richer learning signals for teacher\-student distillation\.

In conclusion, our contributions are as follows:

- •We proposeDuDi, a multilingual knowledge distillation framework that integrates sequence\-level signals and token\-level, improving SEA performance in small LMs\.
- •We introduce across\-lingual verbalizerthat better facilitates on\-policy distillation\.
- •We conductablations and analysesto assess each component, showing the effectiveness of dual\-signal and verbalizer designs\.

## 2Background

MethodTeacherOff\-Policy Token\-SignalOn\-Policy Token\-SignalSequence\-SignalVerbalizerSFT×\\times✓\\checkmark×\\times×\\times×\\timesDFTWuet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib37)\)×\\times✓\\checkmark×\\times×\\times×\\timesSPINChenet al\.\([2024b](https://arxiv.org/html/2606.04694#bib.bib33)\)×\\times×\\times×\\times✓\\checkmark×\\timesSDFTShenfeldet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib38)\)Self×\\times✓\\checkmark×\\timesEnglishSeqKDKim and Rush \([2016](https://arxiv.org/html/2606.04694#bib.bib35)\)Larger✓\\checkmark×\\times×\\times×\\timesGKDAgarwalet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib36)\)Larger✓\\checkmark✓\\checkmark×\\times×\\timesDuDi \(Ours\)Larger✓\\checkmark✓\\checkmark✓\\checkmarkCross\-lingual

Table 1:Comparison of training paradigms across different frameworks\.The multilingual training corpus consists of an inputxx, a ground\-truthyy, and a languagell\. Formally, this dataset is defined as𝒟=\{\(xi,yi,li\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\},l\_\{i\}\)\\\}\_\{i=1\}^\{N\}, whereNNdenotes the total number of training samples\. To learn from the data, a fine\-tuning methodology is employed to optimize the policyπθ\\pi\_\{\\theta\}\. This process involves minimizing an objective function, denoted asℒ\(xi,yi,li;πθ\)\\mathcal\{L\}\(x\_\{i\},y\_\{i\},l\_\{i\};\\pi\_\{\\theta\}\), which serves as a metric for the difference between the model’s stochastic predictions and the ground\-truths\. Existing methods structure this objective differently to address distinct learning dynamics\.

Off\-Policy Fine\-Tuning\.This method represents a straightforward approach, typically grounded in a cross\-entropy objective\. Given a model policyπθ\\pi\_\{\\theta\}, the loss function is formulated as

ℒOff\-FT=𝔼\(x,y\)∼𝒟\[−wlog⁡πθ\(y\|x\)\],\\mathcal\{L\}\_\{\\text\{Off\-FT\}\}=\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\}\\left\[\-w\\log\\pi\_\{\\theta\}\(y\|x\)\\right\],\(1\)wherewwserves as a weighting coefficient to modulate the training signal\. In standard Supervised Fine\-Tuning \(SFT\),w=1w=1, treating all tokens with equal importance\. In Dynamic Fine\-Tuning \(DFT\)Wuet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib37)\),wwis defined assg\(πθ\(y\|x\)\)\\text\{sg\}\(\\pi\_\{\\theta\}\(y\|x\)\), wheresg\(⋅\)\\text\{sg\}\(\\cdot\)denotes the stop\-gradient operator\. This token\-weighting is designed to stabilize gradient magnitudes and improve generalization during the fine\-tuning process\.

Iterative Fine\-Tuning\.To address the limitation of static training data, prior work has explored iterative self\-play for policy refinementTesauro \([1995](https://arxiv.org/html/2606.04694#bib.bib7)\); Silveret al\.\([2017](https://arxiv.org/html/2606.04694#bib.bib6)\)\.Chenet al\.\([2024b](https://arxiv.org/html/2606.04694#bib.bib33)\)proposed Self\-Play Fine\-Tuning \(SPIN\), a bootstrapping framework that improves the model by distinguishing ground\-truths and self\-generated responses sampled from an SFT\-initialized reference policyy′∼πθRef\(x\)y^\{\\prime\}\\sim\\pi\_\{\\theta\_\{\\text\{Ref\}\}\}\(x\)\. This approach optimizes the policy by maximizing an Integral Probability Metric against a previous iteration of the self\. The iterative fine\-tuning objective is defined as

ℒIFT=𝔼\(x,y\)∼𝒟,y′\[ℓ\(λlog⁡πθRef\(y\|x\)πθ\(y′\|x\)πθ\(y\|x\)πθRef\(y′\|x\)\)\],\\mathcal\{L\}\_\{\\text\{IFT\}\}=\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\},y^\{\\prime\}\}\\left\[\\ell\\left\(\\lambda\\log\\frac\{\\pi\_\{\\theta\_\{\\text\{Ref\}\}\}\(y\|x\)\\pi\_\{\\theta\}\(y^\{\\prime\}\|x\)\}\{\\pi\_\{\\theta\}\(y\|x\)\\pi\_\{\\theta\_\{\\text\{Ref\}\}\}\(y^\{\\prime\}\|x\)\}\\right\)\\right\],

\(2\)whereℓ\(t\)=log⁡\(1\+exp⁡\(−t\)\)\\ell\(t\)=\\log\(1\+\\exp\(\-t\)\)is the logistic loss andλ\>0\\lambda\>0is the regularization parameter\. By contrasting the log\-likelihood ratios of target responses against its own generations, the model increasingly aligns its policy with the ground\-truth distribution through successive iterations\.

Self\-Distillation\.To mitigate the generalization of off\-policy fine\-tuning, several studies adopt a self\-distillation paradigmYanget al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib14)\); Zhanget al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib10)\); Hübotteret al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib2)\)\. Specifically,Shenfeldet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib38)\)introduced Self\-Distillation Fine\-Tuning \(SDFT\), which transforms off\-policy signals into an on\-policy paradigm by employing an exponential moving average student as the teacherπθT∼EMA\(πθ\)\\pi\_\{\\theta\_\{\\text\{T\}\}\}\\sim\\text\{EMA\}\(\\pi\_\{\\theta\}\)\. A central component of SDFT is an English verbalizer function,z∼ven\(x,y\)z\\sim v\_\{\\text\{en\}\}\(x,y\), which converts an input and ground\-truth pair into a structured demonstration prompt for the teacher model \(illustrated in Figure[6](https://arxiv.org/html/2606.04694#A5.F6)\)\. This verbalized demonstration is subsequently used to guide the student\-generated response,y~∼πθ\(x\)\\tilde\{y\}\\sim\\pi\_\{\\theta\}\(x\)\. The optimization objective minimizes the divergence between the teacher, conditioned on the verbalized demonstration, and the student policy:

ℒSD=𝔼x∼D,y~,z\[D\(πθT\(y~\|z\)\|\|πθ\(y~\|x\)\)\]\.\\mathcal\{L\}\_\{\\text\{SD\}\}=\\mathbb\{E\}\_\{x\\sim\{D\},\\tilde\{y\},z\}\\left\[D\\left\(\\pi\_\{\\theta\_\{\\text\{T\}\}\}\(\\tilde\{y\}\|z\)\|\|\\pi\_\{\\theta\}\(\\tilde\{y\}\|x\)\\right\)\\right\]\.\(3\)By leveraging this temporary ensemble, SDFT regularizes the optimization path and improves generalization through token\-level guidance\.

Teacher Distillation\.The teacher knowledge distillationHintonet al\.\([2015](https://arxiv.org/html/2606.04694#bib.bib34)\); Linet al\.\([2020](https://arxiv.org/html/2606.04694#bib.bib24)\); Koet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib22),[2025](https://arxiv.org/html/2606.04694#bib.bib20)\)paradigm leverages signals from a stronger teacher model \(πθT\\pi\_\{\\theta\_\{\\text\{T\}\}\}\) to guide a student model \(πθ\\pi\_\{\\theta\}\), typically a parameter\-efficient counterpart\. This approach abstractly optimizes two objectives that could utilize static ground\-truths with stochastic explorations:

ℒTD=\(1−λ\)𝔼\(x,y\)∼𝒟\[D\(πθT\(y\|x\)\|\|πθ\(y\|x\)\)\]\+λ𝔼x∼𝒟,y~∼πθ\(x\)\[D\(πθT\(y~\|x\)\|\|πθ\(y~\|x\)\)\],\\begin\{aligned\} \\mathcal\{L\}\_\{\\text\{TD\}\}=&\(1\-\\lambda\)\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\}\\left\[D\(\\pi\_\{\\theta\_\{\\text\{T\}\}\}\(y\|x\)\|\|\\pi\_\{\\theta\}\(y\|x\)\)\\right\]\\\\ &\+\\lambda\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\tilde\{y\}\\sim\\pi\_\{\\theta\}\(x\)\}\\left\[D\\left\(\\pi\_\{\\theta\_\{\\text\{T\}\}\}\(\\tilde\{y\}\|x\)\|\|\\pi\_\{\\theta\}\(\\tilde\{y\}\|x\)\\right\)\\right\],\\end\{aligned\}

\(4\)whereDDdenotes a divergence function, andλ∈\[0,1\]\\lambda~\\in~\[0,1\]balances the distillation signals from ground\-truths and newly\-generated responses\. Here,y~\\tilde\{y\}is sampled from either the teacher or student policy\. In off\-policy KD, SeqKDKim and Rush \([2016](https://arxiv.org/html/2606.04694#bib.bib35)\)trains the student on teacher\-generated sequences, i\.e\.,y~∼πθT\(x\)\\tilde\{y\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{T\}\}\}\(x\)\. However, this approach often suffers from training\-inference mismatches when the output sequences generated by the student at inference time deviate significantly from those encountered during training\. To address this mismatch, Generalized Knowledge Distillation, GKDAgarwalet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib36)\)introduces an on\-policy mechanism\. In the GKD framework, the student generates its own responsesy~∼πθ\(x\)\\tilde\{y\}\\sim\\pi\_\{\\theta\}\(x\), while the teacher provides token\-level signals for guiding the student output logits to correct their self\-generated mistakes\.

As summarized in Table[1](https://arxiv.org/html/2606.04694#S2.T1), although these approaches have advanced fine\-tuning methodologies, their multilingual extension remains insufficiently explored\. Additionally, existing methods typically treat sequence\-level and token\-level supervision independently, limiting the complementarity of both learning signals within a unified framework\.

## 3DuDi

As illustrated in Figure[2](https://arxiv.org/html/2606.04694#S3.F2), DuDi consists of three core components: a sequence\-level signal \(Section[3\.1](https://arxiv.org/html/2606.04694#S3.SS1)\), a token\-level signal \(Section[3\.2](https://arxiv.org/html/2606.04694#S3.SS2)\), and a cross\-lingual verbalizer \(Section[3\.3](https://arxiv.org/html/2606.04694#S3.SS3)\)\. The framework enables SLMs to jointly leverage sequence\-level and token\-level supervision, while the cross\-lingual verbalizer improves teacher\-student knowledge transferability in multilingual settings\.

![Refer to caption](https://arxiv.org/html/2606.04694v1/x2.png)Figure 2:Overview of the DuDi framework, which integrates the sequence\-level objective, token\-level signals of off\-policy, and on\-policy knowledge distillation with teacher\-guided cross\-lingual verbalization\.### 3\.1Sequence Signal

To improve the efficiency of SLMs, we integrate a sequence\-level objective inspired by the SPIN framework to ensure the student policy converges toward the ground\-truth demonstration\. This objective requires the model to differentiate its current policy from the ground\-truth by maximizing the relative likelihood of ground\-truths\. Distinct from static iterations of SPIN, we sample responsesy′∼πθ\(x\)y^\{\\prime\}\\sim\\pi\_\{\\theta\}\(x\)inreal\-timeto provide dynamic on\-policy feedback with the objective in Equation[2](https://arxiv.org/html/2606.04694#S2.E2)\.

### 3\.2Token Signal

Complementing token\-level supervision, DuDi leverages both off\-policy and on\-policy objectives to enhance performance through knowledge distillation\.

Off\-Policy KD\.While on\-policy distillation mitigates teacher\-student mismatches\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.04694#bib.bib36); Guet al\.,[2024](https://arxiv.org/html/2606.04694#bib.bib23); Koet al\.,[2024](https://arxiv.org/html/2606.04694#bib.bib22),[2025](https://arxiv.org/html/2606.04694#bib.bib20)\), relying solely on student\-generated responses may reduce exposure to the ground\-truth data distribution, potentially causing the learned policy to drift away from it\. To address this problem, we adopt an off\-policy distillation signal that leverages teacher\-provided logits as guidance\. This mechanism ensures that the supervisory signal remains strictly grounded in the ground\-truth distribution\. The corresponding objective can be represented as

ℒOff\-KD=D\(πθT\(y\|x\)\|\|πθ\(y\|x\)\)\.\\begin\{split\}\\mathcal\{L\}\_\{\\text\{Off\-KD\}\}&=D\\left\(\\pi\_\{\\theta\_\{\\text\{T\}\}\}\(y\|x\)\|\|\\pi\_\{\\theta\}\(y\|x\)\\right\)\.\\end\{split\}\(5\)
On\-policy KD\.To enable teacher\-guided refinement of the student\-generated responses, we further adopt an on\-policy knowledge distillation objective, where the student generates a responsey~∼πθ\(x\)\\tilde\{y\}\\sim\\pi\_\{\\theta\}\(x\)\. The on\-policy distillation objective is defined as

ℒOn\-KD=D\(πθT\(y~\|x\)\|\|πθ\(y~\|x\)\)\.\\begin\{split\}\\mathcal\{L\}\_\{\\text\{On\-KD\}\}&=D\\left\(\\pi\_\{\\theta\_\{\\text\{T\}\}\}\(\\tilde\{y\}\|x\)\|\|\\pi\_\{\\theta\}\(\\tilde\{y\}\|x\)\\right\)\.\\end\{split\}\(6\)

### 3\.3Cross\-Lingual Verbalizer

Optimizing on\-policy learning with token\-level supervision in multilingual settings requires a carefully designed framework for effective teacher\-student knowledge transfer\. To further improve the distillation process, inspired by English verbalizer\(Shenfeldet al\.,[2026](https://arxiv.org/html/2606.04694#bib.bib38)\), we introduce a cross\-lingual verbalizer\. This component converts an input promptxx, a ground\-truthyyas demonstration, source languagell, and target languagelzl\_\{z\}into a verbalized promptz=v\(x,y,l,lz\)z=v\(x,y,l,l\_\{z\}\)for the teacher, along with a corresponding promptpzp\_\{z\}for the student\. The verbalizer prompt is the same languagellas the sample native language\. The promptpzp\_\{z\}instructs the student model to generate responses in the target languagelzl\_\{z\}, wherelzl\_\{z\}sampled uniformly from the set of training languages excludingll, with English included as an additional language\. An example of the cross\-lingual verbalizer prompt template is shown in Figure[3](https://arxiv.org/html/2606.04694#S3.F3)\. Furthermore, Figure[4](https://arxiv.org/html/2606.04694#S3.F4)illustrates a Thai sample in which verbalized for generating responses in Vietnamese\.

Consequently, with a cross\-lingual verbalizer, the on\-policy token\-level distillation objective in Equation[6](https://arxiv.org/html/2606.04694#S3.E6)could be modified with a cross\-lingual verbalizerzzfor the teacher together with its associated prompt templatepzp\_\{z\}for the student\. Based on this verbalized input, the student generates an additional responsey~∼πθ\(pz,x\)\\tilde\{y\}\\sim\\pi\_\{\\theta\}\(p\_\{z\},x\)\. The on\-policy distillation objective with the cross\-lingual verbalizer on the teacher is formally expressed as:

ℒOn\-KD=D\(πθT\(y~\|z\)\|\|πθ\(y~\|pz,x\)\)\.\\begin\{split\}\\mathcal\{L\}\_\{\\text\{On\-KD\}\}&=D\\left\(\\pi\_\{\\theta\_\{\\text\{T\}\}\}\(\\tilde\{y\}\|z\)\|\|\\pi\_\{\\theta\}\(\\tilde\{y\}\|p\_\{z\},x\)\\right\)\.\\end\{split\}\(7\)
The use of the cross\-lingual verbalizer enables knowledge transfer across languages, thereby improving downstream performance in multilingual settings\. The cross\-lingual verbalizer configuration in DuDi is detailed in Section[4\.1](https://arxiv.org/html/2606.04694#S4.SS1)\.

![Refer to caption](https://arxiv.org/html/2606.04694v1/x3.png)Figure 3:Illustration of the cross\-lingual verbalizer template, showing the teacher \(zz\) and its corresponding student prompt \(pzp\_\{z\}\) from a Thai training instance\.![Refer to caption](https://arxiv.org/html/2606.04694v1/x4.png)Figure 4:Example of cross\-lingual verbalized teacher and student prompts\. In this example, the original sample is in Thai, where the prompt is expressed in Thai, while the target response is generated in Vietnamese\. For this sample, the target response language is uniformly sampled from seven languages: English, Indonesian, Vietnamese, Tamil, Tagalog, Malay, and Burmese, excluding the sample’s original language \(Thai\)\.
### 3\.4Unify Training Objective

The DuDi framework is optimized by integrating on\-policy sequence\-level alignment \(Equation[2](https://arxiv.org/html/2606.04694#S2.E2)\), off\-policy token\-level distillation \(Equation[5](https://arxiv.org/html/2606.04694#S3.E5)\), and on\-policy token\-level guidance via the multilingual verbalizer \(Equation[7](https://arxiv.org/html/2606.04694#S3.E7)\)\. Formally, the learning objective is abstractly represented as:

ℒDuDi=α𝔼\(x,y\)∼𝒟,y′∼πθ\(x\)\[ℒSPIN\(x,y,y′;πθ,πθRef\)\]\+\(1−λ\)𝔼\(x,y\)∼𝒟\[ℒOff\-KD\(x,y;πθ,πθT\)\]\+λ𝔼x∼𝒟,y~∼πθ\(pz,x\),z∼v\(x,y\)\[ℒOn\-KD\(x,y~,z;πθ,πθT\)\],\\begin\{aligned\} &\\mathcal\{L\}\_\{\\text\{DuDi\}\}=\\alpha\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\},y^\{\\prime\}\\sim\\pi\_\{\\theta\}\(x\)\}\\left\[\\mathcal\{L\}\_\{\\text\{SPIN\}\}\(x,y,y^\{\\prime\};\\pi\_\{\\theta\},\\pi\_\{\\theta\_\{\\text\{Ref\}\}\}\)\\right\]\\\\ &\+\(1\-\\lambda\)\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\}\\left\[\\mathcal\{L\}\_\{\\text\{Off\-KD\}\}\(x,y;\\pi\_\{\\theta\},\\pi\_\{\\theta\_\{\\text\{T\}\}\}\)\\right\]\\\\ &\+\\lambda\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\tilde\{y\}\\sim\\pi\_\{\\theta\}\(p\_\{z\},x\),z\\sim v\(x,y\)\}\\left\[\\mathcal\{L\}\_\{\\text\{On\-KD\}\}\(x,\\tilde\{y\},z;\\pi\_\{\\theta\},\\pi\_\{\\theta\_\{\\text\{T\}\}\}\)\\right\],\\end\{aligned\}

\(8\)
whereα\\alphadenotes the weighting coefficient for sequence\-level student policy optimization andλ\\lambdadenotes the weighting coefficient to balance token\-level off\-policy and on\-policy distillation loss\. This joint objective allows the student model to leverage both the sequence\-level optimization towards the ground\-truth data distribution and fine\-grained token\-level supervision from the teacher\.

### 3\.5Differentiation from Previous Work

DuDi jointly leverages sequence\-level and token\-level signals, with two key distinctions from prior approaches\. We integrate a self\-play mechanismChenet al\.\([2024b](https://arxiv.org/html/2606.04694#bib.bib33)\)as a sequence\-level signal, transitioning from offline to online generation to reflect the student’s evolving policy\. Furthermore, we extend the on\-policy distillation objectiveAgarwalet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib36)\); Shenfeldet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib38)\)by introducing cross\-lingual prompting for student rollouts, paired with a teacher equipped with a cross\-lingual verbalizer, thereby improving knowledge transferability between teacher and student\.

Compared to prior methods summarized in Table[1](https://arxiv.org/html/2606.04694#S2.T1), this design enables the integration of dual\-signal supervision with a tailored verbalizer, achieving state\-of\-the\-art performance in multilingual settings \(Section[5](https://arxiv.org/html/2606.04694#S5)\)\. Ablation results in Section[6\.1](https://arxiv.org/html/2606.04694#S6.SS1)further show that these components are not independently effective but require joint integration in multilingual scenarios\. Moreover, Section[7](https://arxiv.org/html/2606.04694#S7)demonstrates that the cross\-lingual verbalizer improves on\-policy teacher\-student knowledge transferability, consistent with our hypothesis\.

## 4Experimental Setup

MethodIndonesianVietnameseThaiTamilTagalogMalayBurmeseAverageQwen2\.5\-3B\-Instruct \(Teacher\)42\.038\.732\.39\.824\.640\.06\.527\.7Qwen2\.5\-0\.5B \(Student\)SFT10\.611\.810\.84\.95\.612\.04\.48\.6DFT9\.311\.610\.210\.06\.112\.48\.39\.7SPIN12\.611\.410\.05\.38\.014\.74\.89\.5SDFT4\.24\.64\.72\.33\.35\.11\.33\.6SeqKD7\.19\.17\.74\.23\.48\.12\.96\.1GKD11\.713\.210\.94\.94\.813\.63\.79\.0DuDi \(ours\)11\.7⋄14\.4⋄12\.8⋄6\.56\.114\.8⋄4\.610\.1⋄

Table 2:Downstream evaluation of methods across seven Southeast Asian languages\. "⋄\\diamond" denotes a statistically significant improvement in each language, where DuDi outperforms DFT, the second\-best overall performance\.MethodIndonesianVietnameseThaiTamilTagalogMalayBurmeseAverageQwen2\.5\-3B\-Instruct \(Teacher\)42\.038\.732\.39\.824\.640\.06\.527\.7Qwen2\.5\-1\.5B \(Student\)SFT21\.426\.023\.810\.016\.626\.16\.218\.6DFT22\.117\.816\.69\.613\.622\.86\.215\.6SPIN21\.025\.320\.210\.119\.226\.75\.318\.3GKD28\.628\.819\.97\.315\.429\.14\.719\.1DuDi27\.930\.3⋄19\.88\.5⋄19\.0⋄30\.7⋄4\.820\.1⋄Qwen3\-4B \(Teacher\)54\.152\.750\.743\.145\.953\.020\.045\.6Qwen3\-0\.6B\-Base \(Student\)SFT15\.619\.517\.310\.110\.717\.96\.413\.9DFT18\.418\.117\.314\.014\.020\.17\.115\.6SPIN14\.420\.517\.710\.414\.016\.16\.814\.3GKD20\.627\.021\.910\.915\.123\.96\.518\.0DuDi24\.2⋄30\.4⋄23\.4⋄13\.2⋄17\.6⋄28\.1⋄8\.4⋄20\.8⋄Llama3\.2\-3B\-Instruct \(Teacher\)32\.430\.338\.216\.728\.342\.85\.027\.7Llama3\.2\-1B \(Student\)SFT4\.75\.33\.63\.93\.36\.55\.74\.7DFT0\.40\.40\.90\.40\.31\.10\.40\.6SPIN2\.93\.63\.83\.12\.97\.42\.43\.7GKD11\.817\.711\.45\.67\.316\.34\.210\.6DuDi14\.4⋄16\.510\.46\.6⋄8\.118\.5⋄5\.2⋄11\.4⋄

Table 3:Results across different teacher\-student model configurations\. "⋄\\diamond" denotes a statistically significant improvement in each language, where DuDi outperforms GKD, the second\-best overall performing framework\.### 4\.1Setup

Models and Datasets\.We center our study on Qwen2\.5Qwenet al\.\([2025](https://arxiv.org/html/2606.04694#bib.bib30)\), using Qwen2\.5\-3B\-Instruct as the teacher model and Qwen2\.5\-0\.5B and 1\.5B as student models\. To cover other families, we also evaluate Qwen3 \(4B→\\rightarrow0\.6B\) and Llama3\.2 \(3B→\\rightarrow1B\)\. All student models are initialized from base pretrained checkpoints, whereas the corresponding teacher models use instruction\-tuned variants\. For the training dataset, we use SEA\-Instruct, which covers seven SEA languages: Indonesian, Vietnamese, Thai, Tamil, Tagalog, Malay, and Burmese\. The dataset contains open\-source prompts, each paired with a synthetic response and quality estimate\. We sample 4,000 high\-quality examples per language, as labeled by the original dataset, yielding 28,000 samples\. Random sampling constraints preserve the distribution of domains, task types, and prompt complexity\.

Framework Setup\.The cross\-lingual verbalizer in DuDi incorporates 7 SEA languages and English\. During on\-policy training, the target response language for the student is uniformly sampled from this set, excluding the original language of the training sample\. In addition, we adopt a two\-stage training framework in which the base model is first SFT on the SEA\-Instruct dataset \(cold\-start SFT\), after which DuDi training is initialized from the resulting SFT checkpoints\. For the sequence\-level objective, the reference policyπRef\\pi\_\{\\text\{Ref\}\}is set as the cold\-start SFT checkpoint\. The importance of cold\-start SFT initialization is further discussed in Section[7\.2](https://arxiv.org/html/2606.04694#S7.SS2)\. Additional details regarding the training configurations are provided in Appendix[B](https://arxiv.org/html/2606.04694#A2)\.

### 4\.2Evaluation

We evaluate on SEA\-HELMSusantoet al\.\([2025](https://arxiv.org/html/2606.04694#bib.bib18)\), which covers multiple Southeast Asian languages and diverse tasks\. The languages are Indonesian, Vietnamese, Thai, Tamil, Tagalog, Malay, and Burmese\. SEA\-HELM includes natural language understanding \(NLU\), natural language generation \(NLG\), natural language reasoning \(NLR\), safety, linguistic diagnostics, instruction following, and Southeast Asian knowledge\. All results are averaged over four seeds\. We also use Almost Stochastic Order \(ASO\)333Using the implementation fromUlmeret al\.\([2022](https://arxiv.org/html/2606.04694#bib.bib3)\)Del Barrioet al\.\([2018](https://arxiv.org/html/2606.04694#bib.bib5)\); Droret al\.\([2019](https://arxiv.org/html/2606.04694#bib.bib4)\)to test statistical significance between DuDi and the second\-best performing framework\.

### 4\.3Competitive Methods

We compare DuDi with all comparative methods for SLMs, as we discussed in Section[2](https://arxiv.org/html/2606.04694#S2)\. For off\-policy fine\-tuning, we use SFT and DFT\(Wuet al\.,[2026](https://arxiv.org/html/2606.04694#bib.bib37)\)as standard fine\-tuning baselines\. For iterative fine\-tuning, we adopt SPIN\(Chenet al\.,[2024b](https://arxiv.org/html/2606.04694#bib.bib33)\)as a representative sequence\-level optimization method\. For self\-distillation strategies, we include SDFT\(Shenfeldet al\.,[2026](https://arxiv.org/html/2606.04694#bib.bib38)\), which uses temporary prefix prompting for self\-guided on\-policy refinement\. For knowledge distillation, we evaluate SeqKD\(Kim and Rush,[2016](https://arxiv.org/html/2606.04694#bib.bib35)\)and GKD\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.04694#bib.bib36)\)\. Furthermore, all methods, except off\-policy fine\-tuning, are initialized from a cold\-start SFT checkpoint, following the same setup as DuDi\. Additional implementation details for all competitive methods are provided in Appendix[B](https://arxiv.org/html/2606.04694#A2)\.

## 5Main Results

#### DuDi outperforms all methods\.

Overall, DuDi achieves the strongest performance among all frameworks\. As shown in Table[2](https://arxiv.org/html/2606.04694#S4.T2), DuDi obtains the highest average score of 10\.1 across seven SEA languages, statistically significantly outperforming the strongest baseline, DFT, by 0\.4 points and SPIN by 0\.6 points\. DuDi ranks first in Vietnamese, Thai, and Malay, and second in Indonesian, Tamil, and Tagalog, demonstrating consistent improvements across both high\- and mid\-resource SEA languages\. The only exception is Burmese, where the gain is limited, likely due to the smaller teacher\-student gap and teacher performance in this language\.

#### Comparison with off\-policy fine\-tuning\.

DFT is a strong off\-policy baseline, achieving the second\-best overall score, only 0\.4 points below DuDi\. It outperforms the larger model in Tamil and Burmese, suggesting that direct fine\-tuning is beneficial when the teacher is unreliable\. However, DFT does not consistently surpass SFT across languages and is limited as a cold\-start initialization for further distillation \(Reported in Appendix[F](https://arxiv.org/html/2606.04694#A6)\)\.

Contrast with other methods\.SPIN emerges as the third strongest approach, suggesting that self\-play remains effective\. GKD is also a strong KD baseline, achieving 9\.0 on average, while SeqKD performs moderately but remains below both GKD and SPIN\. In contrast, SDFT substantially underperforms despite its self\-distillation design; its English\-based verbalizer limits the teacher\-student transferability\. Further analyses regarding verbalizer configurations and teacher–student knowledge transferability are detailed in Sections[6\.2](https://arxiv.org/html/2606.04694#S6.SS2)and[7\.1](https://arxiv.org/html/2606.04694#S7.SS1)\.

#### Robustness across model families\.

DuDi also generalizes consistently across different teacher–student configurations\. As shown in Table[3](https://arxiv.org/html/2606.04694#S4.T3), DuDi achieves the best average performance for all three student model families\. Compared with GKD, the strongest prior baseline, DuDi improves average performance by 5\.2%, 15\.6%, and 7\.5% on Qwen2\.5\-1\.5B, Qwen3\-0\.6B\-Base, and Llama3\.2\-1B, respectively\. These gains demonstrate that DuDi remains effective across variations in model scale, version, and architecture family\.

#### Stability over competing methods\.

Among prior methods, GKD is the most competitive baseline and consistently outperforms SPIN and DFT across model families\. However, SPIN exhibits unstable behavior, it provides little improvement over SFT on Qwen2\.5\-1\.5B and Llama3\.2\-1B, and only modest gains on Qwen3\-0\.6B\-Base\. DFT is even less stable, failing to outperform SFT and collapsing on Llama3\.2\-1B\. In contrast, DuDi consistently improves over these baselines, highlighting its robustness over comparative fine\-tuning approaches\.

## 6Ablation Studies

### 6\.1Critical Components

DuDi comprises sequence\-level supervision, off\-policy KD, and on\-policy KD with a cross\-lingual verbalizer\. We perform an ablation study by removing one component at a time and measuring its impact on overall performance\. Additionally, we examine whether excluding English from the set of target responses leads to performance degradation\.

Table[4](https://arxiv.org/html/2606.04694#S6.T4)reports the ablation results\. Eliminating the sequence\-level loss yields only a minor performance decline, though it still helps align the student with the ground\-truth trajectory\. Excluding off\-policy KD causes the largest performance drop, highlighting its critical role in providing ground\-truth supervision and guiding SLMs toward the target trajectory\. Disabling on\-policy KD results in the second\-largest degradation, as the teacher can no longer refine student\-generated responses\.

For the verbalizer component, their absence noticeably reduces performance by 5%\. Lastly, removing English generation from the cross\-lingual verbalizer degrades performance, suggesting the inclusion of the English language facilitates better cross\-lingual transfer\. Overall, all components are complementary and jointly contribute to DuDi’s optimal performance\.

MethodSEA\-HELMΔ\\Delta% DifferenceDuDi10\.1\-\-w/o sequence9\.7\-0\.4\-4\.7%w/o off\-policy KD7\.6\-2\.5\-24\.6%w/o on\-policy KD9\.5\-0\.6\-6\.6%w/o verbalizer9\.6\-0\.5\-5\.0%w/o English9\.8\-0\.3\-3\.3%

Table 4:Ablation results of DuDi’s components\.
### 6\.2Design Choices

Verbalizer Modes\.As described in Section[3](https://arxiv.org/html/2606.04694#S3), DuDi uses a cross\-lingual verbalizer\. We compare three variants:English, followingShenfeldet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib38)\), where verbalizer prompt template are English;Multilingual, wherelzl\_\{z\}is the sample languagell, andMix, which uses 50% multilingual mode for native\-language learning and 50% cross\-lingual mode for knowledge transfer\. Templates of individual verbalizers are provided in Appendix[E](https://arxiv.org/html/2606.04694#A5)\.

Table[5](https://arxiv.org/html/2606.04694#S6.T5)shows that our cross\-lingual verbalizer is the only variant outperforming the no\-verbalizer baseline\. The multilingual verbalizer performs worst overall, while the English\-only verbalizer slightly surpasses it despite lacking language\-specific supervision\. The mixed verbalizer further improves upon the multilingual setting, underscoring the importance of cross\-lingual verbalization\. Overall, these results support the effectiveness of the proposed cross\-lingual verbalizer in facilitating teacher\-student knowledge transfer\.

KD Objective\.We compare DuDi’s reverse KL objective against forward KL and Jensen\-Shannon Divergence \(JSD\), which interpolates between the two\. Table[5](https://arxiv.org/html/2606.04694#S6.T5)shows that reverse KL achieves the best SEA\-HELM score \(10\.1\), substantially outperforming JSD and forward KL, corresponding to relative drops of 23\.8% and 37\.8%, respectively\. The weaker performance of forward KL may stem from teacher\-student mismatch and overestimation of low\-probability regions in the teacher distributionGuet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib23)\)\. In contrast, reverse KL yields more stable and effective knowledge transfer\.

MethodSEA\-HELMΔ\\Delta% DifferenceDuDi10\.1\-\-Verbalizer ModeNo verbalizer9\.6\-0\.5\-5\.0%English8\.2\-1\.9\-18\.8%Multilingual7\.9\-2\.2\-21\.8%Mix9\.0\-1\.1\-10\.9%KD ObjectiveForward\-KL6\.3\-3\.8\-37\.8%JSD7\.7\-2\.4\-23\.8%

Table 5:Evaluation of performance under different verbalizers and knowledge distillation objectives\.

## 7Analyses

To further investigate the properties of DuDi, we conduct two analyses\. In Section[7\.1](https://arxiv.org/html/2606.04694#S7.SS1), we examine how cross\-lingual verbalizers facilitate the transferability between teacher and student\. Subsequently, in Section[7\.2](https://arxiv.org/html/2606.04694#S7.SS2), we analyze the role of cold\-start fine\-tuning and demonstrate its importance for effective multilingual knowledge distillation\.

### 7\.1Why DuDi Verbalizer is Optimal?

To better understand the effectiveness of the proposed cross\-lingual verbalizer, we employ the overlap ratio analysisLiet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib9)\), which evaluates the top\-k overlap between student and teacher output logits during on\-policy token\-level distillation\. This metric quantifies the degree of agreement between the two distributions, thereby facilitating the on\-policy gradient signal to the student model\.

As shown in Figure[5](https://arxiv.org/html/2606.04694#S7.F5), the proposed cross\-lingual verbalizer achieves the highest overlap ratio throughout training, indicating that cross\-lingual rollouts provide informative supervision signals for on\-policy distillation\. For the remaining variants, the overlap ratio ranking follows the performance trend in Table[5](https://arxiv.org/html/2606.04694#S6.T5), with no verbalizer performing second best, followed by the mix, English, and multilingual verbalizers\. Notably, the results suggest that multilingual verbalizers can increase teacher\-student mismatches, highlighting the challenges of verbalizer design in multilingual settings\.

![Refer to caption](https://arxiv.org/html/2606.04694v1/x5.png)Figure 5:Overlap ratio between teacher and student logits during on\-policy rollouts across different verbalizers\. The mix verbalizer denotes a uniform random combination of multilingual and cross\-lingual verbalizers\.
### 7\.2Why is Cold\-Start SFT Important?

DuDi is trained under a cold\-start SFT setting, where models are initialized from SFT checkpoints using ground\-truth responses as supervision\. In parallel,Liet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib9)\); Zhuet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib42)\)adopts a related initialization strategy based on teacher\-generated responses rather than ground\-truth data, demonstrating that SFT initialization improves the effectiveness of on\-policy distillation\. As shown in Table[6](https://arxiv.org/html/2606.04694#S7.T6), training DuDi directly from the base model without cold\-start initialization results in the weakest performance\. Although initialization with teacher\-generated outputs yields moderate improvements, it remains lower than initialization using the original ground\-truth training data\. These findings indicate that cold\-start initialization also plays an important role in effective multilingual knowledge distillation for SLMs\. Furthermore, similar trends are consistently observed across all comparative methods, as reported in Appendix[G](https://arxiv.org/html/2606.04694#A7)\.

MethodSEA\-HELMΔ\\Delta% DifferenceDuDi10\.1\-\-Teacher generated9\.1\-1\.0\-10\.2%No cold\-start8\.6\-1\.5\-15\.1%

Table 6:Evaluation of downstream performance under different cold\-start setting initialization\.

## 8Conclusion

In this work, we introduced DuDi, a general\-purposed multilingual distillation framework for SLMs that integrates sequence\-level and token\-level supervisions, along with a cross\-lingual verbalization mechanism\. Extensive experiments across diverse model families and parameter scales demonstrate that DuDi consistently achieves the highest average SEA\-HELM performance, substantially outperforming strong fine\-tuning and distillation baselines\. Ablation studies further indicate that jointly optimizing sequence\-level and token\-level objectives, together with the proposed verbalizer design, yields complementary benefits\. Our analysis also shows that the cross\-lingual verbalizer improves teacher\-student knowledge transferability\. Overall, DuDi offers an effective fine\-tuning framework for SLMs in a multilingual environment\. In addition, for the open\-research purpose, we will release all artifacts in this paper, including training code, datasets, and models\.

## Limitations

The experimental setup of this study primarily focuses on Southeast Asian \(SEA\) languages, with models trained on SEA\-Instruct and evaluated using SEA\-HELMSusantoet al\.\([2025](https://arxiv.org/html/2606.04694#bib.bib18)\), which covers seven SEA languages across a diverse set of tasks\. Consequently, the findings related to DuDi may not generalize to tasks beyond those included in the current evaluation framework\. Nevertheless, SEA\-HELM remains a gold\-standard benchmark for the comprehensive evaluation of language model capabilities in SEA languages\. Future work will focus on extending both the training data and evaluation benchmarks to encompass a wider range of contemporary language modeling tasks\.

Another limitation of this work concerns the availability of suitable teacher models\. In particular, the teacher model must be more capable, typically larger in scale than the student model while sharing the same output vocabulary space\. This constraint arises because the divergence function employed in knowledge distillation requires aligned teacher–student logit dimensions\.

## Acknowledgments

This project is supported by the National Research Foundation, Singapore under its National Large Language Models Funding Initiative\. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author\(s\) and do not reflect the views of National Research Foundation, Singapore\. We thank Trevor Cohn for his helpful feedback, and Ngee Chia Tai and Raymond Ng for their support and valuable comments\.

## References

- On\-policy distillation of language models: learning from self\-generated mistakes\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=3zKtaqxLhW)Cited by:[§A\.1](https://arxiv.org/html/2606.04694#A1.SS1.p2.1),[Appendix B](https://arxiv.org/html/2606.04694#A2.p1.2),[Appendix B](https://arxiv.org/html/2606.04694#A2.p2.2),[§1](https://arxiv.org/html/2606.04694#S1.p3.1),[Table 1](https://arxiv.org/html/2606.04694#S2.T1.26.26.26.5),[§2](https://arxiv.org/html/2606.04694#S2.p5.7),[§3\.2](https://arxiv.org/html/2606.04694#S3.SS2.p2.1),[§3\.5](https://arxiv.org/html/2606.04694#S3.SS5.p1.1),[§4\.3](https://arxiv.org/html/2606.04694#S4.SS3.p1.1)\.
- N\. Chen, Z\. Zheng, N\. Wu, M\. Gong, D\. Zhang, and J\. Li \(2024a\)Breaking language barriers in multilingual mathematical reasoning: insights and observations\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 7001–7016\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.411/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.411)Cited by:[§A\.2](https://arxiv.org/html/2606.04694#A1.SS2.p1.1)\.
- X\. Chen, C\. Ma, W\. Fan, Z\. Zhang, and L\. Qing \(2025\)C2KD: cross\-layer and cross\-head knowledge distillation for small language model\-based recommendation\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 17827–17838\.External Links:[Link](https://aclanthology.org/2025.findings-acl.917/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.917),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2606.04694#S1.p1.2)\.
- Z\. Chen, Y\. Deng, H\. Yuan, K\. Ji, and Q\. Gu \(2024b\)Self\-play fine\-tuning convertsweak language models to strong language models\.InProceedings of the 41st International Conference on Machine Learning,ICML’24\.Cited by:[§A\.1](https://arxiv.org/html/2606.04694#A1.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.04694#S2.T1.15.15.15.6),[§2](https://arxiv.org/html/2606.04694#S2.p3.1),[§3\.5](https://arxiv.org/html/2606.04694#S3.SS5.p1.1),[§4\.3](https://arxiv.org/html/2606.04694#S4.SS3.p1.1)\.
- E\. Del Barrio, J\. A\. Cuesta\-Albertos, and C\. Matrán \(2018\)An optimal transportation approach for assessing almost stochastic order\.InThe Mathematics of the Uncertain,pp\. 33–44\.Cited by:[§4\.2](https://arxiv.org/html/2606.04694#S4.SS2.p1.1)\.
- R\. Dror, S\. Shlomov, and R\. Reichart \(2019\)Deep dominance \- how to properly compare deep neural models\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers,A\. Korhonen, D\. R\. Traum, and L\. Màrquez \(Eds\.\),pp\. 2773–2785\.External Links:[Link](https://doi.org/10.18653/v1/p19-1266),[Document](https://dx.doi.org/10.18653/v1/p19-1266)Cited by:[§4\.2](https://arxiv.org/html/2606.04694#S4.SS2.p1.1)\.
- Y\. Gu, L\. Dong, F\. Wei, and M\. Huang \(2024\)MiniLLM: knowledge distillation of large language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=5h0qf7IBZZ)Cited by:[§A\.1](https://arxiv.org/html/2606.04694#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.04694#S1.p3.1),[§3\.2](https://arxiv.org/html/2606.04694#S3.SS2.p2.1),[§6\.2](https://arxiv.org/html/2606.04694#S6.SS2.p3.1)\.
- G\. E\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.\.CoRRabs/1503\.02531\.External Links:[Link](http://dblp.uni-trier.de/db/journals/corr/corr1503.html#HintonVD15)Cited by:[§A\.1](https://arxiv.org/html/2606.04694#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.04694#S1.p3.1),[§2](https://arxiv.org/html/2606.04694#S2.p5.2)\.
- S\. Hu, Y\. Tu, X\. Han, G\. Cui, C\. He, W\. Zhao, X\. Long, Z\. Zheng, Y\. Fang, Y\. Huang, X\. Zhang, Z\. L\. Thai, C\. Wang, Y\. Yao, C\. Zhao, J\. Zhou, J\. Cai, Z\. Zhai, N\. Ding, C\. Jia, G\. Zeng, dahai li, Z\. Liu, and M\. Sun \(2024\)MiniCPM: unveiling the potential of small language models with scalable training strategies\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=3X2L2TFr0f)Cited by:[§1](https://arxiv.org/html/2606.04694#S1.p1.2)\.
- J\. Hübotter, F\. Lübeck, L\. D\. Behric, A\. Baumann, M\. Bagatella, D\. Marta, I\. Hakimi, I\. Shenfeld, T\. K\. Buening, C\. Guestrin, and A\. Krause \(2026\)Reinforcement learning via self\-distillation\.InThe 1st Workshop on Scaling Post\-training for LLMs,External Links:[Link](https://openreview.net/forum?id=k8DcHShsrJ)Cited by:[§A\.1](https://arxiv.org/html/2606.04694#A1.SS1.p2.1),[§2](https://arxiv.org/html/2606.04694#S2.p4.3)\.
- Y\. Kim and A\. M\. Rush \(2016\)Sequence\-level knowledge distillation\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,J\. Su, K\. Duh, and X\. Carreras \(Eds\.\),Austin, Texas,pp\. 1317–1327\.External Links:[Link](https://aclanthology.org/D16-1139/),[Document](https://dx.doi.org/10.18653/v1/D16-1139)Cited by:[§A\.1](https://arxiv.org/html/2606.04694#A1.SS1.p1.1),[Appendix B](https://arxiv.org/html/2606.04694#A2.p2.2),[§1](https://arxiv.org/html/2606.04694#S1.p3.1),[Table 1](https://arxiv.org/html/2606.04694#S2.T1.22.22.22.5),[§2](https://arxiv.org/html/2606.04694#S2.p5.7),[§4\.3](https://arxiv.org/html/2606.04694#S4.SS3.p1.1)\.
- J\. Ko, T\. Chen, S\. Kim, T\. Ding, L\. Liang, I\. Zharkov, and S\. Yun \(2025\)DistiLLM\-2: a contrastive approach boosts the distillation of LLMs\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=rc65N9xIrY)Cited by:[§A\.1](https://arxiv.org/html/2606.04694#A1.SS1.p1.1),[§A\.1](https://arxiv.org/html/2606.04694#A1.SS1.p2.1),[§1](https://arxiv.org/html/2606.04694#S1.p3.1),[§2](https://arxiv.org/html/2606.04694#S2.p5.2),[§3\.2](https://arxiv.org/html/2606.04694#S3.SS2.p2.1)\.
- J\. Ko, S\. Kim, T\. Chen, and S\. Yun \(2024\)DistiLLM: towards streamlined distillation for large language models\.InForty\-first International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=lsHZNNoC7r)Cited by:[§1](https://arxiv.org/html/2606.04694#S1.p3.1),[§2](https://arxiv.org/html/2606.04694#S2.p5.2),[§3\.2](https://arxiv.org/html/2606.04694#S3.SS2.p2.1)\.
- Y\. Li, Y\. Zuo, B\. He, J\. Zhang, C\. Xiao, C\. Qian, T\. Yu, H\. Gao, W\. Yang, Z\. Liu,et al\.\(2026\)Rethinking on\-policy distillation of large language models: phenomenology, mechanism, and recipe\.arXiv preprint arXiv:2604\.13016\.Cited by:[Appendix F](https://arxiv.org/html/2606.04694#A6.p2.1),[Appendix G](https://arxiv.org/html/2606.04694#A7.p2.1),[§7\.1](https://arxiv.org/html/2606.04694#S7.SS1.p1.1),[§7\.2](https://arxiv.org/html/2606.04694#S7.SS2.p1.1)\.
- A\. Lin, J\. Wohlwend, H\. Chen, and T\. Lei \(2020\)Autoregressive knowledge distillation through imitation learning\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 6121–6133\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.494/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.494)Cited by:[§A\.1](https://arxiv.org/html/2606.04694#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.04694#S2.p5.2)\.
- Z\. Liu, C\. Zhao, F\. Iandola, C\. Lai, Y\. Tian, I\. Fedorov, Y\. Xiong, E\. Chang, Y\. Shi, R\. Krishnamoorthi, L\. Lai, and V\. Chandra \(2024\)MobileLLM: optimizing sub\-billion parameter language models for on\-device use cases\.InForty\-first International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=EIGbXbxcUQ)Cited by:[§1](https://arxiv.org/html/2606.04694#S1.p1.2)\.
- C\. V\. Nguyen, X\. Shen, R\. Aponte, Y\. Xia, S\. Basu, Z\. Hu, J\. Chen, M\. Parmar, S\. Kunapuli, J\. Barrow, J\. Wu, A\. Singh, Y\. Wang, J\. Gu, F\. Dernoncourt, N\. K\. Ahmed, N\. Lipka, R\. Zhang, X\. Chen, T\. Yu, S\. Kim, H\. Deilamsalehy, N\. Park, M\. Rimer, Z\. Zhang, H\. Yang, R\. A\. Rossi, and T\. H\. Nguyen \(2024\)A survey of small language models\.External Links:2410\.20011,[Link](https://arxiv.org/abs/2410.20011)Cited by:[§1](https://arxiv.org/html/2606.04694#S1.p1.2)\.
- P\. Payoungkhamdee, P\. Limkonchotiwat, J\. Baek, P\. Manakul, C\. Udomcharoenchaikit, E\. Chuangsuwanich, and S\. Nutanong \(2024\)An empirical study of multilingual reasoning distillation for question answering\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 7739–7751\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.442/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.442)Cited by:[§A\.2](https://arxiv.org/html/2606.04694#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.04694#S1.p3.1)\.
- T\. M\. Pham, P\. T\. Nguyen, S\. Yoon, V\. D\. Lai, F\. Dernoncourt, and T\. Bui \(2025\)SlimLM: an efficient small language model for on\-device document assistance\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),P\. Mishra, S\. Muresan, and T\. Yu \(Eds\.\),Vienna, Austria,pp\. 436–447\.External Links:[Link](https://aclanthology.org/2025.acl-demo.42/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-demo.42),ISBN 979\-8\-89176\-253\-4Cited by:[§1](https://arxiv.org/html/2606.04694#S1.p1.2)\.
- L\. Qin, Q\. Chen, Y\. Zhou, Z\. Chen, Y\. Li, L\. Liao, M\. Li, W\. Che, and P\. S\. Yu \(2025\)A survey of multilingual large language models\.Patterns6\(1\),pp\. 101118\.External Links:ISSN 2666\-3899,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.patter.2024.101118),[Link](https://www.sciencedirect.com/science/article/pii/S2666389924002903)Cited by:[§1](https://arxiv.org/html/2606.04694#S1.p2.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§4\.1](https://arxiv.org/html/2606.04694#S4.SS1.p1.2)\.
- I\. Shenfeld, M\. Damani, J\. Hübotter, and P\. Agrawal \(2026\)Self\-distillation enables continual learning\.InICLR 2026 Workshop on Lifelong Agents: Learning, Aligning, Evolving,External Links:[Link](https://openreview.net/forum?id=HlWA3V6iKF)Cited by:[§A\.1](https://arxiv.org/html/2606.04694#A1.SS1.p2.1),[Appendix B](https://arxiv.org/html/2606.04694#A2.p2.2),[Figure 6](https://arxiv.org/html/2606.04694#A5.F6),[Appendix E](https://arxiv.org/html/2606.04694#A5.p1.5),[Table 1](https://arxiv.org/html/2606.04694#S2.T1.18.18.18.4),[§2](https://arxiv.org/html/2606.04694#S2.p4.3),[§3\.3](https://arxiv.org/html/2606.04694#S3.SS3.p1.11),[§3\.5](https://arxiv.org/html/2606.04694#S3.SS5.p1.1),[§4\.3](https://arxiv.org/html/2606.04694#S4.SS3.p1.1),[§6\.2](https://arxiv.org/html/2606.04694#S6.SS2.p1.2)\.
- D\. Silver, J\. Schrittwieser, K\. Simonyan, I\. Antonoglou, A\. Huang, A\. Guez, T\. Hubert, L\. Baker, M\. Lai, A\. Bolton, Y\. Chen, T\. Lillicrap, F\. Hui, L\. Sifre, G\. van den Driessche, T\. Graepel, and D\. Hassabis \(2017\)Mastering the game of go without human knowledge\.Nature550\(7676\),pp\. 354–359\.External Links:ISSN 1476\-4687,[Document](https://dx.doi.org/10.1038/nature24270),[Link](https://doi.org/10.1038/nature24270)Cited by:[§2](https://arxiv.org/html/2606.04694#S2.p3.1)\.
- S\. Subramanian, V\. Elango, and M\. Gungor \(2025\)Small language models \(slms\) can still pack a punch: a survey\.External Links:2501\.05465,[Link](https://arxiv.org/abs/2501.05465)Cited by:[§1](https://arxiv.org/html/2606.04694#S1.p1.2)\.
- Y\. Susanto, A\. V\. Hulagadri, J\. R\. Montalan, J\. G\. Ngui, X\. Yong, W\. Q\. Leong, H\. Rengarajan, P\. Limkonchotiwat, Y\. Mai, and W\. C\. Tjhi \(2025\)SEA\-HELM: Southeast Asian holistic evaluation of language models\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 12308–12336\.External Links:[Link](https://aclanthology.org/2025.findings-acl.636/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.636),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2606.04694#S1.p5.1),[§4\.2](https://arxiv.org/html/2606.04694#S4.SS2.p1.1),[Limitations](https://arxiv.org/html/2606.04694#Sx1.p1.1)\.
- G\. Tesauro \(1995\)Temporal difference learning and td\-gammon\.Commun\. ACM38\(3\),pp\. 58–68\.External Links:ISSN 0001\-0782,[Link](https://doi.org/10.1145/203330.203343),[Document](https://dx.doi.org/10.1145/203330.203343)Cited by:[§2](https://arxiv.org/html/2606.04694#S2.p3.1)\.
- D\. Ulmer, C\. Hardmeier, and J\. Frellsen \(2022\)Deep\-significance: easy and meaningful signifcance testing in the age of neural networks\.InML Evaluation Standards Workshop at the Tenth International Conference on Learning Representations,Cited by:[footnote 3](https://arxiv.org/html/2606.04694#footnote3)\.
- L\. von Werra, Y\. Belkada, L\. Tunstall, E\. Beeching, T\. Thrush, N\. Lambert, S\. Huang, K\. Rasul, and Q\. Gallouédec \(2020\)TRL: Transformers Reinforcement LearningExternal Links:[Link](https://github.com/huggingface/trl)Cited by:[Appendix B](https://arxiv.org/html/2606.04694#A2.p2.2)\.
- F\. Wang, Z\. Zhang, X\. Zhang, Z\. Wu, T\. Mo, Q\. Lu, W\. Wang, R\. Li, J\. Xu, X\. Tang, Q\. He, Y\. Ma, M\. Huang, and S\. Wang \(2024\)A comprehensive survey of small language models in the era of large language models: techniques, enhancements, applications, collaboration with llms, and trustworthiness\.External Links:2411\.03350,[Link](https://arxiv.org/abs/2411.03350)Cited by:[§1](https://arxiv.org/html/2606.04694#S1.p1.2)\.
- Y\. Wu, Y\. Zhou, Z\. Ziheng, Y\. Peng, X\. Ye, X\. Hu, W\. Zhu, L\. Qi, M\. Yang, and X\. Yang \(2026\)On the generalization of SFT: a reinforcement learning perspective with reward rectification\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Lv7PjbcaMi)Cited by:[Table 1](https://arxiv.org/html/2606.04694#S2.T1.10.10.10.6),[§2](https://arxiv.org/html/2606.04694#S2.p2.6),[§4\.3](https://arxiv.org/html/2606.04694#S4.SS3.p1.1)\.
- W\. Xu, R\. Han, Z\. Wang, L\. Le, D\. Madeka, L\. Li, W\. Y\. Wang, R\. Agarwal, C\. Lee, and T\. Pfister \(2025\)Speculative knowledge distillation: bridging the teacher\-student gap through interleaved sampling\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=EgJhwYR2tB)Cited by:[§A\.1](https://arxiv.org/html/2606.04694#A1.SS1.p1.1)\.
- W\. Xuan, R\. Yang, H\. Qi, Q\. Zeng, Y\. Xiao, A\. Feng, D\. Liu, Y\. Xing, J\. Wang, F\. Gao, J\. Lu, Y\. Jiang, H\. Li, X\. Li, K\. Yu, R\. Dong, S\. Gu, Y\. Li, X\. Xie, F\. Juefei\-Xu, F\. Khomh, O\. Yoshie, Q\. Chen, D\. Teodoro, N\. Liu, R\. Goebel, L\. Ma, E\. Marrese\-Taylor, S\. Lu, Y\. Iwasawa, Y\. Matsuo, and I\. Li \(2025\)MMLU\-ProX: a multilingual benchmark for advanced large language model evaluation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 1513–1532\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.79/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.79),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.04694#S1.p2.1)\.
- Z\. Yang, T\. Pang, H\. Feng, H\. Wang, W\. Chen, M\. Zhu, and Q\. Liu \(2024\)Self\-distillation bridges distribution gap in language model fine\-tuning\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 1028–1043\.External Links:[Link](https://aclanthology.org/2024.acl-long.58/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.58)Cited by:[§A\.1](https://arxiv.org/html/2606.04694#A1.SS1.p2.1),[§2](https://arxiv.org/html/2606.04694#S2.p4.3)\.
- D\. Yoon, J\. Jang, S\. Kim, S\. Kim, S\. Shafayat, and M\. Seo \(2024\)LangBridge: multilingual reasoning without multilingual supervision\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 7502–7522\.External Links:[Link](https://aclanthology.org/2024.acl-long.405/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.405)Cited by:[§A\.2](https://arxiv.org/html/2606.04694#A1.SS2.p1.1)\.
- Y\. Zhang, Y\. Wang, Z\. Liu, S\. Wang, X\. Wang, P\. Li, M\. Sun, and Y\. Liu \(2024\)Enhancing multilingual capabilities of large language models through self\-distillation from resource\-rich languages\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 11189–11204\.External Links:[Link](https://aclanthology.org/2024.acl-long.603/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.603)Cited by:[§A\.2](https://arxiv.org/html/2606.04694#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.04694#S1.p3.1),[§2](https://arxiv.org/html/2606.04694#S2.p4.3)\.
- W\. Zhao, J\. Guo, Y\. Deng, T\. Wu, W\. Zhang, Y\. Hu, X\. Sui, Y\. Zhao, W\. Che, B\. Qin, T\. Chua, and T\. Liu \(2026\)When less language is more: language\-reasoning disentanglement makes LLMs better multilingual reasoners\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=fleQlZ2VTx)Cited by:[§A\.2](https://arxiv.org/html/2606.04694#A1.SS2.p1.1)\.
- Y\. Zhao, W\. Zhang, G\. Chen, K\. Kawaguchi, and L\. Bing \(2024\)How do large language models handle multilingualism?\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=ctXYOoAgRy)Cited by:[§A\.2](https://arxiv.org/html/2606.04694#A1.SS2.p1.1)\.
- S\. Zhu, X\. Ye, H\. Lu, W\. Shi, and G\. Liu \(2026\)The many faces of on\-policy distillation: pitfalls, mechanisms, and fixes\.External Links:2605\.11182,[Link](https://arxiv.org/abs/2605.11182)Cited by:[Appendix G](https://arxiv.org/html/2606.04694#A7.p2.1),[§7\.2](https://arxiv.org/html/2606.04694#S7.SS2.p1.1)\.

## Appendix ARelated Work

### A\.1Knowledge Distillation

Knowledge distillation \(KD\)\(Hintonet al\.,[2015](https://arxiv.org/html/2606.04694#bib.bib34)\)is a training paradigm that transfers knowledge from a larger teacher model to a smaller student model, enabling compact models to benefit from the capabilities of stronger models\. Early sequence\-level distillation methods, including SeqKD\(Kim and Rush,[2016](https://arxiv.org/html/2606.04694#bib.bib35)\)and ImitKD\(Linet al\.,[2020](https://arxiv.org/html/2606.04694#bib.bib24)\), demonstrate that teacher\-generated outputs provide effective supervision signals for student training\. Subsequent work has focused on improving the stability and efficiency of distillation\. For example, MiniLLM\(Guet al\.,[2024](https://arxiv.org/html/2606.04694#bib.bib23)\)introduces a policy\-gradient\-based framework that mitigates the high variance commonly encountered in reinforcement learning optimization\. Similarly,Xuet al\.\([2025](https://arxiv.org/html/2606.04694#bib.bib21)\)combines static datasets with on\-policy distillation through speculative decoding for synthetic data generation\. Among recent approaches, DistiLLM\(Koet al\.,[2025](https://arxiv.org/html/2606.04694#bib.bib20)\)achieves strong performance and training efficiency by employing symmetric KL divergence together with an adaptive off\-policy distillation strategy\.

More recent studies extend distillation to on\-policy settings, where student models learn directly from their own generated responses\. For instance,Agarwalet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib36)\)proposes on\-policy optimization objectives based on reverse KL divergence and Jensen\-Shannon divergence \(JSD\)\. Beyond conventional teacher distillation paradigms, self\-distillation methods exploit model\-generated responses as supervision signals to iteratively improve reasoning and downstream capabilities\. Specifically, offline self\-distillation methods\(Yanget al\.,[2024](https://arxiv.org/html/2606.04694#bib.bib14)\)utilize self\-generated responses from a seed model to better align the model with its own output distribution, whereas online variants\(Shenfeldet al\.,[2026](https://arxiv.org/html/2606.04694#bib.bib38)\)employ in\-context learning to acquire new capabilities while retaining the original competencies of the base model, alongside related reinforcement learning formulations proposed byHübotteret al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib2)\)\. In parallel, sequence\-level optimization methods such as self\-play approaches like SPIN\(Chenet al\.,[2024b](https://arxiv.org/html/2606.04694#bib.bib33)\), and DistiLLM\-2\(Koet al\.,[2025](https://arxiv.org/html/2606.04694#bib.bib20)\)leverage contrastive objectives and trajectory\-level regularization to enhance sample efficiency in reasoning\-focused tasks\.

### A\.2Multilingual Distillation

Recent studies have primarily focused on constructing high\-quality datasets for training smaller models on reasoning tasks\.Payoungkhamdeeet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib28)\)propose a distillation framework that transfers teacher capabilities through response generation, leveraging both positive and negative rationales to fine\-tune smaller models for question\-answering tasks\. From a data\-centric perspective,Zhanget al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib10)\)present self\-distillation from transferring rich\-resources to low\-resources and MathOctopus\(Chenet al\.,[2024a](https://arxiv.org/html/2606.04694#bib.bib27)\)translates mathematical training data into target languages to improve multilingual mathematical reasoning performance\. On the architectural side,Yoonet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib26)\)introduce a multilingual encoder integrated with reasoning\-capable LLMs for solving multilingual mathematics problems\. In addition,Zhaoet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib25)\)investigate the disentanglement of language and reasoning by identifying and exploiting language\-specific neurons, thereby enhancing multilingual capabilities\. Similarly,Zhaoet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib29)\)propose causal intervention methods to improve downstream multilingual reasoning performance\. Despite these advances, existing knowledge distillation frameworks remain largely centered on English settings and single task\-specific setting, while general\-purpose training frameworks for multilingual LLMs are still underexplored\.

## Appendix BTraining Configuration

We fine\-tune all models and all methods with a learning rate of 2e\-5, batch size of 32, and 3 training epochs\. The maximum sequence length is set to 1024 tokens, while on\-policy rollouts are limited to a maximum of 256 generated tokens\. For off\-policy fine\-tuning, we apply loss only on assistant tokens to better align with the next\-token prediction objective of language models\. In the on\-policy rollout, followingAgarwalet al\.\([2024](https://arxiv.org/html/2606.04694#bib.bib36)\), we enable stochastic exploration by setting444[https://huggingface\.co/docs/transformers/main\_classes/text\_generation](https://huggingface.co/docs/transformers/main_classes/text_generation)do\_sample=Trueandtop\-k=0, with temperature 0\.9 for all approaches\. All knowledge distillation objectives use a temperature of 1\.0\. For hyperparameters in DuDi, we use fixed values ofλ=0\.5\\lambda=0\.5andα=0\.1\\alpha=0\.1across all experiments without per\-model tuning\.

For other methods, we evaluate SeqKD\(Kim and Rush,[2016](https://arxiv.org/html/2606.04694#bib.bib35)\)withλ=1\.0\\lambda=1\.0, corresponding to training solely on teacher\-generated responses, and GKD\(Agarwalet al\.,[2024](https://arxiv.org/html/2606.04694#bib.bib36)\)using its default setting ofλ=0\.5\\lambda=0\.5\. Both SeqKD and GKD use JSD as the objective\. We implement DuDi and all comparative methods using the TRL trainer frameworkvon Werraet al\.\([2020](https://arxiv.org/html/2606.04694#bib.bib1)\), except for SDFTShenfeldet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib38)\), where we adopt the original codebase provided by the authors\.

## Appendix CComputing Resources

We trained small models on 8× NVIDIA H200 \(140GB\) GPUs, completing fine\-tuning within approximately 7 hours for SDFT, SeqKD, SPIN, GKD, and DuDi, and within an hour for SFT and DFT\. For evaluation, we used 2× NVIDIA H200 \(140GB\) GPUs, completing SEA\-HELM within 1 hour\. In total, our experiments required approximately 1,704 GPU hours\.

## Appendix DTop\-K Overlap Analysis Details

## Appendix EVerbalizer Template

For each training sample\(x,y,l\)\(x,y,l\), there are three verbalizer modes:\(i\) English, where verbalizer prompt template are English, followingShenfeldet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib38)\);\(ii\) Multilingual, wherelz=ll\_\{z\}=l; and\(iii\) Cross\-lingual, where verbalizer prompt template language matches the sample’s native languagell, whilelzl\_\{z\}is sampled uniformly from the set consisting of English and all training languages excludingll\. Templates for all verbalizer mode are shown in Figure[6](https://arxiv.org/html/2606.04694#A5.F6)\.

![Refer to caption](https://arxiv.org/html/2606.04694v1/x6.png)Figure 6:Comparison of verbalizer templates for teacher prompt, including the English verbalizer fromShenfeldet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib38)\), our extended multilingual verbalizer, and the proposed cross\-lingual verbalizer with its corresponding student prompt example in Thai\.
## Appendix FLimitations of DFT as Cold\-Start

MethodidvithtatlmsmyAvg\. \(Δ\\Delta\)SFT10\.912\.111\.25\.15\.812\.14\.38\.8DFT9\.311\.610\.210\.06\.112\.48\.39\.7SFT→\\rightarrowSPIN12\.611\.410\.05\.38\.014\.74\.89\.5\(\+0\.7\)DFT→\\rightarrowSPIN11\.58\.59\.19\.35\.012\.48\.29\.1\(\-0\.6\)SFT→\\rightarrowGKD11\.713\.210\.94\.94\.813\.63\.79\.0\(\+0\.2\)DFT→\\rightarrowGKD6\.67\.87\.93\.94\.57\.23\.65\.9\(\-3\.8\)SFT→\\rightarrowDuDi12\.814\.611\.55\.46\.414\.63\.410\.1\(\+1\.3\)DFT→\\rightarrowDuDi7\.910\.88\.96\.44\.99\.14\.17\.4\(\-2\.3\)

Table 7:Comparison of results using alternative SFT\- or DFT\-based checkpoints as a cold\-start across different training frameworks\.As shown in Table[2](https://arxiv.org/html/2606.04694#S4.T2), DFT outperforms SFT on downstream tasks\. Motivated by this, we examine whether DFT provides a better initialization checkpoint than SFT as a cold\-start\. We compare off\-policy fine\-tuning \(cold\-start\) initialized from SFT and DFT checkpoints across three methods: SPIN, GKD, and DuDi\. As shown in Table[7](https://arxiv.org/html/2606.04694#A6.T7), using a DFT checkpoint as the cold\-start degrades performance across all three frameworks\. Thus, while DFT is a strong standalone baseline, it is less compatible with subsequent training and yields limited additive gains\.

To better understand the underlying mechanism between SFT and DFT as the cold\-start, we compare the top\-k overlap ratio of SFT and DFT as the cold\-start initializations for DuDi\. As illustrated in Figure[7](https://arxiv.org/html/2606.04694#A6.F7), the cold\-start DFT consistently exhibits a substantially lower overlap ratio than SFT throughout training across methods\. According toLiet al\.\([2026](https://arxiv.org/html/2606.04694#bib.bib9)\), a low overlap ratio is indicative of degraded performance, suggesting that the cold\-start DFT student assigns probability mass to a token set that is largely disjoint from that of the teacher\. These findings indicate that DFT alters the token distribution too aggressively, biasing the model toward different token preferences\.

![Refer to caption](https://arxiv.org/html/2606.04694v1/x7.png)Figure 7:Overlap ratio between student and teacher model logits for on\-policy token\-level distillation of SFT vs DFT as a cold\-start\.
## Appendix GAdditional Results

Task\-Level Performance\.Table[8](https://arxiv.org/html/2606.04694#A7.T8)shows downstream evaluation for Qwen2\.5\-0\.5B, Qwen2\.5\-1\.5B, Qwen3\-0\.6B\-Base, and Llama\-3\.2\-1 on multiple tasks in SEA\-HELM, showing capability of each training methods\. DuDi consistency achieves either the best or second\-best performance across most tasks and model families\. Notably, several tasks, Natural Language Reasoning \(NLR\), Safety, Linguistic Diagnostics \(LD\), and Knowledge, exhibit near\-zero to very low absolute performance for Qwen2\.5\-0\.5B and Llama\-3\.2\-1B \(often below 3 points\), despite the teacher demonstrating some capability on these tasks\. In contrast, larger or more capable base models such as Qwen2\.5\-1\.5B and Qwen3\-0\.6B\-Base retain non\-trivial performance on these tasks\. We hypothesize that this gap arises from limited coverage of these task distributions in the training data, resulting in weak teacher–student transfer and indicating that such tasks may require stronger base model capabilities rather than distillation alone\.

Cold\-Start SFT\.Consistent with our findings in Section[7\.2](https://arxiv.org/html/2606.04694#S7.SS2)and prior studies\(Liet al\.,[2026](https://arxiv.org/html/2606.04694#bib.bib9); Zhuet al\.,[2026](https://arxiv.org/html/2606.04694#bib.bib42)\), Table[9](https://arxiv.org/html/2606.04694#A7.T9)highlights the importance of cold\-start SFT in knowledge distillation\. The performance of both self\-distillation methods, such as SDFT, and teacher\-distillation approaches \(SeqKD and GKD\) declines when the student model is not initialized with cold\-start SFT\.

MethodNLUNLGNLRSafetyLDIFKnowledgeAverageQwen2\.5\-3B\-Instruct \(Teacher\)32\.534\.816\.419\.813\.547\.215\.127\.7Qwen2\.5\-0\.5B \(Student\)SFT7\.212\.50\.40\.20\.028\.82\.28\.6DFT7\.39\.50\.20\.00\.438\.50\.99\.7SPIN7\.513\.91\.10\.40\.528\.65\.29\.5SDFT1\.13\.80\.00\.00\.016\.60\.43\.6SeqKD3\.29\.50\.00\.00\.022\.41\.36\.1GKD8\.115\.60\.30\.30\.028\.11\.49\.0DuDi \(ours\)9\.015\.80\.60\.10\.233\.61\.410\.1Qwen2\.5\-3B\-Instruct \(Teacher\)32\.534\.816\.419\.813\.547\.215\.127\.7Qwen2\.5\-1\.5B \(Student\)SFT22\.026\.36\.23\.60\.640\.713\.318\.6DFT17\.49\.911\.94\.54\.135\.215\.415\.6SPIN20\.423\.78\.05\.10\.438\.814\.618\.3GKD24\.922\.07\.87\.92\.341\.310\.519\.1DuDi26\.523\.76\.18\.82\.544\.110\.220\.1Qwen3\-4B \(Teacher\)58\.625\.247\.337\.936\.268\.333\.145\.6Qwen3\-0\.6B\-Base \(Student\)SFT16\.622\.84\.50\.70\.035\.62\.213\.9DFT20\.78\.07\.62\.01\.240\.612\.615\.6SPIN20\.018\.65\.60\.90\.236\.32\.314\.3GKD24\.626\.28\.13\.00\.041\.51\.918\.0DuDi28\.526\.314\.03\.72\.844\.75\.520\.8Llama\-3\.2\-3B\-Instruct \(Teacher\)34\.244\.27\.915\.81\.047\.216\.027\.7Llama\-3\.2\-1B \(Student\)SFT1\.56\.20\.00\.00\.020\.00\.34\.7DFT0\.03\.30\.00\.00\.00\.00\.00\.6SPIN0\.021\.70\.00\.00\.00\.00\.03\.7GKD9\.323\.10\.40\.50\.028\.12\.210\.6DuDi9\.724\.00\.42\.90\.430\.01\.211\.4

Table 8:Task evaluations for Qwen2\.5\-0\.5B, Qwen2\.5\-1\.5B, Qwen3\-0\.6B\-Base, and Llama\-3\.2\-1B\.MethodIndonesianVietnameseThaiTamilTagalogMalayBurmeseAverageSDFT4\.24\.64\.72\.33\.35\.11\.33\.6SDFT w/o SFT0\.70\.71\.81\.10\.60\.91\.11\.0SeqKD7\.19\.17\.74\.23\.48\.12\.96\.1SeqKD w/o SFT4\.47\.26\.13\.62\.16\.13\.24\.7GKD11\.713\.210\.94\.94\.813\.63\.79\.0GKD w/o SFT6\.57\.57\.23\.93\.57\.73\.35\.6DuDi11\.714\.412\.86\.56\.114\.84\.610\.1DuDi w/o SFT9\.413\.79\.77\.25\.310\.84\.18\.6

Table 9:Results of comparative knowledge distillation methods without cold\-start SFT\.
DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

Similar Articles

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Dolphin-CN-Dialect: Where Chinese Dialects Matter

Submit Feedback

Similar Articles

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
BayLing-Duplex: Native Full-Duplex Speech Dialogue with a Single Autoregressive LLM
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models
UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer
Dolphin-CN-Dialect: Where Chinese Dialects Matter