Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
Summary
The paper introduces OPDLM, a method that transforms autoregressive language models into diffusion language models via on-policy distillation, requiring 15x to 7000x fewer training tokens while retaining knowledge from the original model.
View Cached Full Text
Cached at: 06/08/26, 09:20 AM
# Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation Source: [https://arxiv.org/abs/2606.06712](https://arxiv.org/abs/2606.06712) [View PDF](https://arxiv.org/pdf/2606.06712) > Abstract:We study the transformation of autoregressive models \(ARLMs\) into diffusion language models \(DLMs\)\. Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective\. However, these approaches incur two distribution shifts\. First, transitioning from a next\-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training\. Second, standard DLMs suffer from a train\-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence\-based decoding\. To address both challenges, we introduce an On\-Policy Diffusion Language Model \(OPDLM\) in which On\-Policy Distillation \(OPD\) is employed for ARLM\-to\-DLM transformation\. Specifically, OPDLM is trained via self\-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories\. By training directly in an on\-policy manner, OPDLM eliminates the train\-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM\. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks\. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post\-training\. ## Submission history From: Xingyu Su \[[view email](https://arxiv.org/show-email/046fb24e/2606.06712)\] **\[v1\]**Thu, 4 Jun 2026 20:58:08 UTC \(1,688 KB\)
Similar Articles
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.
DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models
DiffusionOPD proposes a multi-task training paradigm for diffusion models that uses online policy distillation to efficiently combine task-specific teachers into a unified student, achieving state-of-the-art results on all evaluated benchmarks.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
This paper introduces D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy self-distillation during supervised fine-tuning. It allows models to learn new concepts or styles without compromising their efficient few-step inference capabilities.
On the Geometry of On-Policy Distillation
This paper characterizes the unique parameter space dynamics of on-policy distillation (OPD) for large language models, showing that it exhibits relaxed off-principal updates and subspace locking, distinguishing it from supervised fine-tuning and reinforcement learning with verifiable rewards.
Draft-OPD: On-Policy Distillation for Speculative Draft Models
Draft-OPD introduces on-policy distillation with target-assisted rollouts and error replay to overcome the offline-to-inference mismatch in training draft models for speculative decoding, achieving over 5x lossless acceleration and improving upon EAGLE-3 and DFlash by 23% and 13% respectively.