Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

arXiv cs.CL 06/08/26, 04:00 AM Papers

Summary

The paper introduces OPDLM, a method that transforms autoregressive language models into diffusion language models via on-policy distillation, requiring 15x to 7000x fewer training tokens while retaining knowledge from the original model.

arXiv:2606.06712v1 Announce Type: new Abstract: We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.

Original Article

View Cached Full Text

Cached at: 06/08/26, 09:20 AM

# Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
Source: [https://arxiv.org/abs/2606.06712](https://arxiv.org/abs/2606.06712)
[View PDF](https://arxiv.org/pdf/2606.06712)

> Abstract:We study the transformation of autoregressive models \(ARLMs\) into diffusion language models \(DLMs\)\. Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective\. However, these approaches incur two distribution shifts\. First, transitioning from a next\-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training\. Second, standard DLMs suffer from a train\-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence\-based decoding\. To address both challenges, we introduce an On\-Policy Diffusion Language Model \(OPDLM\) in which On\-Policy Distillation \(OPD\) is employed for ARLM\-to\-DLM transformation\. Specifically, OPDLM is trained via self\-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories\. By training directly in an on\-policy manner, OPDLM eliminates the train\-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM\. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks\. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post\-training\.

## Submission history

From: Xingyu Su \[[view email](https://arxiv.org/show-email/046fb24e/2606.06712)\] **\[v1\]**Thu, 4 Jun 2026 20:58:08 UTC \(1,688 KB\)

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Similar Articles

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

On the Geometry of On-Policy Distillation

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Submit Feedback

Similar Articles

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

On the Geometry of On-Policy Distillation

Draft-OPD: On-Policy Distillation for Speculative Draft Models