Distribution Corrected Offline Data Distillation for Large Language Models

arXiv cs.CL 05/15/26, 04:00 AM Papers

Summary

This paper proposes a principled offline reasoning distillation framework that corrects teacher-student distribution drift, improving reasoning accuracy on math benchmarks without requiring online rollouts.

arXiv:2605.14071v1 Announce Type: new Abstract: Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource-constrained settings. Existing approaches face a fundamental trade-off: offline distillation from teacher-generated traces provides high-quality, sample-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher-generated prefixes, whereas during inference the student autoregresses on self-generated prefixes, leading to compounding errors over long reasoning trajectories. Meanwhile, on-policy or self-distillation methods better match the student's inference-time distribution, but require costly online sampling and often produce low-quality traces in early training. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher-generated data while correcting teacher-student distribution drift. It adaptively emphasizes teacher supervision that is better aligned with the student's on-policy distribution. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held-out competition-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction-following capabilities. Our work shows that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without online rollouts.

Original Article

View Cached Full Text

Cached at: 05/15/26, 06:19 AM

# Distribution Corrected Offline Data Distillation for Large Language Models
Source: [https://arxiv.org/abs/2605.14071](https://arxiv.org/abs/2605.14071)
[View PDF](https://arxiv.org/pdf/2605.14071)

> Abstract:Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource\-constrained settings\. Existing approaches face a fundamental trade\-off: offline distillation from teacher\-generated traces provides high\-quality, sample\-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher\-generated prefixes, whereas during inference the student autoregresses on self\-generated prefixes, leading to compounding errors over long reasoning trajectories\. Meanwhile, on\-policy or self\-distillation methods better match the student's inference\-time distribution, but require costly online sampling and often produce low\-quality traces in early training\. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher\-generated data while correcting teacher\-student distribution drift\. It adaptively emphasizes teacher supervision that is better aligned with the student's on\-policy distribution\. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held\-out competition\-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction\-following capabilities\. Our work shows that lightweight, distribution\-correction\-aware training can substantially strengthen offline reasoning distillation without online rollouts\.

## Submission history

From: Yumeng Zhang \[[view email](https://arxiv.org/show-email/0b0b9485/2605.14071)\] **\[v1\]**Wed, 13 May 2026 19:47:31 UTC \(1,102 KB\)

Distribution Corrected Offline Data Distillation for Large Language Models

Similar Articles

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

OPRD: On-Policy Representation Distillation

Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

Submit Feedback

Similar Articles

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

OPRD: On-Policy Representation Distillation

Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information