Distribution Corrected Offline Data Distillation for Large Language Models
Summary
This paper proposes a principled offline reasoning distillation framework that corrects teacher-student distribution drift, improving reasoning accuracy on math benchmarks without requiring online rollouts.
View Cached Full Text
Cached at: 05/15/26, 06:19 AM
# Distribution Corrected Offline Data Distillation for Large Language Models Source: [https://arxiv.org/abs/2605.14071](https://arxiv.org/abs/2605.14071) [View PDF](https://arxiv.org/pdf/2605.14071) > Abstract:Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource\-constrained settings\. Existing approaches face a fundamental trade\-off: offline distillation from teacher\-generated traces provides high\-quality, sample\-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher\-generated prefixes, whereas during inference the student autoregresses on self\-generated prefixes, leading to compounding errors over long reasoning trajectories\. Meanwhile, on\-policy or self\-distillation methods better match the student's inference\-time distribution, but require costly online sampling and often produce low\-quality traces in early training\. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher\-generated data while correcting teacher\-student distribution drift\. It adaptively emphasizes teacher supervision that is better aligned with the student's on\-policy distribution\. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held\-out competition\-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction\-following capabilities\. Our work shows that lightweight, distribution\-correction\-aware training can substantially strengthen offline reasoning distillation without online rollouts\. ## Submission history From: Yumeng Zhang \[[view email](https://arxiv.org/show-email/0b0b9485/2605.14071)\] **\[v1\]**Wed, 13 May 2026 19:47:31 UTC \(1,102 KB\)
Similar Articles
Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
This paper proposes methods for protecting large language models against unauthorized knowledge distillation by rewriting reasoning traces to degrade training usefulness while preserving correctness, and embedding verifiable watermarks in distilled student models. The approach uses instruction-based and gradient-based rewriting techniques to achieve anti-distillation effects without compromising teacher model performance.
The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes
This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.
OPRD: On-Policy Representation Distillation
OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.
Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation
This paper introduces Motab, a new pipeline for LLM reasoning distillation that mitigates both off-policy and on-policy exposure biases by dynamically monitoring student generation and backtracking to safe states with teacher intervention, achieving ~3% average improvement.
Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information
This paper proposes a novel Chain-of-Thought distillation framework that transfers teacher models' stepwise attention on key information to student models through a Mixture-of-Layers module for dynamic layer alignment. The method achieves consistent performance improvements on mathematical and commonsense reasoning benchmarks by explicitly guiding student models to progressively focus on critical information during reasoning.