MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
Summary
MixSD proposes a self-distillation method for knowledge injection in language models that aligns supervision with the model's native distribution, reducing catastrophic forgetting during fine-tuning. It achieves near-perfect memorization while retaining up to 100% of base capabilities, vastly outperforming standard SFT.
View Cached Full Text
Cached at: 05/19/26, 06:30 AM
Paper page - MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
Source: https://huggingface.co/papers/2605.16865
Abstract
MixSD addresses knowledge injection in language models by aligning supervision with the model’s native generation distribution, reducing catastrophic forgetting during fine-tuning.
Supervised fine-tuning(SFT) is widely used to inject new knowledge intolanguage models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model’sautoregressive distribution, forcing the optimizer to imitate low-probabilitytoken sequences. To address this problem, we proposeMixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets,MixSDconstructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model’s original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model’s distribution. We evaluateMixSDon two synthetic corpora that we construct to studyfactual recallandarithmetic function acquisitionin a controlled setting, together with established benchmarks for open-domain factual question answering andknowledge editing. Across multiple model scales and settings,MixSDconsistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model’s held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show thatMixSDproduces substantially lower-NLL supervision targets under the base model and reduces harmful movement alongFisher-sensitive parameter directions. These results suggest that aligning supervision with the model’s native generation distribution is a simple and effective principle for knowledge injection that mitigatescatastrophic forgetting.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.16865
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.16865 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.16865 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.16865 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Self-Distillation Enables Continual Learning [pdf]
Introduces Self-Distillation Fine-Tuning (SDFT), a method that enables on-policy learning from demonstrations to achieve continual learning without catastrophic forgetting, outperforming supervised fine-tuning.
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
This paper introduces UniSD, a unified self-distillation framework for adapting large language models that integrates mechanisms for supervision reliability, representation alignment, and training stability. Experimental results show that UniSD improves performance over base models and existing baselines across multiple benchmarks.
Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting
This paper introduces Self-Distillation Fine-Tuning (SDFT) as a recovery mechanism for LLMs suffering from performance degradation due to catastrophic forgetting, quantization, and pruning. The authors provide theoretical justification using Centered Kernel Alignment (CKA) to demonstrate that self-distillation aligns the student model's high-dimensional manifold with the teacher's optimal structure, effectively recovering lost capabilities.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Proposes Anti-Self-Distillation (AntiSD) which reverses the knowledge transfer direction in self-distillation to improve math reasoning efficiency and accuracy, achieving GRPO baseline accuracy in 2-10x fewer steps and up to 11.5 points higher final accuracy across models from 4B to 30B parameters.
Self-Improving In-Context Learning
This paper proposes a method to improve in-context learning by optimizing the continuous embeddings of a fixed few-shot prompt at test time, using a self-supervised confidence proxy derived from the model's log-probabilities without requiring fine-tuning or token generation.