The Role of Feedback Alignment in Self-Distillation
Summary
This paper studies context design for self-distillation in language models, finding that step-aligned critique feedback significantly outperforms binary reward or reference solution conditioning, because it targets only erroneous tokens while preserving correct behavior.
View Cached Full Text
Cached at: 06/10/26, 05:46 PM
Paper page - The Role of Feedback Alignment in Self-Distillation
Source: https://huggingface.co/papers/2606.11173 Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model’s output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored.
We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver’s reasoning trace.
Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver’s reasoning is a key driver of self-distillation effectiveness.
Similar Articles
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
The paper introduces Reflection-Enhanced Self-Distillation (Resd), a framework that transforms failure feedback into corrective supervision for LLMs, enabling efficient learning from rare successes. It outperforms standard self-distillation baselines and achieves faster early improvement than GRPO with fewer samples.
@sheriyuo: Qwen Tongyi Lab proposes RLCSD, a simple but important critique of on-policy self-distillation. Their key observation i…
Qwen Tongyi Lab proposes RLCSD to address the style drift problem in on-policy self-distillation, where the learning signal focuses on style tokens rather than task-critical reasoning tokens. Their method uses contrastive supervision to focus on task-relevant tokens, achieving consistent improvements over prior methods on reasoning benchmarks.
EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs
EchoDistill is an alignment-based noisy-to-clean self-distillation framework that improves the robustness of Audio Large Language Models (ALLMs) against real-world noise by using a frozen clean-audio teacher to guide the student model via group-relative policy optimization (GRPO). Experiments show significant improvements in semantic reliability and task performance under strong noise without additional inference costs.
Self-Distillation Enables Continual Learning [pdf]
Introduces Self-Distillation Fine-Tuning (SDFT), a method that enables on-policy learning from demonstrations to achieve continual learning without catastrophic forgetting, outperforming supervised fine-tuning.
Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation
This paper proposes Rubric-Conditioned Self-Distillation (RCSD), a framework that uses fine-grained rubric criteria to provide token-level guidance during self-distillation, improving reasoning performance over scalar-reward methods like GRPO and OPSD.