HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents
Summary
HINT-SD proposes a targeted self-distillation framework that selects failure-relevant actions from full trajectories to improve long-horizon LLM agent training, achieving up to 18.80% improvement and 2.26× speedup over dense feedback baselines.
View Cached Full Text
Cached at: 05/25/26, 06:36 AM
Paper page - HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents
Source: https://huggingface.co/papers/2605.17873
Abstract
HINT-SD is a targeted self-distillation framework that selects failure-relevant actions from full trajectories to improve long-horizon LLM agent training efficiency and effectiveness.
Training long-horizon LLM agents withreinforcement learningis challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditionedself-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targetedself-distillationframework that uses full-trajectoryhindsightto select failure-relevant actions and appliesfeedback-conditioned distillationonly on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.17873
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.17873 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.17873 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.17873 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
This paper presents the first systematic study of credit assignment in multi-turn LLM agents, introducing SERL, a selective environment-reweighted learning framework. SERL uses environment feedback to sharpen the RL objective on causally relevant actions, achieving 90.0% and 80.1% success rates on ALFWorld and WebShop respectively.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
The paper introduces Reflection-Enhanced Self-Distillation (Resd), a framework that transforms failure feedback into corrective supervision for LLMs, enabling efficient learning from rare successes. It outperforms standard self-distillation baselines and achieves faster early improvement than GRPO with fewer samples.
Self-Distillation Enables Continual Learning [pdf]
Introduces Self-Distillation Fine-Tuning (SDFT), a method that enables on-policy learning from demonstrations to achieve continual learning without catastrophic forgetting, outperforming supervised fine-tuning.
Self-Distilled Agentic Reinforcement Learning
SDAR enhances multi-turn agent training by integrating self-distillation with a sigmoid gate to selectively strengthen positive token-level guidance while mitigating negative teacher rejections, achieving significant improvements over GRPO across multiple benchmarks.
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Proposes Anti-Self-Distillation (AntiSD) which reverses the knowledge transfer direction in self-distillation to improve math reasoning efficiency and accuracy, achieving GRPO baseline accuracy in 2-10x fewer steps and up to 11.5 points higher final accuracy across models from 4B to 30B parameters.