HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Hugging Face Daily Papers 05/18/26, 12:00 AM Papers

reinforcement-learning self-distillation llm-agents long-horizon hindsight feedback-conditioned

Summary

HINT-SD proposes a targeted self-distillation framework that selects failure-relevant actions from full trajectories to improve long-horizon LLM agent training, achieving up to 18.80% improvement and 2.26× speedup over dense feedback baselines.

Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.

Original Article

View Cached Full Text

Cached at: 05/25/26, 06:36 AM

Paper page - HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Source: https://huggingface.co/papers/2605.17873

Abstract

HINT-SD is a targeted self-distillation framework that selects failure-relevant actions from full trajectories to improve long-horizon LLM agent training efficiency and effectiveness.

Training long-horizon LLM agents withreinforcement learningis challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditionedself-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targetedself-distillationframework that uses full-trajectoryhindsightto select failure-relevant actions and appliesfeedback-conditioned distillationonly on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.17873

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.17873 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.17873 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.17873 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Paper page - HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

H^2SD: Hybrid Hindsight Self-Distillation

HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

Learning More from Less: Reinforcement Learning from Hindsight

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

Submit Feedback

Similar Articles

H^2SD: Hybrid Hindsight Self-Distillation

HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

Learning More from Less: Reinforcement Learning from Hindsight

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents