What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

arXiv cs.AI 05/20/26, 04:00 AM Papers
reinforcement-learning llm-agents credit-assignment multi-turn distillation alfworld webshop
Summary
This paper presents the first systematic study of credit assignment in multi-turn LLM agents, introducing SERL, a selective environment-reweighted learning framework. SERL uses environment feedback to sharpen the RL objective on causally relevant actions, achieving 90.0% and 80.1% success rates on ALFWorld and WebShop respectively.
arXiv:2605.19447v1 Announce Type: new Abstract: Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.
Original Article
View Cached Full Text
Cached at: 05/20/26, 08:29 AM
# What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
Source: [https://arxiv.org/html/2605.19447](https://arxiv.org/html/2605.19447)
Tianyi LyuTongji UniversityShanghai AI LaboratoryYang LiIndependentYichuan MaShanghai AI LaboratoryFudan UniversityPeiji LiShanghai AI LaboratoryFudan UniversityLinyang Li†Shanghai AI LaboratoryFudan UniversityThe Chinese University of Hong KongQipeng Guo†Shanghai AI LaboratoryFudan UniversityDahua LinShanghai AI LaboratoryThe Chinese University of Hong KongKai ChenShanghai AI LaboratoryThe Chinese University of Hong Kong

###### Abstract

Reinforcement learning can train LLM agents from sparse task rewards, but credit assignment in long\-horizon interactions remains a bottleneck: a single success\-or\-failure signal must be distributed across dozens of actions, most of which have no causal effect on the outcome\. Existing agent RL methods primarily rely on trajectory\-level rewards or learned proxy signals, without fully exploiting the per\-step feedback that environments naturally produce\. Distillation\-based approaches have been combined with RL for single\-turn reasoning, but the multi\-turn agent setting is underexplored—and fundamentally different\. In an interactive environment, feedback takes many forms: an error message after a failed action, a changed page after a click, a new observation after navigation, or even a successful reference trajectory\. Which of these signals is most useful for credit assignment, and where in a long trajectory each should be applied, remain open questions\. We conduct the first systematic study of this design space for agent RL, spanning five feedback sources and two insertion granularities\. Guided by the findings, we introduceSERL, a selective environment\-reweighted learning framework whose key principle is that*the task reward determines the update direction, while environment feedback adjusts only the placement and magnitude of that update*\.SERLuses an environment\-conditioned teacher to selectively sharpen the RL objective on the actions that matter, without the instability of unconstrained distillation\. On ALFWorld and WebShop,SERLachieves90\.0%90\.0\\%and80\.1%80\.1\\%success respectively, outperforming strong RL and distillation baselines\. Our analysis reveals that effective feedback is not simply the richest: grounded, action\-relevant signals at semantically meaningful insertion points consistently outperform indiscriminate use of longer or more privileged context\.

††∗\*Equal contribution\.†††\\daggerCorresponding authors: Linyang Li \(lilinyang@pjlab\.org\.cn\), Qipeng Guo \(guoqipeng@pjlab\.org\.cn\)††Code is at:[https://github\.com/OliverLeeXZ/SERL](https://github.com/OliverLeeXZ/SERL)## 1Introduction

Large language models are increasingly deployed as interactive agents that browse websites\[webshop\], call tools\[toolrl\], solve software engineering tasks\[swebench\], perform machine learning workflows\[mlebench,li2025opt\], and act in embodied environments\[alfworld\]\. Reinforcement learning is a natural training paradigm for these agents because many tasks provide a verifiable success signal at the end of an episode\. Yet the central difficulty is not obtaining the reward but*assigning*it: a typical ALFWorld trajectory contains a dozen actions, of which only two or three change the environment in a meaningful way, while the rest are routine navigation or formatting steps\. When GRPO\[grpo\]or similar group\-relative methods broadcast a single trajectory\-level advantage to every token, high\-leverage decisions and inert interface tokens receive the same update\. Better variance\-reduction techniques can shrink gradient noise, but they cannot identify*which*decision caused a later success or failure\.

Agent environments already contain a dense signal that, in principle, addresses this problem\. After every action the environment returns feedback—an error message, an updated page, a changed object state—that reveals the local consequence of that specific decision\. This per\-step feedback is far more informative than a single end\-of\-episode reward, which makes it a natural basis for credit assignment\. One way to exploit it is to let a teacher observe the environment’s response to each action and use its token\-level probabilities to supervise the student along the student’s own rollout\. This idea, known as on\-policy self\-distillation \(OPSD\)\[opd\], provides dense token\-level supervision but introduces a new risk: the teacher inevitably conditions on*privileged*information—post\-action feedback, future observations, or successful reference trajectories—that the student cannot access at decision time\. Imitating the teacher indiscriminately therefore risks leaking unavailable information into the training target, amplifying stylistic preferences unrelated to task success, and destabilizing learning as the student policy drifts\.

The question, then, is not*whether*to use environment feedback but*how*\. Two recent lines of work offer partial answers: SDPO\[sdpo\]adds a distillation loss to the policy objective, and RLSD\[rlsd\]uses teacher–student probability gaps to reweight the RL update\. However, both have been studied primarily in single\-turn reasoning tasks where the feedback structure is simple and uniform\. Long\-horizon agent training introduces two design axes that these methods do not address\. First,whatshould the teacher see? Environment feedback ranges from the immediate action response to the full future trajectory or a successful reference rollout; richer context gives the teacher more information but also more privilege\. Second,whereshould the feedback affect learning? It can be injected at every transition \(step\-level\) or only at semantically meaningful state changes \(anchor\-level\); the right granularity depends on how noisy and redundant the raw signal is\.

We conduct the first systematic study of this design space for multi\-turn agent RL, varying five feedback sources and two insertion granularities across ALFWorld and WebShop\. The study yields a clear finding: effective feedback is not simply the richest feedback—grounded, action\-relevant signals at semantically meaningful insertion points consistently outperform indiscriminate use of longer or more privileged context\. Guided by this finding, we proposeSERL, a selective environment\-reweighted learning framework built on an asymmetric principle:the task reward determines the update direction; environment feedback only adjusts the placement and magnitude of that update\.An environment\-conditioned teacher scores the student’s own action tokens with and without hindsight feedback; the resulting log\-probability gap is converted into a bounded, sign\-aware reweight of the GRPO advantage\. Distillation is restricted to executable action spans—reasoning and formatting tokens remain under the sole control of the reward\-driven objective—and the teacher signal is decayed over training to prevent late\-stage privileged\-information leakage\.

Our contributions are as follows:

- •We conduct the first systematic study of environment feedback for long\-horizon LLM agent RL, analyzing how the*source*and*placement*of feedback jointly affect training stability and task performance across five feedback types and two insertion granularities\.
- •Guided by the study, we proposeSERL, a GRPO\-compatible objective that converts privileged hindsight into a bounded, action\-level reweight of the policy\-gradient update, providing dense credit assignment while keeping the optimization direction anchored to the task reward\.
- •Experiments on ALFWorld and WebShop show thatSERLachieves90\.0%90\.0\\%and80\.1%80\.1\\%success respectively, outperforming strong RL and RL–distillation baselines\. Our analysis reveals that grounded, action\-relevant signals at semantically meaningful insertion points yield the strongest and most stable training\.

## 2Related Work

Reinforcement learning for long\-horizon LLM agents\.Reinforcement learning has become a central post\-training tool for LLMs, from RLHF\[rlhf\]to recent verifiable\-reward training for reasoning and tool use\[team2025kimi,guo2025deepseek,toolrl\]\. To reduce the cost of value modeling, critic\-free and group\-relative methods, such as RLOO\[rloo\], GRPO\[grpo\], DAPO\[yu2025dapo\], and GSPO\[gspo\], estimate advantages from multiple samples per query, enabling scalable training beyond PPO\[schulman2017proximal\]\. These methods have shown strong performance in mathematical\[guo2025deepseek\], logical\[xie2025logic\], and optimization reasoning\[npengine\]\.

With the increasing capability of LLMs and RL algorithms, LLM agents have shown strong potential in long\-horizon, dynamic, and open\-ended environments, including web navigation\[webshop,appworld\], embodied tasks\[alfworld\], search\[searchr1\], and software engineering\[swebench\]\. These long\-horizon tasks introduce new challenges for RL, as success often depends on multi\-turn interactions, delayed rewards, and environment\-dependent decisions\. Recent methods extend policy optimization to agentic settings by applying GRPO at the trajectory level over multi\-turn rollouts\[contextRL\]or by performing stepwise policy optimization\[wang2025ragen\]\. GIGPO\[gigpo\]and HGPO\[hgpo\]further exploit the hierarchical structure of agent trajectories, estimating advantages over actions, groups, or sub\-trajectories to improve credit assignment\. However, these RL methods still underutilize the rich feedback produced by agent environments, which can provide important signals for guiding LLM agent training\.

Credit assignment in long\-horizon LLM agentic training\.Credit assignment is a central challenge in agentic RL training\. In methods such as GRPO\[grpo\], the verifier usually provides only a sequence\-level reward, so every token in a rollout receives the same advantage, regardless of whether it reflects a key decision or a stylistic filler\. This is especially coarse for LLM agents, where final success depends on many intermediate states, actions, observations, and tool interactions\. Existing work improves credit granularity through process rewards, value models, and intermediate evaluators, which provide denser supervision for partial reasoning steps or action traces\[lightman,luo2024improve,stepmath,zhang2024generative,cui2025process\]\. Other methods use token\-level proxies such as entropy, uncertainty, attention, or outcome sensitivity to adjust updates\[cheng2026reasoning,seedgrpo,sun2025ktaemodelfreealgorithmkeytokens,li2025attention,chen2025beyond,li2026outcome\]\. While effective, these methods often require auxiliary models, extra labels, or proxy signals that are only indirectly tied to environment feedback\.

On\-policy distillation provides another way to obtain dense supervision\. OPD\[opd\]uses a stronger teacher to supervise the student’s on\-policy trajectories and provide token\-level signals\. However, it requires an additional stronger model, which increases computational cost and may introduce distribution mismatch\. Related self\-distillation methods condition the teacher on the student model and privileged signals, such as verifier feedback, future context, or correct trajectories\[opsd,sdpo,sdft\], but they may still suffer from privileged information leakage\. SDPO\[sdpo\]and RLSD\[rlsd\]further combine distillation signals with RL rewards to improve RL training while maintaining finer\-grained credit assignment\. However, these methods are mainly studied on simple reasoning tasks, leaving complex multi\-turn, long\-horizon agentic tasks underexplored\.

## 3Method

![Refer to caption](https://arxiv.org/html/2605.19447v1/x1.png)Figure 1:Pipeline of environment\-feedback guided agent RL\. The upper part summarizes training\-only hindsight sources and placement choices\. The lower part showsSERL: placed feedback is exposed only to the teacher, which converts hindsight into reward\-aligned action\-level credit for GRPO\.### 3\.1Preliminaries

#### Problem setting\.

We consider a multi\-turn LLM agent trajectory

τ=\(s0,a0,r0,s1,a1,r1,…,sT,aT,rT\),\\tau=\(s\_\{0\},a\_\{0\},r\_\{0\},s\_\{1\},a\_\{1\},r\_\{1\},\\ldots,s\_\{T\},a\_\{T\},r\_\{T\}\),\(1\)wherests\_\{t\}is the environment state or observation before acting,ata\_\{t\}is the executable action, andrtr\_\{t\}is the environment feedback returned after executingata\_\{t\}\. Each action is generated as a token sequenceat=\(yt,1,…,yt,Lt\)a\_\{t\}=\(y\_\{t,1\},\\ldots,y\_\{t,L\_\{t\}\}\)\. Letht=\(s0,a0,r0,…,st\)h\_\{t\}=\(s\_\{0\},a\_\{0\},r\_\{0\},\\ldots,s\_\{t\}\)denote the history available when the agent producesata\_\{t\}\.

#### GRPO\.

GRPO\[grpo\]optimizes an on\-policy group of rollouts without training an additional value model\. For a task instance, let\{τn\}n=1N\\\{\\tau^\{n\}\\\}\_\{n=1\}^\{N\}be trajectories sampled from the old policyπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}with outcome rewardsRnR^\{n\}\. GRPO computes a group\-relative advantage

An=Rn−meanm⁡\(Rm\)stdm⁡\(Rm\)\+ϵA\.A^\{n\}=\\frac\{R^\{n\}\-\\operatorname\{mean\}\_\{m\}\(R^\{m\}\)\}\{\\operatorname\{std\}\_\{m\}\(R^\{m\}\)\+\\epsilon\_\{A\}\}\.\(2\)In long\-horizon agent training, this reward\-derived signal is usually broadcast to many tokens in the trajectory\. We writeAtA\_\{t\}for the advantage assigned to the action steptt; for standard trajectory\-level GRPO,At=AnA\_\{t\}=A^\{n\}for all steps in trajectoryτn\\tau^\{n\}\. For tokenyt,iy\_\{t,i\}, the policy ratio is

ρt,i\(θ\)=πθ\(yt,i∣ht,yt,<i\)πθold\(yt,i∣ht,yt,<i\)\.\\rho\_\{t,i\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(y\_\{t,i\}\\mid h\_\{t\},y\_\{t,<i\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y\_\{t,i\}\\mid h\_\{t\},y\_\{t,<i\}\)\}\.\(3\)For any token\-level advantageBt,iB\_\{t,i\}, the clipped GRPO surrogate is

ℓGRPO\(θ;Bt,i\)=min⁡\(ρt,i\(θ\)Bt,i,clip⁡\(ρt,i\(θ\),1−ϵ,1\+ϵ\)Bt,i\)\.\\ell\_\{\\mathrm\{GRPO\}\}\(\\theta;B\_\{t,i\}\)=\\min\\left\(\\rho\_\{t,i\}\(\\theta\)B\_\{t,i\},\\operatorname\{clip\}\(\\rho\_\{t,i\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)B\_\{t,i\}\\right\)\.\(4\)Standard GRPO optimizes this surrogate withBt,i=AtB\_\{t,i\}=A\_\{t\}\. This objective gives a stable reward direction, but its credit assignment is coarse: the same advantage can update decisive actions, reasoning tokens, formatting tokens, and repeated environment interactions\.

#### On\-policy distillation\.

On\-policy distillation \(OPD\)\[opd\]provides the complementary signal\. A teacher policyπT\\pi\_\{T\}scores the student’s sampled trajectory and supplies token\-level supervision:

ℒOPD=∑t,iKL\[πT\(⋅∣ht,yt,<i,zt\)∥πθ\(⋅∣ht,yt,<i\)\],\\mathcal\{L\}\_\{\\mathrm\{OPD\}\}=\\sum\_\{t,i\}\\operatorname\{KL\}\\left\[\\pi\_\{T\}\(\\cdot\\mid h\_\{t\},y\_\{t,<i\},z\_\{t\}\)\\,\\middle\\\|\\,\\pi\_\{\\theta\}\(\\cdot\\mid h\_\{t\},y\_\{t,<i\}\)\\right\],\(5\)whereztz\_\{t\}denotes the additional context given to the teacher\. In agent settings, usefulztz\_\{t\}often comes from environment hindsight, such as post\-action feedback, future observations, or successful trajectories\. This makes OPD dense, but also risky: if the teacher sees information unavailable to the student at decision time, directly imitating the teacher can leak privileged information, amplify stylistic teacher preferences, and destabilize training\.

### 3\.2Hindsight Placement for Agent Credit Assignment

Environment hindsight is useful only when it is attached to the decisions it can explain\. Since agent rollouts mix meaningful state changes with repeated or state\-preserving interactions,SERLseparates the*source*of feedback from its*placement*, as summarized in Figure[1](https://arxiv.org/html/2605.19447#S3.F1)\.

Letℱt\\mathcal\{F\}\_\{t\}denote a training\-only hindsight signal derived from the environment\. Its source can be immediate feedbackrtr\_\{t\}, next observationst\+1s\_\{t\+1\}, future trajectoryτ\>t\\tau\_\{\>t\}, successful trajectoryτ\+\\tau^\{\+\}, current trajectoryτ≤t\\tau\_\{\\leq t\}, or their combinations; Appendix[A](https://arxiv.org/html/2605.19447#A1)gives the detailed taxonomy\. The student never receives this hindsight as decision\-time input\. A placement operator selects what the teacher sees:

Φ\(t\)=Place⁡\(ℱ,t\),\\Phi\(t\)=\\operatorname\{Place\}\(\\mathcal\{F\},t\),\(6\)whereΦ\(t\)\\Phi\(t\)is the feedback used when scoring the sampled action at steptt\. This formulation lets the same source support different credit\-assignment granularities\.

#### Step\-Level\.

The densest option is to attach feedback to every transition:

Φstep\(t\)=ℱt\.\\Phi\_\{\\mathrm\{step\}\}\(t\)=\\mathcal\{F\}\_\{t\}\.\(7\)The teacher scores each action token conditioned on local hindsight,πT\(⋅∣ht,yt,<i,Φstep\(t\)\)\\pi\_\{T\}\(\\cdot\\mid h\_\{t\},y\_\{t,<i\},\\Phi\_\{\\mathrm\{step\}\}\(t\)\)\. This gives the densest credit signal and is useful when each transition contains locally causal information\. However, repeated observations, failed interface attempts, or state\-preserving moves can make dense placement noisy\.

#### Anchor\-Level\.

To reduce redundancy, we group decision steps into semantic anchors\{𝒞1,…,𝒞M\}\\\{\\mathcal\{C\}\_\{1\},\\ldots,\\mathcal\{C\}\_\{M\}\\\}, where each anchor contains steps with the same or highly similar environment state\. Letg\(t\)=mg\(t\)=mif stepttbelongs to anchor𝒞m\\mathcal\{C\}\_\{m\}\. The feedback attached to that anchor is

ℱmA=Agg⁡\(\{ℱt:t∈𝒞m\}\),Φanchor\(t\)=ℱg\(t\)A\.\\mathcal\{F\}^\{A\}\_\{m\}=\\operatorname\{Agg\}\\left\(\\\{\\mathcal\{F\}\_\{t\}:t\\in\\mathcal\{C\}\_\{m\}\\\}\\right\),\\qquad\\Phi\_\{\\mathrm\{anchor\}\}\(t\)=\\mathcal\{F\}^\{A\}\_\{g\(t\)\}\.\(8\)HereAgg⁡\(⋅\)\\operatorname\{Agg\}\(\\cdot\)selects or merges feedback within the same semantic state\. Anchor\-level placement trades some density for robustness by concentrating hindsight on meaningful state changes\. In the following,Φ\(t\)\\Phi\(t\)denotes either placement choice\.

### 3\.3Environment\-Guided Advantage Reweighting

![Refer to caption](https://arxiv.org/html/2605.19447v1/x2.png)Figure 2:Mechanism ofSERL\. A stop\-gradient teacher conditioned on placed hindsightΦ\(t\)\\Phi\(t\)produces a teacher–student likelihood gap\.SERLsigns this gap with the GRPO advantage, then clips, decays, and masks it to action tokens, so hindsight changes update magnitude and locality while reward determines direction\.After placing hindsight, the second question is how to use it without turning privileged information into a direct imitation target\.SERLconverts teacher evidence into a reward\-aligned coefficient rather than a standalone learning direction, as illustrated in Figure[2](https://arxiv.org/html/2605.19447#S3.F2)\.

LetπT\\pi\_\{T\}be a stop\-gradient hindsight teacher\. The student scores tokenyt,iy\_\{t,i\}using only the decision\-time context\(ht,yt,<i\)\(h\_\{t\},y\_\{t,<i\}\), while the teacher additionally receives placed environment feedbackΦ\(t\)\\Phi\(t\)\. We define the teacher–student log\-probability gap as

Δt,i=log⁡πT\(yt,i∣ht,yt,<i,Φ\(t\)\)−log⁡πθ\(yt,i∣ht,yt,<i\)\.\\Delta\_\{t,i\}=\\log\\pi\_\{T\}\(y\_\{t,i\}\\mid h\_\{t\},y\_\{t,<i\},\\Phi\(t\)\)\-\\log\\pi\_\{\\theta\}\(y\_\{t,i\}\\mid h\_\{t\},y\_\{t,<i\}\)\.\(9\)IfΔt,i\>0\\Delta\_\{t,i\}\>0, the sampled token becomes more plausible after the teacher observes hindsight\. Since this signal may also reflect privileged context or teacher style, we use it only to adjust the magnitude of the reward\-driven update:

wt,i=clip⁡\(exp⁡\(sgn⁡\(At\)stopgrad⁡\(Δt,i\)\),wmin,wmax\)\.w\_\{t,i\}=\\operatorname\{clip\}\\left\(\\exp\\left\(\\operatorname\{sgn\}\(A\_\{t\}\)\\operatorname\{stopgrad\}\(\\Delta\_\{t,i\}\)\\right\),w\_\{\\min\},w\_\{\\max\}\\right\)\.\(10\)The sign of the GRPO advantage decides how teacher evidence is interpreted\. For positive\-advantage actions, teacher\-supported tokens receive larger updates\. For negative\-advantage actions, teacher\-supported tokens are penalized less aggressively, while teacher\-disfavored tokens receive stronger penalties\. Clipping prevents noisy hindsight probabilities from dominating reward learning, and the stop\-gradient keeps the teacher signal as a coefficient rather than a hidden auxiliary objective\.

We further restrict this reweighting to executable action tokens\. Letmt,iact∈\{0,1\}m^\{\\mathrm\{act\}\}\_\{t,i\}\\in\\\{0,1\\\}indicate whether tokenyt,iy\_\{t,i\}belongs to the action span\. We set

w¯t,i=mt,iactwt,i\+\(1−mt,iact\),\\bar\{w\}\_\{t,i\}=m^\{\\mathrm\{act\}\}\_\{t,i\}w\_\{t,i\}\+\(1\-m^\{\\mathrm\{act\}\}\_\{t,i\}\),\(11\)so reasoning and formatting tokens keep the original GRPO weight\. This focuses hindsight on executable decisions, where environment feedback is most causally tied to task success\. The final token advantage is

A~t,i=At\(\(1−αk\)\+αkw¯t,i\),\\widetilde\{A\}\_\{t,i\}=A\_\{t\}\\left\(\(1\-\\alpha\_\{k\}\)\+\\alpha\_\{k\}\\bar\{w\}\_\{t,i\}\\right\),\(12\)whereαk∈\[0,1\]\\alpha\_\{k\}\\in\[0,1\]controls the strength of hindsight reweighting at training stepkk\. We decayαk\\alpha\_\{k\}over training: early updates exploit dense hindsight when exploration is weak, while later updates return control to environment rewards to reduce privileged\-teacher bias\.

### 3\.4Selective Hindsight Objective

The reweighted RL objective plugsA~t,i\\widetilde\{A\}\_\{t,i\}into the GRPO surrogate defined above:

ℒrw=−∑t,iℓGRPO\(θ;A~t,i\)\.\\mathcal\{L\}\_\{\\mathrm\{rw\}\}=\-\\sum\_\{t,i\}\\ell\_\{\\mathrm\{GRPO\}\}\(\\theta;\\widetilde\{A\}\_\{t,i\}\)\.\(13\)We additionally keep a lightweight action\-only distillation term:

ℒact=∑t,imt,iactKL\[πT\(⋅∣ht,yt,<i,Φ\(t\)\)∥πθ\(⋅∣ht,yt,<i\)\]\.\\mathcal\{L\}\_\{\\mathrm\{act\}\}=\\sum\_\{t,i\}m^\{\\mathrm\{act\}\}\_\{t,i\}\\operatorname\{KL\}\\left\[\\pi\_\{T\}\(\\cdot\\mid h\_\{t\},y\_\{t,<i\},\\Phi\(t\)\)\\,\\middle\\\|\\,\\pi\_\{\\theta\}\(\\cdot\\mid h\_\{t\},y\_\{t,<i\}\)\\right\]\.\(14\)This term targets the decision boundary of the agent, such assearch\[\.\.\.\]andclick\[\.\.\.\]in WebShop or environment actions in ALFWorld\. It complements reweighting while avoiding full\-trajectory distillation over long reasoning traces\.

The final objective is

ℒSERL=ℒrw\+λℒact\.\\mathcal\{L\}\_\{\\mathrm\{SERL\}\}=\\mathcal\{L\}\_\{\\mathrm\{rw\}\}\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{act\}\}\.\(15\)Thus,SERLdiffers from direct OPD and loss\-level RL–distillation mixtures: privileged hindsight never determines the full\-response update direction\. It enters through a placement operator, modifies only bounded action\-level credit, and decays as the policy becomes more competent\.

## 4Experiments

### 4\.1Experimental Setup

#### Benchmarks\.

We train LLM agents on two challenging benchmarks: ALFWorld\[alfworld\]and WebShop\[webshop\]\.ALFWorldis an embodied environment for evaluating multi\-step decision\-making\. In each episode, the agent receives a text goal and completes it through multi\-turn interaction with the environment\. The benchmark contains 3,827 task instances across six common household task categories: Pick & Place \(Pick\), Examine in Light \(Look\), Clean & Place \(Clean\), Heat & Place \(Heat\), Cool & Place \(Cool\), and Pick Two & Place \(Pick2\)\.WebShopis a web\-based interactive environment for evaluating agents in realistic online shopping scenarios\. To complete each task, the agent interacts with a simulated HTML shopping website to search, navigate, and purchase a suitable item\. The benchmark contains over 1\.1 million products and 12k user instructions, providing a large and diverse action space\.

#### Baselines\.

For ALFWorld and WebShop, we compare our approach with several competitive baselines\. \(1\)Prompting methods: ReAct\[react\]and Reflexion\[reflexion\], which use in\-context prompting to guide multi\-step behavior without parameter updates\. \(2\)RL training methods: PPO\[schulman2017proximal\], a standard actor\-critic algorithm that requires an additional value model; critic\-free group\-based methods RLOO\[rloo\]and GRPO\[grpo\], which estimate advantages from trajectory groups; and GIGPO\[gigpo\]and HGPO\[hgpo\], which further perform step\-level or hierarchical advantage estimation within trajectory groups\. \(3\)RL–distillation hybrid methods: SDPO\[sdpo\], which uses environment feedback for self\-distillation; two variants that combine SDPO with GRPO; and RLSD\[rlsd\], which integrates self\-distillation with token\-level weighting in the RL objective\.

#### Training Details\.

We use Qwen2\.5\-7B\-Instruct\[qwen2\.5\]as the base model\. For ALFWorld and WebShop, all RL\-based methods, including our approach and the baselines, are trained with the same hyperparameter settings for fair comparison\. For group\-based RL methods, the rollout group sizeNNis set to 8\. The actor is optimized with a learning rate of1×10−61\\times 10^\{\-6\}\. We use PPO\-style mini\-batches of 256 samples and micro\-batches of 32 samples for ALFWorld, and mini\-batches of 64 samples and micro\-batches of 8 samples for WebShop\. For RL methods combined with on\-policy self\-distillation \(SERL, RLSD\[rlsd\]\), we set the initial self\-distillation mixing coefficient toλ=0\.5\\lambda=0\.5, linearly decay it over 50 steps, and clip the token\-level distillation weights with a threshold of 0\.2\. The teacher policy is synchronized every 10 training steps\.

### 4\.2Experimental Results

Table 1:Main comparison ofSERLwith representative baselines on ALFWorld and WebShop\. For ALFWorld, we report the success rate \(%\) for each subtask and the average success rate across all subtasks\. For WebShop, we report the average score and success rate \(%\)\. All self\-distillation methods use immediate environment feedbackrtr\_\{t\}\.†indicates results from prior reports\.TypeMethodALFWorldWebShopPickLookCleanHeatCoolPick2AllScoreSucc\.Prompting†Qwen2\.5\-7B\-Instruct33\.421\.619\.36\.92\.83\.214\.826\.47\.8Prompting†ReAct\[react\]48\.535\.434\.313\.218\.217\.631\.246\.219\.5Prompting†Reflexion\[reflexion\]62\.041\.644\.930\.936\.323\.842\.758\.128\.8RL Training†PPO \(with critic\)\[schulman2017proximal\]92\.364\.092\.589\.580\.368\.880\.481\.468\.7RL Training†RLOO\[rloo\]87\.678\.287\.381\.371\.948\.975\.580\.365\.7RL TrainingGRPO\[grpo\]90\.383\.384\.270\.069\.255\.075\.373\.164\.1RL TrainingGIGPO\[gigpo\]93\.583\.378\.986\.776\.285\.083\.983\.575\.8RL TrainingHGPO\[hgpo\]92\.391\.777\.893\.385\.773\.985\.888\.477\.8Self\-DistillationSDPO\[sdpo\]66\.766\.729\.616\.720\.00\.033\.30\.00\.0HybridGRPO\+SDPO\(Advantage\)\[sdpo\]100\.091\.788\.953\.381\.021\.772\.884\.875\.4HybridGRPO\+SDPO\(Loss\)\[sdpo\]97\.4100\.088\.9100\.071\.434\.882\.188\.473\.0HybridRLSD\[rlsd\]97\.475\.088\.9100\.061\.973\.982\.983\.675\.8HybridSERL \(Ours\)92\.3100\.088\.9100\.076\.282\.690\.089\.580\.1

#### Overall comparison\.

Table[1](https://arxiv.org/html/2605.19447#S4.T1)comparesSERLwith prompting agents, pure RL training methods, and RL–distillation hybrids on ALFWorld and WebShop\. All self\-distillation and hybrid variants in this comparison use the same immediate environment feedbackrtr\_\{t\}, so the key difference is how each method converts that feedback into an optimization signal\.SERLobtains the best aggregate performance on both benchmarks, reaching90\.0%90\.0\\%average success on ALFWorld,89\.589\.5WebShop score, and80\.1%80\.1\\%WebShop success\. Relative to GRPO,SERLimproves ALFWorld average success by14\.714\.7points and WebShop success by16\.016\.0points\. Relative to the strongest pure RL baseline, HGPO, it still improves ALFWorld by4\.24\.2points and WebShop success by2\.32\.3points\.

These gains indicate that the main bottleneck is not only variance reduction in policy\-gradient estimation\. HGPO and GIGPO already refine trajectory\-level advantages, but their updates are still derived from sparse outcome rewards\.SERLinstead uses environment feedback to reshape credit at the action\-token level while keeping the reward\-determined direction\. This distinction matters in long\-horizon agents: a successful or failed rollout contains many reasoning tokens, interface actions, and repeated observations, but only a small subset of executable actions changes the environment in a task\-relevant way\.

SERLalso outperforms OPD\-style hybrids such as GRPO\+SDPO and RLSD\. These methods improve some individual ALFWorld categories, but their gains are less consistent across tasks\. For example, GRPO\+SDPO performs well on several short\-horizon categories but drops sharply on Pick2, where the agent must preserve state across multiple object interactions\. This pattern suggests that dense teacher supervision alone can over\-emphasize local signals that are not causally aligned with final success\. In contrast,SERLuses the teacher asymmetrically: hindsight changes the magnitude and locality of the GRPO update, but the reward still determines whether the sampled behavior should be reinforced or suppressed\. The aggregate improvement across ALFWorld and WebShop supports this controlled use of environment feedback\.

#### Feedback sources\.

Table 2:Analysis ofSERLwith different feedback sources\. We compare immediate environment feedback, next observations, future trajectories, successful trajectories, and their combinations\.TypeALFWorldWebShopPickLookCleanHeatCoolPick2AllScoreSucc\.GRPO\[grpo\]90\.383\.384\.270\.069\.255\.075\.373\.164\.1GIGPO\[gigpo\]93\.583\.378\.986\.776\.285\.083\.983\.575\.8HGPO\[hgpo\]92\.391\.777\.893\.385\.773\.985\.888\.477\.8immediate feedback92\.3100\.088\.9100\.076\.282\.690\.089\.580\.1next observation97\.472\.291\.780\.061\.956\.576\.690\.577\.7future trajectory100\.083\.383\.386\.780\.165\.283\.185\.976\.6successful trajectory or immediate feedback97\.0100\.083\.373\.385\.752\.281\.984\.172\.7successful trajectory and immediate feedback100\.0100\.094\.493\.381\.073\.990\.487\.781\.3successful trajectory and next observation94\.991\.788\.9100\.085\.752\.285\.686\.977\.7successful trajectory and future trajectory100\.083\.383\.373\.390\.569\.683\.388\.176\.6successful trajectory, future trajectory, and immediate feedback90\.091\.683\.366\.785\.739\.176\.187\.177\.0successful trajectory, future trajectory, and next observation100\.083\.383\.360\.081\.052\.276\.684\.176\.2

Table[2](https://arxiv.org/html/2605.19447#S4.T2)studies the feedback\-source axis defined in our method\. The central observation is that more privileged context is not monotonically better\. Immediate feedbackrtr\_\{t\}is already a strong source, achieving90\.0%90\.0\\%ALFWorld average success and80\.1%80\.1\\%WebShop success\. Its strength comes from locality: it is produced immediately after the current action, so the teacher can connect probability changes to the environment’s response without relying on a long future trajectory\.

The best overall source combines successful trajectories with immediate feedback, reaching the highest ALFWorld average \(90\.4%90\.4\\%\) and the highest WebShop success \(81\.3%81\.3\\%\)\. This combination is informative because the two signals play different roles\. A successful trajectory supplies a positive behavioral reference, whilertr\_\{t\}anchors that reference to the current rollout’s actual transition\. Without this grounding, successful trajectories alone or disjunctive combinations with immediate feedback are weaker, suggesting that reference behavior must be tied to the state being credited rather than used as a generic demonstration\.

The weaker results for next observations, future trajectories, and large source combinations further clarify what makes feedback useful\. Next observations achieve the highest WebShop score \(90\.590\.5\) but lower success, indicating that rich post\-action state information can improve partial progress or item ranking without reliably improving final completion\. Future trajectories and multi\-source privileged feedback often underperform because they contain delayed consequences that are not uniquely attributable to the current action\. Thus, effective agent feedback should be local enough to preserve causal alignment and rich enough to reveal why an action changed the environment; simply giving the teacher more hindsight can introduce noise and privileged mismatch\.

#### Feedback placement\.

Table 3:Analysis of feedback granularity inSERLon WebShop\. Step\-level feedback applies distillation to every transition, whereas anchor\-level feedback first groups semantically similar environment states before applying distillation\.Table[3](https://arxiv.org/html/2605.19447#S4.T3)evaluates the placement axis on WebShop\. Step\-level placement maximizes density by applying feedback to every transition, whereas anchor\-level placement groups semantically similar states and applies feedback around meaningful state changes\. The results show that placement should be chosen together with the feedback source\. With immediate feedback, anchor\-level placement improves score from89\.589\.5to91\.591\.5, suggesting that grouping suppresses redundant updates while preserving the useful local signal\. With successful trajectory plus immediate feedback, anchor\-level placement further improves success from81\.3%81\.3\\%to81\.9%81\.9\\%, the best WebShop success rate in the table\.

Anchor\-level placement is not uniformly superior, and this is also informative\. For next observations and future trajectories used alone, step\-level placement gives higher score and success, likely because these signals are path\-specific: grouping similar\-looking states can remove fine\-grained temporal information that explains the current action\. By contrast, anchor\-level placement is more helpful for multi\-source or partially privileged feedback, where raw density can amplify redundant or weakly causal teacher evidence\. This supports the main design principle ofSERL: dense environment feedback is valuable only when its source and insertion point are aligned with the decisions that actually affect the future trajectory\.

### 4\.3Ablation Study

#### LLM\-judged feedback\.

Table 4:Performance onSERLusing LLM\-judged feedback\. We compare trajectory judgments generated by Kimi\-K2\.6 and Qwen2\.5\-7B\-Instruct, using either the current trajectory alone or together with a successful trajectory\.Table[4](https://arxiv.org/html/2605.19447#S4.T4)tests whether raw environment trajectories can be compressed into a higher\-level feedback summary by an external LLM judge\. This ablation probes a different form of feedback reuse: instead of giving the teacher raw observations or trajectories, a judge first converts the rollout into a concise diagnosis\. With Kimi\-K2\.6, current\-trajectory judgments improve substantially over GRPO, and adding a successful trajectory raises WebShop success to81\.8%81\.8\\%\. This shows that a capable judge can turn a long observation–action sequence into action\-relevant credit: it identifies the causal error or useful behavior and exposes it to the teacher as compact privileged guidance\.

The judge model itself becomes part of the feedback channel\. Qwen2\.5\-7B\-Instruct provides some useful signal, especially on ALFWorld, but is much weaker on WebShop\. The likely reason is context coverage\. ALFWorld observations are short, whereas WebShop trajectories contain long HTML\-like observations and action histories\. In our setting, the Qwen judge has a 32K context window, which can truncate or compress away important WebShop evidence, while Kimi\-K2\.6 supports a 256K context window and can usually cover the complete trajectory prompt\. The result highlights a practical constraint: LLM\-judged feedback is useful only when the judge has enough context length and reasoning capacity to preserve the causal history being summarized\.

#### Decay of teacher signal\.

![Refer to caption](https://arxiv.org/html/2605.19447v1/x3.png)Figure 3:Effect of decaying the hindsight teacher signal during training\.Left \(a\)shows reward score,Middle \(b\)shows policy entropy, andRight \(c\)shows response length\. Decay uses dense teacher feedback early for credit assignment and gradually returns optimization control to reward\-driven GRPO\.Figure[3](https://arxiv.org/html/2605.19447#S4.F3)studies the temporal role of the teacher\-induced reweighting signal\. The motivation follows the same principle as RLSD\[rlsd\]: teacher probabilities can provide dense magnitude information, but the direction of policy improvement should remain reward\-driven\. This concern is stronger in our setting because the teacher may condition on environment feedback, future context, or successful trajectories that the student will not observe at test time\.

The training curves show why a decayed teacher weight is preferable to using hindsight uniformly throughout training\. Early in training, the policy explores poorly and sparse rewards provide weak credit assignment; teacher\-conditioned reweighting helps identify which action tokens deserve larger updates\. Later, as the policy becomes competent, the teacher–student gap increasingly mixes useful credit with privileged\-context bias, formatting differences, and trajectory\-specific noise\. Decay therefore implements a staged use of feedback: hindsight accelerates early credit assignment, then optimization gradually returns to GRPO so that the final policy is anchored by environment rewards rather than persistent privileged supervision\.

## 5Conclusion

We studied credit assignment in long\-horizon LLM agents from the perspective of environment feedback\. The core challenge is that RL sparse rewards provide a reliable optimization direction but weak credit assignment, whereas hindsight distillation provides dense token\-level signals but may rely on information unavailable to the student during training, potentially causing privileged\-information leakage and training instability\.SERLaddresses this challenge by using reward to determine the update direction and using environment\-conditioned teacher signals only for bounded, action\-level credit adjustment\. The broader lesson is that environment feedback should not be treated as unrestricted auxiliary supervision\. Effective agent RL requires aligning three design choices: what feedback source is used, where it is inserted, and how strongly it is allowed to affect the policy update\. Our results show that grounded, action\-relevant feedback can be more useful than richer but weakly causal hindsight, and that semantic feedback placement can matter as much as feedback density\. We hope this work provides new insights for the community and lays a foundation for future research on integrating environment feedback into the training of agentic LLMs\.

## References

## Appendix AEnvironment Feedback Details

We formalize a multi\-turn agent trajectory as an alternating sequence of states, actions, and environment feedback:

τ=\(s0,a0,r0,s1,a1,r1,…,sT,aT,rT\),\\tau=\(s\_\{0\},a\_\{0\},r\_\{0\},s\_\{1\},a\_\{1\},r\_\{1\},\\ldots,s\_\{T\},a\_\{T\},r\_\{T\}\),\(16\)wherests\_\{t\}denotes the environment state or observation available before actionata\_\{t\}, andrtr\_\{t\}denotes the feedback returned by the environment after executingata\_\{t\}\. In our analysis, environment feedback has two independent design axes: the source of feedback information and the position at which this information is injected into training\.

### A\.1Feedback Sources

We consider five feedback sources, ordered from local to increasingly privileged information\.

#### Environment feedback\.

The most direct signal is the immediate environment feedbackrtr\_\{t\}returned after the current action\. It captures whether the action changes the environment in a useful way, but does not expose future decisions\.

#### Next observation\.

The next observationst\+1s\_\{t\+1\}provides the post\-action state induced byata\_\{t\}\. Compared withrtr\_\{t\}, it contains richer state information and can reveal how the environment actually responds to the current action\.

#### Future trajectory\.

Future trajectory feedback uses the suffix after the current step,

τ\>t=\{\(sj,aj\)\}j\>t\.\\tau\_\{\>t\}=\\\{\(s\_\{j\},a\_\{j\}\)\\\}\_\{j\>t\}\.\(17\)This source provides delayed consequences of the current decision and can offer stronger supervision, but it is also more privileged because it contains information unavailable whenata\_\{t\}is chosen\.

#### Successful trajectory\.

A successful trajectoryτ\+=\(s0\+,a0\+,…,sT\+,aT\+\)\\tau^\{\+\}=\(s^\{\+\}\_\{0\},a^\{\+\}\_\{0\},\\ldots,s^\{\+\}\_\{T\},a^\{\+\}\_\{T\}\)is a rollout that reaches the task goal\. It serves as a positive behavioral reference and can indicate which states or actions are associated with successful completion\.

#### Current trajectory\.

The current trajectory prefix

τ≤t=\(s0,a0,…,st,at\)\\tau\_\{\\leq t\}=\(s\_\{0\},a\_\{0\},\\ldots,s\_\{t\},a\_\{t\}\)\(18\)contains only information produced by the agent so far\. It is less privileged than future or successful trajectories, and mainly provides context for judging whether the current action is consistent with the agent’s own interaction history\.

### A\.2Temporal Granularity of Feedback

We further distinguish where feedback is applied during training\.

#### Step\-level feedback\.

At the step level, feedback is injected independently for every transition\(st,at\)\(s\_\{t\},a\_\{t\}\)\. This provides dense supervision and assigns credit to each action, but may also update many trivial or repetitive steps\.

#### Anchor\-level feedback\.

At the anchor level, transitions in a rollout group are first partitioned according to their environment states\. An anchor contains steps whose statessts\_\{t\}correspond to the same environment condition or are semantically similar\. Feedback is then applied at the anchor rather than individual\-step level\. This reduces noisy updates on redundant interactions while preserving supervision for meaningful state changes\.

## Appendix BTraining Details

### B\.1Training Setup Details

We use Qwen2\.5\-7B\-Instruct\[qwen2\.5\]as the base model and conduct all experiments with the VeRL codebase\[verl\]on 8 NVIDIA H200 GPUs\. For ALFWorld and WebShop, all RL\-based methods, including our approach and the baselines, are trained with the same hyperparameter settings for fair comparison\[gigpo\]\. For group\-based RL methods, the rollout group sizeNNis set to 8\. The actor is optimized for one training epoch with a learning rate of1×10−61\\times 10^\{\-6\}with total 150 steps\. We set the maximum number of agent–environment interaction turns to 50 for ALFWorld and 15 for WebShop\. For ALFWorld, we use PPO\-style mini\-batches of 256 samples and micro\-batches of 32 samples; for WebShop, we use mini\-batches of 64 samples and micro\-batches of 8 samples\. The rollout temperature is set to 0\.4\.

For RL methods combined with on\-policy self\-distillation, includingSERL, SDPO\[sdpo\]and RLSD\[rlsd\], we set the initial self\-distillation mixing coefficient toλ=0\.5\\lambda=0\.5, linearly decay it over 50 steps, and clip the token\-level distillation weights with a threshold of 0\.2\. The teacher policy is synchronized every 10 training steps\.

### B\.2Reward Score

![Refer to caption](https://arxiv.org/html/2605.19447v1/x4.png)Figure 4:Training reward dynamics on agent environments\.SERL, which combines GRPO with environment\-feedback\-guided OPD, improves reward faster than pure RL baselines\. Anchor\-level feedback further accelerates convergence over step\-level feedback by concentrating updates on semantically meaningful state changes\.Figure[4](https://arxiv.org/html/2605.19447#A2.F4)provides the training reward dynamics behind the main results\. The central observation is that combining GRPO with OPD\-style environment feedback converges faster than pure RL training\. This supports the motivation ofSERL: environment feedback is not merely an additional supervision source, but a dense credit signal that helps the policy identify which tokens and actions should receive stronger updates before sparse trajectory rewards become reliable\.

The comparison between step\-level and anchor\-level feedback further clarifies where this dense signal should be applied\. Step\-level feedback gives every transition a teacher\-conditioned update, which increases supervision density but also spends capacity on repeated observations, formatting tokens, and low\-level interface operations\. Anchor\-level feedback instead groups semantically similar environment states and applies the signal at meaningful state changes\. Its faster reward growth suggests that the main bottleneck is not the amount of feedback, but the precision of credit placement: agent training benefits most when environment feedback is concentrated on decision points that actually change the future trajectory\.

### B\.3LLM Judge Prompt

LLM\-judged feedback is used to test a more general form of environment feedback summary\. Instead of directly exposing raw observations or full trajectories to the teacher, a judge model reads the rollout and produces a short diagnostic summary\. This matters for long\-horizon agents because raw trajectories can be long, noisy, and hard to align token by token\. A good judge converts the trajectory into an actionable credit signal: it states whether the rollout succeeded, identifies the causal error or useful behavior, and provides concise privileged guidance for the teacher\. The ablation in Table[4](https://arxiv.org/html/2605.19447#S4.T4)therefore measures not only whether LLM judgments are helpful, but also whether the judge has enough context capacity to faithfully summarize the environment interaction\.

Trajectory Judge System PromptRole\.You are an expert rollout critic\.Input\.Given a full trajectory and optional reference trajectories, write a concise trajectory judgment that another policy model can use as privileged guidance\.Requirements\.The first sentence must explicitly state whether the rollout succeeded or failed\. Summarize what went right or wrong, connect it to the task objective, and end with the most actionable lesson\. Keep the judgment short and concrete\. Always write at least one concrete sentence\. Never answer with “None”, “N/A”, or only the section title\.

### B\.4Computational Cost

Table 5:Computational cost of different training algorithms\. MFU denotes model FLOPs utilization, and time per step is measured under the same training infrastructure\.Table[5](https://arxiv.org/html/2605.19447#A2.T5)compares the training cost of representative RL and RL–distillation algorithms under the same infrastructure\. Pure RL methods such as GRPO and GIGPO achieve higher MFU because their updates are dominated by standard policy forward–backward computation\. In contrast, distillation\-based methods introduce additional teacher scoring, feedback construction, and token\-level weighting, which lower MFU by adding non\-matrix\-multiplication overhead and more irregular sequence processing\. This is expected in agent training, where environment interaction and trajectory\-dependent feedback make the workload less uniform than standard supervised or reasoning\-only RL\.

The key observation is thatSERLremains in the same wall\-clock cost regime as strong baselines while providing denser credit assignment\. Step\-levelSERLincreases time per step only moderately over GRPO and is faster than SDPO\+GRPO and GIGPO, suggesting that action\-only reweighting keeps the additional teacher signal relatively lightweight\. Anchor\-levelSERLhas higher per\-step overhead because it performs semantic grouping and feedback aggregation before applying the teacher signal, but this cost buys a different trade\-off: it spends computation on more meaningful state changes rather than uniformly supervising every transition\. Thus, the relevant efficiency question for long\-horizon agents is not only per\-step throughput, but whether each update places credit on decisions that actually affect future interaction\.

## Appendix CLimitations

Our experiments are limited to two agentic environments, ALFWorld and WebShop, which may constrain the generalizability of the empirical findings\. Extending on\-policy training to more diverse agentic environments remains challenging, as it requires substantial computational resources and robust infrastructure for stable training and evaluation\. We leave the broader\-scale evaluation ofSERLacross additional agentic environments as an important direction for future work\.
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

Similar Articles

@SharonYixuanLi: Scaling outcome-based RL won't solve long-horizon agentic tasks. Credit assignment is the bottleneck, and turn-level re…

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

Submit Feedback

Similar Articles

@SharonYixuanLi: Scaling outcome-based RL won't solve long-horizon agentic tasks. Credit assignment is the bottleneck, and turn-level re…
HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents
OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation