Milestone-Guided Policy Learning for Long-Horizon Language Agents
Summary
This paper introduces BEACON, a milestone-guided policy learning framework designed to improve credit assignment and sample efficiency for long-horizon language agents. It demonstrates significant performance improvements over GRPO and GiGPO on benchmarks like ALFWorld, WebShop, and ScienceWorld.
View Cached Full Text
Cached at: 05/08/26, 07:11 AM
# Milestone-Guided Policy Learning for Long-Horizon Language Agents
Source: [https://arxiv.org/html/2605.06078](https://arxiv.org/html/2605.06078)
Yuchen YanHongxing LiTeng PanDingming LiRuiqing ZhangWeiming LuJun XiaoYueting ZhuangYongliang Shen
###### Abstract
While long\-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging\. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near\-total loss of learning signal\. We introduce a milestone\-guided policy learning framework,BEACON, that leverages the compositional structure of long\-horizon tasks to ensure precise credit assignment\. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions\. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO\. Notably, on long\-horizon ALFWorld tasks, BEACON achieves 92\.9% success rate, nearly doubling GRPO’s 53\.5%, while improving effective sample utilization from 23\.7% to 82\.0%\. These results establish milestone\-anchored credit assignment as an effective paradigm for training long\-horizon language agents\. Code is available at[https://github\.com/ZJU\-REAL/BEACON](https://github.com/ZJU-REAL/BEACON)\.
Machine Learning, ICML
## 1Introduction
Large language model agents have demonstrated remarkable capabilities in performing complex tasks in diverse environments\(Yaoet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib2); Schicket al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib4)\), including web navigation\(Zhouet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib6); Denget al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib5)\), embodied control\(Ahnet al\.,[2022](https://arxiv.org/html/2605.06078#bib.bib7); Huanget al\.,[2022](https://arxiv.org/html/2605.06078#bib.bib8); Wanget al\.,[2025b](https://arxiv.org/html/2605.06078#bib.bib49)\), and scientific experimentation\(Boikoet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib9); Branet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib10)\)\. These agents must perform sequences of decisions that span dozens of steps, with success determined only at task completion\. Training such agents through reinforcement learning has shown promise\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.06078#bib.bib12); Ouyanget al\.,[2022a](https://arxiv.org/html/2605.06078#bib.bib11)\), yet current policy optimization methods scale poorly with task horizon, exhibiting systematic performance collapse as decision sequences lengthen\.
This collapse stems from two fundamental limitations of trajectory\-level optimization, which treats trajectories as flat action sequences and assigns credit based solely on terminal outcomes\. The first is*credit misattribution*: all actions within a trajectory receive identical advantages based solely on the terminal outcome\. A correct early action is penalized when later actions cause failure; the same action receives opposite gradient signals across trajectories depending on downstream stochasticity, causing gradients to conflict\. The second is*sample inefficiency*: as task horizons extend, successful trajectories become increasingly scarce, causing most samples to yield zero reward\. Moreover, trajectories that complete substantial subgoals but fail the final objective receive zero reward identical to complete failures, wasting meaningful progress\. We validate these limitations on ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2605.06078#bib.bib1)\): GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.06078#bib.bib13)\)achieves 77% success on short tasks but collapses to 54% on long tasks, with over 40% of gradient updates containing contradictory signals\. Furthermore, 39% of sampled trajectories complete at least one subgoal yet contribute no learning signal under trajectory\-level optimization\.


Figure 1:BEACON overview and performance preview\.Left:GRPO assigns uniform credit from terminal outcomes, penalizing correct early actions when later actions fail; BEACON partitions trajectories at milestones and estimates advantages at dual scales\.Right:On ALFWorld, GRPO degrades sharply with task horizon while BEACON maintains robust performance across all horizons\.Existing methods that aim to provide denser credit assignment introduce their own limitations\. Process reward models\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib35); Wanget al\.,[2024](https://arxiv.org/html/2605.06078#bib.bib36)\)require expensive step\-level annotations and risk reward hacking\(Gaoet al\.,[2022](https://arxiv.org/html/2605.06078#bib.bib40)\)\. Monte Carlo value estimation\(Kazemnejadet al\.,[2024](https://arxiv.org/html/2605.06078#bib.bib38)\)demands multiple rollouts per decision point, multiplying computational cost\. GiGPO\(Fenget al\.,[2025](https://arxiv.org/html/2605.06078#bib.bib33)\)constructs step\-level comparison groups by identifying repeated states across trajectories, but its effectiveness depends on state recurrence, which diminishes as agents progress toward task completion in long\-horizon settings\. We observe that long\-horizon agentic tasks already exhibit exploitable structure: they decompose into phases bounded by*milestones*, state transitions where subgoal achievement renders prior execution history largely irrelevant\. This approximate Markov property enables credit to be decoupled across phases, yet trajectory\-level methods ignore it entirely\.
We introduce Milestone\-Guided Policy Learning Framework \(BEACON\), which leverages task structure to address both credit misattribution and sample inefficiency\. The key idea is to partition trajectories at milestone boundaries and perform credit assignment at the segment level rather than the trajectory level\. Given a trajectory, BEACON first identifies milestones from verifiable state changes, and partitions the trajectory into segments accordingly\. Within each segment, temporal reward shaping assigns higher credit to actions closer to milestone completion, transforming sparse terminal signals into dense feedback that rewards partial progress\. Across segments, dual\-scale advantage estimation computes advantages at both trajectory and segment levels\. The trajectory\-level advantage captures global task performance, while the segment\-level advantage compares only among trajectories that reached the same milestone, isolating local action quality from the variance introduced by subsequent segments\. This decomposition ensures that a correct action in an early segment is not penalized by failures in later segments, directly addressing credit misattribution\.
We evaluate BEACON on ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2605.06078#bib.bib1)\), WebShop\(Yaoet al\.,[2022](https://arxiv.org/html/2605.06078#bib.bib17)\), and ScienceWorld\(Wanget al\.,[2022](https://arxiv.org/html/2605.06078#bib.bib18)\)\. BEACON outperforms GRPO across all benchmarks, with improvements that amplify as task horizons extend: relative gains over GRPO scale from 26\.2% on short tasks to 73\.6% on long tasks on ALFWorld\. On Long tasks, BEACON achieves 92\.9% success versus 53\.5% for GRPO\. Analysis reveals that BEACON recovers learning signal from partial successes: effective sample utilization improves from 23\.7% to 82\.0%\. Furthermore, BEACON achieves 91\.4% success compared to 43% for supervised fine\-tuning on oracle trajectories, confirming that the gains stem from policy optimization rather than milestone imitation\.
In summary, our contributions are as follows:
- •This work identifies credit misattribution and sample inefficiency as fundamental limitations of trajectory\-level optimization, showing that over 40% of gradient updates contain contradictory signals as task horizons extend\.
- •We propose BEACON, a framework that partitions trajectories at milestone boundaries, applies temporal reward shaping within segments, and estimates advantages at dual scales to isolate local action quality from later failures\.
- •Experiments on ALFWorld, WebShop, and ScienceWorld demonstrate horizon\-dependent improvements, with relative gains over GRPO scaling from 26\.2% to 73\.6% and sample utilization improving from 23\.7% to 82\.0%\.
## 2Failures in Flat Trajectory Optimization
We first establish empirically that trajectory\-level policy optimization fails systematically as task horizons extend, then diagnose the underlying causes through gradient analysis\.
Experiments use Qwen2\.5\-1\.5B\-Instruct\(Qwenet al\.,[2025](https://arxiv.org/html/2605.06078#bib.bib19)\)on ALFWorld with GRPO, stratifying tasks by optimal trajectory length: Short \(L∗≤4L^\{\*\}\\leq 4\), Medium \(5≤L∗≤75\\leq L^\{\*\}\\leq 7\), and Long \(L∗\>7L^\{\*\}\>7\) \(details in Section[4\.1](https://arxiv.org/html/2605.06078#S4.SS1)\)\. Figure[1](https://arxiv.org/html/2605.06078#S1.F1)\(Right\) shows GRPO degrades from 76\.7% \(Short\) to 53\.5% \(Long\)\.
Figure 2:Failures in flat trajectory optimization\.\(a\)Sample distribution during GRPO training\. Partial successes yield zero gradient despite meaningful progress\.\(b\)Gradient conflict analysis\. Contradictory signals cause effective learning signal to collapse\.#### Sample Inefficiency\.
Figure[2](https://arxiv.org/html/2605.06078#S2.F2)\(a\) shows the sampled trajectory distribution during training\. We categorize trajectories into three types: full successes \(green\) that complete the task, partial successes \(orange\) that complete at least one milestone but fail the final task, and complete failures \(gray\) that achieving none\. Partial successes consistently comprise 39–47% of samples throughout training, yet under GRPO they receive zero reward identical to complete failures\. Meanwhile, full successes remain below 27%, meaning over 73% of samples yield no learning signal\. This waste of partial progress severely limits learning efficiency\.
#### Credit Misattribution\.
Even among trajectories that do provide signal, credit assignment is corrupted\. Figure[2](https://arxiv.org/html/2605.06078#S2.F2)\(b\) reveals a second pathology: gradient corruption from contradictory credit assignment\. We measure the Contradictory Action Ratio \(CAR\), defined as the fraction of actions that receive opposite\-sign advantages across different trajectories despite being executed at identical states\. CAR exceeds 40% at its peak, indicating that nearly half of gradient updates for repeated state\-action pairs point in conflicting directions\. As a consequence, the effective learning signal \(the fraction of gradient that survives after cancellation\) collapses below 20% \(see Appendix[C\.2](https://arxiv.org/html/2605.06078#A3.SS2)for detailed computation\)\. The root cause is that trajectory\-level advantages conflate action quality with downstream stochasticity: the same correct action receives positive credit when later actions succeed and negative credit when they fail\.
#### Takeaways\.
Flat trajectory optimization suffers from two compounding problems\. Sample inefficiency discards learning signal from partial successes, while credit misattribution corrupts the signal that remains\. Both problems worsen as horizons extend: longer tasks have lower success rates \(increasing partial successes\) and more opportunities for downstream variance to corrupt credit assignment\. Addressing these failures requires exploiting the compositional structure that trajectory\-level methods ignore\.
## 3Milestone\-Anchored Policy Optimization
We introduce BEACON, a framework that exploits the compositional structure of long\-horizon tasks to address the credit assignment failures identified in Section[2](https://arxiv.org/html/2605.06078#S2)\. BEACON operates in three stages: partitioning trajectories at milestone boundaries, shaping rewards within segments, and estimating advantages at dual scales\.
Figure 3:The BEACON framework\.Top:Trajectory partitioning divides rollouts into segments at milestone boundaries; temporal reward decay \(factorγ\\gamma\) assigns higher credit to actions closer to milestone completion\.Bottom:Dual\-scale advantage estimation computes trajectory\-level advantages by comparing terminal outcomes \(left\), segment\-level advantages by comparing returns within milestone\-matched groups \(middle\), and combines both scales for final credit assignment \(right\)\.### 3\.1Preliminaries
We consider a Markov Decision Process\(𝒮,𝒜,𝒫,ℛ,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{P\},\\mathcal\{R\},\\gamma\)where a language agent policyπθ\\pi\_\{\\theta\}produces trajectoriesτ=\{\(st,at\)\}t=1T\\tau=\\\{\(s\_\{t\},a\_\{t\}\)\\\}\_\{t=1\}^\{T\}through interaction with an environment\. The agent receives sparse terminal rewardR\(τ\)∈\{0,1\}R\(\\tau\)\\in\\\{0,1\\\}indicating task success\.
We assume access to a milestone indicatorΦ:𝒮×𝒜×𝒮→\{0,1\}\\Phi:\\mathcal\{S\}\\times\\mathcal\{A\}\\times\\mathcal\{S\}\\rightarrow\\\{0,1\\\}that returns 1 when a transition completes a semantic subgoal, and 0 otherwise\. Crucially,Φ\\Phidoes not require learned models or manual annotation—it detects observable state changes from environment feedback\. In interactive environments, such signals are typically available: in ALFWorld,Φ\\Phidetects object state changes such as successful pick\-up or heating completion; in WebShop,Φ\\Phiidentifies page transitions advancing toward the target product; in ScienceWorld, the environment provides explicit subgoal signals thatΦ\\Phidirectly consumes\.
### 3\.2Trajectory Partitioning
Long\-horizon tasks naturally decompose into phases bounded by milestone states\. Given trajectoryτ\\tau, applyingΦ\\Phito each transition yields milestone timestampsℳ=\{t1,…,tK\}\\mathcal\{M\}=\\\{t\_\{1\},\\ldots,t\_\{K\}\\\}whereKKis the number of milestones reached\. Settingt0=0t\_\{0\}=0andtK\+1=Tt\_\{K\+1\}=T, we partitionτ\\tauintoK\+1K\+1segments:
Segk=\{\(st,at\):tk−1<t≤tk\},k∈\{1,…,K\+1\}\.\\small\\text\{Seg\}\_\{k\}=\\\{\(s\_\{t\},a\_\{t\}\):t\_\{k\-1\}<t\\leq t\_\{k\}\\\},\\quad k\\in\\\{1,\\ldots,K\+1\\\}\.\(1\)
We partition at milestone boundaries based on the following structural assumption:
###### Assumption 3\.1\(Milestone Markov Property\)\.
For milestone statestks\_\{t\_\{k\}\}reached at timesteptkt\_\{k\}:
P\(Segk\+1,…,SegK\+1∣stk,Seg1,…,Segk\)≈P\(Segk\+1,…,SegK\+1∣stk\)\.\\begin\{split\}P\(\\textup\{Seg\}\_\{k\+1\},\\ldots,\\textup\{Seg\}\_\{K\+1\}\\mid s\_\{t\_\{k\}\},\\textup\{Seg\}\_\{1\},\\ldots,\\textup\{Seg\}\_\{k\}\)\\\\ \\approx P\(\\textup\{Seg\}\_\{k\+1\},\\ldots,\\textup\{Seg\}\_\{K\+1\}\\mid s\_\{t\_\{k\}\}\)\.\\end\{split\}\(2\)
This assumption states that conditioned on reaching a milestone state, future trajectory distribution depends primarily on remaining subgoals rather than the full history\. This is natural for compositional tasks: once an object is picked up, subsequent success depends on what to do next, not on how the object was found\. We discuss the validity and limitations of this assumption in Appendix[A\.2](https://arxiv.org/html/2605.06078#A1.SS2)\.
### 3\.3Temporal Reward Shaping
Partitioning alone does not address sample inefficiency, since segments in failed trajectories still receive zero reward\. We assign shaped rewards crediting partial progress\.
For actionata\_\{t\}in segmentSegk\\text\{Seg\}\_\{k\}of trajectoryτi\\tau\_\{i\}withKiK\_\{i\}completed milestones:
rt=\{Rms⋅γtk−tifk≤Ki0ifk=Ki\+1,r\_\{t\}=\\begin\{cases\}R\_\{\\text\{ms\}\}\\cdot\\gamma^\{t\_\{k\}\-t\}&\\text\{if \}k\\leq K\_\{i\}\\\\ 0&\\text\{if \}k=K\_\{i\}\+1\\end\{cases\},\(3\)whereRms\>0R\_\{\\text\{ms\}\}\>0is the milestone reward andγ∈\(0,1\)\\gamma\\in\(0,1\)is the temporal decay factor\. Only segments that end with a completed milestone receive positive reward\. This design has two properties: \(1\) all actions in completed segments receive positive reward, enabling learning from partial successes; \(2\) actions closer to milestone completion receive higher credit, encouraging efficient execution\.
### 3\.4Dual\-Scale Advantage Estimation
Temporal reward shaping provides dense signal but does not fully resolve credit misattribution: actions in early segments may still receive credit influenced by outcomes in later segments through trajectory\-level comparison\. We address this through dual\-scale advantage estimation\.
#### Trajectory\-Level Advantage\.
For a group ofGGtrajectories\{τi\}i=1G\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{G\}sampled for the same task, the trajectory\-level advantage follows GRPO:
Aitraj=R\(τi\)−μσ\+ϵ,A^\{\\text\{traj\}\}\_\{i\}=\\frac\{R\(\\tau\_\{i\}\)\-\\mu\}\{\\sigma\+\\epsilon\},\(4\)whereμ\\muandσ\\sigmaare the mean and standard deviation of terminal rewards across the group\.
#### Segment\-Level Advantage\.
Trajectory\-level comparison assigns identical credit to all actions regardless of position\. To isolate local action quality from downstream variance, we compare segment performance only among trajectories that reached the same milestone\. Define the comparison group for milestonekkas𝒢k=\{i:Ki≥k\}\\mathcal\{G\}\_\{k\}=\\\{i:K\_\{i\}\\geq k\\\}, whereKiK\_\{i\}is the number of milestones reached by trajectoryτi\\tau\_\{i\}\. The segment return is:
Rk\(i\)=∑t∈Segk\(i\)rt\.R\_\{k\}^\{\(i\)\}=\\sum\_\{t\\in\\text\{Seg\}\_\{k\}^\{\(i\)\}\}r\_\{t\}\.\(5\)The segment\-level advantage compares the per\-step reward against the group’s average per\-step return:
Ai,tseg=rt−1\|𝒢k\|∑j∈𝒢kRk\(j\)\|Segk\(j\)\|,t∈Segk\(i\)\.A^\{\\text\{seg\}\}\_\{i,t\}=r\_\{t\}\-\\frac\{1\}\{\|\\mathcal\{G\}\_\{k\}\|\}\\sum\_\{j\\in\\mathcal\{G\}\_\{k\}\}\\frac\{R\_\{k\}^\{\(j\)\}\}\{\|\\text\{Seg\}\_\{k\}^\{\(j\)\}\|\},\\quad t\\in\\text\{Seg\}\_\{k\}^\{\(i\)\}\.\(6\)By comparing only among trajectories that reached milestonekk, this advantage isolates the quality of actions within segmentkkfrom variance in subsequent segments:
###### Proposition 3\.2\(Variance Isolation\)\.
Under Assumption[A\.1](https://arxiv.org/html/2605.06078#A1.Thmtheorem1), for trajectories in comparison group𝒢k\\mathcal\{G\}\_\{k\}:
Covi∈𝒢k\(Ai,tseg,Rk′\(i\)\)≈0,∀i∈𝒢k,∀t∈Segk\(i\),∀k′\>k\.\\textup\{Cov\}\_\{i\\in\\mathcal\{G\}\_\{k\}\}\(A^\{\\textup\{seg\}\}\_\{i,t\},R\_\{k^\{\\prime\}\}^\{\(i\)\}\)\\approx 0,\\quad\\forall i\\in\\mathcal\{G\}\_\{k\},\\,\\forall t\\in\\textup\{Seg\}\_\{k\}^\{\(i\)\},\\,\\forall k^\{\\prime\}\>k\.\(7\)
The proof is provided in Appendix[A\.1](https://arxiv.org/html/2605.06078#A1.SS1)\. This result ensures that credit for actions in segmentkkis not corrupted by variance in later segments, directly addressing credit misattribution\.
#### Combined Advantage\.
The final advantage for actionata\_\{t\}in segmentSegk\\text\{Seg\}\_\{k\}of trajectoryτi\\tau\_\{i\}is:
A^i,t=Aitraj\+λ⋅Ai,tseg,\\hat\{A\}\_\{i,t\}=A^\{\\text\{traj\}\}\_\{i\}\+\\lambda\\cdot A^\{\\text\{seg\}\}\_\{i,t\},\(8\)whereλ\>0\\lambda\>0balances global task performance and local segment quality\.
### 3\.5Optimization
We optimize the policy using a clipped surrogate objective:
𝒥\(θ\)=𝔼\[∑tmin\(ρtA^i,t,clip\(ρt,1−ϵ,1\+ϵ\)A^i,t\)\]\\small\\mathcal\{J\}\(\\theta\)=\\mathbb\{E\}\\left\[\\sum\_\{t\}\\min\\left\(\\rho\_\{t\}\\hat\{A\}\_\{i,t\},\\,\\text\{clip\}\(\\rho\_\{t\},1\-\\epsilon,1\+\\epsilon\)\\hat\{A\}\_\{i,t\}\\right\)\\right\]\(9\)whereρt=πθ\(at\|st\)/πθold\(at\|st\)\\rho\_\{t\}=\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)/\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(a\_\{t\}\|s\_\{t\}\)is the importance ratio\. The complete procedure is summarized in Algorithm[1](https://arxiv.org/html/2605.06078#alg1)\.
Algorithm 1BEACON Training0:Policy
πθ\\pi\_\{\\theta\}, milestone detector
Φ\\Phi, group size
GG, decay
γ\\gamma, weight
λ\\lambda
1:foreach iterationdo
2:// Sample trajectories
3:Sample
GGtrajectories
\{τi\}i=1G\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{G\}using
πθ\\pi\_\{\\theta\}
4:foreach trajectory
τi\\tau\_\{i\}do
5:// Detect milestones and partition
6:
ℳi←\{t:Φ\(st,at,st\+1\)=1\}\\mathcal\{M\}\_\{i\}\\leftarrow\\\{t:\\Phi\(s\_\{t\},a\_\{t\},s\_\{t\+1\}\)=1\\\}
7:Partition
τi\\tau\_\{i\}into
\{Segk\(i\)\}k=1Ki\+1\\\{\\text\{Seg\}\_\{k\}^\{\(i\)\}\\\}\_\{k=1\}^\{K\_\{i\}\+1\}using
ℳi\\mathcal\{M\}\_\{i\}
8:// Compute shaped rewards
9:
rt←𝕀\[k≤Ki\]⋅Rms⋅γtk−tr\_\{t\}\\leftarrow\\mathbb\{I\}\[k\\leq K\_\{i\}\]\\cdot R\_\{\\text\{ms\}\}\\cdot\\gamma^\{t\_\{k\}\-t\}for each
t∈Segk\(i\)t\\in\\text\{Seg\}\_\{k\}^\{\(i\)\}
10:endfor
11:// Compute trajectory\-level advantages
12:
μ←1G∑iR\(τi\)\\mu\\leftarrow\\frac\{1\}\{G\}\\sum\_\{i\}R\(\\tau\_\{i\}\),
σ←std\(\{R\(τi\)\}\)\\sigma\\leftarrow\\text\{std\}\(\\\{R\(\\tau\_\{i\}\)\\\}\)
13:
Aitraj←\(R\(τi\)−μ\)/\(σ\+ϵ\)A^\{\\text\{traj\}\}\_\{i\}\\leftarrow\(R\(\\tau\_\{i\}\)\-\\mu\)/\(\\sigma\+\\epsilon\)for all
ii
14:// Compute segment\-level advantages
15:for
k=1,…,maxiKik=1,\\ldots,\\max\_\{i\}K\_\{i\}do
16:
𝒢k←\{i:Ki≥k\}\\mathcal\{G\}\_\{k\}\\leftarrow\\\{i:K\_\{i\}\\geq k\\\}
17:
Ai,tseg←rt−1\|𝒢k\|∑j∈𝒢kRk\(j\)/\|Segk\(j\)\|A^\{\\text\{seg\}\}\_\{i,t\}\\leftarrow r\_\{t\}\-\\frac\{1\}\{\|\\mathcal\{G\}\_\{k\}\|\}\\sum\_\{j\\in\\mathcal\{G\}\_\{k\}\}R\_\{k\}^\{\(j\)\}/\|\\text\{Seg\}\_\{k\}^\{\(j\)\}\|for
t∈Segk\(i\),i∈𝒢kt\\in\\text\{Seg\}\_\{k\}^\{\(i\)\},i\\in\\mathcal\{G\}\_\{k\}
18:endfor
19:// Combine advantages and update policy
20:
A^i,t←Aitraj\+λ⋅Ai,tseg\\hat\{A\}\_\{i,t\}\\leftarrow A^\{\\text\{traj\}\}\_\{i\}\+\\lambda\\cdot A^\{\\text\{seg\}\}\_\{i,t\}for each
at∈Segk\(i\)a\_\{t\}\\in\\text\{Seg\}\_\{k\}^\{\(i\)\}
21:Update
θ\\thetaby maximizing
𝒥\(θ\)\\mathcal\{J\}\(\\theta\)
22:endfor
Table 1:Main Results\.Performance comparison across benchmarks\. By utilizing structural milestones,BEACONachieves state\-of\-the\-art performance, showing particular robustness in Long\-horizon tasks on ALFWorld\.TypeMethodALFWorldSciWorldWebShopShortMediumLongAvgScoreSuccScoreSucc\\rowcolorgray\!10Closed\-Source ModelsPromptingGPT\-4o \(ReAct\)71\.433\.749\.848\.054\.345\.431\.823\.7PromptingGemini\-2\.5\-Pro \(ReAct\)84\.850\.758\.760\.347\.836\.742\.535\.9\\rowcolorgray\!10Base: Qwen2\.5\-1\.5B\-InstructPromptingDirect Prompt5\.85\.10\.04\.15\.90\.723\.15\.2PromptingReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib2)\)18\.210\.52\.012\.89\.01\.240\.111\.3PromptingReflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib47)\)31\.818\.93\.721\.87\.13\.955\.821\.9RL TrainingPPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.06078#bib.bib14)\)58\.254\.047\.454\.429\.310\.973\.851\.5RL TrainingRLOO\(Ahmadianet al\.,[2024a](https://arxiv.org/html/2605.06078#bib.bib48)\)78\.767\.456\.969\.7\-\-73\.952\.1RL TrainingGRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.06078#bib.bib13)\)76\.773\.953\.572\.831\.721\.175\.856\.8RL TrainingGiGPO\(Fenget al\.,[2025](https://arxiv.org/html/2605.06078#bib.bib33)\)90\.784\.379\.586\.135\.625\.883\.165\.0\\rowcolorlightblue\!50 RL TrainingBEACON \(Ours\)96\.8\+6\.187\.0\+2\.792\.9\+13\.491\.4\+5\.358\.9\+23\.345\.3\+19\.586\.1\+3\.075\.6\+10\.6\\rowcolorgray\!10Base: Qwen2\.5\-7B\-InstructPromptingDirect Prompt30\.210\.33\.214\.811\.44\.226\.47\.8PromptingReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib2)\)45\.023\.417\.631\.217\.47\.846\.219\.5PromptingReflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib47)\)56\.538\.423\.842\.723\.411\.758\.128\.8RL TrainingPPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.06078#bib.bib14)\)84\.687\.368\.880\.437\.124\.081\.468\.7RL TrainingRLOO\(Ahmadianet al\.,[2024a](https://arxiv.org/html/2605.06078#bib.bib48)\)85\.180\.248\.975\.5\-\-80\.365\.7RL TrainingGRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.06078#bib.bib13)\)84\.179\.764\.777\.661\.849\.179\.366\.1RL TrainingGiGPO\(Fenget al\.,[2025](https://arxiv.org/html/2605.06078#bib.bib33)\)93\.691\.879\.290\.869\.253\.484\.472\.8\\rowcolorlightblue\!50 RL TrainingBEACON \(Ours\)95\.1\+1\.594\.9\+3\.190\.0\+10\.894\.5\+3\.783\.7\+14\.564\.3\+10\.987\.7\+3\.379\.7\+6\.9
## 4Experiments
### 4\.1Experimental Setup
#### Benchmarks\.
We evaluate on three long\-horizon benchmarks: ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2605.06078#bib.bib1)\), ScienceWorld\(Wanget al\.,[2022](https://arxiv.org/html/2605.06078#bib.bib18)\)and WebShop\(Yaoet al\.,[2022](https://arxiv.org/html/2605.06078#bib.bib17)\)\.ALFWorldis a text\-based embodied environment where agents complete household tasks \(e\.g\., heating objects, cleaning items\) through multi\-step interaction, receiving only sparse terminal rewards upon task completion\.WebShopis a web navigation environment with 1\.18M products, requiring agents to search, filter, and purchase items matching natural language specifications through compositional understanding of product attributes\.ScienceWorldis a text\-based environment for scientific reasoning, spanning 30 task types across 10 domains, requiring agents to conduct virtual experiments \(e\.g\., measuring melting points, testing electrical conductivity\)\. See Appendix[C\.1](https://arxiv.org/html/2605.06078#A3.SS1)for details\.
#### Baselines\.
We compare against baselines across paradigms: \(1\) Closed\-source models: GPT\-4o\(Hurstet al\.,[2024](https://arxiv.org/html/2605.06078#bib.bib42)\)and Gemini\-2\.5\-Pro\(Comaniciet al\.,[2025](https://arxiv.org/html/2605.06078#bib.bib41)\), evaluated under ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib2)\)prompting as reference points for frontier model capabilities\. \(2\) Prompting methods: ReAct, which guides multi\-step reasoning through in\-context chain\-of\-thought without training\. \(3\) RL training methods: PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.06078#bib.bib14)\), a standard actor\-critic algorithm, and group\-based approaches GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.06078#bib.bib13)\)and GiGPO\(Fenget al\.,[2025](https://arxiv.org/html/2605.06078#bib.bib33)\), which estimate advantages over trajectory groups without learned critics\.
#### Implementation\.
We use Qwen2\.5\-1\.5B\-Instruct and Qwen2\.5\-7B\-Instruct\(Qwenet al\.,[2025](https://arxiv.org/html/2605.06078#bib.bib19)\)as base models\. For fair comparison, all RL methods use identical training configurations\. BEACON\-specific parameters \(γ=0\.95\\gamma=0\.95,λ=1\.0\\lambda=1\.0\) are fixed across all benchmarks without task\-specific tuning\. Full details are in Appendix[C\.4](https://arxiv.org/html/2605.06078#A3.SS4)\.
### 4\.2Main Results
#### Overall Performance\.
Table[3\.5](https://arxiv.org/html/2605.06078#S3.SS5)presents results\. BEACON achieves the highest success rate across all benchmarks and model scales\. On ALFWorld with the 1\.5B model, BEACON achieves 91\.4% average success rate, surpassing GiGPO \(86\.1%\) by 5\.3% and GRPO \(72\.8%\) by 18\.6%\. On WebShop, BEACON achieves 75\.6% success rate compared to 65\.0% for GiGPO and 56\.8% for GRPO\. On ScienceWorld, BEACON reaches 45\.3% success versus 25\.8% for GiGPO and 21\.1% for GRPO\. Scaling to Qwen2\.5\-7B yields consistent improvements: BEACON achieves 94\.5% on ALFWorld and 79\.7% on WebShop\. Notably, even the 1\.5B BEACON model outperforms closed\-source baselines \(GPT\-4o: 48\.0% on ALFWorld, 23\.7% on WebShop\), demonstrating that milestone\-anchored credit assignment provides advantages that model scale alone cannot match\. We provide task\-wise breakdown for ALFWorld in Appendix[B](https://arxiv.org/html/2605.06078#A2), showing consistent gains across all task types\.
#### Horizon\-Dependent Performance\.
On ALFWorld with the 1\.5B model, GRPO exhibits severe degradation as horizon extends: success rate drops from 76\.7% on Short tasks to 53\.5% on Long tasks, a 30% relative decline\. GiGPO mitigates this partially \(90\.7% to 79\.5%, 12\.4% relative decline\) but still shows clear degradation\. In contrast, BEACON maintains robust performance across horizons \(96\.8% Short, 87\.0% Medium, 92\.9% Long\)\. Figure[5](https://arxiv.org/html/2605.06078#S4.F5)\(b\) illustrates this pattern on the 7B model through relative improvement over GRPO\. On Short tasks, BEACON and GiGPO achieve comparable gains \(\+13% vs \+11%\)\. However, the gap widens as horizon extends: on Long tasks, BEACON reaches \+39% while GiGPO remains at \+22%\. GiGPO relies on state recurrence for step\-level grouping, which diminishes as policies improve and trajectories diversify\. These results indicate that milestone\-anchored credit assignment provides increasing benefit as task horizons extend\.
### 4\.3Analysis
Figure 4:Sample Efficiency\.Trajectory distribution during training on ALFWorld\.Green: full successes;Orange: partial successes \(complete≥\\geq1 milestone but fail\);Gray: complete failures\.#### Partial Successes Become Learning Signal\.
We analyze sample efficiency by categorizing trajectories during training into three types: full successes \(complete the task\), partial successes \(complete at least one milestone but fail the final task\), and complete failures \(achieve no milestone\)\. Figure[4](https://arxiv.org/html/2605.06078#S4.F4)shows the distribution on ALFWorld \(Qwen2\.5\-1\.5B\) across 150 training iterations\. Under GRPO, 39% of trajectories at iteration 150 are partial successes that complete at least one milestone but receive zero reward\. GiGPO reduces this to 28% through state\-based grouping, but substantial signal remains discarded\. BEACON’s temporal reward shaping provides positive reward for milestone completion, reducing partial successes to 13%\. Effective sample utilization improves from 23\.7% to 82\.0%, a 3\.5×\\timesincrease in trajectories providing useful gradient signal\.
Figure 5:Learning Signal and Horizon Scaling\.\(a\) Zero\-Advantage Ratio during training\. \(b\) Relative improvement over GRPO by task horizon\.
#### Gradient Starvation\.
We measure the Zero\-Advantage Ratio \(ZAR\), defined as the fraction of samples receiving near\-zero advantage during training\. Figure[5](https://arxiv.org/html/2605.06078#S4.F5)\(a\) shows ZAR on ALFWorld\. GRPO starts near 100% ZAR and decreases to around 55% by iteration 150, indicating that over half of samples provide no learning signal even after extended training\. BEACON starts at 45% ZAR and rapidly decreases to approximately 10%, confirming that milestone\-anchored credit assignment substantially alleviates gradient starvation by extracting signal from partial successes\.
#### Credit Concentration\.
We compute the Credit Concentration Ratio \(CCR\), defined as the average advantage magnitude for milestone actions divided by that for non\-milestone actions\. CCR=1 indicates uniform credit; CCR\>\>1 indicates concentration on milestones\. Figure[6](https://arxiv.org/html/2605.06078#S4.F6)\(a\) shows CCR across methods on ALFWorld \(Qwen2\.5\-1\.5B\)\. GiGPO exhibits the highest CCR \(2\.36\), meaning milestone actions receive 2\.36×\\timesmore credit than non\-milestone actions\. GRPO shows moderate concentration \(1\.37\)\. BEACON has the lowest CCR \(0\.84\), indicating that non\-milestone actions receive slightly more credit than milestone actions\. Despite lower concentration, BEACON achieves the highest performance\. This suggests that credit concentration penalizes intermediate actions necessary for reaching milestones\. BEACON’s temporal decay assigns graduated positive credit to all actions within successful segments, preserving signal for exploratory steps that enable milestone completion\.
Figure 6:Credit Distribution and Policy Optimization\.\(a\) Credit Concentration Ratio across methods\. Higher CCR indicates more aggressive concentration on milestone actions\. \(b\) Comparison with behavior cloning \(SFT on oracle trajectories\)\.
#### Beyond Behavior Cloning\.
A potential concern is whether BEACON degrades to behavior cloning given its use of milestone structure\. Figure[6](https://arxiv.org/html/2605.06078#S4.F6)\(b\) compares BEACON against supervised fine\-tuning \(SFT\) on oracle trajectories on ALFWorld \(Qwen2\.5\-1\.5B\)\. Supervised fine\-tuning on oracle trajectories achieves 43% success rate\. BEACON withγ\\gamma=0 \(milestone reward only\) reaches 81%, demonstrating that milestone\-anchored credit assignment alone enables the policy to discover strategies superior to the oracle\. Introducing temporal decay \(γ\\gamma=0\.95\) further improves performance to 91\.4%\. This confirms that the milestone structure provides credit assignment anchors, but the policy discovers execution strategies superior to the oracle trajectories\.
Figure 7:Training Dynamics\.\(a\) Success rate\. BEACON converges faster than GRPO\. \(b\) Policy entropy evolution\. BEACON exhibits smooth reduction indicating stable refinement\.Figure 8:Credit Assignment on Representative Trajectories\.\(a\) Failed trajectory with intermediate milestones\. \(b\) Successful trajectory with detours\. GRPO assigns uniform credit to all actions; GiGPO produces counterintuitive assignments due to state\-based grouping; BEACON credits milestone completions while appropriately penalizing errors and inefficient detours\.
#### Training Dynamics\.
Figure[7](https://arxiv.org/html/2605.06078#S4.F7)compares training dynamics on ALFWorld \(Qwen2\.5\-1\.5B\)\. BEACON converges faster: it reaches 60% success rate by iteration 50, while GRPO requires iteration 120 to reach the same threshold\. This faster convergence is consistent with BEACON’s improved sample utilization \(23\.7% to 82\.0%\), as more trajectories contribute useful gradient signal per batch\. Figure[7](https://arxiv.org/html/2605.06078#S4.F7)\(b\) shows policy entropy\. BEACON exhibits smooth entropy reduction, while GRPO maintains high entropy throughout\. The contrast reflects the difference in gradient quality: BEACON receives consistent feedback from milestone completion, enabling steady policy refinement\.
### 4\.4Ablation Study
Table 2:Ablation Study with Qwen2\.5\-1\.5B\-Instruct\.#### Trajectory Partitioning\.
We evaluate degraded partitioning strategies on ALFWorld\. Random partitioning \(selecting 5 arbitrary positions as milestones\) achieves 74\.2%, slightly above GRPO \(72\.8%\), indicating that segmentation structure itself provides modest benefit\. With 50% milestone dropout, performance degrades gracefully to 82\.8%, still outperforming GRPO by 10%, indicating that BEACON tolerates imperfect milestone detection\. Notably, the gap between random and full milestones \(17\.2%\) far exceeds the gap between GRPO and random \(1\.4%\), demonstrating that BEACON’s gains stem primarily from exploiting task\-inherent structure rather than segmentation alone\.
#### Temporal Reward Shaping\.
Removing temporal decay \(γ=0\\gamma=0\) reduces performance from 91\.4% to 81\.2% on ALFWorld and from 75\.6% to 62\.1% on WebShop, yet still outperforms GRPO \(72\.8% and 56\.8%\)\. This confirms that milestone\-anchored structure itself provides significant benefit, while temporal decay contributes additional gains by distinguishing action contributions within segments\. Notably, uniform shaping \(γ=1\\gamma=1\) performs worse than no shaping on ALFWorld \(71\.8% vs 81\.2%\): assigning equal credit to all actions obscures the distinction between critical and preparatory actions, producing misleading gradients\.
#### Dual\-Scale Advantage\.
Removing segment\-level advantage naturally degrades BEACON to GRPO \(72\.8% on ALFWorld, 56\.8% on WebShop\), establishing GRPO as the performance lower bound\. Removing trajectory\-level advantage produces different effects across benchmarks: severe degradation on ALFWorld \(23\.4%\) but reasonable performance on WebShop \(67\.9%\)\. This difference reflects task structure\. On ALFWorld, segment\-level optimization alone can reinforce actions that achieve intermediate milestones but lead to eventual task failure\. Trajectory\-level feedback provides necessary correction\. On WebShop, milestone completion aligns more directly with task success, with segment\-level feedback driving the primary improvement while trajectory\-level feedback provides additional gains\. The dual\-scale formulation leverages both signals, achieving robust performance across diverse task structures\.
### 4\.5Case Study
Figure[8](https://arxiv.org/html/2605.06078#S4.F8)presents credit assignment on two representative trajectories from ALFWorld\. In the failed trajectory, the agent completes milestones S3 and S4 before failing\. GRPO assigns uniform negative advantage \(AA=−\-2\.50\) to all actions\. GiGPO produces counterintuitive credit: milestone S3 receives the lowest advantage \(AA=−\-4\.00\), because state\-based grouping compares it against successful trajectories\. BEACON credits milestones \(AA=\+\+0\.51\) while penalizing errors\. In the successful trajectory with an unnecessary detour at S4, GRPO assigns uniform positive advantage \(AA=\+\+7\.50\)\. GiGPO rewards the detour most heavily \(AA=\+\+8\.10\)\. BEACON penalizes the detour \(AA=−\-1\.10\) while crediting milestones\. These examples illustrate how BEACON provides precise credit assignment that distinguishes productive actions from errors and inefficiencies\.
## 5Related Work
Our work relates to policy optimization for language models and credit assignment in reinforcement learning\.
#### Policy Optimization for Language Models\.
PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.06078#bib.bib14); Ouyanget al\.,[2022b](https://arxiv.org/html/2605.06078#bib.bib28)\)is widely used for RLHF but requires a value network that struggles over long horizons\. Critic\-free methods such as DPO\(Rafailovet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib29)\), GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.06078#bib.bib13)\), and RLOO\(Ahmadianet al\.,[2024b](https://arxiv.org/html/2605.06078#bib.bib30)\)eliminate this overhead and achieve strong results on reasoning tasks\(Guoet al\.,[2025](https://arxiv.org/html/2605.06078#bib.bib31); Yuet al\.,[2025](https://arxiv.org/html/2605.06078#bib.bib15)\)\. However, when applied to LLM agents for web navigation\(Denget al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib5); Zhouet al\.,[2024a](https://arxiv.org/html/2605.06078#bib.bib21); Qiet al\.,[2024](https://arxiv.org/html/2605.06078#bib.bib26)\), embodied control\(Shridharet al\.,[2021](https://arxiv.org/html/2605.06078#bib.bib1); Wanget al\.,[2022](https://arxiv.org/html/2605.06078#bib.bib18)\), and tool use\(Schicket al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib4); Qinet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib22); Wanget al\.,[2025a](https://arxiv.org/html/2605.06078#bib.bib27); Zenget al\.,[2024](https://arxiv.org/html/2605.06078#bib.bib24); Chenet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib25)\), these trajectory\-level methods assign identical credit to all actions regardless of individual contribution, causing performance degradation as task horizons extend\. BEACON exploits semantic milestones inherent to agentic tasks, enabling segment\-level comparison within trajectories\.
#### Credit Assignment and Reward Shaping\.
Existing approaches to finer\-grained credit assignment introduce distinct limitations\. Auxiliary model methods, including process reward models\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib35); Wanget al\.,[2024](https://arxiv.org/html/2605.06078#bib.bib36)\), utterance\-level critics\(Zhouet al\.,[2024b](https://arxiv.org/html/2605.06078#bib.bib32)\), implicit reward models\(Cuiet al\.,[2025](https://arxiv.org/html/2605.06078#bib.bib37)\), and co\-evolving verifiers\(Panet al\.,[2026](https://arxiv.org/html/2605.06078#bib.bib50)\), require expensive annotation, risk reward hacking\(Gaoet al\.,[2022](https://arxiv.org/html/2605.06078#bib.bib40)\), or add training complexity\. Monte Carlo methods\(Kazemnejadet al\.,[2024](https://arxiv.org/html/2605.06078#bib.bib38)\)avoid learned models but incur substantial sampling overhead from multiple rollouts per step\. Structure\-based methods such as GiGPO\(Fenget al\.,[2025](https://arxiv.org/html/2605.06078#bib.bib33)\)and RLVMR\(Zhanget al\.,[2025b](https://arxiv.org/html/2605.06078#bib.bib39)\)exploit repeated states or reasoning patterns for localized comparison, but depend on incidental structure that may be sparse in long\-horizon tasks\. BEACON instead anchors credit to milestones that directly reflect task progress, providing consistent segment\-level comparison without auxiliary models, sampling overhead, or reliance on emergent trajectory patterns\.
## 6Conclusion
We introduced BEACON, a framework that addresses credit misattribution and sample inefficiency in trajectory\-level policy optimization for long\-horizon language agents\. BEACON exploits the compositional structure of long\-horizon tasks: milestones, observable state transitions indicating subgoal completion, exhibit an approximate Markov property that enables credit to be decoupled across segments\. By partitioning trajectories at milestone boundaries, applying temporal reward shaping within segments, and estimating advantages at dual scales, BEACON isolates local action quality from downstream variance\. Experiments on ALFWorld, WebShop, and ScienceWorld demonstrate improvements that amplify as task horizons extend, with effective sample utilization improvement\. These results establish milestone\-anchored credit assignment as an effective paradigm for training long\-horizon language agents\. We further discuss the limitations of BEACON and its future directions in Appendix[D](https://arxiv.org/html/2605.06078#A4)\.
## Impact Statement
This paper presents work whose goal is to advance the training of language model agents for long\-horizon tasks\. The primary societal impact is enabling more capable autonomous agents that can assist humans in complex, multi\-step tasks such as web navigation, household management, and scientific experimentation\. While improved agent capabilities could increase productivity and accessibility, they also raise considerations around automation of tasks currently performed by humans\. Our method does not introduce new capabilities beyond existing language models but rather improves the efficiency of training agents on tasks with sparse rewards\. We do not anticipate specific negative societal consequences beyond those generally associated with advances in language model agents\.
## References
- A\. Ahmadian, C\. Cremer, M\. Gallé, M\. Fadaee, J\. Kreutzer, O\. Pietquin, A\. Üstün, and S\. Hooker \(2024a\)Back to basics: revisiting reinforce style optimization for learning from human feedback in llms\.External Links:2402\.14740,[Link](https://arxiv.org/abs/2402.14740)Cited by:[§3\.5](https://arxiv.org/html/2605.06078#S3.SS5.16.16.27.11.2.1.1),[§3\.5](https://arxiv.org/html/2605.06078#S3.SS5.16.16.35.19.2.1.1)\.
- A\. Ahmadian, C\. Cremer, M\. Gallé, M\. Fadaee, J\. Kreutzer, O\. Pietquin, A\. Üstün, and S\. Hooker \(2024b\)Back to basics: revisiting REINFORCE\-style optimization for learning from human feedback in LLMs\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 12248–12267\.External Links:[Link](https://aclanthology.org/2024.acl-long.662/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.662)Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- M\. Ahn, A\. Brohan, N\. Brown, Y\. Chebotar, O\. Cortes, B\. David, C\. Finn, K\. Gopalakrishnan, K\. Hausman, A\. Herzog, D\. Ho, J\. Hsu, J\. Ibarz, B\. Ichter, A\. Irpan, E\. Jang, R\. M\. J\. Ruano, K\. Jeffrey, S\. Jesmonth, N\. J\. Joshi, R\. C\. Julian, D\. Kalashnikov, Y\. Kuang, K\. Lee, S\. Levine, Y\. Lu, L\. Luu, C\. Parada, P\. Pastor, J\. Quiambao, K\. Rao, J\. Rettinghouse, D\. M\. Reyes, P\. Sermanet, N\. Sievers, C\. Tan, A\. Toshev, V\. Vanhoucke, F\. Xia, T\. Xiao, P\. Xu, S\. Xu, and M\. Yan \(2022\)Do as i can, not as i say: grounding language in robotic affordances\.InConference on Robot Learning,External Links:[Link](https://api.semanticscholar.org/CorpusID:247939706)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p1.1)\.
- D\. A\. Boiko, R\. MacKnight, B\. Kline, and G\. Gomes \(2023\)Autonomous chemical research with large language models\.Nature624,pp\. 570 – 578\.External Links:[Link](https://api.semanticscholar.org/CorpusID:266432059)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p1.1)\.
- A\. M\. Bran, S\. Cox, A\. D\. White, and P\. Schwaller \(2023\)ChemCrow: augmenting large\-language models with chemistry tools\.External Links:[Link](https://api.semanticscholar.org/CorpusID:271293795)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p1.1)\.
- B\. Chen, C\. Shu, E\. Shareghi, N\. Collier, K\. Narasimhan, and S\. Yao \(2023\)FireAct: toward language agent fine\-tuning\.External Links:2310\.05915Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§4\.1](https://arxiv.org/html/2605.06078#S4.SS1.SSS0.Px2.p1.1)\.
- M\. Côté, Á\. Kádár, X\. Yuan, B\. A\. Kybartas, T\. Barnes, E\. Fine, J\. Moore, M\. J\. Hausknecht, L\. E\. Asri, M\. Adada, W\. Tay, and A\. Trischler \(2018\)TextWorld: a learning environment for text\-based games\.InCGW@IJCAI,External Links:[Link](https://api.semanticscholar.org/CorpusID:49552345)Cited by:[§C\.1](https://arxiv.org/html/2605.06078#A3.SS1.SSS0.Px1.p1.3)\.
- G\. Cui, L\. Yuan, Z\. Wang, H\. Wang, Y\. Zhang, W\. Li, B\. He, Y\. Fan, T\. Yu, Q\. Xu, W\. Chen, J\. Yuan, H\. Chen, K\. Zhang, X\. Lv, S\. Wang, Y\. Yao, X\. Han, H\. Peng, Y\. Cheng, Z\. Liu, M\. Sun, B\. Zhou, and N\. Ding \(2025\)Process reinforcement through implicit rewards\.ArXivabs/2502\.01456\.External Links:[Link](https://api.semanticscholar.org/CorpusID:276107672)Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px2.p1.1)\.
- X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su \(2023\)Mind2Web: towards a generalist agent for the web\.External Links:2306\.06070Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p1.1),[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- L\. Feng, Z\. Xue, T\. Liu, and B\. An \(2025\)Group\-in\-group policy optimization for llm agent training\.External Links:2505\.10978,[Link](https://arxiv.org/abs/2505.10978)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p3.1),[§3\.5](https://arxiv.org/html/2605.06078#S3.SS5.16.16.29.13.2.1.1),[§3\.5](https://arxiv.org/html/2605.06078#S3.SS5.16.16.37.21.2.1.1),[§4\.1](https://arxiv.org/html/2605.06078#S4.SS1.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px2.p1.1)\.
- L\. Gao, J\. Schulman, and J\. Hilton \(2022\)Scaling laws for reward model overoptimization\.InInternational Conference on Machine Learning,External Links:[Link](https://api.semanticscholar.org/CorpusID:252992904)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p3.1),[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px2.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Ding, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Chen, J\. Yuan, J\. Tu, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. You, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang \(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.External Links:ISSN 1476\-4687,[Link](http://dx.doi.org/10.1038/s41586-025-09422-z),[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- W\. Huang, F\. Xia, T\. Xiao, H\. Chan, J\. Liang, P\. Florence, A\. Zeng, J\. Tompson, I\. Mordatch, Y\. Chebotar, P\. Sermanet, N\. Brown, T\. Jackson, L\. Luu, S\. Levine, K\. Hausman, and B\. Ichter \(2022\)Inner monologue: embodied reasoning through planning with language models\.InarXiv preprint arXiv:2207\.05608,Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p1.1)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§4\.1](https://arxiv.org/html/2605.06078#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Kazemnejad, M\. Aghajohari, E\. Portelance, A\. Sordoni, S\. Reddy, A\. Courville, and N\. L\. Roux \(2024\)VinePPO: unlocking rl potential for llm reasoning through refined credit assignment\.External Links:2410\.01679,[Link](https://arxiv.org/abs/2410.01679)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p3.1),[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px2.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.Proceedings of the 29th Symposium on Operating Systems Principles\.External Links:[Link](https://api.semanticscholar.org/CorpusID:261697361)Cited by:[§C\.3](https://arxiv.org/html/2605.06078#A3.SS3.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.External Links:2305\.20050,[Link](https://arxiv.org/abs/2305.20050)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p3.1),[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px2.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. E\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. F\. Christiano, J\. Leike, and R\. J\. Lowe \(2022a\)Training language models to follow instructions with human feedback\.ArXivabs/2203\.02155\.External Links:[Link](https://api.semanticscholar.org/CorpusID:246426909)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022b\)Training language models to follow instructions with human feedback\.External Links:2203\.02155,[Link](https://arxiv.org/abs/2203.02155)Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- T\. Pan, Y\. Yan, Z\. Wang, R\. Zhang, G\. Han, W\. Zhang, W\. Lu, J\. Xiao, and Y\. Shen \(2026\)CoVerRL: breaking the consensus trap in label\-free reasoning via generator\-verifier co\-evolution\.arXiv preprint arXiv:2603\.17775\.Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px2.p1.1)\.
- Z\. Qi, X\. Liu, I\. L\. Iong, H\. Lai, X\. Sun, X\. Yang, J\. Sun, Y\. Yang, S\. Yao, T\. Zhang,et al\.\(2024\)WebRL: training llm web agents via self\-evolving online curriculum reinforcement learning\.arXiv preprint arXiv:2411\.02337\.Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian, S\. Zhao, L\. Hong, R\. Tian, R\. Xie, J\. Zhou, M\. Gerstein, D\. Li, Z\. Liu, and M\. Sun \(2023\)ToolLLM: facilitating large language models to master 16000\+ real\-world apis\.External Links:2307\.16789,[Link](https://arxiv.org/abs/2307.16789)Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§2](https://arxiv.org/html/2605.06078#S2.p2.3),[§4\.1](https://arxiv.org/html/2605.06078#S4.SS1.SSS0.Px3.p1.2)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.ArXivabs/2305\.18290\.External Links:[Link](https://api.semanticscholar.org/CorpusID:258959321)Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.External Links:2302\.04761,[Link](https://arxiv.org/abs/2302.04761)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p1.1),[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.External Links:1707\.06347,[Link](https://arxiv.org/abs/1707.06347)Cited by:[§3\.5](https://arxiv.org/html/2605.06078#S3.SS5.16.16.26.10.2.1.1),[§3\.5](https://arxiv.org/html/2605.06078#S3.SS5.16.16.34.18.2.1.1),[§4\.1](https://arxiv.org/html/2605.06078#S4.SS1.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p2.1),[§3\.5](https://arxiv.org/html/2605.06078#S3.SS5.16.16.28.12.2.1.1),[§3\.5](https://arxiv.org/html/2605.06078#S3.SS5.16.16.36.20.2.1.1),[§4\.1](https://arxiv.org/html/2605.06078#S4.SS1.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2024\)HybridFlow: a flexible and efficient rlhf framework\.arXiv preprint arXiv: 2409\.19256\.Cited by:[§C\.3](https://arxiv.org/html/2605.06078#A3.SS3.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.External Links:2303\.11366,[Link](https://arxiv.org/abs/2303.11366)Cited by:[§3\.5](https://arxiv.org/html/2605.06078#S3.SS5.16.16.25.9.2.1.1),[§3\.5](https://arxiv.org/html/2605.06078#S3.SS5.16.16.33.17.2.1.1)\.
- M\. Shridhar, J\. Thomason, D\. Gordon, Y\. Bisk, W\. Han, R\. Mottaghi, L\. Zettlemoyer, and D\. Fox \(2020\)Alfred: a benchmark for interpreting grounded instructions for everyday tasks\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10740–10749\.Cited by:[§C\.1](https://arxiv.org/html/2605.06078#A3.SS1.SSS0.Px1.p1.3)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2021\)ALFWorld: Aligning Text and Embodied Environments for Interactive Learning\.InProceedings of the International Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2010.03768)Cited by:[§C\.1](https://arxiv.org/html/2605.06078#A3.SS1.SSS0.Px1.p1.3),[§1](https://arxiv.org/html/2605.06078#S1.p2.1),[§1](https://arxiv.org/html/2605.06078#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.06078#S4.SS1.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024\)Math\-shepherd: verify and reinforce LLMs step\-by\-step without human annotations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 9426–9439\.External Links:[Link](https://aclanthology.org/2024.acl-long.510/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.510)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p3.1),[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px2.p1.1)\.
- R\. Wang, P\. Jansen, M\. Côté, and P\. Ammanabrolu \(2022\)ScienceWorld: is your agent smarter than a 5th grader?\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 11279–11298\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.775/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.775)Cited by:[§C\.1](https://arxiv.org/html/2605.06078#A3.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.06078#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.06078#S4.SS1.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- Z\. Wang, K\. Wang, Q\. Wang, P\. Zhang, L\. Li, Z\. Yang, X\. Jin, K\. Yu, M\. N\. Nguyen, L\. Liu, E\. Gottlieb, Y\. Lu, K\. Cho, J\. Wu, L\. Fei\-Fei, L\. Wang, Y\. Choi, and M\. Li \(2025a\)RAGEN: understanding self\-evolution in llm agents via multi\-turn reinforcement learning\.External Links:2504\.20073,[Link](https://arxiv.org/abs/2504.20073)Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- Z\. Wang, D\. Li, H\. Li, S\. Chen, Y\. Yan, W\. Zhang, Y\. Shen, W\. Lu, J\. Xiao, and Y\. Zhuang \(2025b\)Omniear: benchmarking agent reasoning in embodied tasks\.arXiv preprint arXiv:2508\.05614\.Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)WebShop: towards scalable real\-world web interaction with grounded language agents\.ArXivabs/2207\.01206\.External Links:[Link](https://api.semanticscholar.org/CorpusID:250264533)Cited by:[§C\.1](https://arxiv.org/html/2605.06078#A3.SS1.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.06078#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.06078#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.External Links:2210\.03629,[Link](https://arxiv.org/abs/2210.03629)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p1.1),[§3\.5](https://arxiv.org/html/2605.06078#S3.SS5.16.16.24.8.2.1.1),[§3\.5](https://arxiv.org/html/2605.06078#S3.SS5.16.16.32.16.2.1.1),[§4\.1](https://arxiv.org/html/2605.06078#S4.SS1.SSS0.Px2.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, M\. Qiao, Y\. Wu, and M\. Wang \(2025\)DAPO: an open\-source llm reinforcement learning system at scale\.External Links:2503\.14476,[Link](https://arxiv.org/abs/2503.14476)Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Zeng, M\. Liu, R\. Lu, B\. Wang, X\. Liu, Y\. Dong, and J\. Tang \(2024\)AgentTuning: enabling generalized agent abilities for LLMs\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 3053–3077\.External Links:[Link](https://aclanthology.org/2024.findings-acl.181/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.181)Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- G\. Zhang, H\. Geng, X\. Yu, Z\. Yin, Z\. Zhang, Z\. Tan, H\. Zhou, Z\. Li, X\. Xue, Y\. Li, Y\. Zhou, Y\. Chen, C\. Zhang, Y\. Fan, Z\. Wang, S\. Huang, F\. Piedrahita\-Velez, Y\. Liao, H\. Wang, M\. Yang, H\. Ji, J\. Wang, S\. Yan, P\. Torr, and L\. Bai \(2025a\)The landscape of agentic reinforcement learning for llms: a survey\.External Links:2509\.02547,[Link](https://arxiv.org/abs/2509.02547)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p1.1)\.
- Z\. Zhang, Z\. Chen, M\. Li, Z\. Tu, and X\. Li \(2025b\)RLVMR: reinforcement learning with verifiable meta\-reasoning rewards for robust long\-horizon agents\.External Links:2507\.22844,[Link](https://arxiv.org/abs/2507.22844)Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px2.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, Y\. Bisk, D\. Fried, U\. Alon,et al\.\(2023\)WebArena: a realistic web environment for building autonomous agents\.arXiv preprint arXiv:2307\.13854\.External Links:[Link](https://webarena.dev/)Cited by:[§1](https://arxiv.org/html/2605.06078#S1.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, Y\. Bisk, D\. Fried, U\. Alon,et al\.\(2024a\)WebArena: a realistic web environment for building autonomous agents\.ICLR\.Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhou, A\. Zanette, J\. Pan, S\. Levine, and A\. Kumar \(2024b\)ArCHer: training language model agents via hierarchical multi\-turn rl\.External Links:2402\.19446Cited by:[§5](https://arxiv.org/html/2605.06078#S5.SS0.SSS0.Px2.p1.1)\.
## Appendix ATheoretical Analysis
This appendix provides formal analysis supporting the design of BEACON, establishing that segment\-level advantages isolate local action quality from downstream variance\.
### A\.1Variance Isolation in Segment\-Level Advantages
The foundation of BEACON’s credit assignment is the structural assumption that milestone states are approximately Markovian\.
###### Assumption A\.1\(Milestone Markov Property\)\.
For milestone statestks\_\{t\_\{k\}\}reached at timesteptkt\_\{k\}:
P\(Segk\+1,…,SegK\+1∣stk,Seg1,…,Segk\)≈P\(Segk\+1,…,SegK\+1∣stk\)\.\\begin\{split\}P\(\\textup\{Seg\}\_\{k\+1\},\\ldots,\\textup\{Seg\}\_\{K\+1\}\\mid s\_\{t\_\{k\}\},\\textup\{Seg\}\_\{1\},\\ldots,\\textup\{Seg\}\_\{k\}\)\\\\ \\approx P\(\\textup\{Seg\}\_\{k\+1\},\\ldots,\\textup\{Seg\}\_\{K\+1\}\\mid s\_\{t\_\{k\}\}\)\.\\end\{split\}\(10\)
This assumption is natural for compositional tasks: once a subgoal is achieved \(e\.g\., an object is picked up\), subsequent success depends on completing remaining subgoals, not on how previous subgoals were achieved\.
###### Proposition A\.2\(Variance Isolation\)\.
Under Assumption[A\.1](https://arxiv.org/html/2605.06078#A1.Thmtheorem1), for trajectories in comparison group𝒢k=\{i:Ki≥k\}\\mathcal\{G\}\_\{k\}=\\\{i:K\_\{i\}\\geq k\\\}:
Covi∈𝒢k\(Ai,tseg,Rk′\(i\)\)≈0,∀i∈𝒢k,∀t∈Segk\(i\),∀k′\>k\.\\textup\{Cov\}\_\{i\\in\\mathcal\{G\}\_\{k\}\}\(A^\{\\textup\{seg\}\}\_\{i,t\},R\_\{k^\{\\prime\}\}^\{\(i\)\}\)\\approx 0,\\quad\\forall i\\in\\mathcal\{G\}\_\{k\},\\,\\forall t\\in\\textup\{Seg\}\_\{k\}^\{\(i\)\},\\,\\forall k^\{\\prime\}\>k\.\(11\)
###### Proof\.
For trajectories in𝒢k\\mathcal\{G\}\_\{k\}, the per\-step segment\-level advantage is:
Ai,tseg=rt−b¯k,whereb¯k=1\|𝒢k\|∑j∈𝒢kRk\(j\)\|Segk\(j\)\|\.A^\{\\text\{seg\}\}\_\{i,t\}=r\_\{t\}\-\\bar\{b\}\_\{k\},\\quad\\text\{where \}\\bar\{b\}\_\{k\}=\\frac\{1\}\{\|\\mathcal\{G\}\_\{k\}\|\}\\sum\_\{j\\in\\mathcal\{G\}\_\{k\}\}\\frac\{R\_\{k\}^\{\(j\)\}\}\{\|\\text\{Seg\}\_\{k\}^\{\(j\)\}\|\}\.\(12\)
Fort∈Segk\(i\)t\\in\\text\{Seg\}\_\{k\}^\{\(i\)\}, the shaped rewardrtr\_\{t\}depends only on the position within segmentkk\(throughtk\(i\)−tt\_\{k\}^\{\(i\)\}\-t\) and on actions\{at′:t′∈Segk\(i\)\}\\\{a\_\{t^\{\\prime\}\}:t^\{\\prime\}\\in\\text\{Seg\}\_\{k\}^\{\(i\)\}\\\}, which occur before milestonekkis reached\. Fork′\>kk^\{\\prime\}\>k, the segment returnRk′\(i\)R\_\{k^\{\\prime\}\}^\{\(i\)\}depends only on actions\{at:t∈Segk′\(i\)\}\\\{a\_\{t\}:t\\in\\text\{Seg\}\_\{k^\{\\prime\}\}^\{\(i\)\}\\\}, which occur after milestonekkis reached\.
By Assumption[A\.1](https://arxiv.org/html/2605.06078#A1.Thmtheorem1), conditioned on the milestone statestks\_\{t\_\{k\}\}, the actions in segmentk′k^\{\\prime\}are independent of the actions in segmentkk:
𝔼\[rt⋅Rk′\(i\)∣i∈𝒢k\]≈𝔼\[rt∣i∈𝒢k\]⋅𝔼\[Rk′\(i\)∣i∈𝒢k\]\.\\mathbb\{E\}\[r\_\{t\}\\cdot R\_\{k^\{\\prime\}\}^\{\(i\)\}\\mid i\\in\\mathcal\{G\}\_\{k\}\]\\approx\\mathbb\{E\}\[r\_\{t\}\\mid i\\in\\mathcal\{G\}\_\{k\}\]\\cdot\\mathbb\{E\}\[R\_\{k^\{\\prime\}\}^\{\(i\)\}\\mid i\\in\\mathcal\{G\}\_\{k\}\]\.\(13\)
SinceR¯k\\bar\{R\}\_\{k\}is constant over𝒢k\\mathcal\{G\}\_\{k\}:
Cov\(Ai,kseg,Rk′\(i\)\)\\displaystyle\\text\{Cov\}\(A^\{\\text\{seg\}\}\_\{i,k\},R\_\{k^\{\\prime\}\}^\{\(i\)\}\)=Cov\(Rk\(i\)−R¯k,Rk′\(i\)\)\\displaystyle=\\text\{Cov\}\(R\_\{k\}^\{\(i\)\}\-\\bar\{R\}\_\{k\},R\_\{k^\{\\prime\}\}^\{\(i\)\}\)\(14\)=Cov\(Rk\(i\),Rk′\(i\)\)≈0\.\\displaystyle=\\text\{Cov\}\(R\_\{k\}^\{\(i\)\},R\_\{k^\{\\prime\}\}^\{\(i\)\}\)\\approx 0\.∎
This result establishes that segment\-level advantages isolate local action quality from downstream variance: the gradient for actions in segmentkkis not affected by outcomes in later segments, directly addressing credit misattribution\.
### A\.2Discussion of Assumptions
The Milestone Markov Property \(Assumption[A\.1](https://arxiv.org/html/2605.06078#A1.Thmtheorem1)\) is central to the variance isolation guarantee\. This assumption holds well when milestone states encode complete subgoal achievement and future success depends primarily on remaining subgoals rather than execution details of past subgoals\.
The assumption may be approximate when resources carry across segments \(e\.g\., inventory limits\) or when execution efficiency affects future success \(e\.g\., time constraints\)\. However, even when the Markov property is only approximately satisfied, BEACON provides empirical benefits: partial successes still contribute gradient signal through shaped rewards, and segment\-level comparison reduces downstream variance even if it does not fully eliminate it\. The trajectory\-level advantage component maintains task alignment regardless of the Markov property\. The experimental results in Section[4](https://arxiv.org/html/2605.06078#S4)demonstrate substantial improvements on tasks where the assumption is only approximately satisfied\.
## Appendix BTask\-wise Analysis on ALFWorld
Table 3:ALFWorld Task\-wise Results\.Success rate \(%\) on each task type\.TypeMethodALFWorldPickLookCleanHeatCoolPick2All\\rowcolorgray\!10Closed\-Source ModelsPromptingGPT\-4o75\.360\.831\.256\.721\.649\.848\.0PromptingGemini\-2\.5\-Pro92\.863\.362\.169\.026\.658\.760\.3\\rowcolorgray\!10Base: Qwen2\.5\-1\.5B\-InstructPromptingDirect Prompt5\.95\.53\.39\.74\.20\.04\.1PromptingReAct17\.420\.515\.76\.27\.72\.012\.8PromptingReflexion35\.322\.221\.713\.619\.43\.721\.8RL TrainingPPO64\.840\.557\.160\.646\.447\.454\.4RL TrainingRLOO88\.352\.871\.062\.866\.456\.969\.7RL TrainingGRPO85\.353\.784\.578\.259\.753\.572\.8RL TrainingGiGPO96\.076\.591\.891\.371\.779\.586\.1\\rowcolorlightblue\!50 RL TrainingBEACON \(Ours\)10088\.286\.710078\.992\.991\.4\\rowcolorred\!8Δ\\Deltavs GRPO\+14\.7\+34\.5\+2\.2\+21\.8\+19\.2\+39\.4\+18\.6\\rowcolorgray\!10Base: Qwen2\.5\-7B\-InstructPromptingDirect Prompt33\.421\.619\.36\.92\.83\.214\.8PromptingReAct48\.535\.434\.313\.218\.217\.631\.2PromptingReflexion62\.041\.644\.930\.936\.323\.842\.7RL TrainingPPO92\.364\.092\.589\.580\.368\.880\.4RL TrainingRLOO87\.678\.287\.381\.371\.948\.975\.5RL TrainingGRPO90\.866\.189\.374\.772\.564\.777\.6RL TrainingGiGPO97\.782\.798\.883\.789\.379\.290\.8\\rowcolorlightblue\!50 RL TrainingBEACON \(Ours\)10081\.896\.392\.994\.790\.094\.5\\rowcolorred\!8Δ\\Deltavs GRPO\+9\.2\+15\.7\+7\.0\+18\.2\+22\.2\+25\.3\+16\.9
We report the success rates of different methods across all six ALFWorld task types in Table[B](https://arxiv.org/html/2605.06078#A2)\. The table presents results for both Qwen2\.5\-1\.5B and Qwen2\.5\-7B base models\. BEACON consistently outperforms other methods on both model scales, with particularly strong gains on Pick2 \(\+13% on 1\.5B, \+11% on 7B\), which requires locating and picking up two separate objects and thus involves more milestones for credit assignment\. Notably, BEACON\-trained 1\.5B models \(91\.4%\) substantially outperform GPT\-4o \(48\.0%\) and Gemini\-2\.5\-Pro \(60\.3%\), demonstrating that task\-specific training with proper credit assignment can surpass general\-purpose large models\.
## Appendix CExperimental Details
### C\.1Benchmark Descriptions
We evaluate BEACON on three diverse benchmarks spanning embodied reasoning, web navigation, and scientific experimentation\.
#### ALFWorld\.
ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2605.06078#bib.bib1)\)is a text\-based embodied reasoning benchmark that aligns TextWorld\(Côtéet al\.,[2018](https://arxiv.org/html/2605.06078#bib.bib43)\)environments with ALFRED\(Shridharet al\.,[2020](https://arxiv.org/html/2605.06078#bib.bib44)\)visual tasks\. The benchmark comprises six task types:PICK\(pick up an object\),CLEAN\(clean an object\),HEAT\(heat an object\),COOL\(cool an object\),LOOK\(examine an object under light\), andPICK2\(pick up two objects\)\. Tasks require agents to navigate household environments and manipulate objects through natural language commands\. We use the standard train/validation/test split with 3,321/140/140 tasks respectively\. Following prior work, we stratify tasks by optimal trajectory length: Short \(L∗≤4L^\{\*\}\\leq 4\), Medium \(5≤L∗≤75\\leq L^\{\*\}\\leq 7\), and Long \(L∗\>7L^\{\*\}\>7\)\.
#### WebShop\.
WebShop\(Yaoet al\.,[2022](https://arxiv.org/html/2605.06078#bib.bib17)\)is a simulated e\-commerce environment containing 1\.18 million real\-world products and 12,087 human instructions\. Agents must navigate web pages through search, filtering, and clicking actions to purchase products matching natural language specifications\. The benchmark tests compositional understanding of product attributes including color, size, price constraints, and feature requirements\. We use the standard evaluation protocol with 500 test instructions and report both Score \(partial credit based on attribute matching\) and Success Rate \(binary task completion\)\.
#### ScienceWorld\.
ScienceWorld\(Wanget al\.,[2022](https://arxiv.org/html/2605.06078#bib.bib18)\)presents 30 scientific reasoning tasks requiring agents to conduct virtual experiments, such as measuring melting points, testing electrical conductivity, and identifying life stages of organisms\. Tasks involve long action sequences frequently exceeding 30 steps, with complex dependencies between sub\-experiments\. The environment provides explicit subgoal feedback that our milestone detector directly consumes\. We report both Score \(normalized progress\) and Success Rate across all 30 task types\.
### C\.2Diagnostic Metrics
We introduce two metrics to quantify credit assignment quality in policy optimization\.
#### Contradictory Action Ratio \(CAR\)\.
For a batch of trajectories, let𝒮shared\\mathcal\{S\}\_\{\\text\{shared\}\}denote the set of state\-action pairs\(s,a\)\(s,a\)that appear in multiple trajectories\. For each\(s,a\)∈𝒮shared\(s,a\)\\in\\mathcal\{S\}\_\{\\text\{shared\}\}, letA\+A^\{\+\}andA−A^\{\-\}denote the number of trajectories where this pair receives positive and negative advantages, respectively\. The CAR is defined as:
CAR=1\|𝒮shared\|∑\(s,a\)∈𝒮shared𝕀\[A\+\>0∧A−\>0\],\\text\{CAR\}=\\frac\{1\}\{\|\\mathcal\{S\}\_\{\\text\{shared\}\}\|\}\\sum\_\{\(s,a\)\\in\\mathcal\{S\}\_\{\\text\{shared\}\}\}\\mathbb\{I\}\[A^\{\+\}\>0\\land A^\{\-\}\>0\],\(15\)where𝕀\[⋅\]\\mathbb\{I\}\[\\cdot\]is the indicator function\. CAR measures the fraction of repeated state\-action pairs receiving contradictory gradient signals\.
#### Effective Gradient Ratio \(EGR\)\.
For each state\-action pair\(s,a\)∈𝒮shared\(s,a\)\\in\\mathcal\{S\}\_\{\\text\{shared\}\}, letg\+g^\{\+\}andg−g^\{\-\}denote the sum of positive and negative advantage magnitudes, respectively\. The EGR is defined as:
EGR=∑\(s,a\)∈𝒮shared\|g\+−g−\|∑\(s,a\)∈𝒮shared\(g\+\+g−\)\.\\text\{EGR\}=\\frac\{\\sum\_\{\(s,a\)\\in\\mathcal\{S\}\_\{\\text\{shared\}\}\}\|g^\{\+\}\-g^\{\-\}\|\}\{\\sum\_\{\(s,a\)\\in\\mathcal\{S\}\_\{\\text\{shared\}\}\}\(g^\{\+\}\+g^\{\-\}\)\}\.\(16\)EGR measures the proportion of gradient magnitude that survives after cancellation from contradictory signals\. An EGR of 1\.0 indicates fully consistent gradients, while lower values indicate greater cancellation\.
### C\.3Implementation Details
All experiments are conducted using the veRL framework\(Shenget al\.,[2024](https://arxiv.org/html/2605.06078#bib.bib45)\)with vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2605.06078#bib.bib46)\)for efficient inference\. We use 8 NVIDIA A100 80GB GPUs for training\. Gradient checkpointing is enabled to reduce memory consumption\. The reference model uses CPU parameter offloading while the actor model remains fully on GPU\. Training 150 iterations takes approximately 10 hours for ALFWorld and ScienceWorld, and 8 hours for WebShop\.
For all group\-based methods \(GRPO, GiGPO, BEACON\), we use identical base configurations to ensure fair comparison\. The only differences are in the advantage computation mechanisms specific to each method\. All experiments use a fixed random seed \(seed=0\)\. Evaluation is conducted on 128 samples per checkpoint\.
### C\.4Hyperparameters
Table 4:Hyperparameters\.BEACON\-specific parameters control milestone\-anchored credit assignment; other parameters are shared across all group\-based methods \(GRPO, GiGPO, BEACON\) for fair comparison\. For environment\-specific values, we report ALFWorld / WebShop / ScienceWorld\.HyperparameterSymbolValueBEACON\-specificSegment advantage weightλ\\lambda1\.0Temporal decay factorγ\\gamma0\.95OptimizationLearning rate–1×10−61\\times 10^\{\-6\}PPO clip ratioϵ\\epsilon0\.2Gradient clip norm–1\.0Entropy coefficient–0\.001KL penalty coefficientβ\\beta0\.01Batch ConfigurationPrompts per iteration–16Rollouts per promptGG8PPO mini\-batch size–256SequenceMax prompt length–7000Max response length–512Temperature \(train / eval\)–1\.0 / 0\.4EnvironmentMax steps per episodeTT30 / 15 / 30Total training iterations–150Table[4](https://arxiv.org/html/2605.06078#A3.T4)presents the hyperparameters used in our experiments\. BEACON\-specific parameters are listed separately from general training parameters shared across all methods\.
## Appendix DLimitations and Future Work
#### Milestone Detection\.
BEACON relies on a task\-specific milestone detectorΦ\\Phithat identifies subgoal completions from environment feedback\. In our experiments, milestones are extracted through pattern matching on environment responses \(ALFWorld\), page transitions \(WebShop\), or explicit subgoal signals \(ScienceWorld\)\. This approach requires domain knowledge to design appropriate detectors and may not generalize to environments without clear subgoal structure or verifiable state transitions\. Developing automated milestone discovery methods, potentially through learning or leveraging large language models to identify semantically meaningful progress, remains an important open problem\.
#### Milestone Granularity\.
The effectiveness of BEACON depends on milestones occurring at an appropriate granularity\. If milestones are too sparse, BEACON approaches trajectory\-level optimization; if too dense, the segment\-level advantages may become noisy\. Our experiments use naturally occurring task milestones without tuning granularity, but optimal milestone density likely varies across tasks\. Investigating adaptive or hierarchical milestone structures could further improve performance\.
#### Benchmark Scope\.
We evaluate BEACON on three benchmarks spanning embodied reasoning, web navigation, and scientific experimentation\. While these cover diverse agent capabilities, all involve discrete action spaces and text\-based interaction\. The applicability of milestone\-anchored credit assignment to continuous control, multi\-agent settings, or tasks with less compositional structure remains unexplored\.Similar Articles
PolicyBank: Evolving Policy Understanding for LLM Agents
PolicyBank proposes a memory mechanism that enables LLM agents to autonomously refine their understanding of organizational policies through iterative interaction and corrective feedback, closing specification gaps that cause systematic behavioral divergence from true requirements. The work introduces a systematic testbed and demonstrates PolicyBank can close up to 82% of policy-gap alignment failures, significantly outperforming existing memory mechanisms.
Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents
This paper introduces HCL-GP, a dynamic policy-learning framework that integrates generalized planning and hierarchical task decomposition to enable LLM-based agents to learn and reuse executable policy components, significantly improving performance on the AppWorld benchmark.
Learning Agentic Policy from Action Guidance
The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.
Gradient Extrapolation-Based Policy Optimization
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.
Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents
This paper proposes Dual-Scale Evolutionary Policy Training (DEPT) to address the evolution impasse in social language agents, using asymmetric advantage reshaping to restore gradient signals during self-play.