HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning
Summary
Introduces HIPIF, a method for training LLM agents to handle long-horizon tasks by hierarchical planning and information folding to reduce long-context interference, achieving strong results on three benchmarks.
View Cached Full Text
Cached at: 06/10/26, 06:15 AM
# HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning
Source: [https://arxiv.org/html/2606.10507](https://arxiv.org/html/2606.10507)
11footnotetext:Equal contribution\.22footnotetext:Corresponding author\.Juncheng Diao1,2∗, Zhicong Lu2∗†, Peiguang Li1, Yongwei Zhou1 Changyuan Tian2, Qingbin Li2, Rongxiang Weng1, Jingang Wang1, Xunliang Cai1 1Meituan2University of Chinese Academy of Sciences diaojuncheng24@mails\.ucas\.ac\.cnluzhicong21@mails\.ucas\.ac\.cn
###### Abstract
While Large Language Models \(LLMs\) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performance often degrades in multi\-turn long\-horizon agentic tasks\. Existing methods have made progress through fine\-grained credit assignment to alleviate long\-horizon sparse rewards and hierarchical reinforcement learning to decompose tasks and reduce long\-term dependency\. However, these methods still do not directly address long\-context interference, in which continuously growing histories weaken the agent’s ability to track the global task state and impair subsequent reasoning and decision\-making\. Inspired by the way humans handle complex tasks through subgoal decomposition and completed progress summarization, we proposeHierarchicalPlanning andInformationFolding \(HIPIF\) for long\-horizon LLM agent learning\. HIPIF trains the agent end\-to\-end to organize long\-horizon execution around explicit subgoals while folding completed subgoal histories to reduce long\-context interference\. Furthermore, to stabilize subgoal\-based planning and execution, HIPIF combines hierarchical reflection and subgoal\-oriented process rewards to guide subgoal generation, transition, and execution, without relying on costly auxiliary models or task\-specific expert trajectories\. Extensive experiments on three publicly available agentic benchmarks demonstrate the validity of our method\.
## 1Introduction
Large Language Models \(LLMs\) have been a promising foundation for long\-horizon agentic decision\-making tasks benefiting from their ever\-growing reasoning and planning abilities, where an agent must accomplish a high\-level goal through multi\-turn interaction with the environment\[[30](https://arxiv.org/html/2606.10507#bib.bib1),[40](https://arxiv.org/html/2606.10507#bib.bib2),[24](https://arxiv.org/html/2606.10507#bib.bib19),[33](https://arxiv.org/html/2606.10507#bib.bib22),[17](https://arxiv.org/html/2606.10507#bib.bib55)\]\. Despite this potential, compared with their success in single\-step tasks, existing LLM agents remain far from satisfactory in complex long\-horizon interactions\. As noted by STEP\-HRL\[[44](https://arxiv.org/html/2606.10507#bib.bib32)\], a key limitation of existing LLM agents\[[2](https://arxiv.org/html/2606.10507#bib.bib6),[15](https://arxiv.org/html/2606.10507#bib.bib12)\]is their reliance on an ever\-growing observation\-action history for each decision\. In long\-horizon interactions, the continuously growing context accumulates redundant information that weakens the agent’s ability to track the global task state and impairs subsequent reasoning and decision\-making\[[46](https://arxiv.org/html/2606.10507#bib.bib31)\]\.
Existing methods have made preliminary attempts to address this challenge\. Prompt\-based methods\[[40](https://arxiv.org/html/2606.10507#bib.bib2),[39](https://arxiv.org/html/2606.10507#bib.bib26),[22](https://arxiv.org/html/2606.10507#bib.bib18),[11](https://arxiv.org/html/2606.10507#bib.bib10),[6](https://arxiv.org/html/2606.10507#bib.bib40)\]and behavior cloning methods\[[3](https://arxiv.org/html/2606.10507#bib.bib7),[9](https://arxiv.org/html/2606.10507#bib.bib3)\]mainly rely on prompt engineering or expert trajectories to elicit reasoning, planning, reflection or context\-folding abilities in LLM agents\. However, they are not optimized through environmental feedback, which limits their adaptability across diverse environments and long\-horizon interactions\. In contrast, reinforcement learning \(RL\) methods improve long\-horizon agents from an optimization perspective by using environmental feedback to provide more fine\-grained and reliable reward signals\. For example, credit assignment methods\[[31](https://arxiv.org/html/2606.10507#bib.bib39),[32](https://arxiv.org/html/2606.10507#bib.bib63),[5](https://arxiv.org/html/2606.10507#bib.bib44),[4](https://arxiv.org/html/2606.10507#bib.bib8),[13](https://arxiv.org/html/2606.10507#bib.bib33)\]alleviate sparse\-reward challenges in long\-horizon tasks through more precise step\-level supervision, while hierarchical RL methods\[[7](https://arxiv.org/html/2606.10507#bib.bib9),[43](https://arxiv.org/html/2606.10507#bib.bib30),[44](https://arxiv.org/html/2606.10507#bib.bib32),[16](https://arxiv.org/html/2606.10507#bib.bib37)\]reduce long\-term dependency through task decomposition\. Nevertheless, many existing RL methods rely on additional models for task decomposition or process\-reward annotation, increasing pipeline complexity and limiting scalability across environments\. More importantly, these methods rarely train the model to organize and fold ever\-growing contexts and therefore cannot fundamentally resolve the state\-tracking failure and reasoning degradation caused by long\-context interference\.
Inspired by the way humans handle long\-horizon tasks through subgoal decomposition and completed progress summarization, we proposeHierarchicalPlanning andInformationFolding \(HIPIF\) for long\-horizon LLM agent learning\. HIPIF trains the agent end\-to\-end to organize long\-horizon execution around explicit subgoals and fold the execution histories of completed subgoals, thereby reducing long\-context interference\. To stabilize subgoal\-based planning and execution, HIPIF introduces hierarchical reflection to improve subgoal transition judgment and guide either subgoal proposal or current subgoal execution\. Furthermore, to alleviate sparse rewards in long\-horizon subgoal\-based training, HIPIF introduces subgoal\-oriented process rewards to correct inappropriate subgoals and ineffective execution behaviors within subgoals\.
Extensive experimental results on three publicly available agentic benchmarks and case studies demonstrate the effectiveness of HIPIF\. Further efficiency analyses show that HIPIF achieves lower token usage in long\-horizon interactions while avoiding task\-specific expert trajectories and additional auxiliary models\. In summary, our main contributions are as follows\.
- •We proposeHierarchicalPlanning andInformationFolding \(HIPIF\) for long\-horizon LLM agent learning, which trains the model to organize long\-horizon execution around explicit subgoals and fold the histories of completed subgoals to reduce long\-context interference\.
- •To stabilize subgoal\-based planning and execution, we introduce hierarchical reflection and subgoal\-oriented process rewards to improve subgoal completion judgment, subgoal content assessment, and subgoal execution correction\.
- •Extensive experimental results on three publicly available agentic benchmarks, efficiency analyses, and case studies demonstrate the effectiveness and efficiency of HIPIF\.
## 2Related Work
#### LLM Agents\.
Large Language Models \(LLMs\) have been widely used as agents in interactive decision\-making tasks\[[29](https://arxiv.org/html/2606.10507#bib.bib21),[30](https://arxiv.org/html/2606.10507#bib.bib1),[40](https://arxiv.org/html/2606.10507#bib.bib2)\]\. Early studies primarily adopt prompt\-based formulations, where agents externalize intermediate decision processes to support multi\-step decision\-making, such as Chain\-of\-Thought\[[35](https://arxiv.org/html/2606.10507#bib.bib23)\], ReAct\[[40](https://arxiv.org/html/2606.10507#bib.bib2)\]and Reflexion\[[22](https://arxiv.org/html/2606.10507#bib.bib18)\]\. To improve LLM agents in long\-horizon tasks, several methods introduce memory mechanisms\[[42](https://arxiv.org/html/2606.10507#bib.bib29),[36](https://arxiv.org/html/2606.10507#bib.bib25),[19](https://arxiv.org/html/2606.10507#bib.bib16)\]\. For example, HiAgent\[[6](https://arxiv.org/html/2606.10507#bib.bib40)\]uses prompts to guide subgoal decomposition and history folding\. However, these mechanisms are usually based on hand\-crafted prompts or system designs without environmental feedback, thus unreliable in complex long\-horizon tasks\. Another line of work learns agent policies from expert trajectories through behavior cloning or supervised fine\-tuning\[[3](https://arxiv.org/html/2606.10507#bib.bib7),[9](https://arxiv.org/html/2606.10507#bib.bib3)\]\. However, such methods heavily depend on task\-specific expert trajectories, which are costly and face limited scalability across environments\.
#### Reinforcement Learning in LLM Agents\.
Reinforcement learning \(RL\) provides a mechanism for optimizing LLM agents through environmental interaction and reward feedback\[[37](https://arxiv.org/html/2606.10507#bib.bib4),[34](https://arxiv.org/html/2606.10507#bib.bib65),[15](https://arxiv.org/html/2606.10507#bib.bib12)\]\. Existing work applies PPO\[[20](https://arxiv.org/html/2606.10507#bib.bib17)\], GRPO\[[21](https://arxiv.org/html/2606.10507#bib.bib57)\], RLOO\[[1](https://arxiv.org/html/2606.10507#bib.bib54)\], or preference\-based optimization\[[18](https://arxiv.org/html/2606.10507#bib.bib56)\]to LLM agents, enabling the model to improve its behavior from environment signals\. There are also studies that focus on fine\-grained reward assignment for long\-horizon agent training\[[31](https://arxiv.org/html/2606.10507#bib.bib39),[32](https://arxiv.org/html/2606.10507#bib.bib63),[5](https://arxiv.org/html/2606.10507#bib.bib44)\], since final task rewards are often sparse and delayed\. These methods provide more localized training signals through turn\-level process reward models or step\-level advantage estimation, with representative examples including GiGPO\[[4](https://arxiv.org/html/2606.10507#bib.bib8)\]and HiSR\[[14](https://arxiv.org/html/2606.10507#bib.bib53)\]\. However, they mainly improve credit assignment within trajectories while still making decisions based on the full observation\-action history\. As a result, these methods still lack explicit task\-stage organization and context management, and therefore cannot fundamentally mitigate the reasoning degradation caused by long contexts\. Recent work also trains memory or context compression mechanisms with RL\[[10](https://arxiv.org/html/2606.10507#bib.bib36),[19](https://arxiv.org/html/2606.10507#bib.bib16)\], such as FoldGRPO\[[26](https://arxiv.org/html/2606.10507#bib.bib34)\], A\-Mem\[[36](https://arxiv.org/html/2606.10507#bib.bib25)\], and AgentFold\[[41](https://arxiv.org/html/2606.10507#bib.bib35)\]\. These methods recognize that memory writing, retrieval, or context compression can be optimized through reinforcement learning\. Nevertheless, they mainly focus on compressing long contexts rather than systematically improving decision reliability in long\-horizon agents\. In addition, hierarchical RL methods introduce hierarchical structures for long\-horizon tasks by decomposing complex goals into subgoals and optimizing policies accordingly\[[7](https://arxiv.org/html/2606.10507#bib.bib9),[43](https://arxiv.org/html/2606.10507#bib.bib30),[45](https://arxiv.org/html/2606.10507#bib.bib73)\]\. For example, HiPER\[[16](https://arxiv.org/html/2606.10507#bib.bib37)\]focuses on subgoal proposal and subgoal\-level credit assignment, while STEP\-HRL\[[44](https://arxiv.org/html/2606.10507#bib.bib32)\]improves long\-horizon agent training from the perspectives of subgoal modeling and context compression\. These methods demonstrate the value of subgoal for complex interactive tasks\. However, many existing methods still rely on auxiliary models or task\-specific expert trajectories for subgoal generation, context compression, or critic estimation, which increases training pipeline complexity and limits scalability across environments\.
## 3Methodology
In this section, we present the overall design of HIPIF\. As illustrated in Figure[1](https://arxiv.org/html/2606.10507#S3.F1)\(a\), HIPIF adopts end\-to\-end training for hierarchical planning and information folding\. To stabilize subgoal\-based planning and execution, Figure[1](https://arxiv.org/html/2606.10507#S3.F1)\(b\) introduces a hierarchical reflection mechanism\. Finally, Figure[1](https://arxiv.org/html/2606.10507#S3.F1)\(c\) shows the subgoal\-oriented process rewards for both subgoal generation and execution within the subgoal\. The complete training pipeline is summarized in Algorithm[1](https://arxiv.org/html/2606.10507#alg1)\.
Figure 1:Overview of the design of HIPIF\. \(a\): End\-to\-End Training for Hierarchical Planning and Information Folding\. \(b\): Hierarchical reflection\. \(c\): Subgoal\-oriented process rewards\.### 3\.1End\-to\-End Training for Hierarchical Planning and Information Folding
To reduce long\-context interference while making subgoal\-based execution trainable, we introduce Subgoal\-Level Information Folding and GRPO Training for Subgoal\-Centric Decisions\.
Hierarchical Planning and Information Folding\.In conventional multi\-turn agent tasks, an LLM agent typically follows a history\-conditioned formulation\. At interaction steptt, the policy conditions on the full accumulated trajectory are:
τt=\(c,o1,a1,o2,a2,…,ot\),\\tau\_\{t\}=\(c,o\_\{1\},a\_\{1\},o\_\{2\},a\_\{2\},\\ldots,o\_\{t\}\),\(1\)whereccdenotes the task,oto\_\{t\}is the observation returned by the environment, andata\_\{t\}is the model response, usually consisting of thought\-action pairs, which adheres to ReAct\[[40](https://arxiv.org/html/2606.10507#bib.bib2)\]\. In long\-horizon tasks, full interaction histories continuously accumulate redundant information\. Such context noise weakens the agent’s awareness of the current task stage and degrades its decision\-making ability\.
We draw inspiration from the way humans handle long\-horizon tasks, in which they decompose complex objectives into subgoals and summarize completed progress\. For instance, in PICK2 tasks from ALFWorld, once the subgoal of moving the first object has been completed, the agent should fold the corresponding execution history of the first object and focus on moving the second object\. Retaining the full execution history of the first object may instead introduce context interference and confuse the model’s subsequent decisions\.
Motivated by this observation, HIPIF organizes long\-horizon interaction around explicit subgoals and folds the execution histories of completed subgoals\. The model first proposes an initial subgoal according to the task description\. Given the current subgoal, the model then repeatedly generates actions to execute it, while the environment returns a new observation after each action\. Once the model judges that the current subgoal has been completed or should be terminated, HIPIF folds the execution history of this subgoal and proposes the next subgoal\. At each decision step, HIPIF maintains a compact working context by combining folded global progress with detailed local execution history\. Formally, at stepjjof the current subgoalgkg\_\{k\}, the context provided to the policy is:
Ck,j=\[c;ℋ<k;gk;𝒯k,j\],C\_\{k,j\}=\[c;\\mathcal\{H\}\_\{<k\};g\_\{k\};\\mathcal\{T\}\_\{k,j\}\],\(2\)whereccis the task description,ℋ<k\\mathcal\{H\}\_\{<k\}denotes the folded records of completed subgoals beforegkg\_\{k\}, and𝒯k,j\\mathcal\{T\}\_\{k,j\}denotes the action\-observation history withingkg\_\{k\}up to stepjj\. Detailed implementations and examples are provided in Appendix[E](https://arxiv.org/html/2606.10507#A5)\.
End\-to\-End Training for Subgoal\-Centric Decisions\.Using prompts alone for subgoal decomposition and history folding is insufficient, as the model lacks feedback\-driven training on what subgoals to propose and how to execute them reliably\. Therefore, HIPIF treats subgoal\-centric decision\-making as a learnable policy behavior and optimizes it through environmental feedback\. Given the folded working contextCk,jC\_\{k,j\}and the reflectionξk,j\\xi\_\{k,j\}from Section[3\.2](https://arxiv.org/html/2606.10507#S3.SS2), the model generates the next decision:
yk,j∼πθ\(⋅∣Ck,j,ξk,j\),y\_\{k,j\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid C\_\{k,j\},\\xi\_\{k,j\}\),\(3\)
Specifically, when the model decides to continue the current subgoal, the next actionak,ja\_\{k,j\}is extracted from theactionfield and executed undergkg\_\{k\}; when the model decides to terminate the current subgoal, the new subgoalgk\+1g\_\{k\+1\}and its first actionak\+1,1a\_\{k\+1,1\}are extracted from thesubgoalandactionfields, respectively\. Once the transition happens, the completed subgoalgkg\_\{k\}is folded into a compact record\[gk,okend\]\[g\_\{k\},o\_\{k\}^\{\\mathrm\{end\}\}\]and appended to the folded history\.
To train these behaviors, HIPIF adopts GRPO with the subgoal\-oriented process rewards defined in Section[3\.3](https://arxiv.org/html/2606.10507#S3.SS3)\. For each task instruction, we sample a group ofMMtrajectories from the old policy, recording the context at each step together with the generated reflections, actions, and possible subgoals\. Following verl\-agent\[[4](https://arxiv.org/html/2606.10507#bib.bib8)\], instead of assigning a single trajectory\-level advantage to all decisions, we compute step\-level returns by combining the final task outcome with the subgoal\-oriented process rewards, and then normalize these returns within the sampled group following the procedure in Section[3\.3](https://arxiv.org/html/2606.10507#S3.SS3), which yields a step\-level advantageA^t\(m\)\\hat\{A\}\_\{t\}^\{\(m\)\}for each decision step\. The policy is then optimized with the clipped GRPO objective:
ℒGRPO\(θ\)=−𝔼m,t\[min\(πθ\(ζt\(m\)∣Ct\(m\)\)πθold\(ζt\(m\)∣Ct\(m\)\)A^t\(m\),clip\(πθ\(ζt\(m\)∣Ct\(m\)\)πθold\(ζt\(m\)∣Ct\(m\)\),1−ϵclip,1\+ϵclip\)A^t\(m\)\)\]\\mathcal\{L\}\_\{\\mathrm\{GRPO\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{m,t\}\\left\[\\min\\left\(\\frac\{\\pi\_\{\\theta\}\(\\zeta\_\{t\}^\{\(m\)\}\\mid C\_\{t\}^\{\(m\)\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\zeta\_\{t\}^\{\(m\)\}\\mid C\_\{t\}^\{\(m\)\}\)\}\\hat\{A\}\_\{t\}^\{\(m\)\},\\;\\mathrm\{clip\}\\left\(\\frac\{\\pi\_\{\\theta\}\(\\zeta\_\{t\}^\{\(m\)\}\\mid C\_\{t\}^\{\(m\)\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\zeta\_\{t\}^\{\(m\)\}\\mid C\_\{t\}^\{\(m\)\}\)\},1\-\\epsilon\_\{\\mathrm\{clip\}\},1\+\\epsilon\_\{\\mathrm\{clip\}\}\\right\)\\hat\{A\}\_\{t\}^\{\(m\)\}\\right\)\\right\]
\(4\)whereπθ\\pi\_\{\\theta\}andπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}denote the current and old policies, respectively\.ζt\(m\)\\zeta\_\{t\}^\{\(m\)\}denotes the complete structured decision sequence generated autoregressively at decision steptt, including the rationale, completion judgment, branch\-specific reflection, and the final action or subgoal decision\. All generated response tokens inζt\(m\)\\zeta\_\{t\}^\{\(m\)\}are optimized jointly under the GRPO objective\.
### 3\.2Hierarchical Reflection in Execution
Although the previous section formulates the subgoal proposal and context folding as part of the RL training process, it also introduces new challenges: in the early stage of agent training, it is often challenging for the model to simultaneously propose meaningful subgoals, determine when to switch subgoals, and execute the current subgoal\. While external models or annotated trajectories can provide additional guidance, they incur high costs and limit the scalability on different environments\.
To address this issue, we propose a hierarchical reflection mechanism during rollout, motivated by both temporal abstraction in hierarchical decision\-making\[[27](https://arxiv.org/html/2606.10507#bib.bib75)\]and Reflexion\[[22](https://arxiv.org/html/2606.10507#bib.bib18)\]\. Reflection enables the agent to explicitly assess task progress and diagnose failure patterns before making the next decision\. By embedding this reflective process into a temporally abstracted subgoal structure, HIPIF further turns reflection into a control mechanism for subgoal termination, transition, and execution\.
Specifically, following temporal abstraction in hierarchical decision\-making\[[27](https://arxiv.org/html/2606.10507#bib.bib75)\], we treat each subgoalgkg\_\{k\}as a stage\-level control unit\. Rather than corresponding to a single atomic action, a subgoal specifies the agent’s local objective and constrains action generation over multiple interaction steps\. After each execution step, the model first assesses whether the current subgoal has been completed\. Letℋ<k\\mathcal\{H\}\_\{<k\}denote the folded history before subgoalgkg\_\{k\},hk,th\_\{k,t\}denote the action\-observation history within the current subgoal up to steptt, andoto\_\{t\}denote the current observation\. The reflection module then generates a completion judgment together with its reasoning process:
\(ηk,t,zk,t\)∼πθ\(⋅∣ℋ<k,gk,hk,t,ot\),\(\\eta\_\{k,t\},z\_\{k,t\}\)\\sim\\mathcal\{\\pi\}\_\{\\theta\}\\left\(\\cdot\\mid\\mathcal\{H\}\_\{<k\},g\_\{k\},h\_\{k,t\},o\_\{t\}\\right\),\(5\)wherezk,t∈\{0,1\}z\_\{k,t\}\\in\\\{0,1\\\}indicates whether the current subgoal is completed, andηk,t\\eta\_\{k,t\}denotes the reasoning process generated by the model\. Following the intuition of chain\-of\-thought reasoning\[[35](https://arxiv.org/html/2606.10507#bib.bib23)\], the reasoning process encourages the model to examine the evidence behind its completion judgment and improves the accuracy of subgoal\-state assessment compared with directly predicting a binary label\.
Based on this completion judgment, HIPIF branches into different generation modes:
ξk,t∼\{πθ\(⋅∣𝒢≤k,ηk,t\),zk,t=1,πθ\(⋅∣gk,hk,t,ot,ηk,t\),zk,t=0\.\\xi\_\{k,t\}\\sim\\begin\{cases\}\\pi\_\{\\theta\}\\left\(\\cdot\\mid\\mathcal\{G\}\_\{\\leq k\},\\eta\_\{k,t\}\\right\),&z\_\{k,t\}=1,\\\\\[5\.69054pt\] \\pi\_\{\\theta\}\\left\(\\cdot\\mid g\_\{k\},h\_\{k,t\},o\_\{t\},\\eta\_\{k,t\}\\right\),&z\_\{k,t\}=0\.\\end\{cases\}\(6\)whereξk,t\\xi\_\{k,t\}denotes the agent’s reflection for next output, which can be either about the next subgoalgk\+1g\_\{k\+1\}or about the next actionata\_\{t\}depending on the completion judgmentzk,tz\_\{k,t\}\. Here,𝒢≤k=\{g1,…,gk\}\\mathcal\{G\}\_\{\\leq k\}=\\\{g\_\{1\},\\ldots,g\_\{k\}\\\}is the sequence of previously folded subgoals\. Whenzk,t=1z\_\{k,t\}=1, the model reflects on the previous folded subgoal histories𝒢≤k\\mathcal\{G\}\_\{\\leq k\}to propose the next subgoalgk\+1g\_\{k\+1\}, helping the model avoid redundant subgoals and identify the current state\. Whenzk,t=0z\_\{k,t\}=0, the model reflects based on the current subgoalgkg\_\{k\}and its execution historyhk,th\_\{k,t\}to identify the effective action, encouraging the model to avoid the invalid attempts\. These reflections serve as the reasoning basis for Section[3\.1](https://arxiv.org/html/2606.10507#S3.SS1)\. Detailed implementations are provided in Appendix[E](https://arxiv.org/html/2606.10507#A5)\.
### 3\.3Fine\-Grained and Subgoal\-oriented Process Rewards
RL training for subgoal proposal and information folding faces sparse\-reward challenges\. A failed rollout may result from an inappropriate subgoal or ineffective execution under the current subgoal, while the final task reward alone struggles to distinguish these errors\. Therefore, we design subgoal\-oriented process rewards to provide more localized supervision for both subgoal content and subgoal execution\. To avoid extra computational overhead and reduce the risk of reward hacking, we adopt rule\-based process rewards that penalize only steps clearly identified as erroneous from environment feedback\. Detailed implementations and analysis are provided in Appendix[B](https://arxiv.org/html/2606.10507#A2)and[D](https://arxiv.org/html/2606.10507#A4)\.
Subgoal Content Reward\.The first type of process reward evaluates subgoal content\. Since the model is trained without expert subgoal annotations, it may generate subgoals that are not grounded in the current environment, making them difficult to execute and potentially misleading subsequent action generation\. Therefore, we proposertgrr\_\{t\}^\{\\mathrm\{gr\}\}to penalize subgoals that refers to objects or receptacles absent from the available environment context\. In addition, we identify unreliable subgoals in successful trajectories\. If a trajectory eventually succeeds, it should contain a sequence of subgoals that can support task completion\. Therefore, for successful trajectories, we further applyrttermr\_\{t\}^\{\\mathrm\{term\}\}to penalize subgoals whose terminal observation indicates execution failure, such as “Nothing happens”, thereby exposing erroneous subgoals that may otherwise be masked by eventual task success\.
#### Subgoal Execution Reward\.
The second type of process reward targets execution errors within the execution of a subgoal\. Even when a subgoal is reasonable, the agent may still fail to execute it effectively\. A common failure pattern is a loop within subgoal execution, where the agent repeatedly produces the same action and receives the same observation under the current subgoal\. We therefore define the execution penaltyrtexecr\_\{t\}^\{\\mathrm\{exec\}\}on such repeated action\-observation pairs under the same subgoal, which indicates that the agent is not making progress\. In addition, we introduce a format penaltyrtfmtr\_\{t\}^\{\\mathrm\{fmt\}\}to ensure valid structured outputs, penalizing the model when it omits or mismatches the required tag\.
#### Process Reward Assignment\.
After defining the subgoal\-content and execution\-related penalties, we assign process feedback at the step level\. The total process reward at stepttis defined as
rtproc=rtgr\+rtterm\+rtexec\+rtfmt,r\_\{t\}^\{\\mathrm\{proc\}\}=r\_\{t\}^\{\\mathrm\{gr\}\}\+r\_\{t\}^\{\\mathrm\{term\}\}\+r\_\{t\}^\{\\mathrm\{exec\}\}\+r\_\{t\}^\{\\mathrm\{fmt\}\},\(7\)
We then combine this local process feedback with the final task outcome to construct a step\-level training score\. Given the terminal environment rewardRenvR\_\{\\mathrm\{env\}\}, the score for stepttis defined as
St=Renv\+rtproc,S\_\{t\}=R\_\{\\mathrm\{env\}\}\+r\_\{t\}^\{\\mathrm\{proc\}\},\(8\)whereStS\_\{t\}serves as a lightweight scoring signal that broadcasts the trajectory\-level success outcome to each decision step while applying local penalties to clearly erroneous subgoal or execution behaviors\.
For policy optimization, we follow the group\-relative normalization used in GIGPO\[[4](https://arxiv.org/html/2606.10507#bib.bib8)\]\. For the same instruction, we sample a group of trajectories and compute the step\-level scores for all stored decision steps\. The normalized step\-level advantage is computed as
A^t=St−μSσS\+ϵ,\\hat\{A\}\_\{t\}=\\frac\{S\_\{t\}\-\\mu\_\{S\}\}\{\\sigma\_\{S\}\+\\epsilon\},\(9\)whereμS\\mu\_\{S\}andσS\\sigma\_\{S\}are the group mean and standard deviation of step\-level scores\.
## 4Experiments
### 4\.1Experimental Settings
Benchmarks\.To systematically evaluate the effectiveness of the proposed method, we conduct experiments on three publicly available interactive agent benchmarks, including ALFWorld\[[24](https://arxiv.org/html/2606.10507#bib.bib19)\], VirtualHome\[[17](https://arxiv.org/html/2606.10507#bib.bib55)\]and ScienceWorld\[[33](https://arxiv.org/html/2606.10507#bib.bib22)\]\. Specifically, for ALFWorld, following prior work\[[31](https://arxiv.org/html/2606.10507#bib.bib39),[32](https://arxiv.org/html/2606.10507#bib.bib63),[4](https://arxiv.org/html/2606.10507#bib.bib8)\], we adopt the dataset version constructed bySonget al\.\[[25](https://arxiv.org/html/2606.10507#bib.bib58)\]\. For VirtualHome, we further correct clearly erroneous examples based on the version provided byWanget al\.\[[31](https://arxiv.org/html/2606.10507#bib.bib39),[32](https://arxiv.org/html/2606.10507#bib.bib63)\]\. For ScienceWorld, we use the same experimental setting as previous work\[[44](https://arxiv.org/html/2606.10507#bib.bib32)\]\. Across all benchmarks, at each interaction step, the agent receives an observation from the environment and generates the next action accordingly\. The interaction continues until the task is successfully completed or a predefined maximum number of steps is reached, after which the environment returns the final task outcome\. Additional details on the benchmarks are provided in Appendix[A](https://arxiv.org/html/2606.10507#A1)\.
Baselines\.For the three benchmarks, we compare our approach with a range of competitive baselines: \(1\) Closed\-source LLMs: GPT\-4o\[[8](https://arxiv.org/html/2606.10507#bib.bib47)\]and Gemini\-2\.5\-Pro\[[28](https://arxiv.org/html/2606.10507#bib.bib60)\], which represent powerful capabilities in interactive reasoning and decision\-making\. \(2\) Prompting agents: ReAct\[[40](https://arxiv.org/html/2606.10507#bib.bib2)\], Reflexion\[[22](https://arxiv.org/html/2606.10507#bib.bib18)\], and HiAgent\[[6](https://arxiv.org/html/2606.10507#bib.bib40)\]which rely on in\-context prompting to guide test\-time multi\-turn behavior without parameter updates\. \(3\) Credit\-assignment\-oriented RL training methods: PPO\[[20](https://arxiv.org/html/2606.10507#bib.bib17)\]uses a learned value function for advantage estimation, while RLOO\[[1](https://arxiv.org/html/2606.10507#bib.bib54)\]and GRPO\[[21](https://arxiv.org/html/2606.10507#bib.bib57)\]estimate relative advantages from grouped samples without an additional critic; GiGPO\[[4](https://arxiv.org/html/2606.10507#bib.bib8)\]and RL\-GCD\[[12](https://arxiv.org/html/2606.10507#bib.bib38)\]further introduce step\-level advantage estimation for finer\-grained credit assignment\. \(4\) Hierarchical RL methods: Hiper\[[16](https://arxiv.org/html/2606.10507#bib.bib37)\]focus on subgoal\-level credit assignment for multi\-turn agentic RL\. GLIDER\[[7](https://arxiv.org/html/2606.10507#bib.bib9)\]and STEP\-HRL\[[44](https://arxiv.org/html/2606.10507#bib.bib32)\]combine subgoal modeling with supervised fine\-tuning and offline RL\. HiAgent\+GRPO\[[6](https://arxiv.org/html/2606.10507#bib.bib40),[21](https://arxiv.org/html/2606.10507#bib.bib57)\]is included to assess the effect of applying GRPO\[[21](https://arxiv.org/html/2606.10507#bib.bib57)\]to a prompt\-based subgoal framework\.
\(a\)ALFWorld
\(b\)VirtualHome
\(c\)ScienceWorld
Figure 2:Validation success\-rate curves of 3B models on three benchmarks\.Training details\.We use Qwen2\.5\-3B\-Instruct and Qwen2\.5\-7B\-Instruct\[[38](https://arxiv.org/html/2606.10507#bib.bib74)\]as our base models\. All RL experiments are implemented with the verl\-agent framework\[[4](https://arxiv.org/html/2606.10507#bib.bib8)\]\. For a fair comparison, all RL\-based methods, including our method and the RL baselines, use the same hyperparameter configuration following GiGPO\[[4](https://arxiv.org/html/2606.10507#bib.bib8)\]\. All experiments are conducted on 8 NVIDIA A100 80GB GPUs\. Full training settings and hyperparameter details are provided in Appendix[B](https://arxiv.org/html/2606.10507#A2)\.
Table 1:Evaluation results on three benchmarks, where all reported values are success rates\.Avgdenotes the average score on ALFWorld\. PE and RL indicate methods of prompt engineering and reinforcement learning\.Best and runner\-up results are marked inboldandunderline, respectively\.
### 4\.2Experimental Results
Table[1](https://arxiv.org/html/2606.10507#S4.T1)reports the overall performance of different methods on three embodied agent benchmarks\. We can observe that:\(1\) Limitations of advanced closed\-source models\.Even advanced closed\-source models such as GPT\-4o and Gemini\-2\.5\-Pro struggle with long\-horizon tasks when using standard ReAct prompting and HiAgent prompting, revealing that model scale and general reasoning ability are insufficient to address context interference and goal forgetting in multi\-step interactions\.\(2\) Zero\-shot prompting remains insufficient\.ReAct and Reflexion achieve only 17\.2 and 37\.1 average scores on ALFWorld, respectively\. HiAgent adopts a subgoal decomposition and context folding paradigm, but obtains an even lower score than ReAct and Reflexion\. This indicates that base models cannot reliably perform hierarchical planning and information folding without training\.\(3\) Bottlenecks of credit\-assignment\-oriented RL methods\.Credit\-assignment\-oriented RL improves over prompting baselines, but remains limited by the lack of explicit subgoal structure and context management\. GiGPO improves fine\-grained credit assignment through step\-level advantage comparison and performs best among credit\-assignment\-oriented RL baselines, yet it still underperforms HIPIF across all three benchmarks\. The gap is especially clear on complex tasks such as PICK2 where HIPIF improves over GiGPO from 85\.7 to 95\.2\. This suggests that credit assignment alone cannot address long\-context\-induced degradation in reasoning and decision\-making, while its complementarity to our framework is further analyzed in Appendix[G](https://arxiv.org/html/2606.10507#A7)\.\(4\) Challenges in hierarchical RL methods\.Existing hierarchical RL methods explicitly introduce hierarchical structures, but still suffer from several limitations\. HiPER models subgoals with subgoal\-level credit assignment, but lacks effective context management and hierarchical reflection\. STEP\-HRL depends on expert trajectories and external models for subgoal generation and context compression, which substantially increases training and inference costs\. Despite these additional resources, it still underperforms HIPIF on three benchmarks\. Although HiAgent\+GRPO strengthens subgoal\-and\-folding with RL, its lower performance than HIPIF across all benchmarks suggests that effective long\-horizon execution also requires hierarchical reflection and fine\-grained rewards for subgoal transitions and execution\.\(5\) Overall advantages of HIPIF\.Compared with all baselines, HIPIF achieves the best performance\. Figure[2](https://arxiv.org/html/2606.10507#S4.F2)further reveals two key advantages of HIPIF\. First, learning to organize long\-horizon interaction at the subgoal level and compress completed execution of subgoals can substantially improve model performance\. Second, subgoal\-based training may underperform in the early stage because it introduces a more complex decision process\. The growing advantage of HIPIF over the other two methods during training demonstrates that hierarchical reflection and fine\-grained process\-level rewards effectively stabilize subgoal\-based RL training\.
Table 2:Ablation studies of our proposed method across different model scales on three embodied agent benchmarks\. Best results within each model size group are marked inbold\.
### 4\.3Ablation Study
Table[2](https://arxiv.org/html/2606.10507#S4.T2)reports the ablation results of HIPIF across different architectures and model scales\. Here,w/o Reflectionremoves the hierarchical reflection mechanism,w/o Rewardremoves the subgoal\-oriented process rewards and only uses the final task success signal for training, andw/o Subgoalremoves the explicit subgoal decomposition and context folding structure\.
Ablation on model architectures\.From Table[2](https://arxiv.org/html/2606.10507#S4.T2), we could conclude that: \(1\) Removing the subgoal structure\(w/o Subgoal\) leads to the most significant performance drop, particularly on complex tasks such as PICK2, indicating that subgoal\-level planning and context folding are fundamental to HIPIF\. \(2\) Removing hierarchical reflection\(w/o Reflection\) mechanism substantially degrades performance, showing that hierarchical reflection helps the agent assess subgoal progress and correct subgoal\-level failures without costly expert trajectories or auxiliary models\. \(3\) Removing subgoal\-oriented process rewards \(w/o Reward\) results in noticeable performance drops, suggesting that final task\-level feedback alone provides limited guidance for reliable subgoal content and subgoal execution\.
Analysis across model scales\.Furthermore, we examine the effect of model scale using 3B and 7B backbones\. As shown in Table[2](https://arxiv.org/html/2606.10507#S4.T2), HIPIF consistently achieves the best results under both settings, indicating that subgoal\-centric training remains effective as the base model becomes stronger\. Notably, the 3B HIPIF already outperforms the 7B variant without subgoal structure across all three benchmarks\. This suggests that explicitly learning the organization of the subgoal is more critical than simply increasing the size of the model for long\-horizon decision\-making\.
### 4\.4Analysis on Efficiency
Figure 3:Token efficiency comparison\.
Figure 4:Per\-step token consumption\.
Token Efficiency\.Table[4](https://arxiv.org/html/2606.10507#S4.F4)evaluates context efficiency using two metrics: average completion steps and average input tokens per trajectory\. HIPIF achieves the lowest cost across all three benchmarks, showing that it reduces both the number of interaction steps and the accumulated context tokens\. Figure[4](https://arxiv.org/html/2606.10507#S4.F4)further compares the per\-step token consumption of HIPIF and its ablated variants on the same ALFWorld task\. We could observe that: \(1\) \- w/o Subgoal keeps accumulating tokens as the interaction history grows, whereas HIPIF compresses completed subgoals into compact states, leading to lower token usage after each subgoal is finished\. \(2\) Although reflection slightly increases original token usage, it helps the model better judge whether a subgoal has been completed, thereby avoiding unnecessary subsequent steps\. \(3\) \- w/o Reward receives weaker guidance on how to execute each subgoal, making it more likely to repeat ineffective actions and consume more tokens within a subgoal\.
Figure 5:Pipeline Efficiency\.Pipeline Efficiency\.We further compare the pipeline efficiency of different methods\. As shown in Table[5](https://arxiv.org/html/2606.10507#S4.F5.2), HIPIF and GRPO are lightweight, requiring neither task\-specific expert trajectories nor additional models\. By contrast, HiPER needs an extra critic model, while GLIDER and STEP\-HRL rely on task\-specific expert trajectories and auxiliary models for subgoal generation or context compression\. Together, Table[1](https://arxiv.org/html/2606.10507#S4.T1)and Table[5](https://arxiv.org/html/2606.10507#S4.F5.2)show that HIPIF achieves strong effectiveness, high efficiency, and environmental scalability\.
Table 3:Case study on a long\-horizon PICK2 task in ALFWorld, illustrating the effect of HIPIF\.
### 4\.5Case Study
We conduct a case study on ALFWorld to illustrate the effect of our core designs, as shown in Table[3](https://arxiv.org/html/2606.10507#S4.T3)\. Concretely, we observe that: \(1\) Subgoal decomposition and context folding help HIPIF maintain a clear task structure under long contexts\. After placing the first toiletpaper, HIPIF correctly identifies the current stage as finding the second one and continues planning under this subgoal\. In contrast,w/o Subgoalreasons over an unstructured long action history, loses track of the completed progress, and incorrectly returns to the already handled toiletpaper 1 with invalid actions\. \(2\) Hierarchical reflection enables reliable subgoal progress assessment and transition\. HIPIF determines from the current observation that the second toiletpaper has not been found, and therefore continues the search instead of switching stages\. By contrast,w/o Reflectionkeeps the subgoal history but lacks explicit completion checking, causing a premature transition to the next subgoal and eventual failure\. Additional case studies on ScienceWorld and VirtualHome are presented in Appendix[I](https://arxiv.org/html/2606.10507#A9)\.
## 5Conclusions
In this paper, we proposedHierarchicalPlanning andInformationFolding \(HIPIF\) to improve long\-horizon decision\-making for LLM agents without relying on task\-specific expert trajectories or auxiliary models\. HIPIF trains the agent end\-to\-end to plan explicit subgoals and fold completed execution of subgoals, thus reducing long\-context interference\. To ensure reliable subgoal\-based execution, HIPIF further introduces hierarchical reflection and subgoal\-oriented process rewards to guide subgoal assessment, transition, and execution\. Empirical results on three publicly available benchmarks demonstrate that HIPIF consistently outperforms other methods while improving efficiency\. We believe that hierarchical planning and information folding offers a promising direction for improving both effectiveness and efficiency in future agentic decision\-making research\.
## References
- \[1\]\(2024\)Back to basics: revisiting reinforce\-style optimization for learning from human feedback in llms\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12248–12267\.Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p2.1)\.
- \[2\]L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch\(2021\)Decision transformer: reinforcement learning via sequence modeling\.Advances in neural information processing systems34,pp\. 15084–15097\.Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p1.1)\.
- \[3\]Z\. Chen, K\. Liu, Q\. Wang, W\. Zhang, J\. Liu, D\. Lin, K\. Chen, and F\. Zhao\(2024\)Agent\-flan: designing data and methods of effective agent tuning for large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 9354–9366\.Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p2.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px1.p1.1)\.
- \[4\]L\. Feng, Z\. Xue, T\. Liu, and B\. AnGroup\-in\-group policy optimization for llm agent training\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[Appendix A](https://arxiv.org/html/2606.10507#A1.SS0.SSS0.Px3.p2.1),[§1](https://arxiv.org/html/2606.10507#S1.p2.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.10507#S3.SS1.p7.2),[§3\.3](https://arxiv.org/html/2606.10507#S3.SS3.SSS0.Px2.p3.3),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p3.1)\.
- \[5\]Y\. Guo, L\. Xu, J\. Liu, Y\. Dan, and S\. QiuSegment policy optimization: effective segment\-level credit assignment in rl for large language models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p2.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1)\.
- \[6\]M\. Hu, T\. Chen, Q\. Chen, Y\. Mu, W\. Shao, and P\. Luo\(2025\)Hiagent: hierarchical working memory management for solving long\-horizon agent tasks with large language model\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 32779–32798\.Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p2.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p2.1)\.
- \[7\]Z\. Hu, W\. Liu, X\. Qu, X\. Yue, C\. Chen, Z\. Wang, and Y\. Cheng\(2025\)Divide and conquer: grounding llms as efficient decision\-making agents via offline hierarchical reinforcement learning\.InInternational Conference on Machine Learning,pp\. 24570–24590\.Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p2.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p2.1)\.
- \[8\]A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p2.1)\.
- \[9\]S\. Li, X\. Puig, C\. Paxton, Y\. Du, C\. Wang, L\. Fan, T\. Chen, D\. Huang, E\. Akyürek, A\. Anandkumar,et al\.\(2022\)Pre\-trained language models for interactive decision\-making\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p2.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px1.p1.1)\.
- \[10\]X\. Li, W\. Jiao, J\. Jin, G\. Dong, J\. Jin, Y\. Wang, H\. Wang, Y\. Zhu, J\. Wen, Y\. Lu,et al\.\(2026\)Deepagent: a general reasoning agent with scalable toolsets\.InProceedings of the ACM Web Conference 2026,pp\. 2219–2230\.Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1)\.
- \[11\]B\. Y\. Lin, Y\. Fu, K\. Yang, F\. Brahman, S\. Huang, C\. Bhagavatula, P\. Ammanabrolu, Y\. Choi, and X\. Ren\(2023\)Swiftsage: a generative agent with fast and slow thinking for complex interactive tasks\.Advances in Neural Information Processing Systems36,pp\. 23813–23825\.Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p2.1)\.
- \[12\]J\. Liu, X\. Wu, J\. Peng, K\. Chen, C\. Yu, L\. Ding, and Y\. Liu\(2025\)Gradient coupling: the hidden barrier to generalization in agentic reinforcement learning\.arXiv preprint arXiv:2509\.23870\.Cited by:[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p2.1)\.
- \[13\]Z\. Lu, Z\. Lin, W\. Jia, C\. Tian, D\. Ye, P\. Li, L\. Jin, N\. Liu, G\. Xu, and W\. Feng\(2026\)HISR: hindsight information modulated segmental process rewards for multi\-turn agentic reinforcement learning\.arXiv preprint arXiv:2603\.18683\.Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p2.1)\.
- \[14\]Z\. Lu, C\. Tian, P\. PeiguangLi, L\. Jin, S\. Wang, W\. Jia, Y\. Shen, and G\. Xu\(2025\)PIPER: benchmarking and prompting event reasoning boundary of llms via debiasing\-distillation enhanced tuning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 28591–28613\.Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1)\.
- \[15\]T\. Ni, M\. Ma, B\. Eysenbach, and P\. Bacon\(2023\)When do transformers shine in rl? decoupling memory from credit assignment\.Advances in Neural Information Processing Systems36,pp\. 50429–50452\.Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p1.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1)\.
- \[16\]J\. Peng, Y\. Liu, R\. Zhou, C\. Fleming, Z\. Wang, A\. Garcia, and M\. Hong\(2026\)HiPER: hierarchical reinforcement learning with explicit credit assignment for large language model agents\.arXiv preprint arXiv:2602\.16165\.Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p2.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p2.1)\.
- \[17\]X\. Puig, K\. Ra, M\. Boben, J\. Li, T\. Wang, S\. Fidler, and A\. Torralba\(2018\)Virtualhome: simulating household activities via programs\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 8494–8502\.Cited by:[Appendix A](https://arxiv.org/html/2606.10507#A1.SS0.SSS0.Px2.p1.2),[§1](https://arxiv.org/html/2606.10507#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p1.1)\.
- \[18\]R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1)\.
- \[19\]G\. Sarch, L\. Jang, M\. J\. Tarr, W\. W\. Cohen, K\. Marino, and K\. Fragkiadaki\(2024\)Vlm agents generate their own memories: distilling experience into embodied programs of thought\.Advances in Neural Information Processing Systems37,pp\. 75942–75985\.Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1)\.
- \[20\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p2.1)\.
- \[21\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p2.1)\.
- \[22\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p2.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.10507#S3.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p2.1)\.
- \[23\]M\. Shridhar, J\. Thomason, D\. Gordon, Y\. Bisk, W\. Han, R\. Mottaghi, L\. Zettlemoyer, and D\. Fox\(2020\)Alfred: a benchmark for interpreting grounded instructions for everyday tasks\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10740–10749\.Cited by:[Appendix A](https://arxiv.org/html/2606.10507#A1.SS0.SSS0.Px1.p1.2)\.
- \[24\]M\. Shridhar, X\. Yuan, M\. Cote, Y\. Bisk, A\. Trischler, and M\. HausknechtALFWorld: aligning text and embodied environments for interactive learning\.InInternational Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.10507#A1.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2606.10507#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p1.1)\.
- \[25\]Y\. Song, D\. Yin, X\. Yue, J\. Huang, S\. Li, and B\. Y\. Lin\(2024\)Trial and error: exploration\-based trajectory optimization for llm agents\.arXiv preprint arXiv:2403\.02502\.Cited by:[Appendix A](https://arxiv.org/html/2606.10507#A1.SS0.SSS0.Px1.p1.2),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p1.1)\.
- \[26\]W\. Sun, M\. Lu, Z\. Ling, K\. Liu, X\. Yao, Y\. Yang, and J\. Chen\(2025\)Scaling long\-horizon llm agent via context\-folding\.arXiv preprint arXiv:2510\.11967\.Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1)\.
- \[27\]R\. S\. Sutton, D\. Precup, and S\. Singh\(1999\)Between mdps and semi\-mdps: a framework for temporal abstraction in reinforcement learning\.Artificial intelligence112\(1\-2\),pp\. 181–211\.Cited by:[§3\.2](https://arxiv.org/html/2606.10507#S3.SS2.p2.1),[§3\.2](https://arxiv.org/html/2606.10507#S3.SS2.p3.6)\.
- \[28\]G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805\.Cited by:[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p2.1)\.
- \[29\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px1.p1.1)\.
- \[30\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. AnandkumarVoyager: an open\-ended embodied agent with large language models\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p1.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px1.p1.1)\.
- \[31\]H\. Wang, C\. T\. Leong, J\. Wang, J\. Wang, and W\. Li\(2025\)Spa\-rl: reinforcing llm agents via stepwise progress attribution\.arXiv preprint arXiv:2505\.20732\.Cited by:[Appendix A](https://arxiv.org/html/2606.10507#A1.SS0.SSS0.Px2.p1.2),[§1](https://arxiv.org/html/2606.10507#S1.p2.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p1.1)\.
- \[32\]H\. Wang, J\. Wang, C\. T\. Leong, and W\. Li\(2025\)Steca: step\-level trajectory calibration for llm agent learning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 11597–11614\.Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p2.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p1.1)\.
- \[33\]R\. Wang, P\. Jansen, M\. Côté, and P\. Ammanabrolu\(2022\)Scienceworld: is your agent smarter than a 5th grader?\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 11279–11298\.Cited by:[Appendix A](https://arxiv.org/html/2606.10507#A1.SS0.SSS0.Px3.p1.2),[§1](https://arxiv.org/html/2606.10507#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p1.1)\.
- \[34\]Z\. Wang, K\. Wang, Q\. Wang, P\. Zhang, L\. Li, Z\. Yang, X\. Jin, K\. Yu, M\. N\. Nguyen, L\. Liu,et al\.\(2025\)Ragen: understanding self\-evolution in llm agents via multi\-turn reinforcement learning\.arXiv preprint arXiv:2504\.20073\.Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1)\.
- \[35\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.10507#S3.SS2.p3.8)\.
- \[36\]W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang\(2026\)A\-mem: agentic memory for LLM agents\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=FiM0M8gcct)Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1)\.
- \[37\]Z\. Xu, C\. Yu, F\. Fang, Y\. Wang, and Y\. Wu\(2024\)Language agents with reinforcement learning for strategic play in the werewolf game\.InInternational Conference on Machine Learning,pp\. 55434–55464\.Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1)\.
- \[38\]A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\. 5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p3.1)\.
- \[39\]S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan\(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p2.1)\.
- \[40\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p1.1),[§1](https://arxiv.org/html/2606.10507#S1.p2.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.10507#S3.SS1.p2.4),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p2.1)\.
- \[41\]R\. Ye, Z\. Zhang, K\. Li, H\. Yin, Z\. Tao, Y\. Zhao, L\. Su, L\. Zhang, Z\. Qiao, X\. Wang, P\. Xie, F\. Huang, J\. Zhou, S\. Chen, and Y\. Jiang\(2026\)AgentFold: long\-horizon web agents with proactive context folding\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=IuZoTgsUws)Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1)\.
- \[42\]Z\. Zhang, Q\. Dai, X\. Bo, C\. Ma, R\. Li, X\. Chen, J\. Zhu, Z\. Dong, and J\. Wen\(2025\)A survey on the memory mechanism of large language model\-based agents\.ACM Transactions on Information Systems43\(6\),pp\. 1–47\.Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px1.p1.1)\.
- \[43\]Q\. Zhao, H\. Fu, C\. Sun, and G\. Konidaris\(2024\)Epo: hierarchical llm agents with environment preference optimization\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 6401–6415\.Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p2.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1)\.
- \[44\]S\. Zhen, Y\. Yu, R\. Guo, N\. Cheng, and Y\. Deng\(2026\)Hierarchical reinforcement learning with augmented step\-level transitions for llm agents\.arXiv preprint arXiv:2604\.05808\.Cited by:[Appendix A](https://arxiv.org/html/2606.10507#A1.SS0.SSS0.Px3.p1.2),[Appendix A](https://arxiv.org/html/2606.10507#A1.SS0.SSS0.Px3.p2.1),[§1](https://arxiv.org/html/2606.10507#S1.p1.1),[§1](https://arxiv.org/html/2606.10507#S1.p2.1),[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.10507#S4.SS1.p2.1)\.
- \[45\]Y\. Zhou, A\. Zanette, J\. Pan, S\. Levine, and A\. KumarArCHer: training language model agents via hierarchical multi\-turn rl\.InForty\-first International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.10507#S2.SS0.SSS0.Px2.p1.1)\.
- \[46\]Z\. Zhou, A\. Qu, Z\. Wu, S\. Kim, A\. Prakash, D\. Rus, J\. Zhao, B\. K\. H\. Low, and P\. P\. LiangMEM1: learning to synergize memory and reasoning for efficient long\-horizon agents\.InFirst Workshop on Multi\-Turn Interactions in Large Language Models,Cited by:[§1](https://arxiv.org/html/2606.10507#S1.p1.1)\.
## Appendix ADatasets Details
#### ALFWorld\.
ALFWorld\[[24](https://arxiv.org/html/2606.10507#bib.bib19)\]is an embodied text\-based environment designed to evaluate the agentic capability of language models in complex long\-horizon decision\-making tasks\. It provides interactive TextWorld environments that are closely aligned with ALFRED\[[23](https://arxiv.org/html/2606.10507#bib.bib76)\]\. In each episode, the agent receives a textual goal and interacts with the environment through multi\-turn action generation until the task is completed or the maximum number of interaction turns is reached\. The environment returns a binary outcome as the trajectory\-level score, where11denotes task success and0denotes failure\. ALFWorld contains six categories of common household tasks: Pick & Place \(Pick\), Examine in Light \(Look\), Clean & Place \(Clean\), Heat & Place \(Heat\), Cool & Place \(Cool\), and Pick Two & Place \(Pick2\)\. Following prior work, we adopt the version constructed bySonget al\.\[[25](https://arxiv.org/html/2606.10507#bib.bib58)\]and set the maximum number of interaction turns to 50\.
#### VirtualHome\.
VirtualHome\[[17](https://arxiv.org/html/2606.10507#bib.bib55)\]is another embodied household environment for evaluating long\-horizon interactive agents\. It includes diverse high\-level household tasks in simulated indoor environments, where each task requires the agent to complete a sequence of executable actions\. For each episode, the agent receives a high\-level task description, then repeatedly selects an action and receives environment feedback until the task succeeds or the maximum interaction limit is reached\. We use binary success as the evaluation outcome, where11indicates successful task completion and0indicates failure\. In this work, we adopt the version provided byWanget al\.\[[31](https://arxiv.org/html/2606.10507#bib.bib39)\]and further correct several erroneous task descriptions in the original benchmark\. The processed dataset will be released together with our code\. We set the maximum number of interaction turns to 50\.
#### ScienceWorld\.
ScienceWorld\[[33](https://arxiv.org/html/2606.10507#bib.bib22)\]is a text\-based interactive environment designed to evaluate agents on scientific reasoning and experimental tasks\. ScienceWorld requires the agent to perform multi\-step scientific procedures, such as finding relevant objects, observing physical or chemical changes, and so on\. At each step, the agent generates an executable textual action and receives an observation from the environment\. The interaction continues until the task is completed or the maximum number of interaction turns is reached\. For consistency with the other benchmarks, we report binary success rates, where11denotes task success and0denotes failure\. In this work, we adopt the version provided by STEP\-HRL\[[44](https://arxiv.org/html/2606.10507#bib.bib32)\]\. We set the maximum number of interaction turns to 40\.
To ensure a fair comparison, we follow the dataset settings of the strongest baseline GIGPO\[[4](https://arxiv.org/html/2606.10507#bib.bib8)\]and STEP\-HRL\[[44](https://arxiv.org/html/2606.10507#bib.bib32)\]\. Table[4](https://arxiv.org/html/2606.10507#A1.T4)summarizes the statistics of the three benchmarks used in our experiments\.
Table 4:Statistics of the three agent benchmarks\. Train and Test denote the numbers of training and test samples\. Available Actions denotes the number of action templates used by the agent\.
## Appendix BTraining Details
#### Hyperparameters\.
For all three benchmarks, including ALFWorld, VirtualHome, and ScienceWorld, we use the same training configuration unless otherwise specified\. The maximum prompt length is set to 8192 tokens, and the maximum response length is set to 512 tokens\. The actor learning rate is set to1×10−61\\times 10^\{\-6\}\. For PPO, which is the only method that uses a critic model, the critic learning rate is set to1×10−51\\times 10^\{\-5\}\. For group\-based RL methods, including GRPO and its variants, we use a group size of 8 and sample 16 groups per rollout, resulting in16×8=12816\\times 8=128parallel environments\. For PPO, we use 128 independent environments for rollout collection\. The rollout temperature is set to 1\.0 for exploration, while the validation temperature is set to 0\.4 for more stable evaluation\. The mini\-batch size is set to 256, and the KL\-divergence loss coefficient is set to 0\.01\.
#### Reward Design\.
All methods use rule\-based environment rewards\. The terminal reward is 10 for task success and 0 for failure\. We further assign a penalty of−0\.1\-0\.1when the model fails to produce the required structured tags, such as<reflection\>\.\.\.</reflection\>,<subgoal\>\.\.\.</subgoal\>, or<action\>\.\.\.</action\>\.
For HIPIF, we further incorporate subgoal\-oriented process rewards to provide step\-level feedback for subgoal content and subgoal execution\.For subgoal content, we introduce two rule\-based penalties\. First, we extract object and location names from the generated subgoal and check whether they can be matched by string matching to entities in the available environment context\. If the subgoal contains no grounded object or location, we assign a penalty of−0\.1\-0\.1\. Second, for successful trajectories, we inspect the final observation of each terminated subgoal\. If the final observation indicates an execution failure, such as “Nothing happens” or “No known action matches that input”, we assign a penalty of−0\.1\-0\.1to the corresponding subgoal step\. This helps identify locally erroneous subgoals that may be hidden inside an otherwise successful trajectory\.For subgoal execution, we penalize repeated ineffective interaction patterns within the same subgoal\. Specifically, we compare the action\-observation records under the current subgoal; if the same action\-observation pair appears for the third time, we assign a penalty of−0\.1\-0\.1\. This rule is applied only within the temporal span of the same subgoal, since the same action may still be valid in different task stages\.
#### Computing Details\.
All training experiments are conducted on 8 NVIDIA A100 GPUs\. For 7B models, we set the tensor parallel size to 4 and train for 150 epochs\. For 3B models, we set the tensor parallel size to 2 and train for 200 epochs\.
## Appendix CPseudo Code
Algorithm[1](https://arxiv.org/html/2606.10507#alg1)summarizes the overall training procedure of HIPIF\. Here,ztz\_\{t\}denotes the completion judgment of the current subgoal, andηt\\eta\_\{t\}denotes the rationale for this judgment\.ξt\\xi\_\{t\}is the branch\-specific reflection generated before the next decision: it reflects on subgoal transition whenzt=completedz\_\{t\}=completed, and on current subgoal execution whenzt=uncompletedz\_\{t\}=uncompleted\. The process rewardrtprocr\_\{t\}^\{\\mathrm\{proc\}\}is computed from subgoal\-content and subgoal\-execution feedback, andQtQ\_\{t\}denotes the corresponding step\-level return used for GRPO optimization\.
For each task instruction, we sample a group of trajectories from the old policy and perform rollout under the current folded\-history structure\. At each interaction step, the model first conducts hierarchical reflection and outputs a completion judgmentztz\_\{t\}together with a rationaleηt\\eta\_\{t\}\. This judgment determines the subsequent generation branch\. If the current subgoal is judged as completed, HIPIF folds the completed subgoal and its intra\-subgoal execution history into the compact historyH¯\\bar\{H\}, appends the subgoal to the historical subgoal sequence𝒢\\mathcal\{G\}, and then generates a reflectionξt\\xi\_\{t\}on the next subgoal together with the next subgoal and its first action\. If the current subgoal is not completed, the model instead generates an execution reflectionξt\\xi\_\{t\}based on the current subgoal, recent intra\-subgoal history, and latest observation, and then outputs the next action under the current subgoal\. Importantly, the reflectionξt\\xi\_\{t\}and the structured decision outputyty\_\{t\}are treated as part of the same policy\-generated sequence\. After rollout, HIPIF computes the terminal environment reward and the subgoal\-oriented process rewards for each stored step, obtains step\-level returns, normalizes them within the sampled group, and updates the policy with the clipped GRPO objective\.
Algorithm 1Training Procedure of HIPIF1:Training tasks
𝒟\\mathcal\{D\}, policy
πθ\\pi\_\{\\theta\}, old policy
πθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}, group size
MM, horizon
TT
2:Updated policy
πθ\\pi\_\{\\theta\}
3:foreach training iterationdo
4:foreach task instruction
q∈𝒟q\\in\\mathcal\{D\}do
5:
𝒴←∅\\mathcal\{Y\}\\leftarrow\\emptyset
6:for
m=1m=1to
MMdo
7:Reset environment with
qqand obtain initial observation
o0o\_\{0\}
8:Initialize folded history
H¯←∅\\bar\{H\}\\leftarrow\\emptyset, subgoal sequence
𝒢←∅\\mathcal\{G\}\\leftarrow\\emptyset
9:Generate initial subgoal
ggand set intra\-subgoal history
h←∅h\\leftarrow\\emptyset
10:for
t=0t=0to
T−1T\-1do
11:
\(zt,ηt\)←Reflect\(πθold,H¯,𝒢,g,h,ot\)\(z\_\{t\},\\eta\_\{t\}\)\\leftarrow\\textsc\{Reflect\}\(\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\},\\bar\{H\},\\mathcal\{G\},g,h,o\_\{t\}\)
12:if
zt=completedz\_\{t\}=completedthen
13:
H¯←Fold\(H¯,g,h,ot\)\\bar\{H\}\\leftarrow\\textsc\{Fold\}\(\\bar\{H\},g,h,o\_\{t\}\),
𝒢←𝒢∥g\\mathcal\{G\}\\leftarrow\\mathcal\{G\}\\\|g
14:
\(ξt,yt\)←Generate\(πθold,H¯,𝒢,ηt\)\(\\xi\_\{t\},y\_\{t\}\)\\leftarrow\\textsc\{Generate\}\(\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\},\\bar\{H\},\\mathcal\{G\},\\eta\_\{t\}\)
15:Parse
yty\_\{t\}as the next subgoal
ggand its first action
ata\_\{t\}
16:Reset intra\-subgoal history
h←∅h\\leftarrow\\emptyset
17:else
18:
\(ξt,yt\)←Generate\(πθold,g,h,ot,ηt\)\(\\xi\_\{t\},y\_\{t\}\)\\leftarrow\\textsc\{Generate\}\(\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\},g,h,o\_\{t\},\\eta\_\{t\}\)
19:Parse
yty\_\{t\}as the next action
ata\_\{t\}under current subgoal
gg
20:endif
21:Validate the output schema, subgoal grounding, and action validity
22:Execute
ata\_\{t\}, receive
ot\+1o\_\{t\+1\}, and append
\(at,ot\+1\)\(a\_\{t\},o\_\{t\+1\}\)to
hh
23:iftask is completedthen
24:break
25:endif
26:endfor
27:Obtain terminal reward
Renv\(m\)R\_\{\\mathrm\{env\}\}^\{\(m\)\}
28:Compute process reward
rtproc,\(m\)r\_\{t\}^\{\\mathrm\{proc\},\(m\)\}for each stored step
29:Compute step\-level return
Qt\(m\)=Renv\(m\)\+rtproc,\(m\)Q\_\{t\}^\{\(m\)\}=R\_\{\\mathrm\{env\}\}^\{\(m\)\}\+r\_\{t\}^\{\\mathrm\{proc\},\(m\)\}
30:Add trajectory
mmto
𝒴\\mathcal\{Y\}
31:endfor
32:Normalize
\{Qt\(m\)\}\\\{Q\_\{t\}^\{\(m\)\}\\\}within
𝒴\\mathcal\{Y\}to obtain step\-level advantages
A^t\(m\)\\hat\{A\}\_\{t\}^\{\(m\)\}
33:Update
πθ\\pi\_\{\\theta\}with the clipped GRPO objective
34:endfor
35:endfor
## Appendix DSensitivity Analysis
We conduct sensitivity analysis on two key hyperparameters in the subgoal\-oriented process rewards: the repetition threshold for action\-observation pairs and the magnitude of the process penalty\. The experiments are conducted on on AlfWorld and results are shown in Figure[6](https://arxiv.org/html/2606.10507#A4.F6)\.
\(a\)Sensitivity to repetition threshold\.
\(b\)Sensitivity to reward magnitude\.
Figure 6:Sensitivity analysis of the subgoal\-oriented process rewards on AlfWorld\.For the repetition penalty, we compare three settings: no penalty, penalizing the first repeated action\-observation pair, and penalizing the second repeated occurrence\. As shown in Figure[6\(a\)](https://arxiv.org/html/2606.10507#A4.F6.sf1), penalizing the first repeated pair leads to larger late\-stage fluctuations\. In contrast, penalizing the second repeated occurrence yields more stable performance throughout training\.
For the penalty magnitude, we compare values of0\.050\.05,0\.10\.1, and0\.20\.2\. As shown in Figure[6\(b\)](https://arxiv.org/html/2606.10507#A4.F6.sf2), a larger penalty of0\.20\.2improves early performance but causes clear fluctuations in later training, while0\.050\.05provides weaker supervision and performs slightly worse\. The value of0\.10\.1achieves the best overall balance between effectiveness and stability\.
## Appendix ETemplates
#### Prompt Templates\.
Figures[7](https://arxiv.org/html/2606.10507#A5.F7),[8](https://arxiv.org/html/2606.10507#A5.F8), and[9](https://arxiv.org/html/2606.10507#A5.F9)present the prompt templates used for ALFWorld, ScienceWorld, and VirtualHome, respectively\. To better match the interactive agent setting, we do not provide the model with a fully grounded list of executable actions at each step\. Instead, we provide environment\-specific action templates and require the model to instantiate valid objects or locations according to the current observation\. This setting encourages the agent to explore the environment and learn executable behavior from interaction feedback, rather than relying on an external action enumerator\. These prompt templates are constructed using Python\-style string formatting, where placeholders enclosed in curly braces, such as\{task\_description\},\{current\_subgoal\},\{current\_observation\}, and\{action\_history\}, denote semantic slots that are dynamically filled at runtime using Python’s\.format\(\)function\. Each prompt provides the task description, current subgoal, current observation, last action, available action templates, and folded execution history\. The output format is also explicitly constrained by tags: the model first outputs its reflection process inside<reflection\>\.\.\.</reflection\>, and then outputs either a new subgoal with its first action using<subgoal\>\.\.\.</subgoal\>and<action\>\.\.\.</action\>, or only the next action using<action\>\.\.\.</action\>when the current subgoal should continue\.
Prompt Template for ALFWorldYou are an expert agent operating in the ALFRED Embodied Environment\. Your task is to:\{task\_description\}Your current subgoal is:\{current\_subgoal\}\.Your current observation is:\{current\_observation\}Your last action is:\{last\_action\}\.Here are the AVAILABLE ACTIONS you could take:\- ’go to \{\{recep\}\}’\- ’take \{\{obj\}\} from \{\{recep\}\}’\- ’put \{\{obj\}\} on \{\{recep\}\}’\- ’open \{\{recep\}\}’ / ’close \{\{recep\}\}’\- ’use \{\{obj\}\}/\{\{recep\}\}’\- ’clean \{\{obj\}\} with \{\{recep\}\}’\- ’heat \{\{obj\}\} with \{\{recep\}\}’\- ’cool \{\{obj\}\} with \{\{recep\}\}’Ensure that any objects\(’\{\{obj\}\}’\)and receptacles\(’\{\{recep\}\}’\)are present in your observation\. The current observation"Nothing happens\."means the action failed\.EXECUTION HISTORY:Subgoals are essential milestones on the path to the final task\. Below is the execution history, each entry enclosed in\[\]contains a subgoal followed by all corresponding actions and observations:\{action\_history\}REFLECTION:You should first output your reflection process in the following format:<reflection\>the reflection process</reflection\>\.The reflection process includes two parts:\- Part1:Output whether the current subgoal is completed or not according to the current observation\.\- Part2:If the current subgoal is completed, output your reflection on what the new subgoal should be to advance the final task based on historical subgoals\. If the current subgoal is not completed, output your reflection on the next action based on the execution of the current subgoal\.ACTION:Once you have finished the reflection process, choose one output format based on whether the current subgoal is completed:\- If the current subgoal is completed:output a new subgoal and its first action in the format<subgoal\>your subgoal</subgoal\> <action\>your action</action\>\.\- If the current subgoal is not completed:output the next action in the format<action\>your action</action\>\.Figure 7:Prompt template for ALFWorld\.Prompt Template for ScienceWorldYou are a helpful assistant to do some scientific experiment in an environment\.Your current subgoal is:\{current\_subgoal\}\.Your last action is:\{last\_action\}\.Your current observation is:\{current\_observation\}\{task\_description\}In the environment, there are several rooms: kitchen, foundry, workshop, bathroom, outside, living room, bedroom, greenhouse, art studio, hallway\. You can teleport to any room in one step\.Here are the AVAILABLE ACTIONS you could take:\- ’open \{\{object\}\}’ / ’close \{\{object\}\}’: open or close a container\- ’connect \{\{object\}\} to \{\{object\}\}’: connect electrical components\- ’pick up \{\{object\}\}’: move an object to the inventory\- ’put down \{\{object\}\}’: drop an inventory item\- ’move \{\{object\}\} to \{\{object\}\}’: move an object to a container\- ’pour \{\{object\}\} into \{\{object\}\}’: pour a liquid into a container\- ’mix \{\{object\}\}’: chemically mix a container\- ’use \{\{object\}\} on \{\{object\}\}’: use thermometer on object to measure its temperature\- ’activate \{\{object\}\}’ / ’deactivate \{\{object\}\}’: activate or deactivate a device\- ’teleport to \{\{location\}\}’: move to a new location\- ’focus on \{\{object\}\}’: signal intent on a task object\- ’wait’: wait for 10 steps\- ’wait1’: wait for a stepEnsure that any\{\{object\}\}and\{\{location\}\}are present in your observation\. Do NOT use’focus on \{\{object\}\}’to get details\.EXECUTION HISTORY:Subgoals are essential milestones on the path to the final task\. Below is the execution history, each entry enclosed in\[\]contains a subgoal followed by all corresponding actions and observations:\{action\_history\}REFLECTION:You should first output your reflection process in the following format:<reflection\>the reflection process</reflection\>\.The reflection process includes two parts:\- Part1:Output whether the current subgoal, NOT the task, is completed or not completed according to the current observation\.\- Part2:If the current subgoal is completed, output your reflection on what the new subgoal should be to advance the final task\. If the current subgoal is not completed, output your reflection on the next action\.ACTION:Once you have finished the reflection process, choose one output format based on your reflection process:\- If completed:output a new subgoal and its first action in the format<subgoal\>your subgoal</subgoal\> <action\>your action</action\>\.\- If not completed:output the next action in the format<action\>your action</action\>\.Figure 8:prompt template for ScienceWorld\.Prompt Template for VirtualHomeYou are an agent in a simulated household environment, tasked to assist with daily living activities and interactions\. Your task is to:\{task\_description\}Your current subgoal is:\{current\_subgoal\}\.Your current observation is:\{current\_observation\}Your last action is:\{last\_action\}\.Here are the AVAILABLE ACTIONS you could take:\- walk to \{\{obj\}\}\- find \{\{obj\}\}\- grab \{\{obj\}\}\- open \{\{obj\}\}\- close \{\{obj\}\}\- put \{\{obj\}\} on \{\{recep\}\}\- put \{\{obj\}\} in \{\{recep\}\}\- switch on \{\{obj\}\}\- switch off \{\{obj\}\}\- drink \{\{obj\}\}\- sit on \{\{obj\}\}\- lie on \{\{obj\}\}\- look at \{\{obj\}\}\- stand up\- watch \{\{obj\}\}\- wipe \{\{obj\}\}\- type on \{\{obj\}\}\- take off \{\{obj\}\}\- wash \{\{obj\}\}\- cut \{\{obj\}\}\- eat \{\{obj\}\}\- sleep\- wake up\- plug in \{\{obj\}\}\- plug out \{\{obj\}\}\- pour \{\{obj\}\} into \{\{recep\}\}\- turn to \{\{obj\}\}Ensure that any objects\(’\{\{obj\}\}’\)and locations\(’\{\{recep\}\}’\)are present in your observation\. You should walk to the object before taking action on it\.EXECUTION HISTORY:Subgoals are essential milestones on the path to the final task\. Below is the execution history, each entry enclosed in\[\]contains a subgoal followed by all corresponding actions and observations:\{action\_history\}REFLECTION:You should first output your reflection process in the following format:<reflection\>the reflection process</reflection\>\.The reflection process includes two parts:\- Part1:Output whether the current subgoal is completed or not according to the current observation\.\- Part2:If the current subgoal is completed, output your reflection on what the new subgoal should be to advance the final task based on historical subgoals\. If the current subgoal is not completed, output your reflection on the next action based on the execution of the current subgoal\.ACTION:Once you have finished the reflection process, choose one output format based on whether the current subgoal is completed in Part1:\- If completed:output a new subgoal and its first action in the format<subgoal\>your subgoal</subgoal\> <action\>your action</action\>\.\- If not completed:output the next action in the format<action\>your action</action\>\.Figure 9:Prompt template for VirtualHome\.
#### Folded Memory Templates\.
Figures[10](https://arxiv.org/html/2606.10507#A5.F10),[11](https://arxiv.org/html/2606.10507#A5.F11), and[12](https://arxiv.org/html/2606.10507#A5.F12)show folded subgoal\-level memory examples for ALFWorld, VirtualHome, and ScienceWorld, respectively\. The memory is organized to support both global task tracking and local subgoal execution\. Specifically, we first present the current subgoal memory, which contains the active subgoal and its recent action\-observation records\. We then include the initial environment information, which provides the starting state and task context\. Finally, we append the historical folded memory, where each completed subgoal is stored as a compact record\. Each memory unit is enclosed in square brackets\[\.\.\.\]to make different subgoals easy to distinguish\. Within each memory unit, action\-observation pairs are enclosed in parentheses, such as\(pre\_action: \.\.\., pre\_observation: \.\.\.\)\.
Folded History Example on ALFWorlDYour task is to:put a cool tomato in microwave\.Your current subgoal is:put the cooled tomato in the microwave\.EXECUTION HISTORY:Subgoals are essential milestones on the path to the final task\. Below is the execution history, each entry enclosed in\[\]contains a subgoal followed by all corresponding actions and observations:\[\(current\_subgoal: put the cooled tomato in the microwave\),\(pre\_action: go to microwave 1, pre\_observation: You arrive at microwave 1\. The microwave 1 is closed\.\),\(pre\_action: open microwave 1, observation: You open the microwave 1\. The microwave 1 is open\. In it, you see a egg 1\.\)\]\[\(origin\_observation: \-= Welcome to TextWorld, ALFRED\! =\- You are in the middle of a room\. Looking quickly around you, you see a cabinet 6, a cabinet 5, a cabinet 4, a cabinet 3, a cabinet 2, a cabinet 1, a coffeemachine 1, a countertop 3, a countertop 2, a countertop 1, a drawer 3, a drawer 2, a drawer 1, a fridge 1, a garbagecan 1, a microwave 1, a shelf 3, a shelf 2, a shelf 1, a sinkbasin 1, a stoveburner 4, a stoveburner 3, a stoveburner 2, a stoveburner 1, and a toaster 1\. Your task is to: put a cool tomato in microwave\.\)\]\[\(history\_subgoal 1: find a tomato\),\(pre\_action: go to countertop 2, pre\_observation: You arrive at countertop 2\. On the countertop 2, you see a butterknife 1, a cellphone 1, a creditcard 1, a knife 1, a lettuce 1, a saltshaker 2, a saltshaker 1, a statue 1, and a tomato 1\.\)\]\[\(history\_subgoal 2: cool the tomato\),\(pre\_action: cool tomato 1 with fridge 1, pre\_observation: You cool the tomato 1 using the fridge 1\.\)\]Figure 10:Example of the folded subgoal\-level memory used by HIPIF\.Folded History Example on VirtualHomeYour task is to:Wipe down counter\. Instructions: Walk to dining room\. Grab the rag, rinse it in the sink using the faucet, and then wipe the kitchen counter\.Your current subgoal is:open the faucet\.EXECUTION HISTORY:Subgoals are essential milestones on the path to the final task\. Below is the execution history, each entry enclosed in\[\]contains a subgoal followed by all corresponding actions and observations:\[\(current\_subgoal: wipe kitchen\_counter\),\(pre\_action: walk to kitchen\_counter, pre\_observation:you successfully walk to kitchen\_counter\. kitchen\_counter is CLOSED and DIRTY, sink is CLEAN, faucet is OFF and CLEAN, rag is CLEAN, faucet ON sink, faucet ON kitchen\_counter\. You have rag\.\)\],\[\(origin\_observation: Available rooms in the house are: dining\_room, bedroom, home\_office, bathroom\. You are currently in the bedroom\.\)\]\[\(history\_subgoal 1: walk to dining room\),\(pre\_action: walk to dining\_room, observation: you successfully walk to dining\_room\. You arrive in the dining\_room\. You see bench, bookshelf, chair, coffe\_maker, cupboard, dishwasher, faucet, freezer, kitchen\_counter, microwave, oven, phone, rag, sink, stovefan, table, toaster, tray, tvstand\.\)\]\[\(history\_subgoal 2: find and grab the rag\),\(pre\_action: grab rag, observation: you successfully grab rag\. character, rag is CLEAN\. You have rag\.\)\]\[\(history\_subgoal 3: find the sink\),\(pre\_action: walk to sink, observation: you successfully walk to sink\. kitchen\_counter is CLOSED and CLEAN, sink is CLEAN, faucet is OFF and CLEAN, rag is CLEAN, faucet ON sink, faucet ON kitchen\_counter\. You have rag\.\)\]\[\(history\_subgoal 4: find the faucet\),\(pre\_action: find faucet, observation: you successfully find faucet\. kitchen\_counter is CLOSED and CLEAN, sink is CLEAN, faucet is OFF and CLEAN, rag is CLEAN, faucet ON sink, faucet ON kitchen\_counter\. You have rag\.\)\]\[\(history\_subgoal 5: rinse the rag in the sink using the faucet\),\(pre\_action: Rinse rag, observation: you successfully rinse rag\. Sink is CLEAN, faucet is ON and CLEAN, rag is CLEAN, faucet ON sink, faucet ON kitchen\_counter\. You have rag\.\)\]Figure 11:Example of the folded subgoal\-level memory used by HIPIF on VirtualHome\.Folded History Example on ScienceWorldYour task is to:measure the temperature of wood, which is located around the living room\. First, focus on the thermometer\. Next, focus on the wood\. If the wood temperature is above 200\.0 degrees celsius, place it in the green box\. If the wood temperature is below 200\.0 degrees celsius, place it in the blue box\. The boxes are located around the living room\.Your current subgoal is:place the wood in the blue box\.EXECUTION HISTORY:Subgoals are essential milestones on the path to the final task\. Below is the execution history, each entry enclosed in\[\]contains a subgoal followed by all corresponding actions and observations:\[\(current\_subgoal: place the wood in the blue box\),\(pre\_action: move wood to blue box, observation: You move the wood to the blue box\.\)\]\[\(origin\_observation: This room is called the foundry\. In it, you see: the agent, a substance called air, a blast furnace, which is turned off\. The blast furnace door is closed\. A sink, which is turned off\. In the sink is: nothing\. A table\. On the table is: nothing\. You also see: A door to the outside that is open\.\)\]\[\(history\_subgoal 1: find a thermometer\),\(pre\_action: teleport to kitchen, observation: You teleport to the kitchen\. This room is called the kitchen\. In it, you see: the agent, a substance called air, a chair, a counter, a bowl containing a red apple, a banana, an orange, and a potato, a drawer, a cupboard, a finger painting, a freezer, a fridge, a glass jar containing sodium chloride, a lighter, an oven, a sink, soap, a stopwatch, a stove, a table, a glass cup, and a thermometer currently reading a temperature of 10 degrees celsius\.\)\]\[\(history\_subgoal 2: pick up a thermometer\),\(pre\_action: pick up thermometer, observation: You move the thermometer to the inventory\.\)\]\[\(history\_subgoal 3: find the wood\),\(pre\_action: focus on wood, observation: You foucus on wood\.\)\]\[\(history\_subgoal 4: check the temperature of the wood\),\(pre\_action: use thermometer on wood, observation: the thermometer measures a temperature of 4 degrees celsius\)\]Figure 12:Example of the folded subgoal\-level memory used by HIPIF on ScienceWorld\.
## Appendix FValidation Success Rate of 3B and 7B Models
To further examine the training dynamics of different methods, we present the validation success\-rate curves of 3B and 7B models in Figure[2](https://arxiv.org/html/2606.10507#S4.F2)and Figure[13](https://arxiv.org/html/2606.10507#A6.F13), respectively\. Across all three benchmarks, HIPIF consistently achieves the strongest validation performance, while HiAgent\+GRPO remains consistently above the GRPO baseline\. This trend is stable for both model scales, indicating that the advantage of HIPIF is not limited to a specific parameter size\.
More importantly, the curves show that the gains of HIPIF are not only reflected in the final validation accuracy, but also in the overall training trajectory\. Compared with GRPO, subgoal\-based training can be less stable at the early stage, indicating that explicit subgoals introduce additional optimization difficulties\. The later improvement of HIPIF shows that hierarchical reflection and fine\-grained process rewards effectively stabilize subgoal\-based RL and lead to better optimization behavior\.
\(a\)ALFWorld
\(b\)VirtualHome
\(c\)ScienceWorld
Figure 13:Validation success\-rate curves of 7B models on ALFWorld, VirtualHome, and ScienceWorld\.
## Appendix GComplementarity with GiGPO
We further study the relationship between HIPIF and GiGPO through validation success\-rate curves on the 3B setting, as shown in Figure[14](https://arxiv.org/html/2606.10507#A7.F14)\. Across all three benchmarks, HIPIF\+GiGPO performs best, HIPIF ranks second, and GiGPO alone performs worst\.
More specifically, GiGPO primarily improves optimization through enhanced group\-relative credit assignment, whereas HIPIF restructures the decision process with explicit hierarchical planning and context folding, and further stabilizes subgoal\-based training through hierarchical reflection and subgoal\-oriented process rewards\. Therefore, combining HIPIF with GiGPO yields the strongest validation performance, indicating that the two methods are largely complementary\. At the same time, HIPIF alone still outperforms GiGPO across all three benchmarks, showing that the structural improvements introduced by HIPIF provide stronger gains than using GiGPO alone\.
\(a\)ALFWorld
\(b\)VirtualHome
\(c\)ScienceWorld
Figure 14:Validation success\-rate curves comparing GiGPO, HIPIF, and HIPIF\+GiGPO on the 3B setting\.Table 5:Summary of the orthogonality analysis between HIPIF and GiGPO, based on the validation curves in Figure[14](https://arxiv.org/html/2606.10507#A7.F14)\.
## Appendix HLimitations\.
HIPIF is primarily evaluated in simulated long\-horizon interaction benchmarks with structured observations and action spaces\. While these environments provide controlled and reproducible testbeds, extending the framework to more open\-ended real\-world settings may require additional perception and action\-grounding components\. Moreover, HIPIF uses structured outputs for subgoal proposal, reflection, and execution, which improves interpretability but may require sufficiently capable instruction\-following models\. Finally, although we evaluate HIPIF on multiple benchmarks and two backbone scales, our experiments do not exhaustively cover all model families, parameter scales, or training budgets\. Further evaluation on more diverse backbone models would provide a more comprehensive understanding of its scaling behavior\.
## Appendix ICase Studies\.
To provide a more intuitive understanding of HIPIF, we present additional case studies on VirtualHome and ScienceWorld in Tables[6](https://arxiv.org/html/2606.10507#A9.T6)and[7](https://arxiv.org/html/2606.10507#A9.T7)\. These examples compare HIPIF with two ablated variants,w/o Subgoalandw/o Reflection, at representative key decision points in long\-horizon tasks\. Overall, the case studies show that subgoal decomposition and context folding help HIPIF preserve completed progress while avoiding distraction from long execution histories, while hierarchical reflection enables reliable assessment of subgoal completion, action failure, and subgoal transition\.
Table 6:Case study on a long\-horizon task in VirtualHome\.Table 7:Case study on a long\-horizon task in ScienceWorld\.Similar Articles
Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents
This paper introduces HCL-GP, a dynamic policy-learning framework that integrates generalized planning and hierarchical task decomposition to enable LLM-based agents to learn and reuse executable policy components, significantly improving performance on the AppWorld benchmark.
Improving instruction hierarchy in frontier LLMs
OpenAI presents a training approach using instruction-hierarchy tasks to improve LLM safety and reliability by teaching models to properly prioritize instructions based on trust levels (system > developer > user > tool). The method addresses prompt-injection attacks and safety steerability through reinforcement learning with a new dataset called IH-Challenge.
AIPO: : Learning to Reason from Active Interaction
This paper introduces AIPO, a reinforcement learning framework that enhances LLM reasoning by allowing the model to actively consult collaborative agents during exploration to overcome capability boundaries.
PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents
PersonalAI 2.0 introduces a framework that enhances LLM-based systems by integrating external knowledge graphs with dynamic multistage query processing and adaptive planning mechanisms, achieving reductions in hallucination rates and improved precision across multiple benchmarks.
Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling
This paper argues that full-horizon planning with lazy replanning is more efficient than step-by-step execution for data-centric LLM agent tasks, using fewer tokens while maintaining accuracy.