Learning Agentic Policy from Action Guidance

arXiv cs.CL Papers

Summary

The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.

arXiv:2605.12004v1 Announce Type: new Abstract: Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base policy cannot reach reward states, additional training or external guidance is needed to recover effective learning signals. Rather than relying on costly iterative supervised fine tuning (SFT), we exploit the abundant action data generated in everyday human interactions. We propose \textsc{ActGuide-RL}, which injects action data as plan-style reference guidance, enabling the agentic policy to overcome reachability barriers to reward states. Guided and unguided rollouts are then jointly optimized via mixed-policy training, internalizing the exploration gains back into the unguided policy. Motivated by a theoretical and empirical analysis of the benefit-risk trade-off, we adopt a minimal intervention principle that invokes guidance only as an adaptive fallback, matching task difficulty while minimizing off-policy risk. On search-agent benchmarks, \textsc{ActGuide-RL} substantially improves over zero RL (+10.7 pp on GAIA and +19 pp on XBench with Qwen3-4B), and performs on par with the SFT+RL pipeline without any cold start. This suggests a new paradigm for agentic RL that reduces the reliance on heavy SFT data by using scalable action guidance instead.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/13/26, 06:20 AM

# Learning Agentic Policy from Action Guidance
Source: [https://arxiv.org/html/2605.12004](https://arxiv.org/html/2605.12004)
Yuxiang Ji1,2Zengbin Wang211footnotemark:1Yong Wang2†Shidong Yang2Ziyu Ma2 Guanhua Chen3Zonghua Sun1Liaoni Wu1Xiangxiang Chu2 1Xiamen University2AMAP, Alibaba Group3Southern University of Science and Technology

###### Abstract

Agentic reinforcement learning \(RL\) for Large Language Models \(LLMs\) critically depends on the exploration capability of the base policy, as training signals emerge only within its in\-capability region\. For tasks where the base policy cannot reach reward states, additional training or external guidance is needed to recover effective learning signals\. Rather than relying on costly iterative supervised fine tuning \(SFT\), we exploit the abundant action data generated in everyday human interactions\. We proposeActGuide\-RL, which injects action data as plan\-style reference guidance, enabling the agentic policy to overcome reachability barriers to reward states\. Guided and unguided rollouts are then jointly optimized via mixed\-policy training, internalizing the exploration gains back into the unguided policy\. Motivated by a theoretical and empirical analysis of the benefit\-risk trade\-off, we adopt a minimal intervention principle that invokes guidance only as an adaptive fallback, matching task difficulty while minimizing off\-policy risk\. On search\-agent benchmarks,ActGuide\-RLsubstantially improves over zero RL \(\+10\.7 pp on GAIA and \+19 pp on XBench with Qwen3\-4B\), and performs on par with the SFT\+RL pipeline without any cold start\. This suggests a new paradigm for agentic RL that reduces the reliance on heavy SFT data by using scalable action guidance instead\.

![Refer to caption](https://arxiv.org/html/2605.12004v1/x1.png)Figure 1:Agentic RL is typically confined to thein\-capability region222In\-regionis where reward signals are reachable during rollout \(pass@K \> 0\)\.of the base policy, and stalls onout\-regiontasks beyond this exploration frontier\.ActGuide\-RLleverages diverse and scalableaction dataas plan\-style reference to guide effective state visitation inout\-regiontasks\.## 1Introduction

The role of Large Language Models \(LLMs\) has shifted from simple chatbots to agents capable of independently solving complex tasks\[[70](https://arxiv.org/html/2605.12004#bib.bib246),[69](https://arxiv.org/html/2605.12004#bib.bib248),[36](https://arxiv.org/html/2605.12004#bib.bib245),[62](https://arxiv.org/html/2605.12004#bib.bib244),[38](https://arxiv.org/html/2605.12004#bib.bib311)\]\. With targeted agentic training, recent frontier models can autonomously plan and accomplish a wide range of complex tasks\[[43](https://arxiv.org/html/2605.12004#bib.bib250),[1](https://arxiv.org/html/2605.12004#bib.bib251),[52](https://arxiv.org/html/2605.12004#bib.bib243)\]\. This ability has been demonstrated in general tool\-use\[[2](https://arxiv.org/html/2605.12004#bib.bib252),[13](https://arxiv.org/html/2605.12004#bib.bib253),[25](https://arxiv.org/html/2605.12004#bib.bib313)\], GUI\[[65](https://arxiv.org/html/2605.12004#bib.bib255),[45](https://arxiv.org/html/2605.12004#bib.bib254),[81](https://arxiv.org/html/2605.12004#bib.bib314)\], and CLI\[[27](https://arxiv.org/html/2605.12004#bib.bib257)\]settings, including in\-the\-wild real\-world scenarios\[[60](https://arxiv.org/html/2605.12004#bib.bib247),[12](https://arxiv.org/html/2605.12004#bib.bib258)\]\. A key factor behind such targeted training is agentic reinforcement learning \(RL\), in which LLM\-based policies are optimized through repeated interaction with specific or diverse environments toward verifiable or heuristic rewards\[[77](https://arxiv.org/html/2605.12004#bib.bib259),[61](https://arxiv.org/html/2605.12004#bib.bib10),[24](https://arxiv.org/html/2605.12004#bib.bib312)\]\.

Unlike static supervised training, online RL is highly sensitive to task difficulty because the training signal comes only from exploration by the model itself\. As Figure[1](https://arxiv.org/html/2605.12004#S0.F1), we refer to tasks within the reachable capability of the base policy asin\-region, and those beyond this boundary asout\-region\. When reward states fall into the out\-region, group\-based advantage estimates collapse to zero gradient, causing training to stall\. As a result, a common view is that current RL\-based methods are fundamentally limited by the capabilities of the base model\[[75](https://arxiv.org/html/2605.12004#bib.bib25),[64](https://arxiv.org/html/2605.12004#bib.bib262),[9](https://arxiv.org/html/2605.12004#bib.bib264),[22](https://arxiv.org/html/2605.12004#bib.bib263)\]\. To address the cold\-start problem of RL on difficult or unseen tasks, a typical practice is to perform corresponding Supervised Fine\-Tuning \(SFT\) followed by dynamic difficulty adjustment or curriculum learning\. However, such pipelines shift the burden to warm\-start data and careful curriculum design\. This dependence makes agentic RL complex and difficult to scale to new environments\.

Stepping back to the original motivation for developing agentic capabilities, the goal is to move beyond reasoning and enable models to act, interact, and make decisions in a human\-like manner to accomplish long\-horizon tasks\. From this perspective, a direct and currently underutilized training source is the abundantaction datagenerated in open\-world settings or during task construction\. As shown in Figure[1](https://arxiv.org/html/2605.12004#S0.F1), examples include step\-by\-step GUI/CLI interactions with computers or phones, API\-mediated task execution, and long\-horizon gameplay\. In addition, some agentic RL tasks are constructed through a reverse process\[[29](https://arxiv.org/html/2605.12004#bib.bib117),[14](https://arxiv.org/html/2605.12004#bib.bib299),[27](https://arxiv.org/html/2605.12004#bib.bib257)\], where a valid action trajectory is first constructed and then used to instantiate the task, making the correct actions naturally available\. These action data are inherently diverse and large in scale, yet their direct use for model training is often limited by the absence of explicit reasoning traces\. Existing approaches either augment such data with synthesized chain\-of\-thought\[[16](https://arxiv.org/html/2605.12004#bib.bib265),[68](https://arxiv.org/html/2605.12004#bib.bib266)\]or directly leverage it through behavior imitation\[[10](https://arxiv.org/html/2605.12004#bib.bib267),[3](https://arxiv.org/html/2605.12004#bib.bib268)\]\. However, synthesized reasoning can suffer from post\-hoc rationalization\[[56](https://arxiv.org/html/2605.12004#bib.bib307)\], while behavior imitation tends to fit surface action patterns rather than induce the reasoning abilities of agentic policy\.

In this work, we investigate how to leverage action data to enhance agentic RL\. Through empirical analysis, we first characterize the capability barrier of agentic policies, where reward states fall outside the current reachable region and training signals become unavailable\. To address this issue, we proposeActGuide\-RL, which injects action data as plan\-style reference guidance to help the policy cross such barriers and perform effective out\-region state visitation\. We further analyze the benefit\-risk trade\-off introduced by guidance, where stronger guidance improves exploration but also increases off\-policy distribution shift\. Based on this, we draw two main conclusions from our experiments:C1: Action guidance works best when it serves as a zero\-reward fallback and is minimized adaptively, following a principle of minimal intervention\.C2: Under such minimal intervention, guided rollouts can be directly internalized into the unguided model through a mixed\-policy optimization paradigm\.

We evaluateActGuide\-RLon search\-agent benchmarks across different base models, task difficulty levels, and both in\-domain and out\-of\-domain settings\. Compared with zero RL,ActGuide\-RLconsistently improves all tested base models, with especially large gains on harder benchmarks where unguided RL struggles to obtain effective training signals\. Specifically, based on Qwen3\-4B\-Instruct,ActGuide\-RLimproves zero RL by \+10\.68 pp on GAIA, \+27\.79 pp on WebWalkerQA, \+19\.00 pp on XBench, and \+5\.15 pp on BC\-ZH\. Notably, it also performs on par with the SFT\+RL pipeline even without any cold\-start initialization\. This substantially alleviates the dependence on SFT and offers a new perspective for agentic post\-training\.

## 2Method

### 2\.1Preliminaries: Agentic RL

We follow existing works to formulate Agentic RL as a Partially Observable Markov Decision Process \(POMDP\), where a language model acts as a policyπθ\\pi\_\{\\theta\}\. Given a task instancex∼𝒟x\\sim\\mathcal\{D\}, the policy receives the interaction history as its statests\_\{t\}at each steptt, and predicts the next stepαt∼πθ\(⋅∣st\)\\alpha\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\}\)\. A full rollout yields a trajectoryτ\\tauwith a binary outcome rewardY​\(τ\)∈\{0,1\}Y\(\\tau\)\\in\\\{0,1\\\}indicating whether the task is successfully solved\. The overall training objective is to maximize the expected reward:

maxθ⁡𝒥​\(θ\):=𝔼x∼𝒟​𝔼τ∼πθ\(⋅∣x\)​\[Y​\(τ\)\]\.\\max\_\{\\theta\}\\;\\;\\mathcal\{J\}\(\\theta\)\\;:=\\;\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\,\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\}\\\!\\left\[\\,Y\(\\tau\)\\,\\right\]\.\(1\)SinceY​\(τ\)Y\(\\tau\)is binary, this naturally amounts to maximizing the expected success rate over a task distribution that may contain tasks of*varying difficulty*\.

![Refer to caption](https://arxiv.org/html/2605.12004v1/x2.png)Figure 2:Overview ofActGuide\-RLframework\.Conventional agentic RL can only obtain training signals within the base model in\-capability region\.ActGuide\-RLfollows the principle of minimal intervention, dynamically introducing action data to guide the model toward out\-region exploration\. Such mixed rollouts are trained through mixed\-policy optimization\.
### 2\.2The Reachability Barrier in Agentic RL

To optimize the above objective, recent RL algorithms\[[7](https://arxiv.org/html/2605.12004#bib.bib127),[49](https://arxiv.org/html/2605.12004#bib.bib29),[74](https://arxiv.org/html/2605.12004#bib.bib111)\]often sample a group ofNNrollout trajectories\{τi\}i=1N\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{N\}per task and compute advantages from the contrast between successful and failed ones\. This mechanism works well when reward states lie within the in\-capability region\. However, when reward states fall into the out\-region and become unreachable, no learning signal is obtained\. We formalize this phenomenon through the concept of*reachability dynamics*\.

###### Definition 2\.1\(Reachability Dynamics\)\.

LetΨ​\(s\):=supπℙπ​\(Y=1∣s\)\\Psi\(s\):=\\sup\_\{\\pi\}\\mathbb\{P\}\_\{\\pi\}\(Y\{=\}1\\mid s\)denote the least upper bound on the success probability achievable by any continuation policy from statess\. We define the effective state\-visiting mass

Mtπ:=𝔼π​\[Ψ​\(st\)\],M\_\{t\}^\{\\pi\}:=\\mathbb\{E\}\_\{\\pi\}\[\\Psi\(s\_\{t\}\)\],\(2\)which measures the average remaining success potential along rollouts induced by policyπ\\pi\. The ratioκ¯tπ:=Mt\+1π/Mtπ\\bar\{\\kappa\}\_\{t\}^\{\\pi\}:=M\_\{t\+1\}^\{\\pi\}/M\_\{t\}^\{\\pi\}quantifies the one\-step reachability retention\. By telescoping, the mass over any interval\[u,v\)\[u,v\)satisfies the multiplicative recursion

Mvπ=Muπ​∏t=uv−1κ¯tπ\.M\_\{v\}^\{\\pi\}=M\_\{u\}^\{\\pi\}\\prod\_\{t=u\}^\{v\-1\}\\bar\{\\kappa\}\_\{t\}^\{\\pi\}\.\(3\)

A reachability barrier makes rollouts beyond stepb\+mb\{\+\}mreceiveY​\(τ\)=0Y\(\\tau\)\{=\}0, collapsing the group\-based advantage to zero gradient\.*This confines the model to in\-region training and prevents learning on out\-region tasks\.*Unlike insufficient sampling, this failure is structural, so increasingNNcannot help\. The policy itself must first be steered across the critical interval, which motivates our method below\.

### 2\.3From Barriers to Guidance: TheActGuide\-RLFramework

To address the fundamental barrier in agentic RL, we proposeActGuide\-RLto use action as guidance, illustrated in Figure[2](https://arxiv.org/html/2605.12004#S2.F2)\.ActGuide\-RLis driven by three core questions along with two empirical findings: whether action data can repair reachability barriers \(§[2\.3\.1](https://arxiv.org/html/2605.12004#S2.SS3.SSS1), Finding 1\), how much guidance to inject \(§[2\.3\.2](https://arxiv.org/html/2605.12004#S2.SS3.SSS2), Finding 2\), and how to optimize from guided samples \(§[2\.3\.3](https://arxiv.org/html/2605.12004#S2.SS3.SSS3)\)\.

#### 2\.3\.1How to Guide: Action Data Repairs Barriers

To explore whether action\-only data can repair reachability barriers, we treat the action trajectory as a reference plang=\(α~1,…,α~L\)g=\(\\tilde\{\\alpha\}\_\{1\},\\dots,\\tilde\{\\alpha\}\_\{L\}\)and condition the policy asπθ\(⋅∣s,g\)\\pi\_\{\\theta\}\(\\cdot\\mid s,g\)\. We then compare the guided and unguided behavior along the guided rollout\. Specifically, we measure:

\|ΔLogit\|=\|logitπθ\(⋅∣st,g\)−logitπθ\(⋅∣st\)\|⏟token\-level guidance influence,Pass​@​K=ℙτ1:K∼πθ\(⋅∣st\)​\[maxi≤K⁡Y​\(τi\)=1\]⏟prefix\-level reachability\\underbrace\{\|\\Delta\\mathrm\{Logit\}\|=\\left\|\\mathrm\{logit\}\_\{\\pi\_\{\\theta\}\}\(\\cdot\\mid s\_\{t\},g\)\-\\mathrm\{logit\}\_\{\\pi\_\{\\theta\}\}\(\\cdot\\mid s\_\{t\}\)\\right\|\\vphantom\{\\mathbb\{P\}\_\{\\tau\_\{1:K\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\}\)\}\\\!\\left\[\\max\_\{i\\leq K\}Y\(\\tau\_\{i\}\)=1\\right\]\}\}\_\{\\text\{token\-level guidance influence\}\},\\;\\;\\underbrace\{\\mathrm\{Pass@K\}=\\mathbb\{P\}\_\{\\tau\_\{1:K\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\}\)\}\\\!\\left\[\\max\_\{i\\leq K\}Y\(\\tau\_\{i\}\)=1\\right\]\}\_\{\\text\{prefix\-level reachability\}\}

\(5\)where\|Δ​Logit\|\|\\Delta\\mathrm\{Logit\}\|computes the token logits difference between the guided policyπθ\(⋅∣s,g\)\\pi\_\{\\theta\}\(\\cdot\\mid s,g\)and the unguided policyπθ\(⋅∣s\)\\pi\_\{\\theta\}\(\\cdot\\mid s\), capturing how much guidance changes the policy locally\. Prefix\-levelPass​@​K\\mathrm\{Pass@K\}instead samples unguided continuations from the current guided statests\_\{t\}and measures whether they can recover reward, reflecting the remaining reachability after that state\.

![Refer to caption](https://arxiv.org/html/2605.12004v1/x3.png)Figure 3:Action guidance repairs reachability barriers along guided rollouts\.Blue barsshow\|Δ​Logit\|\|\\Delta\\mathrm\{Logit\}\|, andred curvesshow prefix\-level Pass@K \(K=32K\{=\}32\)\. Barriers emerge where unguided Pass@K collapses and the guidance\-induced logit shift spikes\.Finding 1: Action guidance repairs reachability barriers\.As shown in Figure[3](https://arxiv.org/html/2605.12004#S2.F3), easy tasks333Easy samples: the model can discover reward from early guided states\.already show non\-zero Pass@K from early guided states, while harder tasks444Harder samples: rewarding states become reachable only at much later guided states\.keep zero unguided Pass@K until the guided trajectory crosses the barrier\. Within these barrier intervals,\|Δ​Logit\|\|\\Delta\\mathrm\{Logit\}\|spikes sharply, showing that action trajectories diverge from the current policy exactly where it fails\. After the barrier is crossed, unguided Pass@K recovers to non\-trivial levels, showing that action guidance brings the policy to reachable reward states rather than simply replacing its decisions\.

Motivated by Finding 1, we formally leverage action data \(gg\) as the effective guidance signal and simply append it to the task prompt as a list of future reference actions \(Appendix[8](https://arxiv.org/html/2605.12004#A2.F8)\)\. This provides a non\-intrusive reference plan, rather than forcing the model to generate the actions as a fixed prefix\. Moreover, recognizing that different barriers may require varying amounts of guidance to cross, we organize guidance into an ordered family

g0=∅≺g1≺⋯≺gK,g\_\{0\}=\\varnothing\\prec g\_\{1\}\\prec\\cdots\\prec g\_\{K\},\(6\)wheregk=\(α~1,…,α~k\)g\_\{k\}=\(\\tilde\{\\alpha\}\_\{1\},\\dots,\\tilde\{\\alpha\}\_\{k\}\)provides the firstkkreference actions\. This gives guidance a monotone strength parameter, which later allows us to search for the minimal sufficient intervention\. For a barrier interval\[b,b\+m−1\]\[b,b\+m\-1\]of the base policyπθ​\(αt∣st\)\\pi\_\{\\theta\}\(\\alpha\_\{t\}\\mid s\_\{t\}\), we measure the*barrier\-repair benefit*of guidance levelgkg\_\{k\}by the increase of effective state\-visiting mass after the barrier:

Bk:=log⁡Mb\+mπθ\(⋅∣s,gk\)Mb\+mπθ\(⋅∣s\),B\_\{k\}:=\\log\\frac\{M\_\{b\+m\}^\{\\pi\_\{\\theta\}\(\\cdot\\mid s,g\_\{k\}\)\}\}\{M\_\{b\+m\}^\{\\pi\_\{\\theta\}\(\\cdot\\mid s\)\}\},\(7\)where a largerBkB\_\{k\}implies that the guidance better preserves reachable success potential\.

#### 2\.3\.2How Much to Guide: Minimal Intervention Principle

While stronger guidance raises the barrier\-repair benefitBkB\_\{k\}\(Eq\.[7](https://arxiv.org/html/2605.12004#S2.E7)\), it also induces a larger distribution shift from the base policy, increasing the risk of off\-policy optimization error\[[57](https://arxiv.org/html/2605.12004#bib.bib294),[80](https://arxiv.org/html/2605.12004#bib.bib295)\]\. Letτ=\(z1,…,z\|τ\|\)\\tau=\(z\_\{1\},\\ldots,z\_\{\|\\tau\|\}\)be the generated token sequence\. To quantify the distribution shift under guidance levelgkg\_\{k\}, we measure the cumulative token\-level log\-ratio shift of a rolloutτ\\tau:

ℒk​\(τ\):=∑j=1\|τ\|log⁡πθ​\(zj∣z<j\)πθ​\(zj∣z<j,gk\)\.\\mathcal\{L\}\_\{k\}\(\\tau\):=\\sum\_\{j=1\}^\{\|\\tau\|\}\\log\\frac\{\\pi\_\{\\theta\}\(z\_\{j\}\\mid z\_\{<j\}\)\}\{\\pi\_\{\\theta\}\(z\_\{j\}\\mid z\_\{<j\},g\_\{k\}\)\}\.\(8\)The corresponding*off\-policy risk*is the variance of this shift:

Rk:=Varτ∼πθ\(⋅∣s,gk\)​\(ℒk​\(τ\)\)\.R\_\{k\}:=\\mathrm\{Var\}\_\{\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s,g\_\{k\}\)\}\\\!\\left\(\\mathcal\{L\}\_\{k\}\(\\tau\)\\right\)\.\(9\)
![Refer to caption](https://arxiv.org/html/2605.12004v1/x4.png)Figure 4:Guidance\-induced distribution shift under different guidance proportions\. The blue curve shows the mean log\-ratio shift, while the red curve shows its variance,i\.e\., the off\-policy riskRkR\_\{k\}\.Finding 2: Over\-guidance inflates off\-policy risk\.As shown in Figure[4](https://arxiv.org/html/2605.12004#S2.F4), the mean log\-ratio shift \(blue\) and its variance \(red\) describe the guidance\-induced distribution shift from complementary perspectives\. As the guidance levelkkincreases, the off\-policy riskRkR\_\{k\}keeps rising, indicating that stronger guidance makes guided rollouts increasingly unstable for off\-policy optimization\.

Motivated by Finding 2, we adopt a*minimal intervention principle*: for each task, use the least guidance level that recovers reward\. This principle can be viewed as approximately maximizing a guidance utilityJk=Bk−λ​RkJ\_\{k\}=B\_\{k\}\-\\lambda R\_\{k\}, where the barrier\-repair benefitBkB\_\{k\}exhibits a sharp increase after reward recovery while the off\-policy riskRkR\_\{k\}grows with the guidance level\. In practice, we first collect an unguided rollout group per task, invoking guidance only as a fallback when the entire group fails\. Under a mild monotonicity assumption \(stronger levels do not decrease recovery probability\), we can efficiently identify the smallest sufficient levelk⋆k^\{\\star\}via binary search:

k⋆:=min⁡\{k∈\{1,…,K\}:maxi=1N⁡Y​\(τi\(k\)\)≥δ\},k^\{\\star\}:=\\min\\Bigl\\\{k\\in\\\{1,\\dots,K\\\}:\\max\_\{i=1\}^\{N\}Y\(\\tau\_\{i\}^\{\(k\)\}\)\\geq\\delta\\Bigr\\\},\(10\)where\{τi\(k\)\}i=1N\\\{\\tau\_\{i\}^\{\(k\)\}\\\}\_\{i=1\}^\{N\}areNNrollouts under guidancegkg\_\{k\}andδ\>0\\delta\>0is the success threshold\. We denote the resulting adaptive guidance asgadap:=gk⋆g\_\{\\text\{adap\}\}:=g\_\{k^\{\\star\}\}, which keeps guided rollouts close to the unguided distribution and enables the off\-policy optimization studied next\. Note that under binary rewards,BkB\_\{k\}exhibits threshold behavior \(near zero until the barrier is crossed, then jumping sharply\), whileRkR\_\{k\}grows monotonically\. The guidance utilityJkJ\_\{k\}therefore peaks near the minimal successful level, making the binary search in Eq\.[10](https://arxiv.org/html/2605.12004#S2.E10)a practical proxy for approximately maximizingJkJ\_\{k\}\.

#### 2\.3\.3How to Learn: Off\-Policy Internalization

Action guidance is available only at training time\. At inference, the agent must act under the unguided policyπθ\(⋅∣x\)\\pi\_\{\\theta\}\(\\cdot\\mid x\), so any learning signal extracted from guided rollouts has to be internalized\. Since the guided policyπθ\(⋅∣x,g\)\\pi\_\{\\theta\}\(\\cdot\\mid x,g\)shares parameters with the unguided one, we treat guided samples as off\-policy data w\.r\.t\.πθ\(⋅∣x\)\\pi\_\{\\theta\}\(\\cdot\\mid x\)and optimize the mixed objective

𝒥mix​\(θ\)\\displaystyle\\mathcal\{J\}\_\{\\mathrm\{mix\}\}\(\\theta\)=𝔼\(x,g¯\)∼𝒟,𝒢∼qθoldmix\(⋅∣x,g¯\)\[1∑iTi∑i=1\|𝒢\|∑t=1Timin\(ri,tmix\(θ\)A^\(τi\),\\displaystyle=\\mathbb\{E\}\_\{\(x,\\bar\{g\}\)\\sim\\mathcal\{D\},\\ \\mathcal\{G\}\\sim q\_\{\\theta\_\{\\rm old\}\}^\{\\rm mix\}\(\\cdot\\mid x,\\bar\{g\}\)\}\\Biggl\[\\frac\{1\}\{\\sum\_\{i\}T\_\{i\}\}\\sum\_\{i=1\}^\{\|\\mathcal\{G\}\|\}\\sum\_\{t=1\}^\{T\_\{i\}\}\\min\\Bigl\(r\_\{i,t\}^\{\\rm mix\}\(\\theta\)\\,\\hat\{A\}\(\\tau\_\{i\}\),\(11\)clip\(ri,tmix\(θ\),1−ϵ,1\+ϵ\)A^\(τi\)\)−β1\|𝒢\|∑i=1\|𝒢\|𝔻KL\(πθ\(τi∣x\)∥πref\(τi∣x\)\)\],\\displaystyle\\hskip 18\.49988pt\\hskip 18\.49988pt\\mathrm\{clip\}\\\!\\left\(r\_\{i,t\}^\{\\rm mix\}\(\\theta\),1\{\-\}\\epsilon,1\{\+\}\\epsilon\\right\)\\hat\{A\}\(\\tau\_\{i\}\)\\Bigr\)\-\\beta\\frac\{1\}\{\|\\mathcal\{G\}\|\}\\sum\_\{i=1\}^\{\|\\mathcal\{G\}\|\}\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\(\\tau\_\{i\}\\mid x\)\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\(\\tau\_\{i\}\\mid x\)\\right\)\\Biggr\],whereqθoldmixq\_\{\\theta\_\{\\rm old\}\}^\{\\rm mix\}denotes the mixed rollout collection process in Algorithm[1](https://arxiv.org/html/2605.12004#algorithm1),A^​\(τi\)\\hat\{A\}\(\\tau\_\{i\}\)is the group\-based advantage, and the token\-level importance ratio adapts to the rollout source:

ri,tmix​\(θ\)=\{πθ​\(zi,t∣zi,<t\)πθold​\(zi,t∣zi,<t\),ifτi∼πθold\(⋅∣x\),πθ​\(zi,t∣zi,<t\)πθold​\(zi,t∣zi,<t,gadap\),ifτi∼πθold\(⋅∣x,gadap\)\.r^\{\\rm mix\}\_\{i,t\}\(\\theta\)=\\begin\{cases\}\\displaystyle\\frac\{\\pi\_\{\\theta\}\(z\_\{i,t\}\\mid z\_\{i,<t\}\)\}\{\{\\color\[rgb\]\{0\.1484375,0\.42578125,0\.66015625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1484375,0\.42578125,0\.66015625\}\\pi\_\{\\theta\_\{\\rm old\}\}\(z\_\{i,t\}\\mid z\_\{i,<t\}\)\}\},&\\text\{if \}\{\\color\[rgb\]\{0\.1484375,0\.42578125,0\.66015625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.1484375,0\.42578125,0\.66015625\}\\tau\_\{i\}\\sim\\pi\_\{\\theta\_\{\\rm old\}\}\(\\cdot\\mid x\)\},\\\\\[11\.99998pt\] \\displaystyle\\frac\{\\pi\_\{\\theta\}\(z\_\{i,t\}\\mid z\_\{i,<t\}\)\}\{\{\\color\[rgb\]\{0\.8046875,0\.19921875,0\.203125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8046875,0\.19921875,0\.203125\}\\pi\_\{\\theta\_\{\\rm old\}\}\(z\_\{i,t\}\\mid z\_\{i,<t\},g\_\{\\text\{adap\}\}\)\}\},&\\text\{if \}\{\\color\[rgb\]\{0\.8046875,0\.19921875,0\.203125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8046875,0\.19921875,0\.203125\}\\tau\_\{i\}\\sim\\pi\_\{\\theta\_\{\\rm old\}\}\(\\cdot\\mid x,g\_\{\\text\{adap\}\}\)\}\.\\end\{cases\}\(12\)For unguided rollouts this is the standard importance ratio; for guided rollouts the denominator uses the guided distribution, transferring credit back to the unguided targetπθ\(⋅∣x\)\\pi\_\{\\theta\}\(\\cdot\\mid x\)\. Unlike prior off\-policy RL methods that include ratio shaping\[[66](https://arxiv.org/html/2605.12004#bib.bib230),[42](https://arxiv.org/html/2605.12004#bib.bib283)\], we keep the optimization objective unchanged because minimal intervention limits the shift between guided rollouts and the base policy\.

Input:policy

πθ\\pi\_\{\\theta\}, dataset

𝒟=\{\(x,g¯\)\}\\mathcal\{D\}=\\\{\(x,\\bar\{g\}\)\\\}, minibatch size

MM, group size

NN, threshold

δ\\delta, search budget

BB, steps

SS
for*s=1s=1toSS*do

Sample

ℬ=\{\(xb,g¯b\)\}b=1M∼𝒟\\mathcal\{B\}=\\\{\(x\_\{b\},\\bar\{g\}\_\{b\}\)\\\}\_\{b=1\}^\{M\}\\sim\\mathcal\{D\};

𝒢←∅\\mathcal\{G\}\\leftarrow\\emptyset;

foreach*\(xb,g¯b\)∈ℬ\(x\_\{b\},\\bar\{g\}\_\{b\}\)\\in\\mathcal\{B\}*do

Define

gb,k=\(α~b,1,…,α~b,k\)g\_\{b,k\}=\(\\tilde\{\\alpha\}\_\{b,1\},\\ldots,\\tilde\{\\alpha\}\_\{b,k\}\);

𝒢b←\{\(τb,i,rb,i\)\}i=1N,τb,i∼πθold\(⋅∣xb\),rb,i=Y\(τb,i\)\\mathcal\{G\}\_\{b\}\\leftarrow\\\{\(\\tau\_\{b,i\},r\_\{b,i\}\)\\\}\_\{i=1\}^\{N\},\\ \\tau\_\{b,i\}\\sim\\pi\_\{\\theta\_\{\\rm old\}\}\(\\cdot\\mid x\_\{b\}\),\\ r\_\{b,i\}=Y\(\\tau\_\{b,i\}\);

if*maxi⁡rb,i<δ\\max\_\{i\}r\_\{b,i\}<\\delta*then

kb⋆←min⁡\{k:maxj⁡rb,j\(k\)≥δ\}k\_\{b\}^\{\\star\}\\leftarrow\\min\\Bigl\\\{k:\\max\_\{j\}r\_\{b,j\}^\{\(k\)\}\\geq\\delta\\Bigr\\\}via binary search under budget

BB;

if*kb⋆k\_\{b\}^\{\\star\}exists*then

𝒢b\+←\{\(τb,i\+,rb,i\+\)\}i=1N,τb,i\+∼πθold\(⋅∣xb,gb,kb⋆\),rb,i\+=Y\(τb,i\+\)\\mathcal\{G\}\_\{b\}^\{\+\}\\leftarrow\\\{\(\\tau\_\{b,i\}^\{\+\},r\_\{b,i\}^\{\+\}\)\\\}\_\{i=1\}^\{N\},\\ \\tau\_\{b,i\}^\{\+\}\\sim\\pi\_\{\\theta\_\{\\rm old\}\}\(\\cdot\\mid x\_\{b\},g\_\{b,k\_\{b\}^\{\\star\}\}\),\\ r\_\{b,i\}^\{\+\}=Y\(\\tau\_\{b,i\}^\{\+\}\);

𝒢b←𝒢b∪𝒢b\+\\mathcal\{G\}\_\{b\}\\leftarrow\\mathcal\{G\}\_\{b\}\\cup\\mathcal\{G\}\_\{b\}^\{\+\};

𝒢←𝒢∪𝒢b\\mathcal\{G\}\\leftarrow\\mathcal\{G\}\\cup\\mathcal\{G\}\_\{b\};

Compute advantages on

𝒢\\mathcal\{G\}; Update

πθ\(⋅∣x\)\\pi\_\{\\theta\}\(\\cdot\\mid x\)by

𝒥mix\\mathcal\{J\}\_\{\\mathrm\{mix\}\};

Algorithm 1Adaptive Minimal\-Intervention Training with Action Guidance

## 3Experiment

### 3\.1Experimental Setup

Benchmarks\.To evaluate the effectiveness of our proposedActGuide\-RLin LLM agentic RL, we conduct experiments in the search\-agent setting, which is stateless and facilitates the collection of action data\. Our evaluation covers two categories of benchmarks\. The first category is in\-domain search\-agent benchmarks, including four representative datasets,GAIA\[[39](https://arxiv.org/html/2605.12004#bib.bib142)\],WebWalkerQA\[[63](https://arxiv.org/html/2605.12004#bib.bib143)\],XBench\[[5](https://arxiv.org/html/2605.12004#bib.bib145)\], andBrowseComp\-ZH \(BC\-ZH\)\[[83](https://arxiv.org/html/2605.12004#bib.bib298)\], which span diverse difficulty levels, multiple languages, and real\-world multi\-step reasoning scenarios\. The second category is out\-of\-domain benchmarks, includingGPQA\[[47](https://arxiv.org/html/2605.12004#bib.bib300)\],TruthfulQA\[[34](https://arxiv.org/html/2605.12004#bib.bib301)\], andIFEval\[[82](https://arxiv.org/html/2605.12004#bib.bib302)\], which are used to evaluate the out\-of\-domain generalization ability of models beyond the search\-agent setting\. The detailed RL and SFT training data source are provided in Appendix[A](https://arxiv.org/html/2605.12004#A1)\.

Baselines\.Under the same evaluation protocol, we compareActGuide\-RLagainst several baselines, including foundation models\[[40](https://arxiv.org/html/2605.12004#bib.bib306),[35](https://arxiv.org/html/2605.12004#bib.bib304),[51](https://arxiv.org/html/2605.12004#bib.bib303)\], specified search\-agent\-trained models\[[29](https://arxiv.org/html/2605.12004#bib.bib117),[15](https://arxiv.org/html/2605.12004#bib.bib114),[31](https://arxiv.org/html/2605.12004#bib.bib39)\], and vanilla RL trained from the same backbones without action guidance\. For the RL baseline, we adopt the standard GRPO objective with token\-level policy optimization, using the same training data but without action guidance\.

Implementation Details\.Following Tongyi\-DeepResearch\[[54](https://arxiv.org/html/2605.12004#bib.bib271)\], we equip the agent with two tools,web\-searchandweb\-visit, whose schemas are included in the system prompt\. Given the limited interaction budget and context length in our setup, we use raw tool outputs directly without a separate summary model\. For both training reward and test\-time evaluation, we adopt the few\-shot, reference\-based binary LLM\-judge template from Tongyi\-DeepResearch\. Full implementation details are provided in Appendix[B](https://arxiv.org/html/2605.12004#A2)\.

Table 1:Main results ofActGuide\-RLon search\-agent benchmarks comparing foundation models, search\-agent trained models, and RL baseline\. The best results are indicated inbold\.MethodGeneral AI AssistantWebWalkerQAXBenchBC\-ZHLv\.1Lv\.2Lv\.3Avg\.EasyMed\.HardAvg\.Avg\.Avg\.Foundation ModelMiniMax\-M2\.1\-\-\-64\.3\-\-\-\-68\.066\.6DeepSeek\-V3\.2\-\-\-75\.1\-\-\-\-78\.065\.0GPT\-5 High\-\-\-76\.4\-\-\-\-77\.065\.0Search\-Agent\-Trained ModelsWebSailor\-7B\-\-\-37\.9\-\-\-\-34\.014\.2ARPO\-8B53\.932\.716\.738\.826\.733\.329\.630\.525\.0\-WebThinker\-32B\-RL56\.450\.016\.748\.558\.844\.640\.446\.524\.07\.3Baseline andActGuide\-RLQwen2\.5\-3B\-Instruct15\.387\.690\.009\.715\.007\.144\.585\.738\.002\.08\+\+RL15\.387\.6916\.6611\.6515\.0015\.0015\.8315\.2910\.002\.42\\rowcolor\[HTML\]F5F5F5\+\+ActGuide\-RL28\.2111\.5416\.6618\.4518\.7516\.0722\.0818\.8216\.004\.50Δ\\DeltaDelta\+12\.83\+3\.85\+0\.00\+6\.80\+3\.75\+1\.07\+6\.25\+3\.53\+6\.00\+2\.08Qwen2\.5\-7B\-Instruct35\.8915\.388\.3322\.3218\.7519\.2816\.2518\.0919\.004\.50\+\+RL20\.517\.690\.0011\.6514\.3720\.3519\.5818\.6722\.004\.84\\rowcolor\[HTML\]F5F5F5\+\+ActGuide\-RL41\.0217\.308\.3325\.2424\.3721\.0721\.6622\.0524\.008\.31Δ\\DeltaDelta\+20\.51\+9\.61\+8\.33\+13\.59\+10\.00\+0\.72\+2\.08\+3\.38\+2\.00\+3\.47Qwen3\-4B\-Instruct17\.9417\.300\.0015\.538\.753\.570\.833\.8214\.007\.96\+\+RL33\.3325\.000\.0025\.2413\.1213\.929\.1712\.0618\.0015\.26\\rowcolor\[HTML\]F5F5F5\+\+ActGuide\-RL46\.1532\.6916\.6635\.9243\.7541\.7835\.0039\.8537\.0020\.41Δ\\DeltaDelta\+12\.82\+7\.69\+16\.66\+10\.68\+22\.50\+30\.63\+25\.83\+27\.79\+19\.00\+5\.15Qwen3\-8B43\.5826\.9216\.6632\.0341\.8731\.7826\.2532\.2032\.0023\.52\+\+RL46\.1532\.6925\.0036\.8943\.7544\.6439\.1642\.5033\.0021\.79\\rowcolor\[HTML\]F5F5F5\+\+ActGuide\-RL51\.2836\.5333\.3341\.7450\.0046\.7944\.5846\.7744\.0026\.64Δ\\DeltaDelta\+5\.13\+3\.84\+8\.33\+4\.85\+6\.25\+2\.15\+5\.42\+4\.27\+11\.00\+4\.85

### 3\.2Main Results

Overall Comparison\.Table[1](https://arxiv.org/html/2605.12004#S3.T1)reports overall accuracy on four in\-domain benchmarks, from which three observations stand out\.

- •ActGuide\-RLmitigates in\-region RL capability regression\.When the exploration difficulty of the RL training data does not match the base model, vanilla RL restricted to in\-region exploration can lead to partial performance regression on some benchmarks\. For example, RL degrades Qwen2\.5\-7B\-Instruct on GAIA and Qwen3\-8B on BC\-ZH, whereasActGuide\-RLalleviates these regressions through adaptive guidance and more effective state visitation\.
- •ActGuide\-RLimproves exploration beyond the current reachable region\.When vanilla RL fails to access sufficiently effective states on harder tasks, action guidance helps the policy move beyond its current reachable region and enables more effective state visitation\. This is most evident on Qwen3\-4B\-Instruct, whereActGuide\-RLbrings broad gains across all four benchmarks, with especially large improvements on WebWalker \(12\.06%→39\.85%12\.06\\%\\rightarrow 39\.85\\%\) and XBench \(18\.00%→37\.00%18\.00\\%\\rightarrow 37\.00\\%\)\.
- •ActGuide\-RLdelivers stable gains across base models\.For base models with different capability levels, action guidance can adaptively help the policy access more effective states on each training sample according to its difficulty\. As a result, compared with vanilla RL,ActGuide\-RLconsistently improves all four base models, underscoring the strong adaptability of action guidance across different capability levels\.

Comparison with SFT \+ RL\.Another commonly used strategy to address training stalls caused by limited policy exploration is a targeted SFT cold start\. To further analyze the role ofActGuide\-RLrelative to the SFT \+ RL paradigm, we also initialize the policy with an SFT cold start constructed by partially distilling Tongyi\-DeepResearch\-30B\-A3b\. This setting aims to explore a new possibility beyond the standard SFT \+ RL pipeline through action\-level guidance, rather than merely pursuing performance improvements over a comprehensive SFT baseline\. As shown in Table[2](https://arxiv.org/html/2605.12004#S3.T2), even without any cold start,ActGuide\-RLachieves performance comparable to the two\-stage SFT\+RL pipeline\. Moreover, when built on the same cold\-start initialized model,ActGuide\-RLstill can obtain additional gains from action guidance\. Meanwhile, due to the mode\-covering nature of SFT, cold\-start initialization often degrades out\-of\-domain performance as the consistent performance drop on GPQA\-CoT \(Zero Shot\), TruthfulQA and IFEVAL, whereas such degradation does not occur inActGuide\-RLwith zero RL setting\.

Overall,ActGuide\-RLoffers a new alternative paradigm for agentic RL, alleviating the dependence on heavy SFT data throught the use of lighter\-weight action data instead\.

Table 2:Comparison ofActGuide\-RLand SFT \+ RL on in\-domain and out\-of\-domain benchmarks\.MethodIn\-DomainOut\-of\-DomainGAIAWebWalkerXBenchBC\-ZHGPQA\-CoT \(ZS\)TruthQAIFEvalZeroRL25\.2412\.0618\.0015\.2635\.4562\.1781\.33\\rowcolor\[HTML\]F5F5F5\+\+ActGuide35\.9239\.8537\.0020\.4136\.9362\.3082\.99SFT34\.9531\.1825\.0025\.6129\.1556\.9577\.82\+\+RL36\.8932\.2017\.0026\.3029\.8557\.0276\.34\\rowcolor\[HTML\]F5F5F5\+\+ActGuide40\.7737\.0625\.0028\.0229\.5757\.1177\.43

### 3\.3Further Analysis and Ablation

Training Dynamics\.To further analyze the eff\-

![Refer to caption](https://arxiv.org/html/2605.12004v1/x5.png)Figure 5:Trainable groups dynamic\.ect of action guidance on training dynamics, we track the proportion of rollout groups that provide effective learning signals during training, as shown in Figure[5](https://arxiv.org/html/2605.12004#S3.F5)\. Specifically, we find action data helps the policy discover effective training signals in a higher proportion of samples, while the unguided baseline is frequently hindered by exploration barriers and therefore wastes many rollouts on ineffective state visitation\. This suggests thatActGuide\-RLimproves exploration beyond the current reachable region, allowing the policy to learn from out\-region tasks\.

![Refer to caption](https://arxiv.org/html/2605.12004v1/x6.png)
Figure 6:Training dynamics on number of interaction turns and response length\.Table 3:Agent performance under different interaction turn budgets\.Turn BudgetGAIAWebWalkerXBenchBC\-ZH20\.979\.265\.001\.04418\.4433\.9733\.004\.84819\.4135\.0033\.0016\.961627\.1837\.5535\.0017\.993235\.9239\.8537\.0020\.41

Towards Complex Interaction\.A central challenge of agentic RL without cold\-start is that the policy struggles to acquire complex interaction skills within its in\-region tasks\. Fortunately, we find thatActGuide\-RLenables even a small model such as Qwen3\-4B\-Instruct without any cold\-start initialization, to gradually acquire complex interaction capability, as reflected by the steady increase in the number of interaction turns and generated tokens over training in Figure[6](https://arxiv.org/html/2605.12004#S3.F6)\. To further verify whether these increased interactions are indeed effective, we vary the interaction budget at evaluation time and observe that performance consistently improves as the budget increases in Table[3](https://arxiv.org/html/2605.12004#S3.T3)\.

Ablation Study onActGuide\-RL\.We

Table 4:Ablation study ofActGuide\-RL\.MethodGAIAWebWalkerXBenchActGuide\-RL35\.9239\.8537\.00−\-Minimal\-Intervention \(Adaptive\)27\.1835\.0034\.00−\-Minimal\-Intervention \(Fallback\)24\.2723\.8219\.00−\-Mixed\-Policy Optimization22\.3221\.7621\.00

![Refer to caption](https://arxiv.org/html/2605.12004v1/x7.png)

Figure 7:Performance of different guidance strength\.

conduct ablation studies on several key design choices inActGuide\-RL, including the adaptive guidance mechanism, the fallback guidance, and mixed\-policy optimization\. As shown in Table[4](https://arxiv.org/html/2605.12004#S3.T4), removing either the adaptive or fallback guidance mechanism causes performance degradation to different extents\. We further compare fixed guidance ratios in Figure[7](https://arxiv.org/html/2605.12004#S3.F7), and again find that dynamic guidance performs best\. These results indicate that action guidance is not effective simply because more guidance is provided, nor is less always better\. Rather, the best performance comes from minimally introducing guidance in an adaptive manner according to the policy capability\. Removing mixed\-policy optimization also causes a substantial performance drop, since it breaks the pathway that transfers behaviors acquired under guidance into the test\-time unguided capability\.

Sensitivity to Action Noise\.When consi\-

Table 5:Results of different action noise ratio\.Noise RatioGAIAWebWalkerXBenchBC\-ZH0%35\.9239\.8537\.0020\.4110%39\.8139\.2638\.0019\.0320%29\.1237\.9435\.0017\.64

considering scaling up the collection of action data, an important factor is data noise, as human demonstrations may contain a substantial amount of meaningless or irrelevant actions while completing certain tasks\. Here we simulate such noise by randomly inserting task\-irrelevant actions into the original per\-sample action trajectories, and then perform the sameActGuide\-RLtraining\. As shown in Table[5](https://arxiv.org/html/2605.12004#S3.T5),ActGuide\-RLis not overly sensitive to action noise\. It maintains stable performance under a 10% noise ratio, while a further increase to 20% leads to a performance drop\.

## 4Related Work

### 4\.1Agentic RL

Recent advancements in RL\[[49](https://arxiv.org/html/2605.12004#bib.bib29),[74](https://arxiv.org/html/2605.12004#bib.bib111),[48](https://arxiv.org/html/2605.12004#bib.bib269),[7](https://arxiv.org/html/2605.12004#bib.bib127),[17](https://arxiv.org/html/2605.12004#bib.bib30)\]enable end\-to\-end training of agents that can interact with environments, make sequential decisions, and optimize toward long\-horizon objectives\. This makes agentic RL a pivotal paradigm for both foundation\-model capability building\[[43](https://arxiv.org/html/2605.12004#bib.bib250),[1](https://arxiv.org/html/2605.12004#bib.bib251),[52](https://arxiv.org/html/2605.12004#bib.bib243),[53](https://arxiv.org/html/2605.12004#bib.bib249)\]and domain\-specific agent post\-training\[[28](https://arxiv.org/html/2605.12004#bib.bib18),[25](https://arxiv.org/html/2605.12004#bib.bib313),[73](https://arxiv.org/html/2605.12004#bib.bib270),[54](https://arxiv.org/html/2605.12004#bib.bib271),[8](https://arxiv.org/html/2605.12004#bib.bib272)\]\. Since effective agentic RL strongly depends on the base model to explore valid training signals, existing methods often rely on a cold\-start before RL or on alternating SFT and RL to dynamically align the model capabilities with the target tasks\[[13](https://arxiv.org/html/2605.12004#bib.bib253),[44](https://arxiv.org/html/2605.12004#bib.bib276),[4](https://arxiv.org/html/2605.12004#bib.bib275),[6](https://arxiv.org/html/2605.12004#bib.bib274),[11](https://arxiv.org/html/2605.12004#bib.bib273)\]\. Some works instead adopt dynamic task scheduling\[[71](https://arxiv.org/html/2605.12004#bib.bib277),[20](https://arxiv.org/html/2605.12004#bib.bib280),[76](https://arxiv.org/html/2605.12004#bib.bib279)\]or curriculum learning\[[30](https://arxiv.org/html/2605.12004#bib.bib278),[26](https://arxiv.org/html/2605.12004#bib.bib281)\]to ensure that the difficulty of training tasks is well matched to the evolving capabilities of the model\. A line of work most closely related to ours constructs curriculum learning examples from existing SFT data\[[72](https://arxiv.org/html/2605.12004#bib.bib260),[59](https://arxiv.org/html/2605.12004#bib.bib282)\], or directly uses this data as hints to guide the model toward obtaining meaningful learning signals on hard tasks\[[22](https://arxiv.org/html/2605.12004#bib.bib263),[42](https://arxiv.org/html/2605.12004#bib.bib283),[64](https://arxiv.org/html/2605.12004#bib.bib262)\]\. Unlike these approaches,ActGuide\-RLseeks to leverage more readily available action data, offering greater practical value and stronger scaling potential\.

### 4\.2RL from Demonstration

Our work is also related to reinforcement learning from demonstrations \(RLfD\)\[[41](https://arxiv.org/html/2605.12004#bib.bib287),[33](https://arxiv.org/html/2605.12004#bib.bib288)\], where demonstrations usually take the form of expert trajectories, typically as full reasoning\-and\-action traces in agent settings\. Classical RLfD methods often use demonstration trajectories to bootstrap exploration in sparse\-reward settings, for example by retaining them in the replay buffer and combining RL updates with auxiliary imitation losses\[[46](https://arxiv.org/html/2605.12004#bib.bib284),[21](https://arxiv.org/html/2605.12004#bib.bib285),[58](https://arxiv.org/html/2605.12004#bib.bib286)\]\. Following a similar intuition, several recent LLM studies incorporate off\-policy expert trajectories into online RL to mitigate sparse\-reward and hard\-exploration challenges\[[18](https://arxiv.org/html/2605.12004#bib.bib293),[32](https://arxiv.org/html/2605.12004#bib.bib292),[78](https://arxiv.org/html/2605.12004#bib.bib290),[37](https://arxiv.org/html/2605.12004#bib.bib291)\]\. Specifically, LUFFY\[[67](https://arxiv.org/html/2605.12004#bib.bib289)\]incorporates off\-policy expert trajectories into online RL through mixed\-policy optimization, using regularized importance shaping to avoid rigid imitation\. Guide\[[42](https://arxiv.org/html/2605.12004#bib.bib283)\]utilizes adaptive hint\-guided off\-policy trajectories into online RL, reweighting them to improve exploration while training a policy that no longer relies on hints at inference time\. Unlike these demonstration\-based approaches, this work focuses on learning agentic policy from action guidance, with minimal intervention that adopts to tasks of different difficulty\.

## 5Conclusion

We presentActGuide\-RL, a framework that leverages readily available action data as plan\-style guidance to help agentic RL overcome exploration barriers beyond the base policy’s reachable region\. By introducing guidance only as an adaptive fallback and optimizing guided and unguided rollouts jointly,ActGuide\-RLinternalizes exploration gains while reducing the off\-policy risks of excessive intervention\. Across search\-agent benchmarks, these design choices yield consistent gains over vanilla RL and performance comparable to SFT\+RL, without requiring a supervised cold start\. Further analyses show that these gains are accompanied by more effective multi\-step interaction and arise from adaptive, minimally intrusive guidance rather than simply stronger intervention\. These findings suggest that scalable action\-only traces can serve as a practical post\-training signal for complex agentic interaction, complementing or partially replacing costly supervised demonstrations\.

## References

- \[1\]Anthropic\(2026\)Claude Opus 4\.6 model card\.External Links:[Link](https://www-cdn.anthropic.com/bf10f64990cfda0ba858290be7b8cc6317685f47.pdf)Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[2\]V\. Barres, H\. Dong, S\. Ray, X\. Si, and K\. Narasimhan\(2025\)tau2 Bench: Evaluating Conversational Agents in a Dual\-Control Environment\.arXiv preprint arXiv:2506\.07982\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[3\]M\. Caccia, M\. Thakkar, L\. Boisvert, T\. L\. S\. De Chezelles, A\. Piché, N\. Chapados, A\. Drouin, M\. Gasse, and A\. Lacoste\(2024\)Fine\-tuning web agents: it works, but it’s trickier than you think\.InNeurIPS 2024 Workshop on Open\-World Agents,Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p3.1)\.
- \[4\]H\. Chen, H\. Tu, F\. Wang, H\. Liu, X\. Tang, X\. Du, Y\. Zhou, and C\. Xie\(2025\)Sft or rl? an early investigation into training r1\-like reasoning large vision\-language models\.arXiv preprint arXiv:2504\.11468\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[5\]K\. Chen, Y\. Ren, Y\. Liu, X\. Hu, H\. Tian, T\. Xie, F\. Liu, H\. Zhang, H\. Liu, Y\. Gong, C\. Sun, H\. Hou, H\. Yang, J\. Pan, J\. Lou, J\. Mao, J\. Liu, J\. Li, K\. Liu, K\. Liu, R\. Wang, R\. Li, T\. Niu, W\. Zhang, W\. Yan, X\. Wang, Y\. Zhang, Y\. Hung, Y\. Jiang, Z\. Liu, Z\. Yin, Z\. Ma, and Z\. Mo\(2025\-06\)Xbench: Tracking Agents Productivity Scaling with Profession\-Aligned Real\-World Evaluations\.arXiv\.Note:arXiv:2506\.13651 \[cs\]External Links:[Link](http://arxiv.org/abs/2506.13651),[Document](https://dx.doi.org/10.48550/arXiv.2506.13651)Cited by:[3rd item](https://arxiv.org/html/2605.12004#A1.I1.i3.p1.1),[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p1.1)\.
- \[6\]L\. Chen, X\. Han, L\. Shen, J\. Bai, and K\. Wong\(2025\)Beyond two\-stage training: cooperative sft and rl for llm reasoning\.arXiv preprint arXiv:2509\.06948\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[7\]X\. Chu, H\. Huang, X\. Zhang, F\. Wei, and Y\. Wang\(2026\)GPG: a simple and strong reinforcement learning baseline for model reasoning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=inccdtfx8x)Cited by:[§2\.2](https://arxiv.org/html/2605.12004#S2.SS2.p1.2),[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[8\]Z\. Chu, X\. Wang, J\. Hong, H\. Fan, Y\. Huang, Y\. Yang, G\. Xu, C\. Zhao, C\. Xiang, S\. Hu,et al\.\(2026\)Redsearcher: a scalable and cost\-efficient framework for long\-horizon search agents\.arXiv preprint arXiv:2602\.14234\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[9\]Y\. Dai, Y\. Ji, X\. Zhang, Y\. Wang, X\. Chu, and Z\. Lu\(2026\)Harder is better: boosting mathematical reasoning via difficulty\-aware GRPO and multi\-aspect question reformulation\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nfURupkdRJ)Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p2.1)\.
- \[10\]X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su\(2023\)Mind2web: towards a generalist agent for the web\.Advances in Neural Information Processing Systems36,pp\. 28091–28114\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p3.1)\.
- \[11\]Y\. Deng, H\. Bansal, F\. Yin, N\. Peng, W\. Wang, and K\. Chang\(2025\)Openvlthinker: complex vision\-language reasoning via iterative sft\-rl cycles\.arXiv preprint arXiv:2503\.17352\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[12\]S\. Ding, X\. Dai, L\. Xing, S\. Ding, Z\. Liu, J\. Yang, P\. Yang, Z\. Zhang, X\. Wei, Y\. Ma, H\. Duan, J\. Shao, J\. Wang, D\. Lin, K\. Chen, and Y\. Zang\(2026\)WildClawBench\.Note:https://github\.com/InternLM/WildClawBenchGitHub repositoryCited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[13\]G\. Dong, Y\. Chen, X\. Li, J\. Jin, H\. Qian, Y\. Zhu, H\. Mao, G\. Zhou, Z\. Dou, and J\. Wen\(2025\)Tool\-star: empowering llm\-brained multi\-tool reasoner via reinforcement learning\.arXiv preprint arXiv:2505\.16410\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[14\]G\. Dong, J\. Lu, J\. Huang, W\. Zhong, L\. Liu, S\. Huang, Z\. Li, Y\. Zhao, X\. Song, X\. Li,et al\.\(2026\)Agent\-world: scaling real\-world environment synthesis for evolving general agent intelligence\.arXiv preprint arXiv:2604\.18292\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p3.1)\.
- \[15\]G\. Dong, H\. Mao, K\. Ma, L\. Bao, Y\. Chen, Z\. Wang, Z\. Chen, J\. Du, H\. Wang, F\. Zhang, G\. Zhou, Y\. Zhu, J\. Wen, and Z\. Dou\(2025\-07\)Agentic Reinforced Policy Optimization\.arXiv\.Note:arXiv:2507\.19849 \[cs\]External Links:[Link](http://arxiv.org/abs/2507.19849),[Document](https://dx.doi.org/10.48550/arXiv.2507.19849)Cited by:[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p2.1)\.
- \[16\]L\. E\. Erdogan, N\. Lee, S\. Kim, S\. Moon, H\. Furuta, G\. Anumanchipalli, K\. Keutzer, and A\. Gholami\(2025\)Plan\-and\-act: improving planning of agents for long\-horizon tasks\.arXiv preprint arXiv:2503\.09572\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p3.1)\.
- \[17\]L\. Feng, Z\. Xue, T\. Liu, and B\. An\(2025\-05\)Group\-in\-Group Policy Optimization for LLM Agent Training\.arXiv\.Note:arXiv:2505\.10978 \[cs\]External Links:[Link](http://arxiv.org/abs/2505.10978),[Document](https://dx.doi.org/10.48550/arXiv.2505.10978)Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[18\]Y\. Fu, T\. Chen, J\. Chai, X\. Wang, S\. Tu, G\. Yin, W\. Lin, Q\. Zhang, Y\. Zhu, and D\. Zhao\(2025\)Srft: a single\-stage method with supervised and reinforcement fine\-tuning for reasoning\.arXiv preprint arXiv:2506\.19767\.Cited by:[§4\.2](https://arxiv.org/html/2605.12004#S4.SS2.p1.1)\.
- \[19\]J\. Gao, W\. Fu, M\. Xie, S\. Xu, C\. He, Z\. Mei, B\. Zhu, and Y\. Wu\(2025\)Beyond ten turns: unlocking long\-horizon agentic search with large\-scale asynchronous rl\.External Links:2508\.07976,[Link](https://arxiv.org/abs/2508.07976)Cited by:[§A\.1](https://arxiv.org/html/2605.12004#A1.SS1.p1.1)\.
- \[20\]Z\. Gu, J\. Light, R\. Astudillo, Z\. Ye, L\. He, H\. P\. Zou, W\. Cheng, S\. Paternain, P\. S\. Yu, and Y\. Yue\(2026\)Actor\-curator: co\-adaptive curriculum learning via policy\-improvement bandits for rl post\-training\.arXiv preprint arXiv:2602\.20532\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[21\]T\. Hester, M\. Vecerik, O\. Pietquin, M\. Lanctot, T\. Schaul, B\. Piot, D\. Horgan, J\. Quan, A\. Sendonaris, I\. Osband,et al\.\(2018\)Deep q\-learning from demonstrations\.InProceedings of the AAAI conference on artificial intelligence,Vol\.32\.Cited by:[§4\.2](https://arxiv.org/html/2605.12004#S4.SS2.p1.1)\.
- \[22\]Q\. Huang, W\. Dai, J\. Liu, W\. He, H\. Jiang, M\. Song, J\. Chen, C\. Yao, and J\. Song\(2025\)Boosting mllm reasoning with text\-debiased hint\-grpo\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 4848–4857\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[23\]J\. Hübotter, F\. Lübeck, L\. Behric, A\. Baumann, M\. Bagatella, D\. Marta, I\. Hakimi, I\. Shenfeld, T\. K\. Buening, C\. Guestrin,et al\.\(2026\)Reinforcement learning via self\-distillation\.arXiv preprint arXiv:2601\.20802\.Cited by:[Appendix B](https://arxiv.org/html/2605.12004#A2.p2.6.5)\.
- \[24\]Y\. Ji, Z\. Ma, Y\. Wang, G\. Chen, X\. Chu, and L\. Wu\(2026\)Tree search for LLM agent reinforcement learning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ZpQwAFhU13)Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[25\]Y\. Ji, Y\. Wang, Z\. Ma, Y\. Hu, H\. Huang, X\. Hu, G\. Chen, L\. Wu, and X\. Chu\(2026\)Thinking with map: reinforced parallel map\-augmented agent for geolocalization\.ACL\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[26\]G\. Jiang, W\. Feng, G\. Quan, C\. Hao, Y\. Zhang, G\. Liu, and H\. Wang\(2025\)Vcrl: variance\-based curriculum reinforcement learning for large language models\.arXiv preprint arXiv:2509\.19803\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[27\]C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan\(2023\)Swe\-bench: can language models resolve real\-world github issues?\.arXiv preprint arXiv:2310\.06770\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1),[§1](https://arxiv.org/html/2605.12004#S1.p3.1)\.
- \[28\]B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. Arik, D\. Wang, H\. Zamani, and J\. Han\(2025\-04\)Search\-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning\.arXiv\.Note:arXiv:2503\.09516 \[cs\]External Links:[Link](http://arxiv.org/abs/2503.09516),[Document](https://dx.doi.org/10.48550/arXiv.2503.09516)Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[29\]K\. Li, Z\. Zhang, H\. Yin, L\. Zhang, L\. Ou, J\. Wu, W\. Yin, B\. Li, Z\. Tao, X\. Wang, W\. Shen, J\. Zhang, D\. Zhang, X\. Wu, Y\. Jiang, M\. Yan, P\. Xie, F\. Huang, and J\. Zhou\(2025\-07\)WebSailor: Navigating Super\-human Reasoning for Web Agent\.arXiv\.Note:arXiv:2507\.02592 \[cs\]External Links:[Link](http://arxiv.org/abs/2507.02592),[Document](https://dx.doi.org/10.48550/arXiv.2507.02592)Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p2.1)\.
- \[30\]R\. Li, H\. Huang, F\. Wei, F\. Xiong, Y\. Wang, and X\. Chu\(2026\)Adacurl: adaptive curriculum reinforcement learning with invalid sample mitigation and historical revisiting\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 23123–23131\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[31\]X\. Li, J\. Jin, G\. Dong, H\. Qian, Y\. Zhu, Y\. Wu, J\. Wen, and Z\. Dou\(2025\-04\)WebThinker: Empowering Large Reasoning Models with Deep Research Capability\.arXiv\.Note:arXiv:2504\.21776 \[cs\]External Links:[Link](http://arxiv.org/abs/2504.21776),[Document](https://dx.doi.org/10.48550/arXiv.2504.21776)Cited by:[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p2.1)\.
- \[32\]J\. Liang, H\. Tang, Y\. Ma, J\. Liu, Y\. Zheng, S\. Hu, L\. Bai, and J\. Hao\(2025\)Squeeze the soaked sponge: efficient off\-policy reinforcement finetuning for large language model\.arXiv preprint arXiv:2507\.06892\.Cited by:[§4\.2](https://arxiv.org/html/2605.12004#S4.SS2.p1.1)\.
- \[33\]G\. Libardi, G\. De Fabritiis, and S\. Dittert\(2021\)Guided exploration with proximal policy optimization using a single demonstration\.InInternational Conference on Machine Learning,pp\. 6611–6620\.Cited by:[§4\.2](https://arxiv.org/html/2605.12004#S4.SS2.p1.1)\.
- \[34\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)Truthfulqa: measuring how models mimic human falsehoods\.InProceedings of the 60th annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 3214–3252\.Cited by:[2nd item](https://arxiv.org/html/2605.12004#A1.I2.i2.p1.1),[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p1.1)\.
- \[35\]A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p2.1)\.
- \[36\]J\. Luo, W\. Zhang, Y\. Yuan, Y\. Zhao, J\. Yang, Y\. Gu, B\. Wu, B\. Chen, Z\. Qiao, Q\. Long,et al\.\(2025\)Large language model agent: a survey on methodology, applications and challenges\.arXiv preprint arXiv:2503\.21460\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[37\]L\. Ma, H\. Liang, M\. Qiang, L\. Tang, X\. Ma, Z\. H\. Wong, J\. Niu, C\. Shen, R\. He, Y\. Li,et al\.\(2025\)Learning what reinforcement learning can’t: interleaved online fine\-tuning for hardest questions\.arXiv preprint arXiv:2506\.07527\.Cited by:[§4\.2](https://arxiv.org/html/2605.12004#S4.SS2.p1.1)\.
- \[38\]Z\. Ma, S\. Yang, Y\. Ji, X\. Wang, Y\. Wang, Y\. Hu, T\. Huang, and X\. Chu\(2026\)SkillClaw: let skills evolve collectively with agentic evolver\.arXiv preprint arXiv:2604\.08377\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[39\]G\. Mialon, C\. Fourrier, C\. Swift, T\. Wolf, Y\. LeCun, and T\. Scialom\(2023\-11\)GAIA: a benchmark for General AI Assistants\.arXiv\.Note:arXiv:2311\.12983 \[cs\]External Links:[Link](http://arxiv.org/abs/2311.12983),[Document](https://dx.doi.org/10.48550/arXiv.2311.12983)Cited by:[1st item](https://arxiv.org/html/2605.12004#A1.I1.i1.p1.1),[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p1.1)\.
- \[40\]MiniMax\(2025\)MiniMax m2\.1 system card\(Website\)MiniMax\.External Links:[Link](https://www.minimax.io/news/minimax-m21)Cited by:[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p2.1)\.
- \[41\]A\. Nair, B\. McGrew, M\. Andrychowicz, W\. Zaremba, and P\. Abbeel\(2018\)Overcoming exploration in reinforcement learning with demonstrations\.In2018 IEEE international conference on robotics and automation \(ICRA\),pp\. 6292–6299\.Cited by:[§4\.2](https://arxiv.org/html/2605.12004#S4.SS2.p1.1)\.
- \[42\]V\. Nath, E\. Lau, A\. Gunjal, M\. Sharma, N\. Baharte, and S\. Hendryx\(2025\)Adaptive guidance accelerates reinforcement learning of reasoning models\.arXiv preprint arXiv:2506\.13923\.Cited by:[§2\.3\.3](https://arxiv.org/html/2605.12004#S2.SS3.SSS3.p1.6),[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.12004#S4.SS2.p1.1)\.
- \[43\]OpenAI\(2025\)GPT\-5\.4 thinking system card\(Website\)OpenAI Deployment Safety\.External Links:[Link](https://openai.com/index/gpt-5-4-thinking-system-card/)Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[44\]R\. Y\. Pang, W\. Yuan, K\. Cho, H\. He, S\. Sukhbaatar, and J\. Weston\(2024\)Iterative reasoning preference optimization\.Advances in Neural Information Processing Systems37,pp\. 116617–116637\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[45\]Y\. Qin, Y\. Ye, J\. Fang, H\. Wang, S\. Liang, S\. Tian, J\. Zhang, J\. Li, Y\. Li, S\. Huang,et al\.\(2025\)Ui\-tars: pioneering automated gui interaction with native agents\.arXiv preprint arXiv:2501\.12326\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[46\]A\. Rajeswaran, V\. Kumar, A\. Gupta, G\. Vezzani, J\. Schulman, E\. Todorov, and S\. Levine\(2017\)Learning complex dexterous manipulation with deep reinforcement learning and demonstrations\.arXiv preprint arXiv:1709\.10087\.Cited by:[§4\.2](https://arxiv.org/html/2605.12004#S4.SS2.p1.1)\.
- \[47\]D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman\(2023\)Gpqa: a graduate\-level google\-proof q&a benchmark\.arXiv preprint arXiv:2311\.12022\.Cited by:[1st item](https://arxiv.org/html/2605.12004#A1.I2.i1.p1.1),[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p1.1)\.
- \[48\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[49\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo\(2024\-04\)DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models\.arXiv\(en\-US\)\.Note:arXiv:2402\.03300 \[cs\]External Links:[Link](http://arxiv.org/abs/2402.03300),[Document](https://dx.doi.org/10.48550/arXiv.2402.03300)Cited by:[§2\.2](https://arxiv.org/html/2605.12004#S2.SS2.p1.2),[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[50\]I\. Shenfeld, M\. Damani, J\. Hübotter, and P\. Agrawal\(2026\-01\)Self\-Distillation Enables Continual Learning\.arXiv\.Note:arXiv:2601\.19897 \[cs\]External Links:[Link](http://arxiv.org/abs/2601.19897),[Document](https://dx.doi.org/10.48550/arXiv.2601.19897)Cited by:[Appendix B](https://arxiv.org/html/2605.12004#A2.p2.6.5)\.
- \[51\]A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2025\)Openai gpt\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p2.1)\.
- \[52\]K\. Team, T\. Bai, Y\. Bai, Y\. Bao, S\. Cai, Y\. Cao, Y\. Charles, H\. Che, C\. Chen, G\. Chen,et al\.\(2026\)Kimi k2\. 5: visual agentic intelligence\.arXiv preprint arXiv:2602\.02276\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[53\]Q\. Team\(2026\-02\)Qwen3\.5: accelerating productivity with native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[54\]T\. D\. Team, B\. Li, B\. Zhang, D\. Zhang, F\. Huang, G\. Li, G\. Chen, H\. Yin, J\. Wu, J\. Zhou,et al\.\(2025\)Tongyi deepresearch technical report\.arXiv preprint arXiv:2510\.24701\.Cited by:[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p3.1),[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[55\]T\. D\. Team, B\. Li, B\. Zhang, D\. Zhang, F\. Huang, G\. Li, G\. Chen, H\. Yin, J\. Wu, J\. Zhou,et al\.\(2025\)Tongyi deepresearch technical report\.arXiv preprint arXiv:2510\.24701\.Cited by:[§A\.1](https://arxiv.org/html/2605.12004#A1.SS1.p2.1)\.
- \[56\]M\. Turpin, J\. Michael, E\. Perez, and S\. Bowman\(2023\)Language models don’t always say what they think: unfaithful explanations in chain\-of\-thought prompting\.Advances in Neural Information Processing Systems36,pp\. 74952–74965\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p3.1)\.
- \[57\]H\. Van Hasselt, Y\. Doron, F\. Strub, M\. Hessel, N\. Sonnerat, and J\. Modayil\(2018\)Deep reinforcement learning and the deadly triad\.arXiv preprint arXiv:1812\.02648\.Cited by:[§2\.3\.2](https://arxiv.org/html/2605.12004#S2.SS3.SSS2.p1.4)\.
- \[58\]M\. Vecerik, T\. Hester, J\. Scholz, F\. Wang, O\. Pietquin, B\. Piot, N\. Heess, T\. Rothörl, T\. Lampe, and M\. Riedmiller\(2017\)Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards\.arXiv preprint arXiv:1707\.08817\.Cited by:[§4\.2](https://arxiv.org/html/2605.12004#S4.SS2.p1.1)\.
- \[59\]W\. Wang, X\. Xu, W\. An, F\. Dai, W\. Gao, Y\. He, J\. Huang, Q\. Ji, H\. Jin, X\. Li,et al\.\(2025\)Let it flow: agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem\.arXiv preprint arXiv:2512\.24873\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[60\]Y\. Wang, X\. Chen, X\. Jin, M\. Wang, and L\. Yang\(2026\)Openclaw\-rl: train any agent simply by talking\.arXiv preprint arXiv:2603\.10165\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[61\]Z\. Wang, K\. Wang, Q\. Wang, P\. Zhang, L\. Li, Z\. Yang, X\. Jin, K\. Yu, M\. N\. Nguyen, L\. Liu, E\. Gottlieb, Y\. Lu, K\. Cho, J\. Wu, L\. Fei\-Fei, L\. Wang, Y\. Choi, and M\. Li\(2025\-05\)RAGEN: Understanding Self\-Evolution in LLM Agents via Multi\-Turn Reinforcement Learning\.arXiv\.Note:arXiv:2504\.20073 \[cs\]External Links:[Link](http://arxiv.org/abs/2504.20073),[Document](https://dx.doi.org/10.48550/arXiv.2504.20073)Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[62\]T\. Wei, T\. Li, Z\. Liu, X\. Ning, Z\. Yang, J\. Zou, Z\. Zeng, R\. Qiu, X\. Lin, D\. Fu,et al\.\(2026\)Agentic reasoning for large language models\.arXiv preprint arXiv:2601\.12538\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[63\]J\. Wu, W\. Yin, Y\. Jiang, Z\. Wang, Z\. Xi, R\. Fang, L\. Zhang, Y\. He, D\. Zhou, P\. Xie, and F\. Huang\(2025\-08\)WebWalker: Benchmarking LLMs in Web Traversal\.arXiv\.Note:arXiv:2501\.07572 \[cs\]External Links:[Link](http://arxiv.org/abs/2501.07572),[Document](https://dx.doi.org/10.48550/arXiv.2501.07572)Cited by:[2nd item](https://arxiv.org/html/2605.12004#A1.I1.i2.p1.1),[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p1.1)\.
- \[64\]Y\. Wu, S\. Li, Z\. Wen, X\. Zhou, A\. Talwalkar, Y\. Yang, W\. Huang, and T\. Cai\(2026\)Learn hard problems during rl with reference guided fine\-tuning\.arXiv preprint arXiv:2603\.01223\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[65\]T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei, Y\. Liu, Y\. Xu, S\. Zhou, S\. Savarese, C\. Xiong, V\. Zhong, and T\. Yu\(2024\)OSWorld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.External Links:2404\.07972Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[66\]J\. Yan, Y\. Li, Z\. Hu, Z\. Wang, G\. Cui, X\. Qu, Y\. Cheng, and Y\. Zhang\(2025\-06\)Learning to Reason under Off\-Policy Guidance\.arXiv\(en\-US\)\.Note:arXiv:2504\.14945 \[cs\]External Links:[Link](http://arxiv.org/abs/2504.14945),[Document](https://dx.doi.org/10.48550/arXiv.2504.14945)Cited by:[§2\.3\.3](https://arxiv.org/html/2605.12004#S2.SS3.SSS3.p1.6)\.
- \[67\]J\. Yan, Y\. Li, Z\. Hu, Z\. Wang, G\. Cui, X\. Qu, Y\. Cheng, and Y\. Zhang\(2025\)Learning to reason under off\-policy guidance\.arXiv preprint arXiv:2504\.14945\.Cited by:[§4\.2](https://arxiv.org/html/2605.12004#S4.SS2.p1.1)\.
- \[68\]R\. Yang, Q\. Wu, Z\. Wang, H\. Chen, K\. Yang, H\. Cheng, H\. Yao, B\. Peng, H\. Zhang, J\. Gao,et al\.\(2026\)GUI\-libra: training native gui agents to reason and act with action\-aware supervision and partially verifiable rl\.arXiv preprint arXiv:2602\.22190\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p3.1)\.
- \[69\]S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan\(2024\)tau bench: A Benchmark for Tool\-Agent\-User Interaction in Real\-World Domains\.arXiv preprint arXiv:2406\.12045\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[70\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao\(2022\)React: synergizing reasoning and acting in language models\.InThe eleventh international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[71\]Z\. Yao, Y\. Zhang, Y\. Chen, Y\. Sun, Z\. Xu, Y\. Yang, T\. Hu, Q\. Gu, H\. Su, and X\. Cai\(2026\)CoBA\-rl: capability\-oriented budget allocation for reinforcement learning in llms\.arXiv preprint arXiv:2602\.03048\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[72\]J\. Yi, D\. Mosk\-Aoyama, B\. Huang, R\. Gala, C\. Wang, S\. D\. Devare, K\. Bhardwaj, A\. Gupta, O\. Kuchaiev, J\. Jiao,et al\.\(2026\)PivotRL: high accuracy agentic post\-training at low compute cost\.arXiv preprint arXiv:2603\.21383\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[73\]A\. Yu, L\. Yao, J\. Liu, Z\. Chen, J\. Yin, Y\. Wang, X\. Liao, Z\. Ye, J\. Li, Y\. Yue,et al\.\(2025\)Medresearcher\-r1: expert\-level medical deep researcher via a knowledge\-informed trajectory synthesis framework\.arXiv preprint arXiv:2508\.14880\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[74\]Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, M\. Qiao, Y\. Wu, and M\. Wang\(2025\-05\)DAPO: An Open\-Source LLM Reinforcement Learning System at Scale\.arXiv\.Note:arXiv:2503\.14476 \[cs\]External Links:[Link](http://arxiv.org/abs/2503.14476),[Document](https://dx.doi.org/10.48550/arXiv.2503.14476)Cited by:[§2\.2](https://arxiv.org/html/2605.12004#S2.SS2.p1.2),[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[75\]Y\. Yue, Z\. Chen, R\. Lu, A\. Zhao, Z\. Wang, Y\. Yue, S\. Song, and G\. Huang\(2025\-05\)Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?\.arXiv\.Note:arXiv:2504\.13837 \[cs\]External Links:[Link](http://arxiv.org/abs/2504.13837),[Document](https://dx.doi.org/10.48550/arXiv.2504.13837)Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p2.1)\.
- \[76\]Y\. Zhai, S\. Tao, C\. Chen, A\. Zou, Z\. Chen, Q\. Fu, S\. Mai, L\. Yu, J\. Deng, Z\. Cao,et al\.\(2025\)Agentevolver: towards efficient self\-evolving agent system\.arXiv preprint arXiv:2511\.10395\.Cited by:[§4\.1](https://arxiv.org/html/2605.12004#S4.SS1.p1.1)\.
- \[77\]G\. Zhang, H\. Geng, X\. Yu, Z\. Yin, Z\. Zhang, Z\. Tan, H\. Zhou, Z\. Li, X\. Xue, Y\. Li,et al\.\(2025\)The landscape of agentic reinforcement learning for llms: a survey\.arXiv preprint arXiv:2509\.02547\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[78\]W\. Zhang, Y\. Xie, Y\. Sun, Y\. Chen, G\. Wang, Y\. Li, B\. Ding, and J\. Zhou\(2026\)On\-policy rl meets off\-policy experts: harmonizing supervised fine\-tuning and reinforcement learning via dynamic weighting\.arXiv preprint arXiv:2508\.11408\.Cited by:[§4\.2](https://arxiv.org/html/2605.12004#S4.SS2.p1.1)\.
- \[79\]S\. Zhao, Z\. Xie, M\. Liu, J\. Huang, G\. Pang, F\. Chen, and A\. Grover\(2026\)Self\-distilled reasoner: on\-policy self\-distillation for large language models\.arXiv preprint arXiv:2601\.18734\.Cited by:[Appendix B](https://arxiv.org/html/2605.12004#A2.p2.6.5)\.
- \[80\]H\. Zheng, J\. Zhao, and B\. Chen\(2025\)Prosperity before collapse: how far can off\-policy rl reach with stale data on llms?\.arXiv preprint arXiv:2510\.01161\.Cited by:[§2\.3\.2](https://arxiv.org/html/2605.12004#S2.SS3.SSS2.p1.4)\.
- \[81\]Y\. Zheng, L\. Zhong, Y\. Wang, R\. Dai, K\. Liu, X\. Chu, L\. Lv, P\. Torr, and K\. Q\. Lin\(2026\)Code2world: a gui world model via renderable code generation\.arXiv preprint arXiv:2602\.09856\.Cited by:[§1](https://arxiv.org/html/2605.12004#S1.p1.1)\.
- \[82\]J\. Zhou, T\. Lu, S\. Mishra, S\. Brahma, S\. Basu, Y\. Luan, D\. Zhou, and L\. Hou\(2023\)Instruction\-following evaluation for large language models\.arXiv preprint arXiv:2311\.07911\.Cited by:[3rd item](https://arxiv.org/html/2605.12004#A1.I2.i3.p1.1),[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p1.1)\.
- \[83\]P\. Zhou, B\. Leon, X\. Ying, C\. Zhang, Y\. Shao, Q\. Ye, D\. Chong, Z\. Jin, C\. Xie, M\. Cao,et al\.\(2025\)Browsecomp\-zh: benchmarking web browsing ability of large language models in chinese\.arXiv preprint arXiv:2504\.19314\.Cited by:[4th item](https://arxiv.org/html/2605.12004#A1.I1.i4.p1.1),[§3\.1](https://arxiv.org/html/2605.12004#S3.SS1.p1.1)\.

## Appendix

## Appendix ADatasets

### A\.1Train

We adopted the search agent RL training data from ASearcher\[[19](https://arxiv.org/html/2605.12004#bib.bib296)\]\. Specifically, we sampled 2k instances to be used for RL training across all our experimental settings\.

Additionally, to acquire the action data, we utilized Tongyi\-DeepResearch\-30B\-A3B\[[55](https://arxiv.org/html/2605.12004#bib.bib297)\]as the expert model to conduct rejection sampling\. Consistent with our experimental settings, we restricted the toolset to only two types of tools:web\-searchandweb\-visit\. The correct trajectories generated by the expert model were collected, from which we exclusively extracted the atomic per\-step operations \(i\.e\., the tool call names and corresponding arguments\) to serve as the candidate complete action guidance trajectories for each sample instance\. The action turns statistics of the RL training data are shwon in Figure[8](https://arxiv.org/html/2605.12004#A2.F8)\.

For the SFT data, we sampled another disjoint subset from the ASearcher dataset in a similar manner\. We also employed Tongyi\-DeepResearch\-30B\-A3B to conduct rejection sampling, yielding 4k complete search agent trajectories\. Unlike the action data, the SFT data inherently preserves the comprehensive elements of the trajectory, emphasizing the retention of the full Chain\-of\-Thought \(CoT\) reasoning, explicit tool calls, and corresponding tool responses\.

### A\.2Evaluation

To comprehensively evaluate our proposed search agent’s capabilities in complex reasoning and deep search, we adopt several standard and challenging deep search benchmarks\. The details of the utilized datasets are outlined below:

- •GAIA\[[39](https://arxiv.org/html/2605.12004#bib.bib142)\]is a challenging general AI assistant benchmark comprising real\-world questions that require deep reasoning and web browsing\. Following previous works, we utilize a subset of 103 text\-only questions to test the fundamental capabilities of our system\.
- •WebWalkerQA\[[63](https://arxiv.org/html/2605.12004#bib.bib143)\]evaluates LLMs in complex web traversal and information gathering\. It contains 680 QA tasks requiring agents to systematically traverse multiple dynamic web pages to discover multi\-layered information via multi\-hop reasoning\.
- •XBench\[[5](https://arxiv.org/html/2605.12004#bib.bib145)\]specifically assesses the deep search capabilities of AI agents\. It comprises 100 questions and dynamically evaluates high\-order information retrieval and tool usage abilities across real\-world scenarios, considering both search breadth and reasoning depth\.
- •BrowseComp\-ZH\[[83](https://arxiv.org/html/2605.12004#bib.bib298)\]is a complex benchmark measuring web browsing and reasoning within the Chinese internet ecosystem\. It comprises 289 native, multi\-hop retrieval questions strictly cross\-validated across major search engines to test sophisticated multi\-step reasoning\.

To further assess out\-of\-domain generalization beyond the search\-agent setting, we also evaluate on three general\-purpose benchmarks:

- •GPQA\[[47](https://arxiv.org/html/2605.12004#bib.bib300)\]is a graduate\-level, Google\-proof question\-answering benchmark covering difficult scientific domains\. We use it as an out\-of\-domain reasoning benchmark beyond the search\-agent setting\.
- •TruthfulQA\[[34](https://arxiv.org/html/2605.12004#bib.bib301)\]evaluates whether language models generate truthful answers rather than imitating common misconceptions, providing an out\-of\-domain test of factual robustness\.
- •IFEval\[[82](https://arxiv.org/html/2605.12004#bib.bib302)\]measures instruction\-following ability with verifiable constraints, serving as an out\-of\-domain benchmark for general alignment and controllability\.

## Appendix BExperiment Details

Implementation Details\.Our implementation is built upon VeRL\. All the experimental hyperparameter settings are listed in Table[6](https://arxiv.org/html/2605.12004#A2.T6)\. During guided rollout, we inject the action data into the query prompt as plan\-style reference guidance, so that the policy can follow the partial action trajectory while still completing any missing steps by itself\. The exact prompt format is shown in Template[8](https://arxiv.org/html/2605.12004#A2.F8)\.

Table 6:Hyperparameters for search\-agent RL training\.ConfigSettingoptimizerAdamWlearning rate1e\-6KL coefficient0\.001training data2,000total training steps64training batch size32PPO mini batch size16group size8max response length40,960max observation length8,000max turns30ϵcliplow\\epsilon\_\{\\text\{clip\}\_\{\\text\{low\}\}\}0\.2ϵcliphigh\\epsilon\_\{\\text\{clip\}\_\{\\text\{high\}\}\}0\.2

![Refer to caption](https://arxiv.org/html/2605.12004v1/x8.png)Figure 8:Action turns statistics of RL training data\.`ActGuide Prompt Template LLM\-Judge Prompt Template Compute Resources\. All training and rollout experiments were conducted on nodes equipped with 8 NVIDIA H20 GPUs\. The LLM judge used for reward assignment and test\-time evaluation required additional serving resources, for which we used a separate node with 8 NVIDIA H20 GPUs\. Different Guidance Methods\. We also compare different ways of injecting action guidance for LLM\-based agent\. Besides the unguided setting, we consider an assistant\-prefix format following prior hint\-based methods, where the action reference is prepended as a generated prefix and the model continues generation from it\. We also consider a user\-assistant message format, where the action data are converted into the corresponding tool calls and tool responses and then assembled as multi\-turn messages before the model continues generation\. As shown in Table 7, inserting the action trajectory as a reference plan in the query prompt achieves the best Reward@1, suggesting that lightweight plan\-style guidance is more effective than directly prefixing or replaying actions for LLM agent\.`

`Table 7: Injection method comparison\. Inject Method Unguidance Assistant Prefix User\-Assistant Messages Reference Plan in Query Prompt Reward@1 57\.90 74\.50 80\.10 85\.70``Action Data for On\-policy Self Distillation\. Beyond using action data to guide the policy toward better state visitation, we also explore whether it can be used for on\-policy self distillation \(OPSD\) \[79, 50, 23\]\. Specifically, OPSD still samples trajectories from the unguided policy, but uses action\-conditioned guided logits as the distillation target on these on\-policy rollouts\. Formally, for an unguided rollout τ∼πθold\(⋅∣x\)\\tau\\sim\\pi\_\{\\theta\_\{\\rm old\}\}\(\\cdot\\mid x\), we re\-evaluate each visited prefix z<tz\_\{<t\} with the same model additionally conditioned on the action guidance gg, and optimize ℒOPSD\(θ\)=𝔼x∼𝒟,τ∼πθold\(⋅∣x\)\[1T∑t=1T𝔻KL\(sg\[πθold\(⋅∣z<t,g\)\]∥πθ\(⋅∣z<t\)\)\],\\mathcal\{L\}\_\{\\mathrm\{OPSD\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\,\\tau\\sim\\pi\_\{\\theta\_\{\\rm old\}\}\(\\cdot\\mid x\)\}\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\\!\\left\(\\mathrm\{sg\}\\\!\\left\[\\pi\_\{\\theta\_\{\\rm old\}\}\(\\cdot\\mid z\_\{<t\},g\)\\right\]\\,\\\|\\,\\pi\_\{\\theta\}\(\\cdot\\mid z\_\{<t\}\)\\right\)\\right\], \(13\) where sg​\[⋅\]\\mathrm\{sg\}\[\\cdot\] denotes stop\-gradient, so the guided distribution serves only as a token\-level teacher while the learned policy remains unguided at inference time\. As shown in Table 8, OPSD can improve model performance, but the gains remain limited because the visited states are still determined by the base unguided policy\. Therefore, it does not fundamentally resolve the ineffective state\-visitation problem when the agent cannot reach useful states by itself\. Table 8: Comparison between ActGuide\-RL and OPSD\. Guidance Use GAIA WebWalker XBench ActGuide\-RL 35\.92 39\.85 37\.00 OPSD 36\.89 30\.29 26\.00 Appendix C Theoretical Analysis C\.1 Covariance Form of the Token\-Level Off\-Policy Risk In Section 2\.3\.2, let τ=\(z1,…,z\|τ\|\)\\tau=\(z\_\{1\},\\ldots,z\_\{\|\\tau\|\}\) be the generated token sequence\. We define the token\-level importance ratio under guidance level gkg\_\{k\} as rj\(k\):=πθ​\(zj∣z<j\)πθ​\(zj∣z<j,gk\),r\_\{j\}^\{\(k\)\}:=\\frac\{\\pi\_\{\\theta\}\(z\_\{j\}\\mid z\_\{<j\}\)\}\{\\pi\_\{\\theta\}\(z\_\{j\}\\mid z\_\{<j\},g\_\{k\}\)\}, \(14\) and the corresponding cumulative log\-ratio shift as ℒk​\(τ\):=∑j=1\|τ\|log⁡rj\(k\)\.\\mathcal\{L\}\_\{k\}\(\\tau\):=\\sum\_\{j=1\}^\{\|\\tau\|\}\\log r\_\{j\}^\{\(k\)\}\. \(15\) The off\-policy risk is then defined as Rk:=Varτ∼πθ\(⋅∣s,gk\)​\(ℒk​\(τ\)\)\.R\_\{k\}:=\\mathrm\{Var\}\_\{\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s,g\_\{k\}\)\}\\\!\\left\(\\mathcal\{L\}\_\{k\}\(\\tau\)\\right\)\. \(16\) By variance expansion, we have Rk=Varτ∼πθ\(⋅∣s,gk\)​\(∑j=1\|τ\|log⁡rj\(k\)\)\.R\_\{k\}=\\mathrm\{Var\}\_\{\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s,g\_\{k\}\)\}\\left\(\\sum\_\{j=1\}^\{\|\\tau\|\}\\log r\_\{j\}^\{\(k\)\}\\right\)\. \(17\) Therefore, Rk=∑j=1\|τ\|Varτ∼πθ\(⋅∣s,gk\)​\(log⁡rj\(k\)\)\+2​∑j<j′Covτ∼πθ\(⋅∣s,gk\)​\(log⁡rj\(k\),log⁡rj′\(k\)\),R\_\{k\}=\\sum\_\{j=1\}^\{\|\\tau\|\}\\mathrm\{Var\}\_\{\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s,g\_\{k\}\)\}\\\!\\left\(\\log r\_\{j\}^\{\(k\)\}\\right\)\+2\\sum\_\{j<j^\{\\prime\}\}\\mathrm\{Cov\}\_\{\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s,g\_\{k\}\)\}\\\!\\left\(\\log r\_\{j\}^\{\(k\)\},\\log r\_\{j^\{\\prime\}\}^\{\(k\)\}\\right\), \(18\) where the second summation ranges over all distinct token pairs in the rollout\. This decomposition shows that the off\-policy risk consists of two components: \(1\) token\-wise variance terms, which capture local distribution mismatch at each generation step, and \(2\) cross\-token covariance terms, which capture the dependence structure of these mismatches along the autoregressive trajectory\. Hence, stronger guidance may increase not only the magnitude of token\-level deviations, but also their correlation across the rollout, both of which contribute to larger internalization risk\. In particular, if the token\-level log\-ratio shifts were independent, then all covariance terms would vanish and RkR\_\{k\} would reduce to the sum of token\-wise variances\. However, in autoregressive agent generation, token dependencies are intrinsic, and the covariance terms generally cannot be ignored\. This motivates using the variance of the cumulative log\-ratio shift as a compact measure of off\-policy risk\. C\.2 Risk\-Constrained View of Minimal Intervention We formalize the minimal\-intervention rule in Section 2\.3\.2 as a risk\-constrained selection problem\. For a guidance level gkg\_\{k\}, define the group recovery probability Qk:=ℙ\{τi\(k\)\}i=1N∼πθ\(⋅∣x,gk\)​\(maxi≤N⁡Y​\(τi\(k\)\)≥δ\),Q\_\{k\}:=\\mathbb\{P\}\_\{\\\{\\tau\_\{i\}^\{\(k\)\}\\\}\_\{i=1\}^\{N\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x,g\_\{k\}\)\}\\\!\\left\(\\max\_\{i\\leq N\}Y\(\\tau\_\{i\}^\{\(k\)\}\)\\geq\\delta\\right\), \(19\) where NN is the rollout group size and δ\\delta is the success threshold\. Given a target recovery level ρ∈\(0,1\)\\rho\\in\(0,1\), the least\-risk sufficient guidance level is the solution of mink∈\{0,…,K\}Rks\.t\.Qk≥ρ\.\\min\_\{k\\in\\\{0,\\ldots,K\\\}\}R\_\{k\}\\quad\\mathrm\{s\.t\.\}\\quad Q\_\{k\}\\geq\\rho\. \(20\) Assumption C\.1 \(Monotone recovery and risk\)\. The ordered guidance family g0≺g1≺⋯≺gKg\_\{0\}\\prec g\_\{1\}\\prec\\cdots\\prec g\_\{K\} satisfies: Q0≤Q1≤⋯≤QK,R0≤R1≤⋯≤RK\.Q\_\{0\}\\leq Q\_\{1\}\\leq\\cdots\\leq Q\_\{K\},\\qquad R\_\{0\}\\leq R\_\{1\}\\leq\\cdots\\leq R\_\{K\}\. \(21\) Proposition C\.1 \(Minimal sufficient guidance is risk\-optimal\)\. Under Assumption C\.1, if the feasible set of Eq\. 20 is non\-empty, then kρ⋆:=min⁡\{k∈\{0,…,K\}:Qk≥ρ\}k\_\{\\rho\}^\{\\star\}:=\\min\\\{k\\in\\\{0,\\ldots,K\\\}:Q\_\{k\}\\geq\\rho\\\} \(22\) is an optimal solution of Eq\. 20\. Proof\. By definition, kρ⋆k\_\{\\rho\}^\{\\star\} is feasible\. For any other feasible level kk, minimality of kρ⋆k\_\{\\rho\}^\{\\star\} implies k≥kρ⋆k\\geq k\_\{\\rho\}^\{\\star\}\. Since RkR\_\{k\} is non\-decreasing in kk, we have Rk≥Rkρ⋆R\_\{k\}\\geq R\_\{k\_\{\\rho\}^\{\\star\}\}\. Hence no feasible guidance level has smaller off\-policy risk than kρ⋆k\_\{\\rho\}^\{\\star\}\. ∎ This proposition gives a constrained interpretation of Eq\. 10: minimal intervention does not maximize guidance strength, but selects the lowest\-risk level that satisfies a recovery requirement\. When QkQ\_\{k\} is not known exactly, it can be estimated by repeated rollout groups\. Let Q^k\\widehat\{Q\}\_\{k\} be the empirical mean of mm independent group\-recovery indicators at level kk\. Corollary C\.1 \(Empirical identification under a margin\)\. Suppose Assumption C\.1 holds and there exists a margin Δ\>0\\Delta\>0 such that Qk≤ρ−Δ∀k<kρ⋆,Qk≥ρ\+Δ∀k≥kρ⋆\.Q\_\{k\}\\leq\\rho\-\\Delta\\quad\\forall k<k\_\{\\rho\}^\{\\star\},\\qquad Q\_\{k\}\\geq\\rho\+\\Delta\\quad\\forall k\\geq k\_\{\\rho\}^\{\\star\}\. \(23\) If m≥12​Δ2​log⁡2​\(K\+1\)ξ,m\\geq\\frac\{1\}\{2\\Delta^\{2\}\}\\log\\frac\{2\(K\+1\)\}\{\\xi\}, \(24\) then with probability at least 1−ξ1\-\\xi, the empirical rule k^ρ:=min⁡\{k:Q^k≥ρ\}\\widehat\{k\}\_\{\\rho\}:=\\min\\\{k:\\widehat\{Q\}\_\{k\}\\geq\\rho\\\} \(25\) recovers kρ⋆k\_\{\\rho\}^\{\\star\}\. Proof\. By Hoeffding’s inequality and a union bound over K\+1K\+1 levels, ℙ​\(maxk⁡\|Q^k−Qk\|≥Δ\)≤2​\(K\+1\)​exp⁡\(−2​m​Δ2\)≤ξ\.\\mathbb\{P\}\\\!\\left\(\\max\_\{k\}\|\\widehat\{Q\}\_\{k\}\-Q\_\{k\}\|\\geq\\Delta\\right\)\\leq 2\(K\+1\)\\exp\(\-2m\\Delta^\{2\}\)\\leq\\xi\. \(26\) On the complementary event, every k<kρ⋆k<k\_\{\\rho\}^\{\\star\} has Q^k<ρ\\widehat\{Q\}\_\{k\}<\\rho, while every k≥kρ⋆k\\geq k\_\{\\rho\}^\{\\star\} has Q^k≥ρ\\widehat\{Q\}\_\{k\}\\geq\\rho\. Therefore the empirical minimal feasible level equals kρ⋆k\_\{\\rho\}^\{\\star\}\. ∎ Appendix D Training Cases of ActGuide\-RL We present representative training cases to illustrate how ActGuide\-RL injects action data as plan\-style guidance\. Each case uses the full guidance level gKg\_\{K\}, where the prompt contains the complete available reference action trajectory while still requiring the policy to reason and complete the task with tools\. Training Case 1 of ActGuide\-RL with Full Guidance gKg\_\{K\} Example from RL Training Data Original Question: Which political party last held overall control of the West Berkshire Council before the election that first introduced the ward named after the Berkshire village where the Kennet & Avon Canal route was restored to Hungerford Wharf in July 1974, resulting in a tied outcome necessitating by\-elections and enabling the Conservatives to secure a majority? Ground Truth: The Liberal Democrats Prompt for ActGuide\-RL Answer the given question using the given tools\. For each step, you must conduct a thought section to reason before calling any tools\. Question: Which political party last held overall control of the West Berkshire Council before the election that first introduced the ward named after the Berkshire village where the Kennet & Avon Canal route was restored to Hungerford Wharf in July 1974, resulting in a tied outcome necessitating by\-elections and enabling the Conservatives to secure a majority? Follow the partial action trajectory hint to take actions, note that the trajectory may not complete and you still need do some extra tool calls to finish the task\. Reference action trajectory hint: <1\>: search \[‘‘Kennet & Avon Canal restored to Hungerford Wharf July 1974’’, ‘‘Hungerford Wharf July 1974 canal restoration’’, ‘‘Kennet and Avon canal restoration July 1974’’, ‘‘Berkshire village Hungerford Wharf restored July 1974’’\] <2\>: search \[‘‘West Berkshire Council election tied result by\-elections Conservatives secured majority new ward named after Hungerford’’, ‘‘West Berkshire Council election tied outcome by\-elections Conservative majority ward introduction’’, ‘‘West Berkshire Council ward first introduced election tied by\-elections’’\] <3\>: visit \[https://en\.wikipedia\.org/wiki/West\_Berkshire\_Council\_elections\] <4\>: search \[‘‘Hungerford ward West Berkshire Council’’, ‘‘Hungerford electoral ward West Berkshire’’, ‘‘Hungerford West Berkshire Council ward’’\] <5\>: search \[‘‘tied West Berkshire council election by\-election Conservative majority’’, ‘‘tied West Berkshire by\-election’’, ‘‘by\-elections West Berkshire Conservative majority’’\] <6\>: search \[‘‘West Berkshire Council tied election’’, ‘‘West Berkshire tied election’’, ‘‘West Berkshire tied council’’\] <7\>: search \[‘‘tied result West Berkshire council by\-election’’, ‘‘tied by\-elections West Berkshire’’, ‘‘tie West Berkshire council election’’\] <8\>: search \[‘‘West Berkshire 2003 new ward’’, ‘‘West Berkshire 2003 ward boundary’’, ‘‘West Berkshire 2003 electoral changes’’\] <9\>: visit \[https://grokipedia\.com/page/2003\_west\_berkshire\_council\_election\] <10\>: search \[‘‘Hungerford ward West Berkshire 2003’’, ‘‘Hungerford West Berkshire new ward 2003’’\] <11\>: visit \[https://www\.andrewteale\.me\.uk/leap/ward/2226/\] <12\>: visit \[https://en\.wikipedia\.org/wiki/West\_Berkshire\_Council\_elections\] Training Case 2 of ActGuide\-RL with Full Guidance gKg\_\{K\} Example from RL Training Data Original Question: In a chess tournament, all players faced each other twice and scored 1 point for a win, half a point for a draw, and zero points for a loss\. The winner was the one who scored the most points\. Luíza, a curious mathematician, found a paper stating that the sum of the scores of all participants was 210 points\. a\) How many players participated in the tournament? b\) Luíza noticed that the tournament logo depicted a knight in the center, with black squares illustrating the knight’s possible moves\. She wondered: If the board were infinite and the knight continued its pattern of movement, onto how many different squares could the knight move in two moves? Initially, she thought ‘‘64’’ but quickly realized she was wrong\. Now, it’s your turn to correctly answer the question Luíza thought of\. c\) Luíza received a message from a friend who participated in the tournament, informing her that he scored 12 points\. Luíza replied: ‘‘\- Don’t be sad, you will win the next championship\!’’ How did Luíza know that her friend did not win the tournament? Ground Truth: 33 Prompt for ActGuide\-RL Answer the given question using the given tools\. For each step, you must conduct a thought section to reason before calling any tools\. Question: In a chess tournament, all players faced each other twice and scored 1 point for a win, half a point for a draw, and zero points for a loss\. The winner was the one who scored the most points\. Luíza, a curious mathematician, found a paper stating that the sum of the scores of all participants was 210 points\. a\) How many players participated in the tournament? b\) Luíza noticed that the tournament logo depicted a knight in the center, with black squares illustrating the knight’s possible moves\. She wondered: If the board were infinite and the knight continued its pattern of movement, onto how many different squares could the knight move in two moves? Initially, she thought ‘‘64’’ but quickly realized she was wrong\. Now, it’s your turn to correctly answer the question Luíza thought of\. c\) Luíza received a message from a friend who participated in the tournament, informing her that he scored 12 points\. Luíza replied: ‘‘\- Don’t be sad, you will win the next championship\!’’ How did Luíza know that her friend did not win the tournament? Follow the partial action trajectory hint to take actions, note that the trajectory may not complete and you still need do some extra tool calls to finish the task\. Reference action trajectory hint: <1\>: search \[‘‘knight moves two moves reachable squares infinite board 33’’\] <2\>: search \[‘‘knight two moves reachable squares infinite board’’\] <3\>: visit \[https://www\.reddit\.com/r/chess/comments/bem4zw/given\_an\_infinite\_chess\_board\_how\_many\_spaces/\] <4\>: search \[‘‘how many squares can a knight reach in two moves’’\] <5\>: search \[‘‘knight reach two moves how many squares’’\] <6\>: visit \[https://file\.scirp\.org/Html/8\-1200157\_34513\.htm\] <7\>: search \[‘‘33 squares knight two moves’’\] <8\>: visit \[https://www\.mathworks\.com/matlabcentral/cody/problems/60541\-compute\-the\-number\-of\-squares\-a\-knight\-can\-reach\-after\-n\-moves\] <9\>: visit \[https://www\.mathworks\.com/content/dam/mathworks/images/cody/graphics/ChessKnight\.svg\] <10\>: search \[‘‘33 squares knight two moves infinite board’’\] <11\>: search \[‘‘33 squares after two moves knight’’\] <12\>: search \[‘‘knight can reach after two moves 33’’\] <13\>: visit \[https://www\.mathworks\.com/matlabcentral/cody/problems/60541\-compute\-the\-number\-of\-squares\-a\-knight\-can\-reach\-after\-n\-moves\] <14\>: search \[‘‘knight two moves 33 squares’’\] <15\>: visit \[https://www\.reddit\.com/r/chess/comments/1j3roeo/the\_amount\_of\_space\_a\_single\_knight\_can\_control/\] Training Case 3 of ActGuide\-RL with Full Guidance gKg\_\{K\} Example from RL Training Data Original Question: In which year, during the early 1970s, did a player, known for his time with the New York Yankees, join the team, and who is the former MLB pitcher, now a pitching coach for a minor league team affiliated with the Miami Marlins, whose first name starts with ‘M’ and who started the opening game of the 2009 World Baseball Classic against Venezuela, pitching 4 shutout innings? Ground Truth: 1972, Mark DiFelice Prompt for ActGuide\-RL Answer the given question using the given tools\. For each step, you must conduct a thought section to reason before calling any tools\. Question: In which year, during the early 1970s, did a player, known for his time with the New York Yankees, join the team, and who is the former MLB pitcher, now a pitching coach for a minor league team affiliated with the Miami Marlins, whose first name starts with ‘M’ and who started the opening game of the 2009 World Baseball Classic against Venezuela, pitching 4 shutout innings? Follow the partial action trajectory hint to take actions, note that the trajectory may not complete and you still need do some extra tool calls to finish the task\. Reference action trajectory hint: <1\>: search \[‘‘2009 World Baseball Classic opening game Venezuela 4 shutout innings pitcher’’, ‘‘Venezuela 2009 WBC opening game pitcher 4 shutout innings’’, ‘‘World Baseball Classic 2009 Venezuela opener pitcher’’, ‘‘2009 WBC Venezuela vs Opening Game pitcher’’, ‘‘2009 WBC Venezuela 4 shutout innings start’’\] <2\>: search \[‘‘2009 World Baseball Classic opening game March 5 2009 opening games schedule’’, ‘‘first game of 2009 World Baseball Classic opening day’’\] <3\>: visit \[https://www\.aol\.com/articles/miami\-opening\-game\-world\-baseball\-183425977\.html\] <4\>: search \[‘‘2009 World Baseball Classic Venezuela vs Italy opening game 4 shutout innings pitcher’’, ‘‘Venezuela vs Italy March 7 2009 WBC pitcher started 4 shutout innings’’\] <5\>: visit \[https://www\.arkansasonline\.com/news/2009/mar/11/world\-baseball\-classic\-venezuelas\-power\-too\-much\-for\-italy/\] <6\>: visit \[https://www\.espn\.com/world\-baseball\-classic/boxscore/\_/gameId/290307116\] <7\>: search \[‘‘Mark DiFelice pitching coach Miami Marlins affiliate’’, ‘‘Mark DiFelice minor league pitching coach Marlins’’\] <8\>: search \[‘‘joined the New York Yankees in 1970’’, ‘‘joined the New York Yankees in 1971’’, ‘‘joined the New York Yankees in 1972’’, ‘‘joined the New York Yankees in 1973’’, ‘‘joined the New York Yankees in 1974’’, ‘‘joined the New York Yankees in 1975’’\] <9\>: search \[‘‘joined the New York Yankees 1970’’, ‘‘joined the New York Yankees 1971’’, ‘‘joined the New York Yankees 1972’’, ‘‘joined the New York Yankees 1973’’, ‘‘joined the New York Yankees 1974’’, ‘‘joined the New York Yankees 1975’’\] <10\>: search \[‘‘Sparky Lyle joined the New York Yankees 1972’’\] <11\>: search \[‘‘Graig Nettles joined the New York Yankees 1972’’\] <12\>: search \[‘‘known for his time with the New York Yankees’’\] <13\>: search \[‘‘Thurman Munson joined the New York Yankees’’\] <14\>: visit \[https://pinstripesnation\.com/thurman\-munson\-the\-captain\-who\-never\-left\-the\-yankees\-2023\-08\-02/\] <15\>: search \[‘‘Graig Nettles known for his time with the New York Yankees’’\] <16\>: search \[‘‘Mark DiFelice 2009 World Baseball Classic opening game Venezuela 4 shutout innings’’\] <17\>: visit \[https://en\.wikipedia\.org/wiki/Mark\_DiFelice\] <18\>: visit \[https://www\.ebay\.com/itm/389131297484\] <19\>: visit \[https://en\.wikipedia\.org/wiki/Graig\_Nettles\] <20\>: visit \[https://en\.wikipedia\.org/wiki/Mark\_DiFelice\] <21\>: search \[‘‘Mark DiFelice Marlins’’, ‘‘Mark DiFelice Miami Marlins’’\] <22\>: visit \[https://en\.wikipedia\.org/wiki/Graig\_Nettles\] <23\>: search \[‘‘Sparky Lyle New York Yankees known for his time with’’\] <24\>: visit \[https://en\.wikipedia\.org/wiki/Sparky\_Lyle\] <25\>: search \[‘‘best known for his time with the New York Yankees Sparky Lyle’’\] <26\>: search \[‘‘best known for his time with the New York Yankees Graig Nettles’’\] Appendix E Limitations Due to the relatively simple experimental setup, the ease of obtaining task queries with different difficulty levels, and the natural availability of action data, our main experiments are conducted in the search\-agent setting\. This setting provides a controlled testbed for studying reachability barriers and guidance\-induced off\-policy risk\. Nevertheless, ActGuide\-RL is designed for general agentic training rather than being specific to search agents, and its effectiveness in other agent tasks, such as CLI, GUI, API\-based, and embodied environments, remains to be further explored\. This work utilizes action data through plan\-style guidance, where reference actions are injected as a high\-level action plan to help the policy cross exploration barriers\. This simple formulation keeps the method lightweight, broadly applicable, and independent of costly reasoning traces\. More fine\-grained ways of using action data, such as step\-level guidance injection, also remain to be further explored\. This work focuses on how to leverage action data for agentic RL, but does not discuss how such data should be systematically collected and processed\. In practice, structured collection, cleaning, and filtering of existing interaction records, such as backend logs from different agent applications, are also important for action\-data\-based training and remain worth exploring\.`

Similar Articles

Learning policy representations in multiagent systems

OpenAI Blog

OpenAI researchers propose a general framework for learning representations of agent policies in multiagent systems using minimal interaction data, casting the problem as representation learning with applications to competitive control and cooperative communication environments.

Milestone-Guided Policy Learning for Long-Horizon Language Agents

arXiv cs.CL

This paper introduces BEACON, a milestone-guided policy learning framework designed to improve credit assignment and sample efficiency for long-horizon language agents. It demonstrates significant performance improvements over GRPO and GiGPO on benchmarks like ALFWorld, WebShop, and ScienceWorld.