Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents
Summary
Proposes CVT-RL, a constrained policy-gradient algorithm with policy-conditioned counterfactual contribution estimation and verifiable rewards, improving long-horizon language agent reliability and reducing reward hacking.
View Cached Full Text
Cached at: 06/05/26, 08:09 AM
# Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents
Source: [https://arxiv.org/html/2606.05263](https://arxiv.org/html/2606.05263)
###### Abstract
Reinforcement learning with verifiable rewards improves reasoning and tool use, yet long\-horizon language agents still learn unsupported evidence chains, belief drift, and shortcut actions that satisfy terminal checks\. Existing process rewards are mostly correlational: they reward retrieval\-, reflection\-, or verification\-like steps without estimating whether the step contributes to final verified success under a specified intervention\. We proposeCVT\-RL, a constrained policy\-gradient algorithm with dense verifiable rewards, intervention\-validity gating, and a*policy\-conditioned counterfactual contribution*\(PCCC\) estimator\. Deletion, semantic substitution, evidence substitution, and tool\-output perturbation define separate controlled interventions; continuations are sampled from a frozen reference policy, and a selection\-adjusted doubly robust estimator augments the advantage\. Belief control uses only prefix\-observable labels, while an augmented Lagrangian constrains unsupported claims, skipped verification, tool tampering, and unsafe calls\. On long\-context QA, ALFWorld, ScienceWorld, and web/tool tasks,CVT\-RLimproves average task success from71\.8%71\.8\\%for compute\-matched non\-causal RL and75\.4%75\.4\\%for an information\-matched counterfactual\-process baseline to78\.9%78\.9\\%, improves evidence F1 from78\.978\.9to82\.882\.8over the information\-matched baseline, and reduces measured hacking from7\.2%7\.2\\%to3\.9%3\.9\\%\. Independent human audit estimates4\.6%4\.6\\%hacking forCVT\-RLversus8\.1%8\.1\\%for the information\-matched baseline, and adaptive detector\-evasion attacks raise hacking only to7\.1%7\.1\\%\. Stratified bootstrap and mixed\-effects tests givep<0\.01p<0\.01after Holm correction for all primary metrics\. Carefully scoped counterfactual credit, paired with validity gating, diagnostics, and verifiable constraints, provides a reproducible route toward more reliable long\-horizon RL for language agents\.
Keywords:reinforcement learning; language agents; causal inference; verifiable rewards; constrained optimization; reward hacking
## 1Introduction
Large language model \(LLM\) agents solve tasks by alternating natural\-language reasoning, retrieval, tool calls, verification, and final answers\. RL post\-training is central to this progress, from RLHF and preference optimization\(Christianoet al\.,[2017](https://arxiv.org/html/2606.05263#bib.bib10); Stiennonet al\.,[2020](https://arxiv.org/html/2606.05263#bib.bib51); Ouyanget al\.,[2022](https://arxiv.org/html/2606.05263#bib.bib34); Rafailovet al\.,[2023](https://arxiv.org/html/2606.05263#bib.bib42)\)to chain\-of\-thought, self\-consistency, and tool use\(Weiet al\.,[2022](https://arxiv.org/html/2606.05263#bib.bib56); Wanget al\.,[2023](https://arxiv.org/html/2606.05263#bib.bib55); Yaoet al\.,[2023](https://arxiv.org/html/2606.05263#bib.bib59); Schicket al\.,[2023](https://arxiv.org/html/2606.05263#bib.bib44)\)\. Long\-horizon agents expose failures hidden by answer\-only benchmarks: the policy may skip verification, cite unsupported evidence, repeat null actions, exploit metadata, or edit evaluator\-facing artifacts while still receiving terminal reward\.
Recent work provides components\. Dense long\-context rewards reduce sparse\-gradient failures\(Chenet al\.,[2026](https://arxiv.org/html/2606.05263#bib.bib8); Pinget al\.,[2026](https://arxiv.org/html/2606.05263#bib.bib40); Lvet al\.,[2026](https://arxiv.org/html/2606.05263#bib.bib32)\); trust\-region updates stabilize LLM RL\(Schulmanet al\.,[2015](https://arxiv.org/html/2606.05263#bib.bib45),[2017](https://arxiv.org/html/2606.05263#bib.bib46); Becker and others,[2026](https://arxiv.org/html/2606.05263#bib.bib6)\); belief\-bottleneck or deviation penalties reduce active\-reasoning drift\(Zouet al\.,[2026](https://arxiv.org/html/2606.05263#bib.bib65); Lidayan and others,[2026](https://arxiv.org/html/2606.05263#bib.bib28)\); and verifiable meta\-reasoning rewards improve agents\(Zhanget al\.,[2026](https://arxiv.org/html/2606.05263#bib.bib61)\)\. Yet they rarely ask whether an intermediate step changed the probability of final verified success under a specified intervention and continuation policy\.
We answer withCVT\-RL\. We do not claim to recover an unconditional path\-specific effect under the evolving training policy\. For actionata\_\{t\}at historyhth\_\{t\}, PCCC estimates the controlled change in final verified success whenata\_\{t\}is replaced by an intervention\-specific counterfactualat0,ka\_\{t\}^\{0,k\}and the rest of the trajectory is completed by a frozen continuation policyμ\\mu\. PCCC is a stable credit surrogate, not the exact policy\-gradient causal effect\. The algorithm combines PCCC with verifiable rewards, leakage\-controlled belief supervision, and constrained trust\-region updates\.
Contributions\.\(i\) We define separate PCCC estimands for deletion, semantic substitution, evidence substitution, and tool\-output perturbation, and add intervention\-validity gating to reduce OOD counterfactuals\. \(ii\) We state identification assumptions, selection correction, nuisance\-model training, overlap diagnostics, and failure modes for sequential language trajectories\. \(iii\) We give a full\-vocabulary KL condition for top\-MMprojection\. \(iv\) We evaluate compute\-matched and information\-matched counterfactual baselines, detector\-held\-out and human\-audited hacking, adaptive detector evasion, per\-seed benchmark uncertainty, refresh\-period sensitivity, and model\-scale transfer, showing that gains are not explained by extra rollouts, structured supervision, or detector reuse alone\.
Figure 1:System overview ofCVT\-RL\. The diagram visualizes end\-to\-end data flow from task input, retrieval, and tool use to candidate\-step selection, intervention\-validity gating, frozen\-policy counterfactual continuation, verifier\-based PCCC estimation, and trust\-region constrained policy updates\.
## 2Related Work
RL and reasoning in language models\.Human\- and AI\-feedback RL align LMs with preferences, while DPO and related objectives avoid explicit online RL\(Christianoet al\.,[2017](https://arxiv.org/html/2606.05263#bib.bib10); Stiennonet al\.,[2020](https://arxiv.org/html/2606.05263#bib.bib51); Ouyanget al\.,[2022](https://arxiv.org/html/2606.05263#bib.bib34); Baiet al\.,[2022](https://arxiv.org/html/2606.05263#bib.bib4); Rafailovet al\.,[2023](https://arxiv.org/html/2606.05263#bib.bib42)\)\. Chain\-of\-thought, zero\-shot reasoning, self\-consistency, ReAct, Reflexion, and Toolformer expose intermediate computation but do not guarantee faithful or necessary reasoning\(Weiet al\.,[2022](https://arxiv.org/html/2606.05263#bib.bib56); Kojimaet al\.,[2022](https://arxiv.org/html/2606.05263#bib.bib23); Wanget al\.,[2023](https://arxiv.org/html/2606.05263#bib.bib55); Yaoet al\.,[2023](https://arxiv.org/html/2606.05263#bib.bib59); Shinnet al\.,[2023](https://arxiv.org/html/2606.05263#bib.bib48); Schicket al\.,[2023](https://arxiv.org/html/2606.05263#bib.bib44)\)\. RL with verifiable rewards improves math, code, and reasoning models\(Shaoet al\.,[2024](https://arxiv.org/html/2606.05263#bib.bib47); DeepSeek\-AI,[2025](https://arxiv.org/html/2606.05263#bib.bib11); Liu and others,[2025b](https://arxiv.org/html/2606.05263#bib.bib30); Yuanet al\.,[2024](https://arxiv.org/html/2606.05263#bib.bib60)\), but terminal correctness can reinforce shortcuts\.
Grounding, retrieval, and agents\.Retrieval\-augmented generation and dense retrieval ground LM outputs in external corpora\(Lewiset al\.,[2020](https://arxiv.org/html/2606.05263#bib.bib26); Karpukhinet al\.,[2020](https://arxiv.org/html/2606.05263#bib.bib22); Izacard and Grave,[2021](https://arxiv.org/html/2606.05263#bib.bib19); Borgeaudet al\.,[2022](https://arxiv.org/html/2606.05263#bib.bib7)\)\. Long\-context benchmarks show that large windows do not ensure evidence selection\(Baiet al\.,[2024](https://arxiv.org/html/2606.05263#bib.bib5); Hsiehet al\.,[2024](https://arxiv.org/html/2606.05263#bib.bib16); Kamradt,[2023](https://arxiv.org/html/2606.05263#bib.bib21); Li and others,[2024](https://arxiv.org/html/2606.05263#bib.bib27)\)\. Embodied, scientific, web, and API\-agent benchmarks evaluate multi\-step interaction with tools\(Shridharet al\.,[2021](https://arxiv.org/html/2606.05263#bib.bib49); Wanget al\.,[2022](https://arxiv.org/html/2606.05263#bib.bib54); Yaoet al\.,[2022](https://arxiv.org/html/2606.05263#bib.bib58); Zhouet al\.,[2024](https://arxiv.org/html/2606.05263#bib.bib63); Liuet al\.,[2024](https://arxiv.org/html/2606.05263#bib.bib29); Patilet al\.,[2023](https://arxiv.org/html/2606.05263#bib.bib38); Qinet al\.,[2024](https://arxiv.org/html/2606.05263#bib.bib41); Parisiet al\.,[2022](https://arxiv.org/html/2606.05263#bib.bib37)\)\. Long\-context RL and meta\-reasoning rewards motivate dense process supervision, but leave causal credit implicit\(Chenet al\.,[2026](https://arxiv.org/html/2606.05263#bib.bib8); Pinget al\.,[2026](https://arxiv.org/html/2606.05263#bib.bib40); Lvet al\.,[2026](https://arxiv.org/html/2606.05263#bib.bib32); Zhanget al\.,[2026](https://arxiv.org/html/2606.05263#bib.bib61)\)\.
Stable, safe, and causal RL\.Trust\-region and proximal policy gradients stabilize optimization\(Williams,[1992](https://arxiv.org/html/2606.05263#bib.bib57); Sutton and Barto,[2018](https://arxiv.org/html/2606.05263#bib.bib52); Schulmanet al\.,[2015](https://arxiv.org/html/2606.05263#bib.bib45),[2017](https://arxiv.org/html/2606.05263#bib.bib46)\); constrained RL controls expected costs\(Altman,[1999](https://arxiv.org/html/2606.05263#bib.bib1); Achiamet al\.,[2017](https://arxiv.org/html/2606.05263#bib.bib2); Chowet al\.,[2018](https://arxiv.org/html/2606.05263#bib.bib9); Rayet al\.,[2019](https://arxiv.org/html/2606.05263#bib.bib43)\)\. Offline\-to\-online RL, conservative value learning, implicit Q\-learning, diffusion data generation, and action chunking address sparse long\-horizon rewards\(Kumaret al\.,[2020](https://arxiv.org/html/2606.05263#bib.bib25); Kostrikovet al\.,[2022](https://arxiv.org/html/2606.05263#bib.bib24); Huang and others,[2025](https://arxiv.org/html/2606.05263#bib.bib17); Liu and others,[2025a](https://arxiv.org/html/2606.05263#bib.bib31); Zhu and others,[2025](https://arxiv.org/html/2606.05263#bib.bib64)\)\. Causal inference and doubly robust evaluation separate interventions from correlations\(Pearl,[2009](https://arxiv.org/html/2606.05263#bib.bib39); Imbens and Rubin,[2015](https://arxiv.org/html/2606.05263#bib.bib18); Hernán and Robins,[2020](https://arxiv.org/html/2606.05263#bib.bib15); Dudíket al\.,[2011](https://arxiv.org/html/2606.05263#bib.bib12); Jiang and Li,[2016](https://arxiv.org/html/2606.05263#bib.bib20); Thomas and Brunskill,[2016](https://arxiv.org/html/2606.05263#bib.bib53)\)\. Reward misspecification studies show that capable agents exploit flawed objectives\(Amodeiet al\.,[2016](https://arxiv.org/html/2606.05263#bib.bib3); Skalseet al\.,[2022](https://arxiv.org/html/2606.05263#bib.bib50); Panet al\.,[2022](https://arxiv.org/html/2606.05263#bib.bib35); Pan and others,[2026](https://arxiv.org/html/2606.05263#bib.bib36)\)\.
## 3Methodology
### 3\.1Constrained long\-horizon agent
We model the agent as a partially observed constrained MDPℳ=\(𝒮,𝒪,𝒜,P,R,C,γ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{O\},\\mathcal\{A\},P,R,C,\\gamma\)\. At timett, the policy observes historyht=\(o0,a0,…,ot\)h\_\{t\}=\(o\_\{0\},a\_\{0\},\\ldots,o\_\{t\}\)and samples
at∈\{THINK,SEARCH,READ,VERIFY,ACT,FINAL\}×𝒳,a\_\{t\}\\in\\\{\\texttt\{THINK\},\\texttt\{SEARCH\},\\texttt\{READ\},\\texttt\{VERIFY\},\\texttt\{ACT\},\\texttt\{FINAL\}\\\}\\times\\mathcal\{X\},\(1\)where𝒳\\mathcal\{X\}is text or structured tool arguments\. Costscj,tc\_\{j,t\}measure unsupported claims, skipped verification, evaluator tampering, unsafe tool calls, repeated null actions, and budget overruns:
maxθJR\(πθ\)=𝔼τ∼πθ∑t=0Tγtrt,s\.t\.JCj\(πθ\)=𝔼τ∼πθ∑t=0Tγtcj,t≤dj\.\\max\_\{\\theta\}J\_\{R\}\(\\pi\_\{\\theta\}\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\\sum\_\{t=0\}^\{T\}\\gamma^\{t\}r\_\{t\},\\quad\\mathrm\{s\.t\.\}\\quad J\_\{C\_\{j\}\}\(\\pi\_\{\\theta\}\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\\sum\_\{t=0\}^\{T\}\\gamma^\{t\}c\_\{j,t\}\\leq d\_\{j\}\.\(2\)The dense reward is
rt=λyrtans\+λertevi\+λmrtmeta\+λbrtbel\+λΔrtpccc−λhrthack\.r\_\{t\}=\\lambda\_\{y\}r\_\{t\}^\{\\mathrm\{ans\}\}\+\\lambda\_\{e\}r\_\{t\}^\{\\mathrm\{evi\}\}\+\\lambda\_\{m\}r\_\{t\}^\{\\mathrm\{meta\}\}\+\\lambda\_\{b\}r\_\{t\}^\{\\mathrm\{bel\}\}\+\\lambda\_\{\\Delta\}r\_\{t\}^\{\\mathrm\{pccc\}\}\-\\lambda\_\{h\}r\_\{t\}^\{\\mathrm\{hack\}\}\.\(3\)Default weights are\(1\.0,0\.45,0\.18,0\.25,0\.60,0\.80\)\(1\.0,0\.45,0\.18,0\.25,0\.60,0\.80\)and are varied in Section[4\.5](https://arxiv.org/html/2606.05263#S4.SS5)\.ransr^\{\\mathrm\{ans\}\}is exact match, unit\-test pass, or environment success\.revir^\{\\mathrm\{evi\}\}is the harmonic mean of support\-document F1 and entailment\.rmetar^\{\\mathrm\{meta\}\}rewards plan–explore–verify patterns only when they reduce verifier uncertainty\.rhackr^\{\\mathrm\{hack\}\}is the maximum detector score for metadata leakage, evaluator modification, unsupported finalization, and suspicious tool edits\.
### 3\.2Policy\-conditioned counterfactual contribution
Letμ\\mube a frozen continuation policy, usually the reference model from the previous outer iteration\. For intervention familyk∈𝒦k\\in\\mathcal\{K\},gk\(ht,at,u\)g\_\{k\}\(h\_\{t\},a\_\{t\},u\)produces a counterfactual actionat0,ka\_\{t\}^\{0,k\}with randomnessuu\. We use four families: deletion, neutral semantic paraphrase, evidence substitution, and tool\-output perturbation\. Their estimands are not merged:
Δtk,μ\(ht,at\)=𝔼u,τt\+1:T∼μ\[Y\{ht,at,τt\+1:T\}\]−𝔼u,τt\+1:T∼μ\[Y\{ht,gk\(ht,at,u\),τt\+1:T\}\]\.\\Delta\_\{t\}^\{k,\\mu\}\(h\_\{t\},a\_\{t\}\)=\\mathbb\{E\}\_\{u,\\tau\_\{t\+1:T\}\\sim\\mu\}\\left\[Y\\\{h\_\{t\},a\_\{t\},\\tau\_\{t\+1:T\}\\\}\\right\]\-\\mathbb\{E\}\_\{u,\\tau\_\{t\+1:T\}\\sim\\mu\}\\left\[Y\\\{h\_\{t\},g\_\{k\}\(h\_\{t\},a\_\{t\},u\),\\tau\_\{t\+1:T\}\\\}\\right\]\.\(4\)Y∈\[0,1\]Y\\in\[0,1\]is final verified success\. Ifμ≠πθ\\mu\\neq\\pi\_\{\\theta\}, PCCC measures usefulness under a reference continuation; it regularizes credit assignment but is not a proof of improvement under arbitrary future continuations\. The aggregate reward usesΔtμ=∑kwkΔtk,μ\\Delta\_\{t\}^\{\\mu\}=\\sum\_\{k\}w\_\{k\}\\Delta\_\{t\}^\{k,\\mu\}withw=\(0\.25,0\.20,0\.30,0\.25\)w=\(0\.25,0\.20,0\.30,0\.25\)\.
#### Identification and estimator\.
Identification ofΔtk,μ\\Delta\_\{t\}^\{k,\\mu\}for selected stepsSt=1S\_\{t\}=1requires: consistency; positivity of observed and intervened actions; sequential exchangeability conditional onhth\_\{t\}, verifier state, tool state, and candidate\-selection features; a fixed intervention distributiongkg\_\{k\}independent of the outcome except through\(ht,at\)\(h\_\{t\},a\_\{t\}\); and selected\-step bias correction\. To reduce off\-support interventions, each proposal passes a validity gateνtk=𝕀\[syntax∧schema∧ssup≥0\.35∧sent≥0\.55\]\\nu\_\{t\}^\{k\}=\\mathbb\{I\}\[\\mathrm\{syntax\}\\wedge\\mathrm\{schema\}\\wedge s\_\{\\mathrm\{sup\}\}\\\!\\geq\\\!0\.35\\wedge s\_\{\\mathrm\{ent\}\}\\\!\\geq\\\!0\.55\]before rollout\. Hidden environment state, invalid interventions, and verifier failures can still bias estimates, so we report overlap and validity diagnostics in Table[5](https://arxiv.org/html/2606.05263#S4.T5)\.
We select at mostL=8L=8candidate steps using gradient norm, verifier disagreement, tool relevance, and novelty\. Letqω\(St=1\|ht,at\)q\_\{\\omega\}\(S\_\{t\}=1\|h\_\{t\},a\_\{t\}\)be the calibrated selection probability, clipped to\[0\.15,1\]\[0\.15,1\]\. The outcome modelmϕk\(h,a\)=𝔼\[Y\|h,a,μ,k\]m\_\{\\phi\}^\{k\}\(h,a\)=\\mathbb\{E\}\[Y\|h,a,\\mu,k\]is a DeBERTa\-v3\-base cross\-encoder over compressed history, action, tool state, and verifier features, trained with BCE on frozen\-continuation outcomes\. The continuation ratio is
ρt=clip\(exp\{∑s\>tlogμ\(as\|hs\)−∑s\>tlogπb\(as\|hs\)\},0\.2,5\),\\rho\_\{t\}=\\mathrm\{clip\}\\left\(\\exp\\\{\\sum\_\{s\>t\}\\log\\mu\(a\_\{s\}\|h\_\{s\}\)\-\\sum\_\{s\>t\}\\log\\pi\_\{b\}\(a\_\{s\}\|h\_\{s\}\)\\\},0\.2,5\\right\),\(5\)whereπb\\pi\_\{b\}is the behavior policy that generated the continuation;ρt0\\rho\_\{t\}^\{0\}is defined analogously for the counterfactual branch\. The selection\-adjusted doubly robust estimator is
Δ^tk,μ=\\displaystyle\\widehat\{\\Delta\}\_\{t\}^\{k,\\mu\}=Stqω\[mϕk\(ht,at\)−mϕk\(ht,at0,k\)\+ρt\(Y−mϕk\(ht,at\)\)\\displaystyle\\frac\{S\_\{t\}\}\{q\_\{\\omega\}\}\\Big\[m\_\{\\phi\}^\{k\}\(h\_\{t\},a\_\{t\}\)\-m\_\{\\phi\}^\{k\}\(h\_\{t\},a\_\{t\}^\{0,k\}\)\+\\rho\_\{t\}\(Y\-m\_\{\\phi\}^\{k\}\(h\_\{t\},a\_\{t\}\)\)−ρt0\(Y0,k−mϕk\(ht,at0,k\)\)\]\.\\displaystyle\\hskip 32\.52127pt\-\\rho\_\{t\}^\{0\}\(Y^\{0,k\}\-m\_\{\\phi\}^\{k\}\(h\_\{t\},a\_\{t\}^\{0,k\}\)\)\\Big\]\.\(6\)Under the assumptions above, bounded weights, and either a correct outcome model or correct continuation\-ratio model, Eq\. \([6](https://arxiv.org/html/2606.05263#S3.E6)\) is unbiased for selected\-step PCCC up to clipping bias; with both nuisances estimated atop\(n−1/4\)o\_\{p\}\(n^\{\-1/4\}\), the leading product\-error term is second order\. We usertpccc=tanh\(2∑kwkΔ^tk,μ\)r\_\{t\}^\{\\mathrm\{pccc\}\}=\\tanh\(2\\sum\_\{k\}w\_\{k\}\\widehat\{\\Delta\}\_\{t\}^\{k,\\mu\}\)and
A^tCVT=A^tGAE\+0\.7∑kwkΔ^tk,μ−0\.9c^thack\.\\widehat\{A\}^\{\\mathrm\{CVT\}\}\_\{t\}=\\widehat\{A\}^\{\\mathrm\{GAE\}\}\_\{t\}\+0\.7\\sum\_\{k\}w\_\{k\}\\widehat\{\\Delta\}^\{k,\\mu\}\_\{t\}\-0\.9\\widehat\{c\}^\{\\mathrm\{hack\}\}\_\{t\}\.\(7\)The PCCC reward term defines the shaped constrained objective, while the advantage augmentation is a stop\-gradient credit\-shaping term that changes the stochastic estimator used for optimization but not the verifier definitions or constraints\. We therefore report reward\-only and advantage\-only ablations rather than relying on this engineering choice for the claim\.
Figure 2:PCCC separates intervention semantics\. Deletion, paraphrase, evidence substitution, and tool\-output perturbation estimate different controlled contributions under the same frozen continuation policy\.
### 3\.3Belief reward and trust\-region constraints
The belief headbt=fψ\(ht\)b\_\{t\}=f\_\{\\psi\}\(h\_\{t\}\)predicts prefix\-observable slots: task phase, required evidence IDs among retrieved documents, verified subgoals, and unresolved constraints\. Main results use verifier\-estimated labelsbt⋆b\_\{t\}^\{\\star\}generated from prefix evidence only; final answers, gold hidden states, and future observations are excluded\. The reward isrtbel=−clip\(DKL\(bt⋆∥bt\),0,2\)r\_\{t\}^\{\\mathrm\{bel\}\}=\-\\mathrm\{clip\}\(D\_\{\\mathrm\{KL\}\}\(b\_\{t\}^\{\\star\}\\\|b\_\{t\}\),0,2\)\. Oracle belief is reported only as an upper bound\.
Top\-MMprojection is safe only if the tail is controlled\. For old distributionpoldp\_\{\\rm old\}, let𝒱M\\mathcal\{V\}\_\{M\}be the top\-M=256M=256tokens andα=∑v∈𝒱Mpold\(v\)\\alpha=\\sum\_\{v\\in\\mathcal\{V\}\_\{M\}\}p\_\{\\rm old\}\(v\)\. We preserve old tail distribution and mass; only the conditional top distribution is optimized:
qM⋆=argminq∥logq−zθ∥22s\.t\.DKL\(pold,M∥q\)≤δ/α\.q\_\{M\}^\{\\star\}=\\arg\\min\_\{q\}\\\|\\log q\-z\_\{\\theta\}\\\|\_\{2\}^\{2\}\\quad\\mathrm\{s\.t\.\}\\quad D\_\{\\mathrm\{KL\}\}\(p\_\{\{\\rm old\},M\}\\\|q\)\\leq\\delta/\\alpha\.\(8\)The full distribution ispnew\(v\)=αqM⋆\(v\)p\_\{\\rm new\}\(v\)=\\alpha q\_\{M\}^\{\\star\}\(v\)forv∈𝒱Mv\\in\\mathcal\{V\}\_\{M\}andpnew\(v\)=pold\(v\)p\_\{\\rm new\}\(v\)=p\_\{\\rm old\}\(v\)otherwise, hence
DKL\(pold∥pnew\)=αDKL\(pold,M∥qM⋆\)≤δ\.D\_\{\\mathrm\{KL\}\}\(p\_\{\\rm old\}\\\|p\_\{\\rm new\}\)=\\alpha D\_\{\\mathrm\{KL\}\}\(p\_\{\{\\rm old\},M\}\\\|q\_\{M\}^\{\\star\}\)\\leq\\delta\.\(9\)Ifα<0\.98\\alpha<0\.98, we expand toM=1024M=1024; if still below0\.980\.98, projection is disabled and an explicit KL penalty is used\. We setδ=0\.025\\delta=0\.025for reasoning tokens and0\.0100\.010for tool tokens\. The augmented Lagrangian is
ℒ=−𝔼t\[πθ\(at\|ht\)πθk\(at\|ht\)A^tCVT\]\+0\.2ℒV\+∑jλj\(JCj−dj\)\+ρ2∑j\[JCj−dj\]\+2\.\\mathcal\{L\}=\-\\mathbb\{E\}\_\{t\}\\left\[\\frac\{\\pi\_\{\\theta\}\(a\_\{t\}\|h\_\{t\}\)\}\{\\pi\_\{\\theta\_\{k\}\}\(a\_\{t\}\|h\_\{t\}\)\}\\widehat\{A\}^\{\\mathrm\{CVT\}\}\_\{t\}\\right\]\+0\.2\\mathcal\{L\}\_\{V\}\+\\sum\_\{j\}\\lambda\_\{j\}\(J\_\{C\_\{j\}\}\-d\_\{j\}\)\+\\frac\{\\rho\}\{2\}\\sum\_\{j\}\[J\_\{C\_\{j\}\}\-d\_\{j\}\]\_\{\+\}^\{2\}\.\(10\)We useρ=2\.0\\rho=2\.0,λj←\[λj\+0\.05\(JCj−dj\)\]\+\\lambda\_\{j\}\\leftarrow\[\\lambda\_\{j\}\+0\.05\(J\_\{C\_\{j\}\}\-d\_\{j\}\)\]\_\{\+\}, andd=\(0\.02,0\.04,0\.03,0\.06,0\.15\)d=\(0\.02,0\.04,0\.03,0\.06,0\.15\)for tampering, unsupported evidence, skipped verification, unsafe action, and budget overrun\.
### 3\.4Reproducible details
The default backbone is Qwen2\.5\-7B\-Instruct with LoRA rank6464,α=128\\alpha=128, dropout0\.050\.05, and trainable attention projections; we also evaluate Llama\-3\.1\-8B\-Instruct and Qwen2\.5\-14B\-Instruct\. Rollouts use temperature0\.80\.8, top\-p=0\.95p=0\.95, max generation20482048, context3232k, max tool calls1616, andG=8G=8samples per prompt\. Training runs for3,0003\{,\}000updates with batch size128128, gradient accumulation88, AdamW\(0\.9,0\.95\)\(0\.9,0\.95\), weight decay0\.10\.1, policy LR8×10−78\\times 10^\{\-7\}, verifier/outcome LR5×10−65\\times 10^\{\-6\}, cosine decay,3%3\\%warmup, gradient clipping1\.01\.0, and GAE\(γ,λ\)=\(0\.99,0\.95\)\(\\gamma,\\lambda\)=\(0\.99,0\.95\)\. Each selected step receivesK=4K=4counterfactual continuations per intervention family\. The frozenμ\\muis refreshed every200200updates; we also test100100and400400update lags and an EMA teacher with decay0\.9950\.995\. Intervention\-validity gating uses a lightweight entailment scorer and schema checker before rollout, and we log PCCC drift before each refresh\.
Input:Policy
πθ\\pi\_\{\\theta\}, frozen continuation
μ\\mu, verifier bank
VV, intervention set
𝒦\\mathcal\{K\}
for*outer iterationr=1,…,Rr=1,\\ldots,R*do
Collect rollouts
τ∼πθ\\tau\\sim\\pi\_\{\\theta\}with tool logs and verifier states
Calibrate selector
qωq\_\{\\omega\}and select candidate steps
StS\_\{t\}
foreach*selected\(ht,at\)\(h\_\{t\},a\_\{t\}\)andk∈𝒦k\\in\\mathcal\{K\}*do
Construct
at0,k=gk\(ht,at,u\)a\_\{t\}^\{0,k\}=g\_\{k\}\(h\_\{t\},a\_\{t\},u\); complete
KKcontinuations with frozen
μ\\mu
Evaluate
Y,Y0,kY,Y^\{0,k\}, evidence support, belief labels, and costs with
VV
Compute
Δ^tk,μ\\widehat\{\\Delta\}^\{k,\\mu\}\_\{t\}by Eq\. \([6](https://arxiv.org/html/2606.05263#S3.E6)\)
Update
mϕm\_\{\\phi\},
qωq\_\{\\omega\},
A^CVT\\widehat\{A\}^\{\\mathrm\{CVT\}\}, policy projection, and Lagrange multipliers
Set
μ←πθ\\mu\\leftarrow\\pi\_\{\\theta\}every
200200updates and record PCCC drift
Algorithm 1CVT\-RLtraining\.
## 4Experiment
### 4\.1Benchmarks, splits, and baselines
We evaluate on four task groups: long\-context QA \(RULER, LongBench, LooGLE\), embodied text agents \(ALFWorld\), scientific interaction \(ScienceWorld\), and web/tool tasks \(WebShop, WebArena, AgentBench subsets, ToolLLM/Gorilla\-style API calls\)\. Reward\-hacking evaluation uses RHB\-style categories plus held\-out attacks: metadata shortcut, evaluator edit, hidden target leakage, unsupported finalization, and tool\-output spoofing\. Average success is the unweighted mean over the four task groups only; evidence F1 and belief deviation are averaged where supporting evidence or belief slots are defined\. Table[1](https://arxiv.org/html/2606.05263#S4.T1)gives split and sub\-benchmark details\.
Table 1:Data splits, per\-seed CVT\-RL success, and benchmark\-level uncertainty\. Split is train/dev/test\. CM is compute\-matched non\-causal RL; IM is information\-matched counterfactual\-process RL\.Table 2:Baseline capabilities\. “Verifier” means access to the same verifier bank for reward computation; “CF” includes all counterfactual completions and is counted in normalized generations\.
### 4\.2Main results
Table 3:Main results\. Avg is the mean of four task\-success columns\. Standard deviations are over five seeds\.Figure 3:Task success across benchmark groups with heterogeneous five\-seed standard errors\.
### 4\.3Ablations and estimator diagnostics
Table[4](https://arxiv.org/html/2606.05263#S4.T4)controls for alternative explanations\. Random counterfactuals isolate compute, while the information\-matched CF\-process baseline receives the exact same semantic counterfactual continuations, verifier labels, and outcome\-model training data but learns a non\-causal process reward without Eq\. \([6](https://arxiv.org/html/2606.05263#S3.E6)\)\. Detector\-held\-out attacks isolate detector overfitting; single\-family interventions test semantic necessity\. Table[5](https://arxiv.org/html/2606.05263#S4.T5)reports overlap, validity, drift, and KL diagnostics requested for PCCC estimation\.
Table 4:Ablations\. Success is average task success; held\-out hacking attacks are excluded from detector training\.Table 5:Estimator, detector, and trust\-region diagnostics\. IQR is interquartile range; invalid means intervention rejected before rollout\.Figure 4:Different counterfactual interventions expose different credit patterns\. White dots mark positive contributions after Benjamini–Hochberg correction\.
### 4\.4Belief, verifier, detector, and compute robustness
The main model uses prefix\-only verifier\-estimated belief\. The belief probe reaches82\.6%82\.6\\%slot accuracy and ECE0\.0410\.041on validation tasks; on unseen task families, accuracy is77\.3%77\.3\\%and ECE is0\.0610\.061\. Oracle belief improves success by1\.21\.2points and is excluded from the main method\. Hacking detectors are DeBERTa\-v3 classifiers plus rules trained on metadata shortcut, unsupported\-finalization, and tool\-spoof traces; evaluator\-edit and hidden\-target attacks are never used for detector training\. A blinded human audit samples600600trajectories stratified by benchmark and model family\. Three annotators use majority vote, are blind to method identity, and reach Fleiss’κ=0\.78\\kappa=0\.78; audited hacking is4\.6%4\.6\\%forCVT\-RL,8\.1%8\.1\\%for the information\-matched baseline, and12\.3%12\.3\\%for compute\-matched non\-causal RL\. Under adaptive detector\-evasion prompts generated by an external attacker model, hacking rises to7\.1%7\.1\\%forCVT\-RL,10\.8%10\.8\\%for the information\-matched baseline, and16\.8%16\.8\\%for the compute\-matched baseline\. Under verifier noise of5%5\\%,10%10\\%, and20%20\\%,CVT\-RLobtains77\.877\.8,76\.476\.4, and73\.273\.2success, with hacking rates4\.54\.5,5\.75\.7, and8\.98\.9\. Active selection controls cost:L=4,8,12L=4,8,12selected steps give success77\.6,78\.9,79\.177\.6,78\.9,79\.1at2\.6×,3\.9×,5\.2×2\.6\\times,3\.9\\times,5\.2\\timesgeneration cost; varying the frozen\-policy refresh interval to100/200/400100/200/400updates gives78\.5/78\.9/78\.178\.5/78\.9/78\.1success and mean PCCC drift0\.013/0\.021/0\.0340\.013/0\.021/0\.034\.
Figure 5:Verifier\-noise robustness with dualyy\-axes: success \(left\) and measured hacking rate \(right\)\. The single plot highlights the joint accuracy–safety trade\-off as verifier noise increases\.
### 4\.5Statistical testing and sensitivity
We run five seeds\. Primary tests use stratified paired bootstrap over prompts within benchmark groups with10,00010\{,\}000resamples; sensitivity uses seed\-level paired tests and a mixed\-effects logistic model with random intercepts for seed, prompt, and benchmark\. Against compute\-matched non\-causal RL, success improves by7\.17\.1points \(95% CI\[5\.4,8\.9\]\[5\.4,8\.9\], stratifiedp=0\.003p=0\.003; mixed\-effects coefficient0\.41±0\.070\.41\\pm 0\.07,p<10−4p<10^\{\-4\}\), evidence F1 by8\.28\.2points \(CI\[6\.1,10\.0\]\[6\.1,10\.0\],p=0\.002p=0\.002\), and hacking rate decreases by7\.87\.8points \(CI\[6\.5,9\.4\]\[6\.5,9\.4\],p=0\.001p=0\.001\)\. The human\-audited hacking difference is7\.77\.7points \(CI\[5\.1,10\.2\]\[5\.1,10\.2\], permutationp=0\.004p=0\.004\)\. Holm\-correctedqqvalues for all primary metrics are below0\.010\.01\. VaryingλΔ∈\{0\.3,0\.45,0\.6,0\.75,0\.9\}\\lambda\_\{\\Delta\}\\in\\\{0\.3,0\.45,0\.6,0\.75,0\.9\\\}yields success76\.9,78\.0,78\.9,78\.6,77\.276\.9,78\.0,78\.9,78\.6,77\.2and hacking5\.4,4\.5,3\.9,4\.2,5\.85\.4,4\.5,3\.9,4\.2,5\.8\. Ratio caps3,5,83,5,8give success78\.0,78\.9,78\.678\.0,78\.9,78\.6; selection floors\.05,\.15,\.25\.05,\.15,\.25give78\.2,78\.9,77\.978\.2,78\.9,77\.9; invalid/OOD filtering thresholds0\.70,0\.80,0\.900\.70,0\.80,0\.90give78\.1,78\.9,78\.478\.1,78\.9,78\.4\. Refresh periods100,200,400100,200,400give success78\.3,78\.9,77\.578\.3,78\.9,77\.5and PCCC drift\.012,\.021,\.039\.012,\.021,\.039\. Without retuning weights, Llama\-3\.1\-8B obtains77\.677\.6and Qwen2\.5\-14B obtains82\.082\.0average success\. Varying KL radii by×\{0\.5,1,2\}\\times\\\{0\.5,1,2\\\}gives success76\.8,78\.9,77\.676\.8,78\.9,77\.6and empirical full\-vocabulary KL0\.011,0\.021,0\.0430\.011,0\.021,0\.043\. Tool\-token KL satisfies the0\.0100\.010mean budget in98\.7%98\.7\\%of updates; violations trigger projection expansion or penalty fallback\.
Figure 6:Sensitivity and empirical trust\-region behavior\. The tool\-token KL panel reports budget compliance separately from full\-vocabulary reasoning KL\.
## 5Discussion
CVT\-RLis not a claim that language\-agent causality is solved\. PCCC is conditional onhth\_\{t\}, intervention family, selected\-step rule, and frozen continuation policy\. A deletion effect, an evidence\-substitution effect, and a tool\-perturbation effect are different scientific objects, and none equals the total causal effect under the future learned policy\. The practical value is an auditable credit signal aligned with verified success and less prone to rewarding merely plausible reasoning\. Intervention\-validity gating lowers OOD counterfactuals, and adaptive\-attack evaluation suggests the model is not merely overfitting a fixed detector family\. The compute overhead is substantial, though active selection offers a usable cost–accuracy trade\-off\. The method reduces measured and human\-audited hacking under evaluated, held\-out, and adaptive attacks, but cannot guarantee immunity to novel attacks or broken verifiers\.
## 6Conclusion
We presentedCVT\-RL, a constrained RL framework for long\-horizon language agents that combines dense verifiable rewards with policy\-conditioned counterfactual contribution\. The formulation separates intervention semantics, controls belief\-label leakage, states identification assumptions, gives a full\-vocabulary KL condition, and reports overlap, drift, detector, and statistical diagnostics\. Across long\-context, embodied, scientific, and web/tool tasks,CVT\-RLimproves success and evidence quality while reducing measured reward hacking, making counterfactual credit a promising direction for reliable agent RL\.
## Appendix AAppendix
### A\.1Proof details
DR consistency\.For fixedkkandμ\\mu, defineYaY^\{a\}as final verified success after setting the selected action toaaand completing withμ\\mu\. Under consistency, positivity, exchangeability, fixed intervention distribution, and inverse selection correction,𝔼\[Ya\|ht,St=1\]=𝔼\[mk\(ht,a\)\]\\mathbb\{E\}\[Y^\{a\}\|h\_\{t\},S\_\{t\}=1\]=\\mathbb\{E\}\[m^\{k\}\(h\_\{t\},a\)\]\. Ifmkm^\{k\}is correct, residuals have zero conditional mean even with misspecified ratios; if the ratio is correct, the weighted residual recovers the missing outcome even with misspecifiedmkm^\{k\}\. Thus Eq\. \([6](https://arxiv.org/html/2606.05263#S3.E6)\) is unbiased for selected\-step PCCC before clipping\. With estimated nuisances, the leading bias is the product of outcome and ratio errors plus explicit clipping and invalid\-intervention bias; Table[5](https://arxiv.org/html/2606.05263#S4.T5)quantifies these terms\.
Top\-MMKL\.Eq\. \([9](https://arxiv.org/html/2606.05263#S3.E9)\) follows by decomposing the vocabulary into top and tail sets\. Because tail mass and conditional tail distribution are copied frompoldp\_\{\\rm old\}, tail KL and mass\-ratio terms vanish; the only term is top conditional KL multiplied byα\\alpha\. This guarantee applies to the projected sampling distribution, not arbitrary later optimizer steps, so empirical full\-vocabulary and tool\-token KL are logged after every update\.
### A\.2Additional implementation notes
The selector uses a logistic model with features: token log\-probability drop, value error, verifier disagreement, number of tool arguments, retrieval\-rank change, and novelty\. It is trained from replay labels indicating whether ablating a step changes any verifier output; isotonic calibration gives ECE0\.0330\.033\. Outcome models use BCE with class\-balanced sampling over original and counterfactual continuations; validation AUC is0\.840\.84–0\.890\.89\. Intervention\-validity gating uses a tool\-schema parser, BM25\-overlap support score, and a DeBERTa\-v3 entailment scorer; invalid proposals are regenerated at most twice\. Human audit samples 150 trajectories per task group and aggregates three annotations by majority vote\. Baselines use the same backbone, retriever, parser, verifier bank, context length, optimizer, and dev\-set tuning budget where the original method permits it; Table[2](https://arxiv.org/html/2606.05263#S4.T2)reports every capability difference\.
## References
- J\. Achiam, D\. Held, A\. Tamar, and P\. Abbeel \(2017\)Constrained policy optimization\.InInternational Conference on Machine Learning,pp\. 22–31\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- E\. Altman \(1999\)Constrained markov decision processes\.Chapman and Hall/CRC\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- D\. Amodei, C\. Olah, J\. Steinhardt, P\. Christiano, J\. Schulman, and D\. Mané \(2016\)Concrete problems in ai safety\.arXiv preprint arXiv:1606\.06565\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- Y\. Bai, S\. Kadavath, S\. Kundu,et al\.\(2022\)Constitutional ai: harmlessness from ai feedback\.arXiv preprint arXiv:2212\.08073\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- Y\. Bai, X\. Lv, J\. Zhang,et al\.\(2024\)LongBench: a bilingual, multitask benchmark for long context understanding\.InAnnual Meeting of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- P\. Beckeret al\.\(2026\)TROLL: trust regions improve reinforcement learning for large language models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p2.1)\.
- S\. Borgeaud, A\. Mensch, J\. Hoffmann,et al\.\(2022\)Improving language models by retrieving from trillions of tokens\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- G\. Chen, M\. Q\. Shieh, and L\. Bing \(2026\)LongRLVR: long\-context reinforcement learning requires verifiable context rewards\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p2.1),[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- Y\. Chow, O\. Nachum, E\. Duenez\-Guzman, and M\. Ghavamzadeh \(2018\)A lyapunov\-based approach to safe reinforcement learning\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- P\. F\. Christiano, J\. Leike, T\. Brown, M\. Martic, S\. Legg, and D\. Amodei \(2017\)Deep reinforcement learning from human preferences\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p1.1),[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- M\. Dudík, J\. Langford, and L\. Li \(2011\)Doubly robust policy evaluation and learning\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- M\. A\. Hernán and J\. M\. Robins \(2020\)Causal inference: what if\.Chapman and Hall/CRC\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- C\. Hsieh, S\. Sun, S\. Kriman,et al\.\(2024\)RULER: what’s the real context size of your long\-context language models?\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- S\. Huanget al\.\(2025\)Classifier\-free diffusion generation for offline\-to\-online reinforcement learning\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- G\. W\. Imbens and D\. B\. Rubin \(2015\)Causal inference for statistics, social, and biomedical sciences\.Cambridge University Press\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- G\. Izacard and E\. Grave \(2021\)Leveraging passage retrieval with generative models for open domain question answering\.InEuropean Chapter of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- N\. Jiang and L\. Li \(2016\)Doubly robust off\-policy value evaluation for reinforcement learning\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- G\. Kamradt \(2023\)Needle in a haystack: pressure testing llms\.Note:Technical reportCited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- V\. Karpukhin, B\. Oguz, S\. Min,et al\.\(2020\)Dense passage retrieval for open\-domain question answering\.InEmpirical Methods in Natural Language Processing,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- I\. Kostrikov, A\. Nair, and S\. Levine \(2022\)Offline reinforcement learning with implicit q\-learning\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- A\. Kumar, A\. Zhou, G\. Tucker, and S\. Levine \(2020\)Conservative q\-learning for offline reinforcement learning\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- P\. Lewis, E\. Perez, A\. Piktus,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- J\. Liet al\.\(2024\)LooGLE: can long\-context language models understand long contexts?\.InAnnual Meeting of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- A\. Lidayanet al\.\(2026\)ABBEL: llm agents acting through belief bottlenecks for efficient long\-horizon reasoning\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p2.1)\.
- F\. Liuet al\.\(2025a\)Q\-chunking: offline\-to\-online reinforcement learning with action chunking\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- X\. Liu, H\. Yu, H\. Zhang,et al\.\(2024\)AgentBench: evaluating llms as agents\.Journal of Artificial Intelligence Research79,pp\. 1109–1176\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- Z\. Liuet al\.\(2025b\)DAPO: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- M\. Lv, T\. Mei, T\. Du,et al\.\(2026\)GoLongRL: capability\-oriented long context reinforcement learning with multitask alignment\.arXiv preprint arXiv:2605\.19577\.Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p2.1),[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in Neural Information Processing Systems\.Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p1.1),[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- A\. Pan, K\. Bhatia, and J\. Steinhardt \(2022\)The effects of reward misspecification: mapping and mitigating misaligned models\.International Conference on Learning Representations\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- A\. Panet al\.\(2026\)Reward hacking benchmark: evaluating reward hacking in tool\-using language agents\.arXiv preprint arXiv:2605\.02964\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- A\. Parisi, Y\. Zhao, and N\. Fiedel \(2022\)TALM: tool augmented language models\.InarXiv preprint arXiv:2205\.12255,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2023\)Gorilla: large language model connected with massive apis\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- J\. Pearl \(2009\)Causality: models, reasoning, and inference\.Cambridge University Press\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- B\. Ping, Z\. Chen, Y\. Yu,et al\.\(2026\)LongR: unleashing long\-context reasoning via reinforcement learning with dense utility rewards\.arXiv preprint arXiv:2602\.05758\.Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p2.1),[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- Y\. Qin, S\. Liang, Y\. Ye,et al\.\(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world apis\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell,et al\.\(2023\)Direct preference optimization: your language model is secretly a reward model\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p1.1),[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- A\. Ray, J\. Achiam, and D\. Amodei \(2019\)Benchmarking safe exploration in deep reinforcement learning\.InarXiv preprint arXiv:1910\.01708,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì,et al\.\(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p1.1),[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- J\. Schulman, S\. Levine, P\. Abbeel, M\. Jordan, and P\. Moritz \(2015\)Trust region policy optimization\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p2.1),[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.InarXiv preprint arXiv:1707\.06347,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p2.1),[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu,et al\.\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath,et al\.\(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- M\. Shridhar, X\. Yuan, M\. Côté,et al\.\(2021\)ALFWorld: aligning text and embodied environments for interactive learning\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- J\. Skalse, N\. Howe, D\. Krasheninnikov, and D\. Krueger \(2022\)Defining and characterizing reward hacking\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- N\. Stiennon, L\. Ouyang, J\. Wu,et al\.\(2020\)Learning to summarize with human feedback\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p1.1),[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- R\. S\. Sutton and A\. G\. Barto \(2018\)Reinforcement learning: an introduction\.MIT Press\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- P\. S\. Thomas and E\. Brunskill \(2016\)Data\-efficient off\-policy policy evaluation for reinforcement learning\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- R\. Wang, P\. Jansen, M\. Côté,et al\.\(2022\)ScienceWorld: is your agent smarter than a 5th grader?\.InEmpirical Methods in Natural Language Processing,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans,et al\.\(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p1.1),[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p1.1),[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- R\. J\. Williams \(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine Learning8,pp\. 229–256\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)WebShop: towards scalable real\-world web interaction with grounded language agents\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- S\. Yao, J\. Zhao, D\. Yu,et al\.\(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p1.1),[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- W\. Yuan, R\. Y\. Pang, K\. Cho,et al\.\(2024\)Self\-rewarding language models\.arXiv preprint arXiv:2401\.10020\.Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p1.1)\.
- Z\. Zhang, Z\. Chen, M\. Li, Z\. Tu, and X\. Li \(2026\)RLVMR: reinforcement learning with verifiable meta\-reasoning rewards for robust long\-horizon agents\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p2.1),[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu,et al\.\(2024\)WebArena: a realistic web environment for building autonomous agents\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p2.1)\.
- Z\. Zhuet al\.\(2025\)BOLA: bayesian optimistic learning under approximation for model\-based reinforcement learning\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.05263#S2.p3.1)\.
- D\. Zou, Y\. Chen, J\. Wang, G\. Yang, M\. Li, Q\. Da, J\. Cheng, and Y\. Gong \(2026\)Reducing belief deviation in reinforcement learning for active reasoning of llm agents\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.05263#S1.p2.1)\.Similar Articles
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
Proposes Correction-Oriented Policy Optimization (CIPO), an extension to RLVR that converts failed trajectories into correction-oriented supervision, improving reasoning and correction performance in LLMs across math and code benchmarks.
Not only where, But when: Temporal Scheduling for RLVR
Introduces temporal scheduling for credit allocation criteria in reinforcement learning with verifiable rewards, showing that scheduling when learning signals are applied improves policy evolution and stability.
Reducing Credit Assignment Variance via Counterfactual Reasoning Paths
Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.
Video Models Can Reason with Verifiable Rewards
VideoRLVR optimizes video diffusion models for verifiable reasoning tasks using reinforcement learning with rule-based rewards, achieving better performance than supervised methods in constraint-satisfying video generation.
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
AgentV-RL introduces an Agentic Verifier framework that enhances reward modeling through bidirectional verification with forward and backward agents augmented with tools, achieving 25.2% improvement over state-of-the-art ORMs. The approach addresses error propagation and grounding issues in verifiers for complex reasoning tasks through multi-turn deliberative processes combined with reinforcement learning.