BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents
Summary
BiPACE introduces a drop-in advantage estimator that fixes state-action credit mismatch in stepwise group-based RL for LLM agents, using bisimulation-guided state clustering and action counterfactual estimation, achieving significant performance gains on ALFWorld, WebShop, and TextCraft with Qwen2.5 models.
View Cached Full Text
Cached at: 06/25/26, 05:12 AM
# Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents
Source: [https://arxiv.org/html/2606.25556](https://arxiv.org/html/2606.25556)
\\@ifundefinedcolor
linkblue\\@ifundefinedcolorcodegray\\@ifundefinedcolortblhead\\@ifundefinedcolortblgroup\\@ifundefinedcolortblours\\@ifundefinedcolortblref
Hanyang Wang1,†Weijieying Ren3Yuxiang Zhang2Ding Cao4 Zhizhao Zeng5Ke Zeng5Tianxiang Zhao2,\* 1University of Chicago2The Hong Kong University of Science and Technology \(Guangzhou\) 3Stanford University4University of Science and Technology of China5Meituan hanyangw@uchicago\.eduwjyren@stanford\.eduyxzhang25128@gmail\.comcaoding@mail\.ustc\.edu\.cn \{zengzhizhao,zengke02\}@meituan\.comtianxiangz@hkust\-gz\.edu\.cn †First author\*Corresponding author
###### Abstract
Stepwise group\-based RL is an attractive way to train long\-horizon LLM agents without a learned critic: it reuses multiple sampled rollouts to estimate local advantages\. Its weakness is less visible but more fundamental: every group\-relative estimator assumes that the steps it compares are equivalent for credit assignment\. We show that current agentic variants violate this assumption through a*state\-action credit mismatch*\. The observation\-hash partition is overly fine on the state side, creating singleton groups with zero step\-level signal, while a single within\-group mean is too coarse on the action side, mixing state\-value estimation with action\-specific credit\. We introduceBiPACE\(*Bisimulation\-Guided Policy Optimization with Action Counterfactual Estimation*\), a drop\-in advantage estimator that fixes both sides without adding a critic, auxiliary loss, or extra rollouts\.BiGPOclusters steps by cosine distance in the actor’s own hidden\-state geometry, an empirical, policy\-induced proxy for bisimulation that substantially lowers the singleton rate left by observation hashing\.PACEthen recenters returns within each behavioral cluster using action\-conditioned peer baselines; its Q\-style instance estimates a localQ^\(s,a\)−V^\(s\)\\widehat\{Q\}\(s,a\)\-\\widehat\{V\}\(s\)nonparametrically\. OnALFWorld/Qwen2\.5\-7B,BiPACEQ\{\}\_\{\\text\{Q\}\}raises overall validation success fromGiGPO’s reported90\.890\.8to97\.1±0\.9\\mathbf\{97\.1\{\\pm\}0\.9\}over three seeds, and crosses the95%95\\%threshold on every seed, whichGiGPOnever does within the same budget\. On Qwen2\.5\-1\.5B it reaches93\.5±1\.2\\mathbf\{93\.5\{\\pm\}1\.2\}versusGiGPO’s86\.786\.7, and onWebShopandTextCraftit improves over GRPO andGiGPOat both model scales\. The change is small in systems terms: the measured BiPACE\-specific share is11\.3%11\.3\\%of a singleALFWorld/Qwen2\.5\-7B training\-step wall time\. Yet it changes the estimator’s comparison unit from surface identity to approximate behavioral equivalence plus action\-side counterfactuals\. The code is available at[https://github\.com/TianxiangZhao/BiPACE](https://github.com/TianxiangZhao/BiPACE)\.
Figure 1:BiPACEQ\{\}\_\{\\text\{Q\}\}vs\.GiGPOacross benchmarks and model scales\.Top:validation success over training; dots and badges mark each method’s peak andBiPACEQ\{\}\_\{\\text\{Q\}\}’s gap overGiGPO\.Bottom:steps to reach a fixed success threshold \(lower is better\), with speedupsstepsGiGPO/stepsBiPACEQ\\text\{steps\}\_\{\\textnormal\{GiGPO\}\{\}\}/\\text\{steps\}\_\{\\mathrm\{BiPACE\}\_\{\\text\{Q\}\}\}; hatched bars mark thresholds never reached\. Multi\-seed aggregates are in[Tables2](https://arxiv.org/html/2606.25556#S4.T2)and[3](https://arxiv.org/html/2606.25556#S4.T3)\.## 1\. Introduction
Reinforcement learning \(RL\) post\-training of large language models has recently moved beyond single\-turn reasoning into the harder*agentic*regime: long\-horizon, partially observed, multi\-turn interaction with tools, web pages, simulated households, and games\. The central obstacle is assigning a sparse terminal reward to the intermediate decisions that made the trajectory succeed or fail\(Wanget al\.,[2025](https://arxiv.org/html/2606.25556#bib.bib1)\)\. Group\-based RL methods such as RLOO\-style leave\-one\-out estimators\(Koolet al\.,[2019](https://arxiv.org/html/2606.25556#bib.bib23); Ahmadianet al\.,[2024](https://arxiv.org/html/2606.25556#bib.bib24)\)and GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.25556#bib.bib2)\)are appealing because they avoid a learned value network\. Recent agentic variants such asGiGPO\(Fenget al\.,[2025](https://arxiv.org/html/2606.25556#bib.bib3)\)andHGPO\(Heet al\.,[2026](https://arxiv.org/html/2606.25556#bib.bib4)\)push this idea to the step level by comparing rollout steps inside groups\. Their performance, however, depends on a choice that is often treated as an implementation detail: which steps are grouped together for that comparison\.
These estimators share an implicit assumption: if two step records are placed in the same group, then they are interchangeable for credit assignment\. In long\-horizon agent environments, this assumption fails in two coupled ways\.State side:observation identity is a convenient but overly sparse proxy for value equivalence\. Group baselines are reliable when grouped states share continuation value, a condition formalized by*bisimulation*\(Givanet al\.,[2003](https://arxiv.org/html/2606.25556#bib.bib11); Fernset al\.,[2004](https://arxiv.org/html/2606.25556#bib.bib12)\); observation keys impose a strictly finer equivalence than bisimulation requires, splitting many reusable states into isolated singletons\.Action side:even when states are comparable, the usual within\-group mean assigns the same baseline to all actions, ignoring that different actions from the same state can induce different futures\. We call this two\-sided failure the*state\-action credit mismatch*\.
This mismatch is measurable during training via the singleton fraction, independently of the final task reward\. In ourGiGPOreproduction onALFWorld,34\.2%34\.2\\%of step groups are singletons at iteration 10 and the fraction remains20\.7%20\.7\\%at iteration 140\. Since singleton clusters produce zero step\-level advantage, exact observation hashes discard local signal when the policy needs it most\. OnTextCraft, where observations are sparser, exact hashes isolate even more records and expose fewer matched pairs \(detailed in[Sec\.4\.2](https://arxiv.org/html/2606.25556#S4.SS2)\)\.
We proposeBiPACE\(*Bisimulation\-Guided Policy Optimization with Action Counterfactual Estimation*\), a drop\-in advantage estimator that treats step\-level credit as two local problems: state aggregation and action\-conditioned credit assignment\.On the state side,BiGPOreplaces observation hashing with cosine clustering on the actor’s normalized hidden stateϕθ\(st\)\\phi\_\{\\theta\}\(s\_\{t\}\)at a fixed late layer \([Apps\.I](https://arxiv.org/html/2606.25556#A9)and[H](https://arxiv.org/html/2606.25556#A8)\), an empirical proxy for the behavioral metrics ofCastroet al\.\([2021](https://arxiv.org/html/2606.25556#bib.bib13)\)\.On the action side, PACE partitions each behavioral cluster by the executed action and augments the cluster\-mean baseline with a same\-action peer estimate, forming a nonparametricQ^\(s,a\)−V^\(s\)\\widehat\{Q\}\(s,a\)\-\\widehat\{V\}\(s\)advantage inside each cluster\. The two halves are coupled: PACE requires behaviorally comparable state peers, which BiGPO provides\. Our main contributions are summarized as follows:
- •*Identifying state\-action credit mismatch\.*We show that stepwise group\-based RL conflates state aggregation with action\-conditioned credit assignment, and that exact observation hashing is the wrong state equivalence relation, splitting reusable states into singletons that carry no step\-level signal\.
- •*Proposing a drop\-in advantage estimator\.*We introduceBiPACE, which makes two local replacements:BiGPOclusters actor\-hidden fingerprints as a policy\-induced bisimulation proxy, and PACE adds an action\-conditioned peer baseline inside each cluster\.
- •*Analyzing the estimator\.*We bound the state\-side bias byO\(ε\)O\(\\varepsilon\)under aMICo\-Lipschitz assumption, recoverGiGPOas theε=0\\varepsilon\{=\}0limit, quantify the singleton signal loss, and show Q\-style PACE is exact under exact bisimulation\.
- •*Achieving strong empirical performance\.*BiPACEQ\{\}\_\{\\text\{Q\}\}gains\+6\.3\+6\.3pp onALFWorld/Qwen2\.5\-7B \(97\.1±0\.997\.1\{\\pm\}0\.9vs\.90\.890\.8\) and\+6\.8\+6\.8pp on 1\.5B, and improves over GRPO andGiGPOonWebShopandTextCraftat both scales, at only11\.3%11\.3\\%step overhead\.
## 2\. Related Work
#### Group\-relative RL for LLM agents
BiPACEbuilds on critic\-free group\-relative RL, including GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.25556#bib.bib2)\),GiGPO\(Fenget al\.,[2025](https://arxiv.org/html/2606.25556#bib.bib3)\), andHGPO\(Heet al\.,[2026](https://arxiv.org/html/2606.25556#bib.bib4)\)\. These methods compare sampled returns inside groups but keep the state equivalence relation discrete;BiPACEreplaces that relation with a policy\-induced behavioral partition\. The state side follows the value\-preserving bisimulation view\(Fernset al\.,[2004](https://arxiv.org/html/2606.25556#bib.bib12); Castroet al\.,[2021](https://arxiv.org/html/2606.25556#bib.bib13); Zhanget al\.,[2020](https://arxiv.org/html/2606.25556#bib.bib14)\), while PACE gives a nonparametric analogue of action\-conditioned counterfactual baselines studied in COMA/CCPO and related work\(Foersteret al\.,[2018](https://arxiv.org/html/2606.25556#bib.bib15); Liet al\.,[2026b](https://arxiv.org/html/2606.25556#bib.bib8)\)\. Other agent\-credit methods alter the learning signal or optimizer\(Tanet al\.,[2026](https://arxiv.org/html/2606.25556#bib.bib5); Liuet al\.,[2025](https://arxiv.org/html/2606.25556#bib.bib6); Weiet al\.,[2025](https://arxiv.org/html/2606.25556#bib.bib7); Yuet al\.,[2025](https://arxiv.org/html/2606.25556#bib.bib10)\);BiPACEinstead changes which step records are compared\. Extended discussion appears in[App\.A](https://arxiv.org/html/2606.25556#A1)\.
## 3\. Method:BiPACE
This section first isolates the estimation issue thatBiPACEtargets, then describes the two local replacements that constitute the method\.
### 3\.1 Estimator setup
For each prompt grouppp, GRPO samples trajectories\{τ\(g\)\}g=1G\\\{\\tau^\{\(g\)\}\\\}\_\{g=1\}^\{G\}and standardizes terminal returns within the group,
Aep\(τ\(g\)\)=R\(τ\(g\)\)−μpσp\+δ,μp,σpover\{R\(τ\(g\)\)\}g=1G\.A^\{\\mathrm\{ep\}\}\(\\tau^\{\(g\)\}\)=\\frac\{R\(\\tau^\{\(g\)\}\)\-\\mu\_\{p\}\}\{\\sigma\_\{p\}\+\\delta\},\\quad\\mu\_\{p\},\\sigma\_\{p\}\\text\{ over \}\\\{R\(\\tau^\{\(g\)\}\)\\\}\_\{g=1\}^\{G\}\.\(1\)GiGPOadds a step\-level term by collecting all step records in the same prompt group and partitioning them by exact observation hash:𝒞p=\{\{i:hash\(s\(i\)\)=h\}:h∈Hash\(\{s\(i\)\}i∈p\)\}\\mathcal\{C\}\_\{p\}=\\big\\\{\\,\\\{i:\\mathrm\{hash\}\(s^\{\(i\)\}\)=h\\\}:h\\in\\mathrm\{Hash\}\(\\\{s^\{\(i\)\}\\\}\_\{i\\in p\}\)\\,\\big\\\}\. For each clusterC∈𝒞pC\\in\\mathcal\{C\}\_\{p\}, it normalizes the return\-to\-goRt\(i\)R\_\{t\}^\{\(i\)\}locally:
Astep\(i\)=Rt\(i\)−μCσC\+δ,i∈C\.A^\{\\mathrm\{step\}\}\(i\)=\\frac\{R\_\{t\}^\{\(i\)\}\-\\mu\_\{C\}\}\{\\sigma\_\{C\}\+\\delta\},\\quad i\\in C\.\(2\)BiPACEkeeps this training loop intact and replaces only the partition and the local baseline used by[Eq\.2](https://arxiv.org/html/2606.25556#S3.E2)\. Extra background on the agent decision process, GRPO, and bisimulation appears in[App\.B](https://arxiv.org/html/2606.25556#A2)\.
### 3\.2 The state\-action credit mismatch
GiGPOimproves over trajectory\-level GRPO by computing a step\-level advantage within groups of rollout steps that share the same current observation\. This design assumes that two step records in the same group are exchangeable for credit assignment\. In agentic tasks, the assumption breaks in two complementary ways\.
State side\.Exact observation identity is too strict a proxy for value equivalence: observations that differ only in surface form land in different groups even when they induce the same continuation value\. When the exact\-key partition produces singleton groups, the within\-group baseline in[Eq\.2](https://arxiv.org/html/2606.25556#S3.E2)degenerates to zero and the step contributes no step\-level gradient\.
Action side\.Even when a group contains behaviorally comparable states, computing a single cluster mean evaluates every step against the same number, regardless of which action was taken\. Two actions from the same state neighborhood can lead to different futures, so their advantages should differ; but subtracting a common mean cannot distinguish between them\. The desired quantity is the local advantageQ\(s,a\)−V\(s\)Q\(s,a\)\-V\(s\): the cluster mean serves as a state\-value estimateV\(s\)V\(s\), and pooling same\-action peers yields an action\-value estimateQ\(s,a\)Q\(s,a\), isolating action\-specific credit without changing the state\-level baseline\.
BiPACEaddresses these two sides jointly\. On the state side,BiGPOreplaces the observation\-hash partition with a behavioral partition derived from the actor’s own hidden states\. On the action side, PACE augments the cluster\-mean baseline with a same\-action peer estimate, forming theQ\(s,a\)−V\(s\)Q\(s,a\)\-V\(s\)advantage inside each cluster\.
### 3\.3 BiPACE overview
Figure 2:Method overview\.BiPACEmakes two local replacements to the GiGPO step\-level estimator\.Left:A prompt group provides step records \(st\(i\),at\(i\),Rt\(i\)s\_\{t\}^\{\(i\)\},a\_\{t\}^\{\(i\)\},R\_\{t\}^\{\(i\)\}\) acrossGGrollouts; chip colors encode bisimulation class\.Middle:BiGPOextracts the actor’s normalized hidden stateϕθ\(s\)\\phi\_\{\\theta\}\(s\)at a fixed late layer and clusters by cosine distance, forming behavioral state neighborhoods𝒞1,𝒞2,…\\mathcal\{C\}\_\{1\},\\mathcal\{C\}\_\{2\},\\ldotsRight:PACE splits each cluster by the executed action key and computes a per\-action peer baseline; the Q\-style form estimatesQ^\(s,a\)−V^\(s\)\\widehat\{Q\}\(s,a\)\-\\widehat\{V\}\(s\)\. Only the step\-level advantage changes; the PPO objective is unchanged\.BiPACEin one line\.For each prompt group,BiPACEfirst clusters step records by the actor’s own hidden\-state geometry \(BiGPO\), then computes the step\-level baseline inside each cluster conditional on the executed action \(PACE\)\. Everything outside the advantage estimator is identical toGiGPO\.
[Figure2](https://arxiv.org/html/2606.25556#S3.F2)traces the full pipeline; the two components compose rather than simply stack, as PACE’s per\-action baseline requires a behaviorally coherent peer pool thatBiGPOprovides\.[Sections3\.4](https://arxiv.org/html/2606.25556#S3.SS4)and[3\.5](https://arxiv.org/html/2606.25556#S3.SS5)detail each in turn\.
### 3\.4 State\-side grouping withBiGPO
The key insight behindBiGPOis that the actor’s own hidden states already encode behavioral similarity: observations the policy processes identically cluster tightly in late\-layer representation space, regardless of their surface form\. Exact observation hashing ignores this geometry \(treating any surface\-distinct observations as incomparable even when the policy responds to them identically;[Sec\.3\.2](https://arxiv.org/html/2606.25556#S3.SS2)\), and routinely discards the local signal that step\-level RL is designed to exploit\.BiGPOreplaces the exact hash with a soft behavioral partition derived from this representation\.
Concretely, letfθ:𝒮→ℝDf\_\{\\theta\}:\\mathcal\{S\}\\to\\mathbb\{R\}^\{D\}be the function that maps the current promptsts\_\{t\}to the actor LLM’s hidden state at the final prompt token, taken at a fixed late intermediate layer chosen once per backbone \(layer−8\-8for Qwen2\.5\-7B,−12\-12for Qwen2\.5\-1\.5B; calibration details in[Apps\.I](https://arxiv.org/html/2606.25556#A9)and[H](https://arxiv.org/html/2606.25556#A8)\)\. We use the normalized representationϕθ\(s\)=fθ\(s\)/‖fθ\(s\)‖2\\phi\_\{\\theta\}\(s\)=f\_\{\\theta\}\(s\)/\\\|f\_\{\\theta\}\(s\)\\\|\_\{2\}\. This representation moves with the policy being optimized and requires no learned critic or auxiliary model; it is obtained via a dedicated actor forward pass with hidden\-state extraction enabled, whose cost is measured in[Sec\.4\.6](https://arxiv.org/html/2606.25556#S4.SS6)\.
For each prompt grouppp,BiGPOcomputes the partition
𝒞pBiGPO=Clusterε\(\{ϕθ\(s\(i\)\):i∈p\},dcos\),\\mathcal\{C\}\_\{p\}^\{\\textnormal\{BiGPO\}\}\\;=\\;\\mathrm\{Cluster\}\_\{\\varepsilon\}\\big\(\\\{\\phi\_\{\\theta\}\(s^\{\(i\)\}\):i\\in p\\\},\\;d\_\{\\cos\}\\big\),\(3\)wheredcos\(u,v\)=1−u⊤vd\_\{\\cos\}\(u,v\)=1\-u^\{\\top\}vandClusterε\\mathrm\{Cluster\}\_\{\\varepsilon\}is a single\-pass greedy procedure: each record joins the nearest existing centroid if its cosine distance is≤ε\\leq\\varepsilon, otherwise it seeds a new cluster, and the joined centroid is updated online; full pseudocode is in[App\.D](https://arxiv.org/html/2606.25556#A4)\. TheGiGPOstep advantage in[Eq\.2](https://arxiv.org/html/2606.25556#S3.E2)then applies unchanged, with𝒞p\\mathcal\{C\}\_\{p\}replaced by𝒞pBiGPO\\mathcal\{C\}\_\{p\}^\{\\textnormal\{BiGPO\}\}\.[AppendixC](https://arxiv.org/html/2606.25556#A3)bounds the resulting bias: under aMICo\-Lipschitz assumption on the embedding, the replacement trades anO\(ε\)O\(\\varepsilon\)bias for many more non\-singleton step groups\.
Relation to prior methods\.Settingε=0\\varepsilon\{=\}0and using a one\-hot observation hash asϕ\\phiexactly recoversGiGPO; replacing the hash with a history\-aware signature recoversHGPO\. TheGiGPO/HGPOfamily thus uses hand\-designed discrete fingerprints, whileBiGPOuses the policy’s own continuous fingerprint with coarsening controlled byε\\varepsilon\.
Embedder design space\.The estimator only requires a fingerprintϕ\\phi, leaving the embedder open\. We examine two backends along an effort/fidelity axis:*HashNgram*, a zero\-dependency character\-nn\-gram hash that groups by lexical surface form, and*Actor\-Hidden*, the policy LLM’s own hidden state\. HashNgram is a policy\-agnostic control: it isolates whether the gain comes from the policy\-induced geometry or merely from coarsening the partition; Actor\-Hidden is the main method because it changes with the policy being optimized\.
### 3\.5 Action\-conditioned baseline with PACE
PACE realizes theQ\(s,a\)−V\(s\)Q\(s,a\)\{\-\}V\(s\)decomposition identified in[Sec\.3\.2](https://arxiv.org/html/2606.25556#S3.SS2): within each behavioral cluster, the cluster mean estimatesV\(s\)V\(s\)and a same\-action peer mean estimatesQ\(s,a\)Q\(s,a\), isolating action\-specific credit without changing the state\-level baseline\.
Concretely, PACE splits each behavioral cluster by the executed action\. For each actionaa, letκ\(a\)∈ℤ\\kappa\(a\)\\in\\mathbb\{Z\}be an action key\. We use two practical keys:*first\-token*, the hash of the firstN=8N\{=\}8response tokens, and*action\-tag*, the hash of the body of<action\>\.\.\.</action\>\. The latter is semantically cleaner when the environment exposes parseable actions; the former is cheap and robust when parsing is unavailable\.
Inside a clusterC∈𝒞pBiGPOC\\in\\mathcal\{C\}\_\{p\}^\{\\textnormal\{BiGPO\}\}, writeki=κ\(a\(i\)\)k\_\{i\}=\\kappa\(a^\{\(i\)\}\),C=\(i\)=\{j∈C:kj=ki\}C^\{=\}\(i\)=\\\{j\\in C:k\_\{j\}=k\_\{i\}\\\}, andC≠\(i\)=C∖C=\(i\)C^\{\\neq\}\(i\)=C\\setminus C^\{=\}\(i\)\. PACE instantiates two nonparametric estimators:
A^diff\(i\)\\displaystyle\\hat\{A\}^\{\\text\{diff\}\}\(i\)=R\(i\)−1\|C≠\(i\)\|∑j∈C≠\(i\)R\(j\)\\displaystyle=R^\{\(i\)\}\-\\tfrac\{1\}\{\|C^\{\\neq\}\(i\)\|\}\\\!\\\!\\sum\_\{j\\in C^\{\\neq\}\(i\)\}\\\!\\\!R^\{\(j\)\}\(4\)A^q\(i\)\\displaystyle\\hat\{A\}^\{\\text\{q\}\}\(i\)=Q^\(s,ai\)−V^\(s\)\\displaystyle=\\widehat\{Q\}\(s,a\_\{i\}\)\-\\widehat\{V\}\(s\)\(5\)Q^\(s,ai\)\\displaystyle\\widehat\{Q\}\(s,a\_\{i\}\)=1\|C=\(i\)\|∑j∈C=\(i\)R\(j\)\\displaystyle=\\tfrac\{1\}\{\|C^\{=\}\(i\)\|\}\\sum\_\{j\\in C^\{=\}\(i\)\}R^\{\(j\)\}\(6\)V^\(s\)\\displaystyle\\widehat\{V\}\(s\)=1\|C\|∑j∈CR\(j\)\.\\displaystyle=\\tfrac\{1\}\{\|C\|\}\\sum\_\{j\\in C\}R^\{\(j\)\}\.\(7\)The diff\-peer form compares each action against peers that took a different action from the same state neighborhood; the Q\-style form directly estimatesQ^\(s,a\)−V^\(s\)\\widehat\{Q\}\(s,a\)\-\\widehat\{V\}\(s\)\.
Fallbacks keep the estimator well\-defined\. Singleton clusters retainA^step=0\\hat\{A\}^\{\\text\{step\}\}=0as inGiGPO\. The diff\-peer form falls back to RLOO leave\-one\-out when\|C≠\(i\)\|=0\|C^\{\\neq\}\(i\)\|=0; the Q\-style form falls back when\|C=\(i\)\|=1\|C^\{=\}\(i\)\|=1\. Empirically, the Q\-style branch is non\-degenerate onALFWorldand gives the best completed variant in[Sec\.4\.5](https://arxiv.org/html/2606.25556#S4.SS5); in environments with larger effective action spaces, the diff\-peer form is the safer default\.
Combining with the episode term gives the final per\-token advantage
ABiPACE\(i\)=Aep\(i\)\+w⋅A^S\(i\),A^\{\\textnormal\{BiPACE\}\}\(i\)\\;=\\;A^\{\\mathrm\{ep\}\}\(i\)\\;\+\\;w\\cdot\\hat\{A\}^\{S\}\(i\),\(8\)whereA^S\\hat\{A\}^\{S\}is the selected step\-level estimator andwwis the same fixed weight used byGiGPO\. The PPO surrogate is unchanged\. Implementation details are in[App\.G](https://arxiv.org/html/2606.25556#A7)and the computational\-budget discussion below\.
## 4\. Experiments
We organize the experiments around four questions:RQ1:DoesBiPACEimprove end\-task success under the same rollout budgets and model scales?RQ2:Does the improvement come with better sample efficiency?RQ3:Do the policy\-state groups actually increase usable step\-level interactions?RQ4:Is the action\-conditioned PACE estimator necessary on top of the state\-side partition? We present RQ3 \(mechanism\) before RQ1 \(end\-task results\) to establish the underlying diagnostic before interpreting the headline numbers; RQ2 and RQ4 follow\. The appendix provides extended related work, background, proofs, reproducibility details, hyperparameters, prompts, calibration scans, diagnostic tables, failure modes, and full per\-seed results\.
### 4\.1 Setup
Environments\.ALFWorld\(Shridharet al\.,[2020](https://arxiv.org/html/2606.25556#bib.bib17)\),WebShop\(Yaoet al\.,[2022](https://arxiv.org/html/2606.25556#bib.bib18)\), andTextCraft\(Prasadet al\.,[2024](https://arxiv.org/html/2606.25556#bib.bib19)\)\.Models\.Qwen2\.5\-\{1\.5, 7\}B\-Instruct\(Yanget al\.,[2024](https://arxiv.org/html/2606.25556#bib.bib22)\)\.Baselines\.GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.25556#bib.bib2)\), PPO with critic\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.25556#bib.bib16)\),GiGPO\(Fenget al\.,[2025](https://arxiv.org/html/2606.25556#bib.bib3)\), andHGPO\(Heet al\.,[2026](https://arxiv.org/html/2606.25556#bib.bib4)\); prompting rows fromFenget al\.\([2025](https://arxiv.org/html/2606.25556#bib.bib3)\)anchor the benchmark scale\.Hardware\.4×\\timesH100 \(7B\); 2–4×\\timesH100 \(1\.5B\)\. Full settings, seeds, hyperparameters, and prompt templates are in[Apps\.I](https://arxiv.org/html/2606.25556#A9)and[J](https://arxiv.org/html/2606.25556#A10)\.
### 4\.2 Mechanism: the singleton tax \(RQ3\)
Table 1:Singleton fraction onALFWorld/7B\.GiGPO’s observation\-hash partition leaves many step records in singleton clusters\. These records receive zero step\-level advantage by construction, so the partition directly controls how much of the step\-level gradient can be used\.[Table1](https://arxiv.org/html/2606.25556#S4.T1)shows the mechanism\-level change: the policy\-induced bisimulation partition starts with a lower singleton fraction thanGiGPOhas even late in training\. Both rows are measured from our training\-log diagnostics\.BiGPOentries are55\-step window means centered at the listed iteration, averaged over theBiPACEQ\{\}\_\{\\text\{Q\}\}seeds whose logs cover that window; the partition depends only on the state\-side clustering, not on the PACE estimator\.
### 4\.3 End\-task performance \(RQ1\)
[Table2](https://arxiv.org/html/2606.25556#S4.T2)is the main result table\. The 7B comparison is the cleanest completed setting: over three seeds,BiPACEQ\{\}\_\{\\text\{Q\}\}raises aggregate val@@max \(binary val/success\-rate, count\-weighted by the validation set\) fromGiGPO’s90\.890\.8to97\.1±0\.997\.1\{\\pm\}0\.9, while saturating five of sixALFWorldtask families at100%100\\%per\-subtask val@@max across all seeds\. All threeBiPACEQ\{\}\_\{\\text\{Q\}\}seeds individually reach the95%95\\%threshold \(at steps115115–135135;[App\.N](https://arxiv.org/html/2606.25556#A14)\); noGiGPOseed does so within the same150150\-step budget\. The per\-subtask cells are diagnostic slices: because each slice takes its own best checkpoint, the aggregate*All*column is the headline metric\. The 1\.5B result is a three\-seed transfer check; the smaller backbone converges more slowly, so its row is reported at a later checkpoint within our training budget and marked with⋆\\star\(extended budget\)\. Across three seeds, overall val@@max is93\.5±1\.293\.5\{\\pm\}1\.2, with three of six task families saturated at100%100\\%on every seed; per\-seed values are reported in[App\.N](https://arxiv.org/html/2606.25556#A14)\.
Table 2:Task success rate \(%\) onALFWorld\(valid\-seen, 6 task families\) andWebShop\(*Score*/*Success*\)\.*All*is the count\-weighted overall val@@max on the binary validation success metric\. Reference rows are reproduced fromFenget al\.\([2025](https://arxiv.org/html/2606.25556#bib.bib3)\)andHeet al\.\([2026](https://arxiv.org/html/2606.25556#bib.bib4)\)\.⋆\\starextended training budget \(200 steps vs\. 150 for 7B\);‡\\ddaggerpartial run \(fewer completed seeds\)\.ALFWorldWebShopTypeMethodPickLookCleanHeatCoolPick2AllScoreSucc\.*Closed\-source prompting \(no fine\-tuning\)\.*PromptingGPT\-4o75\.375\.360\.860\.831\.231\.256\.756\.721\.621\.649\.849\.848\.048\.031\.831\.823\.723\.7PromptingGemini\-2\.5\-Pro92\.892\.863\.363\.362\.162\.169\.069\.026\.626\.658\.758\.760\.360\.342\.542\.535\.935\.9*Qwen2\.5\-1\.5B\-Instruct\.*PromptingQwen2\.55\.95\.95\.55\.53\.33\.39\.79\.74\.24\.20\.00\.04\.14\.123\.123\.15\.25\.2PromptingReAct17\.417\.420\.520\.515\.715\.76\.26\.27\.77\.72\.02\.012\.812\.840\.140\.111\.311\.3PromptingReflexion35\.335\.322\.222\.221\.721\.713\.613\.619\.419\.43\.73\.721\.821\.855\.855\.821\.921\.9RL trainingPPO \(w/ critic\)64\.8±3\.564\.8\{\\pm\}3\.540\.5±6\.940\.5\{\\pm\}6\.957\.1±4\.957\.1\{\\pm\}4\.960\.6±6\.660\.6\{\\pm\}6\.646\.4±4\.046\.4\{\\pm\}4\.047\.4±1\.947\.4\{\\pm\}1\.954\.4±3\.154\.4\{\\pm\}3\.173\.8±3\.073\.8\{\\pm\}3\.051\.5±2\.951\.5\{\\pm\}2\.9RL trainingGRPO85\.3±1\.585\.3\{\\pm\}1\.553\.7±8\.053\.7\{\\pm\}8\.084\.5±6\.884\.5\{\\pm\}6\.878\.2±7\.978\.2\{\\pm\}7\.959\.7±5\.059\.7\{\\pm\}5\.053\.5±5\.653\.5\{\\pm\}5\.675\.2±3\.875\.2\{\\pm\}3\.875\.8±3\.575\.8\{\\pm\}3\.556\.8±3\.856\.8\{\\pm\}3\.8RL trainingGiGPO94\.4±5\.994\.4\{\\pm\}5\.967\.5±4\.667\.5\{\\pm\}4\.694\.8±3\.894\.8\{\\pm\}3\.894\.4±7\.894\.4\{\\pm\}7\.879\.8±4\.779\.8\{\\pm\}4\.776\.4±5\.476\.4\{\\pm\}5\.486\.7±1\.786\.7\{\\pm\}1\.783\.5±1\.883\.5\{\\pm\}1\.8‡\\ddagger67\.4±4\.567\.4\{\\pm\}4\.5‡\\ddaggerRL trainingHGPOnot reported92\.8±1\.192\.8\{\\pm\}1\.185\.6±2\.985\.6\{\\pm\}2\.971\.5±4\.071\.5\{\\pm\}4\.0RL trainingBiPACEQ\{\}\_\{\\text\{Q\}\}\(ours\)⋆\\star100\.0\\mathbf\{100\.0\}97\.4±3\.8\\mathbf\{97\.4\}\{\\pm\}3\.8100\.0\\mathbf\{100\.0\}100\.0\\mathbf\{100\.0\}96\.5±3\.6\\mathbf\{96\.5\}\{\\pm\}3\.692\.0±7\.9\\mathbf\{92\.0\}\{\\pm\}7\.993\.5±1\.2\\mathbf\{93\.5\}\{\\pm\}1\.285\.8±1\.1\\mathbf\{85\.8\}\{\\pm\}1\.171\.9±2\.1\\mathbf\{71\.9\}\{\\pm\}2\.1*Qwen2\.5\-7B\-Instruct\.*PromptingQwen2\.533\.433\.421\.621\.619\.319\.36\.96\.92\.82\.83\.23\.214\.814\.826\.426\.47\.87\.8PromptingReAct48\.548\.535\.435\.434\.334\.313\.213\.218\.218\.217\.617\.631\.231\.246\.246\.219\.519\.5PromptingReflexion62\.062\.041\.641\.644\.944\.930\.930\.936\.336\.323\.823\.842\.742\.758\.158\.128\.828\.8RL trainingPPO \(w/ critic\)92\.3±4\.092\.3\{\\pm\}4\.064\.0±8\.464\.0\{\\pm\}8\.492\.5±2\.492\.5\{\\pm\}2\.489\.5±7\.089\.5\{\\pm\}7\.080\.3±2\.080\.3\{\\pm\}2\.068\.8±8\.368\.8\{\\pm\}8\.380\.4±2\.780\.4\{\\pm\}2\.781\.4±3\.181\.4\{\\pm\}3\.168\.7±5\.168\.7\{\\pm\}5\.1RL trainingGRPO90\.8±5\.190\.8\{\\pm\}5\.166\.1±6\.766\.1\{\\pm\}6\.789\.3±5\.489\.3\{\\pm\}5\.474\.7±6\.974\.7\{\\pm\}6\.972\.5±5\.472\.5\{\\pm\}5\.464\.7±7\.364\.7\{\\pm\}7\.377\.6±5\.277\.6\{\\pm\}5\.279\.3±2\.879\.3\{\\pm\}2\.866\.1±3\.766\.1\{\\pm\}3\.7RL trainingGiGPO97\.7±1\.697\.7\{\\pm\}1\.682\.7±7\.982\.7\{\\pm\}7\.998\.8±1\.698\.8\{\\pm\}1\.683\.7±7\.283\.7\{\\pm\}7\.289\.3±8\.289\.3\{\\pm\}8\.279\.2±6\.679\.2\{\\pm\}6\.690\.8±1\.390\.8\{\\pm\}1\.386\.2±2\.686\.2\{\\pm\}2\.675\.2±3\.875\.2\{\\pm\}3\.8RL trainingHGPOnot reported95\.4±0\.695\.4\{\\pm\}0\.689\.0±1\.089\.0\{\\pm\}1\.078\.5±1\.478\.5\{\\pm\}1\.4RL trainingBiPACEQ\{\}\_\{\\text\{Q\}\}\(ours\)100\.0\\mathbf\{100\.0\}100\.0\\mathbf\{100\.0\}100\.0\\mathbf\{100\.0\}100\.0\\mathbf\{100\.0\}95\.3±3\.7\\mathbf\{95\.3\}\{\\pm\}3\.7100\.0\\mathbf\{100\.0\}97\.1±0\.9\\mathbf\{97\.1\}\{\\pm\}0\.989\.6±1\.3\\mathbf\{89\.6\}\{\\pm\}1\.379\.7±3\.3\\mathbf\{79\.7\}\{\\pm\}3\.3
#### TextCrafttransfer
TextCraftprovides an out\-of\-domain transfer check with depth\-stratified crafting goals \(depth\-2: short chains; depth\-3: longer subplans requiring intermediate reuse; depth\-4 omitted due to insufficient validation examples\)\. We apply the same Q\-style PACE recipe; the group sizes, training windows, and action\-key choices are listed in[App\.I](https://arxiv.org/html/2606.25556#A9)\.
Table 3:TextCraftvalidation success rate \(%\): peak val success within the stated window\.[Table3](https://arxiv.org/html/2606.25556#S4.T3)gives a small out\-of\-domain check\. Prompting alone does not solve the transfer setting \(≤7%\{\\leq\}7\\%overall\), andBiPACEQ\{\}\_\{\\text\{Q\}\}is the strongest trained row at both scales\. Its largest margins are on depth\-33goals \(\+7\.8\+7\.8pp overGiGPOon 1\.5B,\+12\.4\+12\.4pp on 7B\), where intermediate states can lead to several action\-conditioned futures\. TheHGPOrows land in theGiGPOband on overall success at both scales;BiPACEQ\{\}\_\{\\text\{Q\}\}outperformsHGPOby\+3\.6\+3\.6pp at 7B \(91\.191\.1vs\.87\.587\.5\) and\+5\.7\+5\.7pp at 1\.5B \(65\.165\.1vs\.59\.459\.4\)\.
#### Policy\-state interaction diagnostics
Figure 3:Step\-record utilization onALFWorld/7B andTextCraft/7B \(*kept*: multi\-member cluster;*wasted*: singleton\)\.BiPACEyields×1\.3\\times 1\.3usable pairs onALFWorldand×2\.2\\times 2\.2onTextCraft; means over diagnostic seeds, first130130/5050steps\.[Figure3](https://arxiv.org/html/2606.25556#S4.F3)measures step\-record utilization from paired training\-log diagnostics\(funnel diagnostic style of Heet al\.,[2026](https://arxiv.org/html/2606.25556#bib.bib4)\)\. OnALFWorld/7B,BiPACEwastes fewer records to singletons and clears roughly×1\.3\\times 1\.3as many matched pairs\. OnTextCraft/7B, where exact hashes are sparser, it keeps the singleton share near2020–25%25\\%and exposes roughly twice as many matched pairs\. The learned actor representation therefore creates larger reusable state pools for PACE’s action\-conditioned baseline\.
### 4\.4 Sample efficiency \(RQ2\)
Sample efficiency is summarized in[Fig\.1](https://arxiv.org/html/2606.25556#S0.F1): the bottom row reports steps\-to\-threshold \(lower is better\) for one seed per method on all three benchmarks, andBiPACEQ\{\}\_\{\\text\{Q\}\}reaches every threshold first\. OnALFWorld/1\.5B \(the smaller backbone is sub\-saturated at the 7B budget, so both methods are run for an extended budget;[Fig\.1](https://arxiv.org/html/2606.25556#S0.F1)a\),BiPACEQ\{\}\_\{\\text\{Q\}\}is1\.181\.18–1\.33×1\.33\\timesfaster across the5050–80%80\\%band and is the only method to cross90%90\\%within the budget\. The same pattern holds onALFWorld/7B:BiPACEQ\{\}\_\{\\text\{Q\}\}is1\.051\.05–1\.25×1\.25\\timesfaster across the6060–95%95\\%band and is the only method to cross95%95\\%within the150150\-step budget, at step100100on the best seed; all three seeds cross within the budget\. OnWebShop/7B andTextCraft/1\.5B \([Fig\.1](https://arxiv.org/html/2606.25556#S0.F1)b,c\), the speedups reach1\.57×1\.57\\timesand2\.00×2\.00\\times, and the top threshold in each panel is reached only byBiPACEQ\{\}\_\{\\text\{Q\}\}within the budget\.
### 4\.5 Action\-side ablation: PACE \(RQ4\)
Figure 4:Matched\-seed per\-task PACE ablation onALFWorld/7B \(validation peak, %\); radial axis truncated to8080–100100\.The state\-only variant isolatesBiGPO’s partition; PACE tests whether action\-conditioning within each cluster adds further gain\. We compare three variants atop Actor\-Hidden clustering onALFWorld/7B \([Fig\.4](https://arxiv.org/html/2606.25556#S4.F4); full numeric table in[App\.L](https://arxiv.org/html/2606.25556#A12)\):first\-token\(diff\-peer mean,[Eq\.4](https://arxiv.org/html/2606.25556#S3.E4), keyed on the firstN=8N\{=\}8tokens\),action\-tag\(same estimator,<action\>\-body key\), andQ\-style\(action\-tag key withQ^\(s,a\)−V^\(s\)\\widehat\{Q\}\(s,a\)\{\-\}\\widehat\{V\}\(s\),[Eq\.5](https://arxiv.org/html/2606.25556#S3.E5)\)\.
Any action\-conditioning on top of state\-only clustering helps \([Fig\.4](https://arxiv.org/html/2606.25556#S4.F4); Look, Cool, and Pick2 are the informative margins, with most other families near ceiling\): first\-token PACE reaches95\.8±0\.4%95\.8\{\\pm\}0\.4\\%across three seeds, confirming thatBiGPO’s behavioral peers support action\-specific credit\. Swapping to the<action\>\-body key with the same diff\-peer estimator underperforms first\-token by2\.82\.8pp \(93\.0±1\.1%93\.0\{\\pm\}1\.1\\%;[App\.L](https://arxiv.org/html/2606.25556#A12)\), suggesting that the first∼8\{\\sim\}8tokens already disambiguate the command onALFWorldand that parsing the full action body adds fragility without benefit\. Q\-style is the strongest variant \(97\.1±0\.9%97\.1\{\\pm\}0\.9\\%, three seeds\): keepingQ^\\widehat\{Q\}andV^\\widehat\{V\}in non\-overlapping pools recovers a cleaner estimate ofA\(s,a\)=Q\(s,a\)−V\(s\)A\(s,a\)\{=\}Q\(s,a\)\{\-\}V\(s\)than diff\-peer’s mixed\-action baseline\. Replacing the Actor\-Hidden fingerprint with a policy\-agnostic character\-nn\-gram hash, holding the Q\-style PACE estimator fixed, yields95\.4%95\.4\\%, which is2\.42\.4pp above the state\-only level but1\.71\.7pp below Actor\-Hidden \([Table4](https://arxiv.org/html/2606.25556#S4.T4)\), confirming that the policy\-induced geometry contributes gain beyond what lexical coarsening alone provides\.
Table 4:Local checks onALFWorld/7B with Q\-style PACE fixed \(embedder: three seeds; radius: seed0\)\.The same recipe transfers to Qwen2\.5\-1\.5B \(93\.5±1\.293\.5\{\\pm\}1\.2vs\.GiGPO’s86\.7±1\.786\.7\{\\pm\}1\.7,\+6\.8\+6\.8pp\)\.
The default clustering radius is also not brittle\. With Q\-style PACE fixed onALFWorld/7B seed0, validation success peaks at theε=0\.10\\varepsilon\{=\}0\.10default but remains within a∼3\{\\sim\}3pp band over\{0\.05,0\.10,0\.15,0\.20\}\\\{0\.05,0\.10,0\.15,0\.20\\\}\([Table4](https://arxiv.org/html/2606.25556#S4.T4)\)\. A row\-mix diagnostic at the default further confirms that the Q\-style estimator is not mostly falling back:80\.2%80\.2\\%of rows enter the PACE branch and multi\-member clusters average2\.762\.76distinct action keys \([App\.L](https://arxiv.org/html/2606.25556#A12)\)\. The action side carries most of the gain\. OnceBiGPOsupplies behaviorally comparable peers, the multi\-seed PACE variants improve over theGiGPOreimplementation by\+5\.0\{\+\}5\.0pp \(first\-token\) and\+6\.3\{\+\}6\.3pp \(Q\-style\) on the aggregate val/success\-rate metric; state\-only clustering alone contributes\+2\.2\{\+\}2\.2pp\. The same pattern transfers to 1\.5B:BiPACEQ\{\}\_\{\\text\{Q\}\}gains\+6\.8\{\+\}6\.8pp overGiGPOon overall val@@max \(33seeds;[Table2](https://arxiv.org/html/2606.25556#S4.T2)\), consistent with the 7B\+6\.3\{\+\}6\.3pp gap\.
### 4\.6 Computational budget
The only additional measured work over the baseGiGPO/GRPO loop is local to the step\-level estimator: an actor\-hidden forward pass to obtain policy\-state fingerprints, followed by lightweight PACE grouping and advantage estimation\.
[Figure5](https://arxiv.org/html/2606.25556#S4.F5)shows that the per\-iteration budget is still dominated by the shared training loop\. Rollout generation alone takes197\.25197\.25s, and the actor update takes88\.6888\.68s, compared with a361\.27361\.27s pure training step\. The BiPACE\-specific components total40\.7040\.70s, or11\.3%11\.3\\%of a step\. Almost all of this measured addition is the optional actor\-hidden extraction forward \(40\.2140\.21s\); the PACE grouping and action\-conditioned advantage estimation costs only0\.490\.49s per iteration,0\.14%0\.14\\%of the step budget\. Thus the estimator changes which step records are compared for credit assignment while leaving the dominant rollout, probability, and policy\-update costs intact\. To be precise: this forward pass adds computation but no extra environment interactions \(rollouts\); the two should not be conflated\. The number of records being grouped scales with the rollout budget asO\(GT\)O\(GT\)for group sizeGGand horizonTT, so larger budgets increase the absolute grouping cost even though the measured constant here is small\.
Figure 5:Per\-iteration budget onALFWorld/Qwen2\.5\-7B \(4×\\timesH100\); blue bars are BiPACE\-specific\.
## 5\. Conclusion and Limitations
We have presentedBiPACE, a drop\-in advantage estimator for agentic GRPO that addresses the state\-action credit mismatch in step\-level group\-based reinforcement learning\. Specifically,BiPACEintroduces two local replacements to theGiGPOestimator:BiGPO, which clusters policy\-hidden\-state fingerprints as a bisimulation proxy to substantially reduce singleton groups, and PACE, which adds action\-conditioned peer baselines within each behavioral cluster to recover a localQ−VQ\{\-\}Vadvantage, without auxiliary models or extra rollouts\. Across three environments and two model scales,BiPACEQ\{\}\_\{\\text\{Q\}\}consistently outperforms GRPO andGiGPO, and improves overHGPOonALFWorld/7B \(\+1\.7\+1\.7pp\) andTextCraft; all threeALFWorld/7B seeds cross the95%95\\%threshold within the training budget\.
Several directions remain open\.BiPACEis currently evaluated on text\-only environments with discrete action spaces, and extending to vision\-based or continuous\-action settings is a natural next step\. The cosine radiusε\\varepsilonis also fixed via a one\-time calibration scan; allowing it to adapt online as the policy evolves would sharpen the partition over the course of training\. Richer action representations for PACE beyond the action\-tag key \(such as full\-text embeddings\) could improve counterfactual contrast in environments with larger action spaces\. Finally, extending the bisimulation\-guided grouping to agents that compress history into a memory module \(where direct observation hashing is intractable\) is an interesting avenue for future work\.
## References
- A\. Ahmadian, C\. Cremer, M\. Gallé, M\. Fadaee, J\. Kreutzer, O\. Pietquin, A\. Üstün, and S\. Hooker \(2024\)Back to basics: revisiting REINFORCE\-style optimization for learning from human feedback in LLMs\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,Cited by:[§1](https://arxiv.org/html/2606.25556#S1.p1.1)\.
- MICo: improved representations via sampling\-based state similarity for Markov decision processes\.Advances in Neural Information Processing Systems34,pp\. 30113–30126\.Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px2.p1.1),[§B\.2](https://arxiv.org/html/2606.25556#A2.SS2.p2.1),[§1](https://arxiv.org/html/2606.25556#S1.p4.2),[§2](https://arxiv.org/html/2606.25556#S2.SS0.SSS0.Px1.p1.1),[Remark 1](https://arxiv.org/html/2606.25556#Thmremark1.p1.3)\.
- L\. Feng, Z\. Xue, T\. Liu, and B\. An \(2025\)Group\-in\-group policy optimization for LLM agent training\.Advances in Neural Information Processing Systems38,pp\. 46375–46408\.Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px1.p1.1),[Appendix O](https://arxiv.org/html/2606.25556#A15.SS0.SSS0.Px1.p1.1),[§B\.1](https://arxiv.org/html/2606.25556#A2.SS1.p1.8),[§1](https://arxiv.org/html/2606.25556#S1.p1.1),[§2](https://arxiv.org/html/2606.25556#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.25556#S4.SS1.p1.2),[Table 2](https://arxiv.org/html/2606.25556#S4.T2),[Table 2](https://arxiv.org/html/2606.25556#S4.T2.6.3)\.
- N\. Ferns, P\. Panangaden, and D\. Precup \(2004\)Metrics for finite markov decision processes\.InProceedings of the 20th Conference on Uncertainty in Artificial Intelligence \(UAI\),pp\. 162–169\.Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.25556#S1.p2.1),[§2](https://arxiv.org/html/2606.25556#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Foerster, G\. Farquhar, T\. Afouras, N\. Nardelli, and S\. Whiteson \(2018\)Counterfactual multi\-agent policy gradients\.InProceedings of the AAAI conference on artificial intelligence,Vol\.32\.Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.25556#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Givan, T\. Dean, and M\. Greig \(2003\)Equivalence notions and model minimization in markov decision processes\.Artificial intelligence147\(1\-2\),pp\. 163–223\.Cited by:[§B\.2](https://arxiv.org/html/2606.25556#A2.SS2.p1.7),[§1](https://arxiv.org/html/2606.25556#S1.p2.1)\.
- S\. Gu, T\. Lillicrap, Z\. Ghahramani, R\. E\. Turner, and S\. Levine \(2017\)Q\-Prop: sample\-efficient policy gradient with an off\-policy critic\.InInternational Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px3.p1.1)\.
- Y\. Han, K\. Li, Y\. Jiao, Y\. Dai, Y\. Fu, L\. Zhuo, and T\. Qian \(2026\)3SPO: state\-score\-supervised policy optimization for llm agents\.arXiv preprint arXiv:2606\.09961\.Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px1.p1.1)\.
- S\. He, L\. Feng, Q\. Wei, X\. Cheng, L\. Feng, and B\. An \(2026\)Hierarchy\-of\-groups policy optimization for long\-horizon agentic tasks\.arXiv preprint arXiv:2602\.22817\.Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px1.p1.1),[Appendix J](https://arxiv.org/html/2606.25556#A10.p1.1),[Appendix O](https://arxiv.org/html/2606.25556#A15.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.25556#S1.p1.1),[§2](https://arxiv.org/html/2606.25556#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.25556#S4.SS1.p1.2),[§4\.3](https://arxiv.org/html/2606.25556#S4.SS3.SSS0.Px2.p1.3),[Table 2](https://arxiv.org/html/2606.25556#S4.T2),[Table 2](https://arxiv.org/html/2606.25556#S4.T2.6.3)\.
- W\. Kool, H\. van Hoof, and M\. Welling \(2019\)Buy 4 REINFORCE samples, get a baseline for free\!\.InICLR Workshop on Deep RL Meets Structured Prediction,Cited by:[§1](https://arxiv.org/html/2606.25556#S1.p1.1)\.
- J\. Li, Y\. Wang, Q\. Yan, Y\. Tian, Z\. Xu, H\. Song, P\. Xu, and L\. L\. Cheong \(2026a\)SALT: step\-level advantage assignment for long\-horizon agents via trajectory graph\.InFindings of the Association for Computational Linguistics: EACL 2026,pp\. 4709–4725\.Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px1.p1.1)\.
- Z\. Li, W\. Tian, Y\. Ban, J\. Chen, H\. Zhang, Y\. Liu, and F\. Zhuang \(2026b\)Counterfactual credit policy optimization for multi\-agent collaboration\.arXiv preprint arXiv:2603\.21563\.Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.25556#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Liu, K\. Wang, Y\. Wu, F\. Huang, Y\. Li, J\. Zhang, and J\. Jiao \(2025\)Agentic reinforcement learning with implicit step rewards\.arXiv preprint arXiv:2509\.19199\.Cited by:[§2](https://arxiv.org/html/2606.25556#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Pan, S\. Liu, J\. Lin, D\. Zhu, J\. Zhang, S\. Dou, S\. Gao, Z\. Han, B\. Wang, R\. Zheng,et al\.\(2026\)EVPO: explained variance policy optimization for adaptive critic utilization in LLM post\-training\.arXiv preprint arXiv:2604\.19485\.Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px3.p1.1)\.
- A\. Prasad, A\. Koller, M\. Hartmann, P\. Clark, A\. Sabharwal, M\. Bansal, and T\. Khot \(2024\)ADaPT: as\-needed decomposition and planning with language models\.InFindings of the Association for Computational Linguistics: NAACL 2024,pp\. 4226–4252\.Cited by:[§4\.1](https://arxiv.org/html/2606.25556#S4.SS1.p1.2)\.
- M\. B\. Schrader \(2018\)Gym\-sokoban\.GitHub\.Note:[https://github\.com/mpSchrader/gym\-sokoban](https://github.com/mpSchrader/gym-sokoban)Cited by:[Appendix M](https://arxiv.org/html/2606.25556#A13.SS0.SSS0.Px1.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§4\.1](https://arxiv.org/html/2606.25556#S4.SS1.p1.2)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.25556#S1.p1.1),[§2](https://arxiv.org/html/2606.25556#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.25556#S4.SS1.p1.2)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2020\)ALFWorld: aligning text and embodied environments for interactive learning\.arXiv preprint arXiv:2010\.03768\.Cited by:[§4\.1](https://arxiv.org/html/2606.25556#S4.SS1.p1.2)\.
- H\. Tan, X\. Yang, H\. Chen, J\. Shao, Y\. Wen, Y\. Shen, W\. Luo, X\. Du, L\. Guo, and Y\. Li \(2026\)Hindsight credit assignment for long\-horizon LLM agents\.arXiv preprint arXiv:2603\.08754\.Cited by:[§2](https://arxiv.org/html/2606.25556#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Tucker, S\. Bhupatiraju, S\. Gu, R\. E\. Turner, Z\. Ghahramani, and S\. Levine \(2018\)The mirage of action\-dependent baselines in reinforcement learning\.InInternational Conference on Machine Learning,pp\. 5015–5024\.Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px3.p1.1)\.
- H\. Wang, L\. Wang, C\. Zhang, T\. Mao, S\. Qin, Q\. Lin, S\. Rajmohan, and D\. Zhang \(2025\)Text2Grad: reinforcement learning from natural language feedback\.arXiv preprint arXiv:2505\.22338\.Cited by:[§1](https://arxiv.org/html/2606.25556#S1.p1.1)\.
- Q\. Wei, S\. Zeng, C\. Li, W\. Brown, O\. Frunza, W\. Deng, A\. Schneider, Y\. Nevmyvaka, Y\. K\. Zhao, A\. Garcia,et al\.\(2025\)Reinforcing multi\-turn reasoning in LLM agents via turn\-level reward design\.arXiv preprint arXiv:2505\.11821\.Cited by:[§2](https://arxiv.org/html/2606.25556#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§4\.1](https://arxiv.org/html/2606.25556#S4.SS1.p1.2)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)WebShop: towards scalable real\-world web interaction with grounded language agents\.Advances in Neural Information Processing Systems35,pp\. 20744–20757\.Cited by:[§4\.1](https://arxiv.org/html/2606.25556#S4.SS1.p1.2)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)DAPO: an open\-source LLM reinforcement learning system at scale\.Advances in Neural Information Processing Systems38,pp\. 113222–113244\.Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.25556#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Zhang, R\. McAllister, R\. Calandra, Y\. Gal, and S\. Levine \(2020\)Learning invariant representations for reinforcement learning without reconstruction\.arXiv preprint arXiv:2006\.10742\.Cited by:[Appendix A](https://arxiv.org/html/2606.25556#A1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.25556#S2.SS0.SSS0.Px1.p1.1)\.
## Appendix AExtended Related Work
#### Group\-relative RL for LLM agents
GRPO\[Shaoet al\.,[2024](https://arxiv.org/html/2606.25556#bib.bib2)\]drops the critic by baselining against in\-group sampled returns\.GiGPO\[Fenget al\.,[2025](https://arxiv.org/html/2606.25556#bib.bib3)\]adds a step\-level term keyed on exact observation hashes, andHGPO\[Heet al\.,[2026](https://arxiv.org/html/2606.25556#bib.bib4)\]augments that key with history length\. DAPO\[Yuet al\.,[2025](https://arxiv.org/html/2606.25556#bib.bib10)\]and related work tune optimization or reward shaping, but still leave the state equivalence relation discrete, the partBiPACEreplaces\. Recent step\-level alternatives such as SALT\[Liet al\.,[2026a](https://arxiv.org/html/2606.25556#bib.bib29)\]and 3SPO\[Hanet al\.,[2026](https://arxiv.org/html/2606.25556#bib.bib30)\]are complementary: they modify the learning signal or supervision, whereasBiPACEchanges the comparison set while reusing sparse returns\.
#### Bisimulation and representation metrics
On the state side,BiPACEborrows the value\-preserving bisimulation view: the metric ofFernset al\.\[[2004](https://arxiv.org/html/2606.25556#bib.bib12)\], its sample\-basedMICoapproximation\[Castroet al\.,[2021](https://arxiv.org/html/2606.25556#bib.bib13)\], and its use in shaping deep\-RL representations\[Zhanget al\.,[2020](https://arxiv.org/html/2606.25556#bib.bib14)\]\. Unlike auxiliary representation\-learning approaches,BiGPOuses the actor’s own hidden states as the behavioral fingerprint, so the partition moves with the policy being optimized\.
#### Action\-conditioned baselines
On the action side, PACE recovers nonparametrically the action\-conditioned baseline that COMA\[Foersteret al\.,[2018](https://arxiv.org/html/2606.25556#bib.bib15)\]and CCPO\[Liet al\.,[2026b](https://arxiv.org/html/2606.25556#bib.bib8)\]learn with critics\. Classical action\-dependent baselines\[Guet al\.,[2017](https://arxiv.org/html/2606.25556#bib.bib25), Tuckeret al\.,[2018](https://arxiv.org/html/2606.25556#bib.bib26)\]motivate this direction but are generally biased without correction, so we treat PACE as an estimator design validated by ablation rather than an unbiased\-gradient claim\. EVPO\[Panet al\.,[2026](https://arxiv.org/html/2606.25556#bib.bib9)\]switches between critic and group\-mean baselines, a choice complementary toBiPACE’s control of the partition that defines the group mean\.
## Appendix BBackground Details
### B\.1 Multi\-turn LLM agent decision process
We consider an episodic POMDPℳ=\(𝒮,𝒜,P,r,γ,T\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},P,r,\\gamma,T\)in which an LLM policyπθ\(at∣st\)\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)produces a textual actionata\_\{t\}given an observationsts\_\{t\}, and the environment returns a next observationst\+1s\_\{t\+1\}, a \(typically sparse\) rewardrtr\_\{t\}, and a done flag\. Throughout,sts\_\{t\}denotes the agent\-visible observation, which plays the role of state for estimation purposes\. We followFenget al\.\[[2025](https://arxiv.org/html/2606.25556#bib.bib3)\]in a*step\-independent*input formulation: each step’s prompt is constructed from the current observation and a \(possibly summarized\) history, enabling horizons of50\+50\{\+\}steps\.
For each promptppwe sampleGGtrajectoriesτ\(1\),…,τ\(G\)\\tau^\{\(1\)\},\\ldots,\\tau^\{\(G\)\}of \(possibly varying\) lengthsT\(g\)T^\{\(g\)\}\. In the terminal\-only reward setting we focus on, the trajectory return isR\(τ\)=rT−1R\(\\tau\)=r\_\{T\-1\}\(e\.g\.,11on success and0otherwise onALFWorld\)\.
#### GRPO andGiGPO
The GRPO episode\-level advantage and theGiGPOstep\-level advantage are defined in the main paper \([Eqs\.1](https://arxiv.org/html/2606.25556#S3.E1)and[2](https://arxiv.org/html/2606.25556#S3.E2)and the surrounding discussion in[Sec\.3](https://arxiv.org/html/2606.25556#S3)\)\. The two failure modes of exact observation hashing that motivateBiGPO\(singleton clusters and paraphrase splitting\) are analyzed in[Sec\.3\.2](https://arxiv.org/html/2606.25556#S3.SS2)\.
### B\.2 Bisimulation and theMICometric
A binary relationEEon states is a*bisimulation*ifsEs′sEs^\{\\prime\}impliesr\(s,a\)=r\(s′,a\)r\(s,a\)=r\(s^\{\\prime\},a\)and∑s\+∈\[s\+\]EP\(s\+∣s,a\)=∑s\+∈\[s\+\]EP\(s\+∣s′,a\)\\sum\_\{s\_\{\+\}\\in\[s\_\{\+\}\]\_\{E\}\}P\(s\_\{\+\}\\mid s,a\)=\\sum\_\{s\_\{\+\}\\in\[s\_\{\+\}\]\_\{E\}\}P\(s\_\{\+\}\\mid s^\{\\prime\},a\)for allaa\[Givanet al\.,[2003](https://arxiv.org/html/2606.25556#bib.bib11)\]\. Aggregating bisimilar states preservesVπV^\{\\pi\}\(and indeedQπQ^\{\\pi\}for every policy\); bisimulation is thus a sufficient, if conservative, equivalence for value\-preserving state aggregation\.
Castroet al\.\[[2021](https://arxiv.org/html/2606.25556#bib.bib13)\]introduce the*MICodistance*dπd\_\{\\pi\}, a tractable sample\-based approximation satisfying the value\-difference bound
\|Vπ\(s\)−Vπ\(s′\)\|≤dπ\(s,s′\)\.\\bigl\|V^\{\\pi\}\(s\)\-V^\{\\pi\}\(s^\{\\prime\}\)\\bigr\|\\;\\leq\\;d\_\{\\pi\}\(s,s^\{\\prime\}\)\.\(B\.1\)BiGPOuses an empirical proxy fordπd\_\{\\pi\}derived from the policy’s own hidden representation;[Proposition1](https://arxiv.org/html/2606.25556#Thmproposition1)formalizes the resulting bias bound on the step\-level advantage estimator\.
## Appendix CBias–Variance Analysis Details
We analyze the step\-level baseline estimator thatBiGPOandGiGPOshare, and isolate the role of the partition\. Throughout, fix a prompt\-groupppand let\{\(s\(i\),a\(i\),R\(i\)\)\}i=1N\\\{\(s^\{\(i\)\},a^\{\(i\)\},R^\{\(i\)\}\)\\\}\_\{i=1\}^\{N\}denote its step records, withR\(i\)R^\{\(i\)\}the discounted return\-to\-go used by the step estimator in[Sec\.3\.1](https://arxiv.org/html/2606.25556#S3.SS1)\. Let𝒞\\mathcal\{C\}be a partition of\{1,…,N\}\\\{1,\\dots,N\\\}defined by some equivalence relation∼\\sim, and let
A^∼\(i\)=R\(i\)−1\|C\(i\)\|∑j∈C\(i\)R\(j\),C\(i\)∈𝒞,i∈C\(i\),\\hat\{A\}^\{\\sim\}\(i\)\\;=\\;R^\{\(i\)\}\-\\frac\{1\}\{\|C\(i\)\|\}\\sum\_\{j\\in C\(i\)\}R^\{\(j\)\},\\quad C\(i\)\\in\\mathcal\{C\},\\;i\\in C\(i\),\(C\.1\)denote the within\-cluster mean\-baseline estimator \(mean\-norm form; the std\-norm form admits a parallel argument\)\.
### C\.1 Bias and theMICobound
###### Proposition 1\(BiGPOstep\-baseline bias\)\.
LetVπ\(s\):=𝔼\[R∣s,π\]V^\{\\pi\}\(s\):=\\mathbb\{E\}\[R\\mid s,\\pi\]and assumeVπV^\{\\pi\}isLL\-Lipschitz in theMICometricdπd\_\{\\pi\}\. Let∼ε\\sim\_\{\\varepsilon\}denote any partition of the step records whose clusters havedπd\_\{\\pi\}\-diameter at most2ε2\\varepsilon, the regime greedy clustering with admission thresholdε\\varepsilon\([Algorithm1](https://arxiv.org/html/2606.25556#alg1)\) targets by bounding each member’s distance to the cluster centroid at admission\. Then for every stepii,
\|𝔼\[A^∼ε\(i\)−A⋆\(i\)\]\|≤2Lε,\\Big\|\\mathbb\{E\}\\\!\\left\[\\hat\{A\}^\{\\sim\_\{\\varepsilon\}\}\(i\)\\;\-\\;A^\{\\star\}\(i\)\\right\]\\Big\|\\;\\leq\\;2L\\varepsilon,\(C\.2\)whereA⋆\(i\):=R\(i\)−Vπ\(s\(i\)\)A^\{\\star\}\(i\):=R^\{\(i\)\}\-V^\{\\pi\}\(s^\{\(i\)\}\)is the ideal state\-conditional advantage\.
###### Proof sketch\.
The estimator’s bias is𝔼\[A^∼ε\(i\)\]−A⋆\(i\)=Vπ\(s\(i\)\)−1\|C\(i\)\|∑j∈C\(i\)Vπ\(s\(j\)\)\\mathbb\{E\}\[\\hat\{A\}^\{\\sim\_\{\\varepsilon\}\}\(i\)\]\-A^\{\\star\}\(i\)=V^\{\\pi\}\(s^\{\(i\)\}\)\-\\frac\{1\}\{\|C\(i\)\|\}\\sum\_\{j\\in C\(i\)\}V^\{\\pi\}\(s^\{\(j\)\}\)\. By the diameter assumption, any two members of the same cluster satisfydπ\(s\(i\),s\(j\)\)≤2εd\_\{\\pi\}\(s^\{\(i\)\},s^\{\(j\)\}\)\\leq 2\\varepsilon; by Lipschitz continuity each term\|Vπ\(s\(i\)\)−Vπ\(s\(j\)\)\|\|V^\{\\pi\}\(s^\{\(i\)\}\)\-V^\{\\pi\}\(s^\{\(j\)\}\)\|is then bounded by2Lε2L\\varepsilon, and so is the cluster average, giving the2Lε2L\\varepsilonbound\. Full proof in[App\.E](https://arxiv.org/html/2606.25556#A5)\. ∎
### C\.2GiGPOas the degenerateε=0\\varepsilon\{=\}0limit
###### Corollary 1\(Singleton signal collapse\)\.
Settingε=0\\varepsilon=0with the identity embedder \(i\.e\.GiGPO\) yields zero aggregation bias in[Eq\.C\.2](https://arxiv.org/html/2606.25556#A3.E2), but every singleton cluster produces a degenerate estimate:
A^∼0\(i\)=0deterministically whenever\|C\(i\)\|=1\.\\hat\{A\}^\{\\sim\_\{0\}\}\(i\)\\;=\\;0\\quad\\text\{deterministically whenever \}\|C\(i\)\|=1\.\(C\.3\)Hence a fractionp1p\_\{1\}of step records \(those landing in singleton clusters\) carries no step\-level gradient \(the episode\-level termAepA^\{\\mathrm\{ep\}\}is unaffected\), and the usable step\-level signal vanishes asp1→1p\_\{1\}\\to 1\.
###### Proof\.
A singleton cluster hasA^\(i\)=R\(i\)−R\(i\)=0\\hat\{A\}\(i\)=R^\{\(i\)\}\-R^\{\(i\)\}=0deterministically, so the PPO surrogate gradient∇θlogπθ\(a\(i\)∣s\(i\)\)A^∼0\(i\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a^\{\(i\)\}\\mid s^\{\(i\)\}\)\\,\\hat\{A\}^\{\\sim\_\{0\}\}\(i\)is identically zero on every singleton, regardless of the realized return: the estimate carries no information about the action taken\. The discarded mass is exactlyp1p\_\{1\}\. See[App\.E](https://arxiv.org/html/2606.25556#A5)for the formal restatement\. ∎
### C\.3 Variance of the Q\-style estimator
The PACE Q\-style formA^q\(i\)=Q^\(s,ai\)−V^\(s\)\\hat\{A\}^\{\\text\{q\}\}\(i\)=\\widehat\{Q\}\(s,a\_\{i\}\)\-\\widehat\{V\}\(s\)in[Eq\.5](https://arxiv.org/html/2606.25556#S3.E5)is unbiased forAπ\(s,a\)A^\{\\pi\}\(s,a\)under*exact*bisimulation \(ε=0\\varepsilon\{=\}0\): both terms are within\-class sample means with𝔼\[Q^\(s,a\)\]=Qπ\(s,a\)\\mathbb\{E\}\[\\widehat\{Q\}\(s,a\)\]=Q^\{\\pi\}\(s,a\)and𝔼\[V^\(s\)\]=Vπ\(s\)\\mathbb\{E\}\[\\widehat\{V\}\(s\)\]=V^\{\\pi\}\(s\)\. Forε\>0\\varepsilon\>0,V^\\widehat\{V\}inherits theO\(ε\)O\(\\varepsilon\)bias of[Proposition1](https://arxiv.org/html/2606.25556#Thmproposition1); forQ^\\widehat\{Q\}we additionally assumeQπ\(⋅,a\)Q^\{\\pi\}\(\\cdot,a\)is Lipschitz indπd\_\{\\pi\}for each actionaa\(a strictly stronger requirement than the value\-difference bound of[Eq\.B\.1](https://arxiv.org/html/2606.25556#A2.E1)\), under whichA^q\\hat\{A\}^\{\\text\{q\}\}isO\(ε\)O\(\\varepsilon\)\-biased forAπA^\{\\pi\}\. Its variance decomposes asVar\[A^q\]=Var\[Q^\]\+Var\[V^\]−2Cov\[Q^,V^\]\\mathrm\{Var\}\[\\hat\{A\}^\{\\text\{q\}\}\]=\\mathrm\{Var\}\[\\widehat\{Q\}\]\+\\mathrm\{Var\}\[\\widehat\{V\}\]\-2\\,\\mathrm\{Cov\}\[\\widehat\{Q\},\\widehat\{V\}\], withV^\\widehat\{V\}pooling the larger set\|C\|\|C\|and therefore reducing the second term\. The estimator is well\-defined when\|C=\(i\)\|≥2\|C^\{=\}\(i\)\|\\geq 2; otherwise we fall back to RLOO leave\-one\-out\. The diagnostics in[Sec\.4\.5](https://arxiv.org/html/2606.25556#S4.SS5)report the empirical fraction of rows that enter each branch and confirm the same\-action pool is non\-degenerate on the benchmarks tested\.
### C\.4 Choosingε\\varepsilon
[Proposition1](https://arxiv.org/html/2606.25556#Thmproposition1)suggests a clear principle:ε\\varepsilonshould be small enough for the Lipschitz bias to be dominated by within\-trajectory return noise, but large enough to defeat the singleton tax of[Corollary1](https://arxiv.org/html/2606.25556#Thmcorollary1)\. We provide an adaptive heuristic in[App\.F](https://arxiv.org/html/2606.25556#A6)that targets a median cluster size of44–88by binary search onε\\varepsilonover the first training step; empirically a single staticε=0\.10\\varepsilon=0\.10works across all 7B environments tested, and onALFWorldend\-task success is unimodal inε\\varepsilonwith a flat∼3\{\\sim\}3pp plateau around the default \([Table4](https://arxiv.org/html/2606.25556#S4.T4),Radiuscheck\)\. Smaller backbones have different representation geometry, so we calibrate\(ℓ,ε\)\(\\ell,\\varepsilon\)once per backbone before training \([App\.H](https://arxiv.org/html/2606.25556#A8)\)\.
## Appendix DGreedy Clustering Procedure
[Algorithm1](https://arxiv.org/html/2606.25556#alg1)gives the single\-pass greedy clustering used byBiGPO\([Eq\.3](https://arxiv.org/html/2606.25556#S3.E3)\)\. The procedure runs once per prompt\-group inO\(NpKp\)O\(N\_\{p\}K\_\{p\}\)time, whereNpN\_\{p\}is the group’s step\-record count andKpK\_\{p\}is its cluster count\.
Algorithm 1Greedy online cosine clustering used byBiGPO\.1:unit vectors
\{xi\}i=1n⊂ℝD\\\{x\_\{i\}\\\}\_\{i=1\}^\{n\}\\subset\\mathbb\{R\}^\{D\}, threshold
ε∈\[0,2\]\\varepsilon\\in\[0,2\]
2:
𝒦←\[\]\\mathcal\{K\}\\leftarrow\[\\,\]⊳\\trianglerightcluster centroids
3:
ℳ←\[\]\\mathcal\{M\}\\leftarrow\[\\,\]⊳\\trianglerightcluster members
4:for
i=1,…,ni=1,\\ldots,ndo
5:if
𝒦=∅\\mathcal\{K\}=\\emptysetthen
6:append
xix\_\{i\}to
𝒦\\mathcal\{K\}; append
\{i\}\\\{i\\\}to
ℳ\\mathcal\{M\}
7:else
8:
k←argmaxjxi⊤𝒦jk\\leftarrow\\arg\\max\_\{j\}\\,x\_\{i\}^\{\\top\}\\mathcal\{K\}\_\{j\}
9:if
1−xi⊤𝒦k≤ε1\-x\_\{i\}^\{\\top\}\\mathcal\{K\}\_\{k\}\\leq\\varepsilonthen
10:
ℳk←ℳk∪\{i\}\\mathcal\{M\}\_\{k\}\\leftarrow\\mathcal\{M\}\_\{k\}\\cup\\\{i\\\}
11:
𝒦k←normalize\(𝒦k\+1\|ℳk\|\(xi−𝒦k\)\)\\mathcal\{K\}\_\{k\}\\leftarrow\\mathrm\{normalize\}\\\!\\left\(\\mathcal\{K\}\_\{k\}\+\\tfrac\{1\}\{\|\\mathcal\{M\}\_\{k\}\|\}\(x\_\{i\}\-\\mathcal\{K\}\_\{k\}\)\\right\)
12:else
13:append
xix\_\{i\}to
𝒦\\mathcal\{K\}; append
\{i\}\\\{i\\\}to
ℳ\\mathcal\{M\}
14:endif
15:endif
16:endfor
17:return
ℳ\\mathcal\{M\}
Records are processed in rollout\-major order \(g=1,…,Gg=1,\\dots,G, thent=1,…,T\(g\)t=1,\\dots,T^\{\(g\)\}\); results are stable to within\-group shuffling\.
## Appendix EProofs
### E\.1 Proof of[Proposition1](https://arxiv.org/html/2606.25556#Thmproposition1)
###### Proof\.
LetC:=C\(i\)C:=C\(i\)andV¯C:=1\|C\|∑j∈CVπ\(s\(j\)\)\\bar\{V\}\_\{C\}:=\\frac\{1\}\{\|C\|\}\\sum\_\{j\\in C\}V^\{\\pi\}\(s^\{\(j\)\}\)\. By the diameter assumption of[Proposition1](https://arxiv.org/html/2606.25556#Thmproposition1), any two membersi,j∈Ci,j\\in Csatisfydπ\(s\(i\),s\(j\)\)≤2εd\_\{\\pi\}\(s^\{\(i\)\},s^\{\(j\)\}\)\\leq 2\\varepsilon\. \(Greedy online clustering,[Algorithm1](https://arxiv.org/html/2606.25556#alg1), joins a point to a cluster only when its distance to the current centroid is≤ε\\leq\\varepsilon; we state the proposition for diameter\-bounded partitions so the guarantee is independent of subsequent centroid updates\.\) By Lipschitzness ofVπV^\{\\pi\}indπd\_\{\\pi\}, for everyj∈Cj\\in C,
\|Vπ\(s\(i\)\)−Vπ\(s\(j\)\)\|≤Ldπ\(s\(i\),s\(j\)\)≤2Lε\.\\bigl\|V^\{\\pi\}\(s^\{\(i\)\}\)\-V^\{\\pi\}\(s^\{\(j\)\}\)\\bigr\|\\;\\leq\\;L\\,d\_\{\\pi\}\(s^\{\(i\)\},s^\{\(j\)\}\)\\;\\leq\\;2L\\,\\varepsilon\.Averaging overj∈Cj\\in Cgives\|Vπ\(s\(i\)\)−V¯C\|≤2Lε\|V^\{\\pi\}\(s^\{\(i\)\}\)\-\\bar\{V\}\_\{C\}\|\\leq 2L\\varepsilon\. Substituting into the bias decomposition
𝔼\[A^∼ε\(i\)\]−A⋆\(i\)=Vπ\(s\(i\)\)−V¯C\\mathbb\{E\}\\\!\\left\[\\hat\{A\}^\{\\sim\_\{\\varepsilon\}\}\(i\)\\right\]\-A^\{\\star\}\(i\)\\;=\\;V^\{\\pi\}\(s^\{\(i\)\}\)\-\\bar\{V\}\_\{C\}yields\|𝔼\[A^\]−A⋆\|≤2Lε\|\\mathbb\{E\}\[\\hat\{A\}\]\-A^\{\\star\}\|\\leq 2L\\varepsilon, which is the claim\. ∎
### E\.2 Proof of[Corollary1](https://arxiv.org/html/2606.25556#Thmcorollary1)
###### Proof\.
Restricted to\{i:\|C\(i\)\|=1\}\\\{i:\|C\(i\)\|=1\\\}we haveA^∼0\(i\)=R\(i\)−R\(i\)=0\\hat\{A\}^\{\\sim\_\{0\}\}\(i\)=R^\{\(i\)\}\-R^\{\(i\)\}=0deterministically\. The PPO surrogate gradient∇θlogπθ\(a\(i\)∣s\(i\)\)A^∼0\(i\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a^\{\(i\)\}\\mid s^\{\(i\)\}\)\\,\\hat\{A\}^\{\\sim\_\{0\}\}\(i\)is therefore identically zero on every singleton, regardless of the realized return; the singleton carries no learning signal\. Summing over step records, a fractionp1=Pr\(\|C\|=1\)p\_\{1\}=\\Pr\(\|C\|=1\)of the step\-level gradient is discarded, and the usable signal→0\\to 0asp1→1p\_\{1\}\\to 1\. ∎
## Appendix FAdaptiveε\\varepsilonHeuristic
We target a median cluster sizem~∈\[4,8\]\\tilde\{m\}\\in\[4,8\]\(an empirically healthy range consistent with the singleton rates in[Table1](https://arxiv.org/html/2606.25556#S4.T1)\) by binary\-searchingε∈\[0\.02,0\.40\]\\varepsilon\\in\[0\.02,0\.40\]on the first training step\. The heuristic converges in≤6\\leq 6probe values per environment and producesε∈\[0\.07,0\.13\]\\varepsilon\\in\[0\.07,0\.13\]across all 7B environments tested, suggesting the static defaultε=0\.10\\varepsilon=0\.10is adequate at that scale\. Smaller backbones use the per\-backbone calibration described in[App\.H](https://arxiv.org/html/2606.25556#A8)\.
## Appendix GImplementation Details
BiPACEattaches to the existingGiGPOstep\-advantage path through one driver\-side grouping routine and one optional actor\-worker hook\. The driver routine consumes per\-step embeddings and returns cluster UUIDs in the same shape asGiGPO’s hash\-grouping path; greedy clustering \([Algorithm1](https://arxiv.org/html/2606.25556#alg1)\) runs per prompt\-group inO\(NpKp\)O\(N\_\{p\}K\_\{p\}\)time \(NpN\_\{p\}step records,KpK\_\{p\}clusters\), a handful ofDD\-dimensional dot products per record dominated in practice by the actor forward pass \(measured grouping and advantage\-estimation cost:0\.490\.49s,0\.14%0\.14\\%of a training step;[Sec\.4\.6](https://arxiv.org/html/2606.25556#S4.SS6)\)\. When Actor\-Hidden features are used, the worker hook runs a vanilla actor forward with hidden states enabled, extracts the last non\-pad hidden state at the configured layer,ℓ2\\ell\_\{2\}\-normalizes it, and returns the tensors to the driver\. The complete addition \(grouping routine, worker hook, PACE estimator, and theε=0\\varepsilon\{=\}0regression test\) is under800800lines\. The swap leaves the optimizer\-side advantage scale unchanged: onALFWorld/7B, the mean per\-token advantage stays in the same−0\.03±0\.02\-0\.03\\pm 0\.02band forGiGPOandBiPACEQ\{\}\_\{\\text\{Q\}\}throughout training, so downstream PPO hyperparameters need no retuning\.
#### Code naming note
The released code keeps the historicalcacb\_\*config prefix \(e\.g\.,cacb\_enabled,cacb\_estimator\) for backward compatibility; the paper refers to this component as PACE throughout\.
## Appendix HScale Calibration for Qwen2\.5\-1\.5B
Bisimulation hyperparameters are tied to representation geometry, so we calibrate\(ℓ,ε\)\(\\ell,\\varepsilon\)for Qwen2\.5\-1\.5B a priori, before any training run, rather than reusing the 7B default\(ℓ,ε\)=\(−8,0\.10\)\(\\ell,\\varepsilon\)=\(\-8,0\.10\)\. The calibration is a lightweight forward\-pass scan on a syntheticALFWorldaudit set \(4040rollouts×\\times66task families\) overℓ∈\{4,8,12,16,20,24,27\}\\ell\\in\\\{4,8,12,16,20,24,27\\\}, pooling∈\{last\-token,mean\-prompt,attn\-weighted\}\\in\\\{\\text\{last\-token\},\\,\\text\{mean\-prompt\},\\,\\text\{attn\-weighted\}\\\}, andε∈\{0\.05,0\.10,0\.15,0\.20\}\\varepsilon\\in\\\{0\.05,0\.10,0\.15,0\.20\\\}\.[TableH\.1](https://arxiv.org/html/2606.25556#A8.T1)reports the slice used to select the 1\.5B default\.
The selection rule is automatic: pick the layer that maximizes linear\-probe task\-id accuracy subject to a non\-degenerate singleton fraction \(neither one giant cluster nor all singletons\) andnclusters≥5n\_\{\\text\{clusters\}\}\\geq 5so the partition can resolve the sixALFWorldtask families\. In[TableH\.1](https://arxiv.org/html/2606.25556#A8.T1), this rules out Layer2727\(nclusters=4n\_\{\\text\{clusters\}\}\{=\}4, probe acc0\.8120\.812\) in favor of Layer1616\(nclusters=12n\_\{\\text\{clusters\}\}\{=\}12, probe acc0\.8080\.808\): four clusters merge multiple task types and defeat the purpose of behavioral partitioning even though the raw probe accuracy is marginally higher\. The thresholdε\\varepsilonis then set by the median\-cluster\-size heuristic of[App\.F](https://arxiv.org/html/2606.25556#A6)\. The whole procedure amounts to a single offline forward pass over a few hundred cached observations \(no training, on the order of minutes\) and the same released script applies unchanged to a new backbone \(e\.g\., Llama\-family models\), so the per\-backbone calibration is a one\-time cost rather than a tuning loop\.
Intuitively, the optimal layer is late\-but\-not\-final: the last few transformer blocks specialize toward next\-token logits and lose the coarser behavioral structure that clustering needs\. This explains why layer−8\-8\(7B\) and−12\-12\(1\.5B\), rather than the final layer, give the best probe\-accuracy vs\. singleton\-rate trade\-off\.
Table H\.1:Qwen2\.5\-1\.5B forward\-pass scan \(*last\-token*pool,ε=0\.05\\varepsilon\{=\}0\.05\)\. Layer1616\(negative index−12\-12on the 28\-layer backbone\) gives the best singleton\-vs\-probe trade\-off and is adopted as the default for all 1\.5B runs\.Two practical observations follow from the scan\. First, the 1\.5B representation geometry is more compact than 7B’s: atε=0\.10\\varepsilon\{=\}0\.10the scan yields a single coarse cluster across the tested layer\-and\-pool combinations, whereasε=0\.05\\varepsilon\{=\}0\.05produces a well\-resolved partition\. Second,*last\-token*is the most robust pooling strategy on this backbone:*mean\-prompt*and*attn\-weighted*yield at most two clusters on most slices\. Both observations reinforce that bisimulation hyperparameters should be calibrated once per backbone before training\.
## Appendix IHyperparameters
[TableI\.1](https://arxiv.org/html/2606.25556#A9.T1)records the script\-level settings needed to reproduce the reported runs\.[TableI\.2](https://arxiv.org/html/2606.25556#A9.T2)lists the non\-default trainer and optimizer overrides; all other values inherit from the verl\-agent v0\.1 defaults\.
Table I\.1:Script\-level settings for the reportedBiPACEruns\. TP is the rollout tensor\-parallel size;GGis the number of sampled trajectories per prompt group\.Table I\.2:Non\-default hyperparameters forBiPACEruns\. Rows are grouped by config namespace; values marked “see[TableI\.1](https://arxiv.org/html/2606.25556#A9.T1)” vary by environment\.GroupKeyValuealgorithmadv\_estimatorgigpoalgorithmgamma0\.950\.95algorithm\.gigpobisim\_groupingTruealgorithm\.gigpobisim\_embedderactor\_hiddenalgorithm\.gigpobisim\_layersee[TableI\.1](https://arxiv.org/html/2606.25556#A9.T1)algorithm\.gigpobisim\_epssee[TableI\.1](https://arxiv.org/html/2606.25556#A9.T1)algorithm\.gigpocacb\_enabledTruealgorithm\.gigpocacb\_n\_action\_tokens88algorithm\.gigpocacb\_action\_key\_modefirst\_n/action\_tagalgorithm\.gigpocacb\_estimatorcounterfactual/q\_stylealgorithm\.gigpostep\_advantage\_w1\.01\.0actorkl\_loss\_coef0\.010\.01actoroptim\.lr5×10−75\\\!\\times\\\!10^\{\-7\}rollouttensor\_model\_parallel\_sizesee[TableI\.1](https://arxiv.org/html/2606.25556#A9.T1)trainern\_gpus\_per\_nodesee[TableI\.1](https://arxiv.org/html/2606.25556#A9.T1)
## Appendix JPrompt Templates
Each environment wrapper fills a prompt contract online followingHeet al\.\[[2026](https://arxiv.org/html/2606.25556#bib.bib4)\]\. The first step omits history; all later steps include recent observation–action history\. The executed command is the first well\-formed body inside<action\>\.\.\.</action\>; malformed responses receive the environment\-specific invalid\-action penalty\.[TableJ\.1](https://arxiv.org/html/2606.25556#A10.T1)summarizes the dynamic fields and output contract per environment; full strings are released with the code\.
Table J\.1:Prompt contract per environment\. Placeholders \(braced tokens\) are populated by the wrapper at runtime\.The verbatim templates below show the with\-history form used at every step after the first\. The first\-step variant omits the\{step\_count\},\{history\_length\}, and\{action\_history\}fields\.
ALFWorldPrompt TemplateYou are an expert agent operating in theALFRED Embodied Environment\. Your task isto: \{task\_description\}Prior to this step, you have already taken\{step\_count\} step\(s\)\. Below are the mostrecent \{history\_length\} observations and thecorresponding actions you took:\{action\_history\}You are now at step \{current\_step\} and yourcurrent observation is: \{current\_observation\}Your admissible actions of the currentsituation are: \[\{admissible\_actions\}\]\.Now it’s your turn to take an action\.You should first reason step\-by\-step aboutthe current situation\. This reasoning processMUST be enclosed within <think\> </think\> tags\.Once you’ve finished your reasoning, you shouldchoose an admissible action for current stepand present it within <action\> </action\> tags\.
WebShopPrompt TemplateYou are an expert autonomous agent operating inthe WebShop e\-commerce environment\.Your task is to: \{task\_description\}\.Prior to this step, you have already taken\{step\_count\} step\(s\)\. Below are the most recent\{history\_length\} observations and thecorresponding actions you took: \{action\_history\}You are now at step \{current\_step\} and yourcurrent observation is: \{current\_observation\}\.Your admissible actions of the currentsituation are:\[\{available\_actions\}\]\.Now it’s your turn to take one action for thecurrent step\.You should first reason step\-by\-step about thecurrent situation, then think carefully whichadmissible action best advances the shoppinggoal\. This reasoning process MUST be enclosedwithin <think\> </think\> tags\.Once you’ve finished your reasoning, you shouldchoose an admissible action for current stepand present it within <action\> </action\> tags\.
TextCraftPrompt TemplateYou are an expert agent operating in theTextCraft interactive crafting environment,modeled after Minecraft\. Your goal is to crafta target item by combining ingredients along achain of recipes\.Task context \(constant across the episode\):\{task\_context\}Prior to this step, you have already taken\{step\_count\} step\(s\)\. Below are the most recent\{history\_length\} observations and thecorresponding actions you took:\{action\_history\}You are now at step \{current\_step\} and yourcurrent observation is:\{current\_observation\}Current inventory: \{inventory\}You can issue exactly one action per step, usingone of these three formats:\- ‘craft <out\_count\> <out\_item\> using<in\_count\_1\> <in\_item\_1\>, \.\.\.‘ \-\- apply arecipe \(the inputs must already be in yourinventory\)\.\- ‘get <count\> <item\>‘ \-\- fetch a raw\(non\-craftable\) base material directly fromthe environment\.\- ‘inventory‘ \-\- print what you currently hold\.Items are spelled out in lower\-case with spacesbetween words \(e\.g\. ‘oak planks‘, not‘minecraft:oak\_planks‘\)\.Recipe matching rules \(important \-\- follow allof these\):\- Use ONLY the crafting commands shown in therecipe list; do not invent new recipes\.\- The target item AND its count must match arecipe line EXACTLY\. You cannot use apartial recipe\.\- The ingredient counts must ALSO match therecipe EXACTLY\.\- The target item must be a SPECIFIC item\(e\.g\. ‘dark oak sign‘\); a generic familyname \(e\.g\. just ‘planks‘\) is not valid\.\- If a recipe lists a GENERIC ingredient \(e\.g\.‘using 4 planks‘\), substitute a specificvariant from the same family \(e\.g\. ‘using 4oak planks‘\)\.\- It is fine to produce more of the targetthan the goal requires; repeat the craft ina later step if one craft is not enough\.Example of one well\-formed turn \(format only\):Recipes available include: ‘craft 3 dark oaksign using 6 dark oak planks, 1 stick‘, ‘craft1 fletching table using 4 planks, 2 flint‘\.Goal: craft fletching table\.Current inventory: 4 oak planks, 2 flint\.<think\>I need a fletching table\. The recipeneeds 4 planks \(generic\) and 2 flint\. I have 4oak planks and 2 flint, so I substitute ‘oakplanks‘ for the generic ‘planks‘ and craftdirectly\.</think\><action\>craft 1 fletching table using 4 oakplanks, 2 flint</action\>Now it’s your turn to take one action for thecurrent step\.You should first reason step\-by\-step about thecurrent situation, what intermediate items youalready have, and which next recipe step bringsyou closest to the goal\. This reasoning processMUST be enclosed within <think\> </think\> tags\.Once you’ve finished your reasoning, choose anaction and present it within <action\> </action\>tags\. Output exactly one action; do not chainmultiple actions in one response\.
## Appendix KEmbedder Ablation: Policy\-Induced vs\. Lexical Geometry
BiGPO’s state\-side fingerprint is the actor’s own hidden state\. A natural concern is whether the gain is specific to this*policy\-induced*representation, or whether*any*coarsening of the singleton\-heavy observation\-hash partition would suffice\. We test this using the HashNgram control introduced in[Sec\.3\.4](https://arxiv.org/html/2606.25556#S3.SS4): a zero\-dependency character\-nn\-gram feature hash that clusters step records by lexical surface form\. It is policy\-agnostic by construction: the partition does not move as the actor is optimized\.
Everything except the state\-side embedder is held fixed at the 7B Q\-style recipe of[TableI\.1](https://arxiv.org/html/2606.25556#A9.T1): the same PACEaction\_tagkey, Q\-style estimator, weightww, discountγ\\gamma, learning rate, and training budget\. Because cosine geometry differs across embedder backends, the thresholdε\\varepsilonis re\-calibrated for HashNgram using the same median\-cluster\-size criterion as[App\.F](https://arxiv.org/html/2606.25556#A6), yieldingε=0\.25\\varepsilon\{=\}0\.25\.
The result is reported as theEmbeddercheck in[Table4](https://arxiv.org/html/2606.25556#S4.T4)of the main paper: holding the action\-side estimator fixed, swapping the policy\-induced fingerprint for the lexical hash drops val@@max from97\.197\.1to95\.495\.4, a1\.71\.7pp gap that persists even though both partitions defeat the singleton tax\. HashNgram \(95\.495\.4\) exceeds the state\-only baseline \(93\.093\.0;[TableL\.1](https://arxiv.org/html/2606.25556#A12.T1)\) by2\.42\.4pp, indicating that lexical coarsening provides some benefit beyond singleton reduction alone\. The remaining1\.71\.7pp gap over Actor\-Hidden is attributable to the policy\-induced geometry, which adapts as the actor is optimized and tracks behavioral equivalence more faithfully than a static lexical hash\. Extending this comparison to a frozen sentence\-encoder variant and to additional environments is left to future work\.
## Appendix LPACE Diagnostics
The main text summarizes the PACE ablation qualitatively;[TableL\.1](https://arxiv.org/html/2606.25556#A12.T1)records the full aggregate validation peaks for each estimator variant\.
Table L\.1:PACE estimator variants onALFWorld/Qwen2\.5\-7B\.*Peak*is the best binary aggregate validation success within the matched training window of[TableI\.1](https://arxiv.org/html/2606.25556#A9.T1)\. Entries report mean±\\pmstd over33seeds; the best variant per column isboldand the runner\-up isunderlined\.#### Row\-mix statistics
We log per\-step PACE statistics via thestep\_norm\_reward\_cacbhook in the released code\. Three quantities are recorded at every training step: \(i\) the fraction of rows entering the PACE branch \(vs\. RLOO fallback or singleton\), \(ii\) the fraction of multi\-member clusters containing≥2\\geq 2distinct action keys, and \(iii\) the mean number of unique action keys per cluster\.[TableL\.2](https://arxiv.org/html/2606.25556#A12.T2)reports a representative step near iteration150150onALFWorld/Qwen2\.5\-7B \(180180total clusters\); the Q\-style same\-action pool remains non\-degenerate throughout training\. The action\-tag parse rate \(logged atcompute\_action\_keys\.last\_action\_match\_rate\) stays above0\.990\.99across all reported runs\.
Table L\.2:Q\-style row\-mix statistics onALFWorld/Qwen2\.5\-7B near iteration150150\(180180total clusters\)\. The same\-action pool remains non\-degenerate throughout training\.
#### Policy\-state interaction diagnostics
[TableL\.3](https://arxiv.org/html/2606.25556#A12.T3)expands the main\-text diagnostic of[Fig\.3](https://arxiv.org/html/2606.25556#S4.F3)\. The diagnostic asks specifically whether the actor\-hidden policy\-state clustering creates larger non\-singleton reuse pools than exact observation hashing under the same rollout budget \(a question distinct from historical\-context oracle grouping\)\. OnALFWorld,BiPACElowers the singleton cluster fraction by9\.39\.3pp and increases mean group size by1\.6×1\.6\\times\. OnTextCraft, where exact observation hashes are especially sparse, the effect is larger: singleton clusters fall by27\.927\.9pp and mean group size triples\. The matched\-pair volume grows by1\.3×1\.3\\timesonALFWorldand2\.2×2\.2\\timesonTextCraft, providing PACE with more peer records for its per\-action baseline\.
Table L\.3:Policy\-state reuse diagnostics from training logs\.TextCraftuses the same 7B window as[Table3](https://arxiv.org/html/2606.25556#S4.T3);ALFWorlduses the first130130steps so all threeBiPACEdiagnostic seeds are present\. All entries are means over completed seeds\. Lower singleton rate means more rows receive nonzero step\-level signal; larger group size and pair volume indicate more reusable peers for the PACE estimator\.
## Appendix MFailure\-Mode Analysis
We catalog two regimes whereBiGPOprovides limited improvement\.
#### \(F1\) Highly uniform observation distributions
When every state visually resembles every other state \(e\.g\., a synthetic Sokoban\[Schrader,[2018](https://arxiv.org/html/2606.25556#bib.bib21)\]variant with nearly identical grid layouts\),ϕθ\\phi\_\{\\theta\}collapses to a single tight cluster\. In this degenerate caseBiGPOreduces to a batch\-level baseline, matching GRPO without step\-level factorization\. The failure is detectable before training via the singleton\-fraction diagnostic: a partition producing one giant cluster indicates that the hidden\-state geometry does not resolve behavioral differences among the observed states\.
#### \(F2\) Pre\-RL initialization
At training step0, the actor’s hidden state reflects the base LM’s pre\-training biases rather than task\-specific value geometry\. Clusters at this point are therefore coarser than they will become during RL training, but this is not catastrophic: the adaptiveε\\varepsilonheuristic \([App\.F](https://arxiv.org/html/2606.25556#A6)\) is run on the first training step rather than at initialization, so the radius is calibrated once task\-relevant geometry has begun to emerge\.
## Appendix NFull Results Tables
This section collects compact per\-seed numbers; full per\-step learning curves and SwanLab logs are linked from the project page\.
#### Q\-style on Qwen2\.5\-7B \(ALFWorld\)
Val@@max across three seeds:97\.7%97\.7\\%,96\.1%96\.1\\%,97\.7%97\.7\\%\(peak steps135135,115115,120120\); mean±\\pmstd=97\.1±0\.9%=97\.1\\pm 0\.9\\%\. All three seeds reach≥95%\{\\geq\}95\\%on at least33of3030validation checkpoints; the three first\-token seeds reach it on at most22\.
#### Q\-style on Qwen2\.5\-1\.5B \(ALFWorld,ε=0\.05\\varepsilon\{=\}0\.05, layer−12\-12\)
The smaller backbone converges more slowly than 7B, so val@@max is taken over the full 200\-step training window\. Val/success\-rate \(binary aggregate,\|𝒱\|=128\|\\mathcal\{V\}\|\{=\}128\) across three seeds:93\.8%93\.8\\%,92\.2%92\.2\\%,94\.5%94\.5\\%; mean±\\pmstd=93\.5±1\.2%=93\.5\\pm 1\.2\\%\(reported as93\.593\.5in the*All*column of[Table2](https://arxiv.org/html/2606.25556#S4.T2)\)\. Versus the citedGiGPOresult of86\.7±1\.786\.7\{\\pm\}1\.7, the gap is\+6\.8\+6\.8pp\. Per\-subtask val@@max \(3\-seed mean with within\-seed spread\): pick100\.0100\.0, look97\.4±3\.897\.4\{\\pm\}3\.8, clean100\.0100\.0, heat100\.0100\.0, cool96\.5±3\.696\.5\{\\pm\}3\.6, pick292\.0±7\.992\.0\{\\pm\}7\.9\. Q\-style also shortens episodes relative toGiGPOlate in training \(mean episode length17\.117\.1vs\.21\.021\.0steps\)\.
## Appendix OReproducibility
A regression test in the released code certifies that settingbisim\_grouping=True,embedder=identity, andε=0\\varepsilon\{=\}0produces a partition equivalent to vanillaGiGPObit\-for\-bit \(up to UUID relabeling\)\. Configs, random seeds, and SwanLab training logs accompanying the completed tables are linked from the project page\.
#### Baseline provenance
GiGPOandHGPOnumbers onALFWorldandWebShopare cited directly from the respective papers \(Fenget al\.\[[2025](https://arxiv.org/html/2606.25556#bib.bib3)\], Heet al\.\[[2026](https://arxiv.org/html/2606.25556#bib.bib4)\]\) and were not re\-run under our codebase; the prompting rows \(GPT\-4o, Gemini\-2\.5\-Pro, ReAct, Reflexion\) are likewise fromFenget al\.\[[2025](https://arxiv.org/html/2606.25556#bib.bib3)\]\. AllTextCraftrows \(including the GRPO,GiGPO, andHGPObaselines\) are run by us under the same codebase, seeds, and evaluation protocol asBiPACE\. PPO\-with\-critic rows are likewise cited fromFenget al\.\[[2025](https://arxiv.org/html/2606.25556#bib.bib3)\]\.Similar Articles
Milestone-Guided Policy Learning for Long-Horizon Language Agents
This paper introduces BEACON, a milestone-guided policy learning framework designed to improve credit assignment and sample efficiency for long-horizon language agents. It demonstrates significant performance improvements over GRPO and GiGPO on benchmarks like ALFWorld, WebShop, and ScienceWorld.
PolicyBank: Evolving Policy Understanding for LLM Agents
PolicyBank proposes a memory mechanism that enables LLM agents to autonomously refine their understanding of organizational policies through iterative interaction and corrective feedback, closing specification gaps that cause systematic behavioral divergence from true requirements. The work introduces a systematic testbed and demonstrates PolicyBank can close up to 82% of policy-gap alignment failures, significantly outperforming existing memory mechanisms.
Reducing Credit Assignment Variance via Counterfactual Reasoning Paths
Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.
GAGPO: Generalized Advantage Grouped Policy Optimization
GAGPO proposes a critic-free RL method that uses a non-parametric grouped value proxy for step-level credit assignment in multi-turn agentic tasks, outperforming strong baselines on ALFWorld and WebShop.
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
This paper introduces Agent-BRACE, a method that decouples LLM agents into belief state and policy models to handle long-horizon tasks in partially observable environments. By verbalizing state uncertainty, it achieves significant performance improvements over baselines while maintaining constant context window size.