GraphPO: Graph-based Policy Optimization for Reasoning Models

arXiv cs.CL 06/18/26, 04:00 AM Papers
reinforcement-learning reasoning-models graph-based-methods policy-optimization llm rlvr exploration
Summary
GraphPO is a novel graph-based reinforcement learning framework that represents rollouts as a directed acyclic graph, merging semantically equivalent reasoning paths to reduce redundant exploration and improve credit assignment for large reasoning models.
arXiv:2606.18954v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.
Original Article
View Cached Full Text
Cached at: 06/18/26, 05:46 AM
# GraphPO: Graph-based Policy Optimization for Reasoning Models
Source: [https://arxiv.org/html/2606.18954](https://arxiv.org/html/2606.18954)
Yuliang Zhan1,∗Xinyu Tang1,∗Jian Li1Dandan Zheng2Weilong Chai2 Jingdong Chen2Jun Zhou2Ge Wu2Wenyue Tang2Hao Sun1 1Gaoling School of Artificial Intelligence, Renmin University of China2Ant Group

###### Abstract

Reinforcement Learning with Verifiable Rewards \(RLVR\) has become a standard paradigm for enhancing the capability of large reasoning models\. RLVR typically samples responses independently and optimizes the policy using from final answers\. This paradigm has two limitations\. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation\. Second, sparse final\-answer rewards make it hard to identify useful steps\. Tree\-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine\-grained signals\. However, tree branches are still expanded independently\. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration\. Moreover, tree\-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation\. To address this challenge, we proposeGraphPO\(Graph\-basedPolicyOptimization\), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes\. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration\. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome\. Theory shows that GraphPO reduces advantage\-estimation variance and enhances reasoning efficiency\. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain\- and tree\-based baselines with the same token budgets or response budgets\. The code will be released soon\.

11footnotetext:Equal contribution\.22footnotetext:Corresponding author\.## 1Introduction

Recently, Large Reasoning Models \(LRMs\) have achieved milestone progress in agentic, coding, and mathematical reasoning tasks\[[7](https://arxiv.org/html/2606.18954#bib.bib4),[31](https://arxiv.org/html/2606.18954#bib.bib11),[39](https://arxiv.org/html/2606.18954#bib.bib12)\]\. They typically adopt reinforcement learning with verifiable rewards \(RLVR\), where the correctness of the final answer is used as a binary reward to optimize the policy\[[1](https://arxiv.org/html/2606.18954#bib.bib6),[47](https://arxiv.org/html/2606.18954#bib.bib7),[45](https://arxiv.org/html/2606.18954#bib.bib5)\]\. Effective RL requires both accurate reward credit assignment for policy updates and diverse exploration during sampling\. However, current RLVR methods still struggle with both\. First, outcome rewards are inherently sparse, which hinders credit assignment to intermediate reasoning steps that contribute to the final answer\[[19](https://arxiv.org/html/2606.18954#bib.bib10),[27](https://arxiv.org/html/2606.18954#bib.bib13)\]\. Second, RLVR methods typically sample response independently, where different responses often contain repeated intermediate reasoning steps\. This leads to substantial sampling redundancy and limits exploration diversity\[[42](https://arxiv.org/html/2606.18954#bib.bib21),[13](https://arxiv.org/html/2606.18954#bib.bib14)\]\.

For accurate reward credit assignment, existing methods introduce step\-level signals\. Process Reward Models \(PRMs\)\[[28](https://arxiv.org/html/2606.18954#bib.bib15),[33](https://arxiv.org/html/2606.18954#bib.bib16),[46](https://arxiv.org/html/2606.18954#bib.bib46)\]reward intermediate reasoning steps, helping the policy identify steps that contribute to the final answer\. However, they usually require costly step\-level annotations and often generalize poorly across domains\. Furthermore, recent methods avoid explicit PRMs by estimating intermediate\-state values, such as search\-based estimation\[[2](https://arxiv.org/html/2606.18954#bib.bib17),[25](https://arxiv.org/html/2606.18954#bib.bib18)\]and extraction of process signals from outcome rewards\[[4](https://arxiv.org/html/2606.18954#bib.bib19),[5](https://arxiv.org/html/2606.18954#bib.bib29)\]\. Although these methods alleviate reward sparsity by providing denser supervision signal, they still rely on chain\-based rollouts and cannot capture similarity between independent rollouts\(Figure[1](https://arxiv.org/html/2606.18954#S1.F1)a\), which limits both credit assignment and exploration efficiency\. To address this challenge,*tree\-based RL methods*organize rollouts as trees\[[42](https://arxiv.org/html/2606.18954#bib.bib21),[13](https://arxiv.org/html/2606.18954#bib.bib14),[44](https://arxiv.org/html/2606.18954#bib.bib9),[16](https://arxiv.org/html/2606.18954#bib.bib20),[32](https://arxiv.org/html/2606.18954#bib.bib22),[12](https://arxiv.org/html/2606.18954#bib.bib23),[8](https://arxiv.org/html/2606.18954#bib.bib30)\], where shared prefixes avoid repeated reasoning and branching points provide fine\-grained supervision \(Figure[1](https://arxiv.org/html/2606.18954#S1.F1)b\)\.

Despite their benefits, tree\-structured rollouts still have three limitations\. First, deeper trees give uncertain intermediate states more chances to reach correct answers, but this optimization may encourage redundant reasoning steps and*reduce inference efficiency*\. Second, early\-stopped path provide limited training signals,*reducing exploitation efficiency*\. Third, tree only share early prefixes\. After branches diverge, they may still reach similar intermediate states and expand them independently, causing redundant exploration and*limiting exploration efficiency*, as analyzed in Section[3](https://arxiv.org/html/2606.18954#S3)\. These limitations arise from the same structural constraint: trees treat branches as independent paths\. Therefore, they cannot represent equivalent states reached by different paths, compare the efficiency of these paths, or reuse subsequent reasoning after these states\. Based on this observation, we model rollout as a directed acyclic graph, which explicitly aggregates equivalent reasoning states\.

![Refer to caption](https://arxiv.org/html/2606.18954v1/x1.png)Figure 1:Comparison of rollout strategies\.\(a\)Chain rollouts sample independent trajectories\.\(b\)Tree rollouts share prefixes\.\(c\)Graph rollouts merge semantically similar states\.Based on this insight, we proposeGraphPO\(Graph\-basedPolicyOptimization\), a novel RL framework that represents reasoning rollouts as an RL graph, where edges denote generated reasoning steps and nodes denote the intermediate semantic states summarized from the reasoning path from the initial prompt \(Figure[1](https://arxiv.org/html/2606.18954#S1.F1)c\)\. During graph\-based rollout, GraphPO incrementally constructs the RL graph by detecting semantically similar states across paths and virtually merging similar reasonings into equivalence classes\. These equivalence classes allow GraphPO to*improve inference*,*exploitation*, and*exploration efficiency*through path comparison, signal sharing, and redundancy reduction\. First, paths reaching the equivalence class solve the similar sub\-problem\. GraphPO therefore introduces a path\-level efficiency advantage that rewards shorter paths within each class, steering the learned policy toward more efficient reasoning\. Second, reasoning in the same equivalence class share the same semantic state, so their subsequent correctness samples can be shared\. GraphPO therefore improves exploitation efficiency by sharing subsequent correctness signals among similar reasonings, which produces denser step level rewards\. Even early\-stopped branches can receive credit from equivalent partners, improving rollout utilization and stabilizing policy updates\. Third, when a path reaches an already discovered equivalence class, further expansion is likely redundant\. GraphPO reduces its next\-layer budget and reallocates the saved computation to novel frontier states, encouraging broader exploration under the same token budget\. As analyzed in Section[3](https://arxiv.org/html/2606.18954#S3), this effectively improves exploration efficiency\.

∙\\bulletWe introduce GraphPO, a graph\-based RL framework that merges semantically similar reasoning to reduce redundant exploration, improve rollout utilization and get step\-level rewards\.

∙\\bulletWe propose a dual\-group graph advantage estimation method that improves reasoning efficiency by comparing incoming paths within each equivalence class and improves reasoning performance by comparing outgoing steps at each node\.

∙\\bulletExtensive experiments and theoretical analyses demonstrate that GraphPO achieves better performance with the same rollout budgets or response budgets\.

## 2Related Work

Reinforcement Learning with Sparse Supervision\.RLVR drives post\-training of large reasoning models with a binary outcome reward\[[7](https://arxiv.org/html/2606.18954#bib.bib4),[31](https://arxiv.org/html/2606.18954#bib.bib11),[45](https://arxiv.org/html/2606.18954#bib.bib5)\], but this signal is too sparse for critical reasoning step of credit assignment\[[19](https://arxiv.org/html/2606.18954#bib.bib10),[35](https://arxiv.org/html/2606.18954#bib.bib47)\]\. A common remedy is to densify supervision with process reward models\[[28](https://arxiv.org/html/2606.18954#bib.bib15),[33](https://arxiv.org/html/2606.18954#bib.bib16),[46](https://arxiv.org/html/2606.18954#bib.bib46),[48](https://arxiv.org/html/2606.18954#bib.bib39)\], but they require costly annotations and transfer poorly across domains\. Recent work reduces this dependence by deriving process signals from outcomes through value estimation\[[2](https://arxiv.org/html/2606.18954#bib.bib17),[25](https://arxiv.org/html/2606.18954#bib.bib18)\], implicit rewards\[[4](https://arxiv.org/html/2606.18954#bib.bib19),[22](https://arxiv.org/html/2606.18954#bib.bib49),[9](https://arxiv.org/html/2606.18954#bib.bib50)\], or segment\-level credit assignment\[[27](https://arxiv.org/html/2606.18954#bib.bib13),[5](https://arxiv.org/html/2606.18954#bib.bib29),[8](https://arxiv.org/html/2606.18954#bib.bib30)\]\. These methods estimate supervision from independent rollouts\. Thus, outcomes from semantically equivalent states cannot support each other\. GraphPO merges such states into equivalence classes, enabling suffix sharing that provides denser, lower\-variance samples for credit assignment\.

Tree Search for Reinforcement Learning\.Tree\-structured rollouts share prefixes across expansions, allowing them to generate more trajectories than conventional RLVR methods under the same token budget\. The branching points in these trajectories naturally provide step\-level comparisons\.\[[42](https://arxiv.org/html/2606.18954#bib.bib21),[44](https://arxiv.org/html/2606.18954#bib.bib9),[16](https://arxiv.org/html/2606.18954#bib.bib20),[12](https://arxiv.org/html/2606.18954#bib.bib23)\]\. Recent variants further improve efficiency by reusing common prefixes, scheduling branch expansion toward informative states, or extending frontier branches through lookahead\[[13](https://arxiv.org/html/2606.18954#bib.bib14),[32](https://arxiv.org/html/2606.18954#bib.bib22),[38](https://arxiv.org/html/2606.18954#bib.bib51),[30](https://arxiv.org/html/2606.18954#bib.bib56),[14](https://arxiv.org/html/2606.18954#bib.bib53)\]\. These methods share prefixes, but their tree topology still treats diverged branches independently, even when they later reach semantically equivalent states\. GraphPO instead merges such states into a DAG, enabling length comparison, signal sharing, and budget reallocation within each equivalence class to improve reward exploitation and exploration efficiency\.

## 3Empirical Study

In this section, we analyze reasoning redundancy and exploration efficiency of rollout strategies\.

Experimental Setup\.Following PROS\[[13](https://arxiv.org/html/2606.18954#bib.bib14)\], we sample 64 reasoning trajectories for each prompt\. To reduce expressive variation and better capture underlying semantics, we summarize intermediate reasoning states with Qwen2\.5\-7B\-Instruct\[[41](https://arxiv.org/html/2606.18954#bib.bib26)\]and measure summary similarity using SFR\-Embedding\-2\-R\[[23](https://arxiv.org/html/2606.18954#bib.bib25)\], following MIRB\[[18](https://arxiv.org/html/2606.18954#bib.bib24)\]\. We use MATH500\[[11](https://arxiv.org/html/2606.18954#bib.bib27)\]and compare independent chain\-based, tree\-based, and graph\-based sampling strategies, using Qwen2\.5\-7B\-Math\[[40](https://arxiv.org/html/2606.18954#bib.bib34)\]for sampling\.

![Refer to caption](https://arxiv.org/html/2606.18954v1/x2.png)Figure 2:Empirical study of semantic redundancy and exploration efficiency\.\(a\)Pairwise normalized edit similarity across prefix window sizes\.\(b\)Semantic similarity across prefix window sizes\.\(c\)Exploration efficiency across different sampling strategies\. For \(a\) and \(b\), wider regions in each violin indicate denser pairwise score distributions\.Chain and Tree Structured Sampling Contain Substantial Semantic Redundancy\.To show the redundancy produced by different sampling strategies, we extract rollout prefixes under different window sizes for each prompt\. For each prefix, we compute its semantic similarity and Rouge\-L\[[21](https://arxiv.org/html/2606.18954#bib.bib28)\]score with shorter prefixes from other trajectories\. Figures[2](https://arxiv.org/html/2606.18954#S3.F2)a and[2](https://arxiv.org/html/2606.18954#S3.F2)b show the distributions of pairwise semantic similarity and Rouge\-L scores under different prefix window sizes, where many prefixes are highly similar to prefixes from other paths \(similarity \> 0\.9\)\. This suggests that rollouts based on chains and trees both often revisit similar intermediate reasoning states\.

Specifically, Figure[2](https://arxiv.org/html/2606.18954#S3.F2)a shows that chain sampling produces highly similar shorter prefixes, indicating substantial redundancy in early reasoning steps\. Tree\-based sampling reduces this early redundancy by sharing prefixes, thus improving exploration diversity under the same token budget\. However, Figure[2](https://arxiv.org/html/2606.18954#S3.F2)b shows that tree prefixes still have high semantic similarity, with similarity scores close to chain sampling\. This suggests that the tree structure reduces part of the repeated computation by sharing ancestors, but it still independently reach and expand semantically similar reasoning, causing redundant exploration\. In particular, when two branches have reached semantically similar states, the tree structure still treats them as independent nodes and continues to expand them separately\.

These observations motivate a graph\-based rollout structure\. Instead of only sharing exact prefixes, Graph can connect semantically similar intermediate states across different branches\. In this way, later rollouts can reuse information from related states and adaptively reallocate rollout budgets, rather than expanding each similar state in isolation\. This graph structure reduces redundant exploration and provides a more fine\-grained basis for credit assignment across reasoning steps\.

Graph Structured Turns Semantic Redundancy into Exploration Efficiency\.To examine whether graph structured sampling converts redundancy into useful exploration, we measure exploration efficiency under different token budgets\. As the budget increases, chain sampling adds more independent rollouts, whereas tree and graph sampling increase the branching factor and depth\. For the currentnnrollouts, let𝒮n\\mathcal\{S\}\_\{n\}denote all reasoning steps, and let\|s\|\|s\|be the token cost of stepss\. For tree and graph rollouts, a step is correct if it can lead to a correct final answer\. For chain rollouts, each full trajectory is treated as one step\. We define exploration efficiency as:

Eff\(n\)=∑s∈𝒮n𝟏\[sis correct\]\|s\|∑s∈𝒮n\|s\|\.\\mathrm\{Eff\}\(n\)=\\frac\{\\sum\_\{s\\in\\mathcal\{S\}\_\{n\}\}\\mathbf\{1\}\[s\\ \\text\{is correct\}\]\\,\|s\|\}\{\\sum\_\{s\\in\\mathcal\{S\}\_\{n\}\}\|s\|\}\.\(1\)A higher value indicates that more computation is allocated to useful reasoning progress rather than redundant expansions\. Figure[2](https://arxiv.org/html/2606.18954#S3.F2)c shows that chain\-based sampling remains inefficient as token consumption grows, since independent chains repeatedly explore overlapping or unproductive states\. Tree\-based sampling improves efficiency at small budgets through prefix sharing, but its advantage decreases with larger budgets\. This suggests that later branches still expand semantically similar states independently\. In contrast, graph\-based sampling achieves substantially higher efficiency across token budgets\. Its efficiency rises quickly after the initial rollout stage and remains consistently above chain\- and tree\-based sampling\. This supports our motivation that detecting and merging semantically similar states redirects rollout budget from redundant reasoning to more useful exploration\.

## 4Method

![Refer to caption](https://arxiv.org/html/2606.18954v1/x3.png)Figure 3:Overview of GraphPO\.\(a\)Overall model architecture\.\(b\)Graph Construction via Step\-Level Expansion\.\(c\)The resulting RL graph shares downstream successors among merged states\.\(d\)Scoring Nodes and Deriving Graph Step Rewards\.\(e\)Dual\-Group Graph Advantage\.In this section, we describe Graph\-based Policy Optimization\. We first delineate the overall methodology, then detail graph construction, graph reward, graph advantage and policy optimization\.

### 4\.1Overview

Based on the above analysis, we propose GraphPO, a novel reinforcement learning framework with graph\-based sampling\. GraphPO incrementally constructs a directed acyclic graph𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)step by step, with sampled reasoning steps as edges and nodes represent intermediate semantic states summarized from the reasoning path starting at initial prompt and ending at current step\. GraphPO then embeds these node states for equivalence detection\. GrphPO then compares node embeddings across paths and layers, and merges semantically similar states into equivalence classes, yielding a graph with both direct and shared edges \(Section[4\.2](https://arxiv.org/html/2606.18954#S4.SS2)\)\. Equivalence classes reduce redundant exploration by reallocating rollout budgets among semantically equivalent paths\. Based on this graph, equivalence classes also enable reuse of reasoning suffixes across similar states, converting sparse outcome rewards into dense step\-level edge rewards for credit assignment \(Section[4\.3](https://arxiv.org/html/2606.18954#S4.SS3)\)\. We further introduce a dual\-group graph advantage for credit assignment and efficiency learning \(Section[4\.4](https://arxiv.org/html/2606.18954#S4.SS4)\)\. The correctness group compares direct edges and shared steps from similar state, and the efficiency group compares paths that reach the same merged state\. An overview is depicted in Figure[3](https://arxiv.org/html/2606.18954#S4.F3)\. The algorithm is presented in Appendix[A](https://arxiv.org/html/2606.18954#A1)\.

### 4\.2Graph Construction

Tree rollouts can reduce repeated prefixes\. Howerer, The empirical study shows that when different branches reach similar reasoning states, they still explore independently and cause redundant exploration\. GraphPO therefore builds a rollout graph, where edges are generated reasoning steps and nodes are the semantic states summarized from the reasoning path starting at the initial prompt\.

As shown in Figure[3](https://arxiv.org/html/2606.18954#S4.F3)b, GraphPO initializes the prompt as the rootv0v\_\{0\}and expands the graph layer by layer\. Each stepu→v\{u\\to v\}is added as a direct edgee\(u,v\)e\(u,v\), whereu,v∈𝒱u,v\\in\\mathcal\{V\}\. Following Section[3](https://arxiv.org/html/2606.18954#S3), we use the policy model itself to summarize the entire edge path fromv0v\_\{0\}tovv, and nodevvstores the summarize as semantic statesvs\_\{v\}\. We embed each semantic statesvs\_\{v\}as𝐳v\\mathbf\{z\}\_\{v\}and use cosine similarity for equivalence detection\. To preserve acyclicity, GraphPO only compares node pairs that are not in an ancestor–descendant relation\. Two comparable nodesuuandvvare grouped into the same equivalence class whensim⁡\(u,v\)=𝐳u⊤𝐳v‖𝐳u‖‖𝐳v‖≥κ\\operatorname\{sim\}\(u,v\)=\\frac\{\\mathbf\{z\}\_\{u\}^\{\\top\}\\mathbf\{z\}\_\{v\}\}\{\\\|\\mathbf\{z\}\_\{u\}\\\|\\\|\\mathbf\{z\}\_\{v\}\\\|\}\\geq\\kappa\. Merged nodes form an equivalence class𝒬\(v\)\\mathcal\{Q\}\(v\)\. Appendix[G](https://arxiv.org/html/2606.18954#A7)theoretically proves that GraphPO remains robust under imperfect equivalence detection\. Equivalent nodes keep their own direct context, but share suffix from other members of the same class\. Let𝒞\(u\)\\mathcal\{C\}\(u\)be the direct suffix ofuu\. The graph suffix𝒞g\(u\)\\mathcal\{C\}\_\{g\}\(u\)is:

𝒞g\(u\)=𝒞\(u\)∪\{c∣q∈𝒬\(u\)∖\{u\},c∈𝒞\(q\)\}\.\\mathcal\{C\}\_\{g\}\(u\)=\\mathcal\{C\}\(u\)\\cup\\\{c\\mid q\\in\\mathcal\{Q\}\(u\)\\setminus\\\{u\\\},\\,c\\in\\mathcal\{C\}\(q\)\\\}\.\(2\)As shown in Figure[3](https://arxiv.org/html/2606.18954#S4.F3)c, when a path reaches an existing equivalence class, further expansion is likely redundant\. GraphPO halves its next\-layer budget while keeping at least one continuation and reallocates the saved budget to other paths\. Thus, graph construction reduces redundant exploration and improves exploration efficiency\. Appendix[F\.3](https://arxiv.org/html/2606.18954#A6.SS3)formalizes that this improves expected correct\-token efficiency and discovers more distinct semantic states than tree rollouts under the same budget\.

### 4\.3Graph Reward

To turn sparse outcome rewards into process supervision, GraphPO builds the graph reward that shares node score in the same equivalent class\. As shown in Figure[3](https://arxiv.org/html/2606.18954#S4.F3)d, it first estimates a state score for each node from verified answers, and then uses the value difference along each direct edge as the step reward\. For nodevv, its own score comes from terminal states reachable through direct edges\. If nodevvbelongs to an equivalence class, GraphPO pools the score of equivalent nodes because they represent similar reasoning states:

S\(v\)=cvown\+w∑q∈𝒬\(v\)∖\{v\}cqowntvown\+w∑q∈𝒬\(v\)∖\{v\}tqown,S\(v\)=\\frac\{c\_\{v\}^\{\\mathrm\{own\}\}\+w\\sum\_\{q\\in\\mathcal\{Q\}\(v\)\\setminus\\\{v\\\}\}c\_\{q\}^\{\\mathrm\{own\}\}\}\{t\_\{v\}^\{\\mathrm\{own\}\}\+w\\sum\_\{q\\in\\mathcal\{Q\}\(v\)\\setminus\\\{v\\\}\}t\_\{q\}^\{\\mathrm\{own\}\}\},\(3\)wherew∈\[0,1\]w\\in\[0,1\]is a pooling coefficient that controls the contribution of downstream samples from equivalent nodes,𝒬\(v\)\\mathcal\{Q\}\(v\)denotes the equivalence class containingvv,cqown=∑z∈𝒵\(q\)R\(z\)c\_\{q\}^\{\\mathrm\{own\}\}=\\sum\_\{z\\in\\mathcal\{Z\}\(q\)\}R\(z\), andtqown=\|𝒵\(q\)\|t\_\{q\}^\{\\mathrm\{own\}\}=\|\\mathcal\{Z\}\(q\)\|\. Here,R\(z\)∈\{0,1\}R\(z\)\\in\\\{0,1\\\}is the verifier reward and𝒵\(v\)\\mathcal\{Z\}\(v\)denotes the terminal states reachable fromvvthrough original direct edges\. When𝒬\(v\)=v\\mathcal\{Q\}\(v\)=\{v\}, Eq\.[3](https://arxiv.org/html/2606.18954#S4.E3)is the the score of a non\-merged node\. To obtain a step\-level reward, GraphPO computes the score gain on each direct edgee\(u,v\)e\(u,v\)and discounts repetitive steps by endpoint similarity:

rstep\(u,v\)=\(S\(v\)−S\(u\)\)\(1−η\(u,v\)\),r\_\{\\mathrm\{step\}\}\(u,v\)=\\bigl\(S\(v\)\-S\(u\)\\bigr\)\\bigl\(1\-\\eta\(u,v\)\\bigr\),\(4\)whereη\(u,v\)=max⁡\(0,sim⁡\(u,v\)\)\\eta\(u,v\)=\\max\(0,\\operatorname\{sim\}\(u,v\)\)\. A step receives positive credit when it improves downstream correctness and adds new semantic content\. Early\-stopped branches can also receive evidence from equivalent partners\. Appendix[F\.2](https://arxiv.org/html/2606.18954#A6.SS2)shows that pooling equivalent nodes yields a more accurate edge reward, and Appendix[F\.4](https://arxiv.org/html/2606.18954#A6.SS4)proves thatrstepr\_\{\\mathrm\{step\}\}matches the expected policy\-gradient direction of an oracle PRM\. Thus, GraphPO obtains step\-level process supervision from outcome rewards without annotated step labels\.

### 4\.4Dual\-Group Graph Advantage

To improve reasoning efficiency and provide fine\-grained credit assignment, we propose Dual\-Group Graph Advantage estimation, which compares reasoning paths reaching the same equivalence class to encourage efficient reasoning, and compares subsequent steps from the same state to assign step\-level credit \(Figure[3](https://arxiv.org/html/2606.18954#S4.F3)e\)\.

The step correctness group compares steps that*leave the same semantic state*\. For source nodeuu,ℰcor\(u\)=\{\(p\(v\),v\)∣v∈𝒞g\(u\)\}\\mathcal\{E\}\_\{\\mathrm\{cor\}\}\(u\)=\\\{\(p\(v\),v\)\\mid v\\in\\mathcal\{C\}\_\{g\}\(u\)\\\}, wherep\(v\)p\(v\)indicates the predecessor node ofvv\. For a shared nodevv, GraphPO computes the reward on its actual generated edge\(p\(v\),v\)\(p\(v\),v\), rather than on the virtual shared edge fromuutovv\. This keeps the reward tied to the real generation context\. For each edgee\(p\(v\),v\)∈ℰcor\(u\)e\(p\(v\),v\)\\in\\mathcal\{E\}\_\{\\mathrm\{cor\}\}\(u\), the*correctness advantage*is:

Acor\(e\)=rstep\(e\)−mean⁡\(\{rstep\(e^\)∣e^∈ℰcor\(u\)\}\)std⁡\(\{rstep\(e^\)∣e^∈ℰcor\(u\)\}\),A\_\{\\mathrm\{cor\}\}\(e\)=\\frac\{r\_\{\\mathrm\{step\}\}\(e\)\-\\operatorname\{mean\}\(\\\{r\_\{\\mathrm\{step\}\}\(\\hat\{e\}\)\\mid\\hat\{e\}\\in\\mathcal\{E\}\_\{\\mathrm\{cor\}\}\(u\)\\\}\)\}\{\\operatorname\{std\}\(\\\{r\_\{\\mathrm\{step\}\}\(\\hat\{e\}\)\\mid\\hat\{e\}\\in\\mathcal\{E\}\_\{\\mathrm\{cor\}\}\(u\)\\\}\)\},\(5\)
The efficiency group compares paths that*reach the same semantic state*\. For an equivalence classQQ, letaQa\_\{Q\}be the deepest shared predecessor,ℓv\\ell\_\{v\}is the number of direct edges fromaQa\_\{Q\}tovv, andS¯Q=∑v∈QS\(v\)/\|Q\|\\bar\{S\}\_\{Q\}=\\sum\_\{v\\in Q\}S\(v\)/\|Q\|\. The path\-level*efficiency advantage*is:

Aeff\(v\)=S¯Q−ℓv−mean⁡\(\{−ℓq∣q∈Q\}\)std⁡\(\{−ℓq∣q∈Q\}\),A\_\{\\mathrm\{eff\}\}\(v\)=\\bar\{S\}\_\{Q\}\\frac\{\-\\ell\_\{v\}\-\\operatorname\{mean\}\(\\\{\-\\ell\_\{q\}\\mid q\\in Q\\\}\)\}\{\\operatorname\{std\}\(\\\{\-\\ell\_\{q\}\\mid q\\in Q\\\}\)\},\(6\)and it is uniformly assigned to the edges on the path fromaQa\_\{Q\}tovv\. The gateS¯Q\\bar\{S\}\_\{Q\}keeps this signal active only when the shared state has correctness support, so GraphPO prefers shorter paths only when they reach a useful semantic state\. Pooling descendants from semantically equivalent paths into a single advantage group lowers the variance ofAcorA\_\{\\mathrm\{cor\}\}\(Appendix[F\.2](https://arxiv.org/html/2606.18954#A6.SS2)\)\. Appendix[F\.5](https://arxiv.org/html/2606.18954#A6.SS5)also proves that this increases probability toward the shortest correct path and reduces expected token cost\.

### 4\.5Policy Optimization

Finally, GraphPO trains the policy with a PPO\-style objective\. It combines correctness and efficiency into the dual\-group graph advantage:

Adual\(e\)=Acor\(e\)\+λeffAeff\(e\),A\_\{\\mathrm\{dual\}\}\(e\)=A\_\{\\mathrm\{cor\}\}\(e\)\+\\lambda\_\{\\mathrm\{eff\}\}A\_\{\\mathrm\{eff\}\}\(e\),\(7\)
whereλeff≥0\\lambda\_\{\\mathrm\{eff\}\}\\geq 0controls the efficiency signal, andAeff\(e\)A\_\{\\mathrm\{eff\}\}\(e\)denotes the path\-level efficiency advantage distributed onto edges along the path fromaQa\_\{Q\}tovv\. Our method builds on DAPO\[[45](https://arxiv.org/html/2606.18954#bib.bib5)\]with:

𝒥GraphPO\(θ\)=\\displaystyle\\mathcal\{J\}\_\{\\mathrm\{GraphPO\}\}\(\\theta\)=\{\}𝔼\(q,a\)∼𝒟𝒯∼πθold\(⋅∣q\)\[1\|ℰ\|∑e∈ℰ1\|e\|∑t=1\|e\|\(min\(re,t\(θ\)Adual\(e\),\\displaystyle\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}\(q,a\)\\sim\\mathcal\{D\}\\\\ \\mathcal\{T\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\cdot\\mid q\)\\end\{subarray\}\}\\Bigg\[\\frac\{1\}\{\{\\color\[rgb\]\{0\.70703125,0\.3515625,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.3515625,0\}\|\\mathcal\{E\}\|\}\}\\sum\_\{\{\\color\[rgb\]\{0\.70703125,0\.3515625,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.3515625,0\}e\\in\\mathcal\{E\}\}\}\\frac\{1\}\{\{\\color\[rgb\]\{0\.70703125,0\.3515625,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.3515625,0\}\|e\|\}\}\\sum\_\{\{\\color\[rgb\]\{0\.70703125,0\.3515625,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.3515625,0\}t=1\}\}^\{\{\\color\[rgb\]\{0\.70703125,0\.3515625,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.3515625,0\}\|e\|\}\}\\bigg\(\\min\\Big\(r\_\{e,t\}\(\\theta\)\\,\{\\color\[rgb\]\{0\.70703125,0\.3515625,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.3515625,0\}A\_\{\\mathrm\{dual\}\}\(e\)\},\(8\)clip\(re,t\(θ\),1−ε,1\+ε\)Adual\(e\)\)−βDKL\(πθ∥πref\)\)\],\\displaystyle\\qquad\\qquad\\operatorname\{clip\}\\big\(r\_\{e,t\}\(\\theta\),1\-\\varepsilon,1\+\\varepsilon\\big\)\\,\{\\color\[rgb\]\{0\.70703125,0\.3515625,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.70703125,0\.3515625,0\}A\_\{\\mathrm\{dual\}\}\(e\)\}\\Big\)\-\\beta\\,D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\)\\bigg\)\\Bigg\],
wherere,t\(θ\)=πθ\(oe,t∣q,oe,<t\)πθold\(oe,t∣q,oe,<t\)r\_\{e,t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(o\_\{e,t\}\\mid q,o\_\{e,<t\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(o\_\{e,t\}\\mid q,o\_\{e,<t\}\)\}is the token\-level importance ratio,ℰ\\mathcal\{E\}is the set of direct edges, and\|e\|\|e\|is the number of tokens on edgeee\. Each token on the same edge uses the same graph advantage, preserving the token\-level PPO form\.

## 5Experiment

In this section, we first describe the experimental setup, then we present the main results, and finally we conduct detailed analysis and ablation study\.

### 5\.1Experimental Setup

Datasets\.To evaluate reasoning ability, we conduct experiments on three mathematical reasoning benchmarks, including AIME24, AIME25, and MATH500\[[10](https://arxiv.org/html/2606.18954#bib.bib31)\]\. We also evaluate the models on two additional reasoning benchmarks, GPQA\[[26](https://arxiv.org/html/2606.18954#bib.bib32)\]and LiveCodeBench\[[15](https://arxiv.org/html/2606.18954#bib.bib33)\]\. To evaluate the effectiveness in LLM agentic RL, we test model on deep search tasks including General AI Assistant\[[24](https://arxiv.org/html/2606.18954#bib.bib40)\], WebWalker\[[37](https://arxiv.org/html/2606.18954#bib.bib41)\], BrowseComp\[[34](https://arxiv.org/html/2606.18954#bib.bib43)\]and XBench\[[3](https://arxiv.org/html/2606.18954#bib.bib42)\]\.

Baselines\.We conduct experiments on four LLMs, including Qwen2\.5\-7B\-Instruct\[[41](https://arxiv.org/html/2606.18954#bib.bib26)\], Qwen2\.5\-7B\-Math\[[40](https://arxiv.org/html/2606.18954#bib.bib34)\], Qwen3\-8B\-Base\[[39](https://arxiv.org/html/2606.18954#bib.bib12)\], and Deepseek\-R1\-Distill\-Qwen\-7B\[[7](https://arxiv.org/html/2606.18954#bib.bib4)\]\. For comparison, we evaluate GraphPO against standard outcome\-reward RLVR methods, including GRPO\[[7](https://arxiv.org/html/2606.18954#bib.bib4)\]and DAPO\[[45](https://arxiv.org/html/2606.18954#bib.bib5)\]\. We also compare with tree\-structured reasoning methods, including TreeRL\[[12](https://arxiv.org/html/2606.18954#bib.bib23)\], SPO\[[8](https://arxiv.org/html/2606.18954#bib.bib30)\], TREE\-GRPO\[[16](https://arxiv.org/html/2606.18954#bib.bib20)\], PROS\[[13](https://arxiv.org/html/2606.18954#bib.bib14)\], and TreePO\[[44](https://arxiv.org/html/2606.18954#bib.bib9)\]\. Detailed descriptions of these baselines are provided in Appendix[B](https://arxiv.org/html/2606.18954#A2)\. For GraphPO, we set the merging thresholdκ\\kappato 0\.92 and set pooling coefficientwwto 0\.7\. Train detail can be found in Appendix[C](https://arxiv.org/html/2606.18954#A3)\.

### 5\.2Main Results

Table 1:Performance comparison of different methods on various reasoning benchmarks\. We report Accuracy \(%\) for each benchmark\. The best performance for each benchmark is highlighted inbold\.ModelMethodAIME24AIME25MATH500GPQALiveCodeBenchAverageQwen2\.5\-7B\-MathBase13\.66\.354\.828\.55\.721\.8GRPO23\.315\.579\.531\.811\.432\.3DAPO25\.717\.282\.632\.312\.634\.1TreeRL26\.118\.984\.233\.612\.535\.1SPO28\.520\.487\.535\.513\.637\.1TREE\-GRPO27\.220\.386\.634\.813\.436\.5PROS29\.121\.088\.136\.613\.537\.7TreePO26\.819\.085\.534\.013\.035\.7GraphPO32\.124\.591\.639\.616\.940\.9Qwen3\-8B\-BaseBase29\.320\.879\.739\.122\.938\.4GRPO30\.321\.380\.043\.427\.240\.4DAPO32\.323\.882\.744\.027\.642\.1TreeRL32\.325\.183\.444\.428\.142\.7SPO34\.127\.186\.446\.530\.845\.0TREE\-GRPO33\.226\.785\.545\.930\.444\.3PROS34\.327\.586\.947\.631\.145\.5TreePO33\.125\.484\.745\.029\.443\.5GraphPO38\.230\.990\.250\.634\.548\.9Deepseek\-R1\-Distill\-Qwen\-7BBase51\.738\.491\.046\.136\.852\.8GRPO57\.247\.493\.546\.240\.757\.0DAPO59\.048\.693\.348\.441\.458\.1TreeRL60\.749\.591\.249\.242\.558\.6SPO62\.650\.792\.850\.544\.760\.3TREE\-GRPO61\.750\.592\.349\.944\.159\.7PROS63\.251\.493\.051\.544\.860\.8TreePO61\.649\.691\.749\.743\.559\.2GraphPO66\.254\.796\.354\.548\.364\.0

![Refer to caption](https://arxiv.org/html/2606.18954v1/x4.png)Figure 4:Training dynamics of Qwen2\.5\-7B\-Math\.\(a\)Accuracy\.\(b\)Entropy\.\(c\)Response length\.GraphPO Outperforms the Baseline Under the Same Token Budget\.Table[1](https://arxiv.org/html/2606.18954#S5.T1)reports model performance under the same token budget\. Figure[4](https://arxiv.org/html/2606.18954#S5.F4)a shows the change in model performance during training\. These results show that structure\-based methods consistently outperform sparse outcome\-reward baselines, confirming the effectiveness of outcome\-derived process signals for reasoning\. This improvement also comes from the fact that tree based and graph based methods can explore more reasoning paths under a same token budget, thereby collecting more trajectories\. We can also see that GraphPO outperforms tree\-based methods across all models and datasets\. This indicates that, compared with the tree structure, the graph structure can use process signals more effectively to improve reasoning performance\. This improvement mainly comes from better rollout utilization and lower\-variance advantage estimation, achieved by sharing suffixes across equivalent nodes\.

GraphPO Preserves Strong Exploration\.Figure[4](https://arxiv.org/html/2606.18954#S5.F4)b shows the change of entropy during training\. We observe that tree\-based methods show the fastest entropy decay, chain\-based methods are in the middle, and GraphPO decays the slowest\. This is beacuse tree compare many local suffix from shared prefixes, which quickly amplifies high\-reward branches and drives the policy toward a small set of reasoning patterns\. In contrast, GraphPO pools downstream evidence across semantically equivalent states, allowing different paths to share support instead of competing independently\. This reduces noisy over\-amplification and preserves more semantically valid reasoning paths\.

Table 2:Performance on deep search tasks for Qwen2\.5\-7B\. The best results are indicated inbold\.MethodGeneral AI AssistantWebWalkerQABrowseCompXBenchOverallLv\.1Lv\.2Lv\.3Avg\.EasyMed\.HardAvg\.Avg\.Avg\.ReAct6\.43\.41\.14\.27\.79\.45\.57\.81\.49\.35\.7\+ GRPO17\.214\.24\.314\.18\.711\.510\.710\.62\.310\.59\.4\+ DAPO17\.915\.14\.614\.99\.511\.711\.411\.32\.410\.79\.8\+ Tree\-GRPO18\.917\.35\.515\.710\.712\.311\.611\.72\.511\.410\.3\+ GraphPO20\.118\.56\.317\.711\.513\.712\.312\.73\.713\.111\.8

GraphPO Is Effective on Agent Tasks\.Agent tasks are naturally sequential decision making processes\. They can be naturally decomposed into multiple complete decision steps\. Therefore, tree and graph structures are well suited for reinforcement learning in agent tasks\. To evaluate the effectiveness of GraphPO in agent scenarios, we use ReAct\[[43](https://arxiv.org/html/2606.18954#bib.bib35)\]as the agent framework\. The results are shown in Table[2](https://arxiv.org/html/2606.18954#S5.T2)\. GraphPO outperforms the baselines across all datasets, demonstrating its effectiveness in agent scenarios\.

### 5\.3Detailed Analysis

GraphPO Improves Reasoning Efficiency by Reducing Redundant Paths\.Figure[4](https://arxiv.org/html/2606.18954#S5.F4)c shows the response length during training\. Tree\-GRPO shows a non\-monotonic trend, where response length first decreases and then increases\. The initial decrease suggests that step\-level comparisons reduce unnecessary exploration\. The later increase occurs because deeper tree expansion creates more branching opportunities, which raises the probability of finding correct answers but also introduces redundant intermediate reasoning\. Since the tree cannot detect or merge such redundancy, it cannot optimize repeated reasoning step, reducing inference efficiency\. In contrast, GraphPO produces the shortest responses by merging semantically equivalent states and using an efficiency advantage to assign higher credit to shorter paths that reach the same state\.

GraphPO Outperforms PRM Methods Without Annotated Process Data\.In the previous section, we have shown that the policy\-gradient direction of GraphPO is close to that of PRM methods\. In this section, we further compare GraphPO with PRM methods\. Specifically, we follow ReasonFlux\-PRM\[[48](https://arxiv.org/html/2606.18954#bib.bib39)\]and conduct experiments based on Qwen2\.5\-7B\-Instruct\. The results are shown in Figure[5](https://arxiv.org/html/2606.18954#S5.F5)a\. We can see that GraphPO still outperforms PRM methods without using additional annotated data\. This shows that deriving process supervision from the graph structure is effective\.

GraphPO Outperforms the Baseline Under the Same Trajectory Budget\.Under the same token budget, Tree and Graph structures can generate more responses than independent sampling by sharing prefixes\. We further compare different methods under the same number of trajectories, where each method generates 64 rollouts per prompt\. As shown in Figure[5](https://arxiv.org/html/2606.18954#S5.F5)b, GraphPO still outperforms the baselines, suggesting that its advantage comes not only from producing more trajectories, but also from deriving more effective process supervision\. GraphPO also substantially reduces training time under the same trajectory count, indicating that merging equivalent nodes improves rollout utilization, reduces redundant exploration, and enhances training efficiency\.

GraphPO Maintains Efficient Trajectory Generation\.We further compare the trajectory generation efficiency\. All experiments are conducted on H20 GPUs without parallelization\. As shown in Figure[5](https://arxiv.org/html/2606.18954#S5.F5)c, GraphPO achieves trajectory generation efficiency comparable to tree\-based methods, even though it uses an embedding model during rollout\. This efficiency mainly comes from merging equivalent nodes, which reduces redundant exploration and encourages shorter reasoning paths\.

![Refer to caption](https://arxiv.org/html/2606.18954v1/x5.png)Figure 5:\(a\)PRM methods*vs\.*GraphPO\.\(b\)Training Performance and Training Time\.\(c\)Trajectory generation efficiency\. The bubble size represents the number of trajectories generated per second\.
### 5\.4Ablation Study

![Refer to caption](https://arxiv.org/html/2606.18954v1/x6.png)Figure 6:Performance under Different Merging Thresholds\.Moderate Merging Thresholds Yield the Best GraphPO Performance\.The merging thresholdκ\\kappacontrols how strictly nodes are merged\. In this section, we conduct ablation studies to analyze the impact of the merging threshold\. Figure[6](https://arxiv.org/html/2606.18954#S5.F6)shows the performance under different merging thresholds\. We can see that the performance of GraphPO first increases and then decreases as the merging threshold becomes larger\. This is because a lower merging threshold leads to excessive merging\. It suppresses reasonable exploration and thus reduces performance\. In contrast, a higher merging threshold leads to too few merges\. It prevents semantically equivalent reasoning paths from sharing their reasoning suffixes\. When the merging threshold is 1, graph structure degenerates into a tree structure, leading to degraded performance\.

## 6Conclusion

In this work, we propose GraphPO, a graph\-based reinforcement learning framework for improving the capability of large reasoning models under verifiable outcome\. Motivated by the semantic redundancy observed in chain\- and tree\-based rollouts, GraphPO represents rollouts as directed acyclic graph, where generated reasoning steps are edges and path\-defined semantic states are nodes\. By detecting and virtually merging semantically similar intermediate states, GraphPO reuses suffixes, reduces redundant expansions, and converts sparse outcome rewards into dense step\-level process supervision\. We further introduce a dual\-group graph advantage in which the correctness group compares shared outgoing steps, while the efficiency group compares different incoming paths that reach the same semantic state, encouraging the policy to favor shorter paths toward useful reasoning states\. Theoretical analysis shows that next\-step sharing reduces advantage\-estimation variance and the efficiency advantage favors shorter correct reasoning\. Experiments across three large reasoning models, standard reasoning benchmarks, and agentic search tasks demonstrate that GraphPO consistently outperforms outcome\-only and tree\-based RL baselines under matched budgets, while maintaining efficient trajectory generation and improving response efficiency\. These results suggest that graph\-structured rollouts provide a practical way to obtain self\-emergent process supervision from outcome rewards alone\.

## References

- \[1\]\(2026\)Troll: trust regions improve reinforcement learning for large language models\.The Fourteenth International Conference on Learning Representations\.Cited by:[§1](https://arxiv.org/html/2606.18954#S1.p1.1)\.
- \[2\]G\. Chen, M\. Liao, C\. Li, and K\. Fan\(2024\)Alphamath almost zero: process supervision without process\.Advances in Neural Information Processing Systems37,pp\. 27689–27724\.Cited by:[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p1.1)\.
- \[3\]K\. Chen, Y\. Ren, Y\. Liu, X\. Hu, H\. Tian, T\. Xie, F\. Liu, H\. Zhang, H\. Liu, Y\. Gong,et al\.\(2025\)Xbench: tracking agents productivity scaling with profession\-aligned real\-world evaluations\.arXiv preprint arXiv:2506\.13651\.Cited by:[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p1.1)\.
- \[4\]G\. Cui, L\. Yuan, Z\. Wang, H\. Wang, Y\. Zhang, J\. Chen, W\. Li, B\. He, Y\. Fan, T\. Yu,et al\.\(2026\)Process reinforcement through implicit rewards\.The Fourteenth International Conference on Learning Representations\.Cited by:[§F\.1](https://arxiv.org/html/2606.18954#A6.SS1.p2.1),[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p1.1)\.
- \[5\]L\. Feng, Z\. Xue, T\. Liu, and B\. An\(2025\)Group\-in\-group policy optimization for llm agent training\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p1.1)\.
- \[6\]J\. Gao, W\. Fu, M\. Xie, S\. Xu, C\. He, Z\. Mei, B\. Zhu, and Y\. Wu\(2025\)Beyond ten turns: unlocking long\-horizon agentic search with large\-scale asynchronous rl\.arXiv preprint arXiv:2508\.07976\.Cited by:[Appendix C](https://arxiv.org/html/2606.18954#A3.p2.1)\.
- \[7\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[Appendix B](https://arxiv.org/html/2606.18954#A2.p1.1.1),[Appendix C](https://arxiv.org/html/2606.18954#A3.p5.1),[§1](https://arxiv.org/html/2606.18954#S1.p1.1),[§2](https://arxiv.org/html/2606.18954#S2.p1.1),[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p2.2)\.
- \[8\]Y\. Guo, L\. Xu, J\. Liu, Y\. Dan, and S\. Qiu\(2025\)Segment policy optimization: effective segment\-level credit assignment in rl for large language models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[Appendix B](https://arxiv.org/html/2606.18954#A2.p4.1.1),[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p1.1),[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p2.2)\.
- \[9\]P\. Han, A\. Krishnan, G\. Friedland, J\. You, and L\. Kong\(2026\)Self\-aligned reward: towards effective and efficient reasoners\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.18954#S2.p1.1)\.
- \[10\]D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt\(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p1.1)\.
- \[11\]D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt\(2021\)Measuring mathematical problem solving with the math dataset\.NeurIPS\.Cited by:[§3](https://arxiv.org/html/2606.18954#S3.p2.1)\.
- \[12\]Z\. Hou, Z\. Hu, Y\. Li, R\. Lu, J\. Tang, and Y\. Dong\(2025\)Treerl: llm reinforcement learning with on\-policy tree search\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12355–12369\.Cited by:[Appendix B](https://arxiv.org/html/2606.18954#A2.p3.1.1),[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p2.1),[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p2.2)\.
- \[13\]B\. Huang and X\. Wan\(2026\)PROS: towards compute\-efficient rlvr via rollout prefix reuse\.InThe Fourteenth International Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2606.18954#A2.p6.1.1),[§1](https://arxiv.org/html/2606.18954#S1.p1.1),[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p2.1),[§3](https://arxiv.org/html/2606.18954#S3.p2.1),[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p2.2)\.
- \[14\]B\. Huang, T\. Nguyen, and M\. Zimmer\(2025\)Tree\-opo: off\-policy monte carlo tree\-guided advantage optimization for multistep reasoning\.InMATH\-AI: The 5th Workshop on Mathematical Reasoning and AI at NeurIPS,Cited by:[§2](https://arxiv.org/html/2606.18954#S2.p2.1)\.
- \[15\]N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica\(2025\)LiveCodeBench: holistic and contamination free evaluation of large language models for code\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p1.1)\.
- \[16\]Y\. Ji, Z\. Ma, Y\. Wang, G\. Chen, X\. Chu, and L\. Wu\(2026\)Tree search for llm agent reinforcement learning\.The Fourteenth International Conference on Learning Representations\.Cited by:[Appendix B](https://arxiv.org/html/2606.18954#A2.p5.1.1),[Appendix C](https://arxiv.org/html/2606.18954#A3.p2.1),[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p2.1),[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p2.2)\.
- \[17\]B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. Arik, D\. Wang, H\. Zamani, and J\. Han\(2025\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.arXiv preprint arXiv:2503\.09516\.Cited by:[Appendix C](https://arxiv.org/html/2606.18954#A3.p1.1)\.
- \[18\]H\. Ju and B\. Dong\(2025\)MIRB: mathematical information retrieval benchmark\.In2nd AI for Math Workshop@ ICML 2025,Cited by:[§3](https://arxiv.org/html/2606.18954#S3.p2.1)\.
- \[19\]A\. Kazemnejad, M\. Aghajohari, E\. Portelance, A\. Sordoni, S\. Reddy, A\. Courville, and N\. L\. Roux\(2025\)Vineppo: refining credit assignment in rl training of llms\.Forty\-Second International Conference on Machine Learning\.Cited by:[§F\.1](https://arxiv.org/html/2606.18954#A6.SS1.p2.1),[§1](https://arxiv.org/html/2606.18954#S1.p1.1),[§2](https://arxiv.org/html/2606.18954#S2.p1.1)\.
- \[20\]W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica\(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[Appendix C](https://arxiv.org/html/2606.18954#A3.p1.1)\.
- \[21\]C\. Lin\(2004\)Rouge: a package for automatic evaluation of summaries\.InText summarization branches out,pp\. 74–81\.Cited by:[§3](https://arxiv.org/html/2606.18954#S3.p3.1)\.
- \[22\]X\. Liu, T\. Liang, Z\. He, J\. Xu, W\. Wang, P\. He, Z\. Tu, H\. Mi, and D\. Yu\(2025\)Trust, but verify: a self\-verification approach to reinforcement learning with verifiable rewards\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.18954#S2.p1.1)\.
- \[23\]R\. Meng, Y\. Liu, S\. R\. Joty, C\. Xiong, Y\. Zhou, and S\. Yavuz\(2024\)Sfr\-embedding\-2: advanced text embedding with multi\-stage training, 2024\.URL https://huggingface\. co/Salesforce/SFR\-Embedding\-2\_R\.Cited by:[§3](https://arxiv.org/html/2606.18954#S3.p2.1)\.
- \[24\]G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom\(2024\)GAIA: a benchmark for general ai assistants\.InThe Twelfth International Conference on Learning Representations,Cited by:[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p1.1)\.
- \[25\]Z\. Qi, M\. Mingyuan, J\. Xu, L\. L\. Zhang, F\. Yang, and M\. Yang\(2025\)Mutual reasoning makes smaller llms stronger problem\-solver\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p1.1)\.
- \[26\]D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman\(2023\)Gpqa: a graduate\-level google\-proof q&a benchmark\.arXiv preprint arXiv:2311\.12022\.Cited by:[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p1.1)\.
- \[27\]A\. Setlur, C\. Nagpal, A\. Fisch, X\. Geng, J\. Eisenstein, R\. Agarwal, A\. Agarwal, J\. Berant, and A\. Kumar\(2025\)Rewarding progress: scaling automated process verifiers for llm reasoning\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§F\.1](https://arxiv.org/html/2606.18954#A6.SS1.p2.1),[§1](https://arxiv.org/html/2606.18954#S1.p1.1),[§2](https://arxiv.org/html/2606.18954#S2.p1.1)\.
- \[28\]C\. Shen, Z\. H\. Wong, R\. He, H\. Liang, M\. Qiang, Z\. Meng, Z\. Zhao, B\. Zeng, Z\. Zhu, B\. Cui,et al\.\(2026\)Let’s verify math questions step by step\.InProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 1,pp\. 2770–2781\.Cited by:[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p1.1)\.
- \[29\]G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu\(2024\)HybridFlow: a flexible and efficient rlhf framework\.arXiv preprint arXiv: 2409\.19256\.Cited by:[Appendix C](https://arxiv.org/html/2606.18954#A3.p1.1)\.
- \[30\]Z\. Shi, M\. Fang, and L\. Chen\(2025\)Monte carlo planning with large language model for text\-based game agents\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.18954#S2.p2.1)\.
- \[31\]K\. Team, Y\. Bai, Y\. Bao, Y\. Charles, C\. Chen, G\. Chen, H\. Chen, H\. Chen, J\. Chen, N\. Chen,et al\.\(2025\)Kimi k2: open agentic intelligence\.arXiv preprint arXiv:2507\.20534\.Cited by:[§1](https://arxiv.org/html/2606.18954#S1.p1.1),[§2](https://arxiv.org/html/2606.18954#S2.p1.1)\.
- \[32\]H\. Wang, Z\. Hao, J\. Luo, C\. Wei, Y\. Shu, L\. Liu, Q\. Lin, H\. Dong, and J\. Chen\(2026\)Scheduling your llm reinforcement learning with reasoning trees\.The Fourteenth International Conference on Learning Representations\.Cited by:[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p2.1)\.
- \[33\]P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui\(2024\)Math\-shepherd: verify and reinforce llms step\-by\-step without human annotations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9426–9439\.Cited by:[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p1.1)\.
- \[34\]J\. Wei, Z\. Sun, S\. Papay, S\. McKinney, J\. Han, I\. Fulford, H\. W\. Chung, A\. T\. Passos, W\. Fedus, and A\. Glaese\(2025\)Browsecomp: a simple yet challenging benchmark for browsing agents\.arXiv preprint arXiv:2504\.12516\.Cited by:[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p1.1)\.
- \[35\]X\. Wen, Z\. Liu, S\. Zheng, S\. Ye, Z\. Wu, Y\. Wang, Z\. Xu, X\. Liang, J\. Li, Z\. Miao, J\. Bian, and M\. Yang\(2026\)Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.18954#S2.p1.1)\.
- \[36\]J\. Wu, B\. Li, R\. Fang, W\. Yin, L\. Zhang, Z\. Tao, D\. Zhang, Z\. Xi, G\. Fu, Y\. Jiang,et al\.\(2025\)Webdancer: towards autonomous information seeking agency\.arXiv preprint arXiv:2505\.22648\.Cited by:[Appendix C](https://arxiv.org/html/2606.18954#A3.p2.1)\.
- \[37\]J\. Wu, W\. Yin, Y\. Jiang, Z\. Wang, Z\. Xi, R\. Fang, L\. Zhang, Y\. He, D\. Zhou, P\. Xie,et al\.\(2025\)Webwalker: benchmarking llms in web traversal\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 10290–10305\.Cited by:[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p1.1)\.
- \[38\]S\. Xing, S\. Wang, C\. Yang, X\. Dai, and X\. Ren\(2026\)Lookahead Tree\-Based Rollouts for Enhanced Trajectory\-Level Exploration in Reinforcement Learning with Verifiable Rewards\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.18954#S2.p2.1)\.
- \[39\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[Appendix C](https://arxiv.org/html/2606.18954#A3.p5.1),[§1](https://arxiv.org/html/2606.18954#S1.p1.1),[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p2.2)\.
- \[40\]A\. Yang, B\. Zhang, B\. Hui, B\. Gao, B\. Yu, C\. Li, D\. Liu, J\. Tu, J\. Zhou, J\. Lin,et al\.\(2024\)Qwen2\.5\-math technical report: toward mathematical expert model via self\-improvement\.arXiv preprint arXiv:2409\.12122\.Cited by:[Appendix C](https://arxiv.org/html/2606.18954#A3.p5.1),[§3](https://arxiv.org/html/2606.18954#S3.p2.1),[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p2.2)\.
- \[41\]Q\. A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, G\. Dong, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, Z\. Qiu, S\. Quan, and Z\. Wang\(2024\)Qwen2\.5 technical report\.ArXivabs/2412\.15115\.External Links:[Link](https://api.semanticscholar.org/CorpusID:274859421)Cited by:[Appendix C](https://arxiv.org/html/2606.18954#A3.p5.1),[§3](https://arxiv.org/html/2606.18954#S3.p2.1),[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p2.2)\.
- \[42\]Z\. Yang, Z\. Guo, Y\. Huang, X\. Liang, Y\. Wang, and J\. Tang\(2025\)Treerpo: tree relative policy optimization\.arXiv preprint arXiv:2506\.05183\.Cited by:[§1](https://arxiv.org/html/2606.18954#S1.p1.1),[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p2.1)\.
- \[43\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2022\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§5\.2](https://arxiv.org/html/2606.18954#S5.SS2.p3.1)\.
- \[44\]L\. Yizhi, Q\. Gu, Z\. Wen, Z\. Li, R\. Yuan, T\. Xing, S\. Guo, T\. Zheng, X\. Qu, W\. Zhou,et al\.\(2025\)TreePO: enhancing policy efficacy and inference efficiency with tree modeling\.OpenReview preprint\.Cited by:[Appendix B](https://arxiv.org/html/2606.18954#A2.p7.1.1),[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p2.1),[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p2.2)\.
- \[45\]Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.The Thirty\-Ninth Annual Conference on Neural Information Processing Systems\.Cited by:[Appendix B](https://arxiv.org/html/2606.18954#A2.p2.1.1),[Appendix C](https://arxiv.org/html/2606.18954#A3.p2.1),[§1](https://arxiv.org/html/2606.18954#S1.p1.1),[§2](https://arxiv.org/html/2606.18954#S2.p1.1),[§4\.5](https://arxiv.org/html/2606.18954#S4.SS5.p2.4),[§5\.1](https://arxiv.org/html/2606.18954#S5.SS1.p2.2)\.
- \[46\]Z\. Zhang, C\. Zheng, Y\. Wu, B\. Zhang, R\. Lin, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin\(2025\)The lessons of developing process reward models in mathematical reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 10495–10516\.Cited by:[§1](https://arxiv.org/html/2606.18954#S1.p2.1),[§2](https://arxiv.org/html/2606.18954#S2.p1.1)\.
- \[47\]Y\. Zhao, Y\. Liu, J\. Liu, J\. Chen, X\. Wu, Y\. Hao, T\. Lv, S\. Huang, L\. Cui, Q\. Ye,et al\.\(2026\)Geometric\-mean policy optimization\.The Fourteenth International Conference on Learning Representations\.Cited by:[§1](https://arxiv.org/html/2606.18954#S1.p1.1)\.
- \[48\]J\. Zou, L\. Yang, J\. Gu, J\. Qiu, K\. Shen, J\. He, and M\. Wang\(2025\)Reasonflux\-prm: trajectory\-aware prms for long chain\-of\-thought reasoning in llms\.InThe Thirty\-Ninth Annual Conference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.18954#S2.p1.1),[§5\.3](https://arxiv.org/html/2606.18954#S5.SS3.p2.1)\.

APPENDIX

## Appendix AAlgorithm

We present the complete procedure of GraphPO in Algorithm[S\.1](https://arxiv.org/html/2606.18954#alg1), which integrates the three stages introduced in the main text into a unified training loop\.

Algorithm S\.1GraphPO Graph\-based Policy Optimization1:Policy

πθ\\pi\_\{\\theta\}, prompts

𝒟\\mathcal\{D\}, segment length

LL, max layers

HH, samples per state

bb, convergence threshold

κ\\kappa, evidence\-pooling weight

ww, efficiency coefficient

λeff\\lambda\_\{\\mathrm\{eff\}\}
2:foreach training iterationdo

3:Sample prompt

x∼𝒟x\\sim\\mathcal\{D\}; initialize root semantic state with

xx
4:

⊳\\trianglerightStage 1 Graph Construction \(Section 3\.2\)

5:forlayer

h=1,…,Hh=1,\\dots,Hdo

6:foreach active semantic state

uuat layer

h−1h\-1do

7:Sample

bub\_\{u\}continuations from

πθ\(⋅∣prefix\(u\)\)\\pi\_\{\\theta\}\(\\cdot\\mid\\operatorname\{prefix\}\(u\)\); split them into segment\-level transitions

8:Register each new state

vvwith generation predecessor

p\(v\)=up\(v\)=u
9:Extract structured summary and compute embedding

𝐳v\\mathbf\{z\}\_\{v\}for each new state

vv
10:endfor

11:foreach non\-causal state pair

\(u,v\)\(u,v\)do

12:if

sim⁡\(u,v\)≥κ\\operatorname\{sim\}\(u,v\)\\geq\\kappathen

13:Union\-Find links

uuand

vvinto the same semantic equivalence class

14:endif

15:endfor

16:Build graph successors

𝒞g\(v\)\\mathcal\{C\}\_\{g\}\(v\)from direct successors and convergent\-state successor sharing

17:Reallocate the saved budget from longer routes to active non\-redundant frontier states

18:endfor

19:

⊳\\trianglerightStage 2 Graph Reward \(Section 3\.3\)

20:Evaluate terminal states with verifier and obtain

R\(z\)∈\{0,1\}R\(z\)\\in\\\{0,1\\\}
21:Propagate

cvown,tvownc\_\{v\}^\{\\mathrm\{own\}\},t\_\{v\}^\{\\mathrm\{own\}\}bottom\-up along direct generation\-provenance successors

22:Compute node scores

S\(v\)←\(cvown\+w∑q∈𝒬\(v\)∖\{v\}cqown\)/\(tvown\+w∑q∈𝒬\(v\)∖\{v\}tqown\)S\(v\)\\leftarrow\\bigl\(c\_\{v\}^\{\\mathrm\{own\}\}\+w\\textstyle\\sum\_\{q\\in\\mathcal\{Q\}\(v\)\\setminus\\\{v\\\}\}c\_\{q\}^\{\\mathrm\{own\}\}\\bigr\)/\\bigl\(t\_\{v\}^\{\\mathrm\{own\}\}\+w\\textstyle\\sum\_\{q\\in\\mathcal\{Q\}\(v\)\\setminus\\\{v\\\}\}t\_\{q\}^\{\\mathrm\{own\}\}\\bigr\)when the denominator is positive

23:Compute edge rewards

rstep\(p\(v\),v\)←\(S\(v\)−S\(p\(v\)\)\)⋅\(1−η\(p\(v\),v\)\)r\_\{\\mathrm\{step\}\}\(p\(v\),v\)\\leftarrow\(S\(v\)\-S\(p\(v\)\)\)\\cdot\(1\-\\eta\(p\(v\),v\)\)
24:

⊳\\trianglerightStage 3 Dual\-Group Graph Advantage \(Section 3\.4\)

25:foreach semantic state

uuwith

\|𝒞g\(u\)\|\>1\|\\mathcal\{C\}\_\{g\}\(u\)\|\>1do

26:

Acor\(e\)←\(rstep\(e\)−mean⁡\(\{rstep\(e^\)∣e^∈ℰcor\(u\)\}\)\)/\(std⁡\(\{rstep\(e^\)∣e^∈ℰcor\(u\)\}\)\)A\_\{\\mathrm\{cor\}\}\(e\)\\leftarrow\(r\_\{\\mathrm\{step\}\}\(e\)\-\\operatorname\{mean\}\(\\\{r\_\{\\mathrm\{step\}\}\(\\hat\{e\}\)\\mid\\hat\{e\}\\in\\mathcal\{E\}\_\{\\mathrm\{cor\}\}\(u\)\\\}\)\)/\(\\operatorname\{std\}\(\\\{r\_\{\\mathrm\{step\}\}\(\\hat\{e\}\)\\mid\\hat\{e\}\\in\\mathcal\{E\}\_\{\\mathrm\{cor\}\}\(u\)\\\}\)\)for each

e∈ℰcor\(u\)e\\in\\mathcal\{E\}\_\{\\mathrm\{cor\}\}\(u\)
27:endfor

28:foreach equivalence class

QQwith

\|Q\|≥2\|Q\|\\geq 2do

29:Find deepest common generation predecessor

aQa\_\{Q\}; compute

ℓv=d\(v\)−d\(aQ\)\\ell\_\{v\}=d\(v\)\-d\(a\_\{Q\}\),

S¯Q=mean⁡\(\{S\(v\)∣v∈Q\}\)\\bar\{S\}\_\{Q\}=\\operatorname\{mean\}\(\\\{S\(v\)\\mid v\\in Q\\\}\)
30:

Aeff\(v\)←S¯Q⋅\(−ℓv−mean⁡\(\{−ℓq∣q∈Q\}\)\)/\(std⁡\(\{−ℓq∣q∈Q\}\)\)A\_\{\\mathrm\{eff\}\}\(v\)\\leftarrow\\bar\{S\}\_\{Q\}\\cdot\(\-\\ell\_\{v\}\-\\operatorname\{mean\}\(\\\{\-\\ell\_\{q\}\\mid q\\in Q\\\}\)\)/\(\\operatorname\{std\}\(\\\{\-\\ell\_\{q\}\\mid q\\in Q\\\}\)\); distribute uniformly to generated transitions on the route to

vv
31:endfor

32:

⊳\\trianglerightStage 4 Policy Update

33:Update

θ\\thetaby minimizing

ℒppo\(Acor,mresp\)\+λeffℒppo\(Aeff,meff\)\+βKLℒKL−cHℋ\(πθ\)\\mathcal\{L\}\_\{\\mathrm\{ppo\}\}\(A\_\{\\mathrm\{cor\}\},m\_\{\\mathrm\{resp\}\}\)\+\\lambda\_\{\\mathrm\{eff\}\}\\mathcal\{L\}\_\{\\mathrm\{ppo\}\}\(A\_\{\\mathrm\{eff\}\},m\_\{\\mathrm\{eff\}\}\)\+\\beta\_\{\\mathrm\{KL\}\}\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\-c\_\{H\}\\mathcal\{H\}\(\\pi\_\{\\theta\}\)
34:endfor

## Appendix BDetailed Description of Baselines

GRPO\[[7](https://arxiv.org/html/2606.18954#bib.bib4)\]:GRPO aims to remove the learned critic from PPO\-style RLVR\. It samples multiple complete responses for each prompt and normalizes verifier rewards within the group to obtain response\-level advantages\.

DAPO\[[45](https://arxiv.org/html/2606.18954#bib.bib5)\]:DAPO aims to stabilize and scale GRPO for long\-CoT reasoning\. It keeps group\-relative advantage estimation and improves training with decoupled clipping, dynamic sampling, token\-level policy\-gradient normalization, and overlong reward shaping\.

TreeRL\[[12](https://arxiv.org/html/2606.18954#bib.bib23)\]:TreeRL aims to improve exploration and credit assignment beyond independent response sampling\. It builds an entropy\-guided on\-policy reasoning tree and estimates step values from the correctness ratio of descendant leaf responses\.

SPO\[[8](https://arxiv.org/html/2606.18954#bib.bib30)\]:SPO aims to reduce the sparsity of response\-level rewards\. It partitions responses into segments, estimates segment\-level advantages with Monte Carlo rollouts, and optimizes tokens using a probability\-mask strategy\.

TREE\-GRPO\[[16](https://arxiv.org/html/2606.18954#bib.bib20)\]:TREE\-GRPO aims to reuse shared prefixes and improve advantage estimation in tree\-search RL\. It was originally designed for agent trajectories; we adapt it to LLM reasoning by replacing agent interaction steps with reasoning segments while keeping its intra\-tree and inter\-tree grouped relative advantages\.

PROS\[[13](https://arxiv.org/html/2606.18954#bib.bib14)\]:PROS aims to reduce redundant generation in RLVR by reusing rollout prefixes\. It selects valuable historical prefixes, appends them to the original query as augmented prompts, and trains only on newly generated continuations\.

TreePO\[[44](https://arxiv.org/html/2606.18954#bib.bib9)\]:TreePO aims to improve rollout efficiency with tree modeling\. It decodes responses in fixed\-length segments, reuses shared prefixes through dynamic tree sampling, prunes low\-value paths, and estimates segment advantages from global and local tree subgroups\.

## Appendix CDetailed Experimental Setup

Framework and Hardware\.For reasoning tasks, Our GraphPO is built on the Verl framework\[[29](https://arxiv.org/html/2606.18954#bib.bib44)\]\. During the rollout stage, we use the vLLM framework for efficient batched inference\[[20](https://arxiv.org/html/2606.18954#bib.bib45)\]\. In agentic RL, Our implementation is built upon Search\-R1\[[17](https://arxiv.org/html/2606.18954#bib.bib59)\]\. Our experiments are conducted on 16×\\timesH20 GPUs\.

Training Data\.For training data in reasoning tasks, we use the DAPO\-Math dataset\[[45](https://arxiv.org/html/2606.18954#bib.bib5)\]for reinforcement learning\. For agent tasks, we adopt the experimental setup of Tree\-GRPO\[[16](https://arxiv.org/html/2606.18954#bib.bib20)\], using ASearcher\-35K\[[6](https://arxiv.org/html/2606.18954#bib.bib57)\]and WebDancer\[[36](https://arxiv.org/html/2606.18954#bib.bib58)\]as training data\.

Rollout Budget\.In reasoning tasks, chain methods, including GRPO and DAPO, generate 64 trajectories for each prompt\. In deep search task, we use group size 8 for all chain\-based RL

For tree\-based methods, including TreeRL, SPO, TREE\-GRPO, PROS, and TreePO, as well as GraphPO, we fix the length of each step and use a token budget close to that of chain methods for rollout\. Taking a ternary tree as an example, the calculation is as follows\. Assume that the tree hasnnlayers and each branch step containssstokens\. Then each complete root\-to\-leaf trajectory has lengthnsns, and the token budget of chain sampling with 64 trajectories isBchain=64ns\.B\_\{\\mathrm\{chain\}\}=64ns\.For the ternary tree, the number of generated branches at layeriiis3i3^\{i\}\. Therefore, the total rollout cost of constructing the tree isBtree=s∑i=1n3i\.B\_\{\\mathrm\{tree\}\}=s\\sum\_\{i=1\}^\{n\}3^\{i\}\.

Context Length and Step Size\.For reasoning tasks, we set the context length to 16,384 tokens for DeepSeek\-R1\-Distill\-Qwen\-7B\[[7](https://arxiv.org/html/2606.18954#bib.bib4)\]and Qwen3\-8B\-Base\[[39](https://arxiv.org/html/2606.18954#bib.bib12)\]\. For Qwen2\.5\-Math\-7B\[[40](https://arxiv.org/html/2606.18954#bib.bib34)\], we use its maximum supported context length of 4,096 tokens\. For tree based methods and GraphPO, we set the length of each step from 512 to 2048 tokens\. For deep search task, The max response length of Qwen2\.5\-7B\-Instruct\[[41](https://arxiv.org/html/2606.18954#bib.bib26)\]is set to 8000 tokens\.

Optimization\.We train the models under an off policy RL setting\. For reasoning tasks, the batch size is set to 512, and the minibatch size is set to 32\. In the main text, each training step corresponds to one batch\. We train all models for 350 steps\. For deep search task, we train 40 steps\. The batch size is set to 128, and the minibatch size is set to 64\. We use the AdamW optimizer with a learning rate of1×10−61\\times 10^\{\-6\}, betas of\(0\.9,0\.999\)\(0\.9,0\.999\), and a weight decay of 0\.01\. We use no learning rate warmup\. For PPO, we set the clip ratio to 0\.2 and the gradient clipping threshold to 1\.0\. The actor module is trained efficiently with Fully Sharded Data Parallel\.

Table S\.1:Performance comparison of different branching schemes on various reasoning benchmarks with Qwen2\.5\-7B\-Math\. We report Accuracy \(%\) for each benchmark\. The best performance for each benchmark is highlighted inbold\.MethodAIME24AIME25MATH500GPQALiveCodeBenchAverageBase13\.66\.354\.828\.55\.721\.8GRPO23\.315\.579\.531\.811\.432\.3DAPO25\.717\.282\.632\.312\.634\.1GraphPO \(𝒯=\{4,5,3,2\}\\mathcal\{T\}=\\\{4,5,3,2\\\}\)29\.421\.487\.836\.515\.138\.0GraphPO \(𝒯=\{7,6,5,4,3\}\\mathcal\{T\}=\\\{7,6,5,4,3\\\}\)31\.023\.290\.038\.316\.139\.7GraphPO \(𝒯=\{5,4,2,2,2\}\\mathcal\{T\}=\\\{5,4,2,2,2\\\}\)31\.624\.090\.939\.116\.640\.4GraphPO \(𝒯=\{2,4,3,2,2,2\}\\mathcal\{T\}=\\\{2,4,3,2,2,2\\\}\)30\.222\.389\.037\.415\.638\.9GraphPO \(𝒯=\{2,5,4,2,2\}\\mathcal\{T\}=\\\{2,5,4,2,2\\\}\)32\.124\.591\.639\.616\.940\.9

## Appendix DGraphPO Remains Effective Across Diverse Branching Strategies

The number of steps generated by each reasoning path at each layer is a critical parameter when constructing the RL graph\. According to Figure[2](https://arxiv.org/html/2606.18954#S3.F2)a and b, different reasoning paths tend to show high similarity at the early stage of reasoning\. In the later stage, these reasoning paths gradually become more different\. Therefore, in our experiments, we adopt a branching strategy that gradually reduces the number of branches across layers\.

In this section, we analyze model performance under different settings\. The results are shown in Table[S\.1](https://arxiv.org/html/2606.18954#A3.T1)\. Here,𝒯=\{b1,b2,…,bH\}\\mathcal\{T\}=\\\{b\_\{1\},b\_\{2\},\\dots,b\_\{H\}\\\}denotes the per\-layer branching schedule of the reasoning graph:HHis the total number of layers, andbhb\_\{h\}specifies the number of child continuations sampled from each active state at layerhh\. For example,𝒯=\{4,5,3,2\}\\mathcal\{T\}=\\\{4,5,3,2\\\}corresponds to a44\-layer graph in which each root state expands44children at the first layer, every layer\-11state expands55children at the second layer, and so on, with the branching factor decreasing toward deeper layers\. We can see that the strategy with gradually decreasing branches achieves better performance\. This is because more branches in the early reasoning stage can encourage more diverse exploration\. In contrast, fewer branches in the later reasoning stage can reduce redundant exploration and improve reasoning efficiency\.

At the same time, all branching strategies outperform independent sampling\. This shows that introducing branches during reasoning can encourage more diverse exploration and improve reasoning performance\. We can also see that using more branches is not always better\. Under the same token budget, too many branches reduce the exploration depth\. As a result, the model has less reasoning time and may fail to find better reasoning paths\. Moreover, the comparison between𝒯=\{4,5,3,2\}\\mathcal\{T\}=\\\{4,5,3,2\\\}and the uniform schedule𝒯=\{4,4,4,4\}\\mathcal\{T\}=\\\{4,4,4,4\\\}further confirms that allocating more capacity to mid\-stage branching, where reasoning paths begin to diverge, is more beneficial than spreading branches uniformly across all layers\. This observation aligns with our empirical study in Section[3](https://arxiv.org/html/2606.18954#S3), where the divergence of reasoning trajectories peaks in the middle stage rather than at the very beginning or the very end\.

## Appendix EAblation Study

![Refer to caption](https://arxiv.org/html/2606.18954v1/x7.png)Figure S\.1:Impact of efficiency advantage on response length\.Efficiency advantage improves reasoning efficiency\.To show that the efficiency advantage improves reasoning efficiency, we further analyze its effect on response length\. The results are shown in Figure[S\.1](https://arxiv.org/html/2606.18954#A5.F1)\. We can see that, without the efficiency advantage, the response length first decreases and then increases during training, similar to tree based methods\. This is mainly because longer responses create more branching opportunities\. As a result, the model may learn redundant exploration patterns, which reduces reasoning efficiency\. We can also observe that even without the efficiency advantage, GraphPO still produces shorter responses than tree based methods overall\. This is because node merging reduces the sampling budget of longer reasoning paths\. Therefore, much of the learned downstream information comes from shorter paths\. As a result, the final model tends to learn response patterns based on shorter reasoning paths\.

![Refer to caption](https://arxiv.org/html/2606.18954v1/x8.png)Figure S\.2:Impact of pooling coefficient\.Moderate Pooling Coefficients Strike the Best Bias\-Variance Balance\.The pooling coefficientw∈\[0,1\]w\\in\[0,1\]controls how strongly the suffixes shared from equivalent nodes contribute to a node’s value estimate\. In this section, we conduct ablation studies to analyze the impact of the pooling coefficient\. Figure[S\.2](https://arxiv.org/html/2606.18954#A5.F2)shows the performance of GraphPO under different pooling coefficients\. We can see that the performance first increases and then decreases aswwbecomes larger\. This is because a small pooling coefficient prevents the graph from sharing suffixes across equivalent states, so the advantage estimator inherits the high variance of tree\-based estimators with few terminal samples per node and yields noisy gradients that slow convergence\. In contrast, a large pooling coefficient forces the value estimate to rely heavily on suffixes from semantically similar but not identical nodes\. The residual semantic heterogeneity is amplified into a non\-negligible merging bias, which distorts the process reward and hurts policy optimization\. Whenw=0w=0, the equivalence\-class suffix sharing is disabled and GraphPO degenerates to a tree\-based estimator, and the performance drops clearly\. A moderate value aroundw=0\.7w=0\.7best balances the variance reduction obtained from shared suffixes and the semantic\-merging bias, which is consistent with the theoretical analysis in Lemma[2](https://arxiv.org/html/2606.18954#Thmlemma2)\.

## Appendix FTheoretical Analysis

In this section we provide formal theoretical justification for the four claims made in the main text: \(i\) compared with tree\-based methods that cannot reuse nodes, equivalence\-class suffix sharing pools downstream evidence from all semantically equivalent paths into a single advantage\-estimation group, which yields lower\-variance node\-score and advantage estimates at the cost of a controllableO\(1−κ\)O\(1\{\-\}\\kappa\)semantic\-merging bias \(Section[F\.2](https://arxiv.org/html/2606.18954#A6.SS2)\); \(ii\) GraphPO improves expected exploration efficiency under a fixed token budget when budget saved from redundant continuations is redirected to novel frontiers \(Section[F\.3](https://arxiv.org/html/2606.18954#A6.SS3)\); \(iii\) in expectation, the GraphPO surrogate trained withrstepr\_\{\\mathrm\{step\}\}has the same first\-order update direction as a novelty\-gated oracle PRM surrogate, and thus yields self\-emergent process supervision purely from outcome rewards \(Section[F\.4](https://arxiv.org/html/2606.18954#A6.SS4)\); and \(iv\) the dual\-group graph advantage breaks outcome\-objective ties in favor of shorter semantically equivalent correct reasoning paths \(Section[F\.5](https://arxiv.org/html/2606.18954#A6.SS5)\)\.

### F\.1Notation and Assumptions

Letxxdenote a prompt,πθ\\pi\_\{\\theta\}the policy, andτ=\(s0,e1,s1,…,eT,sT\)\\tau=\(s\_\{0\},e\_\{1\},s\_\{1\},\\ldots,e\_\{T\},s\_\{T\}\)a reasoning trajectory with reasoning stepsete\_\{t\}as edges and semantic statessts\_\{t\}as nodes\. LetV⋆\(s\)=𝔼τ∼πθ∣s0=s\[R\(τ\)\]V^\{\\star\}\(s\)\\\!=\\\!\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\\mid s\_\{0\}=s\}\[R\(\\tau\)\]be the true value function under the verifier rewardR∈\{0,1\}R\\in\\\{0,1\\\}\. Letϕ:Σ∗→ℝd\\phi:\\Sigma^\{\\ast\}\\\!\\to\\\!\\mathbb\{R\}^\{d\}denote the embedder, with semantic similaritysim⁡\(u,v\)=⟨ϕ\(u\),ϕ\(v\)⟩/\(‖ϕ\(u\)‖‖ϕ\(v\)‖\)\\operatorname\{sim\}\(u,v\)=\\langle\\phi\(u\),\\phi\(v\)\\rangle/\(\\\|\\phi\(u\)\\\|\\\|\\phi\(v\)\\\|\)andκ∈\(0,1\)\\kappa\\in\(0,1\)the merging threshold\.

###### Assumption 1\(Value\-stable detected equivalence classes\)\.

There exists a class\-stability radiusδκ=O\(1−κ\)\\delta\_\{\\kappa\}=O\(1\-\\kappa\)such that for any two statesu,vu,vin the same detected equivalence class,

\|V⋆\(u\)−V⋆\(v\)\|≤δκ\.\|V^\{\\star\}\(u\)\-V^\{\\star\}\(v\)\|\\leq\\delta\_\{\\kappa\}\.\(S\.1\)This is the class\-level version of a Lipschitz semantic\-equivalence condition\. It avoids treating a union\-find transitive closure as automatically pairwise close unless the detected class itself has small semantic diameter\.

###### Assumption 2\(Bounded rewards and i\.i\.d\. rollouts inside a group\)\.

Verifier rewards are bounded in\[0,1\]\[0,1\]\. Conditional on a parent stateuu, the children generated byπθ\(⋅∣u\)\\pi\_\{\\theta\}\(\\cdot\\mid u\)are i\.i\.d\., and their downstream completions are independent given the children\. The own\-subtree terminal samples used in a pooled score are sampled independently and are disjoint across equivalence\-class members\.

###### Assumption 3\(No\-cycle merging\)\.

Equivalence classes𝒬\(v\)\\mathcal\{Q\}\(v\)are formed only between non\-causal pairs, so the merged structure remains a DAG and the bottom\-up propagation in Section[4\.3](https://arxiv.org/html/2606.18954#S4.SS3)is well\-defined\.

These assumptions are consistent with standard analysis of RLVR/GRPO\-style estimators\[[19](https://arxiv.org/html/2606.18954#bib.bib10),[27](https://arxiv.org/html/2606.18954#bib.bib13),[4](https://arxiv.org/html/2606.18954#bib.bib19)\]and with the structural constraints of GraphPO described in Section[4\.2](https://arxiv.org/html/2606.18954#S4.SS2)\. Throughout the analysis, variances are conditional on the sampled graph structure and on the observed terminal rollout counts\.

### F\.2Lower\-Variance Advantage Estimation via Shared Suffixes

We first analyze the estimation variance of the advantage estimators used by GRPO, TreeRPO, and GraphPO\. Trees expand semantically equivalent states as independent nodes, so each state’s score is estimated only from its own subtree\. GraphPO instead merges these states into an equivalence class and pools their terminal samples, which enlarges the effective sample size of each node\-score estimate and of the standardization group used to form the advantage\. Both effects reduce the variance of the resulting advantage estimate\.

#### Estimators\.

For a stateuuwithKKchildren, letV^i\\hat\{V\}\_\{i\}be the empirical correctness rate of theii\-th child obtained frommmterminal rollouts\. Define:

A^iGRPO\\displaystyle\\hat\{A\}^\{\\mathrm\{GRPO\}\}\_\{i\}=R^i−1K∑j=1KR^j,\\displaystyle=\\hat\{R\}\_\{i\}\-\\tfrac\{1\}\{K\}\\textstyle\\sum\_\{j=1\}^\{K\}\\hat\{R\}\_\{j\},\(S\.2\)A^iTree\\displaystyle\\hat\{A\}^\{\\mathrm\{Tree\}\}\_\{i\}=V^i−1K∑j=1KV^j,\\displaystyle=\\hat\{V\}\_\{i\}\-\\tfrac\{1\}\{K\}\\textstyle\\sum\_\{j=1\}^\{K\}\\hat\{V\}\_\{j\},A^iGraph\\displaystyle\\hat\{A\}^\{\\mathrm\{Graph\}\}\_\{i\}=V^iQ−1K∑j=1KV^jQ,V^iQ=ciown\+w∑q∈𝒬\(i\)∖\{i\}cqowntiown\+w∑q∈𝒬\(i\)∖\{i\}tqown\.\\displaystyle=\\hat\{V\}^\{Q\}\_\{i\}\-\\tfrac\{1\}\{K\}\\textstyle\\sum\_\{j=1\}^\{K\}\\hat\{V\}^\{Q\}\_\{j\},\\quad\\hat\{V\}^\{Q\}\_\{i\}=\\frac\{c\_\{i\}^\{\\mathrm\{own\}\}\+w\\\!\\sum\_\{q\\in\\mathcal\{Q\}\(i\)\\setminus\\\{i\\\}\}c\_\{q\}^\{\\mathrm\{own\}\}\}\{t\_\{i\}^\{\\mathrm\{own\}\}\+w\\\!\\sum\_\{q\\in\\mathcal\{Q\}\(i\)\\setminus\\\{i\\\}\}t\_\{q\}^\{\\mathrm\{own\}\}\}\.GRPO uses raw outcome rewards on full trajectories\. TreeRPO uses Monte\-Carlo correctness of each child computed only from its own subtree\. GraphPO additionally pools terminal evidence across the equivalence class𝒬\(i\)\\mathcal\{Q\}\(i\)\. These are the raw pre\-normalization estimators; the standardization in Section[4\.4](https://arxiv.org/html/2606.18954#S4.SS4)rescales the signal after the sampling estimate is formed\.

###### Lemma 1\(Per\-child variance\)\.

Under Assumptions[1](https://arxiv.org/html/2606.18954#Thmassumption1)–[2](https://arxiv.org/html/2606.18954#Thmassumption2), withmmMonte\-Carlo terminal rollouts per child,

Var⁡\(V^i\)=V⋆\(i\)\(1−V⋆\(i\)\)m\.\\operatorname\{Var\}\\\!\\bigl\(\\hat\{V\}\_\{i\}\\bigr\)\\;=\\;\\frac\{V^\{\\star\}\(i\)\(1\-V^\{\\star\}\(i\)\)\}\{m\}\.\(S\.3\)

###### Proof\.

V^i\\hat\{V\}\_\{i\}is the empirical mean ofmmi\.i\.d\. Bernoulli\(V⋆\(i\)V^\{\\star\}\(i\)\) terminal indicators by Assumption[2](https://arxiv.org/html/2606.18954#Thmassumption2)\. The variance of a Bernoulli mean isp\(1−p\)/mp\(1\-p\)/m\. ∎

###### Lemma 2\(Variance with equivalence\-class pooling\)\.

Let\|𝒬\(i\)\|=ki≥1\|\\mathcal\{Q\}\(i\)\|=k\_\{i\}\\\!\\geq\\\!1and suppose each member of𝒬\(i\)\\mathcal\{Q\}\(i\)hasmmown terminal rollouts\. Define

ρi=1\+\(ki−1\)w2\(1\+w\(ki−1\)\)2≤11\+w\(ki−1\)\.\\rho\_\{i\}=\\frac\{1\+\(k\_\{i\}\-1\)w^\{2\}\}\{\(1\+w\(k\_\{i\}\-1\)\)^\{2\}\}\\leq\\frac\{1\}\{1\+w\(k\_\{i\}\-1\)\}\.\(S\.4\)Then

Var⁡\(V^iQ\)≤ρimaxq∈𝒬\(i\)⁡V⋆\(q\)\(1−V⋆\(q\)\)m,\\operatorname\{Var\}\(\\hat\{V\}^\{Q\}\_\{i\}\)\\;\\leq\\;\\rho\_\{i\}\\,\\frac\{\\max\_\{q\\in\\mathcal\{Q\}\(i\)\}V^\{\\star\}\(q\)\(1\-V^\{\\star\}\(q\)\)\}\{m\},\(S\.5\)and, sincep\(1−p\)p\(1\-p\)is 1\-Lipschitz on\[0,1\]\[0,1\],

Var⁡\(V^iQ\)≤ρiV⋆\(i\)\(1−V⋆\(i\)\)\+δκm,\\operatorname\{Var\}\(\\hat\{V\}^\{Q\}\_\{i\}\)\\;\\leq\\;\\rho\_\{i\}\\,\\frac\{V^\{\\star\}\(i\)\(1\-V^\{\\star\}\(i\)\)\+\\delta\_\{\\kappa\}\}\{m\},\(S\.6\)and its semantic\-merging bias satisfies

\|𝔼\[V^iQ\]−V⋆\(i\)\|≤δκ\.\\bigl\|\\mathbb\{E\}\[\\hat\{V\}^\{Q\}\_\{i\}\]\-V^\{\\star\}\(i\)\\bigr\|\\leq\\delta\_\{\\kappa\}\.\(S\.7\)Consequently, the mean\-squared error relative toV⋆\(i\)V^\{\\star\}\(i\)is bounded by the variance term above plusδκ2\\delta\_\{\\kappa\}^\{2\}\.

###### Proof\.

For balanced groups, writeV^iQ=∑q∈𝒬\(i\)αqV^q\\hat\{V\}^\{Q\}\_\{i\}=\\sum\_\{q\\in\\mathcal\{Q\}\(i\)\}\\alpha\_\{q\}\\hat\{V\}\_\{q\}, whereαi=1/\(1\+w\(ki−1\)\)\\alpha\_\{i\}=1/\(1\+w\(k\_\{i\}\-1\)\),αq=w/\(1\+w\(ki−1\)\)\\alpha\_\{q\}=w/\(1\+w\(k\_\{i\}\-1\)\)forq≠iq\\neq i, and∑qαq=1\\sum\_\{q\}\\alpha\_\{q\}=1\. By Assumption[2](https://arxiv.org/html/2606.18954#Thmassumption2), theV^q\\hat\{V\}\_\{q\}are mutually independent, so

Var⁡\(V^iQ\)=∑q∈𝒬\(i\)αq2Var⁡\(V^q\)\.\\operatorname\{Var\}\(\\hat\{V\}^\{Q\}\_\{i\}\)=\\sum\_\{q\\in\\mathcal\{Q\}\(i\)\}\\alpha\_\{q\}^\{2\}\\operatorname\{Var\}\(\\hat\{V\}\_\{q\}\)\.\(S\.8\)Using Lemma[1](https://arxiv.org/html/2606.18954#Thmlemma1)gives the variance bound because∑qαq2=ρi\\sum\_\{q\}\\alpha\_\{q\}^\{2\}=\\rho\_\{i\}\. The second inequality in the definition ofρi\\rho\_\{i\}follows fromw2≤ww^\{2\}\\leq wforw∈\[0,1\]w\\in\[0,1\]\. The alternative variance bound follows from Assumption[1](https://arxiv.org/html/2606.18954#Thmassumption1)and the fact thatp\(1−p\)p\(1\-p\)is 1\-Lipschitz on\[0,1\]\[0,1\]\. For the bias,

\|𝔼\[V^iQ\]−V⋆\(i\)\|≤∑q∈𝒬\(i\)αq\|V⋆\(q\)−V⋆\(i\)\|≤δκ\\left\|\\mathbb\{E\}\[\\hat\{V\}^\{Q\}\_\{i\}\]\-V^\{\\star\}\(i\)\\right\|\\leq\\sum\_\{q\\in\\mathcal\{Q\}\(i\)\}\\alpha\_\{q\}\|V^\{\\star\}\(q\)\-V^\{\\star\}\(i\)\|\\leq\\delta\_\{\\kappa\}\(S\.9\)by Assumption[1](https://arxiv.org/html/2606.18954#Thmassumption1)\. ∎

###### Theorem 1\(GraphPO yields lower\-variance advantage estimates via shared suffixes\)\.

Under Assumptions[1](https://arxiv.org/html/2606.18954#Thmassumption1)–[3](https://arxiv.org/html/2606.18954#Thmassumption3), suppose theKKchild\-score estimates used in the group baseline are conditionally independent and have balanced own terminal rollouts\. Letρmax\(u\)=maxi≤K⁡ρi\\rho\_\{\\max\}\(u\)=\\max\_\{i\\leq K\}\\rho\_\{i\}\. Then

Var⁡\(A^iGraph\)\\displaystyle\\operatorname\{Var\}\\\!\\bigl\(\\hat\{A\}^\{\\mathrm\{Graph\}\}\_\{i\}\\bigr\)≤ρmax\(u\)Var⁡\(A^iTree\)\+O\(ρmax\(u\)δκm\)\\displaystyle\\leq\\;\\rho\_\{\\max\}\(u\)\\,\\operatorname\{Var\}\\\!\\bigl\(\\hat\{A\}^\{\\mathrm\{Tree\}\}\_\{i\}\\bigr\)\+O\\\!\\left\(\\frac\{\\rho\_\{\\max\}\(u\)\\delta\_\{\\kappa\}\}\{m\}\\right\)\(S\.10\)≤ρmax\(u\)Var⁡\(A^iGRPO\)\+O\(ρmax\(u\)δκm\),\\displaystyle\\leq\\;\\rho\_\{\\max\}\(u\)\\,\\operatorname\{Var\}\\\!\\bigl\(\\hat\{A\}^\{\\mathrm\{GRPO\}\}\_\{i\}\\bigr\)\+O\\\!\\left\(\\frac\{\\rho\_\{\\max\}\(u\)\\delta\_\{\\kappa\}\}\{m\}\\right\),where the second inequality compares to a GRPO estimator that uses a single terminal outcome per child\. The estimator’s bias relative to the oracle advantage isO\(δκ\)O\(\\delta\_\{\\kappa\}\), and its MSE has an additionalO\(δκ2\)O\(\\delta\_\{\\kappa\}^\{2\}\)term\. If every child in the comparison group haski\>1k\_\{i\}\>1andw\>0w\>0, thenρmax\(u\)<1\\rho\_\{\\max\}\(u\)<1, so the variance reduction from suffix sharing is strict whenever the semantic\-heterogeneity term is smaller than the saved sampling variance, i\.e\. GraphPO returns a lower\-variance advantage estimate than tree\-based baselines that cannot reuse nodes\.

###### Proof\.

For independent child\-score estimatesX1,…,XKX\_\{1\},\\ldots,X\_\{K\},

Var⁡\(Xi−1K∑j=1KXj\)=\(1−1K\)2Var⁡\(Xi\)\+1K2∑j≠iVar⁡\(Xj\)\.\\operatorname\{Var\}\\\!\\left\(X\_\{i\}\-\\frac\{1\}\{K\}\\sum\_\{j=1\}^\{K\}X\_\{j\}\\right\)=\\left\(1\-\\frac\{1\}\{K\}\\right\)^\{2\}\\operatorname\{Var\}\(X\_\{i\}\)\+\\frac\{1\}\{K^\{2\}\}\\sum\_\{j\\neq i\}\\operatorname\{Var\}\(X\_\{j\}\)\.\(S\.11\)All coefficients are nonnegative\. Lemma[2](https://arxiv.org/html/2606.18954#Thmlemma2)contracts each child\-score variance by at mostρmax\(u\)\\rho\_\{\\max\}\(u\)relative to the non\-pooled Tree estimator, up to theO\(ρmax\(u\)δκ/m\)O\(\\rho\_\{\\max\}\(u\)\\delta\_\{\\kappa\}/m\)heterogeneity term induced by imperfect semantic merging\. The same bound therefore applies after subtracting the group baseline\. A Tree estimator averagesmmterminal Bernoulli outcomes per child, while the corresponding GRPO estimator uses one terminal outcome per child; the law of total variance therefore givesVar⁡\(A^iTree\)≤Var⁡\(A^iGRPO\)\\operatorname\{Var\}\(\\hat\{A\}^\{\\mathrm\{Tree\}\}\_\{i\}\)\\leq\\operatorname\{Var\}\(\\hat\{A\}^\{\\mathrm\{GRPO\}\}\_\{i\}\)\. The bias and MSE statements follow from the bias part of Lemma[2](https://arxiv.org/html/2606.18954#Thmlemma2)\. ∎

###### Corollary 1\(More accurate edge rewards via aggregation over equivalent partners\)\.

The step\-level edge rewardrstep\(u,v\)=\(S\(v\)−S\(u\)\)\(1−η\(u,v\)\)r\_\{\\mathrm\{step\}\}\(u,v\)=\(S\(v\)\-S\(u\)\)\(1\-\\eta\(u,v\)\)inherits the estimation accuracy of its endpoint scores\. Define the oracle edge rewardrstep⋆\(u,v\)=\(V⋆\(v\)−V⋆\(u\)\)\(1−η\(u,v\)\)r^\{\\star\}\_\{\\mathrm\{step\}\}\(u,v\)=\(V^\{\\star\}\(v\)\-V^\{\\star\}\(u\)\)\(1\-\\eta\(u,v\)\)and the empirical plug\-inr^stepGraph\(u,v\)\\hat\{r\}^\{\\mathrm\{Graph\}\}\_\{\\mathrm\{step\}\}\(u,v\)obtained by replacingSSwith the pooled estimatorV^Q\\hat\{V\}^\{Q\}from Lemma[2](https://arxiv.org/html/2606.18954#Thmlemma2), and analogouslyr^stepTree\\hat\{r\}^\{\\mathrm\{Tree\}\}\_\{\\mathrm\{step\}\}from the non\-pooled tree estimator\. LetΓuvGraph=Cov⁡\(V^Q\(u\),V^Q\(v\)\)\\Gamma^\{\\mathrm\{Graph\}\}\_\{uv\}=\\operatorname\{Cov\}\(\\hat\{V\}^\{Q\}\(u\),\\hat\{V\}^\{Q\}\(v\)\)andΓuvTree=Cov⁡\(V^Tree\(u\),V^Tree\(v\)\)\\Gamma^\{\\mathrm\{Tree\}\}\_\{uv\}=\\operatorname\{Cov\}\(\\hat\{V\}^\{\\mathrm\{Tree\}\}\(u\),\\hat\{V\}^\{\\mathrm\{Tree\}\}\(v\)\)\. Then the mean\-squared error of the GraphPO edge reward to the oracle satisfies

MSE⁡\(r^stepGraph\(u,v\)\)≤\\displaystyle\\operatorname\{MSE\}\\bigl\(\\hat\{r\}^\{\\mathrm\{Graph\}\}\_\{\\mathrm\{step\}\}\(u,v\)\\bigr\)\\leq\{\}\(1−η\(u,v\)\)2\(ρuVar⁡\(V^Tree\(u\)\)\+ρvVar⁡\(V^Tree\(v\)\)−2ΓuvGraph\)\\displaystyle\(1\-\\eta\(u,v\)\)^\{2\}\\Bigl\(\\rho\_\{u\}\\operatorname\{Var\}\(\\hat\{V\}^\{\\mathrm\{Tree\}\}\(u\)\)\+\\rho\_\{v\}\\operatorname\{Var\}\(\\hat\{V\}^\{\\mathrm\{Tree\}\}\(v\)\)\-2\\Gamma^\{\\mathrm\{Graph\}\}\_\{uv\}\\Bigr\)\(S\.12\)\+\(1−η\(u,v\)\)2\(2δκ\)2\+O\(\(ρu\+ρv\)δκm\)\.\\displaystyle\+\\,\(1\-\\eta\(u,v\)\)^\{2\}\\,\(2\\delta\_\{\\kappa\}\)^\{2\}\+O\\\!\\left\(\\frac\{\(\\rho\_\{u\}\+\\rho\_\{v\}\)\\delta\_\{\\kappa\}\}\{m\}\\right\)\.Hereρu,ρv∈\(0,1\]\\rho\_\{u\},\\rho\_\{v\}\\in\(0,1\]are the endpoint pooling factors from Lemma[2](https://arxiv.org/html/2606.18954#Thmlemma2), the first line bounds the variance contributed by the two endpoint estimators \(which is strictly reduced by aggregating own\-subtree evidence with that of equivalent partners\), and the\(2δκ\)2\(2\\delta\_\{\\kappa\}\)^\{2\}term bounds the squared bias induced by imperfect semantic merging\. Compared with the tree edge reward, which uses no information from equivalent partners, the GraphPO edge reward is strictly more accurate \(in MSE\) whenever the variance saving from shared evidence dominates the residual semantic\-heterogeneity bias:

\(1−η\(u,v\)\)2\(\(1−ρu\)Var⁡\(V^Tree\(u\)\)\+\(1−ρv\)Var⁡\(V^Tree\(v\)\)−2\(ΓuvTree−ΓuvGraph\)\)\\displaystyle\(1\-\\eta\(u,v\)\)^\{2\}\\Bigl\(\(1\-\\rho\_\{u\}\)\\operatorname\{Var\}\(\\hat\{V\}^\{\\mathrm\{Tree\}\}\(u\)\)\+\(1\-\\rho\_\{v\}\)\\operatorname\{Var\}\(\\hat\{V\}^\{\\mathrm\{Tree\}\}\(v\)\)\-2\(\\Gamma^\{\\mathrm\{Tree\}\}\_\{uv\}\-\\Gamma^\{\\mathrm\{Graph\}\}\_\{uv\}\)\\Bigr\)\(S\.13\)\>\(1−η\(u,v\)\)2\(2δκ\)2\+O\(\(ρu\+ρv\)δκm\)\.\\displaystyle\\quad\>\\;\(1\-\\eta\(u,v\)\)^\{2\}\(2\\delta\_\{\\kappa\}\)^\{2\}\+O\\\!\\left\(\\frac\{\(\\rho\_\{u\}\+\\rho\_\{v\}\)\\delta\_\{\\kappa\}\}\{m\}\\right\)\.In the conditionally independent endpoint case the covariance term vanishes and the condition reduces to nontrivial pooling \(w\>0w\>0,\|𝒬\|\>1\|\\mathcal\{Q\}\|\>1\) with small semantic heterogeneity \(δκ\\delta\_\{\\kappa\}small\)\.

###### Proof\.

Decompose MSE into variance and squared bias\. The reward is a deterministic linear transformation of the two endpoint score estimates:r^step−rstep⋆=\(1−η\)\(\(S^\(v\)−V⋆\(v\)\)−\(S^\(u\)−V⋆\(u\)\)\)\\hat\{r\}\_\{\\mathrm\{step\}\}\-r^\{\\star\}\_\{\\mathrm\{step\}\}=\(1\-\\eta\)\\bigl\(\(\\hat\{S\}\(v\)\-V^\{\\star\}\(v\)\)\-\(\\hat\{S\}\(u\)\-V^\{\\star\}\(u\)\)\\bigr\), hence

MSE⁡\(r^step\)=\\displaystyle\\operatorname\{MSE\}\(\\hat\{r\}\_\{\\mathrm\{step\}\}\)=\{\}\(1−η\)2\(Var⁡\(S^\(v\)\)\+Var⁡\(S^\(u\)\)−2Cov⁡\(S^\(u\),S^\(v\)\)\)\\displaystyle\(1\-\\eta\)^\{2\}\\Bigl\(\\operatorname\{Var\}\(\\hat\{S\}\(v\)\)\+\\operatorname\{Var\}\(\\hat\{S\}\(u\)\)\-2\\operatorname\{Cov\}\(\\hat\{S\}\(u\),\\hat\{S\}\(v\)\)\\Bigr\)\(S\.14\)\+\(1−η\)2\(𝔼S^\(v\)−V⋆\(v\)−\(𝔼S^\(u\)−V⋆\(u\)\)\)2\.\\displaystyle\+\(1\-\\eta\)^\{2\}\\bigl\(\\mathbb\{E\}\\hat\{S\}\(v\)\-V^\{\\star\}\(v\)\-\(\\mathbb\{E\}\\hat\{S\}\(u\)\-V^\{\\star\}\(u\)\)\\bigr\)^\{2\}\.For the GraphPO estimator, applying the variance bound of Lemma[2](https://arxiv.org/html/2606.18954#Thmlemma2)to each endpoint variance gives the first line of the bound; applying the bias bound\|𝔼V^Q\(⋅\)−V⋆\(⋅\)\|≤δκ\|\\mathbb\{E\}\\hat\{V\}^\{Q\}\(\\cdot\)\-V^\{\\star\}\(\\cdot\)\|\\leq\\delta\_\{\\kappa\}at both endpoints and the triangle inequality gives the\(2δκ\)2\(2\\delta\_\{\\kappa\}\)^\{2\}squared\-bias term\. The tree estimator is unbiased \(zero bias term\) but cannot pool over equivalent partners, so its variance term usesρ=1\\rho=1\. Subtracting the two MSE expressions yields the stated comparison condition: GraphPO trades a smallO\(δκ2\)O\(\\delta\_\{\\kappa\}^\{2\}\)bias for anΩ\(1−ρ\)\\Omega\(1\-\\rho\)variance reduction, and is strictly more accurate whenever the latter dominates\. ∎

#### Dual\-group variance reduction\.

The dual\-group graph advantage further benefits from equivalence\-class pooling because the standardization group itself is enlarged\. In a tree estimator that cannot reuse nodes, the group used to standardize the correctness advantage at stateuucontains only the direct children𝒞\(u\)\\mathcal\{C\}\(u\)\. In GraphPO, paths reaching equivalent states are merged, so the group becomesℰcor\(u\)=\{\(p\(v\),v\)∣v∈𝒞g\(u\)\}\\mathcal\{E\}\_\{\\mathrm\{cor\}\}\(u\)=\\\{\(p\(v\),v\)\\mid v\\in\\mathcal\{C\}\_\{g\}\(u\)\\\}with\|𝒞g\(u\)\|=∑q∈𝒬\(u\)\|𝒞\(q\)\|\|\\mathcal\{C\}\_\{g\}\(u\)\|=\\sum\_\{q\\in\\mathcal\{Q\}\(u\)\}\|\\mathcal\{C\}\(q\)\|\.

###### Proposition 1\(Lower correctness\-advantage variance via equivalence\-class grouping\)\.

Fix a source stateuuand letKT=\|𝒞\(u\)\|K\_\{T\}=\|\\mathcal\{C\}\(u\)\|andKG=\|𝒞g\(u\)\|≥KTK\_\{G\}=\|\\mathcal\{C\}\_\{g\}\(u\)\|\\geq K\_\{T\}be the tree\- and graph\-group sizes\. Assume the per\-edge step\-reward estimates\{r^step\(e^\)\}\\\{\\hat\{r\}\_\{\\mathrm\{step\}\}\(\\hat\{e\}\)\\\}are conditionally independent within the group with a common variance boundσr2\\sigma^\{2\}\_\{r\}\. LetA^corTree\\hat\{A\}^\{\\mathrm\{Tree\}\}\_\{\\mathrm\{cor\}\}andA^corGraph\\hat\{A\}^\{\\mathrm\{Graph\}\}\_\{\\mathrm\{cor\}\}be the pre\-standardization centered advantagesr^step\(e\)−mean⁡\(⋅\)\\hat\{r\}\_\{\\mathrm\{step\}\}\(e\)\-\\operatorname\{mean\}\(\\cdot\)computed on the tree and graph groups, respectively\. Then

Var⁡\(A^corGraph\)≤\(1−1KG\)σr2≤\(1−1KT\)σr2=Var⁡\(A^corTree\),\\operatorname\{Var\}\\\!\\bigl\(\\hat\{A\}^\{\\mathrm\{Graph\}\}\_\{\\mathrm\{cor\}\}\\bigr\)\\;\\leq\\;\\Bigl\(1\-\\tfrac\{1\}\{K\_\{G\}\}\\Bigr\)\\sigma^\{2\}\_\{r\}\\;\\leq\\;\\Bigl\(1\-\\tfrac\{1\}\{K\_\{T\}\}\\Bigr\)\\sigma^\{2\}\_\{r\}=\\operatorname\{Var\}\\\!\\bigl\(\\hat\{A\}^\{\\mathrm\{Tree\}\}\_\{\\mathrm\{cor\}\}\\bigr\),\(S\.15\)with strict inequality wheneverKG\>KTK\_\{G\}\>K\_\{T\}\(i\.e\. at least one non\-trivial equivalence\-class member contributes shared descendants\)\.

###### Proof\.

For independent variatesX1,…,XKX\_\{1\},\\dots,X\_\{K\}with common varianceσ2\\sigma^\{2\},

Var⁡\(Xi−1K∑j=1KXj\)=\(1−1K\)2σ2\+K−1K2σ2=\(1−1K\)σ2,\\operatorname\{Var\}\\\!\\bigl\(X\_\{i\}\-\\tfrac\{1\}\{K\}\\textstyle\\sum\_\{j=1\}^\{K\}X\_\{j\}\\bigr\)=\\bigl\(1\-\\tfrac\{1\}\{K\}\\bigr\)^\{2\}\\sigma^\{2\}\+\\tfrac\{K\-1\}\{K^\{2\}\}\\sigma^\{2\}=\\bigl\(1\-\\tfrac\{1\}\{K\}\\bigr\)\\sigma^\{2\},\(S\.16\)which decreases monotonically inKK\. Applying this withK∈\{KT,KG\}K\\\!\\in\\\!\\\{K\_\{T\},K\_\{G\}\\\}and usingKG≥KTK\_\{G\}\\geq K\_\{T\}yields the inequality\. ∎

Proposition[1](https://arxiv.org/html/2606.18954#Thmproposition1)formalizes the dual\-group claim of Section[4\.4](https://arxiv.org/html/2606.18954#S4.SS4): by aggregating semantically equivalent paths into one correctness\-advantage group, GraphPO obtains a more concentrated baseline and standard deviation than tree\-based estimators that cannot reuse nodes, which directly lowers the variance ofAcorA\_\{\\mathrm\{cor\}\}\.

### F\.3Improved Exploration Efficiency

We now formalize the exploration\-efficiency claim of Section[3](https://arxiv.org/html/2606.18954#S3)\. Let the rollout budget beBBtokens and let𝒮B\\mathcal\{S\}\_\{B\}be the sampled reasoning steps\. Define the cumulative useful exploration as

G\(B\)=∑s∈𝒮B\|s\|𝟏\[scorrect\],G\(B\)=\\sum\_\{s\\in\\mathcal\{S\}\_\{B\}\}\|s\|\\mathbf\{1\}\[s\\text\{ correct\}\],\(S\.17\)soEff¯\(B\)=𝔼\[G\(B\)\]/B\\bar\{\\mathrm\{Eff\}\}\(B\)=\\mathbb\{E\}\[G\(B\)\]/Bis the expected per\-budget version of Eq\. \([1](https://arxiv.org/html/2606.18954#S3.E1)\) when the consumed budget isBB\. Letpredp\_\{\\mathrm\{red\}\}be the fraction of rollout budget spent on states that are both redundant and detected as semantically equivalent to a previously sampled state in the same layer\.

###### Proposition 2\(Budget reallocation gain\)\.

Suppose that, under tree\-based sampling, a fractionpredp\_\{\\mathrm\{red\}\}of the rollout budget is spent on frontier states that are identified as redundant by GraphPO’s equivalence detector\. Assume that the average marginal gain per unit budget on these redundant states is at mostΔr\\Delta\_\{r\}, while the average marginal gain per unit budget on novel frontier states receiving the reallocated budget is at leastΔn\\Delta\_\{n\}\. Suppose also that GraphPO redirects the budget saved by its halved\-budget rule to novel frontiers\. IfΔn\>Δr\\Delta\_\{n\}\>\\Delta\_\{r\}, then, ignoring integer rounding,

Eff¯Graph\(B\)−Eff¯Tree\(B\)≥12pred\(Δn−Δr\)\>0\.\\bar\{\\mathrm\{Eff\}\}\_\{\\mathrm\{Graph\}\}\(B\)\-\\bar\{\\mathrm\{Eff\}\}\_\{\\mathrm\{Tree\}\}\(B\)\\geq\\frac\{1\}\{2\}p\_\{\\mathrm\{red\}\}\(\\Delta\_\{n\}\-\\Delta\_\{r\}\)\>0\.

###### Proof\.

Under tree\-based sampling, a fractionpredp\_\{\\mathrm\{red\}\}of the budget is spent on redundant frontier states\. By assumption, the average marginal gain of this budget is at mostΔr\\Delta\_\{r\}\. The remaining budget is spent on non\-redundant frontier states\.

GraphPO applies the halved\-budget rule to detected redundant states\. Therefore, ignoring integer rounding, it saves a budget fraction12pred\\frac\{1\}\{2\}p\_\{\\mathrm\{red\}\}from redundant expansions\. This saved budget is redirected to novel frontier states, whose average marginal gain is at leastΔn\\Delta\_\{n\}by assumption\.

Compared with tree\-based sampling, GraphPO replaces a budget fraction12pred\\frac\{1\}\{2\}p\_\{\\mathrm\{red\}\}whose marginal gain is at mostΔr\\Delta\_\{r\}with the same budget fraction whose marginal gain is at leastΔn\\Delta\_\{n\}\. Hence the expected gain per unit budget improves by at least

12predΔn−12predΔr=12pred\(Δn−Δr\)\.\\frac\{1\}\{2\}p\_\{\\mathrm\{red\}\}\\Delta\_\{n\}\-\\frac\{1\}\{2\}p\_\{\\mathrm\{red\}\}\\Delta\_\{r\}=\\frac\{1\}\{2\}p\_\{\\mathrm\{red\}\}\(\\Delta\_\{n\}\-\\Delta\_\{r\}\)\.The improvement is strictly positive wheneverpred\>0p\_\{\\mathrm\{red\}\}\>0andΔn\>Δr\\Delta\_\{n\}\>\\Delta\_\{r\}\. ∎

###### Theorem 2\(Exploration coverage\)\.

LetNcov\(B\)N\_\{\\mathrm\{cov\}\}\(B\)be the number of*distinct*semantic states discovered with budgetBB\. Assume an average expansion costc¯\\bar\{c\}, and assume each expansion from a novel frontier discovers a new semantic state with probability at leastμ\>0\\mu\>0\. Under the same detected\-redundancy and saved\-budget reallocation conditions as Proposition[2](https://arxiv.org/html/2606.18954#Thmproposition2),

𝔼\[NcovGraph\(B\)\]≥𝔼\[NcovTree\(B\)\]\+μpredB2c¯\.\\mathbb\{E\}\\\!\\left\[N^\{\\mathrm\{Graph\}\}\_\{\\mathrm\{cov\}\}\(B\)\\right\]\\;\\geq\\;\\mathbb\{E\}\\\!\\left\[N^\{\\mathrm\{Tree\}\}\_\{\\mathrm\{cov\}\}\(B\)\\right\]\+\\frac\{\\mu p\_\{\\mathrm\{red\}\}B\}\{2\\bar\{c\}\}\.\(S\.18\)In the uniform\-yield case where𝔼\[NcovTree\(B\)\]=μ\(1−pred\)B/c¯\\mathbb\{E\}\[N^\{\\mathrm\{Tree\}\}\_\{\\mathrm\{cov\}\}\(B\)\]=\\mu\(1\-p\_\{\\mathrm\{red\}\}\)B/\\bar\{c\}andpred<1p\_\{\\mathrm\{red\}\}<1, this implies the multiplicative gain

𝔼\[NcovGraph\(B\)\]𝔼\[NcovTree\(B\)\]≥1\+pred2\(1−pred\)\.\\frac\{\\mathbb\{E\}\[N^\{\\mathrm\{Graph\}\}\_\{\\mathrm\{cov\}\}\(B\)\]\}\{\\mathbb\{E\}\[N^\{\\mathrm\{Tree\}\}\_\{\\mathrm\{cov\}\}\(B\)\]\}\\;\\geq\\;1\+\\frac\{p\_\{\\mathrm\{red\}\}\}\{2\(1\-p\_\{\\mathrm\{red\}\}\)\}\.\(S\.19\)

###### Proof\.

Each token a tree spends inside an already\-discovered equivalence class produces no new state\. GraphPO detects such tokens viasim≥κ\\operatorname\{sim\}\\\!\\geq\\\!\\kappaand reallocates half of the remaining budget to active novel frontiers\. The reallocated budget ispredB/2p\_\{\\mathrm\{red\}\}B/2, which corresponds topredB/\(2c¯\)p\_\{\\mathrm\{red\}\}B/\(2\\bar\{c\}\)additional frontier expansions in expectation\. Since each such expansion discovers a new state with probability at leastμ\\mu, the additive bound follows\. Dividing this bound by the uniform tree coverage expression gives the stated multiplicative form\. ∎

### F\.4Self\-Emergent Process Supervision \(PRM\-like signal\)

We now show that the objective optimized with GraphPO’s edge reward has the same first\-order policy\-update direction as a novelty\-gated oracle PRM surrogate, up to a controllableO\(1−κ\)O\(1\-\\kappa\)semantic\-merging bias\. The novelty factor1−η\(u,v\)1\-\\eta\(u,v\)is a non\-negative reweighting: it preserves the sign of the oracle advantage on informative steps and attenuates credit on semantically redundant ones \(where the oracle advantage is itself near zero by Assumption[1](https://arxiv.org/html/2606.18954#Thmassumption1)\)\. Thusrstepr\_\{\\mathrm\{step\}\}is not aiming at a different target than PRM; it is a sample\-efficient, novelty\-gated estimator of the same per\-edge improvement signal\.

###### Lemma 3\(Bias of node score\)\.

Under Assumptions[1](https://arxiv.org/html/2606.18954#Thmassumption1)–[2](https://arxiv.org/html/2606.18954#Thmassumption2),\|𝔼\[S\(v\)\]−V⋆\(v\)\|≤δκ\|\\mathbb\{E\}\[S\(v\)\]\-V^\{\\star\}\(v\)\|\\leq\\delta\_\{\\kappa\}for every nodevvwith positive pooled terminal count\.

###### Proof\.

Conditional on the rollout counts,S\(v\)S\(v\)is a convex combination∑q∈𝒬\(v\)βqV^q\\sum\_\{q\\in\\mathcal\{Q\}\(v\)\}\\beta\_\{q\}\\hat\{V\}\_\{q\}with weightsβq=wqtqown/\(tvown\+w∑q′≠vtq′own\)\\beta\_\{q\}=w\_\{q\}t\_\{q\}^\{\\mathrm\{own\}\}/\(t\_\{v\}^\{\\mathrm\{own\}\}\+w\\sum\_\{q^\{\\prime\}\\neq v\}t\_\{q^\{\\prime\}\}^\{\\mathrm\{own\}\}\)\(wv=1w\_\{v\}=1,wq=ww\_\{q\}=wforq≠vq\\neq v\), so∑qβq=1\\sum\_\{q\}\\beta\_\{q\}=1\.𝔼V^q=V⋆\(q\)\\mathbb\{E\}\\hat\{V\}\_\{q\}=V^\{\\star\}\(q\)and Assumption[1](https://arxiv.org/html/2606.18954#Thmassumption1)give\|V⋆\(q\)−V⋆\(v\)\|≤δκ\|V^\{\\star\}\(q\)\-V^\{\\star\}\(v\)\|\\leq\\delta\_\{\\kappa\}\. Hence\|𝔼\[S\(v\)\]−V⋆\(v\)\|≤∑qβq\|V⋆\(q\)−V⋆\(v\)\|≤δκ\|\\mathbb\{E\}\[S\(v\)\]\-V^\{\\star\}\(v\)\|\\leq\\sum\_\{q\}\\beta\_\{q\}\|V^\{\\star\}\(q\)\-V^\{\\star\}\(v\)\|\\leq\\delta\_\{\\kappa\}\. ∎

###### Theorem 3\(GraphPO matches the novelty\-gated PRM surrogate gradient\)\.

LetAstep⋆\(u,v\)=V⋆\(v\)−V⋆\(u\)A^\{\\star\}\_\{\\mathrm\{step\}\}\(u,v\)=V^\{\\star\}\(v\)\-V^\{\\star\}\(u\)denote the oracle process advantage an idealized PRM would assign to the stepu→vu\\\!\\to\\\!v, and letω\(u,v\)=1−η\(u,v\)∈\[0,1\]\\omega\(u,v\)=1\-\\eta\(u,v\)\\in\[0,1\]denote the novelty factor\. Letρθ\(u,v\)=πθ\(v∣u\)/πθold\(v∣u\)\\rho\_\{\\theta\}\(u,v\)=\\pi\_\{\\theta\}\(v\\mid u\)/\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(v\\mid u\)be the edge\-level shorthand for the token\-level importance ratios in Section[4\.5](https://arxiv.org/html/2606.18954#S4.SS5)\. Define the novelty\-gated oracle PRM surrogate and the GraphPO step\-reward surrogate as

𝒥PRMω⁣⋆\(θ\)\\displaystyle\\mathcal\{J\}^\{\\omega\\star\}\_\{\\mathrm\{PRM\}\}\(\\theta\)=𝔼τ,𝒢∼πθold\[∑\(u,v\)∈τρθ\(u,v\)ω\(u,v\)Astep⋆\(u,v\)\],\\displaystyle=\\mathbb\{E\}\_\{\\tau,\\mathcal\{G\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\}\\\!\\Big\[\\sum\_\{\(u,v\)\\in\\tau\}\\rho\_\{\\theta\}\(u,v\)\\,\\omega\(u,v\)A^\{\\star\}\_\{\\mathrm\{step\}\}\(u,v\)\\Big\],\(S\.20\)𝒥step\(θ\)\\displaystyle\\mathcal\{J\}\_\{\\mathrm\{step\}\}\(\\theta\)=𝔼τ,𝒢∼πθold\[∑\(u,v\)∈τρθ\(u,v\)rstep\(u,v\)\]\.\\displaystyle=\\mathbb\{E\}\_\{\\tau,\\mathcal\{G\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\}\\\!\\Big\[\\sum\_\{\(u,v\)\\in\\tau\}\\rho\_\{\\theta\}\(u,v\)\\,r\_\{\\mathrm\{step\}\}\(u,v\)\\Big\]\.Here𝒢\\mathcal\{G\}determinesS\(⋅\)S\(\\cdot\)and thereforerstepr\_\{\\mathrm\{step\}\}, and all rewards are detached during the policy update\. Under Assumptions[1](https://arxiv.org/html/2606.18954#Thmassumption1)–[2](https://arxiv.org/html/2606.18954#Thmassumption2), atθ=θold\\theta=\\theta\_\{\\mathrm\{old\}\},

∇θ𝒥step\(θ\)\|θold=∇θ𝒥PRMω⁣⋆\(θ\)\|θold\+O\(δκ\)𝐛\(θold\),\\nabla\_\{\\theta\}\\mathcal\{J\}\_\{\\mathrm\{step\}\}\(\\theta\)\\big\|\_\{\\theta\_\{\\mathrm\{old\}\}\}=\\nabla\_\{\\theta\}\\mathcal\{J\}^\{\\omega\\star\}\_\{\\mathrm\{PRM\}\}\(\\theta\)\\big\|\_\{\\theta\_\{\\mathrm\{old\}\}\}\+O\(\\delta\_\{\\kappa\}\)\\,\\mathbf\{b\}\(\\theta\_\{\\mathrm\{old\}\}\),\(S\.21\)with∥𝐛\(θold\)∥≤𝔼τ\[∑\(u,v\)∈τ∥∇θlogπθ\(v∣u\)∥\]\\\|\\mathbf\{b\}\(\\theta\_\{\\mathrm\{old\}\}\)\\\|\\leq\\mathbb\{E\}\_\{\\tau\}\\\!\\big\[\\sum\_\{\(u,v\)\\in\\tau\}\\\|\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(v\\mid u\)\\\|\\big\]\. Two consequences follow\.*\(a\) Objective\-level alignment\.*The GraphPO objective trained withrstepr\_\{\\mathrm\{step\}\}has the same first\-order update direction as the novelty\-gated oracle PRM surrogate, up to theO\(δκ\)O\(\\delta\_\{\\kappa\}\)merging bias\. Compared with the ungated oracle PRM surrogate, each edge contribution is multiplied by the non\-negative scalarω\(u,v\)\\omega\(u,v\), so the update remains sign\-consistent with the oracle PRM while suppressing redundant steps\.*\(b\) Novelty\-gated denoising\.*For semantically equivalent endpoints \(η→1\\eta\\to 1,ω→0\\omega\\to 0\), Assumption[1](https://arxiv.org/html/2606.18954#Thmassumption1)already bounds\|Astep⋆\(u,v\)\|≤δκ\|A^\{\\star\}\_\{\\mathrm\{step\}\}\(u,v\)\|\\leq\\delta\_\{\\kappa\}, soω\\omegaattenuates edges on which the oracle credit is itselfO\(δκ\)O\(\\delta\_\{\\kappa\}\); for novel steps \(η≈0\\eta\\approx 0,ω≈1\\omega\\approx 1\), the per\-edge surrogate gradient coincides with the PRM gradient up toO\(δκ\)O\(\\delta\_\{\\kappa\}\)\. Finally, withmmterminal rollouts per descendant the MSE ofrstepr\_\{\\mathrm\{step\}\}as an estimator ofω\(u,v\)Astep⋆\(u,v\)\\omega\(u,v\)A^\{\\star\}\_\{\\mathrm\{step\}\}\(u,v\)satisfiesO\(\(ρu\+ρv\)/m\)\+O\(δκ2\)O\(\(\\rho\_\{u\}\+\\rho\_\{v\}\)/m\)\+O\(\\delta\_\{\\kappa\}^\{2\}\)\.

###### Proof\.

The proof follows the objective used in Section[4\.5](https://arxiv.org/html/2606.18954#S4.SS5)\. In one PPO/DAPO update, sampled graphs and their rewards are fixed, so differentiating the unclipped surrogate gives

∇θ𝒥step\(θ\)\|θold=𝔼τ,𝒢∼πθold\[∑\(u,v\)∈τrstep\(u,v\)∇θlog⁡πθ\(v∣u\)\|θold\],\\nabla\_\{\\theta\}\\mathcal\{J\}\_\{\\mathrm\{step\}\}\(\\theta\)\\big\|\_\{\\theta\_\{\\mathrm\{old\}\}\}=\\mathbb\{E\}\_\{\\tau,\\mathcal\{G\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\}\\\!\\left\[\\sum\_\{\(u,v\)\\in\\tau\}r\_\{\\mathrm\{step\}\}\(u,v\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(v\\mid u\)\\big\|\_\{\\theta\_\{\\mathrm\{old\}\}\}\\right\],\(S\.22\)becauseρθold\(u,v\)=1\\rho\_\{\\theta\_\{\\mathrm\{old\}\}\}\(u,v\)=1and∇θρθ\(u,v\)\|θold=∇θlog⁡πθ\(v∣u\)\|θold\\nabla\_\{\\theta\}\\rho\_\{\\theta\}\(u,v\)\|\_\{\\theta\_\{\\mathrm\{old\}\}\}=\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(v\\mid u\)\|\_\{\\theta\_\{\\mathrm\{old\}\}\}\. Conditioning on an edge’s endpoints marginalizes over the downstream rollout graph used to computeS\(u\)S\(u\)andS\(v\)S\(v\)\. Becauseω\(u,v\)\\omega\(u,v\)is a deterministic function of the endpoint embeddings and𝔼\[S\(w\)∣w\]=V⋆\(w\)\+O\(δκ\)\\mathbb\{E\}\[S\(w\)\\mid w\]=V^\{\\star\}\(w\)\+O\(\\delta\_\{\\kappa\}\)by Lemma[3](https://arxiv.org/html/2606.18954#Thmlemma3),

𝔼\[rstep\(u,v\)\|u,v\]=ω\(u,v\)\(𝔼\[S\(v\)∣v\]−𝔼\[S\(u\)∣u\]\)=ω\(u,v\)Astep⋆\(u,v\)\+O\(δκ\)\.\\mathbb\{E\}\\\!\\bigl\[r\_\{\\mathrm\{step\}\}\(u,v\)\\bigm\|u,v\\bigr\]=\\omega\(u,v\)\\bigl\(\\mathbb\{E\}\[S\(v\)\\mid v\]\-\\mathbb\{E\}\[S\(u\)\\mid u\]\\bigr\)=\\omega\(u,v\)\\,A^\{\\star\}\_\{\\mathrm\{step\}\}\(u,v\)\+O\(\\delta\_\{\\kappa\}\)\.\(S\.23\)Substituting this conditional expectation into the surrogate gradient gives Eq\. \([S\.21](https://arxiv.org/html/2606.18954#A6.E21)\); the bound on𝐛\(θold\)\\mathbf\{b\}\(\\theta\_\{\\mathrm\{old\}\}\)follows from the triangle inequality applied to theO\(δκ\)O\(\\delta\_\{\\kappa\}\)residual\. Claim \(a\) is immediate by comparing Eq\. \([S\.21](https://arxiv.org/html/2606.18954#A6.E21)\) with∇θ𝒥PRMω⁣⋆\(θ\)\|θold\\nabla\_\{\\theta\}\\mathcal\{J\}^\{\\omega\\star\}\_\{\\mathrm\{PRM\}\}\(\\theta\)\|\_\{\\theta\_\{\\mathrm\{old\}\}\}\. For \(b\), Assumption[1](https://arxiv.org/html/2606.18954#Thmassumption1)gives the stated bound on the oracle advantage inside any detected equivalence class, and the MSE bound follows from Lemma[2](https://arxiv.org/html/2606.18954#Thmlemma2)applied to the two endpoint scores together withω\(u,v\)≤1\\omega\(u,v\)\\leq 1\. ∎

Theorem[3](https://arxiv.org/html/2606.18954#Thmtheorem3)formalizes the “self\-emergent process supervision” claim: GraphPO recovers, in expectation, the first\-order update direction of a novelty\-gated oracle PRM surrogate solely from outcome rewardsR\(z\)∈\{0,1\}R\(z\)\\\!\\in\\\!\\\{0,1\\\}, without any annotated step labels or auxiliary value model\. The factorω\(u,v\)\\omega\(u,v\)is not a deviation from the PRM target but a principled non\-negative reweighting that suppresses noise on semantically redundant steps where the oracle credit vanishes\.

### F\.5Inference Efficiency

Finally, we analyze the effect of the efficiency advantageAeffA\_\{\\mathrm\{eff\}\}on the length of correct reasoning paths produced by the trained policy\.

###### Lemma 4\(Path\-length gradient\)\.

Letℓv\\ell\_\{v\}be the number of generated edges from the deepest common predecessoraQa\_\{Q\}tov∈Qv\\\!\\in\\\!Q, and letσ~Q=std⁡\(\{ℓq∣q∈Q\}\)\\tilde\{\\sigma\}\_\{Q\}=\\operatorname\{std\}\(\\\{\\ell\_\{q\}\\mid q\\in Q\\\}\)\. The dual\-group gradient contribution of the efficiency term is

∇θ𝒥eff\(θ\)=𝔼\[∑v∈QS¯Qℓ¯Q−ℓvσ~Q∇θlog⁡πθ\(pathv\)\],\\nabla\_\{\\theta\}\\mathcal\{J\}^\{\\mathrm\{eff\}\}\(\\theta\)=\\mathbb\{E\}\\\!\\left\[\\sum\_\{v\\in Q\}\\bar\{S\}\_\{Q\}\\,\\frac\{\\bar\{\\ell\}\_\{Q\}\-\\ell\_\{v\}\}\{\\tilde\{\\sigma\}\_\{Q\}\}\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\mathrm\{path\}\_\{v\}\)\\right\],\(S\.24\)withℓ¯Q\\bar\{\\ell\}\_\{Q\}the mean of\{ℓq\}q∈Q\\\{\\ell\_\{q\}\\\}\_\{q\\in Q\}\.

###### Proof\.

Differentiating𝒥GraphPO\\mathcal\{J\}\_\{\\mathrm\{GraphPO\}\}w\.r\.t\.θ\\thetaon the efficiency term and applying the policy\-gradient identity to each edgee∈pathve\\\!\\in\\\!\\mathrm\{path\}\_\{v\}yields the above; the uniform allocation ofAeff\(v\)A\_\{\\mathrm\{eff\}\}\(v\)across edges ofpathv\\mathrm\{path\}\_\{v\}collects into a singlelog⁡πθ\(pathv\)\\log\\pi\_\{\\theta\}\(\\mathrm\{path\}\_\{v\}\)term\. ∎

###### Theorem 4\(Efficiency term selects shorter equivalent paths\)\.

Consider an equivalence classQQwhose members are reached by correct paths of different lengths from the shared predecessoraQa\_\{Q\}, withS¯Q\>0\\bar\{S\}\_\{Q\}\>0\. Letπout⋆\\pi^\{\\star\}\_\{\\mathrm\{out\}\}be any policy that maximizes the outcome\-only objective \(λeff=0\\lambda\_\{\\mathrm\{eff\}\}=0\), and letπeff⋆\\pi^\{\\star\}\_\{\\mathrm\{eff\}\}be any maximizer after adding the efficiency term with a smallλeff\>0\\lambda\_\{\\mathrm\{eff\}\}\>0that does not change which paths are outcome\-optimal\. Thenπeff⋆\\pi^\{\\star\}\_\{\\mathrm\{eff\}\}places all outcome\-optimal probability on the shortest paths inQQ, and wheneverπout⋆\\pi^\{\\star\}\_\{\\mathrm\{out\}\}keeps positive mass on a longer path inQQ,

𝔼τ∼πeff⋆\[T\(τ\)∣R\(τ\)=1,τreachesQ\]<𝔼τ∼πout⋆\[T\(τ\)∣R\(τ\)=1,τreachesQ\],\\mathbb\{E\}\_\{\\tau\\sim\\pi^\{\\star\}\_\{\\mathrm\{eff\}\}\}\[T\(\\tau\)\\mid R\(\\tau\)=1,\\,\\tau\\text\{ reaches \}Q\]\\;<\\;\\mathbb\{E\}\_\{\\tau\\sim\\pi^\{\\star\}\_\{\\mathrm\{out\}\}\}\[T\(\\tau\)\\mid R\(\\tau\)=1,\\,\\tau\\text\{ reaches \}Q\],\(S\.25\)whereT\(τ\)T\(\\tau\)counts reasoning steps\.

###### Proof\.

All paths inQQhave the same outcome reward, soπout⋆\\pi^\{\\star\}\_\{\\mathrm\{out\}\}is free to spread mass across them\. The efficiency advantageAeff\(v\)=S¯Q\(ℓ¯Q−ℓv\)/σ~QA\_\{\\mathrm\{eff\}\}\(v\)=\\bar\{S\}\_\{Q\}\(\\bar\{\\ell\}\_\{Q\}\-\\ell\_\{v\}\)/\\tilde\{\\sigma\}\_\{Q\}is strictly larger for shorter paths whenS¯Q\>0\\bar\{S\}\_\{Q\}\>0, so maximizing it within the outcome\-optimal set selects the shortest paths only\. By Lemma[4](https://arxiv.org/html/2606.18954#Thmlemma4), the policy gradient drives probability from longer paths \(ℓv\>ℓ¯Q\\ell\_\{v\}\>\\bar\{\\ell\}\_\{Q\}\) to shorter ones \(ℓv<ℓ¯Q\\ell\_\{v\}<\\bar\{\\ell\}\_\{Q\}\)\. Anyπout⋆\\pi^\{\\star\}\_\{\\mathrm\{out\}\}that keeps mass on a longer path therefore has strictly larger conditional expected length onQQ\. ∎

This theorem characterizes the optimal\-policy limit\. During finite\-sample optimization, the efficiency term should be interpreted as concentrating probability toward shorter equivalent correct paths rather than guaranteeing all mass on them at every update\.

###### Corollary 2\(Lower expected token cost\)\.

If the expected token count of a path is monotone in its number of reasoning steps and strictly increasing for the longer paths in Theorem[4](https://arxiv.org/html/2606.18954#Thmtheorem4), thenπeff⋆\\pi^\{\\star\}\_\{\\mathrm\{eff\}\}uses no more expected tokens thanπout⋆\\pi^\{\\star\}\_\{\\mathrm\{out\}\}, and strictly fewer wheneverπout⋆\\pi^\{\\star\}\_\{\\mathrm\{out\}\}assigns positive mass to one of those longer correct paths\.

#### Discussion\.

Theorems[1](https://arxiv.org/html/2606.18954#Thmtheorem1),[2](https://arxiv.org/html/2606.18954#Thmtheorem2),[3](https://arxiv.org/html/2606.18954#Thmtheorem3), and[4](https://arxiv.org/html/2606.18954#Thmtheorem4)together support the four claims of the main paper: GraphPO \(i\) yields lower\-variance node\-score and advantage estimates than tree\-based baselines that cannot reuse nodes, by pooling downstream evidence from all members of each equivalence class into a single advantage\-estimation group, with a controllable semantic\-merging bias; \(ii\) improves expected exploration efficiency under a fixed budget when detected redundant\-budget savings are redirected to novel frontiers; \(iii\) induces, in expectation, the same first\-order update direction as a novelty\-gated oracle PRM surrogate—equivalently, an oracle PRM update non\-negatively reweighted by the novelty factor—yielding self\-emergent process supervision purely from outcome rewards; and \(iv\) uses the efficiency term to select shorter equivalent correct reasoning paths, thereby improving inference efficiency under the stated tie\-breaking and token\-cost conditions\.

## Appendix GRobustness of Equivalence Detection

The graph construction of GraphPO relies on two components that are in principle noisy: \(i\) structured summaries produced by a frozen extractor, and \(ii\) cosine similarity between embeddings under a thresholdκ\\kappa\. Detection errors split into*false positives*\(FP, merging non\-equivalent states\) and*false negatives*\(FN, missing a true equivalence\)\. This section shows that GraphPO’s advantage estimator, edge reward, and policy gradient are robust to both, formalizes the asymmetric degradation of the two error modes, and connects the bounds to the empirical ablations already reported in the paper\.

### G\.1False Positives are Bounded by the Merging Threshold

Assumption[1](https://arxiv.org/html/2606.18954#Thmassumption1)is a*class\-level*stability condition on whatever classes the detector actually returns: whenever cosine similarity aboveκ\\kappaimplies that members of a detected class share values up toδκ=O\(1−κ\)\\delta\_\{\\kappa\}=O\(1\-\\kappa\), every bias term in our analysis reduces to anO\(δκ\)O\(\\delta\_\{\\kappa\}\)residual\. Concretely:

- •Node score bias\.Lemma[3](https://arxiv.org/html/2606.18954#Thmlemma3)gives\|𝔼\[S\(v\)\]−V⋆\(v\)\|≤δκ\|\\mathbb\{E\}\[S\(v\)\]\-V^\{\\star\}\(v\)\|\\leq\\delta\_\{\\kappa\}\.
- •Edge\-reward bias\.Corollary[1](https://arxiv.org/html/2606.18954#Thmcorollary1)bounds the squared bias ofr^step\\hat\{r\}\_\{\\mathrm\{step\}\}by\(1−η\(u,v\)\)2\(2δκ\)2\{\(1\-\\eta\(u,v\)\)\}^\{2\}\{\(2\\delta\_\{\\kappa\}\)\}^\{2\}\.
- •Gradient alignment\.Theorem[3](https://arxiv.org/html/2606.18954#Thmtheorem3)shows the GraphPO surrogate has the same first\-order update direction as the novelty\-gated oracle PRM up to anO\(δκ\)O\(\\delta\_\{\\kappa\}\)residual\.

The thresholdκ\\kappais therefore a direct knob that monotonically shrinks the worst\-case false\-positive bias: raisingκ\\kappabyϵ\\epsilonshrinksδκ\\delta\_\{\\kappa\}byO\(ϵ\)O\(\\epsilon\)and every bias term above by the same factor\.

#### Self\-correcting novelty gate\.

The edge reward carries a multiplicative novelty factorω\(u,v\)=1−η\(u,v\)\\omega\(u,v\)=1\-\\eta\(u,v\)that measures semantic change along the step\. A false\-positive merge inflatesη\(u,v\)\\eta\(u,v\)toward11, soω\(u,v\)→0\\omega\(u,v\)\\to 0on exactly those edges the detector has spuriously treated as equivalent\. The incorrect credit is attenuated*at the same gate that caused the mistake*, without requiring the training loop to detect the error\. Assumption[1](https://arxiv.org/html/2606.18954#Thmassumption1)further guarantees that the oracle advantage on any over\-merged edge is itselfO\(δκ\)O\(\\delta\_\{\\kappa\}\), so the attenuated edge would not contribute meaningful gradient signal in the first place\.

### G\.2False Negatives Degrade Gracefully to the Tree Baseline

Missed equivalences are even easier to control: when two truly equivalent states are kept as separate nodes, the pooling weight vector collapses to a single non\-zero entry and the GraphPO estimator reduces exactly to its tree counterpart at that node\.

###### Proposition 3\(False negatives are strictly safe\)\.

Under Assumptions[1](https://arxiv.org/html/2606.18954#Thmassumption1)–[3](https://arxiv.org/html/2606.18954#Thmassumption3), suppose the detector returns singleton classes𝒬\(v\)=\{v\}\\mathcal\{Q\}\(v\)=\\\{v\\\}for a subsetℱ\\mathcal\{F\}of nodes, regardless of the true semantics of those nodes\. Then for everyv∈ℱv\\in\\mathcal\{F\}: \(i\)V^vQ=V^v\\hat\{V\}^\{Q\}\_\{v\}=\\hat\{V\}\_\{v\}is unbiased forV⋆\(v\)V^\{\\star\}\(v\); \(ii\)Var⁡\(V^vQ\)=Var⁡\(V^vTree\)\\operatorname\{Var\}\(\\hat\{V\}^\{Q\}\_\{v\}\)=\\operatorname\{Var\}\(\\hat\{V\}^\{\\mathrm\{Tree\}\}\_\{v\}\); \(iii\) the downstream quantitiesrstepr\_\{\\mathrm\{step\}\},AcorA\_\{\\mathrm\{cor\}\}, andAeffA\_\{\\mathrm\{eff\}\}atvvcoincide with the tree baseline\. Consequently, false negatives never introduce bias and never increase variance beyond the tree baseline; they only forfeit the variance reduction that Theorem[1](https://arxiv.org/html/2606.18954#Thmtheorem1)would have delivered under correct detection\.

###### Proof\.

Settingkv=1k\_\{v\}=1in Lemma[2](https://arxiv.org/html/2606.18954#Thmlemma2)givesρv=1\\rho\_\{v\}=1andV^vQ=V^v\\hat\{V\}^\{Q\}\_\{v\}=\\hat\{V\}\_\{v\}, which is unbiased by Lemma[1](https://arxiv.org/html/2606.18954#Thmlemma1)and has varianceV⋆\(v\)\(1−V⋆\(v\)\)/mV^\{\\star\}\(v\)\(1\-V^\{\\star\}\(v\)\)/m\. All downstream quantities are deterministic functions ofS\(⋅\)S\(\\cdot\), so they inherit the tree semantics atv∈ℱv\\in\\mathcal\{F\}\. ∎

Proposition[3](https://arxiv.org/html/2606.18954#Thmproposition3)formalizes the asymmetry that makes GraphPO robust in practice: false negatives degrade gracefully to tree\-level estimates, while false\-positive damage is controlled by the sameδκ\\delta\_\{\\kappa\}that the thresholdκ\\kappadirectly tunes\. No combination of detection errors can push GraphPO below the tree baseline by more than anO\(δκ2\)O\(\\delta\_\{\\kappa\}^\{2\}\)bias term, which disappears asκ→1\\kappa\\to 1\.

### G\.3Empirical Robustness

The theoretical bounds are validated by two complementary ablations already included in the paper\. First, Figure[6](https://arxiv.org/html/2606.18954#S5.F6)shows that GraphPO’s accuracy is stable across a wide range of merging thresholds, with a broad optimum nearκ=0\.92\\kappa=0\.92and a monotone fallback to tree performance asκ→1\\kappa\\to 1\(no merges\)\. This matches Proposition[3](https://arxiv.org/html/2606.18954#Thmproposition3): the high\-κ\\kappalimit recovers the tree estimator rather than degrading below it\. Second, Figure[S\.2](https://arxiv.org/html/2606.18954#A5.F2)shows that the pooling coefficientwwyields a smooth curve with a plateau aroundw=0\.7w=0\.7;w=0w=0\(no pooling\) again recovers tree performance, confirming that equivalence detection affects the estimator only through the two tunable knobs\(w,κ\)\(w,\\kappa\), both of which admit safe defaults\. The absence of a cliff in either ablation confirms that the method does not rely on near\-perfect detection; it only requires the detector to be*asymmetrically tuned for precision*\(largeκ\\kappa\), which is exactly the regime Proposition[3](https://arxiv.org/html/2606.18954#Thmproposition3)identifies as optimal\.

#### Engineering safeguards\.

Three design choices further harden detection against noisy summaries or embeddings\. \(1\)*Structured summaries\.*Nodes are embedded from structured summaries of the full edge path produced by a frozen extractor \(Section[4\.2](https://arxiv.org/html/2606.18954#S4.SS2)\), rather than raw token strings\. This removes surface variation \(ordering, wording, formatting\) before similarity is computed, tighteningδκ\\delta\_\{\\kappa\}at any fixedκ\\kappa\. \(2\)*Non\-causal restriction\.*Merges are only formed between non\-causal pairs \(Assumption[3](https://arxiv.org/html/2606.18954#Thmassumption3)\), which eliminates ancestor\-descendant collapse by construction and preserves the DAG needed for bottom\-up propagation\. \(3\)*Class\-level stability\.*Assumption[1](https://arxiv.org/html/2606.18954#Thmassumption1)is stated at the class level rather than the pairwise level, so transitive closures of high\-similarity edges do not implicitly claim pairwise closeness beyond what the detected class itself satisfies\. Taken together with the novelty gateω\(u,v\)\\omega\(u,v\), these safeguards ensure that the dependence on summaries and embeddings translates into at most anO\(1−κ\)O\(1\-\\kappa\)bias term, which the ablation in Figure[6](https://arxiv.org/html/2606.18954#S5.F6)shows is empirically benign at the recommended threshold\.

## Appendix HLimitations and Future Work

Our experiments mainly focus on reasoning and agent tasks\. Although the graph formulation is general, we have not yet systematically studied its effectiveness in broader settings, such as multimodal reasoning, code generation with execution feedback, and long\-horizon interactive environments\.

## Appendix IBroader Impacts

GraphPO is a reinforcement learning method for improving reasoning models under verifiable rewards\. Its main positive impact is to improve rollout utilization and reduce redundant reasoning, which can lower the amount of computation and process\-level annotation needed for training strong reasoning models\. We focus exclusively on reinforcement learning, presenting no potential ethical risks\.
GraphPO: Graph-based Policy Optimization for Reasoning Models

Similar Articles

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios

GAGPO: Generalized Advantage Grouped Policy Optimization

SocraticPO: Policy Optimization via Interactive Guidance

Structured Role-Aware Policy Optimization for Multimodal Reasoning

Submit Feedback

Similar Articles

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios
GAGPO: Generalized Advantage Grouped Policy Optimization
SocraticPO: Policy Optimization via Interactive Guidance
Structured Role-Aware Policy Optimization for Multimodal Reasoning