ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents

arXiv cs.CL Papers

Summary

ARBOR introduces a reusable rubric buffer to provide online process rewards for LLM-based search agents, improving training efficiency when outcome-only rewards are insufficient. It outperforms GRPO and DAPO on multi-hop QA benchmarks, converting up to 42% of zero-gradient training groups into informative ones.

arXiv:2606.03239v1 Announce Type: new Abstract: LLM-based search agents are trained predominantly with outcome-only reward, leaving the search process itself unsupervised. This signal degenerates on outcome-homogeneous groups where all sampled trajectories share the same correctness, yielding zero within-group advantage and no gradient. Existing process supervision either trains a costly verifier or generates per-query rubrics that are inconsistent across queries and discarded after one use. We propose ARBOR (Adaptive Rubric Buffer for Online Reward), a reusable process-reward framework that maintains a rubric memory shared across queries. Query-local drafts induced from contrastive trajectories are admitted, consolidated into cross-query common rubrics, and retired as the policy evolves. A small active subset of common rubrics scores trajectories via sparse pairwise judging, and the resulting scores are added to the base reward, providing process-level gradient even when outcome reward is uniform. ARBOR consistently outperforms GRPO and DAPO baselines on four multi-hop QA benchmarks, raising average LLM-judge accuracy by up to 4.2 points and converting up to 42% of otherwise-zero-gradient training groups into informative ones.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:37 AM

# Online Process Rewards via a Reusable Rubric Buffer for Search Agents
Source: [https://arxiv.org/html/2606.03239](https://arxiv.org/html/2606.03239)
Zheng Liu1,∗, Longxiang Zhang2, Xintong Wang2, Zhiang Xu2, Shaoxiong Zhan1, Xin Shan3, Wen Huang1, Tao Dai4,†, Shu\-Tao Xia1, Chengfu Huo2, Liang Ding2,† 1Tsinghua University2Alibaba Group3Peking University4Shenzhen University liu\-z24@mails\.tsinghua\.edu\.cn,\{daitao\.edu, liangding\.liam\}@gmail\.com

###### Abstract

LLM\-based search agents are trained predominantly with outcome\-only reward, leaving the search process itself unsupervised\. This signal degenerates on outcome\-homogeneous groups where all sampled trajectories share the same correctness, yielding zero within\-group advantage and no gradient\. Existing process supervision either trains a costly verifier or generates per\-query rubrics that are inconsistent across queries and discarded after one use\. We proposeARBOR\(AdaptiveRubricBuffer forOnlineReward\), a reusable process\-reward framework that maintains a rubric memory shared across queries\. Query\-local drafts induced from contrastive trajectories are admitted, consolidated into cross\-query common rubrics, and retired as the policy evolves\. A small active subset of common rubrics scores trajectories via sparse pairwise judging, and the resulting scores are added to the base reward, providing process\-level gradient even when outcome reward is uniform\. ARBOR consistently outperforms GRPO and DAPO baselines on four multi\-hop QA benchmarks, raising average LLM\-judge accuracy by up to4\.2points and converting up to42%of otherwise\-zero\-gradient training groups into informative ones\.111We will release our code and models after the review process\.

ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents

Zheng Liu1,∗, Longxiang Zhang2, Xintong Wang2, Zhiang Xu2, Shaoxiong Zhan1,Xin Shan3, Wen Huang1, Tao Dai4,†, Shu\-Tao Xia1, Chengfu Huo2, Liang Ding2,†1Tsinghua University2Alibaba Group3Peking University4Shenzhen Universityliu\-z24@mails\.tsinghua\.edu\.cn,\{daitao\.edu, liangding\.liam\}@gmail\.com,

11footnotetext:Work done during an internship at Alibaba\.22footnotetext:Corresponding authors\.## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.03239v1/x1.png)Figure 1:Process quality divergence under identical outcomes\.Two trajectories from the same query reach the same answer yet differ markedly in search efficiency\.LLM\-based agents that interact with external environments under reasoning\-and\-acting paradigms such as ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2606.03239#bib.bib8)\)have become a standard approach to complex tasks\. A representative case is the search agent\(Presset al\.,[2023](https://arxiv.org/html/2606.03239#bib.bib6); Xiet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib2)\), which answers multi\-hop questions requiring external knowledge by iteratively rewriting queries, retrieving evidence, filtering observations, and integrating them into a final answer\. This interaction pattern lets search agents substantially outperform LLMs that answer directly on multi\-hop QA and other complex information\-retrieval tasks\. Recent systems such as Search\-R1\(Jinet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib20)\)and R1\-Searcher\(Songet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib21)\), together with other search\-agent RL studies\(Liet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib14); Jianget al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib16)\), further show that RL is an effective way to push the capability ceiling of search agents and has become the dominant paradigm for their training\. Within this paradigm, the RL stage relies almost exclusively on outcome\-only reward, using final\-answer correctness as the reward signal together with a format penalty\(Shaoet al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib43); Yuet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib44)\), and provides no supervision on the search process itself\.

Trajectories sampled from the same query can follow very different search paths even when they arrive at the same outcome, as illustrated in Figure[1](https://arxiv.org/html/2606.03239#S1.F1): one may reason carefully through targeted retrieval while another stumbles onto the answer after redundant detours, but final\-answer correctness assigns them identical reward\. Under group\-relative objectives such as GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib43)\), identical within\-group reward produces zero relative advantage and no policy gradient, so process differences that could inform better search behavior contribute nothing to training\. Such outcome\-homogeneous groups are far from rare in search\-agent RL training \(see Section[4\.3](https://arxiv.org/html/2606.03239#S4.SS3)\), making them a major bottleneck for outcome\-only reward\.

Adding process\-level supervision is a natural response, yet existing routes do not fit search agents cleanly\. Training a process reward model \(PRM\) requires rollout\-based annotation or value estimation over intermediate reasoning states\(Lightmanet al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib30); Wanget al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib31); Luoet al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib32); Cuiet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib33)\); in search agents, this can require rolling back from intermediate states and invoking search APIs at prohibitive cost, while forcing the inherently qualitative nature of search behavior into discrete step\-correctness labels\. Query\-specific LLM\-generated rubrics, exemplified by Rubrics\-as\-Rewards\(Gunjalet al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib34)\), sidestep verifier training but produce inconsistent criteria across queries and are discarded after one use, so they cannot stably reflect process regularities across queries or evolve with the policy\.

These limitations point to three properties that a process reward suited to search\-agent RL should possess\. First, it should supervise the search process itself, supplying learning signal for the process quality that outcome\-only reward overlooks; this property matters most on outcome\-homogeneous groups, where process supervision is the only available within\-group signal\. Second, the process criteria should be general, cross\-query reusable standards, rather than query\-specific rubrics that may conflict across queries\. Third, the effectiveness of process criteria decays as the policy’s behavior distribution evolves, so the criteria themselves must be updated continuously rather than fixed once\.

We proposeARBOR\(AdaptiveRubricBuffer forOnlineReward\), a reusable process\-reward framework for search\-agent RL training\. The core component is a rubric memory consisting of a candidate pool, which holds query\-local drafts induced from contrastive trajectories within a query\-group, and a common pool, which stores rubrics consolidated into reusable cross\-query process criteria\. An online lifecycle of admission, consolidation, and retirement consolidates candidate drafts into common rubrics and retires stale ones, so that the common pool provides a unified criterion across queries and evolves with the policy’s behavior distribution\. Reward shaping invokes only a small active subset of common rubrics\. Trajectories within a query\-group are scored pairwise under each active rubric, and the resulting scores are added to the base reward\. The reusable common pool remains effective even on outcome\-homogeneous groups, where it still yields within\-group process discrimination that outcome\-only reward cannot\. Figure[2](https://arxiv.org/html/2606.03239#S1.F2)shows the overall framework\.

Our contributions are as follows: \(1\) we propose ARBOR, a reusable process\-reward framework that provides within\-group process supervision even when outcome\-only reward yields zero gradient; \(2\) we design a rubric memory with an online admission, consolidation, and retirement lifecycle that maintains consistent cross\-query process criteria and evolves with the policy; and \(3\) ARBOR consistently outperforms GRPO and DAPO across three Qwen3 scales on four multi\-hop QA benchmarks, improving average LLM\-judge accuracy by up to4\.2points and converting up to42%of outcome\-homogeneous groups into ones with nonzero reward variance\.

![Refer to caption](https://arxiv.org/html/2606.03239v1/x2.png)Figure 2:Overview of ARBOR\.\(a\) Contrastive induction extracts query\-local draft rubrics from trajectories within a query\-group\. \(b\) The rubric bufferℳ\\mathcal\{M\}admits drafts into a candidate pool𝒟\\mathcal\{D\}, consolidates them into a common pool𝒫\\mathcal\{P\}, and retires stale rubrics, forming an online admission–consolidation–retirement lifecycle\. \(c\) At each step, two active common rubrics are selected and used to score trajectories via sparse pairwise scoring, and the centered rubric scores are added to the base reward\.
## 2Related Work

### 2\.1RL Reward Design for Search Agents

Reward design for RL\-trained search agents has evolved along three lines\. The dominant paradigm relies on outcome\-only reward, using final\-answer correctness or F1 plus format penalties, and leaves the search process entirely unsupervised\. Search\-R1\(Jinet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib20)\), R1\-Searcher\(Songet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib21)\), and Search\-o1\(Liet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib14)\)are representative of this approach\. A second line augments outcome reward with task\-specific process heuristics such as information gain, path coverage, or retrieval cost, as in StepSearch\(Zhenget al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib22)\), Search\-P1\(Xiaet al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib23)\), SIGHT\(Zhonget al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib24)\), InfoFlow\(Luoet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib25)\), and TIPS\(Xieet al\.,[2026b](https://arxiv.org/html/2606.03239#bib.bib26)\), but these metrics cannot capture qualitative aspects of search strategy\. A third line trains a process reward model \(PRM\) to provide step\-level feedback, with supervision constructed in different ways: PRM800K\(Lightmanet al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib30)\)uses human annotation, Math\-Shepherd\(Wanget al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib31)\)and OmegaPRM\(Luoet al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib32)\)use rollout\-value estimation, and PRIME\(Cuiet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib33)\)infers from outcomes\. In the search\-agent setting, representative attempts include PPR\(Xuet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib28)\)with pre\-defined principles and a category\-aware PRM, ReasonRAG\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.03239#bib.bib29)\)with MCTS\-constructed step\-level annotations followed by process\-supervised DPO, and LeTS\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.03239#bib.bib27)\)with a mixture of stepwise process reward and outcome reward\.

Existing work either ignores the quality of the search process entirely, reduces it to quantifiable but semantically shallow domain metrics, or relies on a separately trained verifier\. Against this background, ARBOR provides process\-level feedback that continues to discriminate within a query\-group even when its outcomes are homogeneous, restoring a learning signal precisely where outcome\-only reward fails\.

### 2\.2Rubric\-Based Reward Signals

While rubric\-conditioned evaluators such as Prometheus\(Kimet al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib11)\)and LLM\-Rubric\(Hashemiet al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib12)\)use predefined rubrics to structure LLM\-as\-a\-judge assessment, recent work applies rubrics directly as RL reward signals\. Most methods generate rubrics per query without cross\-query sharing, which may cause conflicts across queries\. Rubrics\-as\-Rewards\(Gunjalet al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib34)\)uses static query\-specific checklists as on\-policy rewards; similar per\-query approaches include\(Wanget al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib35); Heet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib36); Zhouet al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib37)\)\. Several systems further evolve rubrics or their generators during training\(Xuet al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib40); Shenget al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib41); Shaoet al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib42)\), addressing staleness but still without sharing across queries\. Auto\-Rubric\(Xieet al\.,[2026a](https://arxiv.org/html/2606.03239#bib.bib38)\), AdaRubric\(Ding,[2026](https://arxiv.org/html/2606.03239#bib.bib1)\), and OpenRS\(Jiaet al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib39)\)achieve reusability across instances through offline rubric generation, but do not co\-evolve with the policy during RL training\.

ARBOR combines cross\-query reusability with online adaptation: a persistent rubric buffer consolidates query\-local drafts into shared common rubrics and continuously retires outdated ones, keeping process standards consistent across queries and aligned with the evolving policy\.

## 3Method

### 3\.1Overview and Problem Setup

We consider RL training for search agents\. Given a queryqq, the policy produces a trajectoryτ\\tauthrough multiple rounds of think, search, and observe interactions and finally outputs an answer\. At each training step, we sampleKKtrajectories from the same query to form a query\-group𝒢q=\{τ1,…,τK\}\\mathcal\{G\}\_\{q\}=\\\{\\tau\_\{1\},\\ldots,\\tau\_\{K\}\\\}, where trajectories share the same query and environment and diverge only through policy sampling\.

During training, ARBOR maintains a single rubric memoryℳ\\mathcal\{M\}shared across all queries\. At every step, ARBOR uses a small number of currently active natural\-language process rubrics fromℳ\\mathcal\{M\}to score the trajectories in𝒢q\\mathcal\{G\}\_\{q\}along process dimensions, and the resulting score is additively combined with the existing RL reward as an auxiliary process\-level signal for the policy optimizer\.ℳ\\mathcal\{M\}itself evolves throughout training under an online admission, consolidation, and retirement lifecycle, with its contents updated in step with the policy behavior distribution\.

### 3\.2Contrastive Local Rubric Induction

ARBOR induces query\-local draft rubrics from each query\-group𝒢q\\mathcal\{G\}\_\{q\}through contrastive induction\. LetF1​\(τ\)F\_\{1\}\(\\tau\)denote the token\-level F1 of trajectoryτ\\tauagainst the gold answer\. We select the highest\-scoring trajectory as a positive anchor, the lowest\-scoring trajectory as a worst\-case negative, and the strongest remaining trajectory as a hard negative:

τ\+\\displaystyle\\tau^\{\+\}=arg⁡maxτ∈𝒢q⁡F1​\(τ\),\\displaystyle=\\arg\\max\_\{\\tau\\in\\mathcal\{G\}\_\{q\}\}F\_\{1\}\(\\tau\),\(1\)τworst−\\displaystyle\\tau^\{\-\}\_\{\\text\{worst\}\}=arg⁡minτ∈𝒢q⁡F1​\(τ\),\\displaystyle=\\arg\\min\_\{\\tau\\in\\mathcal\{G\}\_\{q\}\}F\_\{1\}\(\\tau\),τhard−\\displaystyle\\tau^\{\-\}\_\{\\text\{hard\}\}=arg⁡maxτ∈𝒢q∖\{τ\+\}⁡F1​\(τ\)\.\\displaystyle=\\arg\\max\_\{\\tau\\in\\mathcal\{G\}\_\{q\}\\setminus\\\{\\tau^\{\+\}\\\}\}F\_\{1\}\(\\tau\)\.These anchors define two complementary contrasts\. The pair\(τ\+,τworst−\)\(\\tau^\{\+\},\\tau^\{\-\}\_\{\\text\{worst\}\}\)exposes large\-scale success\-failure differences and surfaces critical process deviations, whereas\(τ\+,τhard−\)\(\\tau^\{\+\},\\tau^\{\-\}\_\{\\text\{hard\}\}\)provides a finer\-grained comparison between trajectories of similar correctness and isolates more subtle process differences\. Both pairs are provided jointly to an external LLM, which summarizes a small set of natural\-language process rubrics as query\-local drafts\. This design captures both coarse and fine process distinctions while reducing induction from exhaustiveO​\(K2\)O\(K^\{2\}\)pairing to constant cost\.

The induction prompt is restricted to search behavior rather than answer content\. The LLM is instructed to identify process behaviors that causally separate successful trajectories from failed ones, focusing on search strategy and reasoning process, such as query formulation, evidence use, and stopping judgments\. Each induced rubric specifies both what a high\-scoring response does and what a low\-scoring response does along the dimension\. The full induction prompt is provided in Appendix[E\.1](https://arxiv.org/html/2606.03239#A5.SS1)\.

All\-correct and all\-wrong groups skip the induction stage, as they provide no correctness contrast from which to infer new rubrics\. Common rubrics already saved inℳ\\mathcal\{M\}, however, can still score these groups under Section[3\.4](https://arxiv.org/html/2606.03239#S3.SS4), providing within\-group discrimination when the outcome signal collapses\. Corner cases are detailed in Appendix[A\.1](https://arxiv.org/html/2606.03239#A1.SS1)\.

The resulting drafts enter the memory pipeline of Section[3\.3](https://arxiv.org/html/2606.03239#S3.SS3)and can affect future training only after surviving admission and consolidation\.

### 3\.3Rubric Memory and Lifecycle

ARBOR maintains a single rubric memoryℳ=\(𝒟,𝒫\)\\mathcal\{M\}=\(\\mathcal\{D\},\\mathcal\{P\}\)shared across all queries\. The candidate pool𝒟\\mathcal\{D\}temporarily stores the query\-local drafts induced in Section[3\.2](https://arxiv.org/html/2606.03239#S3.SS2), awaiting abstraction into reusable standards\. The common rubric pool𝒫\\mathcal\{P\}stores rubrics that have been distilled from𝒟\\mathcal\{D\}and serves as the signal source for reward shaping in Section[3\.4](https://arxiv.org/html/2606.03239#S3.SS4)\. Throughout training,ℳ\\mathcal\{M\}evolves through three mechanisms, namely admission, consolidation, and retirement, keeping its contents updated in step with the policy behavior distribution\.

#### Admission\.

Each draft rubric scores the trajectories in its source group via the pairwise judging procedure of Section[3\.4](https://arxiv.org/html/2606.03239#S3.SS4)\. Letsirs\_\{i\}^\{r\}denote the score of trajectoryτi\\tau\_\{i\}under draftrr\. A draft is admitted to𝒟\\mathcal\{D\}only if its scores satisfy two conditions:

Vari∈𝒢q​\(sir\)\\displaystyle\\mathrm\{Var\}\_\{i\\in\\mathcal\{G\}\_\{q\}\}\(s\_\{i\}^\{r\}\)≥δv,\\displaystyle\\geq\\delta\_\{v\},\(2\)Pearson​\(\{sir\},\{F1​\(τi\)\}\)\\displaystyle\\mathrm\{Pearson\}\\bigl\(\\\{s\_\{i\}^\{r\}\\\},\\,\\\{F\_\{1\}\(\\tau\_\{i\}\)\\\}\\bigr\)≥ρmin\.\\displaystyle\\geq\\rho\_\{\\min\}\.The variance condition rules out drafts that fail to discriminate on the samples they were induced from\. The correlation condition requires that rubric scores are aligned with outcome correctness, rejecting drafts that penalize correct behavior\. Drafts failing either condition are discarded immediately\.

#### Consolidation\.

Once the number of drafts in𝒟\\mathcal\{D\}reaches a fixed threshold, ARBOR sends all candidate drafts together with the existing common rubrics in𝒫\\mathcal\{P\}to an external LLM\. The LLM is instructed to identify recurring process patterns across the candidates and abstract them into cross\-query general standards\. Existing rubrics in𝒫\\mathcal\{P\}are provided as context so that the LLM only produces standards covering new dimensions\. Each output rubric is further deduplicated against𝒫\\mathcal\{P\}by sentence\-embedding similarity as a safeguard\. After consolidation,𝒟\\mathcal\{D\}is cleared\. The full consolidation prompt is provided in Appendix[E\.2](https://arxiv.org/html/2606.03239#A5.SS2)\.

#### Retirement\.

Common rubrics are not assumed to remain valid for the entire training run\. As the policy evolves, a rubric that once discriminated good from bad may lose its power because the policy has uniformly mastered or failed the corresponding behavior\. ARBOR tracks two long\-term signals for each common rubricr∈𝒫r\\in\\mathcal\{P\}, corresponding to the same two dimensions checked at admission stage but accumulated over the rubric’s entire active lifetime\. The first is a consecutive low\-variance countnrn\_\{r\}, the number of consecutive activations on which the within\-group score variance underrrfalls belowδv\\delta\_\{v\}, measuring whetherrrstill discriminates among trajectories\. The second is a cumulative Pearson correlationρr\\rho\_\{r\}between the per\-trajectory scores underrrand the trajectory F1 across all activations ofrr, measuring whetherrraligns with correct behavior\. Eithernrn\_\{r\}exceeding its tolerance orρr<ρmin\\rho\_\{r\}<\\rho\_\{\\min\}triggers removal from𝒫\\mathcal\{P\}\.

In addition, when consolidation attempts to write a new common rubric into a𝒫\\mathcal\{P\}that has already reached its capacity limit, the entry with the lowest cumulative within\-group variance is replaced, provided its tracked statistics have matured over a sufficient number of activations\. This prevents newly admitted rubrics from being evicted before their quality signals stabilize\.

### 3\.4Process Scoring and Reward Shaping

At each training step, ARBOR selects a small active subset of common rubrics from𝒫\\mathcal\{P\}, scores the trajectories in𝒢q\\mathcal\{G\}\_\{q\}under those rubrics via pairwise process comparison, and integrates the resulting scores into the reward after within\-group centering\. Because common rubrics are cross\-query process standards independent of any specific query, they can discriminate among trajectories even in outcome\-homogeneous groups where the base reward is identical for all samples, sustaining process\-level supervision when the outcome signal collapses\.

#### Active selection\.

At each step, ARBOR activates exactly two common rubrics from𝒫\\mathcal\{P\}\. One slot is held by the rubric with the highest cumulative correlationρr\\rho\_\{r\}, serving as the primary process signal; the other rotates among the next strongest candidates by a least\-recently\-used policy\. Activating only two rubrics avoids the linear blowup of judge cost and the within\-group signal dilution that would result from scoring under the full pool\. Rotation ensures that non\-top rubrics still accumulateρr\\rho\_\{r\}andnrn\_\{r\}, keeping the retirement mechanism of Section[3\.3](https://arxiv.org/html/2606.03239#S3.SS3)operational\. When𝒫\\mathcal\{P\}is empty, the step falls back to the base reward, so the cold\-start phase introduces no unverified rubric signal\.

#### Group\-stage process scoring\.

Under each active common rubric, ARBOR performs within\-group pairwise comparisons among the trajectories in𝒢q\\mathcal\{G\}\_\{q\}via an external LLM judge that returnswin,tie, orloss\. A full round\-robin would requireO​\(\|𝒢q\|2\)O\(\|\\mathcal\{G\}\_\{q\}\|^\{2\}\)judge calls, which is prohibitively expensive\. ARBOR therefore sorts the trajectories by base reward and builds a sparse connected graph withO​\(\|𝒢q\|\)O\(\|\\mathcal\{G\}\_\{q\}\|\)edges using two kinds of edges\. As illustrated in Figure[2](https://arxiv.org/html/2606.03239#S1.F2), neighbor edges link adjacent trajectories in the sorted order, providing fine\-grained local comparisons between similarly ranked trajectories; diameter edges connect distant trajectories, providing high\-contrast pairs for more decisive judgments\. Each pairwise call randomizes the presentation order of the two trajectories, eliminating positional bias of the LLM judge\.Wins,ties, andlossesare scored as11,0\.50\.5, and0, respectively\. The score ofτi\\tau\_\{i\}under a rubric is the mean over its incident edges, and averaging across the active rubrics yields the composite process scoresirubrics\_\{i\}^\{\\text\{rubric\}\}\.

#### Centering and reward integration\.

Before entering the reward, ARBOR applies a per\-batch variance filter\. If an active rubric’s within\-group variance falls belowδv\\delta\_\{v\}on a batch, its scores are discarded to avoid injecting uninformative noise\. The scores from the filtered active rubrics are then centered within each query\-group,

s~irubric=sirubric−1\|𝒢q\|​∑j∈𝒢qsjrubric,\\tilde\{s\}\_\{i\}^\{\\text\{rubric\}\}=s\_\{i\}^\{\\text\{rubric\}\}\-\\frac\{1\}\{\|\\mathcal\{G\}\_\{q\}\|\}\\sum\_\{j\\in\\mathcal\{G\}\_\{q\}\}s\_\{j\}^\{\\text\{rubric\}\},\(3\)and the final reward uses additive shaping,

Ritotal\\displaystyle R\_\{i\}^\{\\text\{total\}\}=Ribase\+λ⋅s~irubric,\\displaystyle=R\_\{i\}^\{\\text\{base\}\}\+\\lambda\\cdot\\tilde\{s\}\_\{i\}^\{\\text\{rubric\}\},\(4\)Ribase\\displaystyle R\_\{i\}^\{\\text\{base\}\}=\{RiF1if format valid−1otherwise,\\displaystyle=,whereRiF1R\_\{i\}^\{\\text\{F1\}\}is the token\-level F1 of the final answer andλ\\lambdais fixed throughout training\. Group centering gives the rubric term zero mean within each query\-group, and negative centered scores are further attenuated by a factorα\\alpha, encouraging stronger search behavior without over\-penalizing merely less\-preferred trajectories\. For format\-invalid trajectories, the rubric term is forced to zero, preserving the format penalty imposed byRibaseR\_\{i\}^\{\\text\{base\}\}\.

## 4Experiments

Table 1:Overall performance on 4 knowledge\-intensive reasoning tasks\.ARBOR achieves the best average EM, F1 and LLM\-judge accuracy at all three scales\. The top two outcomes in each column areboldedandunderlined\.### 4\.1Experimental Setup

#### Datasets\.

Our training follows a two\-stage SFT\-then\-RL procedure\. SFT uses a search\-tool subset of Tool\-Star\(Donget al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib45)\)and STILL\(Minet al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib46)\), filtered to remove Python\-tool examples, yielding approximately 16K training examples\. RL training uses 2K QA examples randomly sampled from the ARPO Deep Reasoning Tasks\(Donget al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib17)\)\. Evaluation is on four multi\-hop QA benchmarks: Bamboogle\(Presset al\.,[2023](https://arxiv.org/html/2606.03239#bib.bib6)\), HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.03239#bib.bib3)\), MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2606.03239#bib.bib5)\)and 2WikiMultiHopQA\(Hoet al\.,[2020](https://arxiv.org/html/2606.03239#bib.bib4)\)\. The test split strategy follows the convention of ARPO\(Donget al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib17)\)\. Benchmark details are provided in Appendix[B\.1](https://arxiv.org/html/2606.03239#A2.SS1)\.

#### Baselines\.

We use Qwen3\-8B\(Yanget al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib48)\)as the main backbone, with Qwen3\-4B and Qwen3\-14B for cross\-scale verification, comparing three methods on each backbone: \(1\) TIR Prompting, which injects the search\-tool specification but performs no RL fine\-tuning; \(2\) GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib43)\), which uses base reward as the sole reward; and \(3\) DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib44)\), which discards outcome\-homogeneous query\-groups from policy updates, providing a contrast on how to handle the zero\-gradient groups that ARBOR exploits\.

#### Implementation Details\.

All training runs on a single node with 8×\\timesH100 GPUs using Slime\(Zhuet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib54)\)as the RL framework\. The search tool is Google Search, returning 10 results per query\. The base reward is token\-level F1 with a format penalty\. For ARBOR, induction, consolidation, and pairwise judging all call Qwen3\-Plus222Model id: qwen\-plus\-2025\-04\-28\. Accessed: 2026\-05\(Alibaba Cloud,[2025](https://arxiv.org/html/2606.03239#bib.bib55)\)as the external LLM\. Consolidation deduplication uses BGE\-M3\(Chenet al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib53)\)sentence embeddings, and the reward shaping coefficient is fixed toλ=0\.1\\lambda=0\.1\. Remaining hyperparameters are listed in Appendix[B\.3](https://arxiv.org/html/2606.03239#A2.SS3)\.

#### Evaluation Metrics\.

We report Exact Match \(EM\), token\-level F1, and LLM\-judge accuracy with Qwen3\-Plus as the judge\. All numbers are pass@1 averaged over 5 independent evaluation runs to reduce sampling variance\. The LLM\-judge prompt is given in Appendix[E\.4](https://arxiv.org/html/2606.03239#A5.SS4)\.

### 4\.2Main Results

Table[1](https://arxiv.org/html/2606.03239#S4.T1)presents the main results across four multi\-hop QA benchmarks\. ARBOR achieves the best average performance at every model scale\. We highlight three observations from the results\.

#### ARBOR consistently improves over outcome\-only baselines\.

ARBOR improves average LLM\-judge accuracy over GRPO by4\.0,4\.2, and2\.0points at 4B, 8B, and 14B, and also uniformly outperforms DAPO at all three scales\. The improvements are consistent across evaluation metrics, with gains observed not only in LLM\-judge accuracy but also in EM and F1\. TIR Prompting performs substantially worse than all RL\-trained variants, confirming that tool access alone is insufficient without policy optimization\. Its non\-monotonic scaling is mainly caused by answer\-format failures under the strict evaluator, as diagnosed in Appendix[C\.1](https://arxiv.org/html/2606.03239#A3.SS1)\.

#### Exploiting outcome\-homogeneous groups outperforms discarding them\.

DAPO’s gains over GRPO are inconsistent across scales, turning negative at 14B \(−\-1\.7 points\), while ARBOR consistently outperforms GRPO at all scales\. An important difference is how the methods handle outcome\-homogeneous groups: DAPO discards them entirely from policy updates, whereas ARBOR retains them and applies reusable process rubrics to provide within\-group discrimination\. ARBOR’s lead over DAPO is uniformly large \(\+3\.5to\+4\.4points across all backbones\), confirming that exploiting these groups through process reward is more effective than discarding them\.

#### Process supervision translates into semantic answer gains\.

ARBOR’s largest gains appear on LLM\-judge accuracy: \+4\.0, \+4\.2, and \+2\.0 points over GRPO at 4B, 8B, and 14B, exceeding the corresponding EM gains \(\+1\.8, \+2\.0, \+1\.5\) and F1 gains \(\+1\.8, \+2\.4, \+1\.0\)\. This pattern is consistent with ARBOR’s design: rubrics reward better query formulation, evidence use, and answer synthesis rather than final\-answer string overlap\. The result suggests that reusable process supervision improves answer quality in ways that are better captured by semantic evaluation than by lexical\-overlap metrics alone\. To ensure these gains are not evaluator\-specific, Appendix[C\.2](https://arxiv.org/html/2606.03239#A3.SS2)verifies the same trend under DeepSeek\-V4\-Pro\(DeepSeek\-AI,[2026](https://arxiv.org/html/2606.03239#bib.bib56)\)as an independent evaluator\.

### 4\.3No\-Gradient Group Rescue

Table 2:Reward\-homogeneous group fractions during ARBOR training\.Groups are measured onRbaseR^\{\\text\{base\}\}\(before rubric shaping\) andRtotalR^\{\\text\{total\}\}\(after rubric shaping\)\.A direct test of the motivating claim from Section[1](https://arxiv.org/html/2606.03239#S1)is whether rubric scoring provides within\-group discrimination on groups where outcome\-only reward produces zero gradient\. Table[2](https://arxiv.org/html/2606.03239#S4.T2)reports the fraction of groups with zero within\-group reward variance, measured first underRbaseR^\{\\text\{base\}\}alone and then underRtotalR^\{\\text\{total\}\}\. A drop indicates that rubric scoring introduced reward discrimination that the outcome signal could not provide\. Groups are further split by outcome pattern:*all\-correct*groups have F1=1 on every trajectory with valid format,*all\-wrong*groups have F1=0 on every trajectory including format\-invalid cases, and*mixed\-uniform*groups share the same partial F1 across all trajectories\.

Rubric scoring reduces all three types at every scale\. The effect concentrates on*all\-wrong*groups, where the relative reduction reaches54–61%\. These groups represent queries the policy has not yet learned to solve, and rubric scoring directly helps here by distinguishing more promising search strategies from less promising ones, providing within\-group discrimination that the base reward cannot\. The reduction on*all\-correct*groups is smaller \(18–31%\), as the active rubrics are primarily induced from success\-failure contrasts and less attuned to the subtler process differences among trajectories that all succeed\. Overall, rubric scoring converts32–42%of outcome\-homogeneous groups into ones with nonzero within\-group reward variance, enabling policy learning on these groups\.

### 4\.4Effect of Reusable Rubric Memory

![Refer to caption](https://arxiv.org/html/2606.03239v1/x3.png)Figure 3:Effect of reusable rubric memory\.Average LLM\-judge accuracy across four benchmarks\.To isolate the contribution of reusable rubric memory, Figure[3](https://arxiv.org/html/2606.03239#S4.F3)compares ARBOR with two no\-memory rubric variants\. RaR\-style, adapted from Rubrics\-as\-Rewards\(Gunjalet al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib34)\), uses static query\-specific rubrics generated from the question and reference answer to score rollouts of the corresponding query\. The w/o memory variant uses ARBOR’s contrastive induction but discards each rubric after scoring its source query\-group\. Detailed experimental setup and full results are provided in Appendix[C\.3](https://arxiv.org/html/2606.03239#A3.SS3)\.

No\-memory rubric feedback is not sufficient to explain ARBOR’s gains\. RaR\-style is weak and unstable: it slightly improves average accuracy over GRPO at 4B \(\+1\.3 points\), matches GRPO at 8B, and falls substantially below GRPO at 14B \(\-4\.1 points\)\. The w/o memory variant is stronger than RaR\-style at all three scales, showing that contrastive process rubrics are more useful than reference\-conditioned instance checklists\. However, it still lacks the stability of reusable memory\. ARBOR outperforms w/o memory by1\.0,2\.6, and2\.7accuracy points at 4B, 8B, and 14B, and outperforms RaR\-style by2\.7,4\.2, and6\.1points\. These margins indicate that the main benefit is not merely adding rubric\-shaped reward, but converting local observations into common process criteria that can be filtered and reused across later query\-groups\.

### 4\.5Common\-Rubric Reuse

Table 3:Top\-5 most reused common rubrics during training\.Uses denotes the number of query\-groups each rubric is activated to score\.A central claim of ARBOR is that consolidated common rubrics serve as reusable cross\-query process standards rather than one\-shot rewards\. We verify this by ranking all common rubrics produced during training by the number of query\-groups each one is activated to score\. The top 5% of rubrics by this ranking account for 20% of all scoring events and the top 20% account for 44%, as detailed in Appendix[C\.4](https://arxiv.org/html/2606.03239#A3.SS4)\. This confirms that a small core of high\-quality rubrics emerges and consistently drives process supervision across diverse queries\.

Table[3](https://arxiv.org/html/2606.03239#S4.T3)lists the most reused rubrics, all of which encode generic process behaviors such as precise entity\-attribute targeting and evidence\-guided termination\. None references an entity or fact specific to any individual query, confirming that the consolidation step successfully abstracts query\-local drafts into genuinely cross\-query standards\. A concrete consolidation case is shown in Appendix[D](https://arxiv.org/html/2606.03239#A4)\.

### 4\.6Hyperparameter Sensitivity

![Refer to caption](https://arxiv.org/html/2606.03239v1/x4.png)Figure 4:Hyperparameter sensitivity onλ\\lambdaandKconsolK\_\{\\text\{consol\}\}\.The dashed line marks the GRPO baseline\.ARBOR adds a process reward term and a rubric memory mechanism, so we evaluate sensitivity to the two hyperparameters most tied to them: the reward coefficientλ\\lambda, which controls the strength of the process signal, and the consolidation thresholdKconsolK\_\{\\text\{consol\}\}, which controls how quickly local drafts are converted into reusable common rubrics\.

Figure[4](https://arxiv.org/html/2606.03239#S4.F4)shows the results\. Panel \(a\) variesλ\\lambda\. Performance peaks atλ=0\.1\\lambda=0\.1; smaller values underweight the rubric signal, while larger values allow noisy rubric scores to interfere with the outcome gradient\. All four settings outperform GRPO, indicating that ARBOR is robust over a reasonable range ofλ\\lambda\. Panel \(b\) variesKconsolK\_\{\\text\{consol\}\}\. This parameter trades off reliability against timeliness: small values trigger consolidation from limited evidence, while large values increase the lag between induction and activation so that rubrics reflect an outdated policy distribution\. All values in\{4,8,16\}\\\{4,8,16\\\}outperform the baseline, with the best atKconsol=8K\_\{\\text\{consol\}\}=8\. AtKconsol=32K\_\{\\text\{consol\}\}=32, the consolidation lag causes rubrics to fall behind the evolving policy and performance drops\.

## 5Conclusion

In this work, we introduced ARBOR, a reusable process\-reward framework designed to address the limitations of outcome\-only rewards in search\-agent RL\. ARBOR induces natural\-language process rubrics from contrastive trajectories, consolidates them into a reusable cross\-query memory, and manages the memory through an admission–consolidation–retirement lifecycle\. This design jointly achieves process\-level supervision, cross\-query consistency, and co\-evolution with the policy\. Experiments across three model scales and four multi\-hop QA benchmarks show that ARBOR consistently improves over outcome\-only RL baselines, while ablations and reuse analyses confirm the importance of reusable rubric memory\. These results suggest that process knowledge in agent training need not be ephemeral: when accumulated into reusable criteria that co\-evolve with the policy, it provides a persistent and growing source of supervision that outcome signals alone cannot offer\.

## Limitations

First, ARBOR relies on an external LLM for rubric induction, consolidation, and pairwise judging, so the quality of rubric supervision is tied to the capability of the external model\. Models with stronger reasoning capacity may yield more precise rubrics and finer process discrimination\. Second, our experiments validate ARBOR on multi\-hop QA with a single search tool\. Since the rubric memory is task\-agnostic by construction, extending it to broader tool\-use settings such as code generation and mathematical reasoning with retrieval is a straightforward direction\.

## References

- Qwen\-Plus Model Card\.Note:Model id: qwen\-plus\-2025\-04\-28\. Accessed: 2026\-05External Links:[Link](https://www.aliyun.com/benefit/scene/qwen-plus)Cited by:[§4\.1](https://arxiv.org/html/2606.03239#S4.SS1.SSS0.Px3.p1.2)\.
- J\. Chen, S\. Xiao, P\. Zhang, K\. Luo, D\. Lian, and Z\. Liu \(2024\)M3\-embedding: multi\-linguality, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 2318–2335\.External Links:[Link](https://aclanthology.org/2024.findings-acl.137/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.137)Cited by:[§B\.3](https://arxiv.org/html/2606.03239#A2.SS3.p2.3),[§4\.1](https://arxiv.org/html/2606.03239#S4.SS1.SSS0.Px3.p1.2)\.
- G\. Cui, L\. Yuan, Z\. Wang, H\. Wang, Y\. Zhang, J\. Chen, W\. Li, B\. He, Y\. Fan, T\. Yu, Q\. Xu, W\. Chen, J\. Yuan, H\. Chen, K\. Zhang, X\. Lv, S\. Wang, Y\. Yao, X\. Han, H\. Peng, Y\. Cheng, Z\. Liu, M\. Sun, B\. Zhou, and N\. Ding \(2025\)Process reinforcement through implicit rewards\.External Links:2502\.01456,[Link](https://arxiv.org/abs/2502.01456)Cited by:[§1](https://arxiv.org/html/2606.03239#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- T\. Dao \(2024\)FlashAttention\-2: faster attention with better parallelism and work partitioning\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 35549–35562\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/98ed250b203d1ac6b24bbcf263e3d4a7-Paper-Conference.pdf)Cited by:[§B\.2](https://arxiv.org/html/2606.03239#A2.SS2.SSS0.Px1.p1.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-V4: Towards Highly Efficient Million\-Token Context Intelligence\.External Links:[Link](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro)Cited by:[§C\.2](https://arxiv.org/html/2606.03239#A3.SS2.p1.1),[§4\.2](https://arxiv.org/html/2606.03239#S4.SS2.SSS0.Px3.p1.1)\.
- L\. Ding \(2026\)AdaRubric: task\-adaptive rubrics for reliable llm agent evaluation and reward learning\.External Links:2603\.21362,[Link](https://arxiv.org/abs/2603.21362)Cited by:[§2\.2](https://arxiv.org/html/2606.03239#S2.SS2.p1.1)\.
- G\. Dong, Y\. Chen, X\. Li, J\. Jin, H\. Qian, Y\. Zhu, H\. Mao, G\. Zhou, Z\. Dou, and J\. Wen \(2025\)Tool\-Star: Empowering llm\-brained multi\-tool reasoner via reinforcement learning\.arXiv preprint arXiv:2505\.16410\.External Links:[Link](https://arxiv.org/abs/2505.16410v1)Cited by:[§B\.1](https://arxiv.org/html/2606.03239#A2.SS1.p2.1),[§B\.2](https://arxiv.org/html/2606.03239#A2.SS2.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.03239#S4.SS1.SSS0.Px1.p1.1)\.
- G\. Dong, H\. Mao, K\. Ma, L\. Bao, Y\. Chen, Z\. Wang, Z\. Chen, J\. Du, H\. Wang, F\. Zhang, G\. Zhou, Y\. Zhu, J\. Wen, and Z\. Dou \(2026\)Agentic Reinforced Policy Optimization\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=TX4k7BF6aO)Cited by:[§B\.1](https://arxiv.org/html/2606.03239#A2.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.03239#S4.SS1.SSS0.Px1.p1.1)\.
- A\. Gunjal, A\. Wang, E\. Lau, V\. Nath, Y\. He, B\. Liu, and S\. M\. Hendryx \(2026\)Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=c1bTcrDmt4)Cited by:[§C\.3](https://arxiv.org/html/2606.03239#A3.SS3.p2.1),[§1](https://arxiv.org/html/2606.03239#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.03239#S2.SS2.p1.1),[§4\.4](https://arxiv.org/html/2606.03239#S4.SS4.p1.1)\.
- H\. Hashemi, J\. Eisner, C\. Rosset, B\. Van Durme, and C\. Kedzie \(2024\)Llm\-rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13806–13834\.External Links:[Link](https://aclanthology.org/2024.acl-long.745/)Cited by:[§2\.2](https://arxiv.org/html/2606.03239#S2.SS2.p1.1)\.
- Y\. He, W\. Li, H\. Zhang, S\. Li, K\. Mandyam, S\. Khosla, Y\. Xiong, N\. Wang, X\. Peng, B\. Li, S\. Bi, S\. G\. Patil, Q\. Qi, S\. Feng, J\. Katz\-Samuels, R\. Y\. Pang, S\. Gonugondla, H\. Lang, Y\. Yu, Y\. Qian, M\. Fazel\-Zarandi, L\. Yu, A\. Benhalloum, H\. Awadalla, and M\. Faruqui \(2025\)AdvancedIF: rubric\-based benchmarking and reinforcement learning for advancing llm instruction following\.External Links:2511\.10507,[Link](https://arxiv.org/abs/2511.10507)Cited by:[§2\.2](https://arxiv.org/html/2606.03239#S2.SS2.p1.1)\.
- X\. Ho, A\. D\. Nguyen, S\. Sugawara, and A\. Aizawa \(2020\)Constructing a multi\-hop qa dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics,pp\. 6609–6625\.External Links:[Link](https://aclanthology.org/2020.coling-main.580/)Cited by:[2nd item](https://arxiv.org/html/2606.03239#A2.I1.i2.p1.1),[§4\.1](https://arxiv.org/html/2606.03239#S4.SS1.SSS0.Px1.p1.1)\.
- R\. Jia, Y\. Yang, Y\. Wu, Y\. Gai, S\. Tao, M\. Zhou, J\. Lin, X\. Jiang, and G\. Jiang \(2026\)Open rubric system: Scaling reinforcement learning with pairwise adaptive rubric\.arXiv preprint arXiv:2602\.14069\.External Links:[Link](https://arxiv.org/abs/2602.14069v2)Cited by:[§2\.2](https://arxiv.org/html/2606.03239#S2.SS2.p1.1)\.
- P\. Jiang, X\. Xu, J\. Lin, J\. Xiao, Z\. Wang, J\. Sun, and J\. Han \(2025\)S3: you don’t need that much data to train a search agent via RL\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 21599–21617\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1095/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1095),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.03239#S1.p1.1)\.
- B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. O\. Arik, D\. Wang, H\. Zamani, and J\. Han \(2025\)Search\-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Rwhi91ideu)Cited by:[§1](https://arxiv.org/html/2606.03239#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- S\. Kim, J\. Shin, y\. cho, J\. Jang, S\. Longpre, H\. Lee, S\. Yun, S\. Shin, S\. Kim, J\. Thorne, and M\. Seo \(2024\)Prometheus: inducing fine\-grained evaluation capability in language models\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 29927–29962\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/803485352e61e3ebf41221e4776c9fd4-Paper-Conference.pdf)Cited by:[§2\.2](https://arxiv.org/html/2606.03239#S2.SS2.p1.1)\.
- X\. Li, G\. Dong, J\. Jin, Y\. Zhang, Y\. Zhou, Y\. Zhu, P\. Zhang, and Z\. Dou \(2025\)Search\-o1: Agentic search\-enhanced large reasoning models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 5420–5438\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.276/)Cited by:[§1](https://arxiv.org/html/2606.03239#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 39578–39601\.External Links:[Link](https://openreview.net/forum?id=v8L0pN6EOi)Cited by:[§1](https://arxiv.org/html/2606.03239#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- K\. Luo, H\. Qian, Z\. Liu, Z\. Xia, S\. Xiao, S\. Bao, J\. Zhao, and K\. Liu \(2025\)Infoflow: Reinforcing search agent via reward density optimization\.arXiv preprint arXiv:2510\.26575\.External Links:[Link](https://arxiv.org/abs/2510.26575v1)Cited by:[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- L\. Luo, Y\. Liu, R\. Liu, S\. Phatale, M\. Guo, H\. Lara, Y\. Li, L\. Shu, Y\. Zhu, L\. Meng, J\. Sun, and A\. Rastogi \(2024\)Improve mathematical reasoning in language models by automated process supervision\.External Links:2406\.06592,[Link](https://arxiv.org/abs/2406.06592)Cited by:[§1](https://arxiv.org/html/2606.03239#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- Y\. Min, Z\. Chen, J\. Jiang, J\. Chen, J\. Deng, Y\. Hu, Y\. Tang, J\. Wang, X\. Cheng, H\. Song, W\. X\. Zhao, Z\. Liu, Z\. Wang, and J\. Wen \(2024\)Imitate, explore, and self\-improve: a reproduction report on slow\-thinking reasoning systems\.External Links:2412\.09413,[Link](https://arxiv.org/abs/2412.09413)Cited by:[§B\.2](https://arxiv.org/html/2606.03239#A2.SS2.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.03239#S4.SS1.SSS0.Px1.p1.1)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 5687–5711\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.378/)Cited by:[4th item](https://arxiv.org/html/2606.03239#A2.I1.i4.p1.1),[§1](https://arxiv.org/html/2606.03239#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.03239#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Rajbhandari, J\. Rasley, O\. Ruwase, and Y\. He \(2020\)Zero: Memory optimizations toward training trillion parameter models\.InSC20: international conference for high performance computing, networking, storage and analysis,pp\. 1–16\.External Links:[Link](https://doi.org/10.1109/SC41405.2020.00024)Cited by:[§B\.2](https://arxiv.org/html/2606.03239#A2.SS2.SSS0.Px1.p1.1)\.
- R\. Shao, A\. Asai, S\. Z\. Shen, H\. Ivison, V\. Kishore, J\. Zhuo, X\. Zhao, M\. Park, S\. G\. Finlayson, D\. Sontag, T\. Murray, S\. Min, P\. Dasigi, L\. Soldaini, F\. Brahman, W\. Yih, T\. Wu, L\. Zettlemoyer, Y\. Kim, H\. Hajishirzi, and P\. W\. Koh \(2026\)DR tulu: reinforcement learning with evolving rubrics for deep research\.External Links:2511\.19399,[Link](https://arxiv.org/abs/2511.19399)Cited by:[§2\.2](https://arxiv.org/html/2606.03239#S2.SS2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu, and D\. Guo \(2024\)Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.External Links:[Link](https://arxiv.org/abs/2402.03300v3)Cited by:[§1](https://arxiv.org/html/2606.03239#S1.p1.1),[§1](https://arxiv.org/html/2606.03239#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.03239#S4.SS1.SSS0.Px2.p1.1)\.
- L\. Sheng, W\. Ma, R\. Hong, X\. Wang, A\. Zhang, and T\. Chua \(2026\)Reinforcing Chain\-of\-Thought Reasoning with Self\-Evolving Rubrics\.arXiv preprint arXiv:2602\.10885\.External Links:[Link](https://arxiv.org/abs/2602.10885v1)Cited by:[§2\.2](https://arxiv.org/html/2606.03239#S2.SS2.p1.1)\.
- H\. Song, J\. Jiang, Y\. Min, J\. Chen, Z\. Chen, W\. X\. Zhao, L\. Fang, and J\. Wen \(2025\)R1\-searcher: Incentivizing the search capability in llms via reinforcement learning\.arXiv preprint arXiv:2503\.05592\.External Links:[Link](https://arxiv.org/abs/2503.05592v2)Cited by:[§1](https://arxiv.org/html/2606.03239#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: Multihop Questions via Single\-hop Question Composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.External Links:[Link](https://aclanthology.org/2022.tacl-1.31/)Cited by:[3rd item](https://arxiv.org/html/2606.03239#A2.I1.i3.p1.1),[§4\.1](https://arxiv.org/html/2606.03239#S4.SS1.SSS0.Px1.p1.1)\.
- P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024\)Math\-shepherd: Verify and reinforce llms step\-by\-step without human annotations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9426–9439\.External Links:[Link](https://aclanthology.org/2024.acl-long.510/)Cited by:[§1](https://arxiv.org/html/2606.03239#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- P\. Wang, Linus, P\. Liu, Z\. Sang, C\. Xie, and H\. Yang \(2025\)InfiMed\-orbit: aligning llms on open\-ended complex tasks via rubric\-based incremental training\.External Links:2510\.15859,[Link](https://arxiv.org/abs/2510.15859)Cited by:[§2\.2](https://arxiv.org/html/2606.03239#S2.SS2.p1.1)\.
- Y\. Xi, J\. Lin, Y\. Xiao, Z\. Zhou, R\. Shan, T\. Gao, J\. Zhu, W\. Liu, Y\. Yu, and W\. Zhang \(2025\)A survey of llm\-based deep search agents: Paradigm, optimization, evaluation, and challenges\.arXiv preprint arXiv:2508\.05668\.Cited by:[§1](https://arxiv.org/html/2606.03239#S1.p1.1)\.
- T\. Xia, M\. Xu, L\. Hu, Y\. Sun, W\. Li, L\. Shang, L\. Liu, P\. Shu, H\. Yu, and J\. Jiang \(2026\)Search\-P1: Path\-Centric Reward Shaping for Stable and Efficient Agentic RAG Training\.arXiv preprint arXiv:2602\.22576\.External Links:[Link](https://arxiv.org/abs/2602.22576v1)Cited by:[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- L\. Xie, S\. Huang, Z\. Zhang, A\. Zou, Y\. Zhai, D\. Ren, K\. Zhang, H\. Hu, B\. Liu, H\. Chen, Z\. Liu, and B\. Ding \(2026a\)Auto\-rubric: learning from implicit weights to explicit rubrics for reward modeling\.External Links:2510\.17314,[Link](https://arxiv.org/abs/2510.17314)Cited by:[§2\.2](https://arxiv.org/html/2606.03239#S2.SS2.p1.1)\.
- Y\. Xie, N\. Thomas, N\. Hansen, Y\. Fu, L\. E\. Li, and X\. Wang \(2026b\)TIPS: Turn\-level Information\-Potential Reward Shaping for Search\-Augmented LLMs\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=eBMOr6a84z)Cited by:[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- P\. Xu, Z\. Li, X\. Xing, G\. Zhang, D\. Li, and K\. Shi \(2025\)Hybrid reward normalization for process\-supervised non\-verifiable agentic tasks\.arXiv preprint arXiv:2509\.25598\.External Links:[Link](https://arxiv.org/abs/2509.25598v1)Cited by:[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- R\. Xu, T\. Liu, Z\. Dong, T\. Yu, I\. Hong, C\. Yang, L\. Zhang, T\. Zhao, and H\. Wang \(2026\)Alternating reinforcement learning for rubric\-based reward modeling in non\-verifiable llm post\-training\.arXiv preprint arXiv:2602\.01511\.External Links:[Link](https://arxiv.org/abs/2602.01511v2)Cited by:[§2\.2](https://arxiv.org/html/2606.03239#S2.SS2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.1](https://arxiv.org/html/2606.03239#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: A dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2369–2380\.External Links:[Link](https://aclanthology.org/D18-1259/)Cited by:[1st item](https://arxiv.org/html/2606.03239#A2.I1.i1.p1.1),[§4\.1](https://arxiv.org/html/2606.03239#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS\.In11th International Conference on Learning Representations, ICLR 2023,External Links:[Link](https://openreview.net/forum?id=WE_vluYUL-X)Cited by:[§1](https://arxiv.org/html/2606.03239#S1.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, j\. liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, R\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, Y\. Wu, and M\. Wang \(2025\)DAPO: an open\-source llm reinforcement learning system at scale\.InAdvances in Neural Information Processing Systems,D\. Belgrave, C\. Zhang, H\. Lin, R\. Pascanu, P\. Koniusz, M\. Ghassemi, and N\. Chen \(Eds\.\),Vol\.38,pp\. 113222–113244\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/file/a4277440d50f1f15d2cb4c14f7e0c0d2-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2606.03239#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.03239#S4.SS1.SSS0.Px2.p1.1)\.
- Q\. Zhang, S\. Yang, L\. Gao, H\. Chen, X\. Hu, J\. Chen, J\. Wang, S\. Guo, B\. Zheng, H\. Wang, and J\. Zhao \(2025a\)LeTS: learning to think\-and\-search via process\-and\-outcome reward hybridization\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 5109–5122\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.257/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.257),ISBN 979\-8\-89176\-332\-6Cited by:[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- W\. Zhang, X\. Li, K\. Dong, Y\. Wang, P\. Jia, X\. Li, Y\. Zhang, D\. Xu, Z\. Du, H\. Guo, R\. Tang, and X\. Zhao \(2025b\)Process vs\. outcome reward: which is better for agentic rag reinforcement learning\.InAdvances in Neural Information Processing Systems,D\. Belgrave, C\. Zhang, H\. Lin, R\. Pascanu, P\. Koniusz, M\. Ghassemi, and N\. Chen \(Eds\.\),Vol\.38,pp\. 58701–58729\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/file/54e1381d0c0598127b90af4c940fd3d9-Paper-Conference.pdf)Cited by:[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- X\. Zheng, K\. An, Z\. Wang, Y\. Wang, and Y\. Wu \(2025\)StepSearch: igniting LLMs search ability via step\-wise proximal policy optimization\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 21805–21830\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1106/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1106),ISBN 979\-8\-89176\-332\-6Cited by:[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- Y\. Zheng, R\. Zhang, J\. Zhang, Y\. Ye, and Z\. Luo \(2024\)Llamafactory: Unified efficient fine\-tuning of 100\+ language models\.InProceedings of the 62nd annual meeting of the association for computational linguistics \(volume 3: system demonstrations\),pp\. 400–410\.External Links:[Link](https://aclanthology.org/2024.acl-demos.38/)Cited by:[§B\.2](https://arxiv.org/html/2606.03239#A2.SS2.SSS0.Px1.p1.1)\.
- W\. Zhong, J\. Yang, Y\. Wu, Y\. Liu, J\. Yao, and K\. Kuang \(2026\)SIGHT: Reinforcement Learning with Self\-Evidence and Information\-Gain Diverse Branching for Search Agent\.arXiv preprint arXiv:2602\.11551\.External Links:[Link](https://arxiv.org/abs/2602.11551v1)Cited by:[§2\.1](https://arxiv.org/html/2606.03239#S2.SS1.p1.1)\.
- Y\. Zhou, S\. Li, S\. Liu, W\. Fang, K\. Zhang, J\. Zhao, J\. Yang, Y\. Zhou, J\. Lv, T\. Zheng, H\. Lu, W\. Chen, Y\. Xie, and M\. Song \(2026\)Breaking the exploration bottleneck: rubric\-scaffolded reinforcement learning for general llm reasoning\.External Links:2508\.16949,[Link](https://arxiv.org/abs/2508.16949)Cited by:[§2\.2](https://arxiv.org/html/2606.03239#S2.SS2.p1.1)\.
- Z\. Zhu, C\. Xie, X\. Lv, and slime Contributors \(2025\)Slime: An LLM post\-training framework for RL Scaling\.Note:[https://github\.com/THUDM/slime](https://github.com/THUDM/slime)GitHub repository\. Corresponding author: Xin LvCited by:[§B\.2](https://arxiv.org/html/2606.03239#A2.SS2.SSS0.Px2.p1.2),[§4\.1](https://arxiv.org/html/2606.03239#S4.SS1.SSS0.Px3.p1.2)\.

## Appendix AMethod Details

### A\.1Contrastive Induction Corner Cases

This section records the rules ARBOR follows when correctness scores within a query\-group tie and when the two contrast pairs collapse into one, complementing Section[3\.2](https://arxiv.org/html/2606.03239#S3.SS2)\.

#### Tie\-breaking for anchor selection\.

When multiple trajectories share the highest F1,τ\+\\tau^\{\+\}is the one with the shortest response, and the same secondary key applies when selectingτhard−\\tau^\{\-\}\_\{\\text\{hard\}\}andτworst−\\tau^\{\-\}\_\{\\text\{worst\}\}\.

#### Collapsed contrast pairs\.

When only one trajectory falls strictly belowτ\+\\tau^\{\+\}in F1,τhard−\\tau^\{\-\}\_\{\\text\{hard\}\}andτworst−\\tau^\{\-\}\_\{\\text\{worst\}\}resolve to the same trajectory; in this case only the\(τ\+,τhard−\)\(\\tau^\{\+\},\\tau^\{\-\}\_\{\\text\{hard\}\}\)pair is sent to the LLM\. When no trajectory falls belowτ\+\\tau^\{\+\}, the group has no correctness divergence and induction is skipped entirely\.

#### Homogeneous groups\.

For all\-correct or all\-wrong groups, new rubric generation is skipped entirely; existing common rubrics in𝒫\\mathcal\{P\}still score the group for reward shaping\. For groups in which all trajectories share the same intermediate F1 value, no contrast pairs can be formed, and induction falls back to sending all trajectories to the LLM without positive/negative labels, letting it identify process differences directly\. Induction is also skipped when fewer than two format\-valid trajectories exist in the group\.

## Appendix BTraining and Evaluation Details

### B\.1Evaluation Benchmarks

We evaluate on the following four multi\-hop QA benchmarks:

- •HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.03239#bib.bib3)\)is a Wikipedia\-based multi\-hop QA benchmark that requires aggregating evidence across multiple supporting documents to answer a question\. It is characterized by two representative question types: reasoning over sentence\-level supporting facts across documents, and comparison questions that contrast properties of two entities\.
- •2WikiMultiHopQA\(Hoet al\.,[2020](https://arxiv.org/html/2606.03239#bib.bib4)\)is a multi\-hop QA dataset constructed from both Wikipedia and Wikidata, combining unstructured text with structured knowledge to support multi\-step reasoning\. It provides explicit evidence and reasoning paths, ensuring that answering each question requires genuine multi\-hop inference\.
- •MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2606.03239#bib.bib5)\)is designed to resist shortcut solutions by composing connected single\-hop questions into multi\-hop ones, such that proper multi\-hop reasoning is required by construction\. Unlike earlier benchmarks that can often be resolved with partial evidence, MuSiQue ensures that each step of the reasoning chain is necessary to reach the correct answer\.
- •Bamboogle\(Presset al\.,[2023](https://arxiv.org/html/2606.03239#bib.bib6)\)is a hand\-crafted benchmark of 125 two\-hop compositional questions designed to be difficult for standard search engines to answer directly, while the supporting evidence for each question can be found in Wikipedia\. It evaluates models on varied compositional questions that require combining information across multiple retrieval steps\.

Following Tool\-Star\(Donget al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib45)\)and ARPO\(Donget al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib17)\), we evaluate on fixed held\-out test splits: 200 examples for HotpotQA, 200 for 2WikiMultiHopQA, 200 for MuSiQue, and 125 for Bamboogle\.

### B\.2SFT and RL Training

#### Supervised Fine\-Tuning\.

Starting from Tool\-Star\(Donget al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib45)\)and STILL\(Minet al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib46)\), we filter out all examples that invoke Python as a tool and retain only search\-tool interactions, yielding approximately 16K training examples\. Fine\-tuning is conducted with LLaMAFactory\(Zhenget al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib47)\)using full\-parameter optimization, DeepSpeed ZeRO Stage 3\(Rajbhandariet al\.,[2020](https://arxiv.org/html/2606.03239#bib.bib49)\), and FlashAttention\-2\(Dao,[2024](https://arxiv.org/html/2606.03239#bib.bib50)\)\. We train for 3 epochs with a maximum sequence length of 15,000 tokens, an effective batch size of 16, a learning rate of7×10−67\\times 10^\{\-6\}with cosine decay and 10% linear warmup, in BF16 precision\.

#### Reinforcement Learning\.

RL training runs for 60 rollout steps using Slime\(Zhuet al\.,[2025](https://arxiv.org/html/2606.03239#bib.bib54)\)\. We sample 8 trajectories per prompt with a rollout batch size of 32, sampling temperature 0\.7, a maximum context length of 36,864 tokens, and a maximum response length of 8,192 tokens\. Each trajectory is limited to at most 12 agent turns and 10 search calls\. The policy is updated with GRPO with token\-level importance sampling \(TIS\), KL and entropy coefficients set to 0, and symmetric clipping withϵclip=0\.2\\epsilon\_\{\\text\{clip\}\}=0\.2\. Optimization uses Adam with learning rate1×10−61\\times 10^\{\-6\}and weight decay 0\.01\.

### B\.3ARBOR Hyperparameters

Table[4](https://arxiv.org/html/2606.03239#A2.T4)lists the default ARBOR hyperparameters used in the main experiments\. We elaborate on the key design choices below\.

Table 4:Default ARBOR hyperparameters used in the main experiments\.Kconsol=8K\_\{\\text\{consol\}\}=8is the number of candidate drafts that must accumulate in𝒟\\mathcal\{D\}before a consolidation round is triggered\.κ=5\\kappa=5is the maximum number of consecutive low\-variance activations a common rubric is allowed before retirement\. The deduplication similarity threshold is set to 0\.9, above which a newly produced common rubric is considered a duplicate of an existing one under BGE\-M3\(Chenet al\.,[2024](https://arxiv.org/html/2606.03239#bib.bib53)\)cosine similarity\.

## Appendix CAdditional Experimental Results

### C\.1Format Sensitivity of TIR Prompting

The official evaluation used in Table[1](https://arxiv.org/html/2606.03239#S4.T1)is strict\-format gated: a response is scored only if it produces a non\-empty prediction and terminates with the required final\-answer format\. If the termination reason indicates a formatting or finalization failure, all official metrics are set to zero, including EM, F1, and LLM\-judge accuracy\. This protocol matches the training reward, where finalization is part of the task rather than a post\-hoc parsing detail\.

To diagnose the non\-monotonic TIR Prompting results in the main table, Table[5](https://arxiv.org/html/2606.03239#A3.T5)separates formatting validity from answer quality\.*Official*accuracy follows the strict protocol above,*valid\-only*accuracy evaluates only responses that satisfy the required format, and*recovery*is the gap between the two LLM accuracies\.

Table 5:Format sensitivity of TIR Prompting\.All values are percentages averaged over four benchmarks and five evaluation runs\.Table[5](https://arxiv.org/html/2606.03239#A3.T5)shows that the non\-monotonic TIR results are mainly driven by formatting failures\. Qwen3\-8B and Qwen3\-14B have much lower valid rates than Qwen3\-4B, which sharply lowers their official LLM\-judge accuracies\. On valid responses, however, they reach 48\.6% and 52\.4% LLM\-judge accuracy, comparable to or higher than Qwen3\-4B\. Thus, the apparent weakness of larger TIR backbones primarily reflects unreliable prompt\-only finalization under a strict protocol, not weaker search reasoning\. We keep the strict official scores in the main table because correct finalization is part of the tool\-use task and is also enforced during RL training\.

### C\.2Cross\-Judge Robustness

To check whether the main LLM\-judge trends depend on the same model family used by ARBOR’s reward construction, we re\-evaluate all final answers with DeepSeek\-V4\-Pro\(DeepSeek\-AI,[2026](https://arxiv.org/html/2606.03239#bib.bib56)\)as an independent post\-hoc judge\. DeepSeek\-V4\-Pro is not used for rubric induction, consolidation, pairwise scoring, or reward construction\. Table[6](https://arxiv.org/html/2606.03239#A3.T6)compares average LLM\-judge accuracy under Qwen3\-Plus and DeepSeek\-V4\-Pro\. The independent judge preserves the same ranking: ARBOR remains the best method at all three scales, and its gains over GRPO remain close to the Qwen3\-Plus gains\.

Table 6:Cross\-judge robustness of average LLM\-judge accuracy\.Qwen3\-Plus is the main evaluation judge, while DeepSeek denotes DeepSeek\-V4\-Pro used only as an independent post\-hoc judge\.Δ\\Deltadenotes DeepSeek accuracy minus Qwen3\-Plus accuracy\. Gain over GRPO denotes the average accuracy difference from GRPO under the corresponding judge\.Overall, DeepSeek\-V4\-Pro is slightly stricter than Qwen3\-Plus: its average scores are lower by 0\.4–1\.2 points across methods, but the drop is small and uniform\. Despite this stricter independent judge, the relative ordering is preserved and ARBOR’s gains over GRPO remain nearly unchanged \(\+3\.6, \+4\.0, and \+1\.9 points at 4B, 8B, and 14B\)\. Thus, the main LLM\-judge conclusion is robust across judge families rather than driven by a Qwen3\-Plus\-specific preference\.

### C\.3Setup and Full Results for Rubric Memory Ablation

Table[7](https://arxiv.org/html/2606.03239#A3.T7)provides the per\-benchmark results for the rubric memory ablation discussed in Section[4\.4](https://arxiv.org/html/2606.03239#S4.SS4)\.

RaR\-style is our search\-QA adaptation of Rubrics\-as\-Rewards\(Gunjalet al\.,[2026](https://arxiv.org/html/2606.03239#bib.bib34)\), instantiated as a single\-use instance\-specific rubric reward\. For each training query, the external rubric\-generation LLM generates 5–10 reference\-conditioned rubric items from only the question and ground\-truth answer, without seeing sampled rollout trajectories\. During RL, the judge model scores each full agent trajectory under the query\-specific rubric set and returns a holistic 1–10 rating; we normalize it to\[0,1\]\[0,1\], center it within the rollout group, and add it to the base F1/format reward with the same shaping coefficient as ARBOR\. The generated rubrics are used only for their source query group and are never reused by later groups\.

The w/o memory variant instead keeps ARBOR’s contrastive process\-rubric induction from sampled trajectories, but discards each induced rubric after scoring its source query\-group\. Thus both variants have no reusable memory, while differing in whether their rubrics are reference\-conditioned instance criteria or contrastively induced process criteria\.

Table 7:Per\-benchmark results for the rubric memory ablation\.Average columns report the mean over the four benchmarks; Section[4\.4](https://arxiv.org/html/2606.03239#S4.SS4)discusses the averaged trends\.
### C\.4Reuse Distribution

![Refer to caption](https://arxiv.org/html/2606.03239v1/x5.png)Figure 5:Cumulative share of scoring events covered by the top fraction of common rubrics, ranked by reuse count\.Figure[5](https://arxiv.org/html/2606.03239#A3.F5)plots the cumulative share of scoring events as a function of the fraction of common rubrics ranked by reuse\. The curve sits well above the even\-distribution diagonal\. The top 5% of rubrics cover 20% of all scoring events, and the top 20% cover 44%\. This concentration arises from the active selection mechanism in Section[3\.4](https://arxiv.org/html/2606.03239#S3.SS4), which preferentially activates rubrics with high cumulative correlation, so that a small core of validated rubrics accumulates most scoring events\.

## Appendix DConsolidation Case Study

We illustrate the consolidation process with a concrete example\. Table[8](https://arxiv.org/html/2606.03239#A4.T8)shows three query\-local candidate rubrics induced from different queries during RL training\. Each candidate describes the same underlying process behavior, resolving the target entity with canonical naming before querying its attributes, but in different phrasings shaped by the specific query context\.

The consolidation module receives these and other similar candidates as input, identifies the shared process pattern, and produces the common rubric*Precise Entity\-Attribute Targeting with Canonical Framing*\. The resulting common rubric unifies three variant phrasings of the same process standard into a single reusable criterion, while the query\-specific framing of each candidate is abstracted away\. This abstraction is what enables the common rubric to be reused across 101 query\-groups throughout training\.

Table 8:Consolidation case study for reusable rubric memory\.Three query\-local candidate rubrics from different queries describe the same entity\-attribute targeting pattern in query\-specific language, and the consolidation module abstracts them into a reusable common rubric\.
## Appendix EPrompts

This section reproduces the four prompts used in ARBOR\. Placeholders inside braces are filled at runtime, and ellipses denote repeated structures over the corresponding lists\.

### E\.1Contrastive Induction

The induction prompt has a fixed instruction header followed by a per\-call payload that fills in the question itself, the optional ground\-truth answer, the contrast pairs, and the existing rubrics; the input fields are listed at the end of the header\. The full header is reproduced in Table[9](https://arxiv.org/html/2606.03239#A5.T9)\.

Table 9:Contrastive induction prompt\.The prompt induces query\-local process rubrics from contrastive trajectory pairs\.
### E\.2Consolidation

The consolidation prompt is invoked when the candidate pool𝒟\\mathcal\{D\}accumulates enough drafts\. Existing common rubrics in𝒫\\mathcal\{P\}are passed in as a no\-duplicate constraint, and the candidate drafts are grouped by query of origin\. The full prompt is shown in Table[10](https://arxiv.org/html/2606.03239#A5.T10)\.

Table 10:Consolidation prompt\.The prompt distills recurring patterns from candidate rubrics into reusable common rubrics while avoiding duplicates\.
### E\.3Pairwise Judge

Pairwise judging is invoked once per edge in the sparse comparison graph \(Section[3\.4](https://arxiv.org/html/2606.03239#S3.SS4)\)\. The system prompt fixes the win/tie/loss output format, and the user prompt fills in the question, both responses, and the criterion derived from the active common rubric\. Order randomization is applied at the call site by swapping which response receives the A label\. The full prompt is shown in Table[11](https://arxiv.org/html/2606.03239#A5.T11)\.

Table 11:Pairwise judge prompt\.The prompt compares two trajectories under one active common rubric and outputs a win, tie, or loss\.
### E\.4LLM\-Judge Accuracy Evaluation

Final\-answer accuracy on the four benchmarks is reported with an LLM judge that compares the predicted answer against the gold answer\. The judge is invoked with the prompt in Table[12](https://arxiv.org/html/2606.03239#A5.T12)and is configured to output the literal string Correct or Incorrect\.

Table 12:LLM\-judge accuracy evaluation prompt\.The prompt judges whether a predicted final answer matches the gold answer\.

Similar Articles

ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning

arXiv cs.CL

ARES proposes a framework for automatically constructing rubric-based RL data from pretraining documents, generating question-answer pairs and weighted rubrics to enable instance-level reward supervision for open-ended LLM responses, outperforming existing methods on multi-dimensional open-ended tasks.

Rubric-Guided Process Reward for Stepwise Model Routing

arXiv cs.AI

RoRo introduces a rubric-guided process reward framework for stepwise model routing in Large Reasoning Models, using process rewards alongside outcome rewards to train a routing policy via GRPO, outperforming baselines on reasoning benchmarks.

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Hugging Face Daily Papers

C2 proposes a scalable rubric-augmented reward modeling framework that trains a cooperative rubric generator and critical verifier exclusively from binary preferences, eliminating the need for costly rubric annotations while achieving up to 6.5 point gains on RM-Bench.