Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Summary
Co-ReAct introduces a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference for ReAct agents, improving trajectory quality and outperforming baselines on DeepResearchBench and SQA-CS-V2.
View Cached Full Text
Cached at: 05/25/26, 08:58 AM
# Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Source: [https://arxiv.org/html/2605.23590](https://arxiv.org/html/2605.23590)
Jiazheng Kang1,Bowen Zhang211footnotemark:1,Zixin Song211footnotemark:1,Jiangwang Chen211footnotemark:1, Xiao Yang1,Da Zhu1,Guanjun Jiang1
1Qwen Applications Business Group of Alibaba 2Tsinghua University \{kangjiazheng\.kjz,yx501135,zhuda\.zd,guanj\.jianggj\}@alibaba\-inc\.com \{zbw23,songzx24,jw\-chen24\}@mails\.tsinghua\.edu\.cn
###### Abstract
ReAct\-style agents for search\-intensive, multi\-step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories\. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action\-guiding: rubrics typically serve as training\-time rewards or post\-hoc evaluators of completed outputs, and in deep\-research settings they are often coarse\-grained and report\-level rather than step\-level\. We introduce Co\-ReAct, a rubric\-guided action\-selection framework that uses rubrics as step\-level guidance during inference\. At each decision step, Co\-ReAct injects a rubric into the agent’s context to guide the next Reason\-or\-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self\-evaluation\. To make this guidance reliable, we train a dedicated rubric generator with GRPO\. Unlike prior pairwise or binary preference formulations, our objective optimizes a list\-wise Spearman rank\-correlation reward against multi\-judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible\. On DeepResearchBench and SQA\-CS\-V2, Co\-ReAct consistently improves over ReAct and representative test\-time compute baselines across search agents built on both 8B/14B open\-source and frontier closed\-source base models\. The trained rubric generator can also serve as a drop\-in component that improves these baselines without changing their underlying decision mechanisms\. Our code is publicly available at[https://github\.com/ZBWpro/Co\-ReAct](https://github.com/ZBWpro/Co-ReAct)\.
Co\-ReAct: Rubrics as Step\-Level Collaborators for ReAct Agents
Jiazheng Kang1††thanks:Equal contribution\., Bowen Zhang211footnotemark:1, Zixin Song211footnotemark:1, Jiangwang Chen211footnotemark:1,Xiao Yang1,Da Zhu1,Guanjun Jiang11Qwen Applications Business Group of Alibaba2Tsinghua University\{kangjiazheng\.kjz,yx501135,zhuda\.zd,guanj\.jianggj\}@alibaba\-inc\.com\{zbw23,songzx24,jw\-chen24\}@mails\.tsinghua\.edu\.cn
## 1Introduction
Deep research agents built on the ReAct paradigm\(Yaoet al\.,[2022](https://arxiv.org/html/2605.23590#bib.bib1)\)conduct search by repeatedly deciding what evidence to seek, what action to take next, and when to stop\. In current systems, these decisions are driven largely by the agent’s own internal judgment\. This self\-direction can be brittle\. Agents may reissue near\-duplicate queries, stop before sufficient evidence has been gathered, or rely on a narrow set of sources even when the question would benefit from comparison across multiple perspectives\(Wanget al\.,[2025](https://arxiv.org/html/2605.23590#bib.bib29); Shaoet al\.,[2025a](https://arxiv.org/html/2605.23590#bib.bib30)\)\. The resulting trajectories can therefore become shallow, redundant, or misaligned with the specific demands of the current step\. What is missing is an external, verifiable specification of what the next step should accomplish: a step\-level signal that tells the agent, at a particular branching point in a particular trajectory, what fine\-grained requirements the next action should satisfy\.
Rubrics\(Popham,[1997](https://arxiv.org/html/2605.23590#bib.bib31)\)are a natural candidate for such a specification because they express quality as a small set of checkable criteria\. However, existing rubric\-based methods use rubrics primarily as evaluative objects rather than guidance signals\(Gunjalet al\.,[2025](https://arxiv.org/html/2605.23590#bib.bib12)\)\. In general LLM alignment, rubrics are commonly used as training\-time rewards, judge templates, or post\-hoc evaluators of completed outputs\(Xuet al\.,[2026a](https://arxiv.org/html/2605.23590#bib.bib14)\)\. In deep\-research settings, rubrics are also typically defined at the level of the final report, where they check whether a completed answer is comprehensive, well\-cited, and faithful to the evidence\(Lvet al\.,[2026](https://arxiv.org/html/2605.23590#bib.bib17); Shaoet al\.,[2025b](https://arxiv.org/html/2605.23590#bib.bib11)\)\. These uses answer the question: how much credit does an output already produced deserve? They do not answer the question a search agent faces during inference: given what has already been observed, what concrete requirements should the next action satisfy?
Using rubrics for this prescriptive role requires more than attaching a generic checklist to the prompt\.\(Brookhart,[2018](https://arxiv.org/html/2605.23590#bib.bib32)\)First, the rubric must be*step\-level*: it should specify what the next action should cover, rather than what the final report should contain\. Second, it must be conditioned on the current partial trajectory, because the right next action depends on what the agent has already tried and what evidence it has already found\. Third, it must be discriminative: the actions favored by the rubric should actually be better than the actions it penalizes\. This last requirement is crucial\. As we show in ablation study, an unreliable rubric may not merely fail to help: when injected into the agent’s context, untrained rubrics can actively mislead the search process and degrade performance\.
We therefore propose Co\-ReAct, a rubric\-guided ReAct framework for deep research\. The name Co\-ReAct reflects the rubric’s role as a step\-levelcollaborator: before the agent acts, it specifies fine\-grained requirements for the next step; after the action is executed, it provides a basis for verification and feedback\. Co\-ReAct trains a dedicated rubric generator to produce discriminative step\-level guidance\. Unlike prior rubric\-learning methods\(Xuet al\.,[2026b](https://arxiv.org/html/2605.23590#bib.bib35)\)that rely on pairwise preferences or binary accept/reject labels, Co\-ReAct uses a listwise formulation\. At each ReAct decision point, multiple next actions may appear plausible, so the useful signal is not only whether an action is acceptable or better than another, but how a slate of candidate actions should be ranked relative to one another\. We therefore sample candidate next actions for each decision point and obtain a multi\-judge expert consensus ranking over the full slate\. The rubric generator is trained with GRPO\(Shaoet al\.,[2024b](https://arxiv.org/html/2605.23590#bib.bib34)\)using a Spearman rank\-correlation\(Spearman,[1904](https://arxiv.org/html/2605.23590#bib.bib33); Songet al\.,[2025b](https://arxiv.org/html/2605.23590#bib.bib38)\)reward between the expert ranking and the ranking induced by the generated rubric\. A rubric receives high reward only when its criteria lead to an action ranking that agrees with the expert consensus, encouraging rubrics that induce expert\-aligned preferences rather than merely sounding plausible\.
At inference time, the rubric generator serves two roles\. As a complete system, Co\-ReAct extends the standard ReAct loop with an inject–verify–retry procedure\. Before each tool call, a trajectory\-conditioned rubric is injected into the agent’s context to specify what the next action should target\. After the action is proposed but before it is executed, an independent verifier checks the proposed action against the rubric\. If the verification passes, the action is accepted; otherwise, the verifier returns feedback on which criteria remain unsatisfied, and the agent regenerates the action accordingly\. As a drop\-in plug\-in, the same trained rubric can also be injected into existing test\-time compute methods such as Best\-of\-N\(Snellet al\.,[2024](https://arxiv.org/html/2605.23590#bib.bib7)\), Step\-Back\(Zhenget al\.,[2024](https://arxiv.org/html/2605.23590#bib.bib6)\), and CRITIC\(Gouet al\.,[2024](https://arxiv.org/html/2605.23590#bib.bib5)\)without changing their decision mechanisms\. In both cases, the rubric is consumed by the agent at inference time as a step\-level action\-selection signal, rather than by an optimizer or evaluator after the output has already been produced\.
The primary contributions of this work are:
- •We recast rubrics from an evaluative object consumed by the training pipeline into a prescriptive, step\-level action\-selection signal consumed by the agent at inference time\. To our knowledge, Co\-ReAct is the first system to train rubrics for this role in a ReAct deep research agent\.
- •We train the rubric generator with a listwise GRPO objective that rewards rank\-correlation with multi\-judge expert consensus, so the learned rubric is discriminative by construction rather than merely plausible\.
- •We empirically show that Co\-ReAct consistently improves deep\-research performance across multiple benchmarks, agent backbones, and test\-time compute baselines\. Plugging the same learned rubric into existing methods further yields positive transfer, indicating that step\-level rubric guidance is complementary to current inference\-time enhancement techniques\.
## 2Related Work
### 2\.1ReAct\-paradigm enhancements\.
A first line of work augments a fixed ReAct agent with extra inference\-time computation to improve step\-level decisions\. Self\-Refine\(Madaanet al\.,[2023](https://arxiv.org/html/2605.23590#bib.bib4)\)has the agent critique and rewrite its own output; Best\-of\-N samples multiple parallel trajectories and selects among them with an external or self\-scoring model; Step\-Back prompts for a higher\-level abstraction of the question before acting; CRITIC issues tool\-interactive critique queries to verify and correct intermediate steps; Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.23590#bib.bib3)\)and Tree\-of\-Thought\(Yaoet al\.,[2023](https://arxiv.org/html/2605.23590#bib.bib2)\)extend the same idea with episodic memory and branching search\. In all of these methods the guidance signal—critique, scoring model, abstraction prompt—is produced by an untrained, prompted LLM\. Co\-ReAct occupies the same slot in the pipeline but replaces the prompted signal with a GRPO\-trained rubric generator whose output is rank\-calibrated against expert consensus, and our plug\-in study \(Sec\.[4\.6](https://arxiv.org/html/2605.23590#S4.SS6)\) shows this trained signal is additive with these methods rather than a substitute for them\.
### 2\.2End\-to\-end trained search agents\.
A parallel line of work retrains the search policy itself with reinforcement learning so that the agent itself issues better queries\. Search\-R1\(Jinet al\.,[2025](https://arxiv.org/html/2605.23590#bib.bib8)\), R1\-Searcher\(Songet al\.,[2025a](https://arxiv.org/html/2605.23590#bib.bib9)\), and WebGPT\(Nakanoet al\.,[2021](https://arxiv.org/html/2605.23590#bib.bib10)\)train the agent’s policy against verifiable or preference\-based rewards; DR\-Tulu\(Shaoet al\.,[2025b](https://arxiv.org/html/2605.23590#bib.bib11)\)maintains an evolving rubric buffer that supervises the policy during training; These methods change*what the agent does*by modifying the policy itself, whereas we train an external guidance signal and leave the search policy untouched; the rubric lives outside the agent and is consumed by it at inference time\. We therefore view this line as an orthogonal axis of system design and do not treat it as a direct baseline; stacking our rubric on top of a trained search agent is out of scope here and left to future work\.
### 2\.3Rubric\-based reward and evaluation\.
A growing line of work treats rubrics as a signal for LLM alignment\. Rubric\-ARM\(Xuet al\.,[2026a](https://arxiv.org/html/2605.23590#bib.bib14)\)alternates RL between a rubric generator and a judge; OpenRubrics\(Liuet al\.,[2025](https://arxiv.org/html/2605.23590#bib.bib36)\)trains a rubric\-conditioned reward model on large\-scale prompt–rubric data; AdvancedIF\(Heet al\.,[2025](https://arxiv.org/html/2605.23590#bib.bib37)\)trains a rubric verifier for complex instruction following;Lvet al\.\([2026](https://arxiv.org/html/2605.23590#bib.bib17)\)and DR\-Tulu\(Shaoet al\.,[2025b](https://arxiv.org/html/2605.23590#bib.bib11)\)train or evolve rubrics for deep research, both at the report level; Seed\(Shenget al\.,[2026](https://arxiv.org/html/2605.23590#bib.bib18)\)self\-evolves CoT rubrics during RL\. Broader LLM\-as\-a\-judge\(Leeet al\.,[2024](https://arxiv.org/html/2605.23590#bib.bib19); Baiet al\.,[2022](https://arxiv.org/html/2605.23590#bib.bib20)\)and process\-reward\-model work\(Wanget al\.,[2024](https://arxiv.org/html/2605.23590#bib.bib21); Lightmanet al\.,[2024](https://arxiv.org/html/2605.23590#bib.bib22)\)similarly use LLM\-derived signals to score or supervise reasoning steps\. In all these settings, the rubric is consumed*evaluatively*—by a training pipeline as reward, judge template, or post\-hoc verifier—to decide how much credit an already\-produced response deserves\. Our rubric is consumed*prescriptively*by the agent itself at inference time, and is generated step\-by\-step from the current partial trajectory rather than once per query or per completed report\. To our knowledge, Co\-ReAct is the first system to train rubrics for this prescriptive, step\-level role in a ReAct agent\.
## 3Method
Our method has three stages: \(i\) collect branching points from real ReAct trajectories and label each with an expert ranking over candidate next actions, \(ii\) train a rubric generator with GRPO so that the rubric it emits produces a ranking consistent with the expert ranking, and \(iii\) use the trained rubric at inference time inside an inject–verify–retry loop\. Figure[1](https://arxiv.org/html/2605.23590#S3.F1)gives an overview, and the same generator also serves as a drop\-in plug\-in for other test\-time methods \(Sec\.[4\.6](https://arxiv.org/html/2605.23590#S4.SS6)\)\.
Figure 1:Overview of Co\-ReAct\. \(i\) Collect: sample candidate next actions at each branching point and rank them with multi\-judge expert consensus\. \(ii\) Train: GRPO with a Spearman reward between the rubric\-induced ranking and the expert ranking\. \(iii\) Infer: the trained rubric drives a five\-tuple \(Rubric, Reason, Act, Verify, Observe\) loop\.### 3\.1Preference Data Collection
We construct training data from branching points of real ReAct trajectories, so the rubric is supervised on the same decision states the downstream agent encounters\. Letqqdenote a research query\. A ReAct trajectory forqqis a sequence of interleaved actions and observations\(a1,o1,a2,o2,…\)\(a\_\{1\},o\_\{1\},a\_\{2\},o\_\{2\},\\ldots\), whereata\_\{t\}is the action taken at stepttandoto\_\{t\}is the corresponding observation\. We writeht=\(a1,o1,…,at−1,ot−1\)h\_\{t\}=\(a\_\{1\},o\_\{1\},\\ldots,a\_\{t\-1\},o\_\{t\-1\}\)for the trajectory prefix up to steptt\.
Starting from a pool of deep research queries, we run a search agent on each query to obtain a full ReAct trajectory\. At every tool\-calling steptt, we treat the pair\(q,ht\)\(q,h\_\{t\}\)as a*branching point*and collect a slate ofkkcandidate next actions𝒜t=\{at\(1\),…,at\(k\)\}\\mathcal\{A\}\_\{t\}=\\\{a\_\{t\}^\{\(1\)\},\\ldots,a\_\{t\}^\{\(k\)\}\\\}\.
To ensure the slate is diverse rather than filled with near\-duplicates, we generate1212continuations at each branching point by three ReAct agents of different scales—Qwen3\-8B, Qwen3\-14B, and Qwen3\-32B—each sampled at temperatures\{0\.1,0\.4,0\.7,1\.0\}\\\{0\.1,0\.4,0\.7,1\.0\\\}\. Mixing model scales and temperatures broadens the range of search strategies and surface forms in the slate\. From this pool we remove exact duplicates and then selectk=4k\{=\}4actions using Maximum\-Marginal\-Relevance with BM25\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2605.23590#bib.bib26)\)similarity on the tokenized action string\. We discard branching points that have already emitted a final answer or where fewer thankkdistinct actions can be obtained\.
##### Expert ranking via multi\-judge consensus\.
Each branching point\(q,ht,𝒜t\)\(q,h\_\{t\},\\mathcal\{A\}\_\{t\}\)is paired with an expert consensus rankingσt⋆\\sigma^\{\\star\}\_\{t\}over𝒜t\\mathcal\{A\}\_\{t\}that serves as the supervision target\. Using a single LLM as a pointwise judge is brittle: pointwise scores are poorly calibrated across prompts, and one model’s idiosyncratic preferences become a bias shared across all supervision\. We therefore use a*listwise, multi\-judge*protocol\. The four candidates are randomly permuted and relabeled with neutral identifiers\{X,Y,Z,W\}\\\{X,Y,Z,W\\\}to remove positional bias, then shown toJJindependent frontier LLM judges drawn from different model families\. Each judge returns a full ranking of the slate rather than a scalar score\. We aggregate the rankings via Borda count—each candidate’s rank positions across judges are summed into a single score, andσt⋆\\sigma^\{\\star\}\_\{t\}is the permutation induced by sorting these scores\. Borda over listwise judgments respects each judge’s full ordering and is robust to a single judge being an outlier\. We only keep branching points on which at least two judges return a valid, parseable ranking\.
##### Depth\-wise expansion\.
Branching points at successive depths are collected along a single trajectory spine: after obtaining the expert rankingσt⋆\\sigma^\{\\star\}\_\{t\}at depthtt, we commit only the top\-ranked actionat⋆a\_\{t\}^\{\\star\}and its observationot⋆o\_\{t\}^\{\\star\}to the history, then re\-sample a fresh slate𝒜t\+1\\mathcal\{A\}\_\{t\+1\}at the resulting prefixht\+1h\_\{t\+1\}\.
### 3\.2Rubric Generator Training with Listwise GRPO
We formalize the rubric generator as an autoregressive policyπθ\(R∣q,ht\)\\pi\_\{\\theta\}\(R\\mid q,h\_\{t\}\)that emits a rubricRR: a short list of weighted criteria specifying what a good next action should cover\. A rubric is useful only if it can*discriminate*good actions from bad ones at the same branching point; a rubric that sounds plausible but induces a ranking uncorrelated with expert consensus is useless\. We therefore define the reward of a sampled rubric as the rank correlation between the ranking it induces over𝒜t\\mathcal\{A\}\_\{t\}and the expert consensus rankingσt⋆\\sigma^\{\\star\}\_\{t\}\.
#### 3\.2\.1Rubric Reward Design
##### Rubric\-induced ranking\.
Given a rubricRRand a candidate actiona∈𝒜ta\\in\\mathcal\{A\}\_\{t\}, an independent evaluator LLM reads\(q,ht,a,R\)\(q,h\_\{t\},a,R\)and returns the weighted fraction of rubric criteria the action satisfies\. Sorting these scores in descending order yields the rubric\-induced rankingσ^t\(R\)\\widehat\{\\sigma\}\_\{t\}\(R\)\.
##### Listwise Spearman reward\.
The main reward is the Spearman rank correlation betweenσ^t\(R\)\\widehat\{\\sigma\}\_\{t\}\(R\)andσt⋆\\sigma^\{\\star\}\_\{t\}, rescaled to\[0,1\]\[0,1\]:
rrank\(R\)=12\(ρ\(σ^t\(R\),σt⋆\)\+1\),r\_\{\\text\{rank\}\}\(R\)=\\tfrac\{1\}\{2\}\\\!\\left\(\\rho\\bigl\(\\widehat\{\\sigma\}\_\{t\}\(R\),\\,\\sigma^\{\\star\}\_\{t\}\\bigr\)\+1\\right\),\(1\)whereρ\\rhois Spearman’s rank correlation coefficient
ρ\(σa,σb\)=1−6∑i=1n\(σa\(i\)−σb\(i\)\)2n\(n2−1\),\\rho\(\\sigma\_\{a\},\\sigma\_\{b\}\)=1\-\\frac\{6\\sum\_\{i=1\}^\{n\}\\bigl\(\\sigma\_\{a\}\(i\)\-\\sigma\_\{b\}\(i\)\\bigr\)^\{2\}\}\{n\(n^\{2\}\-1\)\},\(2\)andσa\(i\),σb\(i\)\\sigma\_\{a\}\(i\),\\sigma\_\{b\}\(i\)denote the rank of candidateiiunder the two rankings \(n=\|𝒜t\|n=\|\\mathcal\{A\}\_\{t\}\|\)\. An anti\-correlated ranking gets0, a random ranking gets0\.50\.5in expectation, and perfect agreement gets11; a plausible\-sounding rubric that cannot sort candidates in the expert order earns no credit above chance\.
##### Total reward\.
We combinerrankr\_\{\\text\{rank\}\}with two light shaping terms—an*atomicity*rewardratomr\_\{\\text\{atom\}\}that encourages each criterion to check a single verifiable fact, and a*format*rewardrfmtr\_\{\\text\{fmt\}\}that enforces the expected schema—into the final reward
r\(R\)=w1rrank\(R\)\+w2ratom\(R\)\+w3rfmt\(R\),r\(R\)=w\_\{1\}\\,r\_\{\\text\{rank\}\}\(R\)\+w\_\{2\}\\,r\_\{\\text\{atom\}\}\(R\)\+w\_\{3\}\\,r\_\{\\text\{fmt\}\}\(R\),\(3\)withw1≫w2,w3w\_\{1\}\\gg w\_\{2\},w\_\{3\}, so the rank\-correlation signal drives learning and the shaping terms only refine how the rubric is phrased\.
#### 3\.2\.2GRPO Optimization
We optimizeπθ\\pi\_\{\\theta\}with Group Relative Policy Optimization\(Shaoet al\.,[2024a](https://arxiv.org/html/2605.23590#bib.bib23)\)\. For each branching point\(q,ht\)\(q,h\_\{t\}\), we sample a group ofGGrubrics\{R1,…,RG\}\\\{R\_\{1\},\\ldots,R\_\{G\}\\\}from the current policyπθold\\pi\_\{\\theta\_\{\\text\{old\}\}\}and compute the rewards\{r\(Ri\)\}i=1G\\\{r\(R\_\{i\}\)\\\}\_\{i=1\}^\{G\}via Eq\.[3](https://arxiv.org/html/2605.23590#S3.E3)\. The policy is updated with the standard clipped surrogate objective:
ℒ\(θ\)=−1G∑i=1Gmin\(ωiA^i,clip\(ωi,1−ϵ,1\+ϵ\)A^i\)\+β𝕂𝕃\[πθ∥πref\],\\begin\{split\}\\mathcal\{L\}\(\\theta\)=&\-\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\min\\\!\\big\(\\omega\_\{i\}\\hat\{A\}\_\{i\},\\;\\mathrm\{clip\}\(\\omega\_\{i\},1\{\-\}\\epsilon,1\{\+\}\\epsilon\)\\,\\hat\{A\}\_\{i\}\\big\)\\\\ &\+\\beta\\,\\mathbb\{KL\}\\\!\\big\[\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\text\{ref\}\}\\big\],\\end\{split\}\(4\)whereωi=πθ\(Ri∣q,ht\)/πθold\(Ri∣q,ht\)\\omega\_\{i\}=\\pi\_\{\\theta\}\(R\_\{i\}\\mid q,h\_\{t\}\)/\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(R\_\{i\}\\mid q,h\_\{t\}\)is the importance ratio, and advantages are normalized within each group:
A^i=r\(Ri\)−mean\(\{r\(Rj\)\}j=1G\)std\(\{r\(Rj\)\}j=1G\)\.\\hat\{A\}\_\{i\}=\\frac\{r\(R\_\{i\}\)\-\\operatorname\{mean\}\(\\\{r\(R\_\{j\}\)\\\}\_\{j=1\}^\{G\}\)\}\{\\operatorname\{std\}\(\\\{r\(R\_\{j\}\)\\\}\_\{j=1\}^\{G\}\)\}\.\(5\)The output of this stage is the trained generatorπθ⋆\\pi\_\{\\theta\}^\{\\star\}, which at inference time takes any\(q,ht\)\(q,h\_\{t\}\)and emits a rubric targeting the next search step\.
### 3\.3Co\-ReAct Inference: Inject, Verify, Retry
At inference time we useπθ⋆\\pi\_\{\\theta\}^\{\\star\}to drive a rubric\-guided ReAct loop, extending ReAct’s three\-tuple \(Reason, Act, Observe\) to a five\-tuple \(Rubric, Reason, Act, Verify, Observe\)\. At each tool\-calling step with historyhth\_\{t\}, Co\-ReAct performs three operations:
1. 1\.Inject\.The rubric generator producesRt∼πθ⋆\(⋅∣q,ht\)R\_\{t\}\\sim\\pi\_\{\\theta\}^\{\\star\}\(\\cdot\\mid q,h\_\{t\}\), which is appended to the agent’s context as an explicit specification of what the next action should cover\. The search agent then decides on a next actionata\_\{t\}conditioned on bothhth\_\{t\}andRtR\_\{t\}\.
2. 2\.Verify\.Before executing the action, an independent*verifier*LLM reads\(q,ht,at,Rt\)\(q,h\_\{t\},a\_\{t\},R\_\{t\}\)and checks each criterion inRtR\_\{t\}against the proposed action, returning a per\-criterion verdict\. The step is accepted if the weighted fraction of satisfied criteria exceeds a thresholdτ\\tau\.
3. 3\.Retry\.If the step fails verification, the agent is asked once to re\-plan the step with the same rubricRtR\_\{t\}and the verifier’s per\-criterion feedback pinned in context, so it can directly address the failed criteria\. The retried step replaces the failed one, and at most one retry is issued per step to bound compute\.
The rubric generator, search agent, and verifier each play a distinct role, so the trained rubric can also be used outside this loop: we simply injectRtR\_\{t\}into a baseline’s context and skip the verify–retry step, letting the baseline’s own decision mechanism consume the rubric \(Sec\.[4\.6](https://arxiv.org/html/2605.23590#S4.SS6)\)\.
## 4Experiments
### 4\.1Experimental Settings
##### Datasets\.
We evaluate on two deep research benchmarks that stress different aspects of open\-ended, citation\-grounded research\.DeepResearchBench\(DRB\)\(Duet al\.,[2025](https://arxiv.org/html/2605.23590#bib.bib24)\)contains Chinese and English research questions that require multi\-turn web search and long\-form report generation with citations, and is judged under the RACE protocol that scores comprehensiveness, insight, instruction following, and readability\.SQA\-CS\-V2\(Asaiet al\.,[2024](https://arxiv.org/html/2605.23590#bib.bib25)\)contains scientific questions that require search and citation\-grounded synthesis; evaluation focuses on factual completeness \(ingredient recall, answer precision\) and citation quality \(citation recall and precision\)\.
##### Evaluation Metrics\.
For DRB, we report the RACE metric comprising Comprehensiveness \(Comp\.\), Insight \(Ins\.\), Instruction Following \(IF\), Readability \(Read\.\), and their Global Average \(Avg\.\)\. For SQA\-CS\-V2, we report Ingredient Recall \(IR\), Answer Precision \(AP\), Citation Recall \(CR\), Citation Precision \(CP\), and their Global Average \(Avg\.\)\.
##### Agent Architecture and Tool Set\.
All methods share a two\-stage pipeline: asearch agentgathers evidence through a ReAct loop, and ananswer agentsynthesizes a citation\-grounded report from the full trajectory\. The search agent has access to three tools: an academic search tool, a Google search tool, and a webpage browsing tool\. It interleaves these tool calls with reasoning steps\. Baselines differ only in how the search agent decides what to call next or whether to retry\. The tool set and answer agent are identical across methods, so comparisons isolate decision quality from writing ability\.
##### Compared Methods\.
We compare Co\-ReAct against four test\-time methods on the same ReAct loop:Self\-Refine\(Madaanet al\.,[2023](https://arxiv.org/html/2605.23590#bib.bib4)\)applies iterative self\-critique at each step, retrying when the agent judges its own output insufficient;Best\-of\-N\(Snellet al\.,[2024](https://arxiv.org/html/2605.23590#bib.bib7)\)samplesN=4N\{=\}4trajectories at temperature 0\.7 and picks the best via an external scorer \(answer generation is greedy\);Step\-Back\(Zhenget al\.,[2024](https://arxiv.org/html/2605.23590#bib.bib6)\)prepends a high\-level perspective before each action to encourage broader reasoning;CRITIC\(Gouet al\.,[2024](https://arxiv.org/html/2605.23590#bib.bib5)\)runs a verification search after each action to generate grounded feedback for retries\.Co\-ReAct \(Ours\)emits a calibrated rubric from an RL\-trained generator before each step, injects it as structured guidance, and verifies the action against the criteria with targeted retry on failure\.
Table 1:Comparison results on DeepResearchBench \(DRB\) and SQA\-CS\-V2 with two search agents\. All methods use Qwen3\-235B as the answer rewriter to isolate search quality from writing ability\. Improvement \(%\) is relative to ReAct\.Bold: best;underline: second best\.
##### Implementation Details\.
We useQwen3\-8BandQwen3\-14Bas search agents \(vLLM, greedy decoding\); the rubric generator is initialized from Qwen3\-14B and GRPO\-trained on branching\-point data from the DR\-Tulu training queries\(Shaoet al\.,[2025b](https://arxiv.org/html/2605.23590#bib.bib11)\), with expert rankings from a three\-judge council \(Claude 4\.5 Sonnet, Gemini 2\.5 Pro, GPT\-5\) aggregated by Borda count\. To isolate search quality from writing ability, all methods share the same answer rewriterQwen3\-235B\. For evaluation we adopt each benchmark’s*official setting*: DRB is scored by the official RACE protocol\(Duet al\.,[2025](https://arxiv.org/html/2605.23590#bib.bib24)\)with Gemini as the judge, and SQA\-CS\-V2 is scored by its official evaluation script\(Asaiet al\.,[2024](https://arxiv.org/html/2605.23590#bib.bib25)\)also with Gemini as the judge\. Full data\-collection statistics, judge configuration, and hyperparameters are in Appendix[A](https://arxiv.org/html/2605.23590#A1)\.
Table 2:Ablation study on SQA\-CS\-V2 \(Qwen3\-8B search agent\)\. Each row removes one component from the full Co\-ReAct method\.Table 3:Search behavior analysis on SQA\-CS\-V2 \(Qwen3\-8B search agent\)\.
### 4\.2Main Results
Results on DRB and SQA\-CS\-V2 are shown in Table[1](https://arxiv.org/html/2605.23590#S4.T1)\.
\(1\) Co\-ReAct achieves the best Global Average on both benchmarks and both scales, confirming that rubric\-guided search consistently yields higher\-quality trajectories\. With Qwen3\-8B, it improves over the strongest baseline Self\-Refine by 0\.89% on DRB and 0\.84% on SQA\. Gains amplify with Qwen3\-14B: 7\.86% on DRB and 4\.56% on SQA over ReAct, surpassing the second\-best CRITIC by 3\.59% on DRB and Self\-Refine by 2\.47% on SQA\.
\(2\) Self\-Refine and CRITIC are the most competitive baselines\. Both share the intuition behind our verification component, which is to catch and correct suboptimal actions\. However, they rely on the search agent to diagnose its own quality gaps\. In contrast, Co\-ReAct offloads this process to a dedicated RL\-trained rubric generator, yielding more targeted guidance\.
\(3\) Best\-of\-N and Step\-Back consistently underperform on SQA\. Best\-of\-N produces shorter trajectories on average \(3\.0 tool calls vs\. 5\.2 for ReAct; Table[3](https://arxiv.org/html/2605.23590#S4.T3)\) because its candidates tend to stop once a plausible answer appears, and the best\-scoring candidate is often one of these shorter, less exhaustive runs\. Step\-Back’s abstract perspective diverts the agent from fine\-grained retrieval—though it achieves the highest Answer Precision on SQA \(81\.08 / 82\.19\), suggesting abstraction trades recall for precision\.
\(4\) The scaling behavior from 8B to 14B reveals a clear trend: Co\-ReAct’s relative gain over ReAct grows from 2\.50% to 7\.86% on DRB and from 2\.80% to 4\.56% on SQA, indicating that stronger agents better leverage structured rubric guidance\. The largest sub\-metric gain, 19\.5% on Ingredient Recall at 14B, shows the rubric especially helps the agent cover more key information points\.
### 4\.3Ablation Study
Table[2](https://arxiv.org/html/2605.23590#S4.T2)isolates the contribution of each Co\-ReAct component on SQA\-CS\-V2\.
All three components, listwise training, RL optimization, and verification, are essential\.w/o Co\-ReAct\(72\.76\): removing the rubric mechanism reduces the method to standard ReAct\.w/o RL Rubric\(72\.44\): replacing the RL\-trained generator with an untrained base model hurts performance below even ReAct, confirming that rubric quality matters\. Miscalibrated rubrics mislead the agent rather than guide it\.w/o Listwise\(74\.04\): switching listwise to pairwise GRPO degrades performance, because listwise Spearman optimization provides richer gradient signals across full rankings\.w/o Verification\(74\.08\): removing verify\-and\-retry reduces Global Average by 0\.96%; the verification step catches 21\.4% of tool calls that fail rubric criteria and triggers targeted retries \(Section[4\.5](https://arxiv.org/html/2605.23590#S4.SS5)\)\.
### 4\.4Generalization to Commercial Models
To verify the effectiveness of the Co\-ReAct paradigm itself under a closed\-source setting, we further apply a prompt\-only Co\-ReAct variant to Gemini 3\.1 Pro on DRB\. In this setting, Gemini serves as the search agent and answer generator, and is prompted to generate step\-level rubrics and verification feedback without GRPO fine\-tuning \(Figure[2](https://arxiv.org/html/2605.23590#S4.F2)\)\.
Co\-ReAct reaches 37\.13 Overall RACE, improving over ReAct by 4\.44% and over the strongest baseline Step\-Back by 3\.89%\. All other test\-time methods \(Self\-Refine, Best\-of\-N, CRITIC\) fail to improve over ReAct on this strong model, suggesting that self\-correction and resampling offer diminishing returns when the base agent is already capable\.
Figure 2:DRB RACE sub\-metric results with Gemini 3\.1 Pro used as the search agent, answer generator, and rubric generator\. Co\-ReAct achieves the best score on every sub\-metric\. Dashed lines mark the ReAct baseline in each group\.
### 4\.5Search Behavior
##### Co\-ReAct produces more thorough search trajectories\.
Table[3](https://arxiv.org/html/2605.23590#S4.T3)compares search behavior\. Co\-ReAct averages 6\.5 tool calls and 19\.3 links per question vs\. 5\.2 / 12\.7 for ReAct—a∼52%\{\\sim\}52\\%increase in retrieved documents with only∼25%\{\\sim\}25\\%more tool calls, indicating the rubric guides the agent toward more targeted queries rather than simply increasing search volume\. CRITIC uses comparable tool calls \(5\.0\) but retrieves fewer links \(14\.2\), suggesting its verification searches check existing results rather than discover new ones\. Co\-ReAct also produces the largest pool of unique cited sources \(18\.6\), a∼66%\{\\sim\}66\\%relative gain over ReAct \(11\.2\) and above every baseline\. Despite retrieving the most links, Co\-ReAct achieves the highest utilization ratio \(Utils 0\.96 vs\. 0\.88–0\.91\), which we attribute to the rubric’s ability to generate more step\-appropriate queries that steer the agent toward more relevant and useful evidence\.
##### Verification is well\-calibrated\.
Across the SQA evaluation set, Co\-ReAct executes 743 rubric\-guided steps \(7\.4 per example\); 159 \(21\.4%\) fail verification and trigger a retry\. This rate balances quality and efficiency, and the improvement from inject\-only \(74\.08\) to full Co\-ReAct \(74\.80\) confirms these retries meaningfully improve search quality\.
### 4\.6Plug\-in Rubric Portability Study
We test whether the trained rubric can be reused outside the Co\-ReAct loop by injecting the 14B rubric generator into Best\-of\-N, Step\-Back, and CRITIC as a drop\-in context signal, with verify\-and\-retry disabled and all other components unchanged\. Evaluation follows Table[1](https://arxiv.org/html/2605.23590#S4.T1)’s protocol on both DRB and SQA ; results are in Figure[3](https://arxiv.org/html/2605.23590#S4.F3)\.
Figure 3:Plug\-in rubric portability\. The rubric trained inside Co\-ReAct is injected into three other test\-time methods \(with verify\-and\-retry disabled\) on DRB and SQA\. Arrows connect each method’s original score \(hollow\) to its score after rubric injection \(filled\)\.The rubric yields positive transfer in all six \(method, benchmark\) cells, with the largest gains on the weakest method \(Step\-Back\) and the smallest on the method whose built\-in tool\-interactive critique already overlaps with the rubric signal \(CRITIC\)\. The rubric is thus complementary to existing test\-time compute techniques, not a substitute, and can serve as a drop\-in component on top of them\.
### 4\.7Case Study
Figure 4:Case study: ReAct vs\. Co\-ReAct on the same SQA\-CS\-V2 question \(DepthCrafter\)\. Co\-ReAct’s rubric–verify–retry mechanism ata3a\_\{3\}corrects a factual error that ReAct fails to catch\.Figure[4](https://arxiv.org/html/2605.23590#S4.F4)illustrates the rubric–verify–retry mechanism on a single SQA\-CS\-V2 question about DepthCrafter\. ReAct and Co\-ReAct issue identical first two actions; ata3a\_\{3\}, the rubric guides Co\-ReAct to open the arXiv page rather than issuing another snippet query\. The initial attempt fails verification due to wrong tool selection and insufficient disambiguation, triggering a retry withbrowse\_webpage\. This single corrected action produces the third answer bullet that ReAct gets wrong, demonstrating how step\-level rubrics translate into concrete factual improvements\.
## 5Conclusion
We presented Co\-ReAct, a rubric\-guided extension of ReAct that inserts a Rubric stage before action and a Verify stage after, turning the agent’s three\-tuple into a five\-tuple \(Rubric, Reason, Act, Verify, Observe\)\. The rubric generator is trained with listwise GRPO, using Spearman agreement between rubric\-induced and expert rankings as the reward\. Across DeepResearchBench and SQA\-CS\-V2, Co\-ReAct consistently outperforms Self\-Refine, Best\-of\-N, Step\-Back, and CRITIC on Qwen3\-8B, Qwen3\-14B, and Gemini 3\.1 Pro agents; the learned rubric also transfers as a drop\-in module, improving every baseline it is plugged into\. These results suggest that externally generated, trajectory\-aware rubrics are a lightweight and composable way to improve agentic search\.
## Limitations
##### Scope of the method\.
Co\-ReAct is a ReAct\-paradigm enhancement: it sits on top of a fixed search policy and improves step\-level decision quality through additional inference\-time computation, without retraining the underlying agent\. Accordingly, we compare against other ReAct enhancements \(Self\-Refine, Best\-of\-N, Step\-Back, CRITIC\) and do not benchmark against end\-to\-end RL\-trained search agents such as Search\-R1 or R1\-Searcher, which retrain the policy itself and belong to an orthogonal line of work\. Our plug\-in study only evaluates compositionality within the ReAct\-enhancement family; whether the trained rubric can be stacked on top of RL\-trained search agents is an open question we leave to future work\.
##### Evaluation scale and judging\.
Our evaluation relies on LLM\-based judges \(Gemini for DRB and SQA, and a three\-model council during rubric training\), which inherit known failure modes of LLM\-as\-a\-judge such as verbosity bias\.
## References
- A\. Asai, E\. Chen, K\. Chen, J\. Luo, X\. Qiu, H\. Peng, M\. Tan, M\. Yasunaga, P\. Liang, and L\. Dong \(2024\)OpenScholar: synthesizing scientific literature with retrieval\-augmented language models\.Preprint at Arxiv https://arxiv\. org/abs/2411\.14199\.Cited by:[§4\.1](https://arxiv.org/html/2605.23590#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.23590#S4.SS1.SSS0.Px5.p1.1)\.
- Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon,et al\.\(2022\)Constitutional ai: harmlessness from ai feedback\.arXiv preprint arXiv:2212\.08073\.Cited by:[§2\.3](https://arxiv.org/html/2605.23590#S2.SS3.p1.1)\.
- S\. M\. Brookhart \(2018\)Appropriate criteria: key to effective rubrics\.InFrontiers in education,Vol\.3,pp\. 22\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p3.1)\.
- M\. Du, B\. Xu, C\. Zhu, X\. Wang, and Z\. Mao \(2025\)Deepresearch bench: a comprehensive benchmark for deep research agents\.arXiv preprint arXiv:2506\.11763\.Cited by:[§4\.1](https://arxiv.org/html/2605.23590#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.23590#S4.SS1.SSS0.Px5.p1.1)\.
- Z\. Gou, Z\. Shao, Y\. Gong, Y\. Yang, N\. Duan, W\. Chen,et al\.\(2024\)Critic: large language models can self\-correct with tool\-interactive critiquing\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 57734–57811\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.23590#S4.SS1.SSS0.Px4.p1.1)\.
- A\. Gunjal, A\. Wang, E\. Lau, V\. Nath, Y\. He, B\. Liu, and S\. Hendryx \(2025\)Rubrics as rewards: reinforcement learning beyond verifiable domains\.arXiv preprint arXiv:2507\.17746\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p2.1)\.
- Y\. He, W\. Li, H\. Zhang, S\. Li, K\. Mandyam, S\. Khosla, Y\. Xiong, N\. Wang, X\. Peng, B\. Li,et al\.\(2025\)Advancedif: rubric\-based benchmarking and reinforcement learning for advancing llm instruction following\.arXiv preprint arXiv:2511\.10507\.Cited by:[§2\.3](https://arxiv.org/html/2605.23590#S2.SS3.p1.1)\.
- B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. Arik, D\. Wang, H\. Zamani, and J\. Han \(2025\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.arXiv preprint arXiv:2503\.09516\.Cited by:[§2\.2](https://arxiv.org/html/2605.23590#S2.SS2.p1.1)\.
- H\. Lee, S\. Phatale, H\. Mansoor, T\. Mesnard, J\. Ferret, K\. R\. Lu, C\. Bishop, E\. Hall, V\. Carbune, A\. Rastogi, and S\. Prakash \(2024\)RLAIF vs\. RLHF: scaling reinforcement learning from human feedback with AI feedback\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 26874–26901\.Cited by:[§2\.3](https://arxiv.org/html/2605.23590#S2.SS3.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 39578–39601\.Cited by:[§2\.3](https://arxiv.org/html/2605.23590#S2.SS3.p1.1)\.
- T\. Liu, R\. Xu, T\. Yu, I\. Hong, C\. Yang, T\. Zhao, and H\. Wang \(2025\)Openrubrics: towards scalable synthetic rubric generation for reward modeling and llm alignment\.arXiv preprint arXiv:2510\.07743\.Cited by:[§2\.3](https://arxiv.org/html/2605.23590#S2.SS3.p1.1)\.
- C\. Lv, J\. Zhou, W\. Zhao, J\. Xu, Z\. Huang, M\. Tian, S\. Dou, T\. Gui, L\. Tian, X\. Zhou,et al\.\(2026\)Learning query\-specific rubrics from human preferences for deepresearch report generation\.arXiv preprint arXiv:2602\.03619\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.23590#S2.SS3.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.Advances in neural information processing systems36,pp\. 46534–46594\.Cited by:[§2\.1](https://arxiv.org/html/2605.23590#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.23590#S4.SS1.SSS0.Px4.p1.1)\.
- R\. Nakano, J\. Hilton, S\. Balaji, J\. Wu, L\. Ouyang, C\. Kim, C\. Hesse, S\. Jain, V\. Kosaraju, W\. Saunders,et al\.\(2021\)Webgpt: browser\-assisted question\-answering with human feedback\.arXiv preprint arXiv:2112\.09332\.Cited by:[§2\.2](https://arxiv.org/html/2605.23590#S2.SS2.p1.1)\.
- W\. J\. Popham \(1997\)What’s wrong—and what’s right—with rubrics\.Educational Leadership55\(2\),pp\. 72–75\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p2.1)\.
- S\. Robertson and H\. Zaragoza \(2009\)The probabilistic relevance framework: BM25 and beyond\.Foundations and Trends in Information Retrieval3\(4\),pp\. 333–389\.Cited by:[§3\.1](https://arxiv.org/html/2605.23590#S3.SS1.p3.4)\.
- J\. Shao, Y\. Lin, M\. P\. Lohani, Y\. Miao, and B\. Luo \(2025a\)Do llm agents know how to ground, recover, and assess? a benchmark for epistemic competence in information\-seeking agents\.arXiv preprint arXiv:2509\.22391\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p1.1)\.
- R\. Shao, A\. Asai, S\. Z\. Shen, H\. Ivison, V\. Kishore, J\. Zhuo, X\. Zhao, M\. Park, S\. G\. Finlayson, D\. Sontag,et al\.\(2025b\)Dr tulu: reinforcement learning with evolving rubrics for deep research\.arXiv preprint arXiv:2511\.19399\.Cited by:[Appendix A](https://arxiv.org/html/2605.23590#A1.SS0.SSS0.Px1.p1.5),[§1](https://arxiv.org/html/2605.23590#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.23590#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2605.23590#S2.SS3.p1.1),[§4\.1](https://arxiv.org/html/2605.23590#S4.SS1.SSS0.Px5.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024a\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§3\.2\.2](https://arxiv.org/html/2605.23590#S3.SS2.SSS2.p1.6)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024b\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p4.1)\.
- L\. Sheng, W\. Ma, R\. Hong, X\. Wang, A\. Zhang, and T\. Chua \(2026\)Reinforcing chain\-of\-thought reasoning with self\-evolving rubrics\.arXiv preprint arXiv:2602\.10885\.Cited by:[§2\.3](https://arxiv.org/html/2605.23590#S2.SS3.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§2\.1](https://arxiv.org/html/2605.23590#S2.SS1.p1.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint arXiv:2408\.03314\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.23590#S4.SS1.SSS0.Px4.p1.1)\.
- H\. Song, J\. Jiang, Y\. Min, J\. Chen, Z\. Chen, W\. X\. Zhao, L\. Fang, and J\. Wen \(2025a\)R1\-searcher: incentivizing the search capability in llms via reinforcement learning\.arXiv preprint arXiv:2503\.05592\.Cited by:[§2\.2](https://arxiv.org/html/2605.23590#S2.SS2.p1.1)\.
- Z\. Song, B\. Zhang, Q\. Zhang, D\. Yin, X\. Sun, and C\. Li \(2025b\)PoLi\-rl: a point\-to\-list reinforcement learning framework for conditional semantic textual similarity\.arXiv preprint arXiv:2510\.04080\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p4.1)\.
- C\. Spearman \(1904\)The proof and measurement of association between two things\.The American Journal of Psychology15\(1\),pp\. 72–101\.External Links:[Document](https://dx.doi.org/10.2307/1412159)Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p4.1)\.
- P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024\)Math\-shepherd: verify and reinforce llms step\-by\-step without human annotations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9426–9439\.Cited by:[§2\.3](https://arxiv.org/html/2605.23590#S2.SS3.p1.1)\.
- Y\. Wang, Z\. Wei, X\. Zhu, and Y\. Meng \(2025\)Beyond outcome reward: decoupling search and answering improves llm agents\.arXiv preprint arXiv:2510\.04695\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p1.1)\.
- R\. Xu, T\. Liu, Z\. Dong, T\. Yu, I\. Hong, C\. Yang, L\. Zhang, T\. Zhao, and H\. Wang \(2026a\)Alternating reinforcement learning for rubric\-based reward modeling in non\-verifiable llm post\-training\.arXiv preprint arXiv:2602\.01511\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.23590#S2.SS3.p1.1)\.
- R\. Xu, T\. Liu, Z\. Dong, T\. Yu, I\. Hong, C\. Yang, L\. Zhang, T\. Zhao, and H\. Wang \(2026b\)Alternating reinforcement learning for rubric\-based reward modeling in non\-verifiable llm post\-training\.arXiv preprint arXiv:2602\.01511\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p4.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[§2\.1](https://arxiv.org/html/2605.23590#S2.SS1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p1.1)\.
- H\. S\. Zheng, S\. Mishra, X\. Chen, H\. Cheng, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2024\)Take a step back: evoking reasoning via abstraction in large language models\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 20279–20316\.Cited by:[§1](https://arxiv.org/html/2605.23590#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.23590#S4.SS1.SSS0.Px4.p1.1)\.
## Appendix AAdditional Implementation Details
This appendix records the concrete hyperparameters and configuration choices referenced from Sec\.[3](https://arxiv.org/html/2605.23590#S3)and the Experimental Settings\.
##### Rubric training data\.
We collect branching\-point data \(Sec\.[3\.1](https://arxiv.org/html/2605.23590#S3.SS1)\) from11,40611\{,\}406research queries drawn from the training set ofDR\-Tulu\(Shaoet al\.,[2025b](https://arxiv.org/html/2605.23590#bib.bib11)\), so that the rubric generator is supervised on the same query distribution as the downstream deep research setting\. For each query, we construct a trajectory through depth\-wise expansion rather than rolling out a fixed single\-agent trajectory\. At each branching point, we sample1212candidate next actions using three ReAct agents of different scales—Qwen3\-8B, Qwen3\-14B, and Qwen3\-32B—each decoded at four temperatures\{0\.1,0\.4,0\.7,1\.0\}\\\{0\.1,0\.4,0\.7,1\.0\\\}\. The candidate slate is then ranked by the multi\-judge expert consensus procedure described in Sec\.[3\.1](https://arxiv.org/html/2605.23590#S3.SS1), and the top\-ranked action is executed to extend the trajectory prefix for the next depth\. From each candidate pool, we remove exact duplicates and selectk=4k\{=\}4diverse actions via Maximum\-Marginal\-Relevance with BM25 similarity on the tokenized action string\. After discarding branching points where the agent has already emitted a final answer or where fewer than four distinct actions can be obtained, we obtain29,86629\{,\}866branching points used as the unit of supervision\.
##### Expert consensus judges\.
For each branching point, the four candidates are relabeled\{X,Y,Z,W\}\\\{X,Y,Z,W\\\}under a random permutation and submitted to a council ofJ=3J\{=\}3frontier LLM judges drawn from different model families:Claude 4\.5 Sonnet,Gemini 2\.5 Pro, andGPT\-5\. Each judge is asked for a full listwise ranking \(not a scalar score\) with a chain\-of\-thought rationale; rankings are parsed from the judge’s final answer block\.
##### GRPO hyperparameters\.
The rubric generator is initialized from Qwen3\-14B and trained with GRPO on𝒟⋆\\mathcal\{D\}^\{\\star\}\. We sampleG=8G\{=\}8rubrics per branching point and form group\-relative advantages within each group\. The reward mixes the listwise Spearman term with atomicity and format terms at weights\(w1,w2,w3\)=\(0\.75,0\.15,0\.10\)\(w\_\{1\},w\_\{2\},w\_\{3\}\)=\(0\.75,0\.15,0\.10\), and a repetition gate zeroes out the total reward whenever the44\-gram repetition rate of the rubric exceeds40%40\\%\. The Spearman ranking is computed by an independent evaluator LLM \(Gemini 2\.5 Pro\) that scores each candidate against the sampled rubric\. We train for22epochs with learning rate2×10−62\\times 10^\{\-6\}, a KL coefficient of5×10−35\\times 10^\{\-3\}against a frozen reference policy, and gradient clipping at norm1\.01\.0\.
##### Co\-ReAct inference\.
At inference time the rubric generator is served via vLLM with temperature0\.70\.7, top\-pp0\.950\.95, and a maximum of10241024output tokens per rubric; the search agent and the independent verifier both run on the same base Qwen3\-14B with temperature0\. Verification accepts a step when the weighted fraction of satisfied rubric criteria exceedsτ=0\.5\\tau\{=\}0\.5, and at most one retry is issued per step \(max\_retries=1\\text\{max\\\_retries\}\{=\}1\) to bound compute\. Each search trajectory is truncated to a6,0006\{,\}000\-token budget before the answer\-rewriter stage, matching the protocol used for all baselines\.Similar Articles
ReAct or CodeAct, that is the question
The article discusses the trade-offs between ReAct and CodeAct orchestration paradigms in AI engineering, highlighting CodeAct's efficiency for complex tasks and introducing a new open-source framework.
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
This paper introduces RubricEM, a reinforcement learning framework that uses rubric-guided policy decomposition and reflection-based meta-policy evolution to train deep research agents for long-form tasks. The resulting RubricEM-8B model demonstrates strong performance on long-form research benchmarks by leveraging stage-aware planning and denser semantic feedback.
ARES: Automated Rubric Synthesis for Scalable LLM Reinforcement Learning
ARES proposes a framework for automatically constructing rubric-based RL data from pretraining documents, generating question-answer pairs and weighted rubrics to enable instance-level reward supervision for open-ended LLM responses, outperforming existing methods on multi-dimensional open-ended tasks.
ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning
ReCrit introduces a transition-aware reinforcement learning framework for scientific critic reasoning, decomposing initial-to-critic behavior into four quadrants (Correction, Sycophancy, Robustness, Boundary) and using dynamic asynchronous rollout. It improves critic accuracy significantly on Qwen models across multiple scientific benchmarks.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
This paper introduces the Auto-Rubric as Reward (ARR) framework, which externalizes implicit preference knowledge into explicit rubrics for multimodal alignment. It proposes Rubric Policy Optimization (RPO) to stabilize policy gradients, achieving better performance in text-to-image and image editing tasks.