EVE-Agent: Evidence-Verifiable Self-Evolving Agents

arXiv cs.AI Papers

Summary

EVE-Agent introduces a framework for self-evolving search agents that ensure evidence verifiability by generating questions, answers, and evidence spans, and training on marginal accuracy gain of evidence. This improves grounded correctness without human annotations.

arXiv:2605.22905v1 Announce Type: new Abstract: Self-evolving agents should not train on examples they cannot justify. Data-free self-evolving search agents offer a scalable route to systems that generate their own questions, answer them, and improve from their own feedback without human annotations. Yet, without verifiable evidence, this loop can reward fluent but unsupported examples, turning the self-generated curriculum into an opaque and potentially unreliable training signal. We argue that evidence verifiability is a prerequisite for trustworthy self-evolution in search agents: each generated instance should include not only an answer but also a source-grounded span whose contribution to that answer can be measured. We introduce EVE-Agent, an Evidence-Verifiable Self-Evolving Agent that operationalizes this principle through a modification to the proposer--solver framework. The proposer generates a question, an answer, and a verbatim evidence span. An evidence verifier then rewards the span according to the marginal accuracy gain when the evidence is provided. This produces a training signal that favors evidence that genuinely helps answer the question, without requiring oracle answers, human labels, or external annotations. EVE-Agent leaves the backbone model, retriever, search tool, and optimization framework unchanged. Experiments show that EVE-Agent substantially improves evidence-grounded correctness over prior self-evolving search agents. The resulting curriculum is not merely self-generated but auditable by construction: each training example carries an inspectable source span that explains why it should be trusted.
Original Article
View Cached Full Text

Cached at: 05/25/26, 08:55 AM

# EVE-Agent: Evidence-Verifiable Self-Evolving Agents
Source: [https://arxiv.org/html/2605.22905](https://arxiv.org/html/2605.22905)
Yamato Arai Fujitsu Limited Department of Basic Science The University of Tokyo &Yuma Ichikawa Fujitsu Limited RIKEN center for AIP

###### Abstract

Self\-evolving agents should not train on examples they cannot justify\. Data\-free self\-evolving search agents offer a scalable route to systems that generate their own questions, answer them, and improve from their own feedback without human annotations\. Yet, without verifiable evidence, this loop can reward fluent but unsupported examples, turning the self\-generated curriculum into an opaque and potentially unreliable training signal\. We argue that evidence verifiability is a prerequisite for trustworthy self\-evolution in search agents: each generated instance should include not only an answer but also a source\-grounded span whose contribution to that answer can be measured\. We introduce EVE\-Agent, an Evidence\-Verifiable Self\-Evolving Agent that operationalizes this principle through a modification to the proposer–solver framework\. The proposer generates a question, an answer, and a verbatim evidence span\. An evidence verifier then rewards the span according to the marginal accuracy gain when the evidence is provided\. This produces a training signal that favors evidence that genuinely helps answer the question, without requiring oracle answers, human labels, or external annotations\. EVE\-Agent leaves the backbone model, retriever, search tool, and optimization framework unchanged\. Experiments show that EVE\-Agent substantially improves evidence\-grounded correctness over prior self\-evolving search agents\. The resulting curriculum is not merely self\-generated but auditable by construction: each training example carries an inspectable source span that explains why it should be trusted\.

## 1Introduction

Search agents for knowledge\-intensive question answering must do more than retrieve relevant information: they must also ground their answers in appropriate evidence\. This requirement distinguishes them from standard language models, which may generate fluent responses without revealing why those responses should be trusted\. In supervised settings, evidence grounding is commonly enforced using human\-curated question–answer datasets annotated with supporting spans, such as HotpotQA, 2WikiMultiHopQA, and MuSiQue\(Yang et al\.,[2018](https://arxiv.org/html/2605.22905#bib.bib28); Ho et al\.,[2020](https://arxiv.org/html/2605.22905#bib.bib8); Trivedi et al\.,[2022](https://arxiv.org/html/2605.22905#bib.bib25)\)\. Retrieval\-augmented and tool\-using language models operationalize this evidence\-seeking behavior\(Lewis et al\.,[2020](https://arxiv.org/html/2605.22905#bib.bib15); Yao et al\.,[2023](https://arxiv.org/html/2605.22905#bib.bib29); Schick et al\.,[2023](https://arxiv.org/html/2605.22905#bib.bib21); Trivedi et al\.,[2023](https://arxiv.org/html/2605.22905#bib.bib24); Asai et al\.,[2024](https://arxiv.org/html/2605.22905#bib.bib1)\), while recent reinforcement\-learning methods further demonstrate that models can learn to invoke search as part of their reasoning process\(Jin et al\.,[2025](https://arxiv.org/html/2605.22905#bib.bib10); Song et al\.,[2025](https://arxiv.org/html/2605.22905#bib.bib23)\)\.

However, constructing evidence\-grounded supervision at scale is costly, tightly coupled to a particular corpus, and difficult to update when the retrieval environment changes\. Data\-free self\-evolution provides an attractive alternative: a model generates its own training questions, attempts to solve them, and improves from the resulting feedback\. This paradigm has shown strong promise in domains such as reasoning and code, where self\-generated tasks can be validated by external oracles, including interpreters and symbolic checkers\(Zhao et al\.,[2025](https://arxiv.org/html/2605.22905#bib.bib32); Huang et al\.,[2026](https://arxiv.org/html/2605.22905#bib.bib9)\)\. It has also recently been extended to multi\-turn search agents\(Yue et al\.,[2026](https://arxiv.org/html/2605.22905#bib.bib31)\)\. However, search\-based question answering lacks the exact verification mechanisms available in code or mathematics\. A self\-generated question may be ambiguous, unsupported by the source text, or answerable from the model’s memorized knowledge alone\. Likewise, a solver may produce a confident answer without supplying evidence that genuinely supports it\.

![Refer to caption](https://arxiv.org/html/2605.22905v1/fig/fig1_loop_comparison_v2.png)Figure 1:Evidence\-verifiable self\-evolving search agents\.Existing self\-evolving search agents \(*left*\) reward proposers using only a difficulty signal based on solver accuracy, without auditing the source evidence behind each question\. EVE\-Agent \(*right*\) requires the proposer to output a source\-grounded evidence span and rewards it only when that evidence causally improves the solver’s answer accuracy, measured by the gain from no\-evidence to with\-evidence rollouts\. This modification is limited to the reward: the proposer, solver, backbone model, and search tool remain unchanged\.This limitation exposes a central weakness in existing self\-evolving search\-agent loops\. Their reward signals primarily assess whether a generated question is useful as a difficulty\-controlled training instance\. However, they do not verify whether the corresponding answer is grounded in a source span that can be checked\. Consequently, unsupported examples may enter the self\-generated curriculum and shape subsequent learning\. The problem is not simply that evidence may be missing\. In practice, a system may produce a syntactically valid evidence block even when the cited span does not actually justify the answer\. Such examples are difficult to audit: once they are incorporated into the curriculum, it becomes unclear whether the agent is learning to search for and reason over evidence, or merely reinforcing fluent but unverifiable behavior\.

We argue that evidence verifiability should be a core design principle for data\-free, self\-evolving search agents\. Each generated training instance should include a source\-grounded evidence span, and the utility of that span should be explicitly measurable\. This requirement reframes evidence from an optional explanation into a training\-time object that can be inspected, scored, and reused\. It also makes the generated curriculum more trustworthy: every question–answer pair is paired with a concrete textual basis, and the system can be evaluated not only on whether it answers correctly but also on whether it provides evidence that supports its answer\. To this end, we introduce EVE\-Agent, a lightweight extension of the proposer–solver framework built around evidence verifiability\. The proposer generates a question, an answer, and a verbatim evidence span from the source text\. An evidence verifier then rewards the proposer according to the marginal improvement in the current solver’s answer accuracy when the evidence is provided, relative to answering from the question alone\. This signal requires no oracle answers, human labels, or external annotations: it is computed solely from the solver, the proposer\-emitted evidence, and the corpus\. The same evidence span is subsequently used to train the solver to produce both an answer and supporting evidence\. Importantly, EVE\-Agent leaves the backbone model, retriever, search tool, and policy\-optimization framework unchanged\.

The resulting self\-generated curriculum is auditable by design\. Each training example is paired with an explicit source span, and that span is rewarded only when it helps the solver answer the question\. This design discourages unsupported or purely memorization\-based questions while preserving the scalability advantages of data\-free self\-evolution\. It also offers a practical mechanism for inspecting the agent’s generated training data: the curriculum is no longer a collection of opaque question–answer pairs, but a set of evidence\-linked instances whose grounding can be verified post hoc\. Our experiments show that, under matched conditions, EVE\-Agent substantially improves evidence\-grounded correctness over prior self\-evolving search\-agent methods\. These results suggest that self\-evolving search agents can be trained not only to answer questions but also to produce evidence that makes their own training process verifiable\.

## 2Background

#### Notation\.

Let𝒟=\{d1,…,d\|𝒟\|\}\\mathcal\{D\}=\\\{d\_\{1\},\\ldots,d\_\{\|\\mathcal\{D\}\|\}\\\}denote a finite corpus, where each documentdid\_\{i\}is represented as a token sequence\. A task instance is a triple\(q,a,e\)\(q,a,e\)consisting of a questionqq, its target answeraa, and an evidence spanee\. The evidence span is required to be a contiguous text span copied from either a source document in𝒟\\mathcal\{D\}or a snippet retrieved from that corpus\. A search engineℛ\\mathcal\{R\}is shared by all agents: given a text query,ℛ\\mathcal\{R\}returns a finite list of snippets drawn from𝒟\\mathcal\{D\}\. Throughout the paper, logarithms are natural, and𝟏​\{⋅\}\\mathbf\{1\}\\\{\\cdot\\\}denotes the indicator function, which equals11when its argument is true and0otherwise\. For any modelMMand inputxx, we useM​\(a∣x\)≔ℙa^∼M\(⋅∣x\)​\[a^=a\]M\(a\\mid x\)\\coloneqq\{\\mathbb\{P\}\}\_\{\\hat\{a\}\\sim M\(\\cdot\\mid x\)\}\[\\hat\{a\}=a\]to denote the probability thatMMgenerates the answer stringaawhen conditioned onxx\.

#### Self\-evolving search\-agent loop\.

The self\-evolving search\-agent framework consists of two policies that are updated over training roundst=1,…,Tt=1,\\ldots,T\. The proposer policy, denoted byπtpro\\pi\_\{t\}^\{\\mathrm\{pro\}\}, generates a training instance from a source document\. In the prior framework, this instance is a question–answer pair\(q,a\)\(q,a\); in EVE\-Agent, it is extended to a question–answer–evidence triple\(q,a,e\)\(q,a,e\)\. The solver policy, denoted byπtsol\\pi\_\{t\}^\{\\mathrm\{sol\}\}, receives a question, may call the shared search engineℛ\\mathcal\{R\}during its reasoning process, and outputs an answer\. For notational convenience, we define

Msol,t\(⋅∣x\)≔πtsol\(⋅∣x,ℛ\)M\_\{\\mathrm\{sol\},t\}\(\\cdot\\mid x\)\\coloneqq\\pi\_\{t\}^\{\\mathrm\{sol\}\}\(\\cdot\\mid x,\\mathcal\{R\}\)\(1\)for the answer distribution induced by the solver at roundttwhen it is given inputxxand access to the search engine\. We useMpro,tM\_\{\\mathrm\{pro\},t\}analogously for the distribution induced by the proposer\. The setting is data\-free in the sense that no human\-labeled question–answer pairs or supporting spans are provided\. The only human\-supplied resources are the corpus𝒟\\mathcal\{D\}and the search engineℛ\\mathcal\{R\}\.

#### Difficulty reward\.

We first recall the difficulty\-based proposer reward used in the prior self\-evolving search\-agent framework\(Yue et al\.,[2026](https://arxiv.org/html/2605.22905#bib.bib31)\)\. At training roundtt, a source documentd∈𝒟d\\in\\mathcal\{D\}is sampled uniformly from the corpus\. The proposer then generates a question–answer pair\(q,a\)∼πtpro\(⋅∣d,ℛ\)\(q,a\)\\sim\\pi\_\{t\}^\{\\mathrm\{pro\}\}\(\\cdot\\mid d,\\mathcal\{R\}\), and the solver answers the same questionnntimes independently,\{a^j\}j=1n∼Msol,t\(⋅∣q\)\\\{\\hat\{a\}\_\{j\}\\\}\_\{j=1\}^\{n\}\\sim M\_\{\\mathrm\{sol\},t\}\(\\cdot\\mid q\)\. Letk≔∑j=1n𝟏​\{a^j=a\}k\\coloneqq\\sum\_\{j=1\}^\{n\}\\mathbf\{1\}\\\{\\hat\{a\}\_\{j\}=a\\\}be the number of solver predictions that exactly match the proposer\-provided answer\. The proposer receives the difficulty reward

RtDZ​\(q,a;k\)=𝟏​\{0<k<n\}​n−kn−1\.R^\{\\mathrm\{DZ\}\}\_\{t\}\(q,a;k\)=\\mathbf\{1\}\\\{0<k<n\\\}\\frac\{n\-k\}\{n\-1\}\.\(2\)The prior system\(Yue et al\.,[2026](https://arxiv.org/html/2605.22905#bib.bib31)\)also adds a format\-related term to this signal; in our notation we keepRtDZR^\{\\mathrm\{DZ\}\}\_\{t\}for the pure difficulty component and treat the format reward separately in Eq\. \([16](https://arxiv.org/html/2605.22905#S3.E16)\)\.

This reward favors questions that are neither trivial nor impossible for the current solver\. If all solver predictions are correct or all are incorrect, then the difficulty term is zero\. Otherwise, the reward increases as the number of incorrect predictions grows\. Thus, among questions that the solver can answer at least sometimes, the proposer is encouraged to generate examples near the solver’s current learning frontier\. This intuition can be made precise\. Defineptsol​\(q,a\)≔Msol,t​\(a∣q\)p\_\{t\}^\{\\mathrm\{sol\}\}\(q,a\)\\coloneqq M\_\{\\mathrm\{sol\},t\}\(a\\mid q\), the probability that the solver at roundttgenerates answer stringaawhen conditioned on questionqq\. Since thennsolver predictions are sampled independently, the number of correct predictions followsk∼Bin​\(n,ptsol​\(q,a\)\)k\\sim\\mathrm\{Bin\}\\bigl\(n,p\_\{t\}^\{\\mathrm\{sol\}\}\(q,a\)\\bigr\)\. Taking the expectation of Eq\. \([2](https://arxiv.org/html/2605.22905#S2.E2)\) over this binomial randomness yields

𝔼k∼Bin​\(n,p\)​\[RtDZ​\(q,a;k\)\]=ϕn​\(ptsol​\(q,a\)\),\\mathbb\{E\}\_\{k\\sim\\mathrm\{Bin\}\(n,p\)\}\\bigl\[R^\{\\mathrm\{DZ\}\}\_\{t\}\(q,a;k\)\\bigr\]=\\phi\_\{n\}\\bigl\(p\_\{t\}^\{\\mathrm\{sol\}\}\(q,a\)\\bigr\),\(3\)where

ϕn​\(p\)≔nn−1​\(1−p\)​\(1−\(1−p\)n−1\)\.\\phi\_\{n\}\(p\)\\coloneqq\\frac\{n\}\{n\-1\}\(1\-p\)\\bigl\(1\-\(1\-p\)^\{n\-1\}\\bigr\)\.\(4\)Lemma[B\.1](https://arxiv.org/html/2605.22905#A2.Thmtheorem1)proves this identity\. The functionϕn\\phi\_\{n\}is continuous and unimodal on\[0,1\]\[0,1\], and it vanishes at both endpoints\. Therefore, the expected difficulty reward is small both when the solver almost never answers correctly and when it almost always answers correctly; it is largest at an intermediate success probability\.

#### Hop\-grouped relative policy optimization\.

Optimizing the proposer directly with group\-relative policy optimization would be costly in the search\-agent setting\. A direct implementation would require nested sampling: for each source document, one would sample multiple candidate questions from the proposer, and for each candidate question one would run multiple solver rollouts\. Because each solver rollout may call the search engine several times, this nested procedure is prohibitively expensive\.

Hop\-grouped relative policy optimization \(HRPO\) avoids this cost by sampling one proposer output per source document while normalizing rewards within comparable groups\(Yue et al\.,[2026](https://arxiv.org/html/2605.22905#bib.bib31)\)\. Suppose a batch containsNNgenerated question–answer pairs,\{\(qi,ai\)\}i=1N\\\{\(q\_\{i\},a\_\{i\}\)\\\}\_\{i=1\}^\{N\}\. Each question has a prescribed hop counthi∈ℋh\_\{i\}\\in\\mathcal\{H\}, where the hop count indicates the intended number of reasoning steps or evidence pieces needed to answer the question\. For each hop valuehh, defineℐh≔\{i:hi=h\}\\mathcal\{I\}\_\{h\}\\coloneqq\\\{i:h\_\{i\}=h\\\}\.

HRPO computes the standardized advantage of exampleiiwithin its hop group:

Ai,h=RtDZ​\(qi,ai;ki\)−𝔼j∈ℐh​\[Rt,jDZ\]Varj∈ℐh​\[Rt,jDZ\]\+δ0,A\_\{i,h\}=\\frac\{R^\{\\mathrm\{DZ\}\}\_\{t\}\(q\_\{i\},a\_\{i\};k\_\{i\}\)\-\\mathbb\{E\}\_\{j\\in\\mathcal\{I\}\_\{h\}\}\\bigl\[R^\{\\mathrm\{DZ\}\}\_\{t,j\}\\bigr\]\}\{\\sqrt\{\\mathrm\{Var\}\_\{j\\in\\mathcal\{I\}\_\{h\}\}\\bigl\[R^\{\\mathrm\{DZ\}\}\_\{t,j\}\\bigr\]\}\+\\delta\_\{0\}\},\(5\)wherekik\_\{i\}is the number of correct solver predictions for exampleii,Rt,jDZR^\{\\mathrm\{DZ\}\}\_\{t,j\}denotes the difficulty reward of examplejj, andδ0\>0\\delta\_\{0\}\>0is a numerical stabilizer\.

Letπrefpro\\pi\_\{\\mathrm\{ref\}\}^\{\\mathrm\{pro\}\}be a frozen reference proposer policy\. In our experiments, this reference is the proposer initialization at the beginning of Phase A\. The HRPO update maximizes

𝒥tHRPO=1N∑h∈ℋ∑i∈ℐhlogπtpro\(qi,ai∣di,ℛ\)Ai,h−β𝔼d\[KL\(πtpro\(⋅∣d,ℛ\)∥πrefpro\(⋅∣d,ℛ\)\)\],\\mathcal\{J\}^\{\\mathrm\{HRPO\}\}\_\{t\}=\\frac\{1\}\{N\}\\sum\_\{h\\in\\mathcal\{H\}\}\\sum\_\{i\\in\\mathcal\{I\}\_\{h\}\}\\log\\pi\_\{t\}^\{\\mathrm\{pro\}\}\(q\_\{i\},a\_\{i\}\\mid d\_\{i\},\\mathcal\{R\}\)A\_\{i,h\}\-\\beta\\mathbb\{E\}\_\{d\}\\left\[\\mathrm\{KL\}\\left\(\\pi\_\{t\}^\{\\mathrm\{pro\}\}\(\\cdot\\mid d,\\mathcal\{R\}\)\\,\\middle\\\|\\,\\pi\_\{\\mathrm\{ref\}\}^\{\\mathrm\{pro\}\}\(\\cdot\\mid d,\\mathcal\{R\}\)\\right\)\\right\],\(6\)whereβ\>0\\beta\>0controls the strength of KL regularization\. The KL divergence is taken between two conditional distributions over proposer outputs given the same source documentddand access to the same search engineℛ\\mathcal\{R\}\. Its role is to keep the current proposerπtpro\\pi\_\{t\}^\{\\mathrm\{pro\}\}close to the frozen referenceπrefpro\\pi\_\{\\mathrm\{ref\}\}^\{\\mathrm\{pro\}\}, thereby preventing overly large policy updates\. This KL term is conceptually separate from the relative advantage normalization in Eq\. \([5](https://arxiv.org/html/2605.22905#S2.E5)\)\.

#### Group\-relative policy optimization\.

The solver is trained with group\-relative policy optimization \(GRPO\)\(Shao et al\.,[2024](https://arxiv.org/html/2605.22905#bib.bib22)\)\. For a given questionqq, the rollout policy is the previous solverπt−1sol\\pi\_\{t\-1\}^\{\\mathrm\{sol\}\}\. It samples a group ofnncandidate responses,\{y^j\}j=1n∼πt−1sol\(⋅∣q,ℛ\)\\\{\\hat\{y\}\_\{j\}\\\}\_\{j=1\}^\{n\}\\sim\\pi\_\{t\-1\}^\{\\mathrm\{sol\}\}\(\\cdot\\mid q,\\mathcal\{R\}\)\. Each response receives a binary answer rewardrj=𝟏​\{y^j=a\}r\_\{j\}=\\mathbf\{1\}\\\{\\hat\{y\}\_\{j\}=a\\\}, whereaais the target answer\. Letr¯\\overline\{r\}andσ^\\hat\{\\sigma\}be the empirical mean and standard deviation of\{rj\}j=1n\\\{r\_\{j\}\\\}\_\{j=1\}^\{n\}within the group\. The standardized advantage isAj=rj−r¯/σ^\+δ0,A\_\{j\}=\\nicefrac\{\{r\_\{j\}\-\\overline\{r\}\}\}\{\{\\hat\{\\sigma\}\+\\delta\_\{0\}\}\},whereδ0\>0\\delta\_\{0\}\>0prevents division by zero\.

Letπrefsol\\pi\_\{\\mathrm\{ref\}\}^\{\\mathrm\{sol\}\}be a frozen reference solver policy\. In our experiments, this reference is the solver initialization at the beginning of Phase B\. GRPO maximizes the clipped surrogate

𝒥tGRPO=𝔼​\[1n​∑j=1nmin⁡\(ρj​Aj,clip​\(ρj,1−ϵ,1\+ϵ\)​Aj\)\]−β𝔼q\[KL\(πtsol\(⋅∣q,ℛ\)∥πrefsol\(⋅∣q,ℛ\)\)\],ρj=πtsol​\(y^j∣q,ℛ\)πt−1sol​\(y^j∣q,ℛ\),\\mathcal\{J\}^\{\\mathrm\{GRPO\}\}\_\{t\}=\\mathbb\{E\}\\left\[\\frac\{1\}\{n\}\\sum\_\{j=1\}^\{n\}\\min\\left\(\\rho\_\{j\}A\_\{j\},\\mathrm\{clip\}\(\\rho\_\{j\},1\-\\epsilon,1\+\\epsilon\)A\_\{j\}\\right\)\\right\]\\\\ \-\\beta\\mathbb\{E\}\_\{q\}\\left\[\\mathrm\{KL\}\\left\(\\pi\_\{t\}^\{\\mathrm\{sol\}\}\(\\cdot\\mid q,\\mathcal\{R\}\)\\,\\middle\\\|\\,\\pi\_\{\\mathrm\{ref\}\}^\{\\mathrm\{sol\}\}\(\\cdot\\mid q,\\mathcal\{R\}\)\\right\)\\right\],~~\\rho\_\{j\}=\\frac\{\\pi\_\{t\}^\{\\mathrm\{sol\}\}\(\\hat\{y\}\_\{j\}\\mid q,\\mathcal\{R\}\)\}\{\\pi\_\{t\-1\}^\{\\mathrm\{sol\}\}\(\\hat\{y\}\_\{j\}\\mid q,\\mathcal\{R\}\)\},\(7\)whereϵ∈\(0,1\)\\epsilon\\in\(0,1\)is the clipping width, andβ\>0\\beta\>0is the KL coefficient\.

The importance ratioρj\\rho\_\{j\}compares the probability of the sampled responsey^j\\hat\{y\}\_\{j\}under the current solverπtsol\\pi\_\{t\}^\{\\mathrm\{sol\}\}with its probability under the rollout policyπt−1sol\\pi\_\{t\-1\}^\{\\mathrm\{sol\}\}\. In contrast, the KL term compares the full conditional output distribution of the current solver with the frozen reference solverπrefsol\\pi\_\{\\mathrm\{ref\}\}^\{\\mathrm\{sol\}\}for the same questionqqand search engineℛ\\mathcal\{R\}\. The ratio controls the policy\-gradient update on sampled responses, while the KL term regularizes the entire updated policy\. EVE\-Agent keeps this optimization infrastructure unchanged: it uses HRPO for the proposer and GRPO for the solver, and changes only the reward design and, optionally, the source\-document selector\.

## 3Method

The difficulty reward of Eq\. \([2](https://arxiv.org/html/2605.22905#S2.E2)\) encourages the proposer whenever the solver is uncertain about a generated question, but it does not verify whether the proposer’s answer is supported by any source span: the same reward is paid whether the underlying evidence is genuine or unrelated\. Section[4\.2](https://arxiv.org/html/2605.22905#S4.SS2)documents this failure mode empirically; we report it as a motivating diagnostic and devote the remainder of this section to the components that close the gap\. Section[3\.1](https://arxiv.org/html/2605.22905#S3.SS1)introduces the evidence verifier, which scores the proposer\-emitted span by its causal effect on the solver’s answer accuracy\. Section[3\.2](https://arxiv.org/html/2605.22905#S3.SS2)reuses the same span as the supervision target during solver training\. Section[3\.3](https://arxiv.org/html/2605.22905#S3.SS3)describes an optional cluster bandit that diversifies the source\-document distribution, and Section[3\.4](https://arxiv.org/html/2605.22905#S3.SS4)explains how the proposer and solver are updated in two sequential phases\. Figure[2](https://arxiv.org/html/2605.22905#S3.F2)summarizes the resulting Phase A dataflow\. The backbone, retriever, search tool, and policy\-optimization framework remain unchanged\.

![Refer to caption](https://arxiv.org/html/2605.22905v1/fig/fig2_phaseA_anatomy_v2.png)Figure 2:One Phase A iteration of EVE\-Agent\.The proposer generates a question–answer–evidence triple from the source documentdd\. The solver attempts the question with the search tool, producing the difficulty reward of Eq\. \([2](https://arxiv.org/html/2605.22905#S2.E2)\); in parallel, single\-turn search\-disabled rollouts of the solver with and without the evidence span produce the evidence verifier of Eq\. \([11](https://arxiv.org/html/2605.22905#S3.E11)\)\. These two signals combine with the format and brevity terms in the proposer reward of Eq\. \([16](https://arxiv.org/html/2605.22905#S3.E16)\), which drives one HRPO update of the proposer\. The backbone, the retriever, and the search tool are unchanged\.### 3\.1Evidence verifier for the proposer

#### Proposer output\.

EVE\-Agent changes the proposer output from a question–answer pair to a question–answer–evidence triple\. Given a source documentddand access to the search engineℛ\\mathcal\{R\}, the proposer generates

\(q,a,e\)∼πtpro\(⋅∣d,ℛ\),\(q,a,e\)\\sim\\pi\_\{t\}^\{\\mathrm\{pro\}\}\(\\cdot\\mid d,\\mathcal\{R\}\),\(8\)whereqqis the generated question,aais the target answer, andeeis the evidence span\. The evidence span must be copied verbatim from either the source documentddor one of the snippets returned by the search engine during the proposer’s rollout\. This constraint ensures that the evidence is not merely a free\-form explanation, but a concrete text span that can be checked against the corpus\.

After generation, we parse the output and apply a simple validity filter\. LetNorm​\(⋅\)\\textsc\{Norm\}\(\\cdot\)denote the standard answer\-normalization function that lowercases text, removes articles, strips punctuation, and collapses whitespace\. A rollout is called*valid*if bothqqandaaare non\-empty and the normalized answer is not literally contained in the normalized question:q≠∅,a≠∅,Norm​\(a\)⊈Norm​\(q\)q\\neq\\emptyset,~a\\neq\\emptyset,~\\textsc\{Norm\}\(a\)\\not\\subseteq\\textsc\{Norm\}\(q\)\. The last condition removes degenerate questions that reveal their own answers\. Invalid rollouts receive only the format\-related reward defined below and are excluded from the evidence verifier\.

#### Format reward\.

Before scoring whether the evidence helps the solver, we first assign a lightweight format reward that checks whether the proposer output is structurally usable\. The rewardFfmt​\(q,a,e,d,h\)∈\[0,1\]F\_\{\\mathrm\{fmt\}\}\(q,a,e,d,h\)\\in\[0,1\]depends on the generated questionqq, answeraa, evidence spanee, source documentdd, and prescribed hop counthh\. Here,hhdenotes the intended number of reasoning steps, or equivalently the intended amount of multi\-hop search behavior, for the generated question\.

The format reward combines four equally weighted components\. The first component is an integrity score that is always equal to11for parsed outputs that reach this stage\. The remaining three components are sub\-scores in\[0,1\]\[0,1\]:FthinkF\_\{\\mathrm\{think\}\}rewards the presence of an explicit planning step,Ftool​\(h\)F\_\{\\mathrm\{tool\}\}\(h\)rewards the expected number and syntactic correctness of tool calls for anhh\-hop rollout, andFansF\_\{\\mathrm\{ans\}\}rewards a concise answer that is consistent with the available context\. We define

Ffmt​\(q,a,e,d,h\)=14​\(1\+Fthink\+Ftool​\(h\)\+Fans\)\.F\_\{\\mathrm\{fmt\}\}\(q,a,e,d,h\)=\\frac\{1\}\{4\}\\left\(1\+F\_\{\\mathrm\{think\}\}\+F\_\{\\mathrm\{tool\}\}\(h\)\+F\_\{\\mathrm\{ans\}\}\\right\)\.\(9\)This term is intentionally simple: it ensures that the proposer follows the required output protocol, but it is not meant to judge whether the evidence actually supports the answer\. That role is handled by the evidence verifier below\. The precise definitions of the three sub\-scores are provided in Appendix[D](https://arxiv.org/html/2605.22905#A4)\.

#### Evidence\-quality score\.

We now define the signal that measures whether the proposer\-provided evidence is actually useful for answering the generated question\. Consider a valid triple\(q,a,e\)\(q,a,e\), whereqqis the question,aais the target answer, andeeis the evidence span emitted by the proposer\. The key idea is to compare two answer probabilities: one when the solver is given both the question and the evidence, and one when it is given the question alone\.

Letπ~sol,t\(⋅∣q,e\)\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}\(\\cdot\\mid q,e\)denote the current solver at roundttunder a single\-turn answering protocol: the solver receives the questionqqand evidence spanee, but it is not allowed to make any additional search calls\. Similarly, letπ~aux,t\(⋅∣q\)\\widetilde\{\\pi\}\_\{\\mathrm\{aux\},t\}\(\\cdot\\mid q\)denote an auxiliary scorer under the same single\-turn, search\-disabled protocol, but conditioned only on the questionqq\. We define

p\+t​\(q,e,a\)\\displaystyle p\_\{\+\}^\{t\}\(q,e,a\)≔ℙa^∼π~sol,t\(⋅∣q,e\)​\[a^=a\],\\displaystyle\\coloneqq\{\\mathbb\{P\}\}\_\{\\hat\{a\}\\sim\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}\(\\cdot\\mid q,e\)\}\\left\[\\hat\{a\}=a\\right\],\(10\)p−t​\(q,a\)\\displaystyle p\_\{\-\}^\{t\}\(q,a\)≔ℙa^∼π~aux,t\(⋅∣q\)​\[a^=a\]\.\\displaystyle\\coloneqq\{\\mathbb\{P\}\}\_\{\\hat\{a\}\\sim\\widetilde\{\\pi\}\_\{\\mathrm\{aux\},t\}\(\\cdot\\mid q\)\}\\left\[\\hat\{a\}=a\\right\]\.The first quantity,p\+t​\(q,e,a\)p\_\{\+\}^\{t\}\(q,e,a\), is the probability of producing the correct answer when the evidence is provided\. The second quantity,p−t​\(q,a\)p\_\{\-\}^\{t\}\(q,a\), is the probability of producing the same answer without access to that evidence\. We define the evidence verifier as their difference:

Vt​\(q,e,a\)≔p\+t​\(q,e,a\)−p−t​\(q,a\)\.V\_\{t\}\(q,e,a\)\\coloneqq p\_\{\+\}^\{t\}\(q,e,a\)\-p\_\{\-\}^\{t\}\(q,a\)\.\(11\)Since both probabilities lie in\[0,1\]\[0,1\], the verifier score satisfiesVt​\(q,e,a\)∈\[−1,1\]V\_\{t\}\(q,e,a\)\\in\[\-1,1\]\.

This construction isolates the marginal contribution of the provided evidence\. Search is disabled in both conditions, so the score does not reward the solver for finding new information after receiving the question\. Instead, it asks a narrower and more auditable question: does the spaneeitself make the answeraaeasier to produce? A positive verifier score means that conditioning oneeincreases the solver’s probability of generating the target answer\. A score near zero means that the evidence provides little additional information beyond the question\. A negative score indicates that the evidence makes the target answer less likely, which is consistent with the span being misleading or irrelevant\.

In our experiments, the auxiliary scorer uses the same weights as the current solver\. Thus,π~aux,t\\widetilde\{\\pi\}\_\{\\mathrm\{aux\},t\}andπ~sol,t\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}differ only in their inputs: the former receivesqq, whereas the latter receives\(q,e\)\(q,e\)\. Under this choice,Vt​\(q,e,a\)V\_\{t\}\(q,e,a\)directly measures the conditional answer\-accuracy gain induced by the evidence for the current solver\. The framework also supports a variant in which the auxiliary scorer is a separately hosted frozen model, but we do not use that configuration in the experiments reported here\.

#### Estimator\.

The verifier score in Eq\. \([11](https://arxiv.org/html/2605.22905#S3.E11)\) depends on two answer probabilities, which we estimate by Monte Carlo sampling\. For a fixed valid triple\(q,a,e\)\(q,a,e\)and an integerm≥1m\\geq 1, we drawmmindependent answers from the solver with evidence,

a^\+\(j\)∼π~sol,t\(⋅∣q,e\),j=1,…,m,\\hat\{a\}\_\{\+\}^\{\(j\)\}\\sim\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}\(\\cdot\\mid q,e\),\\qquad j=1,\\ldots,m,\(12\)andmmindependent answers from the auxiliary scorer without evidence,

a^−\(j\)∼π~aux,t\(⋅∣q\),j=1,…,m\.\\hat\{a\}\_\{\-\}^\{\(j\)\}\\sim\\widetilde\{\\pi\}\_\{\\mathrm\{aux\},t\}\(\\cdot\\mid q\),\\qquad j=1,\\ldots,m\.\(13\)We then estimate the evidence verifier by the difference between the two empirical accuracies:

V^t,m​\(q,e,a\)≔1m​∑j=1m𝟏​\{a^\+\(j\)=a\}−1m​∑j=1m𝟏​\{a^−\(j\)=a\}\.\\widehat\{V\}\_\{t,m\}\(q,e,a\)\\coloneqq\\frac\{1\}\{m\}\\sum\_\{j=1\}^\{m\}\\mathbf\{1\}\\\{\\hat\{a\}\_\{\+\}^\{\(j\)\}=a\\\}\-\\frac\{1\}\{m\}\\sum\_\{j=1\}^\{m\}\\mathbf\{1\}\\\{\\hat\{a\}\_\{\-\}^\{\(j\)\}=a\\\}\.\(14\)This estimator is unbiased, and its conditional variance is bounded by1/\(2​m\)1/\(2m\); both facts are formalized and proved as Proposition[B\.2](https://arxiv.org/html/2605.22905#A2.Thmtheorem2)in Appendix[B](https://arxiv.org/html/2605.22905#A2)\. It requires2​m2madditional single\-turn decodes per valid proposer rollout—mmwith evidence andmmwithout—an overhead that is modest because verifier rollouts are search\-disabled, whereas the main solver rollouts are multi\-turn interactions that may call the search engine several times\. We usem=5m=5throughout\. A teacher\-forced log\-probability variant of the verifier is implemented but not used in the reported experiments because the sampling\-based estimator is already sufficiently efficient\.

#### Brevity bonus\.

The evidence span should be informative but not unnecessarily long\. If the proposer copies a very long passage, the verifier becomes less useful because the span may contain many irrelevant facts or reveal the answer without identifying the specific supporting context\. Conversely, an extremely short span may collapse to answer leakage rather than evidence\. To encourage concise and targeted evidence, we introduce a brevity bonus\. Let\|e\|\|e\|denote the number of tokens in the evidence span under the proposer tokenizer\. We define

B​\(e\)≔max⁡\(0,1−\|e\|Lmax\),Lmax=256\.B\(e\)\\coloneqq\\max\\left\(0,1\-\\frac\{\|e\|\}\{L\_\{\\max\}\}\\right\),\\qquad L\_\{\\max\}=256\.\(15\)The bonus is largest for short spans and decreases linearly with length\. It becomes zero once the evidence span reachesLmaxL\_\{\\max\}tokens\. Thus, this term discourages copy\-everything behavior while still allowing enough context for multi\-hop or entity\-rich questions\.

#### Proposer reward\.

The final proposer reward combines structural validity, question difficulty, evidence usefulness, and evidence brevity\. For a rollout with source documentdd, prescribed hop counthh, generated triple\(q,a,e\)\(q,a,e\), andkkcorrect solver predictions amongnntrials, we define

Rtpro​\(q,e,a;d,h,k\)=12​Ffmt​\(q,a,e,d,h\)\+RtDZ​\(q,a;k\)\+λV​Vt​\(q,e,a\)\+λB​B​\(e\),R^\{\\mathrm\{pro\}\}\_\{t\}\(q,e,a;d,h,k\)=\\frac\{1\}\{2\}F\_\{\\mathrm\{fmt\}\}\(q,a,e,d,h\)\+R^\{\\mathrm\{DZ\}\}\_\{t\}\(q,a;k\)\+\\lambda\_\{V\}V\_\{t\}\(q,e,a\)\+\\lambda\_\{B\}B\(e\),\(16\)whereλV≥0\\lambda\_\{V\}\\geq 0controls the strength of the evidence\-verifiability term andλB≥0\\lambda\_\{B\}\\geq 0controls the strength of the brevity bonus\. In all experiments, we set

\(λV,λB\)=\(0\.5,0\.1\)\.\(\\lambda\_\{V\},\\lambda\_\{B\}\)=\(0\.5,0\.1\)\.\(17\)
Each term has a distinct role\. The format reward ensures that the proposer follows the required output protocol\. The difficulty reward encourages questions near the solver’s learning frontier\. The verifier reward favors evidence spans that causally improve the solver’s ability to produce the target answer\. The brevity bonus discourages overly long evidence spans that would make the verifier less diagnostic\. The prior self\-evolving search agent\(Yue et al\.,[2026](https://arxiv.org/html/2605.22905#bib.bib31)\)is recovered by setting

\(λV,λB\)=\(0,0\),\(\\lambda\_\{V\},\\lambda\_\{B\}\)=\(0,0\),\(18\)while keeping the same difficulty and format components\. We optimize the proposer with the HRPO objective in Eq\. \([6](https://arxiv.org/html/2605.22905#S2.E6)\), replacing the original difficulty\-only reward withRtproR^\{\\mathrm\{pro\}\}\_\{t\}and making no other changes to the optimization procedure\.

### 3\.2Solver reward

After training the proposer with the reward in Eq\. \([16](https://arxiv.org/html/2605.22905#S3.E16)\), we freeze it and use it to construct the solver\-training data\. Specifically, we roll out the trained proposer over the corpus\. Each valid rollout produces a triple\(q,a,e\)\(q,a,e\), whereqqis the generated question,aais the target answer, andeeis the proposer\-provided evidence span\. We then treateeas the gold evidence for training the solver\. This choice is important: the same evidence span that the proposer was rewarded for producing is reused as the supervision target for the solver\.

The solver is trained with the GRPO objective in Eq\. \([7](https://arxiv.org/html/2605.22905#S2.E7)\)\. For each generated training instance, the solver is required to output both an answera^\\hat\{a\}and an evidence spane^\\hat\{e\}\. We define the solver reward as

Rsol​\(a^,e^;a,e\)=Rcorrect​\(a^,a\)\+λE​Revidence​\(e^,e\),R^\{\\mathrm\{sol\}\}\(\\hat\{a\},\\hat\{e\};a,e\)=R\_\{\\mathrm\{correct\}\}\(\\hat\{a\},a\)\+\\lambda\_\{E\}R\_\{\\mathrm\{evidence\}\}\(\\hat\{e\},e\),\(19\)whereRcorrectR\_\{\\mathrm\{correct\}\}evaluates answer correctness andRevidenceR\_\{\\mathrm\{evidence\}\}evaluates evidence recovery\. The answer reward is an exact\-match indicator after standard answer normalization:

Rcorrect​\(a^,a\)=𝟏​\{EM​\(a^,a\)\}\.R\_\{\\mathrm\{correct\}\}\(\\hat\{a\},a\)=\\mathbf\{1\}\\\{\\textsc\{EM\}\(\\hat\{a\},a\)\\\}\.\(20\)Here,EM​\(a^,a\)\\textsc\{EM\}\(\\hat\{a\},a\)equals true when the normalized predicted answera^\\hat\{a\}exactly matches the normalized target answeraa\.

The evidence reward measures how well the solver’s extracted evidence spane^\\hat\{e\}matches the proposer\-provided evidence spanee\. We use the SQuAD\-style token\-level F1 score:

Revidence​\(e^,e\)=F1tok​\(Norm​\(e^\),Norm​\(e\)\),R\_\{\\mathrm\{evidence\}\}\(\\hat\{e\},e\)=\\mathrm\{F1\}\_\{\\mathrm\{tok\}\}\\left\(\\textsc\{Norm\}\(\\hat\{e\}\),\\textsc\{Norm\}\(e\)\\right\),\(21\)whereNormis the normalization function defined earlier andF1tok\\mathrm\{F1\}\_\{\\mathrm\{tok\}\}is the harmonic mean of token\-level precision and recall between the normalized predicted and target evidence spans\. We setλE=0\.3\\lambda\_\{E\}=0\.3in all experiments\. This reward encourages the solver not only to answer correctly, but also to recover the evidence that supports the answer\.

### 3\.3Optional corpus selector

The default self\-evolution loop samples source documents uniformly from the corpus\. While simple, uniform sampling can concentrate training on a narrow set of documents or question patterns that happen to receive high reward early in training\. This may reduce curriculum diversity: the proposer can repeatedly generate similar questions that lie near the solver’s current learning frontier, while underusing other topics and reasoning types\.

To mitigate this issue, we include an optional corpus selector that can replace uniform document sampling\. The selector is not required for the core evidence\-verification mechanism, but it provides a way to diversify the self\-generated curriculum\. Its goal is to balance two forms of diversity\. The first is topic diversity, controlled by clustering documents in embedding space\. The second is question\-type diversity, controlled by a small set of open\-domain QA categories\.

Concretely, a frozen sentence encoder maps each documentd∈𝒟d\\in\\mathcal\{D\}to a fixed vector representation\. At training roundtt, these document embeddings are partitioned intoKtK\_\{t\}clusters, where

Kt=K0\+⌊α​t⌋\.K\_\{t\}=K\_\{0\}\+\\lfloor\\alpha t\\rfloor\.\(22\)Here,K0K\_\{0\}is the initial number of clusters,α≥0\\alpha\\geq 0controls how quickly the clustering granularity increases, and⌊⋅⌋\\lfloor\\cdot\\rfloordenotes the floor function\. Larger values ofKtK\_\{t\}correspond to a finer partition of the corpus\.

The selector uses two UCB1 bandits\(Auer et al\.,[2002](https://arxiv.org/html/2605.22905#bib.bib2)\)\. The first is a cluster bandit whose arms correspond to theKtK\_\{t\}document clusters\. The second is a task\-type bandit whose arms correspond to five question types:factual,comparison,causal,temporal, andaggregation\. At each sampling step, the cluster bandit selects a topic cluster and the task\-type bandit selects a desired question type\. A source document is then sampled from the selected cluster with probability proportional to two factors: its distance from the cluster centroid and the inverse of its previous usage count\. This favors documents that are both less typical within the cluster and less frequently sampled, encouraging coverage of boundary cases and reducing repeated use of the same documents\.

The bandits are updated offline between self\-evolution iterations\. Their feedback reward combines a diversity term, which penalizes repeatedly selecting already overused clusters or task types, with a utility term, which compares the reward of the current generated sample to the running average reward for the corresponding arm\. In this way, the selector encourages exploration without ignoring which parts of the corpus are currently useful for training\. Since the selector is orthogonal to the evidence verifier, we do not isolate its empirical effect in the present experiments\. The full algorithmic specification is given in Appendix[F](https://arxiv.org/html/2605.22905#A6)\.

### 3\.4Two\-phase training schedule

We train the proposer and the solver in two sequential phases rather than updating them jointly\. This design has two motivations\. First, it reduces compute: only one model is optimized with policy gradients at a time\. Second, it improves stability\. The evidence verifier depends on the solver, so updating the solver while simultaneously using it to score proposer outputs would make the reward landscape non\-stationary and difficult to interpret\.

In*Phase A*, the solver is kept fixed at its initialization, and the proposer is trained with HRPO using the rewardRtproR^\{\\mathrm\{pro\}\}\_\{t\}in Eq\. \([16](https://arxiv.org/html/2605.22905#S3.E16)\)\. The auxiliary scorer used in the verifier shares weights with this fixed solver\. For each valid proposer rollout, the verifier score is estimated by the Monte Carlo estimatorV^t,m\\widehat\{V\}\_\{t,m\}in Eq\. \([14](https://arxiv.org/html/2605.22905#S3.E14)\); we usem=5m=5samples\. Thus, Phase A teaches the proposer to generate not only difficult questions, but also evidence spans that improve the fixed solver’s ability to recover the target answer\.

At the end of Phase A, we freeze the trained proposer and use it to generate the solver\-training set\. We roll it out over a held\-out shard of the corpus with the full multi\-turn search tool enabled\. Each valid rollout contributes one triple\(q,a,e\)\(q,a,e\), whereqqis the generated question,aais the target answer, andeeis the proposer\-provided evidence span\. We storeeeas the gold evidence for the corresponding question–answer pair\.

In*Phase B*, the proposer is frozen, and the solver is trained with GRPO using the rewardRsolR^\{\\mathrm\{sol\}\}in Eq\. \([19](https://arxiv.org/html/2605.22905#S3.E19)\)\. The solver is required to output both an answer and an evidence span, and it is rewarded for both answer correctness and evidence recovery\. This phase transfers the evidence\-verifiable curriculum produced by the proposer into the solver\.

Relative to a standard self\-evolving search agent, the modification is intentionally local\. EVE\-Agent keeps the backbone model, retriever, search tool, hop grouping, and the HRPO and GRPO optimization objectives unchanged\. The core change is the reward design: the proposer is rewarded for generating useful evidence, and the solver is rewarded for recovering that evidence\. Two basic guarantees of the reward design—a closed form and a unique interior maximizer for the inherited difficulty reward, and an unbiased, bounded\-variance interpretation of the evidence verifier as a marginal answer\-accuracy gain—are stated and proved in Appendix[B](https://arxiv.org/html/2605.22905#A2), and Algorithms[1](https://arxiv.org/html/2605.22905#alg1)–[2](https://arxiv.org/html/2605.22905#alg2)in Appendix[E](https://arxiv.org/html/2605.22905#A5)give the full data flow\.

## 4Experiments

Our experimental study is organized around two questions, addressed in turn\. First, does the difficulty\-only reward of prior self\-evolving search agents leave a measurable evidence\-grounding gap? Section[4\.2](https://arxiv.org/html/2605.22905#S4.SS2)answers this question and isolates the empirical motivation for the verifier\. Second, does the evidence verifier of Eq\. \([11](https://arxiv.org/html/2605.22905#S3.E11)\) close this gap while preserving—or improving—answer accuracy across benchmarks? Section[4\.3](https://arxiv.org/html/2605.22905#S4.SS3)answers this question and isolates the verifier’s contribution under matched compute and matched search tools\. Throughout the section we report per\-benchmark numbers in tables and discuss the qualitative picture in prose; full quantitative details and additional protocol\-specific information are deferred to Appendix[C](https://arxiv.org/html/2605.22905#A3)and Appendix[H](https://arxiv.org/html/2605.22905#A8)\.

### 4\.1Experimental setup

#### Models and search tool\.

The proposer, the solver, and the auxiliary scorer in Eq\. \([11](https://arxiv.org/html/2605.22905#S3.E11)\) share the Qwen2\.5\-3B\-Instruct\(Yang et al\.,[2024](https://arxiv.org/html/2605.22905#bib.bib27)\)backbone, and the auxiliary scorer reuses the current solver weights\. All systems share the same retrieval pipeline: passages from the FlashRAG Wikipedia\-2018 snapshot are encoded by E5\-base\-v2 and indexed by FAISS\-IVF; each search call returns the top three passages, and a multi\-turn rollout permits at most five assistant turns\. Full indexing parameters and decoding settings are listed in Appendix[C](https://arxiv.org/html/2605.22905#A3)\.

#### Training schedule and key hyperparameters\.

Both phases use a single8×8\\timesB200 node, a global batch size of256256, and run for5050policy\-gradient steps\. The proposer\-side coefficients\(λV,λB\)\(\\lambda\_\{V\},\\lambda\_\{B\}\)that enter the proposer reward of Eq\. \([16](https://arxiv.org/html/2605.22905#S3.E16)\) are set to\(0\.5,0\.1\)\(0\.5,0\.1\), and the brevity cutoff isLmax=256L\_\{\\max\}=256tokens; the verifier Monte Carlo budget in Eq\. \([14](https://arxiv.org/html/2605.22905#S3.E14)\) ism=5m=5\. The solver\-side coefficientλE\\lambda\_\{E\}in Eq\. \([19](https://arxiv.org/html/2605.22905#S3.E19)\) is set to0\.30\.3\. Hop countsh∈\{1,2,3,4\}h\\in\\\{1,2,3,4\\\}are sampled in ratio4:3:2:14\{:\}3\{:\}2\{:\}1\. Because the auxiliary scorer shares its weights with the solver, the verifier adds only2​m=102m=10single\-turn, search\-disabled decodes per valid proposer rollout, which is negligible compared with the multi\-turn search rollouts\. The full list of optimizer settings and additional implementation choices is given in Appendix[C](https://arxiv.org/html/2605.22905#A3), with hyperparameters summarized in Appendix[G](https://arxiv.org/html/2605.22905#A7)\.

#### Benchmarks and metrics\.

We evaluate on seven open\-domain QA datasets: NaturalQuestions\(Kwiatkowski et al\.,[2019](https://arxiv.org/html/2605.22905#bib.bib12)\), TriviaQA\(Joshi et al\.,[2017](https://arxiv.org/html/2605.22905#bib.bib11)\), PopQA\(Mallen et al\.,[2023](https://arxiv.org/html/2605.22905#bib.bib17)\), HotpotQA\(Yang et al\.,[2018](https://arxiv.org/html/2605.22905#bib.bib28)\), 2WikiMultiHopQA\(Ho et al\.,[2020](https://arxiv.org/html/2605.22905#bib.bib8)\), MuSiQue\(Trivedi et al\.,[2022](https://arxiv.org/html/2605.22905#bib.bib25)\), and Bamboogle\(Press et al\.,[2023](https://arxiv.org/html/2605.22905#bib.bib20)\)\. Each system is required to emit an answer and a supporting evidence span\. We report three complementary metrics: answer exact match \(EM\) after answer normalization; an evidence score judged by GPT\-4\.1 from the question, the gold answer, and the emitted span; and the joint rate at which the answer is correct*and*the evidence is judged supporting\. The average column is the unweighted mean over the seven benchmarks\.

#### Compared systems\.

We compare four systems trained or evaluated under matched protocols\.*Initial \(no search\)*and*Initial \(search\)*are the untrained backbone evaluated without and with the search tool, respectively\.*Dr\. Zero*is a faithful re\-implementation ofYue et al\. \([2026](https://arxiv.org/html/2605.22905#bib.bib31)\)under the same backbone, retrieval corpus, tool, rollout budget, and wall\-clock budget\.*EVE\-Agent*adds the evidence\-verifier reward and the matching solver evidence term while leaving the proposer, solver, backbone, and search tool otherwise identical to Dr\. Zero\. This design isolates the contribution of evidence\-aware reward shaping\.

### 4\.2Evidence\-grounding bottleneck of prior systems

#### Objective\.

We first quantify the failure mode that motivates EVE\-Agent: even when a self\-evolving search agent attains competitive answer accuracy, the spans it emits as evidence may bear no real relation to the answer\. The goal of this experiment is to verify that the difficulty\-only reward of prior systems leaves a measurable gap between*producing*an answer and*justifying*it\.

#### Protocol\-specific details\.

On top of the shared setup of Section[4\.1](https://arxiv.org/html/2605.22905#S4.SS1), we sample1,0001\{,\}000test instances per benchmark, except for Bamboogle, for which we use the full125125\-instance split\. Each instance is decoded once by every system with the search tool enabled, and the emitted evidence span is passed to the external judge\. We focus on a representative subset of four datasets that spans both single\-hop and multi\-hop regimes; the full breakdown is given in Appendix[H](https://arxiv.org/html/2605.22905#A8)\.

#### Results\.

Table[1](https://arxiv.org/html/2605.22905#S4.T1)reports the judged evidence score and the joint answer\-and\-evidence rate on the four\-dataset subset; the corresponding answer\-accuracy numbers are given in Appendix[H](https://arxiv.org/html/2605.22905#A8)\. Two patterns are immediate\. First, the prior self\-evolving search agent does not narrow the evidence\-quality gap relative to the untrained backbone, even though its answer accuracy is markedly higher: on the open\-domain datasets the judge accepts only a similar fraction of its spans as supporting\. Second, the joint correctness rate, which credits an instance only when the answer is right*and*the evidence is supporting, is correspondingly low for the prior system—comparable to the untrained baselines on most benchmarks\. EVE\-Agent improves both quantities on every dataset in this subset except 2WikiMultiHopQA, where the evidence score is competitive but not the best\.

Table 1:Evidence\-grounding diagnostics on a representative subset of benchmarks \(1,0001\{,\}000test instances each, except Bamboogle with125125\)\.*Prior*is a faithful re\-implementation ofYue et al\. \([2026](https://arxiv.org/html/2605.22905#bib.bib31)\)\. All systems use the same backbone and search tool; the external judge sees only the model\-emitted evidence span\. The full breakdown is given in Appendix[H](https://arxiv.org/html/2605.22905#A8)\.
#### Discussion\.

The bottleneck is not that prior systems omit evidence: Appendix[H](https://arxiv.org/html/2605.22905#A8)shows that they emit a syntactically valid evidence block in over90%90\\%of rollouts, comparable to the initial backbone\. The bottleneck is that the emitted block frequently fails to justify the predicted answer, which is consistent with the difficulty reward’s design: it audits whether a question is challenging, not whether it is supported\. The evidence verifier of Section[3\.1](https://arxiv.org/html/2605.22905#S3.SS1)addresses this by rewarding spans that causally raise the solver’s probability of producing the target answer\.

### 4\.3Main results across benchmarks

#### Objective\.

Having identified the bottleneck, we now ask whether the evidence verifier closes it across the full benchmark suite without sacrificing answer accuracy\. We compare EVE\-Agent against the matched Dr\. Zero re\-implementation and the two untrained baselines on all seven datasets and report each of the three metrics in turn \(answer EM, evidence score, joint correctness\)\.

#### Protocol\-specific details\.

We reuse the setup of Section[4\.1](https://arxiv.org/html/2605.22905#S4.SS1)verbatim; the only change relative to Section[4\.2](https://arxiv.org/html/2605.22905#S4.SS2)is that all seven benchmarks are now included\. Decoding is greedy and the search tool is enabled for both the prior system and EVE\-Agent\. The external judge configuration is unchanged\.

#### Answer accuracy\.

Answer exact match is the standard summary metric for open\-domain QA; we report it first because a verifier that improved evidence quality at the cost of answer correctness would not be useful\. Table[2](https://arxiv.org/html/2605.22905#S4.T2)reports answer EM\. EVE\-Agent attains the best average EM and is strongest on five of the seven benchmarks—NQ, TriviaQA, PopQA, HotpotQA, and 2WikiMultiHopQA—each of which has a meaningfully large evaluation pool\. On MuSiQue and Bamboogle the picture is mixed: Dr\. Zero edges out EVE\-Agent on MuSiQue by a small margin, and the untrained no\-search backbone is highest on the small Bamboogle split\. Overall, the evidence\-oriented reward does not trade off answer correctness for explanation format; on the contrary, accuracy improves under the same compute and search budget as the prior self\-evolving search agent\.

Table 2:Answer accuracy \(exact match\) on seven open\-domain QA benchmarks\.
#### Evidence quality\.

Evidence quality is the central diagnostic of the bottleneck of Section[4\.2](https://arxiv.org/html/2605.22905#S4.SS2): only this metric reveals whether the predicted answer is paired with a source span that a careful reader could use to verify it\. Table[3](https://arxiv.org/html/2605.22905#S4.T3)reports the judged evidence score across all seven benchmarks\. EVE\-Agent substantially improves evidence support on all single\-hop benchmarks—NQ, TriviaQA, and PopQA—and on HotpotQA, and the average evidence score across the seven benchmarks is the highest of the four systems by a clear margin\. The remaining benchmarks are mixed: on 2WikiMultiHopQA Dr\. Zero is slightly stronger; on MuSiQue the search\-equipped initial backbone narrowly edges out EVE\-Agent; and on Bamboogle the small evaluation split favors the no\-search backbone\. These exceptions notwithstanding, the verifier\-shaped reward produces evidence that the external judge accepts as supporting more often than any baseline, including the search\-equipped initial backbone\. Crucially, this improvement is obtained with the same backbone, retriever, and search tool used by Dr\. Zero, so it can be attributed to the reward design rather than to additional capacity or retrieval signal\.

Table 3:Evidence score judged by an external GPT\-4\.1 evaluator\. The judge sees only the question, the gold answer, and the model\-emitted evidence span\.
#### Joint answer\-and\-evidence correctness\.

The joint metric is the strictest of the three: an instance is credited only when the predicted answer matches the gold answer*and*the emitted span is judged supporting\. From the perspective of evidence verifiability, this is the regime that matters most, because it isolates the cases in which the agent’s output is simultaneously correct and auditable\. Table[4](https://arxiv.org/html/2605.22905#S4.T4)reports the result across all seven benchmarks\. EVE\-Agent obtains the best average score and is strongest on six of seven benchmarks; on the average column it improves over the matched Dr\. Zero re\-implementation by a wide margin\. The improvement is not a redistribution between the previous two metrics: EVE\-Agent’s answer\-accuracy gains are concentrated in instances whose evidence is also judged sufficient, which is precisely the population that can be reused as reliable training signal for downstream learning or as a basis for human inspection of the curriculum\.

Table 4:Joint answer\-and\-evidence score\. An instance is counted only when the answer is correct*and*the emitted evidence is judged supporting by the external judge\.
#### Discussion\.

Read together, Tables[2](https://arxiv.org/html/2605.22905#S4.T2)–[4](https://arxiv.org/html/2605.22905#S4.T4)support the central claim of the paper\. The evidence verifier of Eq\. \([11](https://arxiv.org/html/2605.22905#S3.E11)\) produces a training\-time signal whose gains are not bought by relaxing answer correctness, by overfitting to evidence presence, or by exploiting additional compute relative to the prior system\. The answer accuracy and the evidence quality improve together, and the strict joint metric—the metric most aligned with evidence verifiability—exhibits the largest relative gap to prior work\. The pattern is consistent across single\-hop and multi\-hop benchmarks, with the two exceptions involving either a small evaluation pool \(Bamboogle,125125instances\) or a regime in which the prior system already matches the untrained backbone on evidence grounding \(2WikiMultiHopQA\)\. The persistence of the improvement on the strict joint metric is, in our view, the cleanest summary of the verifier’s effect: under matched compute and matched search tools, EVE\-Agent generates curricula and trains solvers whose answers are simultaneously more often correct and more often verifiable\.

## 5Related work

#### Data\-free self\-evolving reasoning and search agents\.

Absolute Zero\(Zhao et al\.,[2025](https://arxiv.org/html/2605.22905#bib.bib32)\)introduced the data\-free paradigm with a Python\-interpreter oracle and R\-Zero\(Huang et al\.,[2026](https://arxiv.org/html/2605.22905#bib.bib9)\)generalized it through a challenger–solver decoupling\. Recent work\(Yue et al\.,[2026](https://arxiv.org/html/2605.22905#bib.bib31)\)ports the loop to multi\-turn search agents with hop\-grouped relative policy optimization; SAGE\(Peng et al\.,[2026](https://arxiv.org/html/2605.22905#bib.bib19)\)adds multi\-agent critics and AReaL\-SEA\(Gao et al\.,[2026](https://arxiv.org/html/2605.22905#bib.bib4)\)adds multi\-turn tool use to the same template\. EVE\-Agent differs in injecting a data\-free*evidence verifier*so that the proposer’s reward depends on whether the emitted span causally helps the trained solver, not only on whether the solver is uncertain\.

#### Verifier\-based and retrieval\-augmented RL\.

Verifier\-based rewards\(Lambert et al\.,[2024](https://arxiv.org/html/2605.22905#bib.bib13); Shao et al\.,[2024](https://arxiv.org/html/2605.22905#bib.bib22); Cobbe et al\.,[2021](https://arxiv.org/html/2605.22905#bib.bib3)\)and knowledge\-graph verifiers\(Yuan et al\.,[2026](https://arxiv.org/html/2605.22905#bib.bib30)\)either assume an external oracle or pay a heavy graph\-construction cost; in contrast, the verifier of Eq\. \([11](https://arxiv.org/html/2605.22905#S3.E11)\) is defined entirely through the trained solver, the proposer\-emitted evidence, and the corpus\. Search\-R1\(Jin et al\.,[2025](https://arxiv.org/html/2605.22905#bib.bib10)\)and R1\-Searcher\(Song et al\.,[2025](https://arxiv.org/html/2605.22905#bib.bib23)\)are retrieval\-augmented RL agents trained on supervised question–answer pairs, and Self\-RAG and IRCoT provide self\-critic and interleaved retrieval templates\(Asai et al\.,[2024](https://arxiv.org/html/2605.22905#bib.bib1); Trivedi et al\.,[2023](https://arxiv.org/html/2605.22905#bib.bib24)\)that we inherit at the proposer level\.

#### Curriculum diversity\.

Semantic diversity rewards\(Wan et al\.,[2026](https://arxiv.org/html/2605.22905#bib.bib26)\)and the R\-Diverse template ofLi et al\. \([2026](https://arxiv.org/html/2605.22905#bib.bib16)\)act*after*sampling by down\-weighting near\-duplicates\. The optional selector of Section[3\.3](https://arxiv.org/html/2605.22905#S3.SS3)instead acts*before*sampling via a cluster bandit, in the lineage of UCB1\(Auer et al\.,[2002](https://arxiv.org/html/2605.22905#bib.bib2)\)and non\-stationary bandit analyses\(Garivier and Moulines,[2011](https://arxiv.org/html/2605.22905#bib.bib5); Lattimore and Szepesvári,[2020](https://arxiv.org/html/2605.22905#bib.bib14)\); curriculum\-learning methods\(Graves et al\.,[2017](https://arxiv.org/html/2605.22905#bib.bib7); Matiisen et al\.,[2020](https://arxiv.org/html/2605.22905#bib.bib18)\)provide background on schedule design more broadly\.

#### Evidence\-grounded evaluation of search agents\.

Evidence\-aware evaluation benchmarks\(Glockner et al\.,[2025](https://arxiv.org/html/2605.22905#bib.bib6)\)formalize the question of whether a model emits a supporting span; our verifier turns this diagnostic into a training\-time signal\.

## 6Conclusion

We argued that*evidence verifiability*should be treated as a first\-class property of data\-free self\-evolving search agents\. In existing systems, the proposer is rewarded for generating difficult questions, but no reward checks whether the predicted answer is grounded in a span that a careful reader could use to verify it\. We documented the resulting bottleneck empirically on representative open\-domain QA benchmarks: a faithful re\-implementation of the prior self\-evolving search agent answers questions competitively yet emits evidence whose judged quality is no better than that of an untrained backbone\.

EVE\-Agent closes this gap with a minimal extension that is local to the reward\. The proposer is required to emit a verbatim source span together with its question and answer, and the span is credited only when its inclusion causally improves the current solver’s ability to recover the target answer; the solver is in turn trained to reproduce both the answer and the supporting span\. The backbone, the retriever, the search tool, and the policy\-optimization framework remain unchanged\.

Across seven open\-domain QA benchmarks, under matched compute and matched search tools, this single change yields curricula and solvers whose outputs are simultaneously more often correct and more often verifiable: EVE\-Agent improves on answer accuracy, on judged evidence quality, and—most importantly—on the strict joint metric that credits an instance only when both are adequate\. The resulting training loop is auditable by construction\. Every generated example carries an inspectable source span whose contribution can be checked against the current solver, and the curriculum can be reviewed instance by instance rather than treated as an opaque collection of question–answer pairs\.

We see this as a small but concrete step toward*safer self\-evolution*\. As data\-free agents increasingly write their own training data, evidence grounding becomes a prerequisite for trust: an agent’s improvements should not rely on training signal that cannot itself be verified, and the soundness of the learning process should be checkable from the data the agent produces, not only from the metrics it eventually reports\. EVE\-Agent treats evidence as a training\-time object that can be inspected, scored, and reused; we expect that incorporating similar verifiability constraints will become standard practice as self\-evolving search agents are deployed in settings where their training data, and not only their final answers, must be trusted\.

## References

- Asai et al\. \[2024\]Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi\.Self\-RAG: Learning to retrieve, generate, and critique through self\-reflection\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.arXiv:2310\.11511\.
- Auer et al\. \[2002\]Peter Auer, Nicolò Cesa\-Bianchi, and Paul Fischer\.Finite\-time analysis of the multiarmed bandit problem\.*Machine Learning*, 47:235–256, 2002\.
- Cobbe et al\. \[2021\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.
- Gao et al\. \[2026\]Jiaxuan Gao, Jiaao Chen, Chuyi He, Shusheng Xu, Di Jin, and Yi Wu\.From self\-evolving synthetic data to verifiable\-reward RL: Post\-training multi\-turn interactive tool\-using agents\.*arXiv preprint arXiv:2601\.22607*, 2026\.
- Garivier and Moulines \[2011\]Aurélien Garivier and Eric Moulines\.On upper\-confidence bound policies for switching bandit problems\.In*Algorithmic Learning Theory \(ALT\)*, 2011\.
- Glockner et al\. \[2025\]Max Glockner, Xiang Jiang, Leonardo F\. R\. Ribeiro, Iryna Gurevych, and Markus Dreyer\.NeoQA: Evidence\-based question answering with generated news events\.In*Findings of the Association for Computational Linguistics \(ACL\)*, 2025\.
- Graves et al\. \[2017\]Alex Graves, Marc G\. Bellemare, Jacob Menick, Rémi Munos, and Koray Kavukcuoglu\.Automated curriculum learning for neural networks\.In*International Conference on Machine Learning \(ICML\)*, 2017\.
- Ho et al\. \[2020\]Xanh Ho, Anh\-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa\.Constructing a multi\-hop QA dataset for comprehensive evaluation of reasoning steps\.In*International Conference on Computational Linguistics \(COLING\)*, 2020\.
- Huang et al\. \[2026\]Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu\.R\-Zero: Self\-evolving reasoning LLM from zero data\.In*International Conference on Learning Representations \(ICLR\)*, 2026\.arXiv:2508\.05004\.
- Jin et al\. \[2025\]Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han\.Search\-R1: Training LLMs to reason and leverage search engines with reinforcement learning\.*arXiv preprint arXiv:2503\.09516*, 2025\.
- Joshi et al\. \[2017\]Mandar Joshi, Eunsol Choi, Daniel S\. Weld, and Luke Zettlemoyer\.TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension\.In*Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2017\.
- Kwiatkowski et al\. \[2019\]Tom Kwiatkowski et al\.Natural questions: A benchmark for question answering research\.In*Transactions of the Association for Computational Linguistics \(TACL\)*, 2019\.
- Lambert et al\. \[2024\]Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester J Vedelgo Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al\.Tulu 3: Pushing frontiers in open language model post\-training\.*arXiv preprint arXiv:2411\.15124*, 2024\.
- Lattimore and Szepesvári \[2020\]Tor Lattimore and Csaba Szepesvári\.*Bandit Algorithms*\.Cambridge University Press, 2020\.
- Lewis et al\. \[2020\]Patrick Lewis et al\.Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2020\.
- Li et al\. \[2026\]Gengsheng Li, Jinghan He, Shijie Wang, Dan Zhang, Ruiqi Liu, Renrui Zhang, Zijun Yao, Junfeng Fang, Haiyun Guo, and Jinqiao Wang\.R\-Diverse: Mitigating diversity illusion in self\-play LLM training\.*arXiv preprint arXiv:2602\.13103*, 2026\.
- Mallen et al\. \[2023\]Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi\.When not to trust language models: Investigating effectiveness of parametric and non\-parametric memories\.In*Association for Computational Linguistics \(ACL\)*, 2023\.
- Matiisen et al\. \[2020\]Tambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman\.Teacher–student curriculum learning\.*IEEE Transactions on Neural Networks and Learning Systems*, 31\(9\):3732–3740, 2020\.
- Peng et al\. \[2026\]Yulin Peng, Xinxin Zhu, Chenxing Wei, Nianbo Zeng, Leilei Wang, Ying Tiffany He, and F\. Richard Yu\.SAGE: Multi\-agent self\-evolution for LLM reasoning\.*arXiv preprint arXiv:2603\.15255*, 2026\.
- Press et al\. \[2023\]Ofir Press et al\.Measuring and narrowing the compositionality gap in language models\.*arXiv preprint arXiv:2210\.03350*, 2023\.
- Schick et al\. \[2023\]Timo Schick, Jane Dwivedi\-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom\.Toolformer: Language models can teach themselves to use tools\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, volume 36, 2023\.arXiv:2302\.04761\.
- Shao et al\. \[2024\]Zhihong Shao et al\.DeepSeekMath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Song et al\. \[2025\]Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji\-Rong Wen\.R1\-Searcher: Incentivizing the search capability in LLMs via reinforcement learning\.*arXiv preprint arXiv:2503\.05592*, 2025\.
- Trivedi et al\. \[2023\]Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal\.Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions\.In*Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2023\.
- Trivedi et al\. \[2022\]Harsh Trivedi et al\.MuSiQue: Multihop questions via single\-hop question composition\.*Transactions of the Association for Computational Linguistics \(TACL\)*, 2022\.
- Wan et al\. \[2026\]Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, Peizhou Huang, and Mi Zhang\.DSDR: Dual\-scale diversity regularization for exploration in LLM reasoning\.*arXiv preprint arXiv:2602\.19895*, 2026\.
- Yang et al\. \[2024\]An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, et al\.Qwen2\.5 technical report\.*arXiv preprint arXiv:2412\.15115*, 2024\.
- Yang et al\. \[2018\]Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D\. Manning\.HotpotQA: A dataset for diverse, explainable multi\-hop question answering\.In*Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2018\.
- Yao et al\. \[2023\]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R\. Narasimhan, and Yuan Cao\.ReAct: Synergizing reasoning and acting in language models\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.arXiv:2210\.03629\.
- Yuan et al\. \[2026\]Zhonghang Yuan, Zhefan Wang, Fang Hu, Zihong Chen, Huanjun Kong, Songyang Zhang, Wanli Ouyang, and Nanqing Dong\.Knowledge\-to\-verification: Unlocking reinforcement learning with verifiable rewards for LLMs in knowledge\-intensive domains\.In*Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2026\.
- Yue et al\. \[2026\]Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang\.Dr\. Zero: Self\-evolving search agents without training data\.*arXiv preprint arXiv:2601\.07055*, 2026\.
- Zhao et al\. \[2025\]Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang\.Absolute zero: Reinforced self\-play reasoning with zero data\.*arXiv preprint arXiv:2505\.03335*, 2025\.

## Appendix ANotation

Table[5](https://arxiv.org/html/2605.22905#A1.T5)collects the principal symbols used in the main text\. Throughout the appendix, log\-probabilities are natural logarithms, indicator functions take values in\{0,1\}\\\{0,1\\\}, and KL divergences are non\-negative and can be infinite\.

Table 5:Glossary of the principal symbols used throughout the paper\.SymbolMeaningFirst used𝒟\{\\mathcal\{D\}\}source corpusSection[2](https://arxiv.org/html/2605.22905#S2)d∈𝒟d\\in\{\\mathcal\{D\}\}source documentSection[2](https://arxiv.org/html/2605.22905#S2)q,a,eq,\\,a,\\,equestion, answer, evidence span \(eeverbatim fromddor a snippet\)Section[2](https://arxiv.org/html/2605.22905#S2)h∈\{1,2,3,4\}h\\in\\\{1,2,3,4\\\}prescribed hop count of the rolloutSection[3\.1](https://arxiv.org/html/2605.22905#S3.SS1)πtpro,πtsol\\pi\_\{t\}^\{\\mathrm\{pro\}\},\\pi\_\{t\}^\{\\mathrm\{sol\}\}proposer / solver policies at roundttSection[2](https://arxiv.org/html/2605.22905#S2)π~sol,t\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}solver under the single\-turn search\-disabled promptEq\. \([10](https://arxiv.org/html/2605.22905#S3.E10)\)π~aux,t\\widetilde\{\\pi\}\_\{\\mathrm\{aux\},t\}auxiliary scorer \(in our runs: weight\-swapped solver\)Eq\. \([10](https://arxiv.org/html/2605.22905#S3.E10)\)p\+t​\(q,e,a\),p−t​\(q,a\)p\_\{\+\}^\{t\}\(q,e,a\),p\_\{\-\}^\{t\}\(q,a\)with\-evidence and no\-evidence answer accuraciesEq\. \([10](https://arxiv.org/html/2605.22905#S3.E10)\)Vt​\(q,e,a\)V\_\{t\}\(q,e,a\)evidence\-quality scorep\+t−p−tp\_\{\+\}^\{t\}\-p\_\{\-\}^\{t\}Eq\. \([11](https://arxiv.org/html/2605.22905#S3.E11)\)V^t,m\\widehat\{V\}\_\{t,m\}Monte Carlo estimator ofVtV\_\{t\}withmmsamplesEq\. \([14](https://arxiv.org/html/2605.22905#S3.E14)\)B​\(e\)B\(e\)brevity bonus,Lmax=256L\_\{\\max\}=256Eq\. \([15](https://arxiv.org/html/2605.22905#S3.E15)\)FfmtF\_\{\\mathrm\{fmt\}\}4\-component proposer format rewardEq\. \([9](https://arxiv.org/html/2605.22905#S3.E9)\)RtDZ​\(q,a;k\)R^\{\\mathrm\{DZ\}\}\_\{t\}\(q,a;k\)Dr\. Zero difficulty reward,k∼Bin​\(n,ptsol\)k\\sim\\mathrm\{Bin\}\(n,p\_\{t\}^\{\\mathrm\{sol\}\}\)Eq\. \([2](https://arxiv.org/html/2605.22905#S2.E2)\)ϕn​\(p\)\\phi\_\{n\}\(p\)population mean ofRtDZR^\{\\mathrm\{DZ\}\}\_\{t\}Eq\. \([4](https://arxiv.org/html/2605.22905#S2.E4)\)Rtpro,RsolR^\{\\mathrm\{pro\}\}\_\{t\},R^\{\\mathrm\{sol\}\}proposer \(Phase A\) and solver \(Phase B\) rewardsEqs\. \([16](https://arxiv.org/html/2605.22905#S3.E16)\), \([19](https://arxiv.org/html/2605.22905#S3.E19)\)Rcorrect,RevidenceR\_\{\\mathrm\{correct\}\},R\_\{\\mathrm\{evidence\}\}solver answer EM and evidence\-F1 sub\-rewardsEq\. \([19](https://arxiv.org/html/2605.22905#S3.E19)\)λV,λB,λE\\lambda\_\{V\},\\lambda\_\{B\},\\lambda\_\{E\}EVE\-Agent reward coefficientsEqs\. \([16](https://arxiv.org/html/2605.22905#S3.E16)\), \([19](https://arxiv.org/html/2605.22905#S3.E19)\)zd=ϕ​\(d\)∈ℝDz\_\{d\}=\\phi\(d\)\\in\{\\mathbb\{R\}\}^\{D\}document embedding \(E5\-base\-v2\)Section[3\.3](https://arxiv.org/html/2605.22905#S3.SS3)Kt=K0\+⌊α​t⌋K\_\{t\}=K\_\{0\}\+\\lfloor\\alpha t\\rfloorcluster count scheduleSection[3\.3](https://arxiv.org/html/2605.22905#S3.SS3)𝒞t=\{Ct,k\}k=1Kt\{\\mathcal\{C\}\}\_\{t\}=\\\{C\_\{t,k\}\\\}\_\{k=1\}^\{K\_\{t\}\}active clusters at roundttSection[3\.3](https://arxiv.org/html/2605.22905#S3.SS3)Nk,R¯kN\_\{k\},\\,\\overline\{R\}\_\{k\}UCB1 pull count and empirical mean for armkkAppendix[F](https://arxiv.org/html/2605.22905#A6)β,λu,ε\\beta,\\,\\lambda\_\{u\},\\,\\varepsilonUCB exploration weight, utility weight, log smoothingAppendix[F](https://arxiv.org/html/2605.22905#A6)
## Appendix BTheoretical results

This appendix collects two basic properties of the reward design used throughout the paper\. Lemma[B\.1](https://arxiv.org/html/2605.22905#A2.Thmtheorem1)characterizes the inherited difficulty reward as a unimodal function of the solver’s per\-trial success probability, and Proposition[B\.2](https://arxiv.org/html/2605.22905#A2.Thmtheorem2)formalizes the evidence verifier as a marginal answer\-accuracy gain with an unbiased, bounded\-variance Monte Carlo estimator\. Both results are referenced from the main text but were moved here to keep the body focused on the empirical contribution\.

### B\.1Closed form and saturation of the difficulty reward

###### Lemma B\.1\(Closed form and saturation of the difficulty reward\)\.

Letn≥2n\\geq 2be the number of independent solver predictions, and letp∈\[0,1\]p\\in\[0,1\]be the probability that a single solver prediction matches the target answer\. Ifk∼Bin​\(n,p\)k\\sim\\mathrm\{Bin\}\(n,p\)denotes the number of correct predictions among thenntrials, then

𝔼k∼Bin​\(n,p\)​\[RtDZ​\(q,a;k\)\]=ϕn​\(p\),\\mathbb\{E\}\_\{k\\sim\\mathrm\{Bin\}\(n,p\)\}\\left\[R^\{\\mathrm\{DZ\}\}\_\{t\}\(q,a;k\)\\right\]=\\phi\_\{n\}\(p\),\(23\)where

ϕn​\(p\)≔nn−1​\(1−p\)​\(1−\(1−p\)n−1\)\.\\phi\_\{n\}\(p\)\\coloneqq\\frac\{n\}\{n\-1\}\(1\-p\)\\bigl\(1\-\(1\-p\)^\{n\-1\}\\bigr\)\.\(24\)The functionϕn:\[0,1\]→\[0,1\]\\phi\_\{n\}:\[0,1\]\\to\[0,1\]is continuous and unimodal, withϕn​\(0\)=ϕn​\(1\)=0\\phi\_\{n\}\(0\)=\\phi\_\{n\}\(1\)=0and a unique interior maximizer

p⋆=1−n−1/\(n−1\)\.p^\{\\star\}=1\-n^\{\-1/\(n\-1\)\}\.\(25\)

#### Discussion\.

Lemma[B\.1](https://arxiv.org/html/2605.22905#A2.Thmtheorem1)makes explicit what the difficulty reward in Eq\. \([2](https://arxiv.org/html/2605.22905#S2.E2)\) encourages\. The expected reward vanishes both when the solver almost never answers correctly and when it almost always does, and is largest at an intermediate success probabilityp⋆p^\{\\star\}\. This frontier is defined entirely by answer accuracy, however, and does not address whether the generated answer is supported by any source span—the motivation for the evidence verifier of Eq\. \([11](https://arxiv.org/html/2605.22905#S3.E11)\)\.

###### Proof of Lemma[B\.1](https://arxiv.org/html/2605.22905#A2.Thmtheorem1)\.

Writeptsol​\(q,a\)=pp\_\{t\}^\{\\mathrm\{sol\}\}\(q,a\)=pand letk∼Bin​\(n,p\)k\\sim\\mathrm\{Bin\}\(n,p\)\. From Eq\. \([2](https://arxiv.org/html/2605.22905#S2.E2)\),

𝔼​\[RtDZ\]\\displaystyle\{\\mathbb\{E\}\}\\bigl\[R^\{\\mathrm\{DZ\}\}\_\{t\}\\bigr\]=∑k=1n−1\(nk\)​pk​\(1−p\)n−k​n−kn−1\\displaystyle=\\sum\_\{k=1\}^\{n\-1\}\\binom\{n\}\{k\}p^\{k\}\(1\-p\)^\{n\-k\}\\,\\frac\{n\-k\}\{n\-1\}=1n−1​\[∑k=0n\(nk\)​pk​\(1−p\)n−k​\(n−k\)−n​\(1−p\)n−0\],\\displaystyle=\\frac\{1\}\{n\-1\}\\\!\\left\[\\sum\_\{k=0\}^\{n\}\\binom\{n\}\{k\}p^\{k\}\(1\-p\)^\{n\-k\}\(n\-k\)\-n\(1\-p\)^\{n\}\-0\\right\],where thek=0k=0term contributesn​\(1−p\)nn\(1\-p\)^\{n\}and thek=nk=nterm contributes0\. The full sum is the binomial mean ofn−kn\-k, which equalsn​\(1−p\)n\(1\-p\), so the bracket equals

n​\(1−p\)−n​\(1−p\)n=n​\(1−p\)​\(1−\(1−p\)n−1\)\.n\(1\-p\)\-n\(1\-p\)^\{n\}=n\(1\-p\)\\bigl\(1\-\(1\-p\)^\{n\-1\}\\bigr\)\.Dividing byn−1n\-1givesϕn​\(p\)=nn−1​\(1−p\)​\(1−\(1−p\)n−1\)\\phi\_\{n\}\(p\)=\\tfrac\{n\}\{n\-1\}\(1\-p\)\(1\-\(1\-p\)^\{n\-1\}\)\.

Continuity is immediate from the polynomial form\. The boundary values areϕn​\(0\)=0\\phi\_\{n\}\(0\)=0andϕn​\(1\)=0\\phi\_\{n\}\(1\)=0by the factor\(1−p\)\(1\-p\)\. Differentiating yieldsϕn′​\(p\)=nn−1​\[n​\(1−p\)n−1−1\]\\phi\_\{n\}^\{\\prime\}\(p\)=\\tfrac\{n\}\{n\-1\}\\bigl\[n\(1\-p\)^\{n\-1\}\-1\\bigr\], which vanishes exactly atp⋆=1−n−1/\(n−1\)∈\(0,1\)p^\{\\star\}=1\-n^\{\-1/\(n\-1\)\}\\in\(0,1\)\. Sinceϕn′​\(0\)\>0\\phi\_\{n\}^\{\\prime\}\(0\)\>0andϕn′​\(1\)<0\\phi\_\{n\}^\{\\prime\}\(1\)<0, this critical point is the unique interior maximum\. Finally,ϕn​\(p\)∈\[0,1\]\\phi\_\{n\}\(p\)\\in\[0,1\]for everyp∈\[0,1\]p\\in\[0,1\]becauseRtDZ​\(q,a;k\)∈\[0,1\]R^\{\\mathrm\{DZ\}\}\_\{t\}\(q,a;k\)\\in\[0,1\]for every realization ofkk, so its expectation also lies in\[0,1\]\[0,1\]\. ∎

### B\.2Evidence verifier as marginal answer\-accuracy gain

###### Proposition B\.2\(Evidence verifier as marginal answer\-accuracy gain\)\.

For any valid triple\(q,a,e\)\(q,a,e\)and any training roundtt, the evidence verifier in Eq\. \([11](https://arxiv.org/html/2605.22905#S3.E11)\) satisfies

Vt​\(q,e,a\)=𝔼a^∼π~sol,t\(⋅∣q,e\)​\[𝟏​\{a^=a\}\]−𝔼a^∼π~aux,t\(⋅∣q\)​\[𝟏​\{a^=a\}\],V\_\{t\}\(q,e,a\)=\\mathbb\{E\}\_\{\\hat\{a\}\\sim\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}\(\\cdot\\mid q,e\)\}\\left\[\\mathbf\{1\}\\\{\\hat\{a\}=a\\\}\\right\]\-\\mathbb\{E\}\_\{\\hat\{a\}\\sim\\widetilde\{\\pi\}\_\{\\mathrm\{aux\},t\}\(\\cdot\\mid q\)\}\\left\[\\mathbf\{1\}\\\{\\hat\{a\}=a\\\}\\right\],\(26\)and thereforeVt​\(q,e,a\)∈\[−1,1\]V\_\{t\}\(q,e,a\)\\in\[\-1,1\]\. When the auxiliary scorer shares its weights with the solver, the two distributions differ only in whether the evidence spaneeis provided; in that caseVt​\(q,e,a\)V\_\{t\}\(q,e,a\)is exactly the increase in the solver’s answer probability induced by conditioning onee, andVt​\(q,e,a\)=0V\_\{t\}\(q,e,a\)=0wheneverπ~sol,t\(⋅∣q,e\)=π~sol,t\(⋅∣q\)\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}\(\\cdot\\mid q,e\)=\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}\(\\cdot\\mid q\)for everyee\. Moreover, the Monte Carlo estimatorV^t,m\\widehat\{V\}\_\{t,m\}in Eq\. \([14](https://arxiv.org/html/2605.22905#S3.E14)\) is unbiased,

𝔼​\[V^t,m​\(q,e,a\)∣q,e,a\]=Vt​\(q,e,a\),\\mathbb\{E\}\\left\[\\widehat\{V\}\_\{t,m\}\(q,e,a\)\\mid q,e,a\\right\]=V\_\{t\}\(q,e,a\),\(27\)and its conditional variance satisfies

Var​\[V^t,m​\(q,e,a\)∣q,e,a\]≤12​m\.\\mathrm\{Var\}\\left\[\\widehat\{V\}\_\{t,m\}\(q,e,a\)\\mid q,e,a\\right\]\\leq\\frac\{1\}\{2m\}\.\(28\)

#### Discussion\.

Proposition[B\.2](https://arxiv.org/html/2605.22905#A2.Thmtheorem2)clarifies the interpretation of the verifier introduced in Section[3\.1](https://arxiv.org/html/2605.22905#S3.SS1)\. A positiveVt​\(q,e,a\)V\_\{t\}\(q,e,a\)means that the proposed evidence span makes the target answer more likely for the current solver, a value near zero means that the solver could already infer the answer from the question alone, and a negative value indicates a misleading or irrelevant span\. The verifier thus turns evidence grounding into a reward\-level quantity that can be optimized without oracle labels\. We do not claim that this reward induces a non\-vanishing policy gradient for every proposer parameterization; such a claim would require additional assumptions on the policy class and the optimization dynamics\.

###### Proof of Proposition[B\.2](https://arxiv.org/html/2605.22905#A2.Thmtheorem2)\.

The identity \([26](https://arxiv.org/html/2605.22905#A2.E26)\) is immediate from the definitions ofVtV\_\{t\}and the two pointwise accuracies in Eqs\. \([10](https://arxiv.org/html/2605.22905#S3.E10)\)–\([11](https://arxiv.org/html/2605.22905#S3.E11)\):

Vt​\(q,e,a\)=π~sol,t​\(a∣q,e\)−π~aux,t​\(a∣q\),V\_\{t\}\(q,e,a\)=\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}\(a\\mid q,e\)\-\\widetilde\{\\pi\}\_\{\\mathrm\{aux\},t\}\(a\\mid q\),which is the difference between two probabilities and hence lies in\[−1,1\]\[\-1,1\]\. In the weight\-swap configurationπ~aux,t≡π~sol,t\\widetilde\{\\pi\}\_\{\\mathrm\{aux\},t\}\\equiv\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}, soVt​\(q,e,a\)=π~sol,t​\(a∣q,e\)−π~sol,t​\(a∣q\)V\_\{t\}\(q,e,a\)=\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}\(a\\mid q,e\)\-\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}\(a\\mid q\)\. Ifπ~sol,t\(⋅∣q,e\)=π~sol,t\(⋅∣q\)\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}\(\\cdot\\mid q,e\)=\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}\(\\cdot\\mid q\)for everyee, thenVt​\(q,e,a\)=0V\_\{t\}\(q,e,a\)=0identically\.

For the Monte Carlo estimator, each indicator𝟏​\{a^\+\(j\)=a\}\{\\bm\{1\}\}\\\{\\hat\{a\}\_\{\+\}^\{\(j\)\}=a\\\}is a Bernoulli random variable with meanπ~sol,t​\(a∣q,e\)\\widetilde\{\\pi\}\_\{\\mathrm\{sol\},t\}\(a\\mid q,e\), and the analogous statement holds for𝟏​\{a^−\(j\)=a\}\{\\bm\{1\}\}\\\{\\hat\{a\}\_\{\-\}^\{\(j\)\}=a\\\}\. By linearity of expectation,𝔼​\[V^t,m∣q,e,a\]=Vt​\(q,e,a\)\{\\mathbb\{E\}\}\[\\widehat\{V\}\_\{t,m\}\\mid q,e,a\]=V\_\{t\}\(q,e,a\), establishing \([27](https://arxiv.org/html/2605.22905#A2.E27)\)\. The variance of a Bernoulli random variable with parameterppisp​\(1−p\)≤1/4p\(1\-p\)\\leq 1/4, so each of the two empirical means in Eq\. \([14](https://arxiv.org/html/2605.22905#S3.E14)\) has conditional variance at most1/\(4​m\)1/\(4m\)\. Because the two empirical means are computed from independent samples by construction of the rollout schedule,

Var​\[V^t,m∣q,e,a\]≤14​m\+14​m=12​m,\\mathrm\{Var\}\[\\widehat\{V\}\_\{t,m\}\\mid q,e,a\]\\leq\\frac\{1\}\{4m\}\+\\frac\{1\}\{4m\}=\\frac\{1\}\{2m\},which proves \([28](https://arxiv.org/html/2605.22905#A2.E28)\)\. ∎

## Appendix CAdditional implementation details

This appendix expands on the experimental setup of Section[4\.1](https://arxiv.org/html/2605.22905#S4.SS1)\. The goal is to provide enough detail to reproduce all numbers in Tables[1](https://arxiv.org/html/2605.22905#S4.T1)–[4](https://arxiv.org/html/2605.22905#S4.T4); nothing in this appendix changes any reward equation in the main text\.

#### Backbone and tokenization\.

All four compared systems use the Qwen2\.5\-3B\-Instruct backbone\[Yang et al\.,[2024](https://arxiv.org/html/2605.22905#bib.bib27)\], the same tokenizer, and the same chat template\. The auxiliary scorer in Eq\. \([11](https://arxiv.org/html/2605.22905#S3.E11)\) reuses the current solver weights; we refer to this implementation as the*weight\-swap*configuration\. The framework also supports a separately hosted frozen auxiliary scorer, but we do not use that configuration in any reported experiment\.

#### Retrieval pipeline\.

The retrieval corpus is the FlashRAG Wikipedia\-2018 snapshot \(≈21\\approx 21M passages\)\. Passages are encoded with E5\-base\-v2 using mean pooling and unit\-normL2L^\{2\}normalization, and indexed with FAISS\-IVF over4,0964\{,\}096centroids withnprobe=64\\mathrm\{nprobe\}=64\. The search tool returns the top\-33passages per query\. Three CPU retrieval servers run in parallel to amortize tool latency\.

#### Rollout protocol\.

A multi\-turn proposer or solver rollout permits at most five assistant turns and at most512512tokens per tool response\. The single\-turn protocol used by the evidence verifier of Section[3\.1](https://arxiv.org/html/2605.22905#S3.SS1)disables the search tool and limits the assistant to one turn; this is the configuration in whichp\+tp\_\{\+\}^\{t\}andp−tp\_\{\-\}^\{t\}of Eq\. \([10](https://arxiv.org/html/2605.22905#S3.E10)\) are computed\.

#### Data preparation\.

Proposer training prompts are drawn from the FlashRAG NQ–HotpotQA mixture, rebound to hop countsh∈\{1,2,3,4\}h\\in\\\{1,2,3,4\\\}in the ratio4:3:2:14\{:\}3\{:\}2\{:\}1\. To construct the Phase B training set we roll out the frozen Phase A proposer on the same corpus with five samples per prompt and retain only valid triples; the proposer\-emitted spaneebecomes the gold evidence for the corresponding question–answer pair\.

#### Optimization\.

Phase A runs5050HRPO steps with global batch size256256, micro\-batch22per GPU, learning rate1​e−61\\mathrm\{e\}\{\-6\}, warmup ratio0\.030\.03, gradient clipping at0\.10\.1, the KL regularizer disabled, no nested grouping, and a verifier Monte Carlo budgetm=5m=5\. Phase B runs5050GRPO steps with global batch size256256, micro\-batch88per GPU, group size55, and the same optimizer settings as Phase A; the evidence\-F1 coefficient isλE=0\.3\\lambda\_\{E\}=0\.3\. Both phases use a single8×8\\timesB200 node\.

#### Inference for the evaluation tables\.

For Tables[1](https://arxiv.org/html/2605.22905#S4.T1)–[4](https://arxiv.org/html/2605.22905#S4.T4), each system is decoded greedily on each evaluation instance with the search tool enabled\. The external judge for the evidence score is GPT\-4\.1, queried with the question, the gold answer, and the model\-emitted span; the judge returns a binary decision over whether the question paired with the span is sufficient to recover the gold answer\. The joint metric counts an instance as correct only when both the answer is exact\-match equal to the gold answer \(after normalization\) and the judge labels the evidence as supporting\.

## Appendix DFormat reward decomposition

The four sub\-rewards of Eq\. \([9](https://arxiv.org/html/2605.22905#S3.E9)\) are defined as follows\. Given a parsed proposer output withhhprescribed hops,TassT\_\{\\mathrm\{ass\}\}assistant turns,TthinkT\_\{\\mathrm\{think\}\}assistant turns that begin with a planning step, andTtcT\_\{\\mathrm\{tc\}\}syntactically valid tool\-call emissions whose number matches the number of returned tool responses,

Fthink\\displaystyle F\_\{\\mathrm\{think\}\}=Tthinkmax⁡\(1,Tass\),\\displaystyle=\\frac\{T\_\{\\mathrm\{think\}\}\}\{\\max\(1,T\_\{\\mathrm\{ass\}\}\)\},Ftool​\(h\)\\displaystyle F\_\{\\mathrm\{tool\}\}\(h\)=\{1h=1,min⁡\(1\+Ttch,1\)h\>1,Ttc=\#returned responses,0h\>1,otherwise,\\displaystyle=\\begin\{cases\}1&h=1,\\\\ \\min\\\!\\bigl\(\\tfrac\{1\+T\_\{\\mathrm\{tc\}\}\}\{h\},\\,1\\bigr\)&h\>1,\\,T\_\{\\mathrm\{tc\}\}=\\text\{\\\#returned responses\},\\\\ 0&h\>1,\\,\\text\{otherwise\},\\end\{cases\}andFans∈\{0,12,1\}F\_\{\\mathrm\{ans\}\}\\in\\\{0,\\tfrac\{1\}\{2\},1\\\}rewards a short, in\-context answer: writinga~≔Norm​\(a\)\\tilde\{a\}\\coloneqq\\textsc\{Norm\}\(a\)and lettingΣ\\Sigmadenote the concatenation of the source document with the returned tool responses,

Fans=\{1a~∈\{yes,no\}​or​\(a~⊆Norm​\(Σ\)​and​\|a~\|words≤5\),12a~⊆Norm​\(Σ\)​and​5<\|a~\|words≤10,0otherwise\.F\_\{\\mathrm\{ans\}\}=\\begin\{cases\}1&\\tilde\{a\}\\in\\\{\\text\{yes\},\\text\{no\}\\\}\\text\{ or \}\\bigl\(\\tilde\{a\}\\subseteq\\textsc\{Norm\}\(\\Sigma\)\\text\{ and \}\|\\tilde\{a\}\|\_\{\\text\{words\}\}\\leq 5\\bigr\),\\\\ \\tfrac\{1\}\{2\}&\\tilde\{a\}\\subseteq\\textsc\{Norm\}\(\\Sigma\)\\text\{ and \}5<\|\\tilde\{a\}\|\_\{\\text\{words\}\}\\leq 10,\\\\ 0&\\text\{otherwise\.\}\\end\{cases\}Rollouts with emptyqqor emptyaareceiveFfmt=0F\_\{\\mathrm\{fmt\}\}=0and are filtered out before the verifier\.

## Appendix EPseudocode

Algorithm[1](https://arxiv.org/html/2605.22905#alg1)describes one Phase A HRPO iteration with the evidence verifier; Algorithm[2](https://arxiv.org/html/2605.22905#alg2)describes one Phase B GRPO iteration with the solver reward of Eq\. \([19](https://arxiv.org/html/2605.22905#S3.E19)\)\. In Algorithm[1](https://arxiv.org/html/2605.22905#alg1), the auxiliary scorer shares its weights with the current solver and produces both the with\-evidence and the no\-evidence single\-turn rollouts\.

Algorithm 1One Phase A iteration of EVE\-Agent \(proposer\)1:Input:proposer

πtpro\\pi^\{\\mathrm\{pro\}\}\_\{t\}, solver

πsol\\pi^\{\\mathrm\{sol\}\}\(frozen\), corpus

𝒟\{\\mathcal\{D\}\}, hop pmf, coefficients

\(λV,λB\)\(\\lambda\_\{V\},\\lambda\_\{B\}\), MC budget

mm, group size

nn
2:Sample

\{di\}i=1N​∼iid​Unif​\(𝒟\)\\\{d\_\{i\}\\\}\_\{i=1\}^\{N\}\\overset\{\\text\{iid\}\}\{\\sim\}\\mathrm\{Unif\}\(\{\\mathcal\{D\}\}\)and

hi∼hop pmfh\_\{i\}\\sim\\text\{hop pmf\}
3:for

i=1,…,Ni=1,\\dots,Ndo

4:

\(qi,ai,ei\)∼πtpro\(⋅∣di,hi,ℛ\)\(q\_\{i\},a\_\{i\},e\_\{i\}\)\\sim\\pi^\{\\mathrm\{pro\}\}\_\{t\}\(\\cdot\\mid d\_\{i\},h\_\{i\},\{\\mathcal\{R\}\}\)
5:

Ffmt,i←ComputeFormatScore​\(qi,ai,ei,di,hi\)F\_\{\\mathrm\{fmt\},i\}\\leftarrow\\textsc\{ComputeFormatScore\}\(q\_\{i\},a\_\{i\},e\_\{i\},d\_\{i\},h\_\{i\}\)
6:Mark

ii*invalid*if parse fails or

Ffmt,i=0F\_\{\\mathrm\{fmt\},i\}=0
7:endfor

8:foreach*valid*

iido

9:Sample

\{a^ji\}j=1n∼πsol\(⋅∣qi,ℛ\)\\\{\\hat\{a\}\_\{j\}^\{i\}\\\}\_\{j=1\}^\{n\}\\sim\\pi^\{\\mathrm\{sol\}\}\(\\cdot\\mid q\_\{i\},\{\\mathcal\{R\}\}\)⊳\\trianglerightmulti\-turn, with search

10:

ki←∑j=1n𝟏​\{a^ji=ai\}k\_\{i\}\\leftarrow\\sum\_\{j=1\}^\{n\}\{\\bm\{1\}\}\\\{\\hat\{a\}\_\{j\}^\{i\}=a\_\{i\}\\\}
11:

RiDZ←𝟏​\{0<ki<n\}​\(n−ki\)/\(n−1\)R^\{\\mathrm\{DZ\}\}\_\{i\}\\leftarrow\{\\bm\{1\}\}\\\{0<k\_\{i\}<n\\\}\(n\-k\_\{i\}\)/\(n\-1\)
12:Sample

\{a^\+\(j\),i\}j=1m∼π~sol\(⋅∣qi,ei\)\\\{\\hat\{a\}\_\{\+\}^\{\(j\),i\}\\\}\_\{j=1\}^\{m\}\\sim\\widetilde\{\\pi\}^\{\\mathrm\{sol\}\}\(\\cdot\\mid q\_\{i\},e\_\{i\}\)⊳\\trianglerightsingle\-turn, no search

13:Sample

\{a^−\(j\),i\}j=1m∼π~aux\(⋅∣qi\)\\\{\\hat\{a\}\_\{\-\}^\{\(j\),i\}\\\}\_\{j=1\}^\{m\}\\sim\\widetilde\{\\pi\}^\{\\mathrm\{aux\}\}\(\\cdot\\mid q\_\{i\}\)⊳\\trianglerightshares weights with the solver

14:

V^i←1m​∑j𝟏​\{a^\+\(j\),i=ai\}−1m​∑j𝟏​\{a^−\(j\),i=ai\}\\widehat\{V\}\_\{i\}\\leftarrow\\tfrac\{1\}\{m\}\\sum\_\{j\}\{\\bm\{1\}\}\\\{\\hat\{a\}\_\{\+\}^\{\(j\),i\}=a\_\{i\}\\\}\-\\tfrac\{1\}\{m\}\\sum\_\{j\}\{\\bm\{1\}\}\\\{\\hat\{a\}\_\{\-\}^\{\(j\),i\}=a\_\{i\}\\\}
15:

Bi←max⁡\(0,1−\|ei\|/Lmax\)B\_\{i\}\\leftarrow\\max\(0,\\,1\-\|e\_\{i\}\|/L\_\{\\max\}\)
16:

Ripro←12​Ffmt,i\+RiDZ\+λV​V^i\+λB​BiR^\{\\mathrm\{pro\}\}\_\{i\}\\leftarrow\\tfrac\{1\}\{2\}F\_\{\\mathrm\{fmt\},i\}\+R^\{\\mathrm\{DZ\}\}\_\{i\}\+\\lambda\_\{V\}\\widehat\{V\}\_\{i\}\+\\lambda\_\{B\}B\_\{i\}
17:endfor

18:forinvalid

iido

19:

Ripro←12​Ffmt,iR^\{\\mathrm\{pro\}\}\_\{i\}\\leftarrow\\tfrac\{1\}\{2\}F\_\{\\mathrm\{fmt\},i\}
20:endfor

21:Compute hop\-grouped advantages

Ai,hA\_\{i,h\}by Eq\. \([5](https://arxiv.org/html/2605.22905#S2.E5)\)

22:Update

πpro\\pi^\{\\mathrm\{pro\}\}with one HRPO step on

\{\(qi,ai,ei\),Ai,h\}\\\{\(q\_\{i\},a\_\{i\},e\_\{i\}\),A\_\{i,h\}\\\}

Algorithm 2One Phase B iteration of EVE\-Agent \(solver\)1:Input:Phase A proposer checkpoint \(frozen\), solver

πtsol\\pi^\{\\mathrm\{sol\}\}\_\{t\}, training shard

\{\(qi,ai,ei\)\}i=1N\\\{\(q\_\{i\},a\_\{i\},e\_\{i\}\)\\\}\_\{i=1\}^\{N\}generated by the Phase A proposer, GRPO group size

nn, coefficient

λE\\lambda\_\{E\}
2:for

i=1,…,Ni=1,\\dots,Ndo

3:Sample

\{\(a^ji,e^ji\)\}j=1n∼πtsol\(⋅∣qi,ℛ\)\\\{\(\\hat\{a\}\_\{j\}^\{i\},\\hat\{e\}\_\{j\}^\{i\}\)\\\}\_\{j=1\}^\{n\}\\sim\\pi^\{\\mathrm\{sol\}\}\_\{t\}\(\\cdot\\mid q\_\{i\},\{\\mathcal\{R\}\}\)
4:for

j=1,…,nj=1,\\dots,ndo

5:

Rcorrect,ji←𝟏​\{EM​\(a^ji,ai\)\}R\_\{\\mathrm\{correct\},j\}^\{i\}\\leftarrow\{\\bm\{1\}\}\\\{\\textsc\{EM\}\(\\hat\{a\}\_\{j\}^\{i\},a\_\{i\}\)\\\}
6:

Revidence,ji←F1tok​\(Norm​\(e^ji\),Norm​\(ei\)\)R\_\{\\mathrm\{evidence\},j\}^\{i\}\\leftarrow\\mathrm\{F1\}\_\{\\mathrm\{tok\}\}\(\\textsc\{Norm\}\(\\hat\{e\}\_\{j\}^\{i\}\),\\textsc\{Norm\}\(e\_\{i\}\)\)
7:

Rjsol,i←Rcorrect,ji\+λE​Revidence,jiR^\{\\mathrm\{sol\},i\}\_\{j\}\\leftarrow R\_\{\\mathrm\{correct\},j\}^\{i\}\+\\lambda\_\{E\}R\_\{\\mathrm\{evidence\},j\}^\{i\}
8:endfor

9:endfor

10:Update

πsol\\pi^\{\\mathrm\{sol\}\}with one GRPO step of Eq\. \([7](https://arxiv.org/html/2605.22905#S2.E7)\) on

\{Rjsol,i\}\\\{R^\{\\mathrm\{sol\},i\}\_\{j\}\\\}

## Appendix FSelector: full specification

This appendix gives the complete specification of the cluster\-bandit selector introduced in Section[3\.3](https://arxiv.org/html/2605.22905#S3.SS3)\.

#### Document embeddings\.

A frozen sentence encoderϕ\\phi\(E5\-base\-v2 with mean pooling andL2L^\{2\}normalization\) maps eachd∈𝒟d\\in\{\\mathcal\{D\}\}tozd=ϕ​\(d\)∈ℝDz\_\{d\}=\\phi\(d\)\\in\{\\mathbb\{R\}\}^\{D\},D=768D=768\. Embeddings are precomputed once per corpus\.

#### Adaptive clustering\.

At roundtt, the cluster set𝒞t=\{Ct,k\}k=1Kt\{\\mathcal\{C\}\}\_\{t\}=\\\{C\_\{t,k\}\\\}\_\{k=1\}^\{K\_\{t\}\}is obtained by mini\-batchkk\-means on\{zd\}d∈𝒟\\\{z\_\{d\}\\\}\_\{d\\in\{\\mathcal\{D\}\}\}with a granularity schedule

Kt=K0\+⌊α​t⌋,K0=10,α≥0\.K\_\{t\}=K\_\{0\}\+\\lfloor\\alpha\\,t\\rfloor,\\qquad K\_\{0\}=10,\\alpha\\geq 0\.\(29\)Withα=0\\alpha=0the clustering is fixed throughout training; withα\>0\\alpha\>0, wheneverKt\>Kt−1K\_\{t\}\>K\_\{t\-1\}we re\-cluster the corpus and*inherit*the bandit statistics across the split\. Concretely, each new centroid is assigned to its nearest old centroid by Euclidean distance, and if old clusterjjhasnchildrenn\_\{\\mathrm\{children\}\}new children with countsNjN\_\{j\}and reward sumSjS\_\{j\}, each child starts withmax⁡\(⌊Nj/nchildren⌋,1\)\\max\(\\lfloor N\_\{j\}/n\_\{\\mathrm\{children\}\}\\rfloor,1\)pulls andSj/nchildrenS\_\{j\}/n\_\{\\mathrm\{children\}\}reward\. Parent statistics are split across children rather than discarded; in particular, we do not reset arm statistics on split\.

#### UCB1 bandits\.

Two independent UCB1 bandits run side by side; both initialize every arm with one virtual pull and zero virtual reward so the exploration bonus is finite from round11\. For armkkwithNkN\_\{k\}pulls, reward sumSkS\_\{k\}, and total pullsNtot=∑jNjN\_\{\\mathrm\{tot\}\}=\\sum\_\{j\}N\_\{j\}, the UCB score is

Uk=Skmax⁡\(Nk,1\)\+β​log⁡max⁡\(Ntot,1\)max⁡\(Nk,1\),β\>0,U\_\{k\}=\\frac\{S\_\{k\}\}\{\\max\(N\_\{k\},1\)\}\+\\beta\\sqrt\{\\frac\{\\log\\max\(N\_\{\\mathrm\{tot\}\},1\)\}\{\\max\(N\_\{k\},1\)\}\},\\qquad\\beta\>0,\(30\)withβ=1\\beta=1throughout\. When a single arm is required, we use the deterministicarg​maxk⁡Uk\\operatorname\*\{arg\\,max\}\_\{k\}U\_\{k\}; when a batch ofn\>1n\>1arms is required, the batch is drawn from a softmax overUkU\_\{k\}rescaled by the empirical standard deviation of theUU\-values, which keeps the batch diverse without sacrificing the exploration bias of UCB1\.

#### Cluster and task arm sets\.

The cluster bandit hasKtK\_\{t\}arms; armkkcorresponds to the clusterCt,kC\_\{t,k\}\. The task\-type bandit has five arms with labelsfactual,comparison,causal,temporal, andaggregation\. The hop count is already controlled independently through the hop pmf and is therefore not a task type\.

#### Within\-cluster sampling\.

Letudtu\_\{d\}^\{t\}be the number of times documentddhas been previously sampled at roundttand letδd,k=‖zd−μk‖2\\delta\_\{d,k\}=\\\|z\_\{d\}\-\\mu\_\{k\}\\\|\_\{2\}be its Euclidean distance to the cluster centroidμk\\mu\_\{k\}\. Inside the chosen clusterCt,ktC\_\{t,k\_\{t\}\}, a document is drawn with probability

ℙ​\[d∣kt\]∝δd,kt⋅11\+udt,d∈Ct,kt\.\{\\mathbb\{P\}\}\\bigl\[d\\mid k\_\{t\}\\bigr\]\\propto\\delta\_\{d,k\_\{t\}\}\\cdot\\frac\{1\}\{1\+u\_\{d\}^\{t\}\},\\qquad d\\in C\_\{t,k\_\{t\}\}\.\(31\)The first factor biases sampling toward the cluster boundary, where atypical entities concentrate; the second is a soft sampling\-without\-replacement that discourages revisiting heavily\-used documents inside the same cluster\. If every score collapses to zero \(e\.g\. for an empty cluster\), the policy falls back to a uniform draw\.

#### Between\-iteration reward feedback\.

The selector is updated between self\-evolution iterations rather than within a single policy\-gradient step\. After each iteration, we read \(i\) the generated solver\-training samples, annotated with the selector’s chosen cluster id and task\-type id per row, and \(ii\) the run\-level summary scoreRrun∈\[0,1\]R\_\{\\mathrm\{run\}\}\\in\[0,1\]of the just\-finished iteration \(the average of the latest proposer and solver mean rewards\)\. The per\-sample reward is the product ofRrunR\_\{\\mathrm\{run\}\}and a bounded quality proxyQi∈\[0,1\]Q\_\{i\}\\in\[0,1\]defined for sampleiias

Qi=qlen​\(\|qi\|\)⋅qans​\(\|ai\|\)⋅1max⁡\(Di,1\),Q\_\{i\}=q\_\{\\mathrm\{len\}\}\(\|q\_\{i\}\|\)\\cdot q\_\{\\mathrm\{ans\}\}\(\|a\_\{i\}\|\)\\cdot\\frac\{1\}\{\\sqrt\{\\max\(D\_\{i\},1\)\}\},\(32\)whereqlen​\(ℓ\)=1q\_\{\\mathrm\{len\}\}\(\\ell\)=1if20≤ℓ≤22020\\leq\\ell\\leq 220and0\.650\.65otherwise,qans​\(ℓ\)=1q\_\{\\mathrm\{ans\}\}\(\\ell\)=1if1≤ℓ≤801\\leq\\ell\\leq 80and0\.550\.55otherwise, andDi≥1D\_\{i\}\\geq 1is the number of duplicates of the prompt string in the batch\. Letri≔Rrun​Qir\_\{i\}\\coloneqq R\_\{\\mathrm\{run\}\}Q\_\{i\}be the resulting per\-sample reward\. Conditional on the cluster idk​\(i\)k\(i\), the selector reward fed back to the bandit is

Risel=−log⁡\(Nk​\(i\)/Ntot\+ε\)⏟=⁣:Rdiv​\(k​\(i\)\)\+λu​\(ri−r¯k​\(i\)prev\)⏟=⁣:Rutil​\(i\),R^\{\\mathrm\{sel\}\}\_\{i\}=\\underbrace\{\-\\log\\bigl\(N\_\{k\(i\)\}/N\_\{\\mathrm\{tot\}\}\+\\varepsilon\\bigr\)\}\_\{=:R\_\{\\mathrm\{div\}\}\(k\(i\)\)\}\+\\lambda\_\{u\}\\,\\underbrace\{\\bigl\(r\_\{i\}\-\\overline\{r\}\_\{k\(i\)\}^\{\\mathrm\{prev\}\}\\bigr\)\}\_\{=:R\_\{\\mathrm\{util\}\}\(i\)\},\(33\)whereNk,NtotN\_\{k\},N\_\{\\mathrm\{tot\}\}are the cluster bandit’s pre\-update pull counts andr¯kprev\\overline\{r\}\_\{k\}^\{\\mathrm\{prev\}\}is its mean reward*before*the current update\. We use\(λu,ε\)=\(0\.5,10−8\)\(\\lambda\_\{u\},\\varepsilon\)=\(0\.5,10^\{\-8\}\)\. The same reward formula is applied to the task\-type bandit by replacing the cluster idk​\(i\)k\(i\)with the task\-type idτ​\(i\)\\tau\(i\)\.

## Appendix GHyperparameters

Table[6](https://arxiv.org/html/2605.22905#A7.T6)lists the hyperparameter values used throughout\.

Table 6:EVE\-Agent hyperparameters\.
## Appendix HExtended evidence\-grounding diagnostics

Table[7](https://arxiv.org/html/2605.22905#A8.T7)extends Table[1](https://arxiv.org/html/2605.22905#S4.T1)of Section[4\.2](https://arxiv.org/html/2605.22905#S4.SS2)by adding the per\-benchmark answer\-accuracy and evidence\-presence columns\. The evaluation uses the same1,0001\{,\}000instances per benchmark, with125125instances on Bamboogle\.

Table 7:Full evidence\-quality breakdown on1,0001\{,\}000samples per benchmark \(125125for Bamboogle\)\.*Initial*is the Qwen2\.5\-3B\-Instruct backbone;*Prior*is a faithful re\-implementation ofYue et al\. \[[2026](https://arxiv.org/html/2605.22905#bib.bib31)\];*EVE\-Agent*is the verifier\-trained solver\.Two patterns confirm the bottleneck argument of Section[4\.2](https://arxiv.org/html/2605.22905#S4.SS2)\. First, the prior system emits an evidence span in9090–99%99\\%of cases, comparable to the initial backbone, yet its judged evidence score in Table[1](https://arxiv.org/html/2605.22905#S4.T1)is low\. Second, the answer\-and\-evidence gap is much larger than the answer\-only gap on the open\-domain datasets: a verifier\-trained solver is several times more likely to be simultaneously correct*and*supported\.

Similar Articles

An Empirical Study of Automating Agent Evaluation

arXiv cs.CL

This paper introduces EvalAgent, a system that automates the evaluation of AI agents by encoding domain-specific expertise, addressing the limitations of standard coding assistants in this task. It also presents AgentEvalBench, a benchmark for testing evaluation pipelines, and demonstrates significant improvements in evaluation reliability.

EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery

Papers with Code Trending

EvoScientist is an adaptive multi-agent framework for end-to-end scientific discovery that continuously improves through persistent memory modules, comprising three specialized agents for idea generation, experiment execution, and knowledge distillation. It outperforms 7 state-of-the-art systems in scientific idea generation and improves code execution success rates through multi-agent evolution.

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

arXiv cs.AI

ScientistOne introduces Chain-of-Evidence, a verifiability framework for autonomous research agents that ensures every claim is traceable to evidence, achieving zero hallucinated references, perfect score verification, and the highest method-code alignment across 75 papers while matching or exceeding human expert performance on frontier research tasks.