Self-Evolving Deep Research via Joint Generation and Evaluation

arXiv cs.CL Papers

Summary

Researchers from HKUST, ByteDance, and UCL propose SCORE, a co-evolutionary training framework that jointly trains an LLM as both a deep research report generator and an evaluator, using a meta-harness to dynamically adjust evaluation difficulty and prevent reward saturation. Experiments show consistent improvement in open-ended research report quality.

arXiv:2606.04507v1 Announce Type: new Abstract: Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report generation lacks definitive ground-truth, making reward design inherently unverifiable and limiting effective reinforcement learning. Existing approaches mitigate this challenge with LLM-as-a-judge and query-dependent evaluation rubrics, but they still rely on static evaluators that cannot adapt their standards as the solver improves, leading to insufficient and eventually saturated optimization pressure. We address this limitation with a \textbf{s}elf-evolving \textbf{co}-evolutionary training framework for deep \textbf{re}search evaluation and generation (SCORE), which tightly couples an evaluator and a solver in a shared-parameter learning process. Rather than treating generation and evaluation as isolated modules, we leverage their intrinsic connection to enable joint improvement within a single shared-parameter model. To restrict this process, we introduce a meta-harness, which dynamically controls the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search. Extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality, showing that co-evolving evaluation and generation is a promising direction for training open-ended research agents.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:15 AM

# Self-Evolving Deep Research via Joint Generation and Evaluation
Source: [https://arxiv.org/html/2606.04507](https://arxiv.org/html/2606.04507)
Han Zhu1, Chengkun Cai2∗, Yuanfeng Song3†, Xing Chen3, Sirui Han1†, Yike Guo1 1The Hong Kong University of Science and Technology 2ByteDance, China3University College London

###### Abstract

Large Language Models \(LLMs\) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability\. Unlike traditional question\-answering \(QA\) tasks, deep research report generation lacks definitive ground\-truth, making reward design inherently unverifiable and limiting effective reinforcement learning\. Existing approaches mitigate this challenge with LLM\-as\-a\-judge and query\-dependent evaluation rubrics, but they still rely on static evaluators that cannot adapt their standards as the solver improves, leading to insufficient and eventually saturated optimization pressure\. We address this limitation with aself\-evolvingco\-evolutionary training framework for deepresearch evaluation and generation \(SCORE\), which tightly couples an evaluator and a solver in a shared\-parameter learning process\. Rather than treating generation and evaluation as isolated modules, we leverage their intrinsic connection to enable joint improvement within a single shared\-parameter model\. To restrict this process, we introduce a meta\-harness, which dynamically controls the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search\. Extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality, showing that co\-evolving evaluation and generation is a promising direction for training open\-ended research agents\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.04507v1/x1.png)Figure 1:Empirical evidence of the positive correlation between LLMs’ generation and evaluation capabilities\.Owing to their extraordinary capabilities, Large Language Models \(LLMs\) have become deeply integrated into various complex domains, spanning autonomous driving\[[48](https://arxiv.org/html/2606.04507#bib.bib34)\], software development\[[9](https://arxiv.org/html/2606.04507#bib.bib35)\], multimedia generation\[[33](https://arxiv.org/html/2606.04507#bib.bib36),[46](https://arxiv.org/html/2606.04507#bib.bib64),[54](https://arxiv.org/html/2606.04507#bib.bib65)\]and so on\[[49](https://arxiv.org/html/2606.04507#bib.bib62),[25](https://arxiv.org/html/2606.04507#bib.bib63)\]\. As the demand for comprehensive information retrieval and synthesis grows, enhancing the deep research capabilities of LLMs has attracted increasing attention from researchers\. Several search agents, such as GPT\-Researcher\[[13](https://arxiv.org/html/2606.04507#bib.bib31)\]and AgentCPM\-Explore\[[2](https://arxiv.org/html/2606.04507#bib.bib32)\], have recently demonstrated remarkable proficiency in these multi\-hop research tasks\. Furthermore, previous works like Search\-o1\[[26](https://arxiv.org/html/2606.04507#bib.bib24)\]and SSP\[[30](https://arxiv.org/html/2606.04507#bib.bib12)\]have shown that applying Reinforcement Learning \(RL\) to these models can substantially enhance their reasoning capabilities when integrated into agent, yielding significant performance improvements on complex Question\-Answering \(QA\) tasks\.

Differ from traditional QA tasks, training models for open\-ended report generation presents two challenges\. First, comprehensive reports lack a definitive ground truth, making objective reward signals inherently difficult to define\[[23](https://arxiv.org/html/2606.04507#bib.bib41),[12](https://arxiv.org/html/2606.04507#bib.bib42)\]\. Since there is no absolute standard for open\-ended queries, relying on human annotations for generated reports is prohibitively expensive, making objective and scalable reward signals difficult to construct\. Relying on human annotations to cover these diverse paths is prohibitively expensive and difficult to scale\. Second, the evaluation criteria for a high\-quality report are multi\-dimensional and query\-related\[[21](https://arxiv.org/html/2606.04507#bib.bib38),[10](https://arxiv.org/html/2606.04507#bib.bib40)\]\. Single fixed evaluation metric inevitably ignore crucial dimensions including factual correctness and insight diversity for different questions\. This dimensional omission results in sparse training signals and fails to provide the fine\-grained credit assignment required for complex generation tasks\.

To overcome these challenges in report generation, recent research proposes several novel paradigms to provide more effective optimization signals\. WebThinker\[[27](https://arxiv.org/html/2606.04507#bib.bib16)\]and Step\-DeepResearch\[[16](https://arxiv.org/html/2606.04507#bib.bib28)\]have applied the LLM\-as\-a\-judge\[[57](https://arxiv.org/html/2606.04507#bib.bib33)\]to evaluate report quality during the training process\. AdaRubric\[[8](https://arxiv.org/html/2606.04507#bib.bib37)\]proposes a framework that dynamically generates task\-specific evaluation criteria based on the input instructions\. Mix\-GRM\[[52](https://arxiv.org/html/2606.04507#bib.bib39)\]utilizes Reinforcement Learning with Verifiable Rewards \(RLVR\) to automatically adapt between breadth\-first \(B\-CoT\) and depth\-first \(D\-CoT\) evaluation strategies according to distinct task demands\. However, a fixed external evaluator lacks the ability to dynamically adjust its evaluation standards, and therefore fails to provide sustained optimization pressure as the solver improves\.

As empirically observed in Figure[1](https://arxiv.org/html/2606.04507#S1.F1), LLMs’ generation quality is positively correlated with its evaluation proficiency\. Building upon this insight, we hypothesize that generation and evaluation abilities can be mutually reinforcing and should be optimized within a unified, rather than decoupled, architecture\. To this end, we propose a self\-play framework in which the policy model iteratively generates and evaluate reports\. By replacing a fixed external judge with a dynamic evaluator that evolves alongside the solver, our method maintains informative training pressure throughout learning\. Furthermore, a Meta\-Harness\[[22](https://arxiv.org/html/2606.04507#bib.bib43)\]is applied to stabilize this process by enforcing environmental constraints\. Empirically, this co\-evolutionary paradigm brings substantial and consistent improvements across diverse research generation tasks\.

In a nutshell, our contributions are summarized as below:

- •We propose SCORE, an evaluator\-solver self\-play framework for training deep research agents with unverifiable rewards\. This framework provide dynamic, multidimensional rewards to optimize the solver’s ability in open\-ended report generation, constrained by a meta\-harness\.
- •We provide a theoretical analysis of SCORE under shared parameters\. Our analysis clarifies the role of consistency in evaluator\-side adaptation and characterizes the local optimization behavior of shared\-parameter alternating updates\.
- •We show that our method improves agent performance across multiple evaluation dimensions on deep research benchmarks, and provide a systematic analysis of how parameters affect training procedure under unverifiable rewards\.

## 2Related Work

### 2\.1Deep Research

LLMs have exhibited strong deep research capabilities in recent agent studies\[[59](https://arxiv.org/html/2606.04507#bib.bib59),[1](https://arxiv.org/html/2606.04507#bib.bib60)\]\. Initial frameworks, including Search\-o1\[[26](https://arxiv.org/html/2606.04507#bib.bib24)\], Search\-R1\[[20](https://arxiv.org/html/2606.04507#bib.bib22)\], and R1\-Searcher\[[39](https://arxiv.org/html/2606.04507#bib.bib23)\], successfully introduced explicit reasoning process into agentic search frameworks, yielding substantial empirical improvement in multi\-hop question\-answering \(QA\) tasks\. Following these advancements, researchers integrated diverse optimization algorithms into agentic training pipelines\. Specifically, WebThinker\[[27](https://arxiv.org/html/2606.04507#bib.bib16)\]utilizes iterative Direct Preference Optimization \(DPO\) for deep research agents, whereas DeepResearcher\[[58](https://arxiv.org/html/2606.04507#bib.bib25)\]adopts Group Relative Policy Optimization \(GRPO\)\. Diverging from these algorithmic enhancements, SimpleDeepSearcher\[[40](https://arxiv.org/html/2606.04507#bib.bib26)\]argues that the quality of the training data supersedes the specific training paradigm\. Beyond QA tasks, report generation tasks are inherently more complex, as they involve open\-ended, multi\-faceted synthesis where establishing reliable ground truth for automated evaluation remains a critical bottleneck\. To address this complexity, multi\-agent frameworks like STORM and Co\-STORM leverage preliminary knowledge and human\-in\-the\-loop interactions during report generation\[[34](https://arxiv.org/html/2606.04507#bib.bib30),[19](https://arxiv.org/html/2606.04507#bib.bib29)\]\. Later approaches such as Step\-DeepResearch and AgentCPM\-Report\[[16](https://arxiv.org/html/2606.04507#bib.bib28),[28](https://arxiv.org/html/2606.04507#bib.bib27)\]directly integrate model training optimized for their specific agent frameworks\.

### 2\.2Intrinsic Rewards and Self\-Play

Traditional Self\-play is widely applied to agentic studies as an adversarial reinforcement learning approach\[[53](https://arxiv.org/html/2606.04507#bib.bib8)\]\. AbsoluteZero, R\-Zero, SeRL and SPIRAL, demonstrate the effectiveness of self\-evolution in tackling complex reasoning tasks under data scarcity situation\[[55](https://arxiv.org/html/2606.04507#bib.bib9),[17](https://arxiv.org/html/2606.04507#bib.bib10),[29](https://arxiv.org/html/2606.04507#bib.bib11),[11](https://arxiv.org/html/2606.04507#bib.bib15)\]\. Recent frameworks, including Search Self\-play, Dr\. Zero, and SPaR, demonstrate the substantial advantages of applying this technique to search agents operating within multi\-hop environments\[[30](https://arxiv.org/html/2606.04507#bib.bib12),[50](https://arxiv.org/html/2606.04507#bib.bib13),[3](https://arxiv.org/html/2606.04507#bib.bib14)\]\. Beyond adversarial learning paradigms like proposer\-solver self\-play, alternative approaches propose training solvers on filtered trajectories generated by the models themselves, as demonstrated by frameworks such as WebThinker, EvolveSearch, and EXSEARCH\[[27](https://arxiv.org/html/2606.04507#bib.bib16),[51](https://arxiv.org/html/2606.04507#bib.bib17),[38](https://arxiv.org/html/2606.04507#bib.bib18)\]\. Other notable efforts include AceSearcher, which employs cooperative interaction between decomposer and solver for self\-improvement\[[45](https://arxiv.org/html/2606.04507#bib.bib19)\]\. Moreover, optimization strategies utilizing internal rewards provided by models themselves have been adopted by researchers in frameworks such as INTUITOR\[[56](https://arxiv.org/html/2606.04507#bib.bib20)\]and JEPO\[[41](https://arxiv.org/html/2606.04507#bib.bib21)\]\. These previous works demonstrate the efficacy of internal signals in enhancing autonomous capability improvements\. However, the unique characteristics fo deep research task present a significant evaluation challenge\. The difficulty of reliably scoring open\-domain, comprehensive reports makes it highly challenging to design effective reward signals that resolve the credit assignment issue\.

## 3Methodology

![Refer to caption](https://arxiv.org/html/2606.04507v1/x2.png)Figure 2:SCORE is controlled by Meta\-Harness\. The evaluator first selects evaluation dimensions from a provided dimension pool and retrieves evidence accordingly\. The solver is updated using rewards derived from evaluator, while the evaluator is optimized based on the consistency of rollout reports generated by the solver\. A reward verifier filters out invalid rewards before they are used for solver optimization\.### 3\.1Problem Setting

We study*deep research report generation*, where the goal is to learn a policyπ​\(r∣q,ℰq\)\\pi\(r\\mid q,\\mathcal\{E\}\_\{q\}\)over candidate reportsrrfor a given queryqqand environmentℰq\\mathcal\{E\}\_\{q\}\. We assume that report quality is governed by a latent functionalQ​\(r;q,ℰq\),Q\(r;q,\\mathcal\{E\}\_\{q\}\),and the ideal objective is to assign more probability mass to reports with higher latent quality\. Here,q∈𝒬q\\in\\mathcal\{Q\}denotes a user query,r∈ℛr\\in\\mathcal\{R\}denotes a candidate long\-form report, andℰq\\mathcal\{E\}\_\{q\}denotes the query\-specific evidence and evaluation environment\.

Unlike QA\-style tasks,QQis inherently*multi\-dimensional*and*query\-dependent*: a report may succeed or fail along different axes, such as factual support, evidence coverage, citation grounding, contradiction handling, uncertainty calibration, and discourse coherence\. As a result, long\-form report generation rarely admits a single reliable verifier or a fixed scalar reward that is both sufficiently informative and broadly applicable\.

We therefore formulate deep research as a query\-conditioned report optimization problem with an evaluation bottleneck\. In particular, the central challenge in training deep research models is how to evaluate long\-form reports in a query\-specific and stable manner under limited supervision\.

### 3\.2Overview of SCORE

To address this challenge, we propose SCORE, a retrieval\-grounded and externally guided evaluator\-solver framework for deep research report generation as shown in Figure[2](https://arxiv.org/html/2606.04507#S3.F2)\. SCORE contains four components:

1. 1\.an external evidence environment that can be queried through search and retrieval tools,
2. 2\.a fixed external*meta\-harness*that shapes the query\-specific evaluation environment,
3. 3\.an*evaluator*that constructs query\-conditioned rubrics and produces structured report assessments,
4. 4\.a*solver*that interacts with the evidence environment, gathers supporting evidence, and generates reports under the resulting rubric\.

A key design choice of SCORE is that the solver and evaluator are implemented as two functionally distinct roles over a*shared underlying actor*\. This design is motivated by the substantial overlap between report generation and report evaluation, both of which require query understanding, evidence modeling, and discourse\-level reasoning\. We empirically examine this choice in our ablations Section[5\.2](https://arxiv.org/html/2606.04507#S5.SS2.SSS0.Px2)\.

Training proceeds through sequential alternating updates on the shared actor: the solver is optimized using evaluator\-derived report rewards, while the evaluator is optimized using rubric\-level auxiliary signals\. Both updates are regularized by KL divergence against a reference policy to limit excessive policy drift\.

### 3\.3External Evidence and Query\-Specific Environment

For each queryqq, SCORE interacts with an external evidence environment through search and retrieval tools\. We denote by

Dq\(0\)=ℛ​\(q\)D\_\{q\}^\{\(0\)\}=\\mathcal\{R\}\(q\)the initial evidence obtained from this environment, whereℛ\\mathcal\{R\}abstracts the retrieval interaction available to the system\. In practice, this evidence environment may include web search, document retrieval, or other tool\-mediated evidence sources\.

Based on the initial evidenceDq\(0\)D\_\{q\}^\{\(0\)\}, SCORE constructs a query\-specific evaluation environment through a fixed external*meta\-harness*:

ℰq=Harness​\(q,Dq\(0\),𝒮\),\\mathcal\{E\}\_\{q\}=\\mathrm\{Harness\}\(q,D\_\{q\}^\{\(0\)\},\\mathcal\{S\}\),where𝒮\\mathcal\{S\}denotes recent training statistics or historical signals available to the harness\. The resulting environmentℰq\\mathcal\{E\}\_\{q\}specifies the admissible evaluation space for the current query, including candidate evaluation dimensions, structural constraints on valid rubrics, process requirements for evidence verification or comparison, and auxiliary task\-specific guidance\.

The meta\-harness is external and non\-trainable\. It is implemented as a fixed external controller and is not optimized by gradient descent together with the actor\. Its role is to shape the admissible evaluation space in which evaluator\-solver interaction takes place\.

### 3\.4Evaluator and Solver

Given queryqqand environmentℰq\\mathcal\{E\}\_\{q\}, the evaluator samples a query\-conditioned rubric

\(Hq,wq\)∼πe\(⋅∣q,ℰq\),\(H\_\{q\},w\_\{q\}\)\\sim\\pi\_\{e\}\(\\cdot\\mid q,\\mathcal\{E\}\_\{q\}\),whereHq=\{h1,…,hk\}H\_\{q\}=\\\{h\_\{1\},\\dots,h\_\{k\}\\\}denotes the selected evaluation dimensions andwqw\_\{q\}their corresponding weights\.

Conditioned on the query, the current evidence state, and the evaluator\-selected rubric, the solver interacts with the external evidence environment to gather supporting information and generate a group of candidate reports

r1,…,rK∼πs\(⋅∣q,Dqs,Hq,wq\),r\_\{1\},\\dots,r\_\{K\}\\sim\\pi\_\{s\}\(\\cdot\\mid q,D\_\{q\}^\{s\},H\_\{q\},w\_\{q\}\),whereDqsD\_\{q\}^\{s\}denotes the evidence collected by the solver andKKis the number of solver rollouts sampled for the same query during training\.

For assessment, the evaluator may perform constrained supplementary retrieval or verification based on the query and the generated reports, producing its own evidence contextDqeD\_\{q\}^\{e\}\. It then produces structured judgments over candidate reports, including dimension\-wise scoring, evidence support checking, contradiction analysis, and citation verification\. These structured judgments are converted into solver\-side learning signals\.

### 3\.5Shared Actor and Training Signals

SCORE implements the solver and evaluator as two roles over a shared actor with parametersθ\\theta\. This design is motivated by the observation that report generation and report evaluation rely on overlapping latent capabilities, including query understanding, evidence modeling, and discourse\-level reasoning\. Under this view, generation and evaluation are not treated as fully separate functions, but as two role\-specific mappings induced from a shared underlying representation\. A shared actor therefore provides a common representational substrate for both roles, allowing evaluator\-side adaptation to influence the same internal representations used by the solver\. We empirically examine this design choice in Section[5\.2](https://arxiv.org/html/2606.04507#S5.SS2.SSS0.Px2)\.

SCORE uses two distinct reward channels:

Rs​\(r;q,Dqe,Hq,wq\)⏟solver\-side report rewardandRe​\(Γq\)=C​\(Γq\)⏟evaluator\-side auxiliary reward\.\\underbrace\{R\_\{s\}\(r;q,D\_\{q\}^\{e\},H\_\{q\},w\_\{q\}\)\}\_\{\\text\{solver\-side report reward\}\}\\qquad\\text\{and\}\\qquad\\underbrace\{R\_\{e\}\(\\Gamma\_\{q\}\)=C\(\\Gamma\_\{q\}\)\}\_\{\\text\{evaluator\-side auxiliary reward\}\}\.The solver\-side reward supervises report generation, while the evaluator\-side reward adapts the evaluator’s rubric\-selection behavior\.

#### Solver\-side report reward\.

For each generated reportrr, the evaluator produces a scalar report reward

Rs​\(r;q,Dqe,Hq,wq\),R\_\{s\}\(r;q,D\_\{q\}^\{e\},H\_\{q\},w\_\{q\}\),whereDqeD\_\{q\}^\{e\}denotes the evidence context used by the evaluator for rubric\-conditioned assessment,Hq=\{h1,…,hm\}H\_\{q\}=\\\{h\_\{1\},\\dots,h\_\{m\}\\\}denotes the selected evaluation dimensions, andwq=\{w1,…,wm\}w\_\{q\}=\\\{w\_\{1\},\\dots,w\_\{m\}\\\}denotes their corresponding weights\. This reward serves as the primary supervision signal for optimizing the solver\.

In our implementation, the abstract report reward is instantiated as

Rs​\(r\)=a​\(r\)⋅\(1\+∑j=1mwj​sj​\(r;q,Dqe\)\),R\_\{s\}\(r\)=a\(r\)\\cdot\\left\(1\+\\sum\_\{j=1\}^\{m\}w\_\{j\}\\,s\_\{j\}\(r;q,D\_\{q\}^\{e\}\)\\right\),wherea​\(r\)a\(r\)denotes a report\-level validity or accuracy score, andsj​\(r;q,Dqe\)s\_\{j\}\(r;q,D\_\{q\}^\{e\}\)denotes the evaluator score of reportrron dimensionhjh\_\{j\}under evaluator\-side evidence contextDqeD\_\{q\}^\{e\}\. Intuitively, the solver is encouraged to generate reports that are both valid and strong along the evaluator\-selected rubric dimensions\.

#### Evaluator\-side auxiliary reward\.

To adapt the evaluator, SCORE introduces a rubric\-level auxiliary reward based on inter\-report consistency\. For a fixed query and evaluator\-selected rubric

Γq:=\(Hq,wq\),\\Gamma\_\{q\}:=\(H\_\{q\},w\_\{q\}\),the solver generatesKKreports under the same query and evaluation setting\. We define the raw evaluator\-side auxiliary reward as

Re​\(Γq\)=C​\(Γq\),R\_\{e\}\(\\Gamma\_\{q\}\)=C\(\\Gamma\_\{q\}\),whereC​\(Γq\)C\(\\Gamma\_\{q\}\)is the empirical consistency score computed by aggregating pairwise agreement across the generated reports\. In the current implementation,C​\(Γq\)C\(\\Gamma\_\{q\}\)is computed as the batch mean of pairwise agreement scores over extracted report conclusions\.

Evaluator updates use an EMA\-centered version of this signal,

Ae​\(Γq\)=Re​\(Γq\)−bt−1,A\_\{e\}\(\\Gamma\_\{q\}\)=R\_\{e\}\(\\Gamma\_\{q\}\)\-b\_\{t\-1\},wherebt−1b\_\{t\-1\}is the previous exponential moving average baseline of past auxiliary rewards\. After computing the centered signal, the baseline is updated as

bt=β​bt−1\+\(1−β\)​Re​\(Γq\),b\_\{t\}=\\beta b\_\{t\-1\}\+\(1\-\\beta\)R\_\{e\}\(\\Gamma\_\{q\}\),whereβ\\betais the EMA decay coefficient\. Thus,ReR\_\{e\}favors rubrics under which solver behavior is more reproducible across repeated rollouts, whileAeA\_\{e\}provides the actual lower\-variance signal used for REINFORCE\-style evaluator updates\.

Overall, SCORE forms a two\-channel training loop: evaluator\-derived report rewards improve the solver, while consistency\-based rubric\-level rewards adapt the evaluator\. The solver\-side reward targets report quality under the selected rubric, whereas the evaluator\-side reward favors rubrics that induce more stable and reproducible solver behavior\. In this sense, consistency is used as a stability\-oriented auxiliary signal for rubric adaptation rather than as a direct correctness criterion\.

### 3\.6Sequential Alternating KL\-Regularized Optimization

Given the two reward channels defined in Section[3\.5](https://arxiv.org/html/2606.04507#S3.SS5), training in SCORE proceeds through sequential alternating updates on the shared actor\. Each step constructs a query\-specific evaluation environment, generates rubric\-conditioned reports, evaluates them under evaluator\-side verification context, and then applies solver\-side and evaluator\-side updates to the shared parametersθ\\theta\.

The solver\-side update uses evaluator\-derived report rewardsRsR\_\{s\}, which are converted into group\-normalized advantages and optimized with GRPO\. The evaluator\-side update uses the EMA\-centered auxiliary rewardAeA\_\{e\}, derived from the raw consistency rewardReR\_\{e\}, to update the rubric\-selection policy with a REINFORCE\-style objective\. Although these updates are computed from different trajectories and rewards, both act on the same actor and are therefore coupled through the shared parameterization\.

Both updates are regularized by KL divergence against a fixed reference policyπref\\pi\_\{\\mathrm\{ref\}\}\. In our implementation, this KL term is applied at the actor\-update level rather than as an explicit reward penalty\. Thus, KL regularization is not treated as a third reward channel, but as an optimization regularizer that limits excessive policy drift under shared\-actor learning\.

At the analysis level, we denote the unregularized role\-specific objectives byℒs​\(θ\)\\mathcal\{L\}\_\{s\}\(\\theta\)andℒe​\(θ\)\\mathcal\{L\}\_\{e\}\(\\theta\), respectively\. The corresponding KL\-regularized objectives can be written as

ℒsreg​\(θ\)=ℒs​\(θ\)\+λ​KL​\(πθ∥πref\),\\mathcal\{L\}\_\{s\}^\{\\mathrm\{reg\}\}\(\\theta\)=\\mathcal\{L\}\_\{s\}\(\\theta\)\+\\lambda\\,\\mathrm\{KL\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\),ℒereg​\(θ\)=ℒe​\(θ\)\+λ​KL​\(πθ∥πref\),\\mathcal\{L\}\_\{e\}^\{\\mathrm\{reg\}\}\(\\theta\)=\\mathcal\{L\}\_\{e\}\(\\theta\)\+\\lambda\\,\\mathrm\{KL\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\),whereλ\>0\\lambda\>0is the KL regularization coefficient\.

KL regularization serves two purposes in SCORE\. First, it stabilizes alternating optimization by limiting abrupt policy drift under shared\-role learning\. Second, it helps maintain compatibility between solver\-side and evaluator\-side updates on the shared actor\. In this way, SCORE combines evaluator\-derived report rewards, consistency\-based evaluator adaptation, external grounding, shared\-role learning, and KL\-regularized alternating optimization into a unified training framework for long\-form report generation\.

## 4Theoretical Analysis

We analyze two properties of SCORE: the role of consistency as an evaluator\-side auxiliary signal, and the first\-order optimization behavior of shared\-parameter KL\-regularized alternating updates\.

#### Assumptions\.

We assume that: \(A1\) the analysis\-level objectivesℒs\\mathcal\{L\}\_\{s\}andℒe\\mathcal\{L\}\_\{e\}areLL\-smooth; \(A2\) the corresponding stochastic gradients are unbiased or approximately unbiased with bounded variance; and \(A3\) the stochastic gradients are almost surely bounded\.

#### Consistency as rubric\-induced reproducibility\.

LetΓq:=\(Hq,wq\)\\Gamma\_\{q\}:=\(H\_\{q\},w\_\{q\}\)denote the evaluator\-selected rubric for queryqq\. Suppose the solver generatesKKreports under the same query, evidence context, and rubric, and letC​\(Γq\)C\(\\Gamma\_\{q\}\)denote the resulting empirical consistency score obtained by aggregating pairwise agreement scores across these reports\. In the current implementation, this is the mean pairwise consistency across generated reports\. We define the raw evaluator\-side auxiliary signal as

Re​\(Γq\)=C​\(Γq\),R\_\{e\}\(\\Gamma\_\{q\}\)=C\(\\Gamma\_\{q\}\),and use the centered signal

Ae​\(Γq\)=Re​\(Γq\)−btA\_\{e\}\(\\Gamma\_\{q\}\)=R\_\{e\}\(\\Gamma\_\{q\}\)\-b\_\{t\}for evaluator updates, wherebtb\_\{t\}is an exponential moving average baseline\.

###### Proposition 1\(Consistency estimates rubric\-induced reproducibility\)\.

LetμC​\(Γq\)\\mu\_\{C\}\(\\Gamma\_\{q\}\)denote the population mean of the consistency metric under rubricΓq\\Gamma\_\{q\}\. Then

𝔼​\[Re​\(Γq\)\]=𝔼​\[μC​\(Γq\)\]\.\\mathbb\{E\}\[R\_\{e\}\(\\Gamma\_\{q\}\)\]=\\mathbb\{E\}\[\\mu\_\{C\}\(\\Gamma\_\{q\}\)\]\.Thus, the raw evaluator\-side auxiliary signal estimates the expected reproducibility induced by rubricΓq\\Gamma\_\{q\}under repeated solver rollouts\.

Higher reproducibility suggests that the rubric induces a more stable and repeatable evaluation context for the solver, which makes it a useful auxiliary signal for evaluator\-side adaptation\. Centering by the EMA baseline reduces gradient variance without changing the underlying preference encoded by the raw signal\. In SCORE, consistency is therefore used as a rubric\-level auxiliary signal for evaluator\-side adaptation\.

#### First\-order characterization of alternating updates\.

Let the shared\-actor updates be

θ′=θt−ηs​g^s​\(θt\),θt\+1=θ′−ηe​g^e​\(θ′\),\\theta^\{\\prime\}=\\theta\_\{t\}\-\\eta\_\{s\}\\hat\{g\}\_\{s\}\(\\theta\_\{t\}\),\\qquad\\theta\_\{t\+1\}=\\theta^\{\\prime\}\-\\eta\_\{e\}\\hat\{g\}\_\{e\}\(\\theta^\{\\prime\}\),whereg^s\\hat\{g\}\_\{s\}andg^e\\hat\{g\}\_\{e\}are the stochastic gradients of the solver\-side and evaluator\-side objectives, respectively\. Define the analysis surrogate

ℒ~​\(θ\)=ℒs​\(θ\)\+α​ℒe​\(θ\)\+λ​KL​\(πθ∥πref\),α=ηe/ηs,\\widetilde\{\\mathcal\{L\}\}\(\\theta\)=\\mathcal\{L\}\_\{s\}\(\\theta\)\+\\alpha\\mathcal\{L\}\_\{e\}\(\\theta\)\+\\lambda\\,\\mathrm\{KL\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\),\\qquad\\alpha=\\eta\_\{e\}/\\eta\_\{s\},whereπref\\pi\_\{\\mathrm\{ref\}\}is the fixed reference policy andλ\>0\\lambda\>0is the KL regularization coefficient\.

###### Proposition 2\(First\-order characterization of alternating updates\)\.

Under Assumptions \(A1\)–\(A3\), the expected one\-step SCORE update satisfies

𝔼​\[θt\+1−θt∣θt\]=−ηs​∇ℒ~​\(θt\)\+ρt,‖ρt‖=O​\(ηs2\)\.\\mathbb\{E\}\[\\theta\_\{t\+1\}\-\\theta\_\{t\}\\mid\\theta\_\{t\}\]=\-\\eta\_\{s\}\\nabla\\widetilde\{\\mathcal\{L\}\}\(\\theta\_\{t\}\)\+\\rho\_\{t\},\\qquad\\\|\\rho\_\{t\}\\\|=O\(\\eta\_\{s\}^\{2\}\)\.

This result shows that the shared\-parameter solver\-evaluator updates follow, to first order, a KL\-regularized surrogate descent direction, with only a higher\-order residual induced by sequential alternating updates\. Detailed derivations are deferred to Appendix[F](https://arxiv.org/html/2606.04507#A6)\.

## 5Experiments

### 5\.1Experimental Setup

#### Training Details

Table 1:Performance of open\-deep\-research and gpt\-researcher integrated with LLaMA\-3\.1 and Qwen2\.5 on DeepResearchBench\.Comp\.: comprehensivenss,IF: instruction following,Read\.: Readability,VC: valid ratio of citation\.Writing QualityModels and AgentsOverallComp\.InsightIF\.Read\.VC\.Agent: Open\-deep\-research \(ReAct Paradigm\)Llama\-3\.1\-8B\-Inst\.27\.0024\.0522\.4630\.2836\.2922\.58\+ GRPO23\.41\(\-3\.59\)19\.89\(\-4\.16\)18\.73\(\-3\.73\)26\.77\(\-3\.51\)33\.90\(\-2\.39\)28\.57\(\+5\.99\)\+ DPO26\.13\(\-0\.87\)23\.11\(\-0\.94\)21\.28\(\-1\.18\)29\.84\(\-0\.44\)35\.77\(\-0\.52\)28\.00\(\+5\.42\)\+ SCORE \(Ours\)30\.07\(\+3\.07\)30\.31\(\+6\.26\)26\.71\(\+4\.25\)32\.25\(\+1\.97\)33\.66\(\-2\.63\)30\.55\(\+7\.97\)Qwen2\.5\-7B\-Inst\.31\.9228\.9327\.8135\.3740\.2911\.76\+ GRPO32\.67\(\+0\.75\)29\.94\(\+1\.01\)29\.08\(\+1\.27\)35\.30\(\-0\.07\)40\.87\(\+0\.58\)16\.13\(\+4\.37\)\+ DPO33\.91\(\+1\.99\)34\.26\(\+5\.33\)30\.89\(\+3\.08\)35\.90\(\+0\.53\)36\.85\(\-3\.44\)13\.76\(\+2\.00\)\+ SCORE \(Ours\)34\.43\(\+2\.51\)34\.86\(\+5\.93\)31\.38\(\+3\.57\)36\.37\(\+1\.00\)37\.41\(\-2\.88\)19\.35\(\+7\.59\)Agent: gpt\-researcher \(Plan\-and\-Execute Paradigm\)Llama\-3\.1\-8B\-Inst\.28\.9225\.6224\.8031\.8538\.3638\.23\+ GRPO16\.97\(\-11\.95\)16\.79\(\-38\.23\)13\.12\(\-11\.68\)18\.09\(\-13\.76\)23\.78\(\-14\.58\)0\.00\(\-8\.16\)\+ DPO33\.58\(\+4\.66\)33\.83\(\+8\.21\)30\.64\(\+5\.84\)35\.15\(\+3\.30\)36\.87\(\-1\.49\)34\.48\(\-3\.75\)\+ SCORE \(Ours\)33\.89\(\+4\.97\)34\.14\(\+8\.52\)30\.90\(\+6\.10\)35\.64\(\+3\.79\)37\.29\(\-1\.07\)52\.38\(\+14\.15\)Qwen2\.5\-7B\-Inst\.32\.0029\.2028\.7034\.6739\.6136\.11\+ GRPO30\.03\(\-1\.97\)27\.52\(\-1\.68\)27\.39\(\-1\.31\)31\.07\(\-3\.60\)38\.37\(\-1\.24\)51\.61\(\+15\.50\)\+ DPO35\.85\(\+3\.85\)36\.22\(\+7\.02\)33\.64\(\+4\.94\)37\.15\(\+2\.48\)38\.07\(\-1\.54\)30\.80\(\-5\.31\)\+ SCORE \(Ours\)35\.91\(\+3\.91\)36\.15\(\+6\.95\)33\.12\(\+4\.42\)37\.60\(\+2\.93\)38\.93\(\-0\.68\)66\.15\(\+30\.04\)

We train all models using the VeRL framework\[[36](https://arxiv.org/html/2606.04507#bib.bib44)\]\. The evaluator and solver are updated asynchronously that the solver is updated at every training step using GRPO\[[35](https://arxiv.org/html/2606.04507#bib.bib45)\], while the evaluator is updated once every 1 steps using REINFORCE\[[44](https://arxiv.org/html/2606.04507#bib.bib5)\]\. We set the rollout number to 1 for the evaluator and 8 for the solver\. The maximum number of tool calls is 10, the batch size is 64, the maximum response length is 4,096 tokens, and the maximum input length is 8,192 tokens\. For training, we use a local search engine backed by local WikiData, while web search is used during evaluation\. GPT\-5\.2\[[18](https://arxiv.org/html/2606.04507#bib.bib46)\]is applied as Meta\-Harness and is invoked every 5 training steps to modify the environment according to the solver’s current performance\. For the coefficient of EMA baseline, we setβ\\beta=0\.7\. We construct the training set by collecting and preprocessing user queries from Reddit\.

#### Baselines and Models

We implement SCORE on top of two capable base LLMs, Qwen2\.5\-7B\-Instruct\[[32](https://arxiv.org/html/2606.04507#bib.bib47)\]and Llama\-3\.1\-8B\-Instruct\[[14](https://arxiv.org/html/2606.04507#bib.bib48)\]\. For comparison, we train these models using standard GRPO and DPO to serve as baselines\. The post\-trained models are subsequently evaluated within two representative search agent systems to assess their downstream capabilities\. Specifically, we employ open\-deep\-research\[[31](https://arxiv.org/html/2606.04507#bib.bib49)\]as a baseline for the ReAct paradigm and gpt\-researcher\[[13](https://arxiv.org/html/2606.04507#bib.bib31)\]as a baseline for the plan\-and\-execute paradigm\. To maintain experimental consistency, a single, unified model is deployed across all functional modules within these respective agents\.

#### Benchmarks and Evaluation Metrics

To rigorously evaluate our approach, we employ two benchmarks designed for deep research tasks which are DeepResearchBench\[[10](https://arxiv.org/html/2606.04507#bib.bib40)\]and DeepResearchEval\[[43](https://arxiv.org/html/2606.04507#bib.bib50)\]\. DeepResearchBench assesses agent performance across multi\-dimensions including comprehensiveness, insight, instruction following, readability, and citation validity\. In contrast, DeepResearchEval utilizes a dynamic evaluation metric, which aligns closely with our methodological perspective\. Both frameworks ultimately adopt GPT\-5\.2\[[57](https://arxiv.org/html/2606.04507#bib.bib33)\]for the final evaluation\.

### 5\.2Experimental Results

#### Main Results

As illustrated in Table[1](https://arxiv.org/html/2606.04507#S5.T1), SCORE demonstrates a substantial and robust improvement in analytical depth and valid citation across all agents\. Basic reinforcement learning \(RL\) methods such as GRPO and DPO suffer from a catastrophic degradation in at lease one dimension, occasionally collapsing to exactly zero in Plan\-and\-Execute agents when integrating LLama\-3\.1\-8B\-Instruct\. This phenomenon is driven by Lazy Likelihood Displacement \(LLD\), a condition where the likelihood of correct reasoning trajectories decays during tool\-integrated RL, ultimately causing training collapse\[[7](https://arxiv.org/html/2606.04507#bib.bib55)\]\. Furthermore, agent architectural magnify this instability, as decoupled long\-horizon frameworks like Plan\-and\-Execute exhibit premature collapse and severe reward hacking compared to step\-by\-step ReAct policies\[[6](https://arxiv.org/html/2606.04507#bib.bib56)\]\. In contrast, SCORE maintains stable performance by utilizing a co\-evolving evaluator that provides dense, consistent constraints through dynamic evaluation rubrics\. Notably, we observe a consistent decline in readability across nearly all training methods\. We hypothesis this a fundamental trade\-off because models are optimized to synthesize comprehensive analysis and rigorously cite evidence, the generated report inherently incorporates denser academic vocabulary and more complex logical structures, thereby sacrificing conversational simplicity for analytical rigor which might be similar to the trade\-off between diversity and factuality\[[24](https://arxiv.org/html/2606.04507#bib.bib57)\]\.

![Refer to caption](https://arxiv.org/html/2606.04507v1/x3.png)Figure 3:Ablation study for individual modules on DeepResearchBench\.Comp\.: comprehensivenss,IF: instruction following,Read\.: Readability,VC: valid ratio of citation\.
#### Module Ablation Study

Refer to Figure[3](https://arxiv.org/html/2606.04507#S5.F3), we observe that removing any single module severely degrades the LLMs’ comprehensiveness, demonstrating that generating well\-rounded research reports requires the synergistic operation of the entire framework\. Freezing the solver leads to a massive decline in citation validity\. The internal consistency reward for evaluator brings reward hacking, where the evaluator learns to construct superficial rubrics that elicit highly consistent text from the static solver while actively avoiding high\-variance behaviors like rigorous external citation\. Furthermore, our complete method exhibits a slightly lower readability score compared to the ablated baselines, which we attribute to the inherent trade\-off between natural language coherence and academic formatting\. Base models tend to produce fluent but unstructured text, whereas our co\-evolutionary framework forces the model to generate deeper reports with complex citation syntax, thus sacrificing superficial readability for factual rigor\.

#### Experiments on Rollout Number

![Refer to caption](https://arxiv.org/html/2606.04507v1/figures/rollout_solver_entroy.png)\(a\)Solver Entropy
![Refer to caption](https://arxiv.org/html/2606.04507v1/figures/rollout_eval_reward.png)\(b\)Evaluator Reward
![Refer to caption](https://arxiv.org/html/2606.04507v1/figures/rollout_solver_reward.png)\(c\)Solver Reward

Figure 4:Training DetailsWe investigate the impact of the rollout numberKKin the training progress\. Figure[4](https://arxiv.org/html/2606.04507#S5.F4)demonstrates that extremeKKaffect training procedure that smallK=3K=3fails to yield meaningful reward improvements, while largeK=10K=10triggers a severe model collapse reflected by sudden drop of entropy\[[37](https://arxiv.org/html/2606.04507#bib.bib61)\]\. SmallKKfails to provide the solver with statistically robust reward due to insufficient ranking samples while for the evaluator, fewer rollouts yield unreliable pairwise consistency estimates, resulting in noisy REINFORCE gradients that prevent stable dimension policy updates\. Conversely, an overly largeKKyields highly precise advantage estimates for both the solver and the evaluator, accelerating their updates\. Under the self\-evolving setting, this mutual acceleration compounds policy drift and may trigger rapid model collapse\.

#### Case Study

As illustrated in Figure[6](https://arxiv.org/html/2606.04507#A7.F6)in Appendix[G\.2](https://arxiv.org/html/2606.04507#A7.SS2), different types of research queries inherently require distinct evaluation criteria\. For instance, legal queries demand absolute precision and strict citation of statutory texts, whereas open\-ended questions require the aggregation of diverse perspectives to maintain fairness and objectivity\. A generic, static rubric fails to capture these critical nuances, often assigning acceptable scores to reports that are superficially fluent but fundamentally flawed\. In contrast, our adaptive evaluation framework dynamically introduces specialized evaluation dimensions tailored to each query\. This mechanism ensures that the generated reports are accurately penalized or rewarded according to the specific rigor demanded by the task\.

## 6Conclusion

Generating comprehensive reports for deep research is a highly complex task, and the inherent difficulty of evaluating these open\-ended reports severely hinders the training process\. We observe a strong positive correlation between a model’s evaluation and generation capabilities\. Based on this insight, we propose SCORE, a self\-evolving training framework that jointly optimizes an evaluator and a solver\. Because the entire training procedure fundamentally relies on unverifiable rewards, we introduce an external meta\-harness to impose necessary environmental constraints on the evaluator\. This module simultaneously guides the evaluator to actively explore a broader range of relevant evaluation dimensions throughout the training phase\. Finally, experimental results demonstrate that our approach achieves substantial performance improvements while requiring only a remarkably small amount of training data\.

## References

- \[1\]H\. Bai, Y\. Zhou, Y\. Wu, C\. Chan, P\. Wen, K\. Pan, S\. Han, and Y\. Guo\(2026\)Glance\-or\-gaze: incentivizing lmms to adaptively focus search via reinforcement learning\.arXiv preprint arXiv:2601\.13942\.Cited by:[§2\.1](https://arxiv.org/html/2606.04507#S2.SS1.p1.1)\.
- \[2\]H\. Chen, X\. Cong, S\. Fan, Y\. Fu, Z\. Gong, Y\. Lu, Y\. Li, B\. Niu, C\. Pan, Z\. Song, H\. Wang, Y\. Wu, Y\. Wu, Z\. Xie, Y\. Yan, Z\. Zhang, Y\. Lin, Z\. Liu, and M\. Sun\(2026\)AgentCPM\-explore: realizing long\-horizon deep exploration for edge\-scale agents\.External Links:2602\.06485,[Link](https://arxiv.org/abs/2602.06485)Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p1.1)\.
- \[3\]J\. Cheng, X\. Liu, C\. Wang, X\. Gu, Y\. Lu, D\. Zhang, Y\. Dong, J\. Tang, H\. Wang, and M\. Huang\(2025\)SPar: self\-play with tree\-search refinement to improve instruction\-following in large language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=9chRqsPOGL)Cited by:[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[4\]G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen, L\. Marris,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.External Links:2507\.06261,[Link](https://arxiv.org/abs/2507.06261)Cited by:[4th item](https://arxiv.org/html/2606.04507#A4.I1.i4.p1.1)\.
- \[5\]DeepSeek\-AI, A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong, C\. Lu, C\. Zhao,et al\.\(2025\)DeepSeek\-v3\.2: pushing the frontier of open large language models\.External Links:2512\.02556,[Link](https://arxiv.org/abs/2512.02556)Cited by:[3rd item](https://arxiv.org/html/2606.04507#A4.I1.i3.p1.1)\.
- \[6\]W\. Deng, Y\. Li, B\. Gong, Y\. Ren, C\. Thrampoulidis, and X\. Li\(2026\)On group relative policy optimization collapse in agent search: the lazy likelihood\-displacement\.External Links:2512\.04220,[Link](https://arxiv.org/abs/2512.04220)Cited by:[§5\.2](https://arxiv.org/html/2606.04507#S5.SS2.SSS0.Px1.p1.1)\.
- \[7\]W\. Deng, Y\. Ren, M\. Li, D\. J\. Sutherland, X\. Li, and C\. Thrampoulidis\(2026\)On the effect of negative gradient in group relative deep reinforcement optimization\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=2K9QsDaqkM)Cited by:[§5\.2](https://arxiv.org/html/2606.04507#S5.SS2.SSS0.Px1.p1.1)\.
- \[8\]L\. Ding\(2026\)AdaRubric: task\-adaptive rubrics for llm agent evaluation\.External Links:2603\.21362,[Link](https://arxiv.org/abs/2603.21362)Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p3.1)\.
- \[9\]Y\. Dong, X\. Jiang, J\. Qian, T\. Wang, K\. Zhang, Z\. Jin, and G\. Li\(2025\)A survey on code generation with llm\-based agents\.arXiv preprint arXiv:2508\.00083\.Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p1.1)\.
- \[10\]M\. Du, B\. Xu, C\. Zhu, L\. Zhang, X\. Wang, and Z\. Mao\(2026\)DeepResearch bench: a comprehensive benchmark for deep research agents\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hQ0K2Hhq7H)Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p2.1),[§5\.1](https://arxiv.org/html/2606.04507#S5.SS1.SSS0.Px3.p1.1)\.
- \[11\]W\. Fang, S\. Liu, Y\. Zhou, K\. Zhang, T\. Zheng, K\. Chen, M\. Song, and D\. Tao\(2026\)SeRL: self\-play reinforcement learning for large language models with limited data\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=ZF93vyH9He)Cited by:[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[12\]Y\. Gao, R\. Zhao, Y\. Deng, and W\. Zhang\(2026\)DR\-arena: an automated evaluation framework for deep research agents\.External Links:2601\.10504,[Link](https://arxiv.org/abs/2601.10504)Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p2.1)\.
- \[13\]Gpt\-researcher\.External Links:[Link](https://docs.gptr.dev/docs/gpt-researcher/getting-started/introduction)Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p1.1),[§5\.1](https://arxiv.org/html/2606.04507#S5.SS1.SSS0.Px2.p1.1)\.
- \[14\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[2nd item](https://arxiv.org/html/2606.04507#A4.I1.i2.p1.1),[§5\.1](https://arxiv.org/html/2606.04507#S5.SS1.SSS0.Px2.p1.1)\.
- \[15\]B\. He, Y\. Zuo, Z\. Liu, S\. Zhao, Z\. Fu, J\. Yang, C\. Qian, K\. Zhang, Y\. Fan, G\. Cui, X\. Chen, Y\. Sun, X\. Lv, X\. Zhu, L\. Sheng, R\. Li, H\. Gao, Y\. Zhang, L\. Yuan, B\. Zhou, Z\. Liu, and N\. Ding\(2026\)How far can unsupervised RLVR scale LLM training?\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=VesLZukY5E)Cited by:[§G\.2](https://arxiv.org/html/2606.04507#A7.SS2.SSS0.Px2.p1.1)\.
- \[16\]C\. Hu, H\. Du, H\. Wang, L\. Lin, M\. Chen, P\. Liu, R\. Miao, T\. Yue, W\. You, W\. Ji, W\. Yuan, W\. Deng, X\. Yuan, X\. Zhang, X\. Liu, X\. Liu, Y\. Xu, Y\. Cao, Y\. Zhang, Y\. Wang, Y\. Shu, Y\. Zhang, Y\. Zhang, Z\. Gong, Z\. Chang, B\. Li, D\. Ma, F\. Jia, H\. Wang, J\. Liu, J\. Bai, J\. Liu, M\. Liu, N\. Wang, Q\. Wu, Q\. Du, S\. Li, W\. Sun, Y\. Gong, Y\. Chen, Y\. Zhao, Y\. Lin, Z\. Ren, Z\. Wang, A\. Zhang, B\. Li, B\. Ma, K\. An, L\. Xie, M\. Li, P\. Li, S\. Yang, X\. Chen, X\. Liu, Y\. Luo, Y\. Song, Y\. Ding, Y\. Liang, Z\. Li, Z\. Zhang, Z\. Zhang, B\. Jiao, D\. Jiang, J\. Chen, J\. Li, X\. Zhang, and Y\. Zhu\(2025\)Step\-deepresearch technical report\.External Links:2512\.20491,[Link](https://arxiv.org/abs/2512.20491)Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.04507#S2.SS1.p1.1)\.
- \[17\]C\. Huang, W\. Yu, X\. Wang, H\. Zhang, Z\. Li, R\. Li, J\. Huang, H\. Mi, and D\. Yu\(2026\)R\-zero: self\-evolving reasoning LLM from zero data\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=96apU6YzSO)Cited by:[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[18\]\(2025\)Introduction gpt\-5\.2\.External Links:[Link](https://openai.com/zh-Hans-CN/index/introducing-gpt-5-2/)Cited by:[4th item](https://arxiv.org/html/2606.04507#A4.I1.i4.p1.1),[§5\.1](https://arxiv.org/html/2606.04507#S5.SS1.SSS0.Px1.p1.1)\.
- \[19\]Y\. Jiang, Y\. Shao, D\. Ma, S\. Semnani, and M\. Lam\(2024\-11\)Into the unknown unknowns: engaged human learning through participation in language model agent conversations\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 9917–9955\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.554/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.554)Cited by:[§2\.1](https://arxiv.org/html/2606.04507#S2.SS1.p1.1)\.
- \[20\]B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. O\. Arik, D\. Wang, H\. Zamani, and J\. Han\(2025\)Search\-r1: training LLMs to reason and leverage search engines with reinforcement learning\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Rwhi91ideu)Cited by:[§2\.1](https://arxiv.org/html/2606.04507#S2.SS1.p1.1)\.
- \[21\]S\. Kim, J\. Shin, Y\. Cho, J\. Jang, S\. Longpre, H\. Lee, S\. Yun, S\. Shin, S\. Kim, J\. Thorne, and M\. Seo\(2024\)Prometheus: inducing fine\-grained evaluation capability in language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=8euJaTveKw)Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p2.1)\.
- \[22\]Y\. Lee, R\. Nair, Q\. Zhang, K\. Lee, O\. Khattab, and C\. Finn\(2026\)Meta\-harness: end\-to\-end optimization of model harnesses\.External Links:2603\.28052,[Link](https://arxiv.org/abs/2603.28052)Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p4.1)\.
- \[23\]X\. Lei, C\. Li, Y\. Wu, K\. Liu, W\. Shen, P\. Li, M\. Yan, F\. Huang, Y\. Zhang, and Y\. Liu\(2026\)Writing\-rl: advancing long\-form writing via adaptive curriculum reinforcement learning\.External Links:2506\.05760,[Link](https://arxiv.org/abs/2506.05760)Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p2.1)\.
- \[24\]J\. Li, J\. Chen, R\. Ren, X\. Cheng, X\. Zhao, J\. Nie, and J\. Wen\(2024\-08\)The dawn after the dark: an empirical study on factuality hallucination in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 10879–10899\.External Links:[Link](https://aclanthology.org/2024.acl-long.586/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.586)Cited by:[§5\.2](https://arxiv.org/html/2606.04507#S5.SS2.SSS0.Px1.p1.1)\.
- \[25\]L\. Li, S\. Jia, and J\. Hwang\(2026\)Multiple human motion understanding\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 6297–6305\.Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p1.1)\.
- \[26\]X\. Li, G\. Dong, J\. Jin, Y\. Zhang, Y\. Zhou, Y\. Zhu, P\. Zhang, and Z\. Dou\(2025\-11\)Search\-o1: agentic search\-enhanced large reasoning models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 5420–5438\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.276/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.276),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.04507#S2.SS1.p1.1)\.
- \[27\]X\. Li, J\. Jin, G\. Dong, H\. Qian, Y\. Wu, J\. Wen, Y\. Zhu, and Z\. Dou\(2026\)WebThinker: empowering large reasoning models with deep research capability\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=7LKKHBAMzH)Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.04507#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[28\]Y\. Li, W\. Chen, Y\. Yan, M\. Li, S\. Mei, X\. Wang, K\. Liu, X\. Cong, S\. Wang, Z\. Zhang, Y\. Lu, Z\. Liu, Y\. Lin, Z\. Liu, and M\. Sun\(2026\)AgentCPM\-report: interleaving drafting and deepening for open\-ended deep research\.External Links:2602\.06540,[Link](https://arxiv.org/abs/2602.06540)Cited by:[§2\.1](https://arxiv.org/html/2606.04507#S2.SS1.p1.1)\.
- \[29\]B\. Liu, S\. Yu, Z\. Liu, L\. Guertler, P\. Qi, D\. Balcells, M\. Liu, C\. Tan, W\. Shi, M\. Lin, W\. S\. Lee, and N\. Jaques\(2026\)SPIRAL: self\-play on zero\-sum games incentivizes reasoning via multi\-agent multi\-turn reinforcement learning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=7Yayy5fNLg)Cited by:[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[30\]H\. Lu, Y\. Wen, P\. Cheng, R\. Ding, H\. Xu, J\. Guo, C\. Wang, H\. Chen, xiaoxi jiang, and guanjunjiang\(2026\)Search self\-play: pushing the frontier of agent capability without supervision\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ZmGirmNJqE)Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[31\]Open\-deep\-research\.External Links:[Link](https://github.com/langchain-ai/open_deep_research)Cited by:[§5\.1](https://arxiv.org/html/2606.04507#S5.SS1.SSS0.Px2.p1.1)\.
- \[32\]Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu\(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[1st item](https://arxiv.org/html/2606.04507#A4.I1.i1.p1.1),[§5\.1](https://arxiv.org/html/2606.04507#S5.SS1.SSS0.Px2.p1.1)\.
- \[33\]T\. Seedance, D\. Chen, L\. Chen, X\. Chen, Y\. Chen, Z\. Chen, Z\. Chen, F\. Cheng, T\. Cheng, Y\. Cheng,et al\.\(2026\)Seedance 2\.0: advancing video generation for world complexity\.arXiv preprint arXiv:2604\.14148\.Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p1.1)\.
- \[34\]Y\. Shao, Y\. Jiang, T\. Kanell, P\. Xu, O\. Khattab, and M\. Lam\(2024\-06\)Assisting in writing Wikipedia\-like articles from scratch with large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 6252–6278\.External Links:[Link](https://aclanthology.org/2024.naacl-long.347/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.347)Cited by:[§2\.1](https://arxiv.org/html/2606.04507#S2.SS1.p1.1)\.
- \[35\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.CoRRabs/2402\.03300\.External Links:[Link](https://doi.org/10.48550/arXiv.2402.03300)Cited by:[§5\.1](https://arxiv.org/html/2606.04507#S5.SS1.SSS0.Px1.p1.1)\.
- \[36\]G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu\(2025\-03\)HybridFlow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,EuroSys ’25,pp\. 1279–1297\.External Links:[Link](http://dx.doi.org/10.1145/3689031.3696075),[Document](https://dx.doi.org/10.1145/3689031.3696075)Cited by:[§5\.1](https://arxiv.org/html/2606.04507#S5.SS1.SSS0.Px1.p1.1)\.
- \[37\]J\. Shi, Q\. Ma, H\. Liu, H\. Zhao, J\. Hwang, and L\. Li\(2026\)Intrinsic entropy of context length scaling in llms\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§5\.2](https://arxiv.org/html/2606.04507#S5.SS2.SSS0.Px3.p1.6)\.
- \[38\]Z\. Shi, L\. Yan, D\. Yin, S\. Verberne, M\. de Rijke, and Z\. Ren\(2026\)Iterative self\-incentivization empowers large language models as agentic searchers\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=s9NkfkUuEr)Cited by:[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[39\]H\. Song, J\. Jiang, Y\. Min, J\. Chen, Z\. Chen, W\. X\. Zhao, L\. Fang, and J\. Wen\(2025\)R1\-searcher: incentivizing the search capability in llms via reinforcement learning\.External Links:2503\.05592,[Link](https://arxiv.org/abs/2503.05592)Cited by:[§2\.1](https://arxiv.org/html/2606.04507#S2.SS1.p1.1)\.
- \[40\]S\. Sun, H\. Song, Y\. Wang, R\. Ren, J\. Jiang, J\. Zhang, F\. Bai, J\. Deng, W\. X\. Zhao, Z\. Liu, L\. Fang, Z\. Wang, and J\. Wen\(2025\-11\)SimpleDeepSearcher: deep information seeking via web\-powered reasoning trajectory synthesis\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 13705–13720\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.739/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.739),ISBN 979\-8\-89176\-335\-7Cited by:[§2\.1](https://arxiv.org/html/2606.04507#S2.SS1.p1.1)\.
- \[41\]Y\. Tang, S\. Wang, L\. Madaan, and R\. Munos\(2026\)Beyond verifiable rewards: scaling reinforcement learning in language models to unverifiable data\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=pc6M9h3T9m)Cited by:[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[42\]5\. Team, A\. Zeng, X\. Lv, Q\. Zheng, Z\. Hou, B\. Chen, C\. Xie, C\. Wang, D\. Yin, H\. Zeng, J\. Zhang, K\. Wang, L\. Zhong, M\. Liu,et al\.\(2025\)GLM\-4\.5: agentic, reasoning, and coding \(arc\) foundation models\.External Links:2508\.06471,[Link](https://arxiv.org/abs/2508.06471)Cited by:[3rd item](https://arxiv.org/html/2606.04507#A4.I1.i3.p1.1)\.
- \[43\]Y\. Tuohetiyaer, Y\. Zhu, Y\. Hu, S\. Lu, and Z\. Wang\(2026\)Deep\-research eval: an automated framework for assessing quality and reliability in long\-form reports\.Applied Sciences16\(5\)\.External Links:[Link](https://www.mdpi.com/2076-3417/16/5/2546),ISSN 2076\-3417,[Document](https://dx.doi.org/10.3390/app16052546)Cited by:[§G\.2](https://arxiv.org/html/2606.04507#A7.SS2.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.04507#S5.SS1.SSS0.Px3.p1.1)\.
- \[44\]R\. J\. Williams\(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine learning8\(3\),pp\. 229–256\.Cited by:[§5\.1](https://arxiv.org/html/2606.04507#S5.SS1.SSS0.Px1.p1.1)\.
- \[45\]R\. Xu, Y\. Zhuang, Z\. Dong, R\. Wang, Y\. Yu, J\. C\. Ho, L\. Zhang, H\. Wang, W\. Shi, and C\. Yang\(2026\)AceSearcher: bootstrapping reasoning and search for LLMs via reinforced self\-play\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=jSgCM0uZn3)Cited by:[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[46\]Z\. Yan, Y\. Shao, M\. Liao, S\. Chen, N\. Wang, M\. Lin, J\. Hwang, H\. Zhao, F\. Remondino, and L\. Li\(2026\-03\)3DSceneEditor: controllable 3d scene editing with gaussian splatting\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision \(WACV\),pp\. 1852–1863\.Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p1.1)\.
- \[47\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[1st item](https://arxiv.org/html/2606.04507#A4.I1.i1.p1.1)\.
- \[48\]K\. Yang, Z\. Guo, G\. Lin, H\. Dong, Z\. Huang, Y\. Wu, D\. Zuo, J\. Peng, Z\. Zhong, X\. WANG, Q\. Guo, X\. Jia, J\. Yan, and D\. Lin\(2025\)Trajectory\-LLM: a language\-based data generator for trajectory prediction in autonomous driving\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=UapxTvxB3N)Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p1.1)\.
- \[49\]Z\. Yao, X\. Cheng, Z\. Huang, and L\. Li\(2025\)CountLLM: towards generalizable repetitive action counting via large language model\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p1.1)\.
- \[50\]Z\. Yue, K\. Upasani, X\. Yang, S\. Ge, S\. Nie, Y\. Mao, Z\. Liu, and D\. Wang\(2026\)Dr\. zero: self\-evolving search agents without training data\.arXiv preprint arXiv:2601\.07055\.Cited by:[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[51\]D\. Zhang, Y\. Zhao, J\. Wu, L\. Zhang, B\. Li, W\. Yin, Y\. Jiang, Y\. Li, K\. Tu, P\. Xie, and F\. Huang\(2025\-11\)EvolveSearch: an iterative self\-evolving search agent\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 13123–13136\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.663/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.663),ISBN 979\-8\-89176\-332\-6Cited by:[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[52\]Q\. Zhang, Y\. Wang, T\. Wu, C\. Xu, Q\. Sun, K\. Zheng, X\. Liu, and C\. Ma\(2026\)Beyond length scaling: synergizing breadth and depth for generative reward models\.External Links:2603\.01571,[Link](https://arxiv.org/abs/2603.01571)Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p3.1)\.
- \[53\]R\. Zhang, Z\. Xu, C\. Ma, C\. Yu, W\. Tu, W\. Tang, S\. Huang, D\. Ye, W\. Ding, Y\. Yang,et al\.\(2024\)A survey on self\-play methods in reinforcement learning\.arXiv preprint arXiv:2408\.01072\.Cited by:[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[54\]X\. Zhang, S\. Chen, J\. Zhou, and L\. Li\(2026\)PSGS: text\-driven panorama sliding scene generation via gaussian splatting\.arXiv preprint arXiv:2602\.00463\.Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p1.1)\.
- \[55\]A\. Zhao, Y\. Wu, Y\. Yue, T\. Wu, Q\. Xu, Y\. Yue, M\. Lin, S\. Wang, Q\. Wu, Z\. Zheng, and G\. Huang\(2025\)Absolute zero: reinforced self\-play reasoning with zero data\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=neZSGqhxDa)Cited by:[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[56\]X\. Zhao, Z\. Kang, A\. Feng, S\. Levine, and D\. Song\(2026\)Learning to reason without external rewards\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=OU9nFEYR2M)Cited by:[§2\.2](https://arxiv.org/html/2606.04507#S2.SS2.p1.1)\.
- \[57\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§1](https://arxiv.org/html/2606.04507#S1.p3.1),[§5\.1](https://arxiv.org/html/2606.04507#S5.SS1.SSS0.Px3.p1.1)\.
- \[58\]Y\. Zheng, D\. Fu, X\. Hu, X\. Cai, L\. Ye, P\. Lu, and P\. Liu\(2025\-11\)DeepResearcher: scaling deep research via reinforcement learning in real\-world environments\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 414–431\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.22/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.22),ISBN 979\-8\-89176\-332\-6Cited by:[§2\.1](https://arxiv.org/html/2606.04507#S2.SS1.p1.1)\.
- \[59\]Y\. Zhou, C\. Cao, J\. Yang, L\. Wu, C\. He, S\. Han, and Y\. Guo\(2026\)LRAS: advanced legal reasoning with agentic search\.External Links:2601\.07296,[Link](https://arxiv.org/abs/2601.07296)Cited by:[§2\.1](https://arxiv.org/html/2606.04507#S2.SS1.p1.1)\.

## Appendix ALimitations

Despite the effectiveness of our self\-evolving framework, we acknowledge several limitations in our current study\. First, the unique nature of deep research report generation demands exceptional instruction\-following capabilities from the base language models\. Models with weaker foundational abilities may struggle with dynamic rubrics generation and tool call\. Second, our training paradigm relies heavily on entirely unverifiable rewards\. This lack of ground\-truth makes the model susceptible to policy collapse during the optimization process\. Consequently, careful hyperparameter tuning and strict early stopping mechanisms are necessary to ensure training stability\. Even though our framework successfully provides differentiated reward signals for specific criteria, aggressively optimizing for one dimension can sometimes lead to performance degradation in another\. Balancing these competing objectives simultaneously remains a challenging problem for future exploration\.

## Appendix BFuture Work

According to our experiments, we observe that inherent trade\-offs still exist during the training process for deep research report generation\. We hypothesize that this issue is primarily caused by sparse reward signals resulting from exceptionally long agent trajectories\. To address this challenge, future work will explore phased training paradigms\. By decomposing the deep research generation process into distinct functional stages and assigning intermediate rewards, we aim to resolve the long\-standing credit assignment problem that plagues end\-to\-end policy optimization\.

## Appendix CLLM Usage Declaration

LLMs are employed to improve the grammar and language quality of the paper\. LLMs are also applied to check the correctness and clarity of the proofs\. In the experiments, open\-source LLMs were used for post\-train models, while closed\-source LLMs were used as judges for assessment\. In addition, a coding agent was used to assist in developing parts of the training framework\.

## Appendix DLicenses

We use the following models in the experiments:

- •Qwen2\.5\-7B\-Instruct, Qwen2\.5\-72B\-Instruct\[[32](https://arxiv.org/html/2606.04507#bib.bib47)\]and Qwen3\-8B\[[47](https://arxiv.org/html/2606.04507#bib.bib51)\]with LicenseApache 2\.0\.
- •Llama\-3\.1\-8B\-Instruct\[[14](https://arxiv.org/html/2606.04507#bib.bib48)\]with LicenseLlama 3\.1 Community\.
- •GLM\-4\.7\-Flash\[[42](https://arxiv.org/html/2606.04507#bib.bib52)\]and DeepSeek\-V3\.2\[[5](https://arxiv.org/html/2606.04507#bib.bib53)\]with Licensemit\.
- •Gemini\-2\.5\-pro\[[4](https://arxiv.org/html/2606.04507#bib.bib54)\]with LicenseApache 2\.0and GPT\-5\.2\[[18](https://arxiv.org/html/2606.04507#bib.bib46)\]with proprietary license\.

## Appendix EAlgorithm

Algorithm[1](https://arxiv.org/html/2606.04507#alg1)summarizes the main shared\-actor evaluator–solver training loop, while Algorithm[2](https://arxiv.org/html/2606.04507#alg2)details the evaluator\-side consistency reward construction and rubric adaptation step\. In the current implementation,PairwiseConsistencyis instantiated as the mean pairwise TF\-IDF cosine similarity over extracted rollout conclusions\.

Algorithm 1SCORE: Evaluator–Solver Self\-Play with Consistency\-Driven Dimension Learning1:Shared actor parameters

θ\\theta; evaluator role policy

πe​\(⋅;θ\)\\pi\_\{e\}\(\\cdot;\\theta\); solver role policy

πs​\(⋅;θ\)\\pi\_\{s\}\(\\cdot;\\theta\); dimension pool

𝒟\\mathcal\{D\}; solver rollout size

KK; batch size

BB; evaluator update period

UU; harness refresh period

MM; minimum verification/search depth

τs\\tau\_\{s\}; KL coefficient

λKL\\lambda\_\{\\mathrm\{KL\}\}
2:Trained shared actor parameters

θ\\theta
3:Initialize DimensionScheduler

𝒮\\mathcal\{S\}over

𝒟\\mathcal\{D\}
4:Initialize Meta\-Harness state

ℳ0←ℳdefault\\mathcal\{M\}\_\{0\}\\leftarrow\\mathcal\{M\}\_\{\\mathrm\{default\}\}
5:Initialize evaluator\-side EMA baseline

b0←0b\_\{0\}\\leftarrow 0
6:foreach training step

t=1,2,…,Tmaxt=1,2,\\dots,T\_\{\\max\}do

7:

⊳\\triangleright*Phase 1: Refresh scheduler and harness constraints*

8:

\(ℛt,𝒜t\)←𝒮\.GetConstraints​\(\)\(\\mathcal\{R\}\_\{t\},\\mathcal\{A\}\_\{t\}\)\\leftarrow\\mathcal\{S\}\.\\textsc\{GetConstraints\}\(\)⊳\\trianglerightretrieve the current required / avoided dimensions and exploration hints

9:Obtain current dimension and process constraints from

ℳt−1\\mathcal\{M\}\_\{t\-1\}⊳\\trianglerightthe meta\-harness defines the admissible rubric space for this step

10:

⊳\\triangleright*Phase 2: Evaluator samples a rubric*

11:Sample query batch

\{qi\}i=1B\\\{q\_\{i\}\\\}\_\{i=1\}^\{B\}
12:Build evaluator rubric\-generation prompt

xtrubricx\_\{t\}^\{\\mathrm\{rubric\}\}from

\(ℛt,𝒜t,ℳt−1\)\(\\mathcal\{R\}\_\{t\},\\mathcal\{A\}\_\{t\},\\mathcal\{M\}\_\{t\-1\}\)
13:Sample rubric

\(Ht,wt\),logπe\(Ht,wt∣xtrubric;θ\)∼πe\(⋅∣xtrubric;θ\)\(H\_\{t\},w\_\{t\}\),\\ \\log\\pi\_\{e\}\(H\_\{t\},w\_\{t\}\\mid x\_\{t\}^\{\\mathrm\{rubric\}\};\\theta\)\\sim\\pi\_\{e\}\(\\cdot\\mid x\_\{t\}^\{\\mathrm\{rubric\}\};\\theta\)⊳\\trianglerightthe evaluator chooses what dimensions to emphasize and how to weight them

14:

⊳\\triangleright*Phase 3: Solver performs rubric\-conditioned rollouts*

15:foreach query

qiq\_\{i\}, each rollout

k=1,…,Kk=1,\\dots,Kdo

16:Generate report trajectory and final report

yik∼πs\(⋅∣qi,Ht,wt;θ\)y\_\{i\}^\{k\}\\sim\\pi\_\{s\}\(\\cdot\\mid q\_\{i\},H\_\{t\},w\_\{t\};\\theta\)
17:Record solver\-side evidence / tool observations

Dqi,ksD\_\{q\_\{i\},k\}^\{s\}⊳\\trianglerightthe solver searches, interacts with tools, and produces a long\-form report

18:endfor

19:

⊳\\triangleright*Phase 4: Evaluator assesses reports*

20:Optionally build evaluator\-side evidence contexts

\{Dqie\}i=1B\\\{D\_\{q\_\{i\}\}^\{e\}\\\}\_\{i=1\}^\{B\}through constrained retrieval / verification⊳\\trianglerightthe evaluator may gather its own evidence for rubric\-conditioned assessment

21:foreach query

qiq\_\{i\}, each rollout

k=1,…,Kk=1,\\dots,Kdo

22:Compute evaluator\-side report reward

sik←EvalScore​\(qi,yik,Ht,wt,Dqie,ℳt−1;θ\)s\_\{i\}^\{k\}\\leftarrow\\textsc\{EvalScore\}\(q\_\{i\},y\_\{i\}^\{k\},H\_\{t\},w\_\{t\},D\_\{q\_\{i\}\}^\{e\},\\mathcal\{M\}\_\{t\-1\};\\theta\)
23:ifevaluator verification/search depth for

\(i,k\)\(i,k\)is below

τs\\tau\_\{s\}thenset

sik←0s\_\{i\}^\{k\}\\leftarrow 0⊳\\trianglerightdiscard under\-verified rewards to avoid noisy solver supervision

24:endfor

25:

⊳\\triangleright*Phase 5: Solver\-side GRPO update*

26:foreach query

qiq\_\{i\}do

27:

μi←meank⁡sik,σi←stdk⁡sik\\mu\_\{i\}\\leftarrow\\operatorname\{mean\}\_\{k\}s\_\{i\}^\{k\},\\quad\\sigma\_\{i\}\\leftarrow\\operatorname\{std\}\_\{k\}s\_\{i\}^\{k\}
28:

A^ik←\(sik−μi\)/\(σi\+ε\)\\hat\{A\}\_\{i\}^\{k\}\\leftarrow\(s\_\{i\}^\{k\}\-\\mu\_\{i\}\)/\(\\sigma\_\{i\}\+\\varepsilon\)⊳\\trianglerightgroup\-normalized rewards define solver\-side advantages

29:endfor

30:Update shared actor parameters

θ\\thetausing the solver\-side GRPO objective with KL regularization against

πref\\pi\_\{\\mathrm\{ref\}\}
31:

⊳\\triangleright*Phase 6: Evaluator\-side consistency update*

32:Compute the evaluator\-side consistency signal and update the evaluator role as detailed in Algorithm[2](https://arxiv.org/html/2606.04507#alg2)⊳\\trianglerightthe evaluator is rewarded when its rubric induces more reproducible solver behavior

33:

⊳\\triangleright*Phase 7: Refresh scheduler and meta\-harness*

34:

𝒮\.UpdateStats​\(Ht,\{Ci\}i=1B\)\\mathcal\{S\}\.\\textsc\{UpdateStats\}\(H\_\{t\},\\\{C\_\{i\}\\\}\_\{i=1\}^\{B\}\)⊳\\trianglerightrecord which dimensions produced more reproducible rollouts

35:if

tmodM=0t\\bmod M=0then

36:Collect harness statistics

𝒯t\\mathcal\{T\}\_\{t\}
37:

ℳt←Harness\.Generate​\(𝒯t,ℳt−1,ℳdefault\)\\mathcal\{M\}\_\{t\}\\leftarrow\\textsc\{Harness\.Generate\}\(\\mathcal\{T\}\_\{t\},\\mathcal\{M\}\_\{t\-1\},\\mathcal\{M\}\_\{\\mathrm\{default\}\}\)⊳\\trianglerightperiodically refresh the admissible evaluation space

38:else

39:

ℳt←ℳt−1\\mathcal\{M\}\_\{t\}\\leftarrow\\mathcal\{M\}\_\{t\-1\}
40:endif

41:endfor

Algorithm 2Evaluator\-Side Consistency Reward and Rubric Update1:Rollout reports

\{yik\}i=1,…,Bk=1,…,K\\\{y\_\{i\}^\{k\}\\\}\_\{i=1,\\dots,B\}^\{k=1,\\dots,K\}; sampled rubric

\(Ht,wt\)\(H\_\{t\},w\_\{t\}\); rubric prompt

xtrubricx\_\{t\}^\{\\mathrm\{rubric\}\}; previous EMA baseline

bt−1b\_\{t\-1\}; baseline decay

β\\beta; evaluator scale

α\\alpha; evaluator update period

UU
2:Updated EMA baseline

btb\_\{t\}, consistency statistics

\{Ci\}i=1B\\\{C\_\{i\}\\\}\_\{i=1\}^\{B\}, and optionally updated shared actor parameters

θ\\theta
3:

⊳\\triangleright*Phase 1: Build a batch\-level consistency signal*

4:foreach query

qiq\_\{i\}do

5:Extract rollout conclusions

\{cik\}k=1K\\\{c\_\{i\}^\{k\}\\\}\_\{k=1\}^\{K\}from

\{yik\}k=1K\\\{y\_\{i\}^\{k\}\\\}\_\{k=1\}^\{K\}⊳\\trianglerightmeasure agreement at the conclusion level rather than at the token level

6:Compute empirical consistency

Ci←PairwiseConsistency​\(\{cik\}k=1K\)C\_\{i\}\\leftarrow\\textsc\{PairwiseConsistency\}\(\\\{c\_\{i\}^\{k\}\\\}\_\{k=1\}^\{K\}\)
7:endfor

8:Compute evaluator\-side raw signal

Reval←1B​∑i=1BCiR\_\{\\mathrm\{eval\}\}\\leftarrow\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}C\_\{i\}⊳\\trianglerightthe raw signal favors rubrics under which solver rollouts are more reproducible

9:

⊳\\triangleright*Phase 2: Center the signal with an EMA baseline*

10:Compute centered evaluator\-side signal

A^eval←Reval−bt−1\\hat\{A\}\_\{\\mathrm\{eval\}\}\\leftarrow R\_\{\\mathrm\{eval\}\}\-b\_\{t\-1\}
11:Update EMA baseline

bt←β​bt−1\+\(1−β\)​Revalb\_\{t\}\\leftarrow\\beta\\,b\_\{t\-1\}\+\(1\-\\beta\)\\,R\_\{\\mathrm\{eval\}\}⊳\\trianglerightthe EMA baseline stabilizes REINFORCE updates without changing the raw preference

12:

⊳\\triangleright*Phase 3: Update the evaluator role*

13:if

tmodU=0t\\bmod U=0andevaluator updates are enabledthen

14:Update shared actor parameters

θ\\thetausing the evaluator\-side REINFORCE objective

θ←θ\+η​α​A^eval​∇θlog⁡πe​\(Ht,wt∣xtrubric;θ\)\\theta\\leftarrow\\theta\+\\eta\\,\\alpha\\,\\hat\{A\}\_\{\\mathrm\{eval\}\}\\,\\nabla\_\{\\theta\}\\log\\pi\_\{e\}\(H\_\{t\},w\_\{t\}\\mid x\_\{t\}^\{\\mathrm\{rubric\}\};\\theta\)⊳\\trianglerightrubrics that induce more reproducible solver behavior become more likely

15:endif

## Appendix FDetailed Proofs

This appendix provides derivations for the results stated in Section[4](https://arxiv.org/html/2606.04507#S4)\. Throughout, letθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}denote the shared actor parameters, and let

ℒ~​\(θ\)=ℒs​\(θ\)\+α​ℒe​\(θ\)\+λ​KL​\(πθ∥πref\)\\widetilde\{\\mathcal\{L\}\}\(\\theta\)=\\mathcal\{L\}\_\{s\}\(\\theta\)\+\\alpha\\mathcal\{L\}\_\{e\}\(\\theta\)\+\\lambda\\,\\mathrm\{KL\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\)denote the KL\-regularized surrogate potential\.

### F\.1Derivation of Proposition[1](https://arxiv.org/html/2606.04507#Thmtheorem1)

Recall thatΓq:=\(Hq,wq\)\\Gamma\_\{q\}:=\(H\_\{q\},w\_\{q\}\)denotes the evaluator\-selected rubric for queryqq\. Suppose the solver generatesKKreports under the same query, evidence context, and rubric\. Let

ci​j​\(Γq\),1≤i<j≤K,c\_\{ij\}\(\\Gamma\_\{q\}\),\\qquad 1\\leq i<j\\leq K,denote the pairwise agreement score between reportsrir\_\{i\}andrjr\_\{j\}under rubricΓq\\Gamma\_\{q\}, and let

be the number of report pairs\. In the current implementation, the empirical consistency score is defined as

C​\(Γq\)=1M​∑1≤i<j≤Kci​j​\(Γq\)\.C\(\\Gamma\_\{q\}\)=\\frac\{1\}\{M\}\\sum\_\{1\\leq i<j\\leq K\}c\_\{ij\}\(\\Gamma\_\{q\}\)\.The raw evaluator\-side auxiliary signal is

Re​\(Γq\)=C​\(Γq\)\.R\_\{e\}\(\\Gamma\_\{q\}\)=C\(\\Gamma\_\{q\}\)\.
Let

μC​\(Γq\):=𝔼​\[ci​j​\(Γq\)\]\\mu\_\{C\}\(\\Gamma\_\{q\}\):=\\mathbb\{E\}\[c\_\{ij\}\(\\Gamma\_\{q\}\)\]denote the population mean of the consistency metric under rubricΓq\\Gamma\_\{q\}\. Then

𝔼​\[Re​\(Γq\)\]=𝔼​\[C​\(Γq\)\]=𝔼​\[1M​∑1≤i<j≤Kci​j​\(Γq\)\]\.\\mathbb\{E\}\[R\_\{e\}\(\\Gamma\_\{q\}\)\]=\\mathbb\{E\}\[C\(\\Gamma\_\{q\}\)\]=\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{M\}\\sum\_\{1\\leq i<j\\leq K\}c\_\{ij\}\(\\Gamma\_\{q\}\)\\right\]\.By linearity of expectation,

𝔼​\[Re​\(Γq\)\]=1M​∑1≤i<j≤K𝔼​\[ci​j​\(Γq\)\]=1M​∑1≤i<j≤KμC​\(Γq\)=μC​\(Γq\)\.\\mathbb\{E\}\[R\_\{e\}\(\\Gamma\_\{q\}\)\]=\\frac\{1\}\{M\}\\sum\_\{1\\leq i<j\\leq K\}\\mathbb\{E\}\[c\_\{ij\}\(\\Gamma\_\{q\}\)\]=\\frac\{1\}\{M\}\\sum\_\{1\\leq i<j\\leq K\}\\mu\_\{C\}\(\\Gamma\_\{q\}\)=\\mu\_\{C\}\(\\Gamma\_\{q\}\)\.Taking expectation over the induced rubric distribution yields

𝔼​\[Re​\(Γq\)\]=𝔼​\[μC​\(Γq\)\]\.\\mathbb\{E\}\[R\_\{e\}\(\\Gamma\_\{q\}\)\]=\\mathbb\{E\}\[\\mu\_\{C\}\(\\Gamma\_\{q\}\)\]\.This proves Proposition[1](https://arxiv.org/html/2606.04507#Thmtheorem1)\.

In implementation, evaluator updates use the centered signal

Ae​\(Γq\)=Re​\(Γq\)−bt,A\_\{e\}\(\\Gamma\_\{q\}\)=R\_\{e\}\(\\Gamma\_\{q\}\)\-b\_\{t\},wherebtb\_\{t\}is an exponential moving average baseline\. Sincebtb\_\{t\}is independent of the current sampled rubric action, subtracting it reduces gradient variance without changing the underlying expected preference encoded byRe​\(Γq\)R\_\{e\}\(\\Gamma\_\{q\}\)\.

### F\.2Derivation of Proposition[2](https://arxiv.org/html/2606.04507#Thmtheorem2)

The shared\-actor updates are

θ′=θt−ηs​g^s​\(θt\),θt\+1=θ′−ηe​g^e​\(θ′\)\.\\theta^\{\\prime\}=\\theta\_\{t\}\-\\eta\_\{s\}\\hat\{g\}\_\{s\}\(\\theta\_\{t\}\),\\qquad\\theta\_\{t\+1\}=\\theta^\{\\prime\}\-\\eta\_\{e\}\\hat\{g\}\_\{e\}\(\\theta^\{\\prime\}\)\.Hence,

θt\+1−θt=−ηs​g^s​\(θt\)−ηe​g^e​\(θ′\)\.\\theta\_\{t\+1\}\-\\theta\_\{t\}=\-\\eta\_\{s\}\\hat\{g\}\_\{s\}\(\\theta\_\{t\}\)\-\\eta\_\{e\}\\hat\{g\}\_\{e\}\(\\theta^\{\\prime\}\)\.Taking conditional expectation with respect toθt\\theta\_\{t\}gives

𝔼​\[θt\+1−θt∣θt\]=−ηs​𝔼​\[g^s​\(θt\)∣θt\]−ηe​𝔼​\[g^e​\(θ′\)∣θt\]\.\\mathbb\{E\}\[\\theta\_\{t\+1\}\-\\theta\_\{t\}\\mid\\theta\_\{t\}\]=\-\\eta\_\{s\}\\mathbb\{E\}\[\\hat\{g\}\_\{s\}\(\\theta\_\{t\}\)\\mid\\theta\_\{t\}\]\-\\eta\_\{e\}\\mathbb\{E\}\[\\hat\{g\}\_\{e\}\(\\theta^\{\\prime\}\)\\mid\\theta\_\{t\}\]\.Under Assumption \(A2\),

𝔼​\[g^s​\(θt\)∣θt\]≈∇ℒs​\(θt\),𝔼​\[g^e​\(θ′\)∣θt\]≈∇ℒe​\(θ′\)\.\\mathbb\{E\}\[\\hat\{g\}\_\{s\}\(\\theta\_\{t\}\)\\mid\\theta\_\{t\}\]\\approx\\nabla\\mathcal\{L\}\_\{s\}\(\\theta\_\{t\}\),\\qquad\\mathbb\{E\}\[\\hat\{g\}\_\{e\}\(\\theta^\{\\prime\}\)\\mid\\theta\_\{t\}\]\\approx\\nabla\\mathcal\{L\}\_\{e\}\(\\theta^\{\\prime\}\)\.Thus,

𝔼​\[θt\+1−θt∣θt\]≈−ηs​∇ℒs​\(θt\)−ηe​∇ℒe​\(θ′\)\.\\mathbb\{E\}\[\\theta\_\{t\+1\}\-\\theta\_\{t\}\\mid\\theta\_\{t\}\]\\approx\-\\eta\_\{s\}\\nabla\\mathcal\{L\}\_\{s\}\(\\theta\_\{t\}\)\-\\eta\_\{e\}\\nabla\\mathcal\{L\}\_\{e\}\(\\theta^\{\\prime\}\)\.
By Assumption \(A1\),ℒe\\mathcal\{L\}\_\{e\}isLL\-smooth, so

‖∇ℒe​\(θ′\)−∇ℒe​\(θt\)‖≤L​‖θ′−θt‖\.\\\|\\nabla\\mathcal\{L\}\_\{e\}\(\\theta^\{\\prime\}\)\-\\nabla\\mathcal\{L\}\_\{e\}\(\\theta\_\{t\}\)\\\|\\leq L\\\|\\theta^\{\\prime\}\-\\theta\_\{t\}\\\|\.From the first update,

θ′−θt=−ηs​g^s​\(θt\),\\theta^\{\\prime\}\-\\theta\_\{t\}=\-\\eta\_\{s\}\\hat\{g\}\_\{s\}\(\\theta\_\{t\}\),and by Assumption \(A3\),‖g^s​\(θt\)‖≤G\\\|\\hat\{g\}\_\{s\}\(\\theta\_\{t\}\)\\\|\\leq G\. Therefore,

‖θ′−θt‖≤ηs​G,\\\|\\theta^\{\\prime\}\-\\theta\_\{t\}\\\|\\leq\\eta\_\{s\}G,which implies

‖∇ℒe​\(θ′\)−∇ℒe​\(θt\)‖≤L​ηs​G\.\\\|\\nabla\\mathcal\{L\}\_\{e\}\(\\theta^\{\\prime\}\)\-\\nabla\\mathcal\{L\}\_\{e\}\(\\theta\_\{t\}\)\\\|\\leq L\\eta\_\{s\}G\.Hence we may write

∇ℒe​\(θ′\)=∇ℒe​\(θt\)\+δt,‖δt‖=O​\(ηs\)\.\\nabla\\mathcal\{L\}\_\{e\}\(\\theta^\{\\prime\}\)=\\nabla\\mathcal\{L\}\_\{e\}\(\\theta\_\{t\}\)\+\\delta\_\{t\},\\qquad\\\|\\delta\_\{t\}\\\|=O\(\\eta\_\{s\}\)\.
Substituting this into the expected update yields

𝔼​\[θt\+1−θt∣θt\]=−ηs​∇ℒs​\(θt\)−ηe​∇ℒe​\(θt\)−ηe​δt\.\\mathbb\{E\}\[\\theta\_\{t\+1\}\-\\theta\_\{t\}\\mid\\theta\_\{t\}\]=\-\\eta\_\{s\}\\nabla\\mathcal\{L\}\_\{s\}\(\\theta\_\{t\}\)\-\\eta\_\{e\}\\nabla\\mathcal\{L\}\_\{e\}\(\\theta\_\{t\}\)\-\\eta\_\{e\}\\delta\_\{t\}\.Usingα=ηe/ηs\\alpha=\\eta\_\{e\}/\\eta\_\{s\}, we obtain

𝔼​\[θt\+1−θt∣θt\]=−ηs​∇\(ℒs​\(θt\)\+α​ℒe​\(θt\)\)\+ρt\(0\),\\mathbb\{E\}\[\\theta\_\{t\+1\}\-\\theta\_\{t\}\\mid\\theta\_\{t\}\]=\-\\eta\_\{s\}\\nabla\\bigl\(\\mathcal\{L\}\_\{s\}\(\\theta\_\{t\}\)\+\\alpha\\mathcal\{L\}\_\{e\}\(\\theta\_\{t\}\)\\bigr\)\+\\rho\_\{t\}^\{\(0\)\},where

ρt\(0\)=−ηe​δt,‖ρt\(0\)‖=O​\(ηs2\)\.\\rho\_\{t\}^\{\(0\)\}=\-\\eta\_\{e\}\\delta\_\{t\},\\qquad\\\|\\rho\_\{t\}^\{\(0\)\}\\\|=O\(\\eta\_\{s\}^\{2\}\)\.
Including KL regularization against the fixed reference policyπref\\pi\_\{\\mathrm\{ref\}\}, we define

ℒ~​\(θ\)=ℒs​\(θ\)\+α​ℒe​\(θ\)\+λ​KL​\(πθ∥πref\)\.\\widetilde\{\\mathcal\{L\}\}\(\\theta\)=\\mathcal\{L\}\_\{s\}\(\\theta\)\+\\alpha\\mathcal\{L\}\_\{e\}\(\\theta\)\+\\lambda\\,\\mathrm\{KL\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\)\.The same argument then gives

𝔼​\[θt\+1−θt∣θt\]=−ηs​∇ℒ~​\(θt\)\+ρt,‖ρt‖=O​\(ηs2\),\\mathbb\{E\}\[\\theta\_\{t\+1\}\-\\theta\_\{t\}\\mid\\theta\_\{t\}\]=\-\\eta\_\{s\}\\nabla\\widetilde\{\\mathcal\{L\}\}\(\\theta\_\{t\}\)\+\\rho\_\{t\},\\qquad\\\|\\rho\_\{t\}\\\|=O\(\\eta\_\{s\}^\{2\}\),which proves Proposition[2](https://arxiv.org/html/2606.04507#Thmtheorem2)\.

### F\.3Summary

The above derivations establish two facts used in the main text:

1. 1\.the raw evaluator\-side consistency signal estimates the expected reproducibility induced by a rubric under repeated solver rollouts;
2. 2\.the shared\-parameter solver\-evaluator alternating updates follow, to first order, a KL\-regularized surrogate descent direction, with only a higher\-order residual due to sequential updating\.

## Appendix GExperiments

### G\.1Training Details

Table 2:Main training hyperparameters in SCORE\.CategoryConfigurationLearning rate1×10−61\\times 10^\{\-6\}Warmup5 steps, warmup ratio 0\.05Training Data Size255 for Qwen2\.5\-7B and 2,380 for Llama\-3\.1\-8BKL regularizationCoefficient 0\.05, low\-variance KL, applied in actor loss onlyBatch sizeTraining batch size 64; PPO mini\-batch size 32; PPO micro\-batch size 4 per GPURolloutsSolver rollout number 8; evaluator rollout number 1Context and generation limitsMaximum input length 8192; maximum response length 4096; evaluator maximum generation length 2048Interaction budgetMaximum assistant turns 10Evaluator updateEMA baseline decay 0\.7; update frequency every step; temperature 0\.8Meta\-harnessRefresh frequency every 5 training stepsTraining retrievalLocal static retrieval backend use wikidataComputeSingle\-node training with 4 H800, 100 CPU cores, and 1 TB host memorySystems optimizationFSDP, Ray orchestration, gradient checkpointing, parameter offloading, optimizer offloadingTraining schedule50\-120 steps
### G\.2Experimental Results

#### Results of DeepResearchEval

Table 3:Performance of open\-deep\-research and gpt\-researcher integrated with LLaMA\-3\.1 and Qwen2\.5 on DeepResearchEval\.Cove: Breadth, depth, and relevance of coverage\.Insight: Depth, originality, logic, and value of analysis\.IF: Accuracy in meeting all requirements\.Clar: Clarity, fluency, structure, and ease of understanding\.Meta: Query\-specific meta\-evaluation dimensions generated dynamically for each task\.Models and AgentsCoveInsightIF\.ClarMetaAgent: Open\-deep\-research \(ReAct Paradigm\)Llama\-3\.1\-8B\-Inst\.3\.73\.24\.55\.12\.6\+ GRPO1\.2\(\-2\.5\)1\.3\(\-1\.9\)1\.4\(\-3\.1\)1\.8\(\-3\.3\)1\.0\(\-1\.6\)\+ DPO4\.3\(\+0\.6\)4\.3\(\+1\.1\)4\.7\(\+0\.2\)5\.2\(\+0\.1\)3\.5\(\+0\.9\)\+ SCORE \(Ours\)4\.3\(\+0\.6\)4\.4\(\+1\.2\)4\.7\(\+0\.2\)5\.2\(\+0\.1\)3\.5\(\+0\.9\)Qwen2\.5\-7B\-Inst\.4\.43\.95\.25\.43\.2\+ GRPO4\.4\(\+0\.0\)4\.0\(\+0\.1\)5\.2\(\+0\.0\)5\.4\(\+0\.0\)3\.3\(\+0\.1\)\+ DPO4\.6\(\+0\.2\)4\.1\(\+0\.2\)5\.6\(\+0\.4\)5\.8\(\+0\.4\)3\.4\(\+0\.2\)\+ SCORE \(Ours\)6\.3\(\+1\.9\)5\.9\(\+2\.0\)7\.0\(\+1\.8\)6\.6\(\+1\.2\)5\.0\(\+1\.8\)Agent: gpt\-researcher \(Plan\-and\-Execute Paradigm\)Llama\-3\.1\-8B\-Inst\.3\.93\.54\.85\.62\.8\+ GRPO2\.5\(\-1\.4\)2\.8\(\-0\.7\)2\.7\(\-2\.1\)3\.7\(\-1\.9\)2\.0\(\-0\.8\)\+ DPO2\.4\(\-1\.5\)2\.8\(\-0\.7\)2\.6\(\-2\.2\)3\.5\(\-2\.1\)2\.0\(\-0\.8\)\+ SCORE \(Ours\)5\.9\(\+2\.0\)5\.6\(\+2\.1\)6\.9\(\+2\.1\)6\.9\(\+1\.3\)4\.6\(\+1\.8\)Qwen2\.5\-7B\-Inst\.4\.54\.15\.75\.93\.4\+ GRPO4\.6\(\+0\.1\)4\.1\(\+0\.0\)5\.6\(\-0\.1\)5\.8\(\-0\.1\)3\.4\(\+0\.0\)\+ DPO6\.1\(\+1\.6\)5\.8\(\+1\.7\)6\.8\(\+1\.1\)6\.5\(\+0\.6\)4\.8\(\+1\.4\)\+ SCORE \(Ours\)6\.5\(\+2\.0\)6\.0\(\+1\.9\)7\.5\(\+1\.8\)7\.0\(\+1\.1\)5\.1\(\+1\.7\)We also evaluate the search agents on DeepResearchEval\[[43](https://arxiv.org/html/2606.04507#bib.bib50)\], where evaluation rubrics are customized for each specific query\. The results shown in Table[3](https://arxiv.org/html/2606.04507#A7.T3)strongly corroborate our previous observations that fundamental GRPO methods suffer severe performance degradation, whereas our framework consistently achieves the highest scores across all dimensions, including the highly challenging query\-specific Meta metrics\. Interestingly, while generic readability declines on DeepResearchBench benchmarks, our method achieves consistent gains in the “Clarity” metric on DeepResearchEval\. This divergence supports our trade\-off hypothesis, confirming that the generated texts exchange conversational simplicity for a highly structured and analytically clear presentation, which is heavily rewarded when evaluating complex, query\-specific research explanations\.

#### Training Data Size

We apply different training data size in SCORE\. While the co\-evolving evaluator successfully stabilizes training on smaller datasets, increasing the data scale accelerates model collapse\. The reason is both of the solver and evaluator rely on instinct reward rather than strictly verifiable external ground truth\. With limited training data, this exact internal feedback effectively sharpens the model’s correct priors and provides necessary dense constraints\. However, as the dataset size increases significantly, the solver progressively discovers blind spots in the evaluator and over\-optimizes against these proxy rewards\. This dynamic inevitably forces the optimization process to saturate, degrading the factual accuracy as the solver exploits the distributional limits of the proxy feedback\. As illustrated by Figure[5](https://arxiv.org/html/2606.04507#A7.F5), the solver entropy collapses prematurely at large scales, followed by a classic rise\-then\-fall trajectory in the training rewards\. This observation aligns perfectly with recent theoretical findings on unsupervised reinforcement learning, confirming that intrinsic reward mechanisms eventually amplify self\-reinforcing hallucinations when exposed to extensive data\[[15](https://arxiv.org/html/2606.04507#bib.bib58)\]\.

![Refer to caption](https://arxiv.org/html/2606.04507v1/figures/datasize_entropy.png)\(a\)Solver Entropy
![Refer to caption](https://arxiv.org/html/2606.04507v1/figures/datasize_r_eval.png)\(b\)Evaluator Reward
![Refer to caption](https://arxiv.org/html/2606.04507v1/figures/datasize_solver_reward.png)\(c\)Solver Reward

Figure 5:Training Details
#### Case Study

Case AnalysisGeneral Evaluation Rubric:\[factuality, insight, coherence, complexity\]User Query 1:What does Texas Penal Code Chapter 9\.42 specifically permit, and under what conditions can deadly force be used to protect property?Report:Texas law permits deadly force to protect property when the force is deemed reasonable, especially in cases involving theft or unlawful intrusion\. The rule is often understood as allowing force when property cannot otherwise be protected\. In practice, this reflects a broad legal recognition of property rights and self\-defense principles\. Courts have generally supported the idea that homeowners …Static evaluation:\[factuality: 7, insight: 2, coherence: 8, complexity: 7\]Adaptive evaluation:\[factuality: 3, citation\_effectiveness: 2, coherence: 8, completeness: 3\]Reason:The report is well\-written, but it does not ground its claims in the statutory text and provide valid citation\. Specifically, it omits the exact conditions enumerated in §9\.42, such as the nighttime requirement and the "no other means" clause, leading to an overly permissive interpretation\.User Query 2:What are the main debates around replacing human workers with AI assistants in knowledge\-intensive jobs?Report:AI can improve efficiency, reduce repetitive work, and increase access to expertise across organizations\. They may also lower costs and support faster decision\-making in research, operations, and customer service\. At the same time, concerns remain about job displacement, deskilling, accountability, and over\-reliance on automated systems\. Overall, most commentators agree that AI will reshape knowledge work, and organizations should gradually integrate these tools while monitoring their impact …Static evaluation:\[factuality: 7, insight: 2, coherence: 8, complexity: 2\]Adaptive evaluation:\[diversity: 3, fairness: 4, coherence: 8, complexity: 4\]Reason:The report is coherent and thoughtful, but it presents only a limited set of viewpoints and does not sufficiently cover the competing perspectives\.Figure 6:Adaptive evaluation vs static evaluation of different queries\.Figure[6](https://arxiv.org/html/2606.04507#A7.F6)demonstrates the stark contrast in scoring behaviors between static and adaptive evaluations\. Under the static rubric, the solver successfully exploits the generic "coherence" metric, securing a near\-perfect score of 8 in both scenarios by generating superficially fluent prose\. However, this high structural score masks critical omissions in substantive content\. When our adaptive framework is applied, it immediately synthesizes query\-specific constraints to interrogate the text\. In the legal scenario, the dynamic evaluation criteria focus on whether the report grounds its claims in specific statutory prerequisites\. Similarly, in the discussion scenario, the metrics expose the report’s failure to synthesize competing viewpoints, despite its seemingly thoughtful tone\. This dynamic assessment effectively prevents the model from prioritizing elegant phrasing to secure high scores while neglecting factual accuracy or other critical dimensional requirements\.

## Appendix HPrompts

Youwillbeprovidedwithareferenceandsomestatements\.Pleasedeterminewhethereachstatementis’supported’,’unsupported’,or’unknown’withrespecttothereference\.Pleasenote:

First,assesswhetherthereferencecontainsanyvalidcontent\.Ifthereferencecontainsnovalidinformation,suchasa’pagenotfound’message,thenallstatementsshouldbeconsidered’unknown’\.

Ifthereferenceisvalid,foragivenstatement:ifthefactsordataitcontainscanbefoundentirelyorpartiallywithinthereference,itisconsidered’supported’\(dataacceptsrounding\);ifallfactsanddatainthestatementcannotbefoundinthereference,itisconsidered’unsupported’\.

YoushouldreturntheresultinaJSONlistformat,whereeachiteminthelistcontainsthestatement’sindexandthejudgmentresult,forexample:

\[

\{\{

"idx":1,

"result":"supported"

\}\},

\{\{

"idx":2,

"result":"unsupported"

\}\}

\]

Belowarethereferenceandstatements:

<reference\>

\{reference\}

</reference\>

<statements\>

\{statements\}

</statements\>

Begintheassessmentnow\.OutputonlytheJSONlist,withoutanyconversationaltextorexplanations\.

Listing 1:Prompt for citation checking\.<system\_role\>

Youareanexperiencedresearcharticleevaluationexpert\.Yourtaskistoevaluateadeepresearcharticlebasedontheprovided‘<task\>‘acrossfourdimensions:Comprehensiveness,Insight,InstructionFollowing,andReadability\.

</system\_role\>

<task\>

"\{task\_prompt\}"

</task\>

<instruction\>

Pleasedeeplyanalyzethetaskandevaluatetheresearchreportbasedonthefollowingspecificcriteriaforeachdimension:

1\.Comprehensiveness\(InformationCoverage\)

\-Evaluatethebreadth,depth,andrelevanceofinformationcoverage\.

\-Ensurethearticlecoverskeyinformationareas,perspectives,anddepthsnecessarytoachievecomprehensivemarketorsituationalanalysis\.

\-Checkifitthoroughlyexaminesallrelevantfactors\(e\.g\.,technical,economic,social,geographical\)andavoidsomittingcriticalfacetsofthetopic\.

2\.Insight\(AnalyticalDepth\)

\-Evaluatethedepth,originality,logic,andvalueoftheanalysisandconclusions\.

\-Assesswhethertheanalysisgoesbeyondasuperficiallistingoffactorstouncovercoredrivers,subtlemechanisms,andsecond\-ordereffects\.

\-Examineiftheconclusionsandrecommendationsarelogicallyderivedfromtheanalysis,offeruniqueperspectives,andprovideforward\-lookingoractionablestrategicvalue\.

3\.InstructionFollowing\(Responsiveness&Relevance\)

\-Evaluatewhetherthereportaccurately,completely,anddirectlyrespondstoallspecificinstructions,questions,andcoreobjectiveswithinthe‘<task\>‘\.

\-Checkforstrictadherencetoscopelimitations\(e\.g\.,specificgeography,time,orsubjectconstraints\)withoutsignificantdeviationintoirrelevanttopics\.

4\.Readability\(PresentationQuality\)

\-Evaluatetheclarityofstructure,fluencyoflanguage,effectivenessofdatapresentation,andoveralleaseofunderstanding\.

\-Assessstructurallogic\(clearframeworkandlogicalflow\),languageprecision\(fluent,grammaticallycorrect,andappropriateuseofterminology\),andcohesiveparagraphtransitions\.

\-Checktheclarityandaccuracyofdatapresentation,includingtheeffectiveuseofvisualizations\(charts,tables\)andthehighlightingofkeyfindingstoreducereaderfatigue\.

</instruction\>

Listing 2:Prompt for writing quality evaluation\[SystemPrompt\]

Youareanexternalresearcher\.Adjusttheevaluatorharnessbasedonrecenttrainingstatistics\.Donotchoosetheevaluator’sfinaldimensionsorweightsdirectly\.

Torequestsearchfromtheevaluator,setrequire\_search=trueinprocess\_constraints,andexplicitlystateinnotesthattheevaluatorshouldusethesearchtoolbyoutputting<search\>query</search\>tags\.

\[UserPrompt\]

Task:GeneratethenextevaluatorharnessH\_t\.

Goal:Updateevaluatorconstraintswithoutdirectlychoosingconcretedimensionsorweightsfortheevaluator\.Youmayadjustdimension\_constraintsandprocess\_constraintsanddimension\_pool\.

\#\#Statistics:

\{TrainingStatisticsJSON\}

\#\#OutputFormat:

\{

"dimension\_constraints":\{

"k":numberofdimensions,

"must\_include":\[listofrequireddimensionnames\],

"weight\_min":minimumweightforadimension,

"weight\_max":maximumweightforadimension

\},

"process\_constraints":\{

"require\_gap\_analysis":true/false,

"require\_verification":true/false,

"citation\_cross\_check":true/false,

"min\_number\_of\_tool\_call":n

\},

"notes":"additionalprocesscomplianceinstructions"

\}

Listing 3:Prompt for meta\-harness\[SystemPrompt\]

Youareaprofessionalevaluatorfordeepresearchreports\.

Youwillreceiveonequeryandmultiplecandidatereportsgeneratedbythesolver\.

Yourjobistoevaluateeveryreportwithasearch\-assistedreviewprocess\.

Youmustcomplete4stages:

1\.\{must\_include\}isalwaysrequiredastheprimaryscoringdimension\.Selectexactly\{k\-1\}additionaldimensionsfromtheremainingpoolandassignweights,eachadditionaldimension’sweightmustbewithin\[\{WeightMin\},\{WeightMax\}\]

2\.Verifykeyclaimsandcitationsineachreportusingtheavailablesearchtool\.

3\.Performindependentresearchfromtheoriginalquerytoidentifyimportantgapsnotcoveredbyeachreport\.

4\.Scoreeachreportoneveryselecteddimensionusingtheevidenceyoufound\.

\#\#Tool\-useprotocol:

\-Whenyouneedexternalevidence,youcanusesearchtoolswithsearchqueryinthisformat:

<search\>yoursearchqueryhere</search\>

\-Thecontentinside<search\>mustbeonlythequerytext\.DonotoutputJSON,function\-callsyntax,XMLattributes,bulletlists,ormultiplequeries\.

\-Afterasearch,thesystemwillreturnretrievedevidenceinside<observation\>…</observation\>\.

\-YoumayissuemultiplesearchqueriesasneededbeforeproducingthefinalXML\.

ReturnthefinalresultinthefollowingXMLformatONLYAFTERyoufinishyoursearchandevaluation:

<dimensions\>

<dimname="\[BaseDimension\]"/\>

<dimname="\[SelectedDimension\]"weight="\[Weight\]"/\>

</dimensions\>

<report\_evaluations\>

<reportid="\[ReportID\]"\>

<verification\>yourverificationprocesshere</verification\>

<gap\_analysis\>yourgapanalysishere</gap\_analysis\>

<scores\>

<scoredim="\[BaseDimension\]"value="\[Score\]"reason="…"/\>

<scoredim="\[SelectedDimension\]"value="\[Score\]"reason="…"/\>

</scores\>

</report\>

</report\_evaluations\>

Listing 4:Prompt for evaluator\[SystemPrompt\]

Youareaprofessionaldeepresearcher\.Solvetheuser’srequestbyiterativelysearchingpublicinformationandsynthesizingevidence\.

\#\#Rules:

1\.Thinkinside<think\>and</think\>beforeeachactionandaftereachnewsearchresult\.

2\.Whenexternalevidenceisneeded,emitexactlyonestandalonesearchquery\.Usetheexactformat<search\>yourqueryhere</search\>toissueasearch\.

3\.Thecontentinside<search\>mustbeonlythequerytext\.DonotoutputJSON,toolnames,XMLattributes,bulletlists,ormultiplequeriesinonesearchblock\.

4\.Prefernaturalwebsearchqueries:startbroad,thennarrowtoverification,recentdevelopments,entity\-specificfacts,ormissingperspectives\.

5\.Prefermultiplesequentialsearchesoveroneoverloadedquery\.

6\.Usesearchtoverifyimportantclaims,gatherevidence,andcloseinformationgaps\.Stopsearchingonceyoucananswerconfidently\.

7\.Theenvironmentwillreturnsearchresultsbetween<observation\>and</observation\>\.

8\.Whenyouhaveenoughevidence,writethefinalresponseonlyinside<answer\>and</answer\>\.

9\.Inthefinalanswer,synthesizetheevidence,noteuncertaintywhennecessary,andincludesourcementionsorashortSourcessection\.

Listing 5:Prompt for solver

Similar Articles

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Hugging Face Daily Papers

DR³-Eval is a benchmark for evaluating deep research agents on multimodal, multi-file report generation with a realistic web environment simulation and comprehensive evaluation framework measuring information recall, factual accuracy, citation coverage, instruction following, and depth quality.

Generating and Refining Dynamic Evaluation Rubrics for LLM-as-a-Judge

arXiv cs.CL

This paper proposes a training-free method to automatically generate fine-grained evaluation rubrics for LLM-as-a-judge without human annotation, and further introduces an iterative fine-tuning strategy for a rubric generator that outperforms larger proprietary models.

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

arXiv cs.CL

This paper introduces GrowLoop, a self-evolving evaluation system for assessing human-likeness in open-ended conversations. It uses minimal human seed annotations to iteratively refine evaluation rubrics, addressing challenges of tacit knowledge, varying human agreement, and evolving model capabilities.