Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher

arXiv cs.AI Papers

Summary

This paper proposes the Hybrid Open-Ended Tri-Evolution (HOTE) framework, which uses hybrid-mode reinforcement learning to evolve a proposer, solver, and judge collaboratively for deep research tasks, achieving state-of-the-art results with an 8B model surpassing larger static models.

arXiv:2606.13710v1 Announce Type: new Abstract: Deep research and agent evolution serve as de-facto tasks for AI agents in real-world applications toward artificial general intelligence. The former enables autonomous retrieval and integration of information in open-ended environments to tackle open-ended research tasks, yet it is constrained by the static parametric deep research capabilities of agent systems. The latter allows agents to autonomously interact with the environment to gain experiences that evolve model capabilities. However, its effectiveness has been widely validated only on verifiable tasks with standard answers, leaving a gap with open-ended research tasks. To bridge these two critical tasks, we propose the Hybrid Open-Ended Tri-Evolution (HOTE) framework, which leverages hybrid-mode reinforcement learning to facilitate the collaborative evolution of a proposer, solver and judge based on web-scale knowledge, moving toward autonomous evolving agents in open-ended tasks and environments. Extensive experiments on three long-form deep research benchmarks demonstrate that the 8B model trained via HOTE surpasses the strongest static open 8-32B models as well as those trained by state-of-the-art deep research training methods with less time overhead, and further verify that the evolution of all three modules in HOTE is indispensable.
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:08 AM

# Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher
Source: [https://arxiv.org/html/2606.13710](https://arxiv.org/html/2606.13710)
Hongming Piao1Chi Liu111footnotemark:1Mengzhuo Chen1Yan Shu1 Derek Li1Ying Wei2Bryan Dai1 1IQuest Research2Zhejiang University \{cxiao, cliu04, cbdai\}@iquestlab\.com [https://github\.com/UBI\-Agent/ote\.git](https://github.com/UBI-Agent/ote.git)

###### Abstract

Deep research and agent evolution serve as de\-facto tasks for AI agents in real\-world applications toward artificial general intelligence\. The former enables autonomous retrieval and integration of information in open\-ended environments to tackle open\-ended research tasks, yet it is constrained by the static parametric deep research capabilities of agent systems\. The latter allows agents to autonomously interact with the environment to gain experiences that evolve model capabilities\. However, its effectiveness has been widely validated only on verifiable tasks with standard answers, leaving a gap with open\-ended research tasks\. To bridge these two critical tasks, we propose the Hybrid Open\-Ended Tri\-Evolution \(HOTE\) framework, which leverages hybrid\-mode reinforcement learning to facilitate the collaborative evolution of a proposer, solver and judge based on web\-scale knowledge, moving toward autonomous evolving agents in open\-ended tasks and environments\. Extensive experiments on three long\-form deep research benchmarks demonstrate that the 8B model trained via HOTE surpasses the strongest static open 8\-32B models as well as those trained by state\-of\-the\-art deep research training methods with less time overhead, and further verify that the evolution of all three modules in HOTE is indispensable\.

Hybrid Open\-Ended Tri\-Evolution Makes Better Deep Researcher

Hongming Piao1††thanks:Equal contributionChi Liu111footnotemark:1Mengzhuo Chen1Yan Shu1Derek Li1Ying Wei2Bryan Dai1††thanks:Corresponding author1IQuest Research2Zhejiang University\{cxiao, cliu04, cbdai\}@iquestlab\.com[https://github\.com/UBI\-Agent/ote\.git](https://github.com/UBI-Agent/ote.git)

## 1Introduction

Deep research, which emphasizes autonomous handling of open\-ended, long\-cycle, and highly complex information retrieval and integration, has become a de\-facto task for AI agents in real\-world applications and a step toward artificial general intelligenceHu et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib11)\); OpenAI \([2025](https://arxiv.org/html/2606.13710#bib.bib21)\)\. Closed\-source proprietary systems such as OpenAI Deep ResearchOpenAI \([2025](https://arxiv.org/html/2606.13710#bib.bib21)\), Claude ResearchAnthropic \([2025](https://arxiv.org/html/2606.13710#bib.bib1)\), Kimi\-ResearcherMoonshot AI \([2025](https://arxiv.org/html/2606.13710#bib.bib20)\), and Grok DeepSearchxAI \([2025](https://arxiv.org/html/2606.13710#bib.bib40)\)have demonstrated near\-human research capabilities\. Meanwhile, the open\-source community has also made significant progress in building more comprehensive research workflows and end\-to\-end training of deep researchers capable of autonomously planning workflowsSchmidgall et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib23)\); Li et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib15)\); Jin et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib14)\); Team et al\. \([2025b](https://arxiv.org/html/2606.13710#bib.bib30)\); Shao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib24)\); Fang et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib7)\); Zheng et al\. \([2025b](https://arxiv.org/html/2606.13710#bib.bib51)\); Wu et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib36)\); Yao et al\. \([2026](https://arxiv.org/html/2606.13710#bib.bib43)\); Song et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib28)\)\.

Although deep researchers can tackle highly complex research questions by autonomously seeking web\-scale knowledge, their parameterized research capabilities are upper bounded by fixed training sets and training strategies\. Autonomous interaction with the environment and evolution through experience are regarded as a path toward artificial general intelligenceLiu et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib17)\), with self\-play offering a promising paradigm for agent evolution, where an agent system learns from feedback acquired through competition with itself\. For example,Huang et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib13)\); Zhao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib49)\); Wang et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib34)\); Chen et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib3)\)proposed agent systems that act as both query proposer and solver, achieving significant results beyond the original training set or even in zero\-data scenarios in domains such as mathematics, coding, or general reasoning\. To address the limitation that such evolution is constrained by the agent system’s own knowledge, SPICELiu et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib17)\)and Dr\. ZeroYue et al\. \([2026](https://arxiv.org/html/2606.13710#bib.bib48)\)equipped the proposer with a pretrained\-scale corpusMahabadi et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib19)\); Yuan et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib47)\)and a search engine, respectively, taking a step forward toward evolution in open\-ended environments\. However, they remain limited to tasks that can be verified with deterministic answers\. Given that deep researchers in real\-world applications often face long\-form report generation tasks without clear standard answersShao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib24)\), constructing an agent evolution framework for open\-ended tasks and environments is crucial\.

![Refer to caption](https://arxiv.org/html/2606.13710v1/x1.png)Figure 1:During the training of HOTE, \(a\) the scores for synthetic research tasks remain at the same level; \(b\) the scores for research tasks from the original training set continuously increase; \(c\) the scores on Healthbench surpass the baselines and maintain an upward trend\.To fill the aforementioned gaps, we propose the Hybrid Open\-ended Tri\-Evolution \(HOTE\) framework, which consists of three co\-evolving modules: proposer, solver, and judge\. The solver is responsible for receiving a query, generating a research plan, conducting multi\-turn information seeking, integrating information and producing a referenced research report\. The judge is responsible for dynamically generating rubricsGunjal et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib9)\); Viswanathan et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib32)\); Shao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib24)\)that capture the strengths and weaknesses of the solver by comparing multiple solver responses sampled for the same query, and providing rewards for the responses based on these rubrics, thereby removing the dependency on verifiable answers\. The proposer is responsible for performing information seeking based on the model weaknesses identified by the judge and proposing challenging yet learnable queries\. HOTE uses GRPOShao et al\. \([2024](https://arxiv.org/html/2606.13710#bib.bib25)\)to encourage a game between the solver and proposer, continuously improving response quality and query difficulty\. Simultaneously, it employs the judge to dynamically evolve evaluation rubrics, preventing reward hacking, maintaining the learnability of difficult queries, and enabling the proposer to uncover the solver’s weaknesses\. Additionally, we propose a dual\-mode hybrid training strategy that includes both tool\-use and no\-tool modes, which achieves mutual benefit between the two modes and significantly improves training efficiency\. HOTE effectively maintains the difficulty of synthetic queries during training \(Figure[1](https://arxiv.org/html/2606.13710#S1.F1)\(a\-b\)\) and outperforms approaches using only the original training set within the same number of training steps \(Figure[1](https://arxiv.org/html/2606.13710#S1.F1)\(c\)\)\. As shown in Figure[4](https://arxiv.org/html/2606.13710#S2.F4)\(a\), HOTE also facilitates the collaborative progress of both the no\-tool and tool\-use modes\.

In conclusion, our contributions are as follows:

- •We propose Hybrid Open\-ended Tri\-Evolution \(HOTE\), the first deep researcher evolution framework designed for open\-ended environments and open\-ended tasks, bridging two paths toward artificial general intelligence: deep research and agent evolution\.
- •We design a co\-evolution strategy for proposer, solver and judge based on reinforcement learning with hybrid modes\. The strategy maintains the challenge and learnability of research tasks for the solver while avoiding reward hacking and achieving the mutual benefit between tool\-use mode and no\-tool mode\.
- •Experimental results on three long\-form deep research benchmarks demonstrate that an 8B model trained with HOTE outperforms the strongest open 8\-32B models and state\-of\-the\-art deep research training methods with less time overhead, with the co\-evolution of all three modules being indispensable\.

## 2Method

### 2\.1Problem Formulation

![Refer to caption](https://arxiv.org/html/2606.13710v1/x2.png)Figure 2:The inference paradigm of the solver and the proposer under tool\-use and no\-tool modes in HOTE\.![Refer to caption](https://arxiv.org/html/2606.13710v1/x3.png)Figure 3:The overall training framework of HOTE\. At each training step, we utilize hybrid data consisting of bothreal tasksandsynthetic taskswith their corresponding persistent rubrics\. Half of the tasks are configured intool‑usemode and the other half inno‑toolmode\. TheSolvergenerates responses in hybrid mode\. Based on each task’s existing rubrics and the generated responses, theJudgeupdates the rubrics, evaluates the responses and generates meta rubrics\. The assessment generated by theJudgeis used to update theSolver, while the portion corresponding to synthetic tasks is used to update theProposer\. TheProposerperforms diverse proposing according to the meta rubrics and different combinations of tasks from the previous step, thereby generating synthetic tasks which use the meta rubrics as persistent rubrics for the next step\.FollowingLi et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib15)\), we build our agent on top of a concise and general ReAct framework, which provides a clear baseline for evaluating the model’s intrinsic capabilities and training strategies\. The deep research model is a language model \(LM\) augmented with search tools\. Each tool accepts a query along with its arguments and returns textual resources that can be cited in the model’s final answer\. Formally, let𝒯=\{T1,T2,…\}\\mathcal\{T\}=\\\{T\_\{1\},T\_\{2\},\\ldots\\\}represent the set of available tools\. Each toolTkT\_\{k\}accepts a queryqqtogether with an optional argument stringα\\alpha, and returns an observationo=Tk​\(q;α\)o=T\_\{k\}\(q;\\alpha\)\. The model follows a policyπθ\\pi\_\{\\theta\}, parameterized byθ\\theta, which generates a sequence of textssautoregressively\. The sequence is initialized ass0=xs\_\{0\}=x, wherexxcontains the system prompt and the task description\. The model’s action space is defined as

\{think,tool,answer,cite\},\\\{\\texttt\{think\},\\texttt\{tool\},\\texttt\{answer\},\\texttt\{cite\}\\\},with each action associated with a corresponding protocol token\.think\(<think\>…\.\.\.</think\>\) leverages the language model’s internal reasoning capability to plan subsequent steps based on the current state and available information\.tool\(<call\_tool name=…\.\.\.\>…\.\.\.</call\_tool\>\) triggers the invocation of one of several search\-related tools\. The specific tool is selected via thenameattribute, together with tool\-dependent arguments omitted here\. The textual output produced by the tool is appended to the context for use in later steps\.answer\(<answer\>…\.\.\.</answer\>\) generates the final response and terminates the interaction\.cite\(<cite id=…\.\.\.\>…\.\.\.</cite\>\) is embedded within the final answer to annotate claims with citation tags that reference supporting sources\.

At each stepii, the model samples both an actionaia\_\{i\}and its associated content or argumentsζi\\zeta\_\{i\},\(ai,ζi\)∼πθ\(⋅∣si\)\(a\_\{i\},\\zeta\_\{i\}\)\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{i\}\)\. Ifai∈\{think,answer,cite\}a\_\{i\}\\in\\\{\\texttt\{think\},\\texttt\{answer\},\\texttt\{cite\}\\\}, the generated outputζi\\zeta\_\{i\}is appended to the context, yieldingsi\+1=si⊕⟨ai,ζi⟩s\_\{i\+1\}=s\_\{i\}\\oplus\\langle a\_\{i\},\\zeta\_\{i\}\\rangle\. Ifai=toola\_\{i\}=\\texttt\{tool\}, the model executes the corresponding tool call, receives the observationoi=Tk​\(qi;αi\)o\_\{i\}=T\_\{k\}\(q\_\{i\};\\alpha\_\{i\}\), whereζi=\(qi,αi\)\\zeta\_\{i\}=\(q\_\{i\},\\alpha\_\{i\}\), and updates the state assi\+1=si⊕⟨ai,ζi,oi⟩s\_\{i\+1\}=s\_\{i\}\\oplus\\langle a\_\{i\},\\zeta\_\{i\},o\_\{i\}\\rangle\. This iterative procedure continues untilaτ=answera\_\{\\tau\}=\\texttt\{answer\}, at which pointζτ\\zeta\_\{\\tau\}contains the final answer\. As shown in Figure[2](https://arxiv.org/html/2606.13710#S2.F2), within HOTE, both the proposer and the solver perform inference under the same paradigm described above\.Formulating challenging research tasks for the solver constitutes a research task for the proposer itself\. The key difference is that the proposer does not include theciteaction, since proposing research tasks does not require the presentation of citations\. In HOTE, both the proposer and the solver operate under two modes: tool\-use and no\-tool\. In the tool\-use mode, the model follows the aforementioned inference paradigm\. In the no\-tool mode, after receiving the initial states0s\_\{0\}, the model performs a singlethinkaction and then directly produces anansweraction\. All inference paradigms described above can be controlled through the system prompt\.

![Refer to caption](https://arxiv.org/html/2606.13710v1/x4.png)Figure 4:\(a\) The hybrid mode of HOTE outperforms the tool\-use mode of HOTE as well as DR Tulu in both no\-tool and tool\-use modes; \(b\) Models trained with no\-tool mode HOTE and DR Tulu evaluated in no\-tool mode on Healthbench achieve higher scores than when evaluated in tool\-use mode; \(c\) When trained with HOTE in no\-tool mode, the scores on DRB under tool\-use mode decrease after a certain number of steps\.
### 2\.2Hybrid Open\-ended Tri\-evolution

HOTE is primarily divided into four parts:Solver Evolution,Judge Evolution,Proposer EvolutionandDual\-mode Hybrid Training Strategy\. Please refer to Figure[3](https://arxiv.org/html/2606.13710#S2.F3)and Algorithm[1](https://arxiv.org/html/2606.13710#alg1)for the overall framework and training pipeline\.

Solver Evolution\. The solverπθs\\pi\_\{\\theta\_\{s\}\}takes a research tasks0s\_\{0\}as input and, after performingthink\-toolinterleaved reasoning, generates a long\-form research reportanswerwithciterepresented byoo\. Thus, the objective of solver evolution is to make the answer better align with the research report requirementsrr, which also serves as the reward in the reinforcement learning and will be further discussed in the judge evolution section\. We utilize GRPOShao et al\. \([2024](https://arxiv.org/html/2606.13710#bib.bib25)\)with token\-level loss aggregationYu et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib45)\)to achieve solver evolution, with the goal of:

𝒥GRPO\\displaystyle\\mathcal\{J\}\_\{\\text\{GRPO\}\}\(θs\)=𝔼\(s0,ℛs0\)∼𝒟,\{oi\}i=1G∼πθsold\(⋅∣s0\)\\displaystyle\(\\theta\_\{s\}\)=\\mathbb\{E\}\_\{\(s\_\{0\},\\mathcal\{R\}\_\{s\_\{0\}\}\)\\sim\\mathcal\{D\},\\ \\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{s\}^\{\\text\{old\}\}\}\(\\cdot\\mid s\_\{0\}\)\}\(1\)\[1∑i=1G\|oi\|∑i=1G∑t=1\|oi\|\(min\(ri,t\(θs\)A^i,t,\\displaystyle\\quad\\Bigg\[\\frac\{1\}\{\\textstyle\\sum\_\{i=1\}^\{G\}\|o\_\{i\}\|\}\\sum\_\{i=1\}^\{G\}\\sum\_\{t=1\}^\{\|o\_\{i\}\|\}\\Big\(\\min\\big\(r\_\{i,t\}\(\\theta\_\{s\}\)\\,\\hat\{A\}\_\{i,t\},clip\(ri,t\(θs\),ϵ\)A^i,t\)−βDKL\(πθs∥πθsref\)\)\],\\displaystyle\\hskip\-27\.0pt\\mathrm\{clip\}\\\!\\left\(r\_\{i,t\}\(\\theta\_\{s\}\),\\epsilon\\right\)\\,\\hat\{A\}\_\{i,t\}\\big\)\-\\beta\\,D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\_\{s\}\}\\,\\\|\\,\\pi\_\{\\theta\_\{s\}^\{\\text\{ref\}\}\}\\right\)\\Big\)\\Bigg\],whereri,t​\(θ\)=πθ​\(oi,t∣q,oi,<t\)πθold​\(oi,t∣q,oi,<t\),\\displaystyle\\hskip\-27\.0pt\\text\{where\}\\quad r\_\{i,t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\\\!\\left\(o\_\{i,t\}\\mid q,o\_\{i,<t\}\\right\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\\\!\\left\(o\_\{i,t\}\\mid q,o\_\{i,<t\}\\right\)\},A^i,t=ri−mean⁡\(\{ri\}i=1G\)std⁡\(\{ri\}i=1G\)\.\\displaystyle\\hat\{A\}\_\{i,t\}=\\frac\{r\_\{i\}\-\\operatorname\{mean\}\\\!\\left\(\\\{r\_\{i\}\\\}\_\{i=1\}^\{G\}\\right\)\}\{\\operatorname\{std\}\\\!\\left\(\\\{r\_\{i\}\\\}\_\{i=1\}^\{G\}\\right\)\}\.\{oi\}i=1G\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}represents a group of responses to the research tasks0s\_\{0\}\.rir\_\{i\}denotes the reward obtained byoio\_\{i\}\.ℛs0\\mathcal\{R\}\_\{s\_\{0\}\}represents the rubric set corresponding tos0s\_\{0\}, which will be explained in detail in the judge evaluation section\. The solver continuously progresses toward better long\-form research reports based on the reward\. We omit the descriptions of other symbols that can be found inYu et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib45)\)\.

Judge Evolution\. The judgeπθj\\pi\_\{\\theta\_\{j\}\}receives a group of responses\{oi\}i=1G\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}fors0s\_\{0\}from the solver as input and assigns rewardrir\_\{i\}to each responseoio\_\{i\}according to the rubric setℛs0\\mathcal\{R\}\_\{s\_\{0\}\}as follows:

ri=∑\(R,w\)∈ℛs0w⋅Judgeπθj​\(oi,R\)∑\(R,w\)∈ℛs0\|w\|,\\displaystyle r\_\{i\}=\\frac\{\\sum\_\{\(R,w\)\\in\\mathcal\{R\}\_\{s\_\{0\}\}\}w\\cdot\\text\{Judge\}\_\{\\pi\_\{\\theta\_\{j\}\}\}\(o\_\{i\},R\)\}\{\\sum\_\{\(R,w\)\\in\\mathcal\{R\}\_\{s\_\{0\}\}\}\|w\|\},\(2\)
whereRRrepresents a rubric inℛs0\\mathcal\{R\}\_\{s\_\{0\}\}andwwrepresents its corresponding weight\. The judge’s reward for each rubric has only0or±1\\pm 1\. Therefore, the evolutionary objective of the judge is to provide more well\-founded and discriminative rewards for the responses, ensuring the learning of the solver\. As can be seen from Equation[2](https://arxiv.org/html/2606.13710#S2.E2), the judge requires extensive inference at each step, so for training efficiency considerations, the judge in HOTE uses a fixed instruction model\. In this case, the key to judge evolution shifts to how to drive the evolution of the rubric setℛs0\\mathcal\{R\}\_\{s\_\{0\}\}fors0s\_\{0\}\. Inspired byShao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib24)\), givenℛs0=ℛs0persi\.∪ℛs0active\\mathcal\{R\}\_\{s\_\{0\}\}=\\mathcal\{R\}^\{\\text\{persi\.\}\}\_\{s\_\{0\}\}\\cup\\mathcal\{R\}^\{\\text\{active\}\}\_\{s\_\{0\}\}whereℛpersi\.\\mathcal\{R\}^\{\\text\{persi\.\}\}contains persistent rubrics ofs0s\_\{0\}andℛs0active\\mathcal\{R\}^\{\\text\{active\}\}\_\{s\_\{0\}\}contains active rubrics ofs0s\_\{0\}that can be deleted or added, HOTE prompts the judge to updateℛs0active\\mathcal\{R\}^\{\\text\{active\}\}\_\{s\_\{0\}\}based on\{oi\}i=1G\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}at each step before assigning rewards as follows:

ℛs0active=Updateπθj​\(s0,\{oi\}i=1G,ℛs0active\)\.\\displaystyle\\mathcal\{R\}^\{\\text\{active\}\}\_\{s\_\{0\}\}=\\text\{Update\}\_\{\\pi\_\{\\theta\_\{j\}\}\}\(s\_\{0\},\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\},\\mathcal\{R\}^\{\\text\{active\}\}\_\{s\_\{0\}\}\)\.\(3\)
The judge will generate two types of rubrics:positive rubricsthat capture strengths or new, relevant knowledge explored byπθs\\pi\_\{\\theta\_\{s\}\}in\{oi\}i=1G\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}but not yet reflected inℛs0\\mathcal\{R\}\_\{s\_\{0\}\}, andnegative rubricsthat summarize common undesirable behaviors such as reward hacking observed across\{oi\}i=1G\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}\. By observing the responses, the judge continuously tracks and uncovers weaknesses in both the rubric set and solver\.

Proposer Evolution\. The objective of proposerπθp\\pi\_\{\\theta\_\{p\}\}evolution is to enhance the capability to search for materials and propose research tasks that can expose the weaknesses of the solver, based on the judge’s assessment\. Similar to solver evolution, HOTE uses GRPO to achieve the evolution of the proposer as shown in Equation[1](https://arxiv.org/html/2606.13710#S2.E1), with the distinction thats0s\_\{0\}becomesproposing research tasks based on the judge’s assessment𝒜\\mathcal\{A\}represented bys0ps\_\{0\}^\{p\}, and\{oi\}i=1G\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}becomesa group of research tasks proposed by the proposer\{oip\}i=1G′\\\{o\_\{i\}^\{p\}\\\}\_\{i=1\}^\{G^\{\\prime\}\}\. However, there are two key issues left:

- •The assessment from judge𝒜=\{Judgeπθj​\(oi,R\)∣\(R,w\)∈ℛs0,1≤i≤G\}\\mathcal\{A=\}\\\{\\text\{Judge\}\_\{\\pi\_\{\\theta\_\{j\}\}\}\(o\_\{i\},R\)\\mid\(R,w\)\\in\\mathcal\{R\}\_\{s\_\{0\}\},1\\leq i\\leq G\\\}includes rewards for each rubric of every response, thus using all of them as input to the proposer results in excessive length, slowing down training speed\.
- •\{oip\}i=1G′\\\{o\_\{i\}^\{p\}\\\}\_\{i=1\}^\{G^\{\\prime\}\}proposed by the proposer lacks shared rubrics, making it difficult to evaluate their relative strengths and weaknesses\.

Therefore, we proposemeta rubrics, allowing the judge to summarize assessments into multiple meta rubrics, uncovering common model weaknesses among the solver’s responses as follows:

ℛ\{oip\}i=1G′meta=Metaπθj​\(𝒜,ℛs0,\{oi\}i=1G\)\.\\displaystyle\\mathcal\{R\}^\{\\text\{meta\}\}\_\{\\\{o\_\{i\}^\{p\}\\\}\_\{i=1\}^\{G^\{\\prime\}\}\}=\\text\{Meta\}\_\{\\pi\_\{\\theta\_\{j\}\}\}\(\\mathcal\{A\},\\mathcal\{R\}\_\{s\_\{0\}\},\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}\)\.\(4\)
These meta rubrics serve as the proposer input and persistent rubrics shared across\{oip\}i=1G′\\\{o\_\{i\}^\{p\}\\\}\_\{i=1\}^\{G^\{\\prime\}\}\. On one hand, they are used for solver evolution; on the other hand, they leverage the reward of solver responses to compute the rewardripr\_\{i\}^\{p\}of research taskoipo\_\{i\}^\{p\}as follows:

rip=1M​∑\(R,w\)∈ℛ\{oip\}i=1G′meta\\displaystyle r\_\{i\}^\{p\}=\\frac\{1\}\{M\}\\textstyle\\sum\_\{\(R,w\)\\in\\mathcal\{R\}^\{\\text\{meta\}\}\_\{\\\{o\_\{i\}^\{p\}\\\}\_\{i=1\}^\{G^\{\\prime\}\}\}\}\(5\)𝕀⋅\(1−𝔼\{oj\}j=1G∼πθs\(⋅∣oip\)​\[Judgeπθj​\(oj,R\)\]\)\\displaystyle\\mathbb\{I\}\\cdot\(1\-\\mathbb\{E\}\_\{\\\{o\_\{j\}\\\}\_\{j=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{s\}\}\(\\cdot\\mid o\_\{i\}^\{p\}\)\}\[\\text\{Judge\}\_\{\\pi\_\{\\theta\_\{j\}\}\}\(o\_\{j\},R\)\]\)whereM=\|ℛ\{oip\}i=1G′meta\|M=\\left\|\\mathcal\{R\}^\{\\text\{meta\}\}\_\{\\\{o\_\{i\}^\{p\}\\\}\_\{i=1\}^\{G^\{\\prime\}\}\}\\right\|\.𝕀\\mathbb\{I\}represents whether there is anojo\_\{j\}that passes the rubricRR\. ‘1’ represents the max average reward the solverπθs\\pi\_\{\\theta\_\{s\}\}can obtain given the judge’s reward for each rubric is limited to0or±1\\pm 1\. Through Equation[5](https://arxiv.org/html/2606.13710#S2.E5), we encourage the proposer to generate challenging but solvable research tasks for the solver\.

### 2\.3Dual\-mode Hybrid Training Strategy

The independent evolution of the three modules mentioned above is insufficient\. They should complement one another to form an evolution pipeline where a stronger solver stimulates a more refined judge, and a stronger proposer and judge inspire the proposer to formulate more challenging problems, which in turn train a stronger solver\. As shown in Figure[3](https://arxiv.org/html/2606.13710#S2.F3), our proposed dual\-mode hybrid training strategy primarily encompasses three key features\.

Hybrid Data\. Except for the first step, the training data for each step \(comprising a batch size ofBBresearch tasks\) consists ofB2\\frac\{B\}\{2\}research tasks from the original training set andB2\\frac\{B\}\{2\}synthetic research tasks proposed by the proposer based on evaluations from the previous step\. Beyond leveraging existing data resources and facilitating agent evolution, this design allows synthetic research tasks generated by the proposer to be immediately solved by the solver and evaluated by the judge\. The evaluation results can then be used to optimize both the proposer and the judge simultaneously, avoiding the need for repeated sampling\.

Diverse Proposing\. We found when the proposer generates research tasks based solely on the judge’s evaluation and all research tasks from the previous step, they tend to concentrate on the same topic, which can undermine the balance and diversity of the training data\. Therefore, at each step, we prompt the proposer to generateNNgroups of research problems based on the judge’s evaluation andNNdistinct combinations of research tasks from the previous step\.

Hybrid Modes\. As illustrated in Figure[4](https://arxiv.org/html/2606.13710#S2.F4)\(b\), we found that for DR Tulu\-8B\-SFT, DR Tulu\-8B\-RL and HOTE trained solely in no\-tool mode, their performance on Healthbench under no\-tool mode exceeds that under tool\-use mode\. This phenomenon can be attributed to factors such as noise in the search tool and it is acceptable in practical applications to trade evaluation metrics for research reports with clear references\. Intuitively, we think it is easier to learn research report generation techniques excluding reference searching and understanding in a no\-tool training mode than a tool\-use training mode\. Meanwhile, as shown in Figure[4](https://arxiv.org/html/2606.13710#S2.F4)\(c\), for HOTE trained in no\-tool mode, its performance on DRB under tool\-use mode exhibits a clear pattern of initial improvement followed by decline, suggesting that no\-tool training leads the model to rely excessively on parametric knowledge\. Therefore, we randomly assign half of the training data in each step to no\-tool mode and the other half to tool\-use mode \(to ensure fairness in judging synthetic research tasks, this assignment is randomized across theNNgroups\), thereby enhancing research report generation techniques and avoiding over\-reliance on parameterized knowledge\.

In actual training, we trained600600steps using no\-tool mode and then trained700700steps using hybrid mode\. Besides, we theoretically prove that the hybrid mode results in a lower expected maximum generation time in Appendix[B](https://arxiv.org/html/2606.13710#A2)\.

## 3Experiment

Our experiments aim to address five research questions\.RQ1:Does HOTE demonstrate stronger capabilities in handling open\-ended research tasks with less time overhead?RQ2:Are the three modules indispensable for HOTE evolution?RQ3:Does HOTE facilitate the collaborative progress of dual modes?RQ4:Is HOTE effective with different base models?RQ5:Does HOTE evolve more sustainably? We additionally provide the case study, the effect of judge models, prompts and diverse proposing in Appendix[C](https://arxiv.org/html/2606.13710#A3)and[E](https://arxiv.org/html/2606.13710#A5)\.

Table 1:Performance comparison across long\-form deep research benchmarks\. HOTE\-8B outperforms existingOpen Deep Research Models,Open Deep Research,RL MethodsandEvolving Methods\.MethodHealthBenchResearchQADRBAverageOverallCompInsightInstructionReadabilityClosed Deep ResearchGemini 3 Pro \+ Search38\.074\.346\.343\.444\.949\.849\.052\.9GPT\-5 \+ Search59\.578\.250\.726\.721\.341\.029\.462\.8OpenAI Deep Research53\.879\.246\.946\.845\.249\.247\.160\.0Open Deep Research ModelsQwen3\-8B5\.946\.318\.214\.38\.729\.524\.423\.5Qwen3\-235B\-A22B21\.350\.722\.519\.117\.330\.625\.131\.5Search\-R1\-7B\-0\.127\.99\.55\.22\.118\.616\.812\.4ASearcher\-Web\-7B\-13\.019\.47\.85\.11\.715\.211\.84\.7WebExplorer\-8B33\.764\.836\.733\.728\.545\.742\.245\.1WebThinker\-32B\-DPO11\.148\.623\.319\.712\.336\.826\.327\.7Tongyi DeepResearch\-30B\-A3B46\.266\.740\.639\.134\.346\.845\.451\.2Fixed Pipeline Deep ResearchWebThinker QwQ\-32B \(report\)36\.572\.837\.936\.232\.643\.242\.949\.1WebThinker\-32B\-DPO \(report\)39\.474\.240\.639\.435\.446\.043\.551\.4Ai2 ScholarQA\-Claude Sonnet \(report\)32\.075\.036\.135\.132\.040\.538\.947\.7Open Deep ResearchDR Tulu\-8B\-SFT38\.168\.539\.036\.335\.345\.539\.548\.5DR Tulu\-8B\-RL111We use the 1900\-step checkpoint of DR Tulu\.50\.274\.343\.441\.741\.848\.241\.356\.0RL MethodsGRPO49\.673\.543\.140\.842\.146\.942\.655\.4GSPO51\.075\.143\.642\.540\.947\.343\.756\.6REINFORCE\+\+50\.874\.843\.141\.242\.746\.142\.456\.2Evolving MethodsSPICE\-8B50\.273\.942\.140\.640\.946\.140\.855\.4Dr\. Zero\-8B52\.173\.243\.741\.542\.146\.544\.756\.3Open Evolving Deep ResearchHOTE\-8B54\.476\.945\.944\.945\.447\.845\.859\.1

Table 2:Average training time per step for baselines, no\-tool mode and hybrid mode of HOTE\.### 3\.1Evaluations

Benchmark\. We evaluated HOTE and baseline models across three long\-form, open\-ended benchmarks: HealthBenchArora et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib2)\)for healthcare deep research, ResearchQAYifei et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib44)\)for assessing synthesis over scientific literature, the DeepResearchBenchDu et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib5)\)\(DRB\) for evaluating general\-domain deep research tasks\. For DRB, we additionally provide detailed performance across diverse aspects of the responses\. DRB includes the following aspects: Comprehensiveness, Insight, Instruction Following, and Readability\. In Table[1](https://arxiv.org/html/2606.13710#S3.T1), we followedShao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib24)\)by using HealthBench with 1,000 samples and ResearchQA with 776 samples\. For other experimental results, we sampled 100 instances each from HealthBench and ResearchQA respectively\. Please refer to Appendix[F](https://arxiv.org/html/2606.13710#A6)for the benchmark details\.

Baselines\. We compared four categories of deep researchers:Open Deep Research Models, including Qwen3\-8BYang et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib41)\), Qwen3\-235B\-A22BYang et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib41)\), Search\-R1\-7BJin et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib14)\), ASearcher\-Web\-7BGao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib8)\), WebExplorer\-8BLiu et al\. \([2025b](https://arxiv.org/html/2606.13710#bib.bib18)\), WebThinker\-32B\-DPOLi et al\. \([2025b](https://arxiv.org/html/2606.13710#bib.bib16)\), Tongyi DeepResearch\-30B\-A3BTeam et al\. \([2025b](https://arxiv.org/html/2606.13710#bib.bib30)\);Open Deep Research, including DR Tulu\-8B\-SFT and DR Tulu\-8B\-RLShao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib24)\);RL Method, including GRPOShao et al\. \([2024](https://arxiv.org/html/2606.13710#bib.bib25)\), GSPOZheng et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib50)\)and REINFORCE\+\+Hu et al\. \([2025b](https://arxiv.org/html/2606.13710#bib.bib12)\);Evolving Method, including SPICELiu et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib17)\)and Dr\. ZeroYue et al\. \([2026](https://arxiv.org/html/2606.13710#bib.bib48)\)\. We also providedClosed Deep Research, including Gemini 3 Pro, GPT\-5, and OpenAI Deep Research;Fixed Pipeline Deep Research, including WebThinker QwQ\-32B, WebThinker\-32B\-DPO, and Ai2 ScholarQA\-Claude SonnetSingh et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib27)\)for reference\. Please refer to Appendix[F](https://arxiv.org/html/2606.13710#A6)for implementation details\.

### 3\.2Training Details

We utilized Qwen3\-8B to initialize the checkpoint of the proposer and DR Tulu\-8B\-SFT to initialize the checkpoint of the solver\. For Open Deep Research, RL Methods, Evolving Methods and Open Evolving Deep Research, we used the same original RL training set in DR TuluShao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib24)\)licensed under ODC\-BY with 9K samples to ensure fairness\. We employed Qwen3\-235B\-A22B\-Instruct\-FP8 as the judge\. The batch sizeBBwas set to4848, the group size of solverGGto88, the group size of proposerG′G^\{\\prime\}to66, the learning rate to5​e−75e\-7, the maximum number of tool uses per responseTTto1010, the temperature to11, and the response length to1638416384\. For the performance in Table[1](https://arxiv.org/html/2606.13710#S3.T1), RL Methods and Evolving Methods were trained for 1300 steps until they converged on a held\-out validation set while HOTE\-8B was trained in no\-tool mode for 600 steps and hybrid mode for 700 steps\. We provide hyperparameter analysis in Appendix[G](https://arxiv.org/html/2606.13710#A7)\.

### 3\.3RQ1: Does HOTE demonstrate stronger capabilities in handling open\-ended research tasks with less time overhead?

As shown in Table[1](https://arxiv.org/html/2606.13710#S3.T1), the HOTE\-8B model surpasses the open\-source solution DR Tulu on HealthBench, ResearchQA and DRB\. It also outperforms Open Deep Research Models including Tongyi DeepResearch\-30B\-A3B\. As illustrated in Figure[5](https://arxiv.org/html/2606.13710#S3.F5), HOTE also leads existing rl and agent evolution methods\. Additionally, due to the presence of hybrid mode, only half of the research tasks in the latter700700steps out of the total13001300steps require tool\-use\. Given that the maximum number of tool\-use per response isTT, the batch size isBBand the group size isGG, the maximum number of tool\-use required for HOTE training is350​B​T​G\+175​B350BTG\+175B\(term 1 for solver, term 2 for proposer\)\. In contrast, for DR Tulu it is1900​B​T​G1900BTGwhile for RL Methods and Evolving Methods it is1300​B​T​G1300BTG\. Furthermore, as indicated in Table[2](https://arxiv.org/html/2606.13710#S3.T2), even with the addition of proposer evolution, both the no\-tool mode in the first600600steps and the hybrid mode in the latter700700steps contribute to improvements in training speed\.

### 3\.4RQ2: Are the three modules indispensable for HOTE evolution?

We compared HOTE, SPICE, the HOTE version without judge evolution \(HOTE w/o je, equivalent to Dr\. Zero using rubric\-based reward and GRPO\), and the HOTE version without proposer evolution \(HOTE w/o pe, the proposer’s parameters are fixed\) in the no\-tool mode\. As shown in Figure[5](https://arxiv.org/html/2606.13710#S3.F5), when training in the no\-tool mode using HOTE, although HOTE initially performed slightly worse on the benchmark compared to SPICE, HOTE w/o je and HOTE w/o pe, it gradually achieved overall superiority as training progressed\. More importantly, while HOTE w/o je, HOTE w/o pe and SPICE approached convergence, HOTE maintained a stronger upward trend\. Moreover, as can be seen from Figure[6](https://arxiv.org/html/2606.13710#S3.F6), with proposer evolution enabled, the scores of synthetic research tasks are more stable compared to fixed proposer parameters, which indicates that proposer evolution helps maintain the difficulty of research tasks\.

### 3\.5RQ3: Does HOTE facilitate the collaborative progress of dual modes?

Figure[4](https://arxiv.org/html/2606.13710#S2.F4)\(a\) shows the performance of HOTE, the open\-source training approach DR Tulu, as well as HOTE trained exclusively in tool\-use mode, evaluated on HealthBench, ResearchQA, and DRB under both no\-tool and tool\-use modes\. HOTE outperforms both DR Tulu and the single\-mode version across both no\-tool and tool\-use evaluation modes, achieving collaborative progress in the dual modes by enhancing research report generation techniques while avoiding over\-reliance on parameterized knowledge\.

![Refer to caption](https://arxiv.org/html/2606.13710v1/x5.png)Figure 5:In HealthBench \(a\), ResearchQA \(b\), and DeepResearchBench \(c\), after 600 steps of training in no\-tool mode, HOTE outperforms SPICE, HOTE w/o je and HOTE w/o pe while demonstrating an upward trend\.![Refer to caption](https://arxiv.org/html/2606.13710v1/x6.png)Figure 6:\(a\) During the training in no\-tool mode with proposer evolution enabled, the solver’s synthetic task score remains stable, indicating that proposer evolution maintains the challenge of the tasks for the evolving solver; \(b\) After disabling proposer evolution, the solver’s synthetic task score gradually increases\.
### 3\.6RQ4: Is HOTE effective with different base models?

As shown in Table[3](https://arxiv.org/html/2606.13710#S3.T3), we also provided the performance comparison on Llama3\.1\-8B\-Instruct supervised fine\-tuned by dr\-tulu\-sft\-dataShao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib24)\)for 5 epochs\. HOTE maintains its lead across three benchmarks and over baselines that we can train by ourselves includingOpen Deep Research,RL MethodsandEvolving Methods\. The absolute scores are lower than using DR Tulu\-8B\-SFT fine\-tuned from Qwen3\-8B due to the lower capability of the base model\.

Table 3:Performance comparison with Llama3\.1\-8B\-Instruct supervised fine\-tuned by dr\-tulu\-sft\-data as the base model\.
### 3\.7RQ5: Does HOTE evolve more sustainably than baselines?

We compared the performance of HOTE and the baselines during training from 1200 to 1500 total steps\. We use the average performance on the three benchmarks\. As shown in Table[4](https://arxiv.org/html/2606.13710#S3.T4), the baselines have already converged, whereas HOTE not only outperforms the baselines but also continues to exhibit an upward trend\. HOTE can sustain continuous evolution for at least 252 hours \(1500 steps\) of wall\-clock time\.

Table 4:Average performance comparison across three benchmarks between HOTE and baselines from 1200 steps to 1500 steps in total\. HOTE evolves more sustainably than baselines\.

## 4Conclusion

We propose Hybrid Open\-Ended Tri\-Evolution \(HOTE\), aiming to develop a deep researcher capable of autonomous evolution in open\-ended environments for open\-ended tasks with less time overhead\. Through a well\-designed reinforcement learning with hybrid modes, HOTE achieves synergistic evolution among the proposer, solver and judge as well as the mutual benefit between no\-tool and tool\-use modes\. On three long\-form deep research benchmarks, HOTE\-8B outperforms the strongest open 8\-32B models and state\-of\-the\-art deep research training methods with less time overhead\. In future work, we will continue to explore how to handle noise in real\-world search tools during the evolutionary process, how to break free from dependence on original training dataset and how to scale HOTE to larger MoE models\.

## Limitations

The evolution gradually slows down as training progresses and is difficult to obtain perfect scores, suggesting that the upper bound of evolution may still be constrained by model scale\. Investigating the scaling capability of HOTE will be a major direction of our future work\. The proposed method still relies on the initial training data, but we believe that transcending the limitations of existing training data through evolution is inherently valuable\.

## References

- Anthropic \(2025\)Anthropic\. 2025\.Claude takes research to new places\.[https://www\.anthropic\.com/news/research](https://www.anthropic.com/news/research)\.Accessed: 2025\-04\.
- Arora et al\. \(2025\)Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero\-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, and 1 others\. 2025\.Healthbench: Evaluating large language models towards improved human health\.*arXiv preprint arXiv:2505\.08775*\.
- Chen et al\. \(2025\)Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak\. 2025\.Self\-questioning language models\.*arXiv preprint arXiv:2508\.03682*\.
- Chen et al\. \(2024\)Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu\. 2024\.Self\-play fine\-tuning converts weak language models to strong language models\.*arXiv preprint arXiv:2401\.01335*\.
- Du et al\. \(2025\)Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao\. 2025\.Deepresearch bench: A comprehensive benchmark for deep research agents\.*arXiv preprint arXiv:2506\.11763*\.
- FAIR et al\. \(2022\)FAIR, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, and 1 others\. 2022\.Human\-level play in the game of diplomacy by combining language models with strategic reasoning\.*Science*, 378\(6624\):1067–1074\.
- Fang et al\. \(2025\)Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun\-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, and 1 others\. 2025\.Cognitive kernel\-pro: A framework for deep research agents and agent foundation models training\.*arXiv preprint arXiv:2508\.00414*\.
- Gao et al\. \(2025\)Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu\. 2025\.Beyond ten turns: Unlocking long\-horizon agentic search with large\-scale asynchronous rl\.*arXiv preprint arXiv:2508\.07976*\.
- Gunjal et al\. \(2025\)Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx\. 2025\.Rubrics as rewards: Reinforcement learning beyond verifiable domains\.*arXiv preprint arXiv:2507\.17746*\.
- Ho et al\. \(2020\)Xanh Ho, Anh\-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa\. 2020\.Constructing a multi\-hop qa dataset for comprehensive evaluation of reasoning steps\.*arXiv preprint arXiv:2011\.01060*\.
- Hu et al\. \(2025a\)Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, and 1 others\. 2025a\.Step\-deepresearch technical report\.*arXiv preprint arXiv:2512\.20491*\.
- Hu et al\. \(2025b\)Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen\. 2025b\.Reinforce\+\+: Stabilizing critic\-free policy optimization with global advantage normalization\.*arXiv preprint arXiv:2501\.03262*\.
- Huang et al\. \(2025\)Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu\. 2025\.R\-zero: Self\-evolving reasoning llm from zero data\.*arXiv preprint arXiv:2508\.05004*\.
- Jin et al\. \(2025\)Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han\. 2025\.Search\-r1: Training llms to reason and leverage search engines with reinforcement learning\.*arXiv preprint arXiv:2503\.09516*\.
- Li et al\. \(2025a\)Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, and 1 others\. 2025a\.Websailor\-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning\.*arXiv preprint arXiv:2509\.13305*\.
- Li et al\. \(2025b\)Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji\-Rong Wen, Yutao Zhu, and Zhicheng Dou\. 2025b\.Webthinker: Empowering large reasoning models with deep research capability\.*arXiv preprint arXiv:2504\.21776*\.
- Liu et al\. \(2025a\)Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston\. 2025a\.Spice: Self\-play in corpus environments improves reasoning\.*arXiv preprint arXiv:2510\.24684*\.
- Liu et al\. \(2025b\)Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, and 1 others\. 2025b\.Webexplorer: Explore and evolve for training long\-horizon web agents\.*arXiv preprint arXiv:2509\.06501*\.
- Mahabadi et al\. \(2025\)Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro\. 2025\.Nemotron\-cc\-math: A 133 billion\-token\-scale high quality math pretraining dataset\.*arXiv preprint arXiv:2508\.15096*\.
- Moonshot AI \(2025\)Moonshot AI\. 2025\.Kimi\-researcher: End\-to\-end rl training for emerging agentic capabilities\.[https://moonshotai\.github\.io/Kimi\-Researcher/](https://moonshotai.github.io/Kimi-Researcher/)\.
- OpenAI \(2025\)OpenAI\. 2025\.Introducing deep research\.[https://openai\.com/index/introducing\-deep\-research/](https://openai.com/index/introducing-deep-research/)\.Accessed: 2025\-02\.
- Qin et al\. \(2025\)Tianrui Qin, Qianben Chen, Sinuo Wang, He Xing, King Zhu, He Zhu, Dingfeng Shi, Xinxin Liu, Ge Zhang, Jiaheng Liu, and 1 others\. 2025\.Flash\-searcher: Fast and effective web agents via dag\-based parallel execution\.*arXiv preprint arXiv:2509\.25301*\.
- Schmidgall et al\. \(2025\)Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum\. 2025\.Agent laboratory: Using llm agents as research assistants\.*arXiv preprint arXiv:2501\.04227*\.
- Shao et al\. \(2025\)Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, and 1 others\. 2025\.Dr tulu: Reinforcement learning with evolving rubrics for deep research\.*arXiv preprint arXiv:2511\.19399*\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others\. 2024\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*\.
- Silver et al\. \(2017\)David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, and 1 others\. 2017\.Mastering chess and shogi by self\-play with a general reinforcement learning algorithm\.*arXiv preprint arXiv:1712\.01815*\.
- Singh et al\. \(2025\)Amanpreet Singh, Joseph Chee Chang, Chloe Anastasiades, Dany Haddad, Aakanksha Naik, Amber Tanaka, Angele Zamarron, Cecile Nguyen, Jena D Hwang, Jason Dunkleberger, and 1 others\. 2025\.Ai2 scholar qa: Organized literature synthesis with attribution\.*arXiv preprint arXiv:2504\.10861*\.
- Song et al\. \(2025\)Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji\-Rong Wen\. 2025\.R1\-searcher: Incentivizing the search capability in llms via reinforcement learning\.*arXiv preprint arXiv:2503\.05592*\.
- Team et al\. \(2025a\)MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, and 1 others\. 2025a\.Mirothinker: Pushing the performance boundaries of open\-source research agents via model, context, and interactive scaling\.*arXiv preprint arXiv:2511\.11793*\.
- Team et al\. \(2025b\)Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, and 1 others\. 2025b\.Tongyi deepresearch technical report\.*arXiv preprint arXiv:2510\.24701*\.
- Tesauro et al\. \(1995\)Gerald Tesauro and 1 others\. 1995\.Temporal difference learning and td\-gammon\.*Communications of the ACM*, 38\(3\):58–68\.
- Viswanathan et al\. \(2025\)Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xiang Kong, Meng Cao, Graham Neubig, and Tongshuang Wu\. 2025\.Checklists are better than reward models for aligning language models\.*arXiv preprint arXiv:2507\.18624*\.
- Wan et al\. \(2026\)Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, and Michael R Lyu\. 2026\.Inference\-time scaling of verification: Self\-evolving deep research agents via test\-time rubric\-guided verification\.*arXiv preprint arXiv:2601\.15808*\.
- Wang et al\. \(2025\)Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, and Mengdi Wang\. 2025\.Cure: Co\-evolving coders and unit testers via reinforcement learning\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*\.
- Wei et al\. \(2024\)Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus\. 2024\.Measuring short\-form factuality in large language models\.*arXiv preprint arXiv:2411\.04368*\.
- Wu et al\. \(2025a\)Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, and 1 others\. 2025a\.Webdancer: Towards autonomous information seeking agency\.*arXiv preprint arXiv:2505\.22648*\.
- Wu et al\. \(2025b\)Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and 1 others\. 2025b\.Webwalker: Benchmarking llms in web traversal\.*arXiv preprint arXiv:2501\.07572*\.
- Wu et al\. \(2025c\)Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin\. 2025c\.Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools\.*arXiv preprint arXiv:2502\.04644*\.
- Wu et al\. \(2024\)Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu\. 2024\.Self\-play preference optimization for language model alignment\.*arXiv preprint arXiv:2405\.00675*\.
- xAI \(2025\)xAI\. 2025\.Grok 3 beta — the age of reasoning agents\.[https://x\.ai/news/grok\-3](https://x.ai/news/grok-3)\.
- Yang et al\. \(2025\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others\. 2025\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*\.
- Yao et al\. \(2022\)Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao\. 2022\.React: Synergizing reasoning and acting in language models\.In*The eleventh international conference on learning representations*\.
- Yao et al\. \(2026\)Yi Yao, He Zhu, Piaohong Wang, Jincheng Ren, Xinlong Yang, Qianben Chen, Xiaowan Li, Dingfeng Shi, Jiaxian Li, Qiexiang Wang, and 1 others\. 2026\.O\-researcher: An open ended deep research model via multi\-agent distillation and agentic rl\.*arXiv preprint arXiv:2601\.03743*\.
- Yifei et al\. \(2025\)Li S Yifei, Allen Chang, Chaitanya Malaviya, and Mark Yatskar\. 2025\.Researchqa: Evaluating scholarly question answering at scale across 75 fields with survey\-mined questions and rubrics\.*arXiv preprint arXiv:2509\.00496*\.
- Yu et al\. \(2025\)Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others\. 2025\.Dapo: An open\-source llm reinforcement learning system at scale\.*arXiv preprint arXiv:2503\.14476*\.
- Yuan et al\. \(2024\)Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E Weston\. 2024\.Self\-rewarding language models\.In*Forty\-first International Conference on Machine Learning*\.
- Yuan et al\. \(2025\)Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Ilia Kulikov, Kyunghyun Cho, Dong Wang, Yuandong Tian, Jason E Weston, and 1 others\. 2025\.Naturalreasoning: Reasoning in the wild with 2\.8 m challenging questions\.*arXiv preprint arXiv:2502\.13124*\.
- Yue et al\. \(2026\)Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang\. 2026\.Dr\. zero: Self\-evolving search agents without training data\.*arXiv preprint arXiv:2601\.07055*\.
- Zhao et al\. \(2025\)Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang\. 2025\.Absolute zero: Reinforced self\-play reasoning with zero data\.*arXiv preprint arXiv:2505\.03335*\.
- Zheng et al\. \(2025a\)Chujie Zheng, Shixuan Liu, Mingze Li, Xiong\-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, and 1 others\. 2025a\.Group sequence policy optimization\.*arXiv preprint arXiv:2507\.18071*\.
- Zheng et al\. \(2025b\)Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu\. 2025b\.Deepresearcher: Scaling deep research via reinforcement learning in real\-world environments\.*arXiv preprint arXiv:2504\.03160*\.

## Appendix ARelated Work

We summarize the contribution of HOTE in Table[5](https://arxiv.org/html/2606.13710#A1.T5)\.

Table 5:The contribution of HOTE\.### A\.1Deep Research Agents

Deep research, defined as AI agents’ capability to handle open\-ended, long\-term, and highly complex information retrieval and integration, has become key for AI agents to move beyond conversational interaction toward general autonomyHu et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib11)\); OpenAI \([2025](https://arxiv.org/html/2606.13710#bib.bib21)\)\.

On the inference front,Wu et al\. \([2025c](https://arxiv.org/html/2606.13710#bib.bib38)\); Qin et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib22)\); Schmidgall et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib23)\)have shown that constructing complex workflows and context management can lead to substantial performance improvements\. However, such methods rely on manual prompting, lack generality and flexibility, and make it difficult to evaluate the inherent autonomous agent capabilities of the modelLi et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib15)\)\. On the training front, research has primarily focused on how to end\-to\-end train autonomous deep research agents based on flexible reasoning paradigms similar to ReActYao et al\. \([2022](https://arxiv.org/html/2606.13710#bib.bib42)\), enabling them to self\-plan, acquire knowledge and summarize\. Search\-R1Jin et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib14)\)applies reinforcement learning with verifiable rewards \(RLVR\) to enhance search capabilities and is trained mainly on short\-form question answeringWei et al\. \([2024](https://arxiv.org/html/2606.13710#bib.bib35)\); Wu et al\. \([2025b](https://arxiv.org/html/2606.13710#bib.bib37)\); Ho et al\. \([2020](https://arxiv.org/html/2606.13710#bib.bib10)\)\. This approach has been explored in many recent follow\-up studies, including WebExplorerLiu et al\. \([2025b](https://arxiv.org/html/2606.13710#bib.bib18)\), Tongyi Deep ResearchTeam et al\. \([2025b](https://arxiv.org/html/2606.13710#bib.bib30)\)and WebSailor\-V2Li et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib15)\)\. WebThinkerLi et al\. \([2025b](https://arxiv.org/html/2606.13710#bib.bib16)\)and MiroThinkerTeam et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib29)\)extend training to longer report generation and more rounds of tool usage\. To address the lack of clearly defined evaluation metrics for long\-form deep research responses, DR TuluShao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib24)\)proposes Reinforcement Learning via Evolving Rubrics \(RLER\), which dynamically updates evaluation rubrics based on sampled policy responses\. Although the above studies enable agents to autonomously conduct research based on user queries, they lack a process for autonomous exploration and improvement of deep research capabilities\. Dr\. ZeroYue et al\. \([2026](https://arxiv.org/html/2606.13710#bib.bib48)\)designs a framework based on search\-based proposer–solver self\-play, enabling the two to co\-evolve without exposure to any training data, but it is limited to short\-form and easily verifiable question answering\.

Therefore, we propose the first deep research agent evolution framework that supports open\-ended long\-form report generation tasks, aiming to achieve both practicality and autonomy simultaneously\.

### A\.2Agent Evolving with Self\-play

Agent evolution has long been regarded as a pathway toward achieving artificial general intelligence, signifying the capability of agents to autonomously interact with the environment and continuously learnLiu et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib17)\)\. Self\-play offers a highly promising paradigm for agent evolution, wherein an agent system learns from feedback automatically generated through competition with itself\. In the domain of games, self\-play has led to achievements such as TD\-Gammon’s backgammon masteryTesauro et al\. \([1995](https://arxiv.org/html/2606.13710#bib.bib31)\), AlphaGo’s superhuman performance in GoSilver et al\. \([2017](https://arxiv.org/html/2606.13710#bib.bib26)\), and CICERO’s capability to understand cooperative strategiesFAIR et al\. \([2022](https://arxiv.org/html/2606.13710#bib.bib6)\)\. In the field of large language models, some approaches enable models to serve dual roles as solver and judge, optimizing strategies without the need for human annotationChen et al\. \([2024](https://arxiv.org/html/2606.13710#bib.bib4)\); Wu et al\. \([2024](https://arxiv.org/html/2606.13710#bib.bib39)\); Yuan et al\. \([2024](https://arxiv.org/html/2606.13710#bib.bib46)\); Wan et al\. \([2026](https://arxiv.org/html/2606.13710#bib.bib33)\)\. However, such evolution is constrained by the queries in the training set, limiting the model’s ability to autonomously explore new knowledge and skills\. By assigning the agent system the roles of both query proposer and solver, significant improvements have been achieved in areas such as mathematics, coding, and general reasoning, surpassing the limitations of the original training set and even demonstrating zero\-data effectivenessHuang et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib13)\); Zhao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib49)\); Wang et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib34)\); Chen et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib3)\)\. To further overcome the inherent limitations of agent capabilities, methods such as SPICELiu et al\. \([2025a](https://arxiv.org/html/2606.13710#bib.bib17)\)and Dr\. ZeroYue et al\. \([2026](https://arxiv.org/html/2606.13710#bib.bib48)\)provide proposers with large\-scale corpora and search engines, facilitating the evolution of agent systems in open\-ended environments\. However, existing approaches remain confined to verifiable tasks, falling short of addressing the reality of numerous open\-ended tasks with ambiguous or undefined boundaries encountered in real\-world applications\.

Therefore, we propose an open\-ended agent evolution framework tailored to open\-ended tasks that are difficult to verify\. Through mutual play among proposer, solver and judge, the framework enables collaborative evolution with web\-scale knowledge\.

## Appendix BProof of Expected Maximum Generation Time Comparison

We formally derive the inequality between the expected maximum generation time of atool\-usestrategy and ahybrid\-modestrategy\.

### B\.1Problem Setup

LetXXdenote the random variable representing the generation time for thetool\-usemode, andYYdenote the generation time for theno\-toolmode\. We assume these follow normal distributions with identical variancesσ2\\sigma^\{2\}but distinct means:

X\\displaystyle X∼𝒩​\(μT,σ2\),\\displaystyle\\sim\\mathcal\{N\}\(\\mu\_\{T\},\\sigma^\{2\}\),\(6\)Y\\displaystyle Y∼𝒩​\(μN,σ2\),\\displaystyle\\sim\\mathcal\{N\}\(\\mu\_\{N\},\\sigma^\{2\}\),\(7\)whereμT\>μN\\mu\_\{T\}\>\\mu\_\{N\}\. LetFX​\(t\)F\_\{X\}\(t\)andFY​\(t\)F\_\{Y\}\(t\)denote the cumulative distribution functions \(CDFs\) ofXXandYY, respectively\. SinceμT\>μN\\mu\_\{T\}\>\\mu\_\{N\}and the variances are equal, we have the strict inequality for the CDFs:

FX​\(t\)<FY​\(t\),∀t∈ℝ\.F\_\{X\}\(t\)<F\_\{Y\}\(t\),\\quad\\forall t\\in\\mathbb\{R\}\.\(8\)
We consider the number of generations of the solver asKK\.

- •Strategy A \(tool\-use\):The maximum generation timeMAM\_\{A\}is defined as the maximum ofKKindependent and identically distributed \(i\.i\.d\.\) variablesX1,…,XK∼XX\_\{1\},\\dots,X\_\{K\}\\sim X: MA=max⁡\{X1,…,XK\}\.M\_\{A\}=\\max\\\{X\_\{1\},\\dots,X\_\{K\}\\\}\.\(9\)
- •Strategy B \(hybrid mode\):The maximum generation timeMBM\_\{B\}is defined as the maximum ofK/2K/2variables of typeXXandK/2K/2variables of typeYY, all mutually independent: MB=max⁡\{X1,…,XK/2,Y1,…,YK/2\}\.M\_\{B\}=\\max\\\{X\_\{1\},\\dots,X\_\{K/2\},Y\_\{1\},\\dots,Y\_\{K/2\}\\\}\.\(10\)

###### Theorem B\.1\.

The expected maximum generation time of Strategy A is strictly greater than that of Strategy B, i\.e\.,E​\[MA\]\>E​\[MB\]E\[M\_\{A\}\]\>E\[M\_\{B\}\]\.

###### Proof\.

First, we derive the cumulative distribution functions for the random variablesMAM\_\{A\}andMBM\_\{B\}\. For anyt∈ℝt\\in\\mathbb\{R\}, the probability that the maximum of a set of independent variables is less than or equal tottis the product of their individual probabilities\.

For Strategy A:

P​\(MA≤t\)=∏i=1KP​\(Xi≤t\)=\[FX​\(t\)\]K\.P\(M\_\{A\}\\leq t\)=\\prod\_\{i=1\}^\{K\}P\(X\_\{i\}\\leq t\)=\[F\_\{X\}\(t\)\]^\{K\}\.\(11\)
For Strategy B:

P\(MB≤t\)=\(∏i=1K/2P\(Xi≤t\)\)⋅\\displaystyle P\(M\_\{B\}\\leq t\)=\\left\(\\prod\_\{i=1\}^\{K/2\}P\(X\_\{i\}\\leq t\)\\right\)\\cdot\(∏j=1K/2P​\(Yj≤t\)\)=\[FX​\(t\)\]K/2​\[FY​\(t\)\]K/2\.\\displaystyle\\left\(\\prod\_\{j=1\}^\{K/2\}P\(Y\_\{j\}\\leq t\)\\right\)=\[F\_\{X\}\(t\)\]^\{K/2\}\[F\_\{Y\}\(t\)\]^\{K/2\}\.\(12\)
Using the inequality from Eq\. \([8](https://arxiv.org/html/2606.13710#A2.E8)\), whereFX​\(t\)<FY​\(t\)F\_\{X\}\(t\)<F\_\{Y\}\(t\)for alltt, and noting thatFX​\(t\)\>0F\_\{X\}\(t\)\>0for sufficiently largett, we compare the two probabilities:

P​\(MB≤t\)\\displaystyle P\(M\_\{B\}\\leq t\)=\[FX​\(t\)\]K/2​\[FY​\(t\)\]K/2\\displaystyle=\[F\_\{X\}\(t\)\]^\{K/2\}\[F\_\{Y\}\(t\)\]^\{K/2\}\>\[FX​\(t\)\]K/2​\[FX​\(t\)\]K/2\\displaystyle\>\[F\_\{X\}\(t\)\]^\{K/2\}\[F\_\{X\}\(t\)\]^\{K/2\}=\[FX​\(t\)\]K\\displaystyle=\[F\_\{X\}\(t\)\]^\{K\}=P​\(MA≤t\)\.\\displaystyle=P\(M\_\{A\}\\leq t\)\.\(13\)Thus,P​\(MB≤t\)\>P​\(MA≤t\)P\(M\_\{B\}\\leq t\)\>P\(M\_\{A\}\\leq t\)for allttwhereFX​\(t\)\>0F\_\{X\}\(t\)\>0\. This implies thatMAM\_\{A\}stochastically dominatesMBM\_\{B\}\(first\-order stochastic dominance\)\.

In terms of the survival function \(tail probability\), this inequality is reversed:

P​\(MA\>t\)\\displaystyle P\(M\_\{A\}\>t\)=1−P​\(MA≤t\)\\displaystyle=1\-P\(M\_\{A\}\\leq t\)\(14\)\>1−P​\(MB≤t\)\\displaystyle\>1\-P\(M\_\{B\}\\leq t\)\(15\)=P​\(MB\>t\)\.\\displaystyle=P\(M\_\{B\}\>t\)\.\(16\)
The expected value of a random variableZZcan be expressed as the integral of its survival function over its support\. Assuming the support covers the real line:

E​\[Z\]\\displaystyle E\[Z\]=∫−∞∞t​fZ​\(t\)​𝑑t\\displaystyle=\\int\_\{\-\\infty\}^\{\\infty\}tf\_\{Z\}\(t\)dt\(17\)=∫0∞P​\(Z\>t\)​𝑑t−∫−∞0P​\(Z≤t\)​𝑑t\.\\displaystyle=\\int\_\{0\}^\{\\infty\}P\(Z\>t\)dt\-\\int\_\{\-\\infty\}^\{0\}P\(Z\\leq t\)dt\.\(18\)Given the stochastic dominance established above, the strict inequality holds for the expectation:

E​\[MA\]\>E​\[MB\]\.E\[M\_\{A\}\]\>E\[M\_\{B\}\]\.\(19\)∎

## Appendix CCase Study

We conducted case studies on HOTE\-8B and DR Tulu\-8B to illustrate the advantages of HOTE\. We omit thethinkandtoolbecause of they are too long\. In practical applications, the final research report will be additionally appended with the searched references\. In Case 1, as shown in Figure[7](https://arxiv.org/html/2606.13710#A8.F7)\-[8](https://arxiv.org/html/2606.13710#A8.F8), HOTE demonstrates: \(a\) more comprehensive information: the response from HOTE\-8B provides detailed citations from EACS guidelines, including baseline examination items \(viral load, CD4 count, complete blood count, metabolic indicators, TB screening, opportunistic infection assessment, etc\.\); \(b\) better structure: the answer is clearly organized into sections such as "Summary", "Baseline workup" and "Virologic check points"; \(c\) stronger contextual awareness: it correctly identifies that this is a question for medical professionals, offering detailed guidelines suitable for their level\. In contrast, DR Tulu offers a more concise response, presenting only "Bottom line" recommendations and lacking a complete monitoring timeline and baseline examination details\. In Case 2, as shown in Figure[10](https://arxiv.org/html/2606.13710#A8.F10)\-[14](https://arxiv.org/html/2606.13710#A8.F14), HOTE \(a\) correctly identifies an emergency: clearly states that "acute angle‑closure glaucoma is a true ophthalmic emergency" requiring "immediate evaluation and treatment to prevent rapid, irreversible vision loss"; \(b\) provides specific action advice: explains what the patient should do \(seek evaluation by an ophthalmologist\) and what examinations the doctor will perform; \(c\) offers complete clinical information: including symptom descriptions \(severe eye pain, blurred vision, halos, headache, nausea\) and treatment methods \(laser peripheral iridotomy\)\. DR Tulu, while providing background medical knowledge, fails to clearly inform the patient that this is an emergency requiring immediate medical attention\.

## Appendix DAlgorithm

We provide the complete training process of HOTE in Algorithm[1](https://arxiv.org/html/2606.13710#alg1)\.

## Appendix EThe effect of judge models for training, prompts and diverse proposing

We further trained with Qwen3\-235BA22B\-Think and Qwen3\-30BA3B\-Instruct as the judge model \(2507\-FP8 version\), along with the average wall\-clock time per step\. The results in Table[6](https://arxiv.org/html/2606.13710#A5.T6)show that: \(i\) a smaller\-scale judge model leads to a moderate performance degradation; \(ii\) a thinking model achieves nearly identical performance but substantially reduces training efficiency\. Therefore, we recommend using large\-scale open\-source instruct models to strike a balance between effectiveness and computational overhead\. We consistently set: temperature=0, max\_tokens=16384, top\_p=1\.0\.

The prompts are role\-defining system instructions for the proposer, solver, and judge modules, designed to specify each module’s task and output format\. The samples are minimal format demonstrations and are not part of the evaluation benchmarks\. To test sensitivity, we replaced samples and rephrased role\-defining instructions with three different sets on HOTE and three baselines\. We observed negligible impact on final performance in Table[9](https://arxiv.org/html/2606.13710#A5.T9), suggesting that the method is not materially dependent on a particular prompt/sample choice\.

As shown in Table[7](https://arxiv.org/html/2606.13710#A5.T7), the diverse proposing effectively improve the performance on three benchmarks, illustrating its importance in ensuring the quality of proposed research tasks\.

Table 6:Performance and training efficiency under different judge models\.Table 7:The effect of diverse proposing\.Table 8:Performance statistics across three evaluation runs\.Table 9:Performance comparison across different samples and role\-defining instructions\.
## Appendix FDetails

### F\.1Implementation details

ForRL MethodsandEvolving Methodsthat we can fully control the training process, since long\-form deep research tasks do not have standard reference answers, we consistently adapted them from RLVR to rubric\-based reward followingShao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib24)\)without judge evolution\. ForEvolving Methodsincluding SPICE and Dr\. Zero, we also consistently adapted them in the same manner as HOTE by utilizing Qwen3\-8B to initialize the proposer checkpoint and DR Tulu\-8B\-SFT to initialize the solver checkpoint\. ForOpen Deep Research Models,Open Deep Research,RL Methods,Evolving Methodsand HOTE that we can fully control the inference process, we use Serper API for google\_search, Jina API for web\_browse and Semantic Scholar API for paper\_search\. We ensured that no data from the benchmark was added to the training set, and we also blocked search tools from accessing the benchmark website\. ForClosed Deep Researchthat we cannot fully control the training and inference process, we also provide their results for reference followingShao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib24)\)but not for a strict comparison with them\. We plan to fully release our models and codes upon acceptance\.

### F\.2Benchmark details

Judge\. To avoid the model simply using the biases of the judge during training, and also to follow the official evaluation of HealthBenchArora et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib2)\), DRBDu et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib5)\)and ResearchQAYifei et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib44)\), different judge models were employed for different benchmarks: GPT\-4\.1 was used for Healthbench; Gemini\-2\.5\-flash for DRB; and GPT\-4\.1\-mini for ResearchQA\. Higher scores consistently indicate better quality across all benchmarks\. HealthBench calculates a normalized score based on physician\-created rubrics that reward desired behaviors and penalize undesirable ones; ResearchQA measures the thoroughness of addressing literature\-derived criteria on a 0\-100% scale; and DRB computes a macro\-average score across four quality dimensions via comparison against high\-quality reference reports\. Please refer to the references for specific implementation\.

Reliability\. We run the evaluation of HOTE on all three benchmarks three times\. As shown in Table[8](https://arxiv.org/html/2606.13710#A5.T8), the standard deviations are small and HOTE consistently maintains its lead\. LLM\-as\-a\-judge can provide a stable evaluation for the three benchmarks\. Besides, human experts are substantially involved in rubric design across all three benchmarks to ensure the reliability: HealthBench uses conversation\-specific rubrics written by 262 physicians, with consensus criteria added only when a majority of reviewing physicians agree they are relevant; ResearchQA derives query\-specific rubrics from expert\-written survey sections and further validates them with 31 Ph\.D\. annotators across 8 fields; and DRB builds on tasks crafted and iteratively refined by domain experts, while its adaptive criteria are anchored in four top\-level dimensions established from domain expertise: comprehensiveness, insight, instruction\-following, and readability\.

The choice of benchmark\. The chosen benchmarks are for a complementary evaluation in different domains: they evaluate distinct aspects of long\-form deep research quality including \(Healthbench\) health\-related safety and communication quality, \(Researchqa\) scholarly synthesis across 7 research domains \(Life & Earth Sciences, Engineering & Computer Science, Physical Sciences, Health Sciences & Medicine, Social Sciences, Humanities, Economics\), and \(DRB\) end\-to\-end deep\-research report quality across 22 domains \(Science & Technology, Finance & Business, Software Development, Education & Jobs, Health, Literature, History, Hardware, Industrial, Art & Design, Games, Crime & Law, Entertainment, Sports & Fitness, Software, Transportation, Religion, Home & Hobbies, Travel, Food & Dining, Fashion & Beauty, Social Life\) rather than a single narrow criterion\.

## Appendix GHyperparameter analysis

We used HealthBench, ResearchQA and DRB to analyze the impact of different batch sizesBB, solver group sizesGG, proposer group sizesG′G^\{\\prime\}, and the numbers of training steps in no\-tool mode and tool\-use mode\. As shown in Table[10](https://arxiv.org/html/2606.13710#A7.T10), increasingBB,GG, andG′G^\{\\prime\}first improves performance and then leads to a plateau\. Therefore, we selectB=48B=48,G=8G=8, andG′=6G^\{\\prime\}=6\. We further conduct the tool\-use training until convergence after different steps of no\-tool training to explore the effect of no\-tool steps\. As shown in Table[10](https://arxiv.org/html/2606.13710#A7.T10), increasing the number of no\-tool training steps first improves performance and then causes a decline, possibly because the model becomes overly reliant on parametric knowledge as training progresses\. Therefore, we train the no\-tool mode for 600 steps\. For the learning rate, maximum number of tool uses per response, temperature, and response length, we reused the hyperparameter settings ablated inShao et al\. \([2025](https://arxiv.org/html/2606.13710#bib.bib24)\)\.

Table 10:Hyperparameter analysis on ResearchQA, HealthBench, and DRB\.
## Appendix HSpecific prompts

Figure[15](https://arxiv.org/html/2606.13710#A8.F15)and Figure[16](https://arxiv.org/html/2606.13710#A8.F16)show the system prompts for the solver in tool\-use mode\. Figure[17](https://arxiv.org/html/2606.13710#A8.F17)shows the system prompt for the solver in no\-tool mode\. Figure[18](https://arxiv.org/html/2606.13710#A8.F18)and Figure[19](https://arxiv.org/html/2606.13710#A8.F19)show the system prompts for the proposer in tool\-use mode\. Figure[20](https://arxiv.org/html/2606.13710#A8.F20)shows the user prompt for the proposer in tool\-use mode\. Figure[21](https://arxiv.org/html/2606.13710#A8.F21)and Figure[22](https://arxiv.org/html/2606.13710#A8.F22)show the system prompt and user prompt for the proposer in no\-tool mode, respectively\. Figure[23](https://arxiv.org/html/2606.13710#A8.F23)and Figure[24](https://arxiv.org/html/2606.13710#A8.F24)show the system prompts for the judge updating rubrics according to Equation[3](https://arxiv.org/html/2606.13710#S2.E3)\. Figure[25](https://arxiv.org/html/2606.13710#A8.F25)shows the system prompt for the judge assigning rewards based on rubrics\. Figure[26](https://arxiv.org/html/2606.13710#A8.F26)and Figure[27](https://arxiv.org/html/2606.13710#A8.F27)show the system prompts for the judge generating meta rubrics\.

Algorithm 1Dual\-mode Hybrid Training Strategy for HOTE0:Solver

πθs\\pi\_\{\\theta\_\{s\}\}, Proposer

πθp\\pi\_\{\\theta\_\{p\}\}, Judge

πθj\\pi\_\{\\theta\_\{j\}\}\.

0:Training dataset

𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}\.

0:Hyperparameters: Batch size

BB, Group size

GG, Number of diverse proposing groups

NN\.

1:Initialize:Set initial synthetic tasks

𝒟syn=∅\\mathcal\{D\}\_\{\\text\{syn\}\}=\\varnothing\.

2:whilenot convergeddo

3:// 1\. Hybrid Data Preparation

4:Sample real tasks

𝒟real\\mathcal\{D\}\_\{\\text\{real\}\}of size

B/2B/2from

𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}\.

5:Construct current batch

𝒮←𝒟real∪𝒟syn\\mathcal\{S\}\\leftarrow\\mathcal\{D\}\_\{\\text\{real\}\}\\cup\\mathcal\{D\}\_\{\\text\{syn\}\}\.

6:// 2\. Hybrid Mode Assignment

7:Randomly assign inference mode

m∈\{tool\-use,no\-tool\}m\\in\\\{\\texttt\{tool\-use\},\\texttt\{no\-tool\}\\\}to each task in

𝒮\\mathcal\{S\}\(

50%50\\%each\)\.

8:// 3\. Solver Rollout

9:For each task

s0∈𝒮s\_\{0\}\\in\\mathcal\{S\}, sample

GGresponses

\{oi\}i=1G∼πθs\(⋅∣s0\)\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{s\}\}\(\\cdot\\mid s\_\{0\}\)under assigned mode

mm\.

10:// 4\. Judge Evolution & Evaluation

11:foreach task

s0∈𝒮s\_\{0\}\\in\\mathcal\{S\}do

12:Update active rubrics:

ℛs0active←Updateπθj​\(s0,\{oi\}i=1G,ℛs0active\)\\mathcal\{R\}^\{\\text\{active\}\}\_\{s\_\{0\}\}\\leftarrow\\text\{Update\}\_\{\\pi\_\{\\theta\_\{j\}\}\}\(s\_\{0\},\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\},\\mathcal\{R\}^\{\\text\{active\}\}\_\{s\_\{0\}\}\)\(Equation[3](https://arxiv.org/html/2606.13710#S2.E3)\)\.

13:Calculate rewards

rir\_\{i\}for each response

oio\_\{i\}using

ℛs0\\mathcal\{R\}\_\{s\_\{0\}\}\(Equation[2](https://arxiv.org/html/2606.13710#S2.E2)\)\.

14:endfor

15:Collect assessments

𝒜\\mathcal\{A\}containing all rubrics and rewards\.

16:Generate meta rubrics

ℛmeta\\mathcal\{R\}^\{\\text\{meta\}\}summarizing weaknesses from

𝒜\\mathcal\{A\},

\{oi\}i=1G\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}and

ℛs0\\mathcal\{R\}\_\{s\_\{0\}\}\(Equation[4](https://arxiv.org/html/2606.13710#S2.E4)\)\.

17:// 5\. Solver Evolution

18:Update solver parameters

θs\\theta\_\{s\}via GRPO \(Equation[1](https://arxiv.org/html/2606.13710#S2.E1)\) using rewards

\{ri\}\\\{r\_\{i\}\\\}\.

19:// 6\. Proposer Evolution

20:if

𝒟syn≠∅\\mathcal\{D\}\_\{\\text\{syn\}\}\\neq\\varnothingthen

21:Calculate proposer rewards

\{rip\}i=1G′\\\{r\_\{i\}^\{p\}\\\}\_\{i=1\}^\{G^\{\\prime\}\}for tasks in

𝒟syn\\mathcal\{D\}\_\{\\text\{syn\}\}\(Eq\.[5](https://arxiv.org/html/2606.13710#S2.E5)\)\.

22:Update proposer parameters

θp\\theta\_\{p\}via GRPO using rewards

\{rip\}i=1G′\\\{r\_\{i\}^\{p\}\\\}\_\{i=1\}^\{G^\{\\prime\}\}\.

23:endif

24:// 7\. Diverse Proposing \(Next Step Synthetic Data\)

25:Sample

NNcombinations of tasks and corresponding assessments from

𝒮\\mathcal\{S\}\.

26:Proposer generates new synthetic tasks

𝒟syn′=\{oip\}i=1G′\\mathcal\{D\}\_\{\\text\{syn\}\}^\{\\prime\}=\\\{o\_\{i\}^\{p\}\\\}\_\{i=1\}^\{G^\{\\prime\}\}conditioned on combinations and

ℛmeta\\mathcal\{R\}^\{\\text\{meta\}\}\.

27:Update

𝒟syn←𝒟syn′\\mathcal\{D\}\_\{\\text\{syn\}\}\\leftarrow\\mathcal\{D\}\_\{\\text\{syn\}\}^\{\\prime\}for the next iteration\.

28:endwhile

![Refer to caption](https://arxiv.org/html/2606.13710v1/x7.png)Figure 7:Case 1 \(Part 1\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x8.png)Figure 8:Case 1 \(Part 2\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x9.png)Figure 9:Case 1 \(Part 3\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x10.png)Figure 10:Case 2 \(Part 1\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x11.png)Figure 11:Case 2 \(Part 2\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x12.png)Figure 12:Case 2 \(Part 3\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x13.png)Figure 13:Case 2 \(Part 4\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x14.png)Figure 14:Case 2 \(Part 5\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x15.png)Figure 15:System prompt of solver under tool\-use mode \(Part 1\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x16.png)Figure 16:System prompt of solver under tool\-use mode \(Part 2\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x17.png)Figure 17:System prompt of solver under no\-tool mode![Refer to caption](https://arxiv.org/html/2606.13710v1/x18.png)Figure 18:System prompt of proposer under tool\-use mode \(Part 1\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x19.png)Figure 19:System prompt of proposer under tool\-use mode \(Part 2\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x20.png)Figure 20:User prompt of proposer under tool\-use mode![Refer to caption](https://arxiv.org/html/2606.13710v1/x21.png)Figure 21:System prompt of proposer under no\-tool mode![Refer to caption](https://arxiv.org/html/2606.13710v1/x22.png)Figure 22:User prompt of proposer under no\-tool mode![Refer to caption](https://arxiv.org/html/2606.13710v1/x23.png)Figure 23:Rubric update system prompt of judge \(Part 1\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x24.png)Figure 24:Rubric update system prompt of judge \(Part 2\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x25.png)Figure 25:Judge system prompt of judge![Refer to caption](https://arxiv.org/html/2606.13710v1/x26.png)Figure 26:Meta rubric system prompt of judge \(Part 1\)![Refer to caption](https://arxiv.org/html/2606.13710v1/x27.png)Figure 27:Meta rubric system prompt of judge \(Part 2\)

Similar Articles

Self-Evolving Deep Research via Joint Generation and Evaluation

arXiv cs.CL

Researchers from HKUST, ByteDance, and UCL propose SCORE, a co-evolutionary training framework that jointly trains an LLM as both a deep research report generator and an evaluator, using a meta-harness to dynamically adjust evaluation difficulty and prevent reward saturation. Experiments show consistent improvement in open-ended research report quality.