DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

arXiv cs.AI Papers

Summary

This technical report introduces DuMate-DeepResearch, a multi-agent framework for deep research tasks that decouples the agent core from a tool ecosystem, and incorporates graph-based dynamic planning, recursive two-level execution, and rubric-based test-time optimization. The system achieves state-of-the-art results on two deep research benchmarks, demonstrating the value of auditable agent infrastructure.

arXiv:2606.07299v1 Announce Type: new Abstract: Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long-form synthesis, and limited process auditability. This technical report presents DuMate-DeepResearch, a multi-agent DR framework built on the Qianfan Agent Foundry. The framework decouples the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable. Building on this infrastructure, DuMate-DeepResearch further introduces three mechanisms: (i) a graph-based dynamic planning strategy expands the research roadmap coarse-to-fine and continuously revises it through reflection, re-planning, backtracking, and parallel branching; (ii) a recursive two-level execution design delegates each complex search sub-task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long-horizon execution; (iii) a rubric-based test-time optimization mechanism dynamically generates task-specific quality criteria and uses them as live reasoning scaffolds for evidence-grounded synthesis and adaptive stopping. Across two deep research benchmarks, DuMate-DeepResearch establishes new state-of-the-art results: the best overall score (58.03%) on DeepResearch Bench, and the best overall score (61.95%) on DeepResearch Bench II while ranking first in information recall and analysis.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:14 AM

# DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning
Source: [https://arxiv.org/html/2606.07299](https://arxiv.org/html/2606.07299)
###### Abstract

Deep Research \(DR\) has emerged as a new agentic paradigm to tackle complex, open\-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long\-form reports\. In practice, however, current DR systems are constrained by four interrelated limitations: long\-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long\-form synthesis, and limited process auditability\. This technical report presentsDuMate\-DeepResearch, a multi\-agent DR framework built on the Qianfan Agent Foundry\. The framework decouples the Agent Core—which handles task understanding, planning, and scheduling—from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable\. Building on this infrastructure, DuMate\-DeepResearch further introduces three mechanisms: \(i\) a*graph\-based dynamic planning*strategy expands the research roadmap coarse\-to\-fine and continuously revises it through reflection, re\-planning, backtracking, and parallel branching; \(ii\) a*recursive two\-level execution*design delegates each complex search sub\-task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long\-horizon execution; \(iii\) a*rubric\-based test\-time optimization*mechanism dynamically generates task\-specific quality criteria and uses them as live reasoning scaffolds for evidence\-grounded synthesis and adaptive stopping\. Across two deep research benchmarks, DuMate\-DeepResearch establishes new state\-of\-the\-art results: the best overall score \(58\.03%\) on DeepResearch Bench, and the best overall score \(61\.95%\) on DeepResearch Bench II while ranking first in information recall and analysis\. These results demonstrate the value of pairing auditable multi\-agent infrastructure with adaptive planning and rubric\-guided reasoning for high\-quality deep research\.

## 1Introduction

The rapid advancement of artificial intelligence has catalyzed a paradigm shift from passive, single\-turn question\-answering systems to autonomous, agentic systems\(Yaoet al\.,[2023b](https://arxiv.org/html/2606.07299#bib.bib16); Wanget al\.,[2024](https://arxiv.org/html/2606.07299#bib.bib17)\), enabling users to initiate complex research workflows from a research question\. In this context,Deep Research \(DR\)\(Zhenget al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib4); Shiet al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib2); Zhanget al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib3); Duet al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib5); Wanget al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib7)\)has emerged as a crucial and highly challenging frontier to bridge the gap between human inquiry and systematic knowledge discovery\. While traditional retrieval\-augmented workflows are confined to single\-shot or rule\-based retrieval over static corpora\(Lewiset al\.,[2020](https://arxiv.org/html/2606.07299#bib.bib18); Gaoet al\.,[2023](https://arxiv.org/html/2606.07299#bib.bib19)\), DR aims to replicate the rigorous, systematic investigative methodologies of human researchers\. To address complex, open\-ended problems, DR requires sophisticated long\-horizon reasoning, strategic decision\-making, and large\-scale information synthesis\(Shinnet al\.,[2023](https://arxiv.org/html/2606.07299#bib.bib50); Yaoet al\.,[2023a](https://arxiv.org/html/2606.07299#bib.bib51)\)\.

To operationalize such demanding workflows, recent efforts have explored a spectrum of architectural paradigms\. Early systems adoptedmonolithic architectures\(e\.g\., OpenAI’s DeepResearch\), which tightly integrate all modules around a central reasoning engine, ensuring unified control flow but limiting scalability and tool extensibility\. Alternatively,pipeline architectures\(e\.g\., n8n workflows\) decompose the process into sequentially connected stages, facilitating component reuse but struggling with complex iteration and global feedback\. In response,agentic architectureshave become a natural direction for DR systems\. By decomposing overarching research tasks and distributing them among autonomous agents with specialized roles, this collaborative paradigm improves scalability, parallel efficiency, and functional specialization for complex research scenarios\.

##### Core Workflow of Deep Research

Operating under this collaborative paradigm, the core workflow of modern agentic DR systems transcends a rigid linear pipeline, functioning instead as a closed\-loop, tool\-augmented process\. Given a complex, open\-ended research question, such a system transforms the high\-level request into a comprehensive report through a set of tightly coupled capabilities that typically include, but are not limited to, the following:

1. 1\.Problem Framing and Adaptive Planning:The system parses an underspecified research question into structured objectives and formulates a dynamic research roadmap, continuously revising its strategy as evidence accrues through sub\-goal refinement, query reformulation, and backtracking from informational dead\-ends\.
2. 2\.Evidence Acquisition and Verification:Driven by this roadmap, the system invokes a heterogeneous toolkit \(e\.g\., web search engines, scholarly databases, domain\-specific APIs\) to acquire information, while assessing source credibility and cross\-validating claims across sources to safeguard factual integrity\.
3. 3\.Synthesis and Report Generation:The validated evidence is finally integrated into a cohesive, logically structured report that weaves multi\-source findings into a coherent narrative with nuanced analysis and verifiable citations\.

##### The Key Challenges

However, realizing this idealized workflow in practice remains far from solved\. Current agentic DR systems still confront open challenges that limit their reliability for real\-world deployment:

- •Long\-Horizon Planning and Dynamic Scope Definition:A research question unfolds into a long horizon of dozens of interdependent sub\-questions whose scope is underspecified at the outset and only crystallizes as evidence accrues\. Reactive, step\-by\-step policies that commit to a single next action—as in ReAct\-style agents—are inherently myopic: they optimize locally without a global representation of the trajectory, oscillate between unbounded exploration and premature convergence, and cannot coherently revise their strategy when a tool fails or newly retrieved evidence invalidates an earlier premise\. Effective DR therefore demands a planning formalism that maintains a global, far\-sighted model of the entire roadmap and continuously re\-delineates scope and re\-plans as the information state evolves\.
- •Complex Task Decomposition and Scheduling:Even given a sound plan, decomposing and scheduling it for execution is where long trajectories most often break down\. A single flat agent can rarely reconcile high\-level task decomposition with the finer sub\-task decomposition, scheduling, and noise handling that each sub\-task in turn demands, since every sub\-question may itself entail many multi\-step retrieval actions over a stochastic web rife with dead links, API failures, and irrelevant or contradictory returns\. Folding global strategy and low\-level retrieval into one policy entangles the two and lets a single local failure propagate and cascade into the global trajectory\. Reliable DR thus requires an execution scheme that separates high\-level decomposition and scheduling from local sub\-task completion, confines noise and errors within sub\-task boundaries, and robustly carries out each sub\-task without destabilizing the overall process\.
- •Hallucination Mitigation and Factual Grounding:Sustaining strict factual fidelity during long\-form synthesis over dynamic, multi\-source evidence streams is notoriously difficult, and the agent must additionally possess a principled criterion for when accumulated evidence is sufficient to halt exploration\. This calls for rigorous inference\-time scaffolds that calibrate every salient assertion against verifiable evidence as it is generated, and that terminate retrieval precisely when—and only when—the evidence demonstrably suffices, rather than relying on post\-hoc verification or fixed exploration budgets\.
- •Process Explainability and Auditability:For DR to be trusted in high\-stakes domains, its autonomous reasoning must be rendered inspectable\. Systems should externalize their decision traces, tool invocations, and action paths as explicit, auditable artifacts—as transparent as the methodology appendix of a rigorous study—so that users can scrutinize not only the final report but the very process by which it was produced\.

To address these challenges, we presentDuMate\-DeepResearch, an end\-to\-end multi\-agent research framework\. Built on top of theQianfan Agent Foundry, our system decouples the central cognitive brain \(Agent Core\) from the versatile execution layer \(Tool Ecosystem\)\. This decoupling not only enables independent evolution of cognition and tooling, but also exposes every planning decision and tool invocation as an inspectable artifact, directly targeting the transparency and auditability challenge\. Furthermore, we equip the framework with three cognitive mechanisms tailored to DR: \(i\) agraph\-based dynamic plannerthat casts the research roadmap as an evolving directed acyclic graph, expanded coarse\-to\-fine and continuously revised through reflection, re\-planning, backtracking, and parallel branching\. Unlike myopic step\-by\-step ReAct\-style reasoning, this graph maintains a global, far\-sighted view of the entire trajectory and re\-thinks its strategy whenever a tool fails or new evidence overturns an earlier assumption—jointly delivering long\-horizon foresight and dynamic scope control; \(ii\) arecursive two\-level executiondesign, in which the outer Research Agent delegates every complex search sub\-task to an inner*Search Agent*that is itself a complete Foundry Agent running its own planning–execution cycle\. This nesting isolates noisy, multi\-step retrieval from high\-level research strategy, so that a single failed search cannot destabilize the global trajectory—the key to stable execution under stochastic web conditions; and \(iii\) arubric\-based test\-time optimizationmechanism that synthesizes question\-specific evaluation rubrics dynamically and uses them as inference\-time reasoning scaffolds to ground generated claims in retrieved evidence, while also providing an adaptive termination criterion\.

We conduct extensive experiments on two deep research benchmarks\. On DeepResearch Bench, DuMate\-DeepResearch attains the best overall score among strong commercial and open baselines, establishing new state\-of\-the\-art performance\. On DeepResearch Bench II, which evaluates reports through fine\-grained expert\-derived rubrics, DuMate\-DeepResearch also achieves the best overall score and leads on the information recall and analysis dimensions\. Together, these results provide consistent evidence that the proposed architecture improves both broad report quality and rubric\-grounded evidence acquisition and synthesis\.

In summary, the main contributions of this report are summarized as follows:

- •A decoupled multi\-agent infrastructure for auditable DR:We introduce the Qianfan Agent Foundry, a highly scalable architecture that implements a transparent*understanding–planning–execution*cyclic paradigm by separating the reasoning core from the tool ecosystem, yielding a DR pipeline whose entire trajectory is auditable\.
- •A graph\-based dynamic planning algorithm:We represent the research roadmap as a dynamic directed acyclic graph expanded in a coarse\-to\-fine manner and equipped with reflection, re\-planning, backtracking, and parallel branching\. In contrast to myopic ReAct\-style reasoning that commits to one next action at a time, this graph sustains a global, far\-sighted view of the trajectory and self\-revises as evidence accumulates, jointly delivering long\-horizon foresight and adaptive scope control\.
- •A recursive two\-level execution framework:We instantiate the Foundry paradigm*recursively*: the outer planning agent decomposes the deep\-research task into sub\-tasks, and each complex search sub\-task is in turn solved by an inner search agent that is itself a complete Foundry Agent with its own planning–execution cycle\. This nesting isolates noisy, multi\-step retrieval from high\-level strategy, preventing a single failed search from destabilizing the global trajectory and substantially improving execution stability\.
- •Rubrics as test\-time reasoning scaffolds:We adapt dynamically generated rubrics from evaluation signals into inference\-time scaffolds that calibrate generation against retrieved evidence, supporting factual grounding and bounding exploration through an adaptive stopping criterion\.
- •State\-of\-the\-art empirical performance:We conduct extensive experiments on DeepResearch Bench and DeepResearch Bench II\. The results demonstrate that DuMate\-DeepResearch outperforms existing commercial and open baselines on both benchmarks, establishing new state\-of\-the\-art performance across overall report quality, information recall, and analysis\.

## 2DuMate\-DeepResearch Framework

DuMate\-DeepResearch is an end\-to\-end Deep Research Agent built upon the Qianfan Agent Foundry\. It follows an agentic loop of task understanding, planning, and execution to carry out complex, long\-horizon research tasks\.

##### Problem Formulation\.

DuMate\-DeepResearch organizes each research session as an auditable, evidence\-grounded state\-transition process\. Given a user queryqq, the Router produces a structured task specification; the Planner maintains an evolving research plan; the Execution Module invokes tools or Search Agents and accumulates evidence; and a rubric\-guidance signal steers planning, stopping, and writing\. This design allows the system to revise its research path while preserving the global report structure and the evidence trail\.

We formalize this loop as a state\-transition system over long\-horizon research trajectories\. At iterationtt, the agent maintains a research state

st=⟨z,pt,et,ρt⟩,s\_\{t\}=\\langle z,\\;p\_\{t\},\\;e\_\{t\},\\;\\rho\_\{t\}\\rangle,\(1\)wherez=\(x,𝒪\)z=\(x,\\mathcal\{O\}\)is the fixed task context that bundles the research topicxxand the report outline𝒪\\mathcal\{O\};ptp\_\{t\}is the current research plan;ete\_\{t\}is the accumulated evidence base collected from completed actions; andρt\\rho\_\{t\}is the current guidance signal\. Later subsections instantiateptp\_\{t\}as a graph\-structured plan \(Section[2\.2\.1](https://arxiv.org/html/2606.07299#S2.SS2.SSS1)\) andρt\\rho\_\{t\}as a rubric\-based control signal \(Section[2\.2\.3](https://arxiv.org/html/2606.07299#S2.SS2.SSS3)\)\. The incrementΔ​et\\Delta e\_\{t\}contains newly collected evidence lists and evidence summaries returned by direct tool actions or Search Agents, including source\-grounded records and consolidated findings for executed sub\-tasks; the global evidence base is their accumulation over cycles\. Starting froms0=⟨z,p0,∅,ρ0⟩s\_\{0\}=\\langle z,p\_\{0\},\\varnothing,\\rho\_\{0\}\\rangle, each cycle plans a set of executable actionsata\_\{t\}, executes them to obtain newly collected evidenceΔ​et\\Delta e\_\{t\}, and folds the new information and updated guidance back into the state,

st\+1=𝒯​\(st,at,Δ​et\)\.s\_\{t\+1\}=\\mathcal\{T\}\\bigl\(s\_\{t\},\\,a\_\{t\},\\,\\Delta e\_\{t\}\\bigr\)\.\(2\)The loop continues until a stopping predicateStop​\(st\)\\textsc\{Stop\}\(s\_\{t\}\)holds—for example, when the plan is fully explored or the current guidance signal reports no outstanding evidence gap—after which the Writer synthesizes the long\-form reportyyfrom the accumulated evidence\. The three subsequent parts instantiate this loop: Section[2\.1\.1](https://arxiv.org/html/2606.07299#S2.SS1.SSS1)details the Router, Planner, and Execution modules; Section[2\.2\.1](https://arxiv.org/html/2606.07299#S2.SS2.SSS1)specifies the graph\-structured transition; and Section[2\.2\.3](https://arxiv.org/html/2606.07299#S2.SS2.SSS3)defines the rubric mechanism that implements the guidance signal\. Algorithm[1](https://arxiv.org/html/2606.07299#alg1)states the overall control loop\.

![Refer to caption](https://arxiv.org/html/2606.07299v1/x1.png)Figure 1:The illustration for the Qianfan Agent Foundry\.
### 2\.1Qianfan Agent Foundry

As a foundational infrastructure designed for general LLM\-based agent construction, the Qianfan Agent Foundry consists of two decoupled components \(illustrated in Figure[1](https://arxiv.org/html/2606.07299#S2.F1)\): the Agent Core and the Agent Extension \(Tool Ecosystem\)\. While the Agent Core functions as the central cognitive brain—orchestrating reasoning, planning, and task scheduling—the Agent Extension serves as the versatile execution layer\. It provides a comprehensive suite of tools that empower the agent to interact with external environments, gather empirical evidence, and render final deliverables\. This decoupled architecture ensures both robust cognitive control and highly extensible execution capabilities\.

Algorithm 1DuMate\-DeepResearch Agent Loop1:user query

qq, max iterations

TmaxT\_\{\\max\}
2:

x←𝒰​\(q\)x\\leftarrow\\mathcal\{U\}\(q\)⊳\\trianglerightRouter: task understanding and analysis

3:

𝒪←Outline​\(x,etc\)\\mathcal\{O\}\\leftarrow\\textsc\{Outline\}\(x,e\_\{t\_\{c\}\}\);

z←\(x,𝒪\)z\\leftarrow\(x,\\mathcal\{O\}\)⊳\\trianglerightWriter builds the outline from coarse\-exploration evidenceetce\_\{t\_\{c\}\}; then fixed

4:

p0←InitPlan​\(x,𝒪\)p\_\{0\}\\leftarrow\\textsc\{InitPlan\}\(x,\\mathcal\{O\}\);

e0←∅e\_\{0\}\\leftarrow\\varnothing;

ρ0←InitGuidance​\(x,𝒪\)\\rho\_\{0\}\\leftarrow\\textsc\{InitGuidance\}\(x,\\mathcal\{O\}\)
5:

s0←⟨z,p0,e0,ρ0⟩s\_\{0\}\\leftarrow\\langle z,p\_\{0\},e\_\{0\},\\rho\_\{0\}\\rangle;

t←0t\\leftarrow 0
6:while

t≤Tmaxt\\leq T\_\{\\max\}andnot

Stop​\(st\)\\textsc\{Stop\}\(s\_\{t\}\)do

7:

at←𝒫​\(st\)a\_\{t\}\\leftarrow\\mathcal\{P\}\(s\_\{t\}\)⊳\\trianglerightPlanner: graph\-based dynamic planning

8:

Δ​et←𝒳​\(st,at\)\\Delta e\_\{t\}\\leftarrow\\mathcal\{X\}\(s\_\{t\},a\_\{t\}\)⊳\\trianglerightExecution: evidence collection

9:

st\+1←𝒯​\(st,at,Δ​et\)s\_\{t\+1\}\\leftarrow\\mathcal\{T\}\(s\_\{t\},a\_\{t\},\\Delta e\_\{t\}\)⊳\\trianglerightfold in evidence and updated guidance

10:

t←t\+1t\\leftarrow t\+1
11:endwhile

12:return

y←𝒲​\(x,𝒪,et,ρp\)y\\leftarrow\\mathcal\{W\}\(x,\\mathcal\{O\},e\_\{t\},\\rho^\{p\}\)⊳\\trianglerightWriter: guidance\-conditioned synthesis

#### 2\.1\.1DuMate\-DeepResearch Core

The core of DuMate\-DeepResearch comprises several specialized modules that collaborate seamlessly to effectively handle deep research tasks\.

##### Router \(Task Understanding and Analysis\)

The Router module is responsible for the initial comprehension and deconstruction of the research task\. Given a user query, the Router extracts salient information and identifies the core research topic\. This information is consolidated into a structured representation \(e\.g\., a standardized JSON format\), which is crucial for downstream planning and execution\. Furthermore, the Router serves as an intelligent interface for user interaction: if the initial query is ambiguous or incomplete, the Router proactively prompts the user for clarification\. This design ensures that the research trajectory remains rigorously aligned with user expectations\. In the global loop, the Router produces the topic specificationxx, and the Planner schedules the Writer to generate the outline𝒪\\mathcal\{O\}; together they define the contextz=\(x,𝒪\)z=\(x,\\mathcal\{O\}\)for all downstream planning\.

##### Planner \(Task Thinking and Planning\)

The Planner module acts as the strategic engine, responsible for formulating the research methodology, reasoning through the investigative path, and planning future steps\. Utilizing the structured task representation from the Router, the Planner analyzes the current knowledge state to identify critical epistemic gaps\. It then strategically decomposes the overarching objective into tractable key research questions and actionable sub\-problems\. Based on this reasoning, the Planner selects the specific tools to be utilized and generates the corresponding parameters required for execution\. Its graph\-structured policy is developed in detail in Section[2\.2\.1](https://arxiv.org/html/2606.07299#S2.SS2.SSS1)\.

##### Execution Module \(Planner\-Following Task Scheduling and Execution\)

The Execution Module realizes the actions issued by the Planner, manages execution context, and collects the returned evidence; unlike the Router and Planner, it sets no research strategy of its own\. Depending on the action type, it routes execution to one of four targets: a*direct tool call*, whose interface it invokes and whose output it normalizes; a*Search Agent*, dispatched for open\-ended retrieval sub\-tasks and itself a Foundry Agent with a local planning loop \(Section[2\.2\.2](https://arxiv.org/html/2606.07299#S2.SS2.SSS2)\) rather than a single black\-box query; the*Writer*, a generation agent invoked with two prompts—an outline prompt that turns the early coarse\-exploration evidence into the fixed outline𝒪\\mathcal\{O\}, and a report prompt that synthesizes the accumulated evidence into the final long\-form report; and a*lightweight reasoning*\(*llm*\) action that deduplicates, merges, and cross\-validates collected evidence without issuing new retrieval\. Supporting serial and parallel fan\-out across these targets, it acts as a scheduling and dispatch layer that carries out the Planner’s decisions while leaving every high\-level research choice to the Planner\.

The collaboration among these modules makes the research trajectory explicitly inspectable\. The Router maintains the structured task representation, the Planner records decision traces and sub\-task decompositions, and the Execution Module logs tool invocations and retrieved evidence\. As a result, users can inspect not only the final report but also the intermediate reasoning and action paths that produced it\.

#### 2\.1\.2DuMate\-DeepResearch Extension: The Tool Ecosystem

Complementing the cognitive core, DuMate\-DeepResearch integrates a comprehensive Tool Ecosystem\. Driven by the Execution Module, this ecosystem serves as the versatile execution layer for the ”task scheduling and execution” phase, encompassing diverse tools for information retrieval, data analysis, and report generation\.

These tools are seamlessly integrated into the agentic execution framework, allowing for efficient coordination and utilization throughout the research process\. By leveraging this tool ecosystem, DuMate\-DeepResearch can effectively handle the diverse and complex requirements of deep research tasks, further enhancing its capabilities and performance in delivering high\-quality research outcomes\. We introduce two key tools in DuMate\-DeepResearch’s tool ecosystem as follows\.

##### Baidu Search Integration

Baidu Search provides the primary retrieval substrate for evidence acquisition in DuMate\-DeepResearch\. Rather than treating search as a single black\-box query, the Execution Module exposes retrieval as a set of structured actions, including query expansion, web search, direct URL crawling, page\-content extraction, and evidence normalization\. Returned snippets and pages are converted into evidence records that preserve source metadata, URLs, timestamps when available, and short summaries for downstream verification and citation\-aware synthesis\. This design separates retrieval infrastructure from research policy: the Planner and Search Agents decide what information is needed and how queries should evolve, while the Tool Ecosystem supplies traceable evidence for cross\-source checking and final report grounding\.

##### Report Rendering Tools

To ensure the high quality and formatting diversity of the final deliverables, DuMate\-DeepResearch employs a decoupled, two\-stage report rendering mechanism\. Initially, the system generates a unified ”pivot report,” utilizing robust reasoning capabilities to guarantee logical coherence and content comprehensiveness\. Subsequently, specialized rendering tools translate this pivot report into multiple user\-desired formats \(e\.g\., Markdown, HTML, PPT\), ensuring adaptability across various presentation contexts\.

![Refer to caption](https://arxiv.org/html/2606.07299v1/x2.png)Figure 2:The illustration for dynamic planning and test\-time optimization\.

### 2\.2Dynamic Planning and Test\-Time Optimization

On top of the Foundry infrastructure, DuMate\-DeepResearch introduces three mechanisms that shape the long\-horizon research process\. First,*graph\-based dynamic planning*\(Section[2\.2\.1](https://arxiv.org/html/2606.07299#S2.SS2.SSS1)\) rewrites the evolving plan as evidence accumulates, maintaining a global, self\-revising roadmap instead of committing to a single next\-action chain\. Second,*recursive two\-level execution*\(Section[2\.2\.2](https://arxiv.org/html/2606.07299#S2.SS2.SSS2)\) lets the outer Research Agent delegate complex search sub\-tasks to inner Search Agents that run their own local Foundry cycles, keeping noisy retrieval separate from high\-level research strategy\. Third,*rubric\-based test\-time optimization*\(Section[2\.2\.3](https://arxiv.org/html/2606.07299#S2.SS2.SSS3)\) turns the guidance signal into active rubric instructions for planning, retrieval, stopping, and final synthesis\. We develop the three mechanisms in turn, introducing notation only where it sharpens the mechanism being described \(as shown in Figure[2](https://arxiv.org/html/2606.07299#S2.F2)\)\.

#### 2\.2\.1Graph\-Based Dynamic Planning

![Refer to caption](https://arxiv.org/html/2606.07299v1/x3.png)Figure 3:The illustration of deep execution path graph planning and reflection\.##### Coarse\-to\-Fine Expansion for Dynamic Scope

DuMate\-DeepResearch expands the research path in a coarse\-to\-fine manner\. Complex tasks often begin with vague intent, making it difficult to balance broad exploration with premature convergence\. The system therefore starts with a macro\-level exploratory retrieval phase that maps the research space and establishes a preliminary cognitive framework\. We usetct\_\{c\}to denote the checkpoint at which this initial coarse\-exploration phase completes; the corresponding evidence baseetce\_\{t\_\{c\}\}is used by the Writer to construct the fixed outline𝒪\\mathcal\{O\}in Algorithm[1](https://arxiv.org/html/2606.07299#alg1)\. Guided by the graph\-based dynamic planner, the system then transitions to a granular phase, systematically diving into defined sub\-topics to collect targeted evidence\. This progressive decomposition and integration mechanism refines the research scope as evidence accumulates, calibrating the boundary between breadth and depth without losing focus\. We formalize this roadmap at planning iterationttas a DAG\-structured planpt=\(Vt,Et\)p\_\{t\}=\(V\_\{t\},E\_\{t\}\), the planning component of the global statests\_\{t\}in Algorithm[1](https://arxiv.org/html/2606.07299#alg1)\. Each nodev∈Vtv\\in V\_\{t\}is a sub\-task carrying a tuple⟨d​\(v\),χ​\(v\)⟩\\langle d\(v\),\\,\\chi\(v\)\\rangle, whered​\(v\)∈ℤ\+d\(v\)\\in\\mathbb\{Z\}^\{\+\}is its depth in the coarse\-to\-fine hierarchy \(smaller values denote broader, exploratory sub\-tasks\), andχ​\(v\)∈\{0,1\}\\chi\(v\)\\in\\\{0,1\\\}is a binary execution status; a directed edge\(u,v\)∈Et\(u,v\)\\in E\_\{t\}records thatvvdepends onuu\. The coarse\-to\-fine principle then becomes a depth\-ordered expansion in which the scheduler only ever dispatches the*ready frontier*,

ℱt=\{v∈Vt:χ​\(v\)=0∧∀\(u,v\)∈Et,χ​\(u\)=1\},\\mathcal\{F\}\_\{t\}=\\bigl\\\{\\,v\\in V\_\{t\}\\;:\\;\\chi\(v\)=0\\;\\wedge\\;\\forall\(u,v\)\\in E\_\{t\},\\;\\chi\(u\)=1\\,\\bigr\\\},\(3\)i\.e\. the unexecuted sub\-tasks whose dependencies are all satisfied\. Confining execution toℱt\\mathcal\{F\}\_\{t\}guarantees that broad, low\-depth probes are resolved before their finer descendants are instantiated, so that boundary definition reduces to a monotone, dependency\-respecting expansion rather than an unbounded search\.

##### Far\-Sighted Re\-Planning over a Dynamic Graph

The dynamic graph also gives the Planner a global structure for revising its strategy as evidence arrives \(as shown in Figure[3](https://arxiv.org/html/2606.07299#S2.F3)\)\. Myopic, step\-by\-step ReAct\-style reasoning commits to one next action at a time and lacks a global view of the trajectory; in highly stochastic web environments it can stall on dead links, API errors, or contradictory evidence\. Representing the roadmap as a dynamic graph instead gives the Planner a far\-sighted view of the entire trajectory: at each milestone \(node\) the agent evaluates intermediate outcomes against expectations, and when anomalies surface it prunes dead ends, adjusts subsequent strategy, and re\-plans alternative paths rather than greedily extending a single chain\. This graph\-level re\-planning lets the system revise earlier assumptions whenever a tool fails or new evidence overturns them, yielding resilience over long horizons\. Formally, at each iteration the Planner emits a set of parallel actionsata\_\{t\}over the ready frontier, where each actionα∈at\\alpha\\in a\_\{t\}binds a frontier sub\-taskv∈ℱtv\\in\\mathcal\{F\}\_\{t\}to a tool and its parameters\. Once the Execution Module returns the newly collected evidenceΔ​et\\Delta e\_\{t\}and folds it into the accumulated evidence baseet\+1e\_\{t\+1\}, the roadmap is regenerated by a single re\-planning operator

pt\+1=Π​\(pt,et\+1,ρt\+1\),p\_\{t\+1\}=\\Pi\\bigl\(p\_\{t\},\\,e\_\{t\+1\},\\,\\rho\_\{t\+1\}\\bigr\),\(4\)which updates only the plan component of the global state; the full transition additionally folds in the fresh evidence and updated guidance\. Conditioned on the current plan, the accumulated evidence, and the latest guidance signalρt\+1\\rho\_\{t\+1\},Π\\Pimay*expand*the frontier with finer sub\-tasks,*prune*unproductive branches—backtracking away from dead links or contradictory evidence—or*rewire*dependencies, while it always preserves executed nodes so thatχ​\(v\)=1\\chi\(v\)\{=\}1is monotone and no evidence is recomputed\. To curb error propagation, every candidate action first passes a lightweight reflection gate before any tool is invoked; rejected actions are revised under the critic’s feedback for a bounded number of rounds\. The loop halts and yields to report synthesis once the frontier is exhausted \(ℱt=∅\\mathcal\{F\}\_\{t\}=\\varnothing\) or the Planner emits a terminal synthesis action, under a hard iteration boundt≤Tmaxt\\leq T\_\{\\max\}—exactly the stopping predicate of the global loop\. Casting expansion, reflective re\-planning, and adaptive stopping as the single operatorΠ\\Piturns the long\-horizon trajectory into one auditable update rule, summarized in Algorithm[2](https://arxiv.org/html/2606.07299#alg2)\.

Algorithm 2Graph\-Based Dynamic Planning with Reflection1:research topic

xx, report outline

𝒪\\mathcal\{O\}, max iterations

TmaxT\_\{\\max\}
2:

p0←InitPlan​\(x,𝒪\)p\_\{0\}\\leftarrow\\textsc\{InitPlan\}\(x,\\mathcal\{O\}\);

e0←∅e\_\{0\}\\leftarrow\\varnothing;

ρ0←InitGuidance​\(x,𝒪\)\\rho\_\{0\}\\leftarrow\\textsc\{InitGuidance\}\(x,\\mathcal\{O\}\);

t←0t\\leftarrow 0
3:while

t≤Tmaxt\\leq T\_\{\\max\}do

4:

ℱt←\{v∈Vt:χ​\(v\)=0∧deps​\(v\)​satisfied\}\\mathcal\{F\}\_\{t\}\\leftarrow\\\{\\,v\\in V\_\{t\}:\\chi\(v\)\{=\}0\\wedge\\text\{deps\}\(v\)\\ \\text\{satisfied\}\\,\\\}⊳\\trianglerightready frontier

5:

at←Planner​\(pt,𝒪,et,ρt\)a\_\{t\}\\leftarrow\\textsc\{Planner\}\(p\_\{t\},\\mathcal\{O\},e\_\{t\},\\rho\_\{t\}\)restricted to

ℱt\\mathcal\{F\}\_\{t\}⊳\\trianglerightselect parallel actions

6:if

at=∅a\_\{t\}=\\varnothingor

ata\_\{t\}is a synthesis actionthen

7:break⊳\\trianglerightadaptive stopping

8:endif

9:whilereflection gate returnsrevisefor

ata\_\{t\}andbounded rounds not reacheddo

10:revise

ata\_\{t\}under critic feedback

11:endwhile

12:

Δ​et←ExecuteParallel​\(at\)\\Delta e\_\{t\}\\leftarrow\\textsc\{ExecuteParallel\}\(a\_\{t\}\)⊳\\trianglerightvia tools or bounded Search Agent dispatch

13:update

χ​\(⋅\)\\chi\(\\cdot\)for executed nodes

14:

et\+1←et∪Δ​ete\_\{t\+1\}\\leftarrow e\_\{t\}\\cup\\Delta e\_\{t\}⊳\\trianglerightaccumulate evidence for subsequent planning

15:

ρt\+1←UpdateGuidance​\(𝒪,et\+1\)\\rho\_\{t\+1\}\\leftarrow\\textsc\{UpdateGuidance\}\(\\mathcal\{O\},e\_\{t\+1\}\)
16:

pt\+1←Π​\(pt,et\+1,ρt\+1\)p\_\{t\+1\}\\leftarrow\\Pi\(p\_\{t\},e\_\{t\+1\},\\rho\_\{t\+1\}\)⊳\\trianglerightexpand / prune / rewire

17:

t←t\+1t\\leftarrow t\+1
18:endwhile

19:return

y←𝒲​\(x,𝒪,et,ρp\)y\\leftarrow\\mathcal\{W\}\\bigl\(x,\\mathcal\{O\},e\_\{t\},\\ \\rho^\{p\}\\bigr\)⊳\\trianglerightWriter: guidance\-conditioned synthesis

A desensitized excerpt of the actual planner prompt that drives this procedure—retaining its DAG legality, depth\-bounding, and re\-planning constraints while omitting the output schema and other sensitive details—is provided in Appendix[A\.1](https://arxiv.org/html/2606.07299#A1.SS1)\.

#### 2\.2\.2Recursive Two\-Level Execution

Even with a sound graph\-based plan, execution remains difficult because each open\-ended sub\-task may itself require many noisy, multi\-step retrieval actions\. Folding high\-level strategy and local search into one flat agent lets a single failed retrieval cascade into the global trajectory\. DuMate\-DeepResearch instead applies the Qianfan Agent Foundry*recursively*, instantiating the same Router–Planner–Execution cycle at two nested levels with a clean division of labor\.

At the*outer*level, the Research Agent owns the global statests\_\{t\}and the planptp\_\{t\}: it decides*what to research*next and advances the research\-planning loop of Algorithm[1](https://arxiv.org/html/2606.07299#alg1)\. Whenever a planned action is an open\-ended retrieval sub\-task, the outer Execution Module does not call a search tool directly; it dispatches an*inner*Search Agent\. Crucially, this agent follows the same Foundry abstraction—with its own Router, Planner, and Execution Module—but operates over a local search state for a single sub\-task\. It decides*how to search*: formulating and reformulating queries, invoking the retrieval tools of the Tool Ecosystem, and consolidating the returned evidence until that sub\-task is sufficiently covered, then returns evidence lists and summaries that are appended to the current cycle’sΔ​et\\Delta e\_\{t\}\.

We capture this nesting with a compact level\-indexed notation\. Let𝒜\(ℓ\)​\(q\)\\mathcal\{A\}^\{\(\\ell\)\}\(q\)denote a complete Foundry Agent that solves queryqqat nesting levelℓ∈\{0,1\}\\ell\\in\\\{0,1\\\}and returns evidence lists and summaries; the outer Research Agent is𝒜\(0\)\\mathcal\{A\}^\{\(0\)\}\. Applied to an open\-ended retrieval actionava\_\{v\}targeting sub\-taskvv, the outer execution step instantiates an inner Agent on the sub\-task queryq​\(v\)q\(v\)one level down and folds the returned evidence intoΔ​et\\Delta e\_\{t\}\. The inner Agent𝒜\(1\)\\mathcal\{A\}^\{\(1\)\}unfolds into the same Router–Planner–Execution cycle, subject to a single restriction that bounds the recursion: at the inner level, execution invokes the retrieval tools of the Tool Ecosystem directly rather than dispatching a further Agent\. The nesting is therefore exactly two levels deep and terminates by construction, while the same execution abstraction appears at both levels, which is exactly what lets a complex search be carried out without conflating it with high\-level planning\.

The research process therefore unfolds as two nested loops—an outer research\-planning loop wrapped around many parallel inner search loops—rather than one flat trajectory, and it is this recursion that stabilizes execution\. It*isolates failure*: a stalled or unproductive search is contained within a single Search Agent and cannot derail the global plan, while the outer Research Agent simply re\-dispatches or re\-plans around it\. It*separates concerns*: the outer Planner reasons over a compact graph of sub\-tasks while each inner Agent reasons only within its own sub\-task, so neither conflates strategy with search nor confronts the full combinatorial horizon\. And because every level logs its own understanding–planning–execution trace, the recursive decomposition remains inspectable end to end\.

#### 2\.2\.3Rubric\-Based Test\-Time Optimization

##### From Evaluation to Reasoning Scaffold

DuMate\-DeepResearch further uses rubrics as test\-time guidance for planning and synthesis\. The concept of a rubric originates from long\-form output evaluation\. In standard RLVR \(Reinforcement Learning with Verifiable Rewards\), reward signals are typically binary, which is too coarse for open\-ended report generation\. Rubrics provide a more structured alternative by decomposing quality into fine\-grained criteria such as evidence grounding, logical coherence, and multi\-source cross\-validation\. Rather than using rubrics only as post\-hoc evaluators, we inject them into the agents’ reasoning process\. This turns the rubric into a live scaffold that provides explicit criteria for source calibration and evidence\-grounded synthesis\. We make this shift precise\. A rubric is a set of criteriaρ=\{c1,…,ck\}\\rho=\\\{c\_\{1\},\\dots,c\_\{k\}\\\}in which each criterionc=⟨name,description,guidance⟩c=\\langle\\text\{name\},\\text\{description\},\\text\{guidance\}\\ranglehas its*guidance*field phrased as an actionable reasoning instruction rather than a numeric score\. Whereas a conventional evaluator consumes a finished report and emits a scalar reward post hoc, we inject rubric context into generation itself before outputs are produced, compelling the agent to ground claims as it reasons rather than to be penalized afterward\.

##### Dynamic Rubric Generation

Because deep research is an evolving process, the rubric cannot remain entirely static\. While the research goal is fixed, the information state changes as new evidence accumulates; criteria specified at initialization may become incomplete or misaligned with the current frontier\. We therefore generate and update rubrics iteratively conditioned on the accumulated knowledge\. The system uses two types of rubrics:Persistent Rubrics, which define stable, topic\-level quality dimensions applied uniformly across the session; andEphemeral Rubrics, which capture transient criteria derived from the latest retrieved information\. Letρp\\rho^\{p\}denote the persistent rubric andρte\\rho^\{e\}\_\{t\}denote the ephemeral rubric available at cyclett\. Concretely, the rubric\-guidance signalρt\\rho\_\{t\}introduced in Algorithms[1](https://arxiv.org/html/2606.07299#alg1)and[2](https://arxiv.org/html/2606.07299#alg2)is instantiated as an active rubric,

ρt=\(ρp,ρte\)\.\\rho\_\{t\}=\(\\rho^\{p\},\\rho^\{e\}\_\{t\}\)\.\(5\)The initialization operatorInitGuidancefirst generates the persistent rubric from the research topic and the report outline,

ρp=𝒢p​\(x,𝒪\),ρ0e=∅,\\rho^\{p\}=\\mathcal\{G\}\_\{p\}\(x,\\mathcal\{O\}\),\\qquad\\rho^\{e\}\_\{0\}=\\varnothing,\(6\)whereρp\\rho^\{p\}is then held fixed to anchor stable, topic\-level quality dimensions\. The update operatorUpdateGuidancerefreshes the ephemeral rubric at the end of every cycle for use in the next,

ρt\+1e=𝒢e​\(𝒪,et\+1\),ρt\+1=\(ρp,ρt\+1e\),\\rho^\{e\}\_\{t\+1\}=\\mathcal\{G\}\_\{e\}\(\\mathcal\{O\},e\_\{t\+1\}\),\\qquad\\rho\_\{t\+1\}=\(\\rho^\{p\},\\rho^\{e\}\_\{t\+1\}\),\(7\)conditioned on the accumulated evidence baseet\+1e\_\{t\+1\}, so as to target the most decision\-relevant gaps exposed by the current evidence state and track the moving information frontier in lockstep with the evolving plan\. Under this instantiation, the Writer consumes the persistent componentρp\\rho^\{p\}for final synthesis, while the Planner and Search Agents use the full active rubric during iterative research:

at∼π𝒫\(⋅∣x,𝒪,pt,et,ρp,ρte\),y∼π𝒲\(⋅∣x,𝒪,et,ρp\),a\_\{t\}\\sim\\pi\_\{\\mathcal\{P\}\}\\bigl\(\\,\\cdot\\mid x,\\,\\mathcal\{O\},\\,p\_\{t\},\\,e\_\{t\},\\,\\rho^\{p\},\\,\\rho^\{e\}\_\{t\}\\,\\bigr\),\\qquad y\\sim\\pi\_\{\\mathcal\{W\}\}\\bigl\(\\,\\cdot\\mid x,\\,\\mathcal\{O\},\\,e\_\{t\},\\,\\rho^\{p\}\\,\\bigr\),\(8\)whereata\_\{t\}is the Planner action at cyclett,yyis the final long\-form report,π𝒫\\pi\_\{\\mathcal\{P\}\}andπ𝒲\\pi\_\{\\mathcal\{W\}\}denote the Planner and Writer policies, and\(x,𝒪,pt,et\)\(x,\\mathcal\{O\},p\_\{t\},e\_\{t\}\)is the current task context: the topic, fixed report outline, evolving plan, and accumulated evidence\. The active rubric components thereby cease to be graders and become a*live scaffold*for planning, while the persistent component provides the stable report\-stage scaffold for prose generation\.

##### Rubrics in Multi\-Agent Collaboration

Since DuMate\-DeepResearch orchestrates the Agent Core and dispatched Search Agents in a hierarchical manner, with each level serving distinct objectives, the rubric strategy is designed accordingly\. At the orchestration level, the active rubric\(ρp,ρte\)\(\\rho^\{p\},\\rho^\{e\}\_\{t\}\)is refreshed after each planning\-execution cycle and provided to the Planner for subsequent research decisions\. At the search level, each Search Agent also receives active rubric guidance conditioned on its sub\-task context and returned tool evidence\. By contrast, the Writer consumes only the persistent report\-stage rubricρp\\rho^\{p\}during final synthesis, so that dynamic evidence\-gap guidance steers research control without becoming an additional moving constraint on report writing\. Upon completing its search, the Search Agent returns evidence lists and summaries to the orchestration level, where they are incorporated into the accumulated evidence base\. This upward evidence flow closes the loop: the orchestrator feeds the updated evidence base into the next ephemeral rubric, so that the orchestration\-level rubric stays aligned with what the search level actually uncovered\. Crucially, the refreshed ephemeral rubricρt\+1e\\rho^\{e\}\_\{t\+1\}also serves as the adaptive termination signal: once it reports no outstanding gap, the stopping predicateStopof Algorithm[1](https://arxiv.org/html/2606.07299#alg1)halts the loop, tying factual sufficiency directly to the stopping rule\. Algorithm[3](https://arxiv.org/html/2606.07299#alg3)summarizes a single rubric\-scaffolded reasoning step\.

Algorithm 3Rubric\-Scaffolded Test\-Time Reasoning1:topic

xx, report outline

𝒪\\mathcal\{O\}, plan

ptp\_\{t\}, evidence base

ete\_\{t\}, newly collected evidence

Δ​et\\Delta e\_\{t\}, persistent rubric

ρp\\rho^\{p\}, active ephemeral rubric

ρte\\rho^\{e\}\_\{t\}\(

ρ0e=∅\\rho^\{e\}\_\{0\}=\\varnothing\)

2:inject

ρp,ρte\\rho^\{p\},\\rho^\{e\}\_\{t\}into the Planner / Search Agent context and

ρp\\rho^\{p\}into the Writer context

3:generate Planner action

ata\_\{t\}conditioned on the active rubric

4:during synthesis, generate the Writer’s report

yyconditioned on the persistent rubric

5:

ρt\+1e←𝒢e​\(𝒪,et∪Δ​et\)\\rho^\{e\}\_\{t\+1\}\\leftarrow\\mathcal\{G\}\_\{e\}\(\\mathcal\{O\},e\_\{t\}\\cup\\Delta e\_\{t\}\)⊳\\trianglerightephemeral rubric: refreshed for the next cycle

6:if

ρt\+1e\\rho^\{e\}\_\{t\+1\}reports no outstanding gaporreach max plan iterationthen

7:signal*stop*to the Planner⊳\\trianglerightadaptive termination

8:endif

9:return

ata\_\{t\}during planning or

yyduring synthesis, together with

ρp,ρt\+1e\\rho^\{p\},\\rho^\{e\}\_\{t\+1\}

Desensitized excerpts of the two\-level rubric\-generation prompts—both the orchestration\-level prompt and the search\-level prompt—are provided in Appendix[A\.2](https://arxiv.org/html/2606.07299#A1.SS2)\. They elicit the two rubric types, constrain the ephemeral criteria to the most decision\-relevant evidence gaps, require the*guidance*of each criterion to be an actionable instruction rather than a numeric score, and ask the generator to flag when no further retrieval is warranted, which yields the adaptive stopping signal\.

## 3Experiments and Evaluation

To assess the performance of the DuMate\-DeepResearch system, we conducted comprehensive experiments on two deep research benchmarks:

- •DeepResearch Bench\(Duet al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib5)\): A comprehensive benchmark specifically designed for deep research agents or systems\. It includes a total of 100 tasks across 22 domains in both Chinese and English\. The generated report for each task is evaluated using the Reference\-based and Adaptive Criteria\-driven Evaluation framework, which leverages LLM\-as\-a\-judge for evaluation\.
- •DeepResearch Bench II\(Liet al\.,[2026a](https://arxiv.org/html/2606.07299#bib.bib6)\): An extension of DeepResearch Bench, focusing on diagnosing deep research agents via rubrics derived from expert reports\. It includes 132 tasks across 22 domains, with a total of 9,430 fine\-grained binary rubrics for evaluation\. The evaluation is conducted in an end\-to\-end manner, assessing the dimensions of Information Recall, Analysis, and Presentation\.

##### Implementation Details

Key hyperparameters are set as follows: the outer planning loop runs up to 15 iterations; each inner Search Agent performs up to 10 retrieval rounds, generating up to 3 sub\-queries per round with 3 results returned per query; fan\-out parallel execution is enabled so that independent sub\-tasks on the ready frontier execute concurrently\. Baidu Search serves as the primary retrieval backend\. To account for variance in generation, all reported results for DuMate\-DeepResearch are averaged over 3 independent runs\.

##### Evaluation Protocol

For both benchmarks, baseline scores are taken from the official benchmark sources and leaderboards, and DuMate\-DeepResearch is evaluated under the corresponding official evaluation protocols\. During report generation, the system is given only the benchmark queries and does not access benchmark reference reports, expert reports, or evaluation rubrics\. This is particularly important for DeepResearch Bench II, whose evaluation rubrics are derived from expert reports; the rubrics generated by DuMate\-DeepResearch are produced independently at test time and are not derived from the benchmark’s hidden evaluation rubrics\.

### 3\.1Overall Performance

Model/SystemComprehensivenessInsightInstruction FollowingReadabilityOverallDR\-Tulu44\.0844\.6549\.5642\.3045\.49UESTC\-MBSE\-RAAA43\.7748\.3447\.2143\.7846\.13OpenAI DeepResearch\*46\.4643\.7349\.3947\.2246\.45Gemini 2\.5 Pro DeepResearch\*49\.5149\.4550\.1250\.0049\.71LangChain Open Deep Research\(GPT\-5 \+ Gensee Search\)50\.0650\.7651\.3149\.7250\.60Salesforce AIR50\.0051\.0950\.7750\.3250\.65ThinkDepth\.ai52\.0253\.8852\.0450\.1252\.43Tavily Research52\.8453\.5951\.9249\.2152\.44LiAuto Mind DeepResearch 1\.551\.5455\.3050\.4551\.2652\.54RecallRadar Intelligence53\.9153\.5352\.1852\.3853\.19Deep Dog 153\.1456\.1051\.8351\.1853\.52Bodhi Deep Research54\.2356\.0952\.8651\.8154\.22Onyx Deep Research54\.6756\.4353\.0852\.0254\.54TrajectoryKit54\.1057\.9052\.9152\.7254\.92CMCC\-DeepInsight55\.6658\.7052\.5350\.9455\.24MS\-Agent DeepResearch56\.7656\.7953\.1052\.2855\.31Cellcog55\.4158\.2152\.5053\.1255\.31NVIDIA\-AIQ56\.9058\.4952\.8953\.4355\.95Grep Deep Research56\.8258\.9253\.3853\.4456\.23Octen DeepResearch56\.8959\.0053\.3953\.8356\.311688AILab\-DeepResearch57\.3259\.2753\.5153\.3656\.53Cellcog\-Max57\.4060\.0153\.2553\.2156\.67Xiaoyi DeepResearch 6\.058\.5859\.3853\.5853\.9957\.00Zhipu Deep Research58\.1560\.1453\.4753\.8857\.06iFlow\-Researcher58\.2459\.7453\.2455\.0557\.08ZTE Nebula DeepResearch58\.3759\.7654\.0654\.6657\.27DuMate\-DeepResearch59\.4861\.4853\.8754\.3458\.03

Table 1:Performance of different deep research models/systems on the DeepResearch Bench\. The scores are presented in percentage, and the best and second\-best performances are highlighted in bold and underline, respectively\. The models/systems marked with an \* represent results reproduced by the DeepResearch Bench paper\. We report the performance of DuMate\-DeepResearch based on average scores across multiple runs\.##### DeepResearch Bench

We report the results of our DuMate\-DeepResearch system and baselines on the DeepResearch Bench in Table[1](https://arxiv.org/html/2606.07299#S3.T1)\. Table[1](https://arxiv.org/html/2606.07299#S3.T1)demonstrates that DuMate\-DeepResearch achieves the best overall score of 58\.03%, outperforming the second\-best ZTE Nebula DeepResearch \(57\.27%\)\. As for the individual evaluation dimensions, DuMate\-DeepResearch ranks first in both Comprehensiveness \(59\.48%\) and Insight \(61\.48%\), improving over the second\-best system by 0\.90% and 1\.34%, respectively\. It ranks second on Instruction Following \(53\.87%\) and remains highly competitive on Readability \(54\.34%\), staying within 0\.2–0\.7% of the top systems on these two dimensions\. These results indicate that DuMate\-DeepResearch can effectively acquire and synthesize information during the deep research process, and generate high\-quality reports that are comprehensive, insightful, and well\-structured\.

##### DeepResearch Bench II

We further evaluate on DeepResearch Bench II, which diagnoses deep research agents via fine\-grained binary rubrics derived from expert reports\. The benchmark assesses three dimensions:Information Recall\(whether the system retrieves all key facts\),Analysis\(whether the system performs correct reasoning and synthesis\), andPresentation\(whether the report is well\-structured and readable\)\. Results are reported in Table[2](https://arxiv.org/html/2606.07299#S3.T2)\.

Model/SystemInformation RecallAnalysisPresentationOverallTongyi Deep Research22\.9535\.8986\.1329\.89Perplexity Research33\.0544\.4779\.3438\.58Grok Deep Search33\.5242\.5091\.4239\.23Qwen3\-Max Deep Research34\.1848\.0474\.5939\.25Doubao Deep Research34\.8349\.4383\.5140\.99Gemini\-2\.5\-Pro Deep Research34\.9151\.9190\.2441\.98Gemini\-3\-Pro Deep Research39\.0948\.9491\.8544\.60OpenAI\-GPT\-o3 Deep Research39\.9849\.8589\.1645\.40NVIDIA\-AIQ49\.2361\.5593\.1554\.50CMCC\-DeepInsight49\.6062\.9592\.9455\.39Xiaoyi DeepResearch 6\.053\.0569\.9091\.1258\.72iFlow\-Researcher54\.9969\.5492\.5659\.91DuMate\-DeepResearch57\.5871\.7089\.8961\.95

Table 2:Performance on DeepResearch Bench II\. Scores are percentages\. The best and second\-best performances are highlighted in bold and underline, respectively\.Table[2](https://arxiv.org/html/2606.07299#S3.T2)shows that, under our evaluation on DeepResearch Bench II, DuMate\-DeepResearch achieves the best overall score of 61\.95%, outperforming the strongest baseline iFlow\-Researcher by 2\.04%\. It also ranks first in Information Recall \(57\.58%\) and Analysis \(71\.70%\), improving over the second\-best systems by 2\.59% and 1\.80%, respectively\. The rubric\-based evaluation indicates that our system excels particularly in acquiring key evidence and performing evidence\-grounded synthesis—the two capabilities most directly impacted by our graph\-based dynamic planning and multi\-turn retrieval mechanisms—while maintaining competitive Presentation quality \(89\.89%\)\.

### 3\.2Detailed Analysis

##### Ablation Study

To understand the contribution of key design choices in DuMate\-DeepResearch, we conduct ablation studies on DeepResearch Bench, examining the impact of rubric\-guided generation and the choice of report\-stage model\. Average results from 3 runs are reported in Table[3](https://arxiv.org/html/2606.07299#S3.T3)\.

VariantComprehensivenessInsightInstruction FollowingReadabilityOverallDuMate\-DeepResearch \(Full\)59\.4861\.4853\.8754\.3458\.03Rubric Ablationw/o Rubric \(Report Stage\)59\.0160\.7353\.6253\.8257\.61w/o Rubric \(Full Pipeline\)58\.9560\.7853\.7153\.9157\.53Report\-Stage Model ReplacementDeepSeek V4 Pro58\.7360\.6653\.5352\.6457\.21GLM 5\.157\.9260\.0252\.9353\.9356\.69MiniMax\-M355\.9158\.7551\.7551\.6455\.21Qwen\-3\.7 Max56\.2058\.4852\.4152\.8055\.55

Table 3:Ablation study results on DeepResearch Bench\. “w/o Rubric \(Report Stage\)” removes rubric guidance only during report generation; “w/o Rubric \(Full Pipeline\)” removes rubric from all stages including planning and research\. The report\-stage model replacement variants substitute the default report generation model with the specified alternative while keeping all other components unchanged\.
##### Effect of Rubric Guidance

Removing the rubric from the report stage alone causes a modest but consistent drop across all dimensions \(Overall: 58\.03→\\to57\.61,−\-0\.42\), with the largest degradation on Insight \(−\-0\.75\) and Comprehensiveness \(−\-0\.47\)\. Notably, further removing the rubric from planning and research stages yields only marginal additional decline \(Overall: 57\.53, a further−\-0\.08 over report\-only removal\)\. This asymmetry indicates that the rubric’s primary value materializes during report synthesis—where it serves as a live scaffold for evidence\-grounded claim generation—rather than during earlier information\-gathering stages\. The finding aligns with our design intent \(Section[2\.2\.3](https://arxiv.org/html/2606.07299#S2.SS2.SSS3)\): persistent rubrics condition the Writer policy to ground claims in retrieved evidence at generation time, and this conditioning effect dominates the rubric’s contribution to overall quality\.

##### Effect of Report\-Stage Model

Replacing the default report\-generation model produces substantially larger quality differences than rubric removal, confirming that the synthesis model is the single most impactful component in the pipeline\. DeepSeek V4 Pro comes closest to the full system \(−\-0\.82 overall\) but exhibits a notable Readability deficit \(−\-1\.70\), suggesting weaker long\-form formatting and structural coherence despite competitive analytical ability\. GLM 5\.1 maintains strong Readability \(53\.93, only−\-0\.41\) yet shows marked drops in Comprehensiveness \(−\-1\.56\) and Insight \(−\-1\.46\), indicating difficulty in fully leveraging the retrieved evidence base\. MiniMax\-M3 and Qwen\-3\.7 Max incur the largest overall degradations \(−\-2\.82 and−\-2\.48, respectively\), with broad declines across all dimensions; both models appear to struggle with the long\-context, multi\-source synthesis demands of deep research reports\. Across all substitutions, the strongest models preserve Insight more robustly than Comprehensiveness, suggesting that information coverage—assembling all relevant evidence into a coherent narrative—is particularly sensitive to model capability and benefits most from scale\.

### 3\.3Qualitative Case Study

#### 3\.3\.1Coarse\-to\-Fine Expansion and Dynamic Boundary Definition

Case A:“How do low\-code/no\-code platforms impact traditional software development?” This ambiguous query embeds four interleaved sub\-problems: impact magnitude, efficiency vs\. maintenance cost, developer vs\. business perspectives, and future trends\. Rather than immediately committing to fine\-grained investigation, the system executes a two\-phase expansion strategy\.

##### Coarse Phase\.

Theinitial\_planner\(the Planner’s first\-stage coarse expansion\) issues two parallel exploratory search tasks to map the macro landscape, followed by an outline generation task \(executed by the Writer\) that depends on both:

```
"task_graph": [
  {"subtask_id":"T-1", "subtask_type":"search",
   "subtask_title":"LCNC market status and impact on SDLC",
   "subtask_dependencies":[], "subtask_depth":1},
  {"subtask_id":"T-2", "subtask_type":"search",
   "subtask_title":"Efficiency gains vs. maintenance costs:
    empirical evidence and controversies",
   "subtask_dependencies":[], "subtask_depth":1},
  {"subtask_id":"T-3", "subtask_type":"outline",
   "subtask_title":"Generate structured research outline",
   "subtask_dependencies":["T-1","T-2"], "subtask_depth":2}
]
```

##### Fine Phase\.

Upon completion of T\-1 and T\-2, the Writer synthesizes an 8\-chapter structured outline covering background, restructuring mechanisms, efficiency verification, hidden costs, stakeholder perspectives, platform comparison, boundaries, and future trends\. This outline then triggers theplannerto expand the research into 14 targeted subtasks \(T\-4 through T\-17\) across three depth layers:

```
Depth-1 (parallel): T-4..T-13 (10 search tasks)
  - Market background, traditional dev pain points,
    6-dimension restructuring, efficiency data,
    hidden costs, stakeholder views, platform comparison,
    industry cases, capability boundaries, future trends
Depth-2 (dependent): T-14 (llm), T-15 (llm), T-16 (search)
  - T-14: Cross-validate efficiency vs. cost data
  - T-15: Build scenario-platform matching matrix
  - T-16: Supplement opposing viewpoints
Depth-3: T-17 (report) [deps: T-4..T-16]
```

Figure[4](https://arxiv.org/html/2606.07299#S3.F4)illustrates this two\-phase expansion\. The coarse phase establishes cognitive boundaries \(“what is the research space?”\) before the fine phase commits computational resources to depth\-first investigation\.

User Query\(4 sub\-problems\)RouterPlanner\(coarse planning\)T\-1&T\-2searchT\-3outlinePlanner\(fine expansion\)T\-4\.\.T\-1310×\\timessearchT\-14\.\.T\-162×\\timesllm \+ 1×\\timessearchT\-17reportCoarseFineFigure 4:Coarse\-to\-fine expansion in Case A\. The coarse phase \(router→\\toplanner→\\to2 searches→\\tooutline generation\) establishes research boundaries; the fine phase \(14 subtasks across 3 depth layers\) performs targeted investigation\.

#### 3\.3\.2Graph\-Based Dynamic Planning and Reflection

Case B:“Constructing a three\-dimensional evaluation framework for NEV powertrain commercialization thresholds\.” Theplannerconstructs a four\-layer DAG with 18 nodes, where edges encode strict execution dependencies \(Figure[5](https://arxiv.org/html/2606.07299#S3.F5)\)\.

T\-1T\-2T\-3T\-4T\-5T\-6T\-7T\-8T\-9T\-10T\-11T\-12T\-13T\-14T\-15T\-16T\-17T\-18searchllmreportd=1d=2d=3d=4Figure 5:Task execution DAG for Case B\. Depth\-1: 11 parallel search tasks; depth\-2: 4llmtasks for integration and cross\-validation; depth\-3: cross\-dimension synthesis \(llm, T\-16\) and gap\-filling search \(T\-17\); depth\-4: final report\. T\-17 depends on the integration tasks \(T\-13–T\-15\) and recovers the sub\-segment data deferred from T\-9; T\-10–T\-11 skip depth\-2 and feed directly into T\-16\.##### Reflective Evaluation\.

At each scheduling cycle, the planner performs explicit quality assessment before deciding next actions:

```
"last_task_revision":
  "T-1 (methodology): HIGH. Provides S-curve (10%/16%/50%),
   AHP+entropy weighting, TCO five-dimension framework.
   T-2 (800V+SiC): HIGH. Covers 40+ production models,
   substrate price curves, 5-10% efficiency gain data.
   T-3 (solid-state battery): HIGH. Covers three routes,
   350-500 Wh/kg density, 2025/2027/2030 milestones.
   ...
   T-9 (enterprise cases): ADEQUATE. Missing Hongqi/Lantu/
   Toyota-Mirai sub-segment data -- defer to T-17.
   T-11 (scenario forecasts): HIGH. Three-scenario matrix,
   BNEF/McKinsey/Ouyang cross-validated projections.
   Assessment: 11 search tasks complete, three-dimensional
   data coverage balanced. T-9 gap handled by T-17.
   Proceeding to T-12..T-15 (parallel llm integration)."
```

This reflection\-before\-action loop enables the system to: \(1\) confirm sufficient evidence before advancing to dependent tasks; \(2\) dynamically inject additional searches when gaps are detected; and \(3\) prune unnecessary branches when early results already satisfy requirements\. Thellm\-type tasks \(T\-12–T\-16\) serve dedicated integration and cross\-validation roles—synthesizing per\-dimension indicators, computing composite scores, and verifying consistency across multiple search results rather than performing new searches\.

#### 3\.3\.3Multi\-Turn Retrieval within Search Agents

Case C:“Manufacturing technology options for hollow motor shafts in NEV electric drive units\.” Beyond planner\-level re\-planning, each search task executes a multi\-turn retrieval loop internally\. TheSearch Agentoperates as a plan\-execute cycle with up to 10 iterations, progressively refining queries based on intermediate results\.

A single search task \(T\-1\) in this case executes 6 internal rounds with 40\+ queries:

```
Round 1 (broad): 3 search tools, 9 queries
  "hollow motor shaft NEV electric drive unit application"
  "hollow rotor shaft electric vehicle e-axle requirements"
  "hollow shaft rotor cooling 800V high speed motor NEV"

Round 2 (manufacturing-focused): 3 search tools, 9 queries
  "rotary swaging hollow rotor shaft EV production"
  "EV motor shaft material steel grade 42CrMo4 20MnCr5"
  "hairpin motor hollow shaft oil spray cooling rotor"

Round 3 (OEM-specific): 3 search tools, 9 queries
  "BYD 8-in-1 e-axle hollow rotor shaft 800V spec"
  "Tesla Model S Plaid drive unit hollow rotor shaft"
  "Hirschvogel multi-piece hollow rotor shaft laser welded"

Rounds 4-5: Progressively narrower (ISO standards, balance
  grades, specific tolerance specs)
Round 6: Final answer synthesis
```

The multi\-turn mechanism enables three retrieval strategies: \(1\)multi\-formulation query expansion\(varying terminology, synonyms, and technical jargon to maximize recall\); \(2\)progressive specificity\(broad domain→\\tomanufacturing process→\\toOEM/supplier names→\\toISO standards\); and \(3\)tool diversification\(search engine queries \+ direct URL crawling for authoritative sources\)\.

#### 3\.3\.4Rubric\-Based Test\-Time Optimization

Case A:“How do low\-code/no\-code platforms impact traditional software development?” The outline’s chapter descriptions function aspersistent rubricsthat scaffold all downstream agents\. Figure[6](https://arxiv.org/html/2606.07299#S3.F6)illustrates the rubric propagation pathway\.

Planner\(dispatch\)Writer\(outline\)Persistent Rubrics\(chapter descriptions\)Search Agents\(T\-4\.\.T\-13\)Writer\(synthesis, T\-17\)Ephemeral Rubrics\(per\-query criteria\)Planner\(next\-cycle\)injectinjectreturnFigure 6:Rubric propagation in multi\-agent collaboration\. Persistent rubrics \(from the Writer\) are injected into both the Search Agents and the Writer\. Ephemeral rubrics generated during search are returned to the Planner for next\-cycle calibration\.##### Rubric as Reasoning Scaffold\.

Chapter 3’s rubric specifies:

> “The study shall cross\-validate LCNC efficiency claims using multi\-source data: comparing delivery cycles, headcount, and ROI between vendor claims, third\-party research \(Forrester TEI, Gartner Peer Insights\), and hands\-on testing of 5–7 mainstream platforms; reveal the differentiated realization degree of efficiency gains across scenarios\.”

This rubric propagates to Search Agents \(guiding query formulation toward multi\-source evidence\) and to the Writer \(enforcing evidence grounding\)\. The effect is directly observable in the final output, where the system producesconditional, source\-calibrated conclusions:

> “The ‘300%–500% efficiency improvement’ should be treated as the upper bound of vendor claims, not the median actually achievable by enterprises—this gap will be critically examined in Chapter 3\. \[…\] IDC’s 40\.3B RMB \(2024\) with 26\.4% CAGR provides the most rigorous baseline; Gartner’s 131B RMB figure includes broader aPaaS integration\.”

#### 3\.3\.5Report Quality and Synthesis Capability

Table[4](https://arxiv.org/html/2606.07299#S3.T4)summarizes the output quality metrics across all three cases\.

MetricCase A \(LCNC\)Case B \(NEV\)Case C \(Shaft\)Word count†151K \(zh\)261K \(zh\)68K \(en\)Chapters / sections8 / 5310 / 628 / 71Citations11419621Plan iterations2311Total subtasks171827Structural elements in final report:Analytical frameworksMermaid,AHP\-entropyMulti\-criteriamatricesformulasdecision matrixConditional conclusionsper\-scenarioper\-routeper\-process†Chinese counts are in characters; English count is in words\.Table 4:Output quality metrics for all three case studies\.All reports exhibit key quality characteristics enabled by the proposed mechanisms: \(1\)Multi\-source cross\-validation: the system explicitly distinguishes vendor claims from third\-party measurements \(e\.g\., “Forrester TEI validates 45% cost reduction—notably more conservative than vendor\-claimed 60–80%”\); \(2\)Conditional conclusions: every major finding is bounded by scenario applicability \(e\.g\., “efficiency gains of 500–600% in simple form/approval scenarios, but only 60% in high\-complexity projects”\); \(3\)Quantitative modeling: Case B autonomously constructs a three\-dimensional, 13\-indicator evaluation framework with explicit formulas \(S​c​o​r​ek=∑iWi×∑jWi​j×xi​j,kn​o​r​mScore\_\{k\}=\\sum\_\{i\}W\_\{i\}\\times\\sum\_\{j\}W\_\{ij\}\\times x\_\{ij,k\}^\{norm\}\) and combined AHP\-entropy weighting; \(4\)Adaptive depth: Case C demonstrates that the system scales plan iterations to 11 and total subtasks to 27 in response to retrieval difficulty, while maintaining report quality; \(5\)Full citation trails: most evidence\-backed claims link to a retrievable URL, enabling broad auditability\.

These qualitative observations align with the quantitative gains on DeepResearch Bench, particularly the leading performance in Comprehensiveness \(59\.48%, \+0\.9% over second\-best\) and Insight \(61\.48%, \+1\.34% over second\-best\), which directly reflect the system’s ability to acquire diverse evidence and synthesize it into structured, evidence\-grounded analysis\.

## 4Background and Related Work

### 4\.1Retrieval\-Augmented Generation and Agentic Search

Before deep research systems, the dominant paradigm for connecting LLMs with external knowledge was retrieval\-augmented generation \(RAG\), where a system retrieves a small set of relevant passages and conditions the generator on them to produce a concise answer\. Early RAG\-style systems showed that non\-parametric retrieval can substantially improve knowledge\-intensive generation\(Lewiset al\.,[2020](https://arxiv.org/html/2606.07299#bib.bib18)\), and later work further integrated retrieval into language model pre\-training and few\-shot learning\(Guuet al\.,[2020](https://arxiv.org/html/2606.07299#bib.bib20); Borgeaudet al\.,[2022](https://arxiv.org/html/2606.07299#bib.bib21); Izacardet al\.,[2023](https://arxiv.org/html/2606.07299#bib.bib22)\)\. In these systems, the search component is usually optimized for short\-answer question answering: retrieve evidence, optionally rerank or filter it, and generate an answer grounded in the retrieved context\. The retriever itself has evolved from lexical retrieval such as BM25\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2606.07299#bib.bib23)\)to dense passage retrieval\(Karpukhinet al\.,[2020](https://arxiv.org/html/2606.07299#bib.bib24)\), while broader RAG surveys summarize this line as a standard way to mitigate the static\-knowledge limitation of LLMs\(Gaoet al\.,[2023](https://arxiv.org/html/2606.07299#bib.bib19)\)\.

A central limitation of conventional RAG is that retrieval quality depends heavily on the input query\. To address this, many systems introduce LLM\-based query rewriting, decomposition, or planning before retrieval\(Liet al\.,[2025c](https://arxiv.org/html/2606.07299#bib.bib52); Chenet al\.,[2025a](https://arxiv.org/html/2606.07299#bib.bib53); Liet al\.,[2026c](https://arxiv.org/html/2606.07299#bib.bib55)\)\. Rewrite\-Retrieve\-Read trains a query rewriter with reinforcement learning so that the rewritten query improves downstream answer accuracy\(Maet al\.,[2023](https://arxiv.org/html/2606.07299#bib.bib25); Chenet al\.,[2026](https://arxiv.org/html/2606.07299#bib.bib54)\)\. Subsequent work extends this idea by optimizing retrieval\-oriented planning with richer reward signals or multi\-agent training, such as DeepRetrieval and multi\-agent RAG optimization\(Jianget al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib26); Chenet al\.,[2025b](https://arxiv.org/html/2606.07299#bib.bib27)\)\. Beyond one\-shot rewriting, iterative systems decompose complex questions into multiple dependent sub\-queries\. LLatrieval repeatedly generates supplementary queries when current evidence fails verification\(Liet al\.,[2023](https://arxiv.org/html/2606.07299#bib.bib28)\), while DRAGIN uses the model’s generation state to dynamically reformulate retrieval queries\(Suet al\.,[2024](https://arxiv.org/html/2606.07299#bib.bib29)\)\. Tree\- or graph\-based methods further expand the search space: RAG\-Star uses retrieval\-augmented verification and refinement over deliberative reasoning paths\(Jianget al\.,[2024](https://arxiv.org/html/2606.07299#bib.bib30)\), DeepRAG decides step by step whether to rely on parametric knowledge or retrieval\(Guanet al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib31)\), and MAO\-ARAG orchestrates multiple retrieval modules through a multi\-agent adaptive RAG framework\(Chenet al\.,[2025c](https://arxiv.org/html/2606.07299#bib.bib32)\)\.

Another line of work focuses onwhenLLMs should search\. Fixed retrieval can be inefficient and may introduce irrelevant or misleading evidence, so adaptive retrieval methods let the model decide whether additional evidence is needed\. IR\-CoT interleaves retrieval with chain\-of\-thought reasoning for multi\-step questions\(Trivediet al\.,[2022](https://arxiv.org/html/2606.07299#bib.bib33)\), while FLARE triggers retrieval based on uncertainty during generation\(Jianget al\.,[2023](https://arxiv.org/html/2606.07299#bib.bib34)\)\. Self\-RAG trains models to retrieve, generate, and critique their outputs through self\-reflection tokens\(Asaiet al\.,[2024](https://arxiv.org/html/2606.07299#bib.bib35)\)\. Other adaptive methods estimate retrieval necessity through model confidence, internal states, or consistency, including DRAGIN, Rowen, and SEAKR\(Suet al\.,[2024](https://arxiv.org/html/2606.07299#bib.bib29); Dinget al\.,[2024](https://arxiv.org/html/2606.07299#bib.bib36); Yaoet al\.,[2024](https://arxiv.org/html/2606.07299#bib.bib37)\)\. This direction connects naturally to tool\-using agents: ReAct frames search as an action interleaved with reasoning\(Yaoet al\.,[2023b](https://arxiv.org/html/2606.07299#bib.bib16)\), Search\-o1 introduces agentic search for large reasoning models\(Liet al\.,[2025a](https://arxiv.org/html/2606.07299#bib.bib38)\), and Search\-R1/R1\-Searcher optimize when and what to search through reinforcement learning\(Jinet al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib39); Songet al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib40)\)\.

Overall, LLM\-augmented search and agentic RAG form the short\-answer foundation of deep research\. They improve evidence acquisition through retrieval, query planning, adaptive search timing, and tool\-augmented reasoning\. However, their primary objective is still usually localized answer accuracy or multi\-hop question answering efficiency\. Deep research extends this foundation from short, evidence\-grounded answers to long\-form, report\-level synthesis, requiring broader tool orchestration, persistent memory, global planning, source calibration, and structured report generation\.

### 4\.2Deep Research

Moving beyond short\-answer RAG and agentic search, recent deep research systems aim to generate long\-form, evidence\-grounded reports for complex and open\-ended user queries\. Compared with conventional RAG systems, they usually require broader information exploration, longer\-horizon planning, iterative reflection, source\-level verification, and structured report writing\. Therefore, the core challenge shifts from retrieving sufficient evidence for a localized answer to coordinating an end\-to\-end research workflow that can acquire, organize, and synthesize information across multiple steps\.

MiroThinker\(Teamet al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib8)\)is designed to enhance the tool\-augmented reasoning ability and information\-seeking capabilities of research agents\. Operating on the ReAct\(Yaoet al\.,[2023b](https://arxiv.org/html/2606.07299#bib.bib16)\)paradigm, it supports up to 600 tool calls within a 256K context window by retaining the most recent tool responses during exploration\. WebThinker\(Liet al\.,[2025b](https://arxiv.org/html/2606.07299#bib.bib9)\)introduces autonomous deep web exploration and operates in problem solving mode and report generation mode\. DR\-Tulu\(Shaoet al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib15)\)addresses the drawback of static evaluation metrics in optimizing open\-ended and long\-form deep research tasks by introducing evolving rubrics\. Rubrics provide measurable reward signals for RL and adapt dynamically to the policy model’s behaviors\. TTD\-DR\(Hanet al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib10)\)conceptualizes report generation as an iterative diffusion process, which includes planning, drafting, revision, and supplementary search\. To enhance the quality of individual agentic components, TTD\-DR introduces a self\-evolution strategy that merges multiple revised variants into a single high\-quality output\. Step\-DeepResearch\(Huet al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib13)\)adopts an Atomic Capability\-based Data Synthesis Strategy for fine\-tuning\. The strategy targets several bottlenecks in deep research systems, including planning, information seeking, reflection, and report writing\. Before SFT and RL, it introduces Agentic Mid\-training to adapt medium\-sized models to long\-context and tool\-augmented reasoning\. FS\-Researcher\(Zhuet al\.,[2026b](https://arxiv.org/html/2606.07299#bib.bib14)\)builds the research task as the collaboration between two agents: context builder and report writer\. The system maintains a file\-system workspace, which serves as the durable external memory for both agents\. The context builder performs tool calls and knowledge base construction, while the report writer interacts with the file system and writes from section to section\.

More recent systems further emphasize verification, scalable training data, and efficient long\-horizon search\. MiroThinker\-1\.7 and H1\(Teamet al\.,[2026a](https://arxiv.org/html/2606.07299#bib.bib41)\)improve heavy\-duty research agents through verification\-enhanced data construction, scalable reinforcement learning, and inference\-time verification\. Marco DeepResearch\(Zhuet al\.,[2026a](https://arxiv.org/html/2606.07299#bib.bib45)\)similarly adopts a verification\-centric design, using a dedicated verification agent and reinforcement learning for compact models\. RedSearch\(Chuet al\.,[2026](https://arxiv.org/html/2606.07299#bib.bib43)\)targets scalable and cost\-efficient long\-horizon search agents by combining decentralized multi\-agent data synthesis, compact agentic supervised fine\-tuning, and reinforcement learning\. LiteResearcher\(Liet al\.,[2026b](https://arxiv.org/html/2606.07299#bib.bib48)\)also focuses on scalable agentic RL for deep research, highlighting the importance of efficient trajectory generation and policy optimization\.

Another emerging direction is to democratize deep research agents through open data and reproducible pipelines\. OpenSeeker\(Duet al\.,[2026](https://arxiv.org/html/2606.07299#bib.bib42)\)fully open\-sources its training data for frontier search agents, covering prompt sets, cold\-start trajectories, and reinforcement learning data\. OpenResearcher\(Liet al\.,[2026d](https://arxiv.org/html/2606.07299#bib.bib44)\)proposes a fully open pipeline for long\-horizon deep research trajectory synthesis, including synthetic task generation, high\-quality trajectory construction, and agent tuning\. OffSeeker\(Zhouet al\.,[2026](https://arxiv.org/html/2606.07299#bib.bib46)\)argues that online reinforcement learning is not the only path to strong deep research agents, showing the effectiveness of offline data construction and training\. AgentFounder\(Suet al\.,[2025](https://arxiv.org/html/2606.07299#bib.bib47)\)scales agents through continual pre\-training over large\-scale agentic data, while DR\-Venus\(Teamet al\.,[2026b](https://arxiv.org/html/2606.07299#bib.bib49)\)explores edge\-scale deep research agents trained from only 10K open data examples\.

Overall, existing deep research systems highlight several complementary directions: scaling tool\-augmented exploration, separating problem\-solving and report\-generation modes, using rubrics and verifiers as optimization signals, improving test\-time writing through iterative refinement, synthesizing capability\-specific training data, open\-sourcing reproducible training pipelines, and introducing external workspaces as persistent memory\. These studies demonstrate that deep research is not merely a longer version of RAG, but a broader agentic workflow that couples search, planning, verification, memory, and long\-form synthesis\.

## 5Conclusions

In this technical report, we presentedDuMate\-DeepResearch, a multi\-agent deep research framework built on the Qianfan Agent Foundry\. By decoupling the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, the framework exposes every planning decision and tool invocation as an inspectable artifact, directly addressing the transparency and auditability challenge of agentic deep research\. On top of this infrastructure, we introduced three cognitive mechanisms tailored to the open challenges of the task: a graph\-based dynamic planner that supports coarse\-to\-fine exploration, reflection, re\-planning, backtracking, and parallel branching for far\-sighted long\-horizon research; a recursive two\-level execution design that delegates each complex search sub\-task to an inner Search Agent running its own planning loop, isolating noisy retrieval so that the global trajectory stays stable; and a rubric\-based test\-time optimization mechanism that dynamically generates task\-specific quality criteria and uses them as live reasoning scaffolds for evidence\-grounded synthesis and adaptive stopping\.

Experiments on DeepResearch Bench and DeepResearch Bench II show consistent gains across complementary evaluation protocols, with DuMate\-DeepResearch achieving the best overall scores on both benchmarks\. These results demonstrate the effectiveness of combining auditable multi\-agent infrastructure with adaptive planning and rubric\-guided reasoning for high\-quality deep research\. In future work, we plan to extend the evaluation to additional live and multimodal deep research benchmarks, broaden the Tool Ecosystem with richer domain\-specific capabilities, and further investigate rubric\-based optimization as a training\-time as well as test\-time signal\.

## Contributions and Acknowledgments

Contributors:Lingyong Yan🖂, Can Xu\*, Yukun Zhao, Wenxuan Li, Qingyang Chen, Jiulong Wu, Wenli Song, Xiangnan Li, Weixian Shi, Yiqun Chen\*, Xuchen Ma\*, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Jianmin Wu, and Dawei Yin\.

🖂Corresponding author:[yanlingyong@baidu\.com](https://arxiv.org/html/2606.07299v1/mailto:[email protected])\. \*Work done during an internship at Baidu AI Cloud\.

We would like to thank our colleagues at Baidu AI Cloud and across Baidu for their continuous support throughout this project\. We are also grateful to the colleagues who participated in internal evaluations and provided valuable feedback that helped shape the design and improve the quality of the system\. Finally, we thank the broader open\-source and deep research community, whose benchmarks, baselines, and prior work have been instrumental in guiding our research and development efforts\.

## References

- A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi \(2024\)Self\-RAG: learning to retrieve, generate, and critique through self\-reflection\.InThe Twelfth International Conference on Learning Representations,Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p3.1)\.
- S\. Borgeaud, A\. Mensch, J\. Hoffmann, T\. Cai, E\. Rutherford, K\. Millican, G\. van den Driessche, J\. Lespiau, B\. Damoc, A\. Clark, D\. de Las Casas, A\. Guy, J\. Menick, R\. Ring, T\. Hennigan, S\. Huang, L\. Maggiore, C\. Jones, A\. Cassirer, A\. Brock, M\. Paganini, G\. Irving, O\. Vinyals, S\. Osindero, K\. Simonyan, J\. W\. Rae, E\. Elsen, and L\. Sifre \(2022\)Improving language models by retrieving from trillions of tokens\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p1.1)\.
- X\. Chen, Y\. Li, H\. Cai, Z\. Ma, X\. Chen, H\. Xiong, S\. Wang, B\. He, L\. Sun, and D\. Yin \(2025a\)Multi\-agent proactive information seeking with adaptive llm orchestration for non\-factoid question answering\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2,pp\. 4341–4352\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p2.1)\.
- X\. Chen, Y\. Li, Y\. Bi, S\. Wang, L\. Kong, and D\. Yin \(2026\)ReflectRAG: enhancing retrieval\-augmented generation with grpo\-optimized iterative reflection\.Neurocomputing,pp\. 134047\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p2.1)\.
- Y\. Chen, L\. Yan, W\. Sun, X\. Ma, Y\. Zhang, S\. Wang, D\. Yin, Y\. Yang, and J\. Mao \(2025b\)Improving retrieval\-augmented generation through multi\-agent reinforcement learning\.arXiv preprint arXiv:2501\.15228\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p2.1)\.
- Y\. Chen, E\. Zhang, L\. Yan, S\. Wang, J\. Huang, D\. Yin, and J\. Mao \(2025c\)MAO\-arag: multi\-agent orchestration for adaptive retrieval\-augmented generation\.arXiv preprint arXiv:2508\.01005\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p2.1)\.
- Z\. Chu, X\. Wang, J\. Hong, H\. Fan, Y\. Huang, Y\. Yang, G\. Xu, C\. Zhao, C\. Xiang, S\. Hu,et al\.\(2026\)Redsearcher: a scalable and cost\-efficient framework for long\-horizon search agents\.arXiv preprint arXiv:2602\.14234\.Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p3.1)\.
- H\. Ding, L\. Pang, Z\. Wei, H\. Shen, and X\. Cheng \(2024\)Retrieve only when it needs: adaptive retrieval augmentation for hallucination mitigation in large language models\.arXiv preprint arXiv:2402\.10612\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p3.1)\.
- M\. Du, B\. Xu, C\. Zhu, X\. Wang, and Z\. Mao \(2025\)DeepResearch bench: a comprehensive benchmark for deep research agents\.External Links:2506\.11763,[Link](https://arxiv.org/abs/2506.11763)Cited by:[§1](https://arxiv.org/html/2606.07299#S1.p1.1),[1st item](https://arxiv.org/html/2606.07299#S3.I1.i1.p1.1)\.
- Y\. Du, R\. Ye, S\. Tang, X\. Zhu, Y\. Lu, Y\. Cai, and S\. Chen \(2026\)OpenSeeker: democratizing frontier search agents by fully open\-sourcing training data\.arXiv preprint arXiv:2603\.15594\.Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p4.1)\.
- Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, H\. Wang, H\. Wang,et al\.\(2023\)Retrieval\-augmented generation for large language models: a survey\.arXiv preprint arXiv:2312\.109972\(1\),pp\. 32\.Cited by:[§1](https://arxiv.org/html/2606.07299#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p1.1)\.
- X\. Guan, J\. Zeng, F\. Meng, C\. Xin, Y\. Lu, H\. Lin, X\. Han, L\. Sun, and J\. Zhou \(2025\)DeepRAG: thinking to retrieve step by step for large language models\.arXiv preprint arXiv:2502\.01142\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p2.1)\.
- K\. Guu, K\. Lee, Z\. Tung, P\. Pasupat, and M\. Chang \(2020\)REALM: retrieval\-augmented language model pre\-training\.InProceedings of the 37th International Conference on Machine Learning,Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p1.1)\.
- R\. Han, Y\. Chen, Z\. CuiZhu, L\. Miculicich, G\. Sun, Y\. Bi, W\. Wen, H\. Wan, C\. Wen, S\. Maître, G\. Lee, V\. Tirumalashetty, E\. Xue, Z\. Zhang, S\. Haykal, B\. Gokturk, T\. Pfister, and C\. Lee \(2025\)Deep researcher with test\-time diffusion\.External Links:2507\.16075,[Link](https://arxiv.org/abs/2507.16075)Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p2.1)\.
- C\. Hu, H\. Du, H\. Wang, L\. Lin, M\. Chen, P\. Liu, R\. Miao, T\. Yue, W\. You, W\. Ji, W\. Yuan, W\. Deng, X\. Yuan, X\. Zhang, X\. Liu, X\. Liu, Y\. Xu, Y\. Cao, Y\. Zhang, Y\. Wang, Y\. Shu, Y\. Zhang, Y\. Zhang, Z\. Gong, Z\. Chang, B\. Li, D\. Ma, F\. Jia, H\. Wang, J\. Liu, J\. Bai, J\. Liu, M\. Liu, N\. Wang, Q\. Wu, Q\. Du, S\. Li, W\. Sun, Y\. Gong, Y\. Chen, Y\. Zhao, Y\. Lin, Z\. Ren, Z\. Wang, A\. Zhang, B\. Li, B\. Ma, K\. An, L\. Xie, M\. Li, P\. Li, S\. Yang, X\. Chen, X\. Liu, Y\. Luo, Y\. Song, Y\. Ding, Y\. Liang, Z\. Li, Z\. Zhang, Z\. Zhang, B\. Jiao, D\. Jiang, J\. Chen, J\. Li, X\. Zhang, and Y\. Zhu \(2025\)Step\-deepresearch technical report\.External Links:2512\.20491,[Link](https://arxiv.org/abs/2512.20491)Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p2.1)\.
- G\. Izacard, P\. Lewis, M\. Lomeli, L\. Hosseini, F\. Petroni, T\. Schick, J\. Dwivedi\-Yu, A\. Joulin, S\. Riedel, and E\. Grave \(2023\)Atlas: few\-shot learning with retrieval augmented language models\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p1.1)\.
- J\. Jiang, J\. Chen, J\. Li, R\. Ren, S\. Wang, W\. X\. Zhao, Y\. Song, and T\. Zhang \(2024\)Rag\-star: enhancing deliberative reasoning with retrieval augmented verification and refinement\.arXiv preprint arXiv:2412\.12881\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p2.1)\.
- P\. Jiang, J\. Lin, L\. Cao, R\. Tian, S\. Kang, Z\. Wang, J\. Sun, and J\. Han \(2025\)Deepretrieval: hacking real search engines and retrievers with large language models via reinforcement learning\.arXiv preprint arXiv:2503\.00223\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p2.1)\.
- Z\. Jiang, F\. F\. Xu, L\. Gao, Z\. Sun, Q\. Liu, J\. Dwivedi\-Yu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)Active retrieval augmented generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 7969–7992\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p3.1)\.
- B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. Arik, D\. Wang, H\. Zamani, and J\. Han \(2025\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.arXiv preprint arXiv:2503\.09516\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p3.1)\.
- V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2606.07299#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p1.1)\.
- R\. Li, M\. Du, B\. Xu, C\. Zhu, X\. Wang, and Z\. Mao \(2026a\)DeepResearch bench ii: diagnosing deep research agents via rubrics from expert report\.External Links:2601\.08536,[Link](https://arxiv.org/abs/2601.08536)Cited by:[2nd item](https://arxiv.org/html/2606.07299#S3.I1.i2.p1.1)\.
- W\. Li, B\. Qu, B\. Pan, J\. Zhang, Z\. Liu, P\. Zhang, W\. Chen, and B\. Zhang \(2026b\)LiteResearcher: a scalable agentic rl training framework for deep research agent\.arXiv preprint arXiv:2604\.17931\.Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p3.1)\.
- X\. Li, C\. Zhu, L\. Li, Z\. Yin, T\. Sun, and X\. Qiu \(2023\)Llatrieval: llm\-verified retrieval for verifiable generation\.arXiv preprint arXiv:2311\.07838\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p2.1)\.
- X\. Li, G\. Dong, J\. Jin, Y\. Zhang, Y\. Zhou, Y\. Zhu, P\. Zhang, and Z\. Dou \(2025a\)Search\-o1: agentic search\-enhanced large reasoning models\.arXiv preprint arXiv:2501\.05366\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p3.1)\.
- X\. Li, J\. Jin, G\. Dong, H\. Qian, Y\. Wu, J\. Wen, Y\. Zhu, and Z\. Dou \(2025b\)WebThinker: empowering large reasoning models with deep research capability\.External Links:2504\.21776,[Link](https://arxiv.org/abs/2504.21776)Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p2.1)\.
- Y\. Li, H\. Cai, R\. Kong, X\. Chen, J\. Chen, J\. Yang, H\. Zhang, J\. Li, J\. Wu, Y\. Chen,et al\.\(2025c\)Towards ai search paradigm\.arXiv preprint arXiv:2506\.17188\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p2.1)\.
- Y\. Li, J\. Chen, X\. Chen, Z\. Li, H\. Zhang, R\. Kong, J\. Li, X\. Ma, H\. Cai, L\. Su,et al\.\(2026c\)Retain to refine: adaptive online question answering via query routing and long\-short memory\.InProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 1,pp\. 2312–2322\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p2.1)\.
- Z\. Li, D\. Jiang, X\. Ma, H\. Zhang, P\. Nie, Y\. Zhang, K\. Zou, J\. Xie, Y\. Zhang, and W\. Chen \(2026d\)Openresearcher: a fully open pipeline for long\-horizon deep research trajectory synthesis\.arXiv preprint arXiv:2603\.20278\.Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p4.1)\.
- X\. Ma, Y\. Gong, P\. He, H\. Zhao, and N\. Duan \(2023\)Query rewriting in retrieval\-augmented large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p2.1)\.
- S\. Robertson and H\. Zaragoza \(2009\)The probabilistic relevance framework: bm25 and beyond\.Now Publishers Inc\.\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p1.1)\.
- R\. Shao, A\. Asai, S\. Z\. Shen, H\. Ivison, V\. Kishore, J\. Zhuo, X\. Zhao, M\. Park, S\. G\. Finlayson, D\. Sontag, T\. Murray, S\. Min, P\. Dasigi, L\. Soldaini, F\. Brahman, W\. Yih, T\. Wu, L\. Zettlemoyer, Y\. Kim, H\. Hajishirzi, and P\. W\. Koh \(2025\)DR tulu: reinforcement learning with evolving rubrics for deep research\.External Links:2511\.19399,[Link](https://arxiv.org/abs/2511.19399)Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p2.1)\.
- Z\. Shi, Y\. Chen, H\. Li, W\. Sun, S\. Ni, Y\. Lyu, R\. Fan, B\. Jin, Y\. Weng, M\. Zhu, Q\. Xie, X\. Guo, Q\. Yang, J\. Wu, J\. Zhao, X\. Tang, X\. Ma, C\. Wang, J\. Mao, Q\. Ai, J\. Huang, W\. Wang, Y\. Zhang, Y\. Yang, Z\. Tu, and Z\. Ren \(2025\)Deep research: a systematic survey\.External Links:2512\.02038,[Link](https://arxiv.org/abs/2512.02038)Cited by:[§1](https://arxiv.org/html/2606.07299#S1.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.07299#S1.p1.1)\.
- H\. Song, J\. Jiang, Y\. Min, J\. Chen, Z\. Chen, W\. X\. Zhao, L\. Fang, and J\. Wen \(2025\)R1\-searcher: incentivizing the search capability in llms via reinforcement learning\.arXiv preprint arXiv:2503\.05592\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p3.1)\.
- L\. Su, Z\. Zhang, G\. Li, Z\. Chen, C\. Wang, M\. Song, X\. Wang, K\. Li, J\. Wu, X\. Chen,et al\.\(2025\)Scaling agents via continual pre\-training\.arXiv preprint arXiv:2509\.13310\.Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p4.1)\.
- W\. Su, Y\. Tang, Q\. Ai, Z\. Wu, and Y\. Liu \(2024\)DRAGIN: dynamic retrieval augmented generation based on the information needs of large language models\.arXiv preprint arXiv:2403\.10081\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p3.1)\.
- M\. Team, S\. Bai, L\. Bing, L\. Lei, R\. Li, X\. Li, X\. Lin, E\. Min, L\. Su, B\. Wang,et al\.\(2026a\)Mirothinker\-1\.7 & h1: towards heavy\-duty research agents via verification\.arXiv preprint arXiv:2603\.15726\.Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p3.1)\.
- M\. Team, S\. Bai, L\. Bing, C\. Chen, G\. Chen, Y\. Chen, Z\. Chen, Z\. Chen, J\. Dai, X\. Dong, W\. Dou, Y\. Deng, Y\. Fu, J\. Ge, C\. Han, T\. Huang, Z\. Huang, J\. Jiao, S\. Jiang, T\. Jiao, X\. Jian, L\. Lei, R\. Li, R\. Luo, T\. Li, X\. Lin, Z\. Liu, Z\. Li, J\. Ni, Q\. Ren, P\. Sun, S\. Su, C\. Tao, B\. Wang, H\. Wang, H\. Wang, J\. Wang, J\. Wang, J\. Wang, L\. Wang, S\. Wang, W\. Wang, Z\. Wang, J\. Xu, S\. Xing, C\. Yang, H\. Ye, J\. Yu, Y\. Yu, M\. Zhong, T\. Zhao, X\. Zhu, Y\. Zhou, Y\. Zhang, and Z\. Zhu \(2025\)MiroThinker: pushing the performance boundaries of open\-source research agents via model, context, and interactive scaling\.External Links:2511\.11793,[Link](https://arxiv.org/abs/2511.11793)Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p2.1)\.
- V\. Team, S\. Dai, Y\. Deng, J\. Lin, Y\. Song, G\. Wang, X\. Wu, Y\. Zhou, S\. Yang, Z\. Ying,et al\.\(2026b\)DR\-venus: towards frontier edge\-scale deep research agents with only 10k open data\.arXiv preprint arXiv:2604\.19859\.Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p4.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions\.arXiv preprint arXiv:2212\.10509\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p3.1)\.
- J\. Wang, Y\. Ming, R\. Dulepet, Q\. Chen, A\. Xu, Z\. Ke, F\. Sala, A\. Albarghouthi, C\. Xiong, and S\. Joty \(2025\)LiveResearchBench: a live benchmark for user\-centric deep research in the wild\.External Links:2510\.14240,[Link](https://arxiv.org/abs/2510.14240)Cited by:[§1](https://arxiv.org/html/2606.07299#S1.p1.1)\.
- L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin,et al\.\(2024\)A survey on large language model based autonomous agents\.Frontiers of Computer Science18\(6\),pp\. 186345\.Cited by:[§1](https://arxiv.org/html/2606.07299#S1.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023a\)Tree of thoughts: deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.07299#S1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2023b\)ReAct: synergizing reasoning and acting in language models\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=WE_vluYUL-X)Cited by:[§1](https://arxiv.org/html/2606.07299#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p3.1),[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p2.1)\.
- Z\. Yao, W\. Qi, L\. Pan, S\. Cao, L\. Hu, W\. Liu, L\. Hou, and J\. Li \(2024\)Seakr: self\-aware knowledge retrieval for adaptive retrieval augmented generation\.arXiv preprint arXiv:2406\.19215\.Cited by:[§4\.1](https://arxiv.org/html/2606.07299#S4.SS1.p3.1)\.
- W\. Zhang, X\. Li, Y\. Zhang, P\. Jia, Y\. Wang, H\. Guo, Y\. Liu, and X\. Zhao \(2025\)Deep research: a survey of autonomous research agents\.External Links:2508\.12752,[Link](https://arxiv.org/abs/2508.12752)Cited by:[§1](https://arxiv.org/html/2606.07299#S1.p1.1)\.
- Y\. Zheng, D\. Fu, X\. Hu, X\. Cai, L\. Ye, P\. Lu, and P\. Liu \(2025\)DeepResearcher: scaling deep research via reinforcement learning in real\-world environments\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 414–431\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.22/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.22),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.07299#S1.p1.1)\.
- Y\. Zhou, K\. Zheng, Q\. Chen, M\. Hu, Q\. Sun, C\. Xu, and J\. Chen \(2026\)OffSeeker: online reinforcement learning is not all you need for deep research agents\.arXiv preprint arXiv:2601\.18467\.Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p4.1)\.
- B\. Zhu, Q\. Jia, T\. Lan, J\. Ren, F\. Gu, F\. Jiang, L\. Wang, Z\. Xu, and W\. Luo \(2026a\)Marco deepresearch: unlocking efficient deep research agents via verification\-centric design\.arXiv preprint arXiv:2603\.28376\.Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p3.1)\.
- C\. Zhu, B\. Xu, M\. Du, S\. Wang, X\. Wang, Z\. Mao, and Y\. Zhang \(2026b\)FS\-researcher: test\-time scaling for long\-horizon research tasks with file\-system\-based agents\.External Links:2602\.01566,[Link](https://arxiv.org/abs/2602.01566)Cited by:[§4\.2](https://arxiv.org/html/2606.07299#S4.SS2.p2.1)\.

## Appendix APrompt Templates

To make the cognitive mechanisms of Section[2\.2](https://arxiv.org/html/2606.07299#S2.SS2)concrete and reproducible, this appendix reproduces desensitized excerpts of the core prompts that drive them\. Due to product and safety constraints, we release only high\-level control logic: the full output schemas, field\-level definitions, tool list, and other sensitive engineering details are omitted \(marked in\-line by a bracketed ellipsis\), while the reasoning logic and control structure are retained\.

### A\.1Planner Prompt

The following desensitized excerpt corresponds to the graph\-based dynamic planner of Section[2\.2\.1](https://arxiv.org/html/2606.07299#S2.SS2.SSS1)\. It maintains and updates the research DAG, enforces the structural constraints, governs when to re\-plan, and emits the next batch of parallel actions\.

Desensitized Planner Prompt \(Excerpt\)【角色与任务】你是深度研究规划专家。请根据【用户需求】、【研究报告大纲】、【上一步完整计划图】与【上一步执行结果】,维护并输出一个完整、可执行、可动态更新的研究计划图(子任务组成的有向无环图 DAG)。核心职责:\(1\) 设计围绕需求与大纲的完整计划图;\(2\) 结合历史规划与执行结果,评估当前计划是否需要更新;\(3\) 确保任务覆盖关键问题、深度充分、依赖合理;\(4\) 输出下一步可并行执行的行动项。【推理指导标准(Rubric)】本研究遵循注入的持久与单步 Rubric,所有规划决策都应参照这些标准:新增任务须满足标准要求;已执行结果若不达标,应规划补充或验证任务;任务目标应对齐 Rubric 的核心维度。〔此处注入的具体持久 / 单步 Rubric 内容略〕【硬约束(节选)】\(1\) 子任务须直接服务于需求与大纲,并显式对应具体章节;\(2\) 计划图须为合法 DAG,无循环依赖,同深度子任务互不依赖;\(3\) 规划总深度受上限约束,避免无意义的过深规划;\(4\) 子任务类型受工具列表限制;有且仅有一个报告类子任务并置于最终阶段,其余子任务不得承担最终结论或章节撰写;\(5\) 相对时间须基于当前时间转化为明确时间范围;\(6\) 输出须为合法的紧凑结构化结果。〔字段级约束与类型名称略〕【重新规划策略】默认优先保持已有计划稳定,不为”看起来更全面”而随意扩展任务。仅当出现以下情形之一时才更新研究计划图:① 已执行任务失败、结果无效或明显偏题、无法支撑目标章节;② 关键章节尚未被覆盖;③ 已收集信息不足以支撑某些章节的深度要求;④ 存在信息冲突、不一致或不确定,需新增验证任务;⑤ 已执行结果暴露出高价值的新信息缺口,补充后能显著提升报告质量;⑥ 用户需求新增限制、目标变化或研究重点发生转移。若以上情形均不存在,应尽量复用既有未执行任务,不做无必要修改。【任务设计规则】*通用规则*:每个子任务目标明确、可独立执行,避免过度碎片化,相近任务尽量合并;除报告类外,子任务仅负责信息收集、补充、验证与轻量整合校验,不得生成最终结论或章节内容。*覆盖检查清单*(按相关性择优纳入,非机械全覆盖):历史背景、当前状况、未来趋势、利益相关者、定量/定性证据、横向比较、风险与局限,以及主题特有维度。*任务类型策略*:检索类用于获取与交叉验证外部信息;轻量推理类仅用于对已收集数据的去重、归并、统计与一致性校验,不得承担报告写作或综合分析;报告类唯一且置于最终阶段。【输出规范】依次完成四步:\(1\)*任务评估*——逐项评估上一步执行结果是否成功、达标、偏题或信息不足(无历史数据则按首次规划处理);\(2\)*计划决策*——说明本轮是否更新计划及具体动作(增 / 改 / 删任务或调整依赖);\(3\)*计划生成*——输出含已执行与未执行任务的完整计划图;\(4\)*行动项生成*——选取依赖已满足、可立即并行执行的子任务,优先深度最小且信息增益最高者。〔字段定义、依赖与深度规则及 JSON 模板略〕

### A\.2Rubric\-Generation Prompts

The rubric generator of Section[2\.2\.3](https://arxiv.org/html/2606.07299#S2.SS2.SSS3)operates at two levels\. The*orchestration\-level*prompt \(below\) is invoked after each planning–execution cycle to assess cross\-sub\-task integration quality and to decide whether further retrieval is warranted; the*search\-level*prompt is invoked by each inner Search Agent after every tool response to steer its next retrieval step\.

Desensitized Orchestration\-Level Rubric Prompt \(Excerpt\)【角色】你是外层研究 Agent 的跨子任务信息整合质量标准生成器。外层 Agent 已收到多个内层子任务摘要,正在决策:信息是否足够支撑报告、是否需追加子任务、结果如何组织进大纲。请生成两类标准:贯穿后续决策的持久标准(Persistent)与基于当前汇总状态的单步标准(Ephemeral)。【持久标准,四到六条】按主题类型从以下维度择优选取:①*跨子任务覆盖完整性*(大纲各章节是否均有子任务覆盖,识别映射缺口);②*信息一致性*(摘要间是否存在矛盾,是否需交叉验证或取舍);③*分析深度均衡性*(各章节深度是否均衡,有无过浅或冗余);④*核心论点可支撑性*(结论是否有充分证据链,预测性结论是否含多情景或不确定性说明);⑤*硬性指令完成度*(格式 / 数量 / 顺序等显式要求是否被集体满足)。每条 guidance 须面向”合并决策与大纲调整”给出可执行检查动作。【单步标准,二到四条】对照大纲逐章节检查覆盖,识别三类跨子任务缺口(优先级递减):*覆盖空白*(章节无任何对应信息)、*深度不均*(章节仅浅层信息)、*冲突需裁决*(多子任务在同一维度信息不一致)。每条给出缺口类型、受影响章节与建议行动。【关键约束】每条标准的 guidance 必须是可执行的推理指令,而非数字分数;当已收集信息已足够支撑报告撰写时,应显式标注”无需继续检索”,以向规划层提供自适应停止信号。〔持久 / 单步标准的完整 JSON 输出结构与字段细节略〕

Desensitized Search\-Level Rubric Prompt \(Excerpt\)【角色】你是内层深度搜索 Agent 的推理指导标准生成器。该 Agent 执行单个研究子任务,每次工具返回后请生成两类标准:持久标准(子任务全程有效,判断”信息是否值得保留 / 深入”)与单步标准(仅针对本次返回,指导”下一步检索什么”)。【持久标准,三到五条】基于子任务目标,覆盖:*相关性*(是否直接服务子任务目标)、*具体性*(是否含可引用的数据、时间、来源、案例)、*来源可靠性*(是否权威、有无时效性风险)、*覆盖缺口感知*(哪些维度仍空白或浅层)、*冲突识别*(与已有信息是否矛盾)。guidance 面向当前轮信息评估。【单步标准,二到三条】对比”目标要求的信息”与”本次返回内容”,识别三类缺口(优先级递减):*缺失*(目标要求但本次未涉及)、*深度不足*(仅概念级、缺数据 / 机制 / 案例)、*延伸线索*(值得追踪的实体、时间或名称)。每条按”当前状态 / 搜索目标 / 建议行动”格式给出,并选取对下一步影响最大者。〔JSON 输出结构与字段细节略〕

Similar Articles

Mind DeepResearch Technical Report

Hugging Face Daily Papers

MindDR is a multi-agent deep research framework using a three-agent architecture (Planning, DeepSearch, Report) and a four-stage training pipeline, achieving competitive performance with ~30B-parameter models on multiple benchmarks. Developed by Li Auto and deployed as an online product, it also introduces MindDR Bench, a 500-query Chinese benchmark for evaluating deep research capabilities.

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

arXiv cs.CL

This paper introduces Deep Reasoning, an inference-time approach that uses structured meta-reasoning to construct task-specific scaffolds for general-purpose agents. The proposed agent, Dolores, outperforms existing methods by distributing cognition across lower-load reasoning threads, reducing hallucinations and improving performance across multiple benchmarks.

Recursive Multi-Agent Systems

Papers with Code Trending

This paper introduces RecursiveMAS, a framework that extends recursive scaling principles to multi-agent systems for improved collaborative reasoning efficiency and accuracy. It demonstrates significant speedups and token reduction across various benchmarks compared to standard baselines.

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Hugging Face Daily Papers

This paper introduces RubricEM, a reinforcement learning framework that uses rubric-guided policy decomposition and reflection-based meta-policy evolution to train deep research agents for long-form tasks. The resulting RubricEM-8B model demonstrates strong performance on long-form research benchmarks by leveraging stage-aware planning and denser semantic feedback.

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Hugging Face Daily Papers

DR³-Eval is a benchmark for evaluating deep research agents on multimodal, multi-file report generation with a realistic web environment simulation and comprehensive evaluation framework measuring information recall, factual accuracy, citation coverage, instruction following, and depth quality.