@tom_doerr: Trains deep search agents from knowledge graphs https://github.com/THUDM/DeepDive

X AI KOLs Timeline Papers

Summary

DeepDive presents an automated approach to training deep search agents using knowledge graphs for data synthesis and multi-turn reinforcement learning, enabling complex multi-step reasoning and web browsing.

Trains deep search agents from knowledge graphs https://t.co/EgfeW3uJX0 https://t.co/kX5Wdusr7R
Original Article
View Cached Full Text

Cached at: 05/16/26, 03:19 PM

Trains deep search agents from knowledge graphs

https://t.co/EgfeW3uJX0 https://t.co/kX5Wdusr7R


THUDM/DeepDive

Source: https://github.com/THUDM/DeepDive

DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL

GitHub arXiv Dataset Model

Multi-Turn RL Training

🔥 News

  • [2025/10/02] Released the complete data construction pipeline — now fully available in the repository.
  • [2025/09/17] QA pairs and SFT trajectories have been fully open-sourced, totaling 4,108 entries. Check them out on Hugging Face Dataset DeepDive.
  • Model and code are currently under development – coming soon!

Overview

DeepDive presents an automated approach for training deep search agents that can navigate complex, multi-step information-seeking tasks. Our method combines automated data synthesis from knowledge graphs with end-to-end multi-turn reinforcement learning to create agents capable of sophisticated long-horizon reasoning and web browsing.

Key Features

  • Automated Deep Search Data Synthesis: Generate challenging QA pairs from knowledge graphs through controlled random walks
  • Multi-Turn RL Training for Browsing: End-to-end reinforcement learning for deep search capabilities
  • Test-Time Scaling: Supports scaling via tool calls and parallel sampling

Method Overview

Stage 1: Automated Data Synthesis from Knowledge Graphs

Data Synthesis Pipeline

We propose an automated method to synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. The process involves three key steps:

Knowledge Graph Random Walks: Starting from an initial node v_0, we navigate through the graph for k steps to form a path P=[v_0, v_1, \ldots, v_k], where each step (v_i, v_{i+1}) is a valid edge in the graph. We choose longer path lengths (k > 5) to increase reasoning complexity.

Entity Obfuscation: We combine each node v_i in the path with its corresponding attributes to form an attribute-rich path:

P_A = [(v_0, [a_0^0, a_0^1, \ldots]), (v_1, [a_1^0, a_1^1, \ldots]), \ldots, (v_k, [a_k^0, a_k^1, \ldots])]

An LLM then obfuscates information along the entire path, generalizing specific details and creating “blurry entities” that require deep search to resolve.

Difficulty Filtering: We use a frontier model (GPT-4o) with basic search to attempt each question four times. Only questions that the frontier model fails in all attempts are retained, ensuring high difficulty.

Stage 2: End-to-End Multi-Turn Reinforcement Learning

Multi-Turn RL Training

We apply end-to-end multi-turn RL to enhance the agent’s long-horizon reasoning and browsing capabilities. The training process follows an iterative cycle where at step t, the agent generates chain-of-thought c_t, executes browsing action a_t, and observes web content o_t.

Multi-Turn GRPO Training: We employ Group Relative Policy Optimization with normalized advantages:

A_i = \frac{r_i - \text{mean}(\{r_k\}_{k=1}^G)}{\text{std}(\{r_k\}_{k=1}^G)}

Strict Binary Rewards: A trajectory receives reward +1 if and only if both format correctness and answer accuracy are satisfied:

r(\mathcal{T}) = \begin{cases} 1, & (\forall i, \text{Format}(c_i, a_i)) \wedge \text{Judge}(a_{\text{eos}}, a^*) \\ 0, & \text{otherwise} \end{cases}

This strict reward mechanism ensures high-quality trajectories and prevents reward hacking.

Models

ModelParametersHuggingFace HubPerformance (BrowseComp)
DeepDive-9B9Bcoming soon6.3%
DeepDive-32B32Bcoming soon14.8%

Data

Synthetic Dataset Construction

Our automated data synthesis pipeline creates challenging QA pairs through knowledge graph random walks, entity obfuscation, and difficulty filtering. The process uses multi-hop paths (k=5-9) through KILT and AMiner knowledge graphs.

ComponentSizeExplanation
Total Dataset3,250All QA pairs in the training corpus
SFT Portion1,016Subset of the data used for Supervised Fine-Tuning (SFT)
  ↳ SFT Trajectories858Search Traces from the SFT QA pairs via reject sampling
RL Portion2,234Subset of the data used for Reinforcement Learning (RL)

Data Example

Multi-Turn RL Training

Results

Main Results

We evaluate DeepDive on four challenging deep search benchmarks: BrowseComp, BrowseComp-ZH, Xbench-DeepSearch, and SEAL-0. The results demonstrate consistent improvements over existing open-source models.

ModelReasonBrowseBrowseCompBrowseComp-ZHXbench-DeepSearchSEAL-0
Proprietary Models
GPT-4o0.9*11.118.0*0.9
GPT-4o†1.9*12.830.09.1
Claude-3.7-Sonnet2.311.812.02.7
Claude-3.7-Sonnet†4.514.229.014.4
o19.9*29.1*38.011.7
o4-mini6.1*15.2*22.3*2.7
Claude-4-Sonnet-Thinking2.621.527.09.0
Claude-4-Sonnet-Thinking†14.730.853.037.8
Grok-DeepResearch-12.9*50+-
Doubao-DeepThink-26.0*50+-
DeepResearch51.5*42.9*--
Open-Source Models
GLM-Z1-9B-04140.62.48.07.2
GLM-Z1-9B-0414†0.61.73.02.7
Qwen2.5-32B-Instruct0.69.38.7*2.7
Qwen2.5-32B-Instruct†1.51.712.00.9
Qwen3-235B-A22B-Instruct-25070.917.617.06.3
Qwen3-235B-A22B-Instruct-2507†0.914.926.09.1
Qwen3-235B-A22B-Thinking-25073.120.122.09.0
Qwen3-235B-A22B-Thinking-2507†4.622.537.013.5
QwQ-32B1.713.510.7*5.4
QwQ-32B†1.314.527.04.5
DeepSeek-V3-03241.524.636.06.3
DeepSeek-R12.023.232.7*5.4
DeepSeek-R1-05283.228.737.05.4
GLM-4.5-Air21.336.365.030.6
GLM-4.526.437.568.036.0
Web Agents
Search-o1-32B2.8*17.9*25.0*-
WebThinker-32B2.8*7.3*24.0*-
WebDancer-32B3.8*18.0*39.0*-
WebSailor-7B6.7*14.2*34.3*-
WebSailor-32B10.5*25.5*53.3*-
DeepDive (Ours)
DeepDive-9B (sft-only)5.615.735.015.3
DeepDive-9B6.315.138.012.2
DeepDive-32B (sft-only)9.523.048.523.9
DeepDive-32B14.825.650.529.3

* represents reported performance from existing studies. represents equipping browsing via function call.

Generalization on Simple Search Tasks

We evaluate DeepDive not only on challenging search tasks (e.g., BrowseComp, BrowseComp-ZH) but also on simpler benchmarks such as HotpotQA, Frames, and WebWalker. DeepDive-32B (SFT and RL) consistently outperforms strong baselines, achieving >60 points on WebWalker, surpassing WebShaper-72B (52.2). These results demonstrate DeepDive’s strong and generalizable search capabilities.

Multi-Turn RL Training

Test-Time Scaling

DeepDive demonstrates remarkable test-time scaling capabilities through two mechanisms:

Tool Call Scaling: Allowing DeepDive more tool calls during inference leads to higher accuracy on complex, multi-step tasks. As shown in the BrowseComp and BrowseComp-ZH benchmark results:

  • When the maximum number of tool calls is increased, accuracy rises steadily. On BrowseComp, performance improves from 8% accuracy at 8 tool calls to 15% at 128 tool calls, showing that the model benefits from additional search and reasoning opportunities.
  • DeepDive-32B consistently outperforms its SFT-only variant, especially when the allowed tool calls exceed 32. This indicates that the reinforcement learning stage better equips the model to utilize long tool call horizons.
Multi-Turn RL Training

Parallel Sampling: Beyond a larger tool call budget, DeepDive leverages parallel sampling to further boost performance. For each question, DeepDive generates 8 independent reasoning trajectories in parallel.

  • Three answer selection strategies are considered: single-shot inference, majority voting among samples, and choosing the answer that required the fewest tool calls before submission.
  • Empirical analysis reveals a clear trend: answers submitted earlier with fewer tool calls are usually more accurate. In practice, selecting the answer with minimal tool calls among 8 samples raises accuracy from 12.0% (single-shot) to 24.8%, more than doubling performance. Majority voting also helps (18.8% accuracy), but is outperformed by minimal tool-call selection.
Multi-Turn RL Training

Additional Study: Semi-Automated i.i.d. Deep Search QA for RL

To further boost performance on deep search tasks, we create a semi-automated framework for generating i.i.d. QA pairs.

Training with i.i.d. data brings larger improvements. The 32B-RL model reaches 22.2% accuracy on BrowseComp, up from 14.8%, and also performs better on Chinese benchmarks.

ModelDataBrowseCompBrowseComp-ZHXbench-DeepSearchSEAL-0
32B (sft-only)KG9.523.048.523.9
32BKG14.825.650.529.3
32B (sft-only)i.i.d.11.426.647.522.5
32Bi.i.d.22.233.956.023.0

Given the contamination analysis, both the KG data and i.i.d. data are adopted by the open Z-AI  GLM-4.5 / GLM-4.6 models, which show strong performance on BrowseComp.

Acknowledgments

  • Built on top of GLM-4 and QwQ base models
  • Uses Slime framework for RL training
  • Powered by Serper and Jina APIs for web access

Citation

If you find DeepDive useful for your research, please cite our paper:

@misc{lu2025deepdiveadvancingdeepsearch,
      title={DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL},
      author={Rui Lu and Zhenyu Hou and Zihan Wang and Hanchen Zhang and Xiao Liu and Yujiang Li and Shi Feng and Jie Tang and Yuxiao Dong},
      year={2025},
      eprint={2509.10446},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.10446},
}

Similar Articles