Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

arXiv cs.AI 05/14/26, 04:00 AM Papers
Summary
Introduces MultiSearch, an RL-based framework that generates multiple queries at each reasoning step and explicitly merges retrieved information to improve signal-to-noise ratio and reasoning accuracy in question-answering tasks.
arXiv:2605.13534v1 Announce Type: new Abstract: Deep search agents have proven effective in enhancing LLMs by retrieving external knowledge during multi-step reasoning. However, existing methods often generate a single query for retrieval at each reasoning step, limiting information coverage and introducing high noise. This may result in low signal-to-noise ratios (SNR) during search, degrading reasoning accuracy and leading to unnecessary reasoning steps. In this paper, we introduce MultiSearch, an RL-based framework that addresses these limitations through multi-query retrieval and explicit merging of retrieved information. At each reasoning step, MultiSearch generates queries from multiple perspectives and retrieves external information in parallel, expanding the scope of relevant information and mitigating the reliance on any single retrieval result. Then, the agent consolidates and refines retrieved information at the merging process, improving the SNR and ensuring more accurate reasoning. Additionally, we propose a reinforcement learning framework with a multi-process reward design to optimize agents for both multi-query retrieval and information consolidation. Extensive experiments on seven benchmarks demonstrate that MultiSearch outperforms baseline methods, enhancing the SNR of retrieval and improving reasoning performance in question-answering tasks.
Original Article
View Cached Full Text
Cached at: 05/14/26, 06:16 AM
# Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
Source: [https://arxiv.org/html/2605.13534](https://arxiv.org/html/2605.13534)
Jiabei Liu1\*,Wenyu Mao1\*,Junfei Tan1,Chunxu Shen2†\{\\dagger\},Lingling Yi2, Jiancan Wu1†\{\\dagger\},Xiang Wang1†\{\\dagger\}, 1University of Science and Technology of China 2WeChat Technical Architecture Department, Tecent Inc\. ∗Equal contribution\.†Corresponding author

###### Abstract

Deep search agents have proven effective in enhancing LLMs by retrieving external knowledge during multi\-step reasoning\. However, existing methods often generate a single query for retrieval at each reasoning step, limiting information coverage and introducing high noise\. This may result in low signal\-to\-noise ratios \(SNR\) during search, degrading reasoning accuracy and leading to unnecessary reasoning steps\. In this paper, we introduce MultiSearch, an RL\-based framework that addresses these limitations through multi\-query retrieval and explicit merging of retrieved information\. At each reasoning step, MultiSearch generates queries from multiple perspectives and retrieves external information in parallel, expanding the scope of relevant information and mitigating the reliance on any single retrieval result\. Then, the agent consolidates and refines retrieved information at the merging process, improving the SNR and ensuring more accurate reasoning\. Additionally, we propose a reinforcement learning framework with a multi\-process reward design to optimize agents for both multi\-query retrieval and information consolidation\. Extensive experiments on seven benchmarks demonstrate that MultiSearch outperforms baseline methods, enhancing the SNR of retrieval and improving reasoning performance in question\-answering tasks\.

## 1Introduction

Large language models \(LLMs\) have demonstrated strong capabilities in understanding and reasoning\[[1](https://arxiv.org/html/2605.13534#bib.bib1)\]\. However, they remain constrained on knowledge\-intensive tasks by their reliance on static internal knowledge\[[2](https://arxiv.org/html/2605.13534#bib.bib2),[3](https://arxiv.org/html/2605.13534#bib.bib3)\]\. Retrieval\-augmented generation \(RAG\) alleviates this limitation by enabling LLMs to access external knowledge during generation\[[4](https://arxiv.org/html/2605.13534#bib.bib4),[5](https://arxiv.org/html/2605.13534#bib.bib5)\]\. For complex questions, a single retrieval step is often insufficient, since the information required for answering may depend on intermediate reasoning states\. This motivates deep search agents, which retrieve external knowledge during multi\-step reasoning and use the retrieved information to support subsequent reasoning\[[6](https://arxiv.org/html/2605.13534#bib.bib6),[7](https://arxiv.org/html/2605.13534#bib.bib7),[8](https://arxiv.org/html/2605.13534#bib.bib8),[9](https://arxiv.org/html/2605.13534#bib.bib9)\]\.

Most existing deep search methods follow a ReAct\-style retrieval\-during\-reasoning paradigm\[[10](https://arxiv.org/html/2605.13534#bib.bib10),[11](https://arxiv.org/html/2605.13534#bib.bib11),[12](https://arxiv.org/html/2605.13534#bib.bib12),[13](https://arxiv.org/html/2605.13534#bib.bib13),[14](https://arxiv.org/html/2605.13534#bib.bib14),[15](https://arxiv.org/html/2605.13534#bib.bib15)\]\. At each reasoning step, the agent generates a search query conditioned on its current reasoning state, retrieves the top\-kkrelevant documents from an external knowledge source, and incorporates the retrieved information into the context for subsequent reasoning\. This process repeats until the agent has gathered sufficient information to produce a final answer\. To optimize this multi\-step process, recent reinforcement learning \(RL\)\-based methods model the search engine as part of the environment and train the agent on sampled reasoning\-and\-search trajectories\[[13](https://arxiv.org/html/2605.13534#bib.bib13),[16](https://arxiv.org/html/2605.13534#bib.bib16)\]\. The policy is typically updated with outcome\-level rewards, such as final\-answer correctness, which encourages better query generation, tool use, and final question\-answering performance\.

Despite its effectiveness, this paradigm can limit the quality of the intermediate retrieval used for reasoning\. In particular, we identify two key limitations:

- •Low Signal\-to\-Noise Ratio \(SNR\) of Retrieval\.At each reasoning step, existing methods typically rely on a single retrieval query, which captures only one formulation of the current information need\. For multi\-hop or complex questions involving multiple entities, relations, or sub\-questions, this may retrieve only partial information\. Moreover, if the generated query is under\-specified, ambiguous, or mismatched with the corpus, the retrieved top\-kkdocuments may include irrelevant or noisy content\. The resulting lack of useful information and presence of noise can lower the signal\-to\-noise ratio \(SNR\) of the intermediate reasoning context, degrading reasoning accuracy or leading to unnecessary search steps, as illustrated by the single\-query failures in Figure[1](https://arxiv.org/html/2605.13534#S1.F1)\(a\)\.
- •Underexplored fine\-grained supervision\.Recent RL\-based deep search methods are often optimized with outcome\-level rewards, such as final\-answer correctness, sometimes supplemented by format or tool\-use rewards\[[12](https://arxiv.org/html/2605.13534#bib.bib12),[13](https://arxiv.org/html/2605.13534#bib.bib13),[16](https://arxiv.org/html/2605.13534#bib.bib16)\]\. However, these rewards provide limited feedback on intermediate behaviors during retrieval\-during\-reasoning\. In particular, when introducing intermediate mechanisms to improve retrieval quality and reduce noise, outcome\-level supervision alone may be insufficient, as it provides limited guidance on whether the agent retrieves sufficient useful information and consolidates it into a reliable context for reasoning\. This motivates a multi\-process reward design that provides targeted supervision for both retrieval and information consolidation, as illustrated in Figure[1](https://arxiv.org/html/2605.13534#S1.F1)\(b\)\.

![Refer to caption](https://arxiv.org/html/2605.13534v1/x1.png)Figure 1:Comparison of classic deep search methods and MultiSearch\. \(a\) Previous serial single\-query retrieval can be affected by noisy information and may consume more reasoning steps\. MultiSearch employs multi\-query retrieval with explicit merging to capture more comprehensive information with less steps\. \(b\) Based on prior methods, MultiSearch incorporates more targeted rewards to enable fine\-grained supervision\.To address these limitations, we proposeMultiSearch, an RL\-based framework that improves retrieval\-during\-reasoning through multi\-query retrieval and explicit merging of retrieved information\. At each reasoning step, MultiSearch generates queries from multiple perspectives, including rephrasing, concept expansion, and question decomposition, and retrieves external information in parallel\. This expands the scope of relevant information and reduces reliance on any single retrieval result\. The agent then consolidates and refines the retrieved information through an explicit merging step, improving the SNR of the intermediate retrieval for subsequent reasoning\. To train the agent to use these mechanisms effectively, we introduce a multi\-process reward design for reinforcement learning\. In addition to the outcome reward for final\-answer correctness, we use a multi\-query reward to encourage retrieval with multiple queries and a merging reward to encourage information consolidation for high SNR\. These process\-level rewards provide targeted supervision for the intermediate retrieval and merging behaviors\. We optimize the resulting multi\-reward objective with Group reward\-Decoupled Normalization Policy Optimization \(GDPO\)\[[17](https://arxiv.org/html/2605.13534#bib.bib17)\], which separately normalizes heterogeneous reward signals before aggregating them for policy optimization\.

We train MultiSearch with Qwen2\.5\-3B/7B Base and Instruct backbones, and evaluate it on seven QA benchmarks, including three single\-hop datasets\[[18](https://arxiv.org/html/2605.13534#bib.bib18),[19](https://arxiv.org/html/2605.13534#bib.bib19),[20](https://arxiv.org/html/2605.13534#bib.bib20)\]and four multi\-hop datasets\[[21](https://arxiv.org/html/2605.13534#bib.bib21),[22](https://arxiv.org/html/2605.13534#bib.bib22),[23](https://arxiv.org/html/2605.13534#bib.bib23),[24](https://arxiv.org/html/2605.13534#bib.bib24)\]\. MultiSearch achieves the best average performance among the compared baselines\[[11](https://arxiv.org/html/2605.13534#bib.bib11),[13](https://arxiv.org/html/2605.13534#bib.bib13),[15](https://arxiv.org/html/2605.13534#bib.bib15),[16](https://arxiv.org/html/2605.13534#bib.bib16)\]across these settings\. Experimental results show that the agent progressively learns to issue multiple retrieval queries and produce higher\-SNR merged information, suggesting that the proposed reward design encourages the intended intermediate behaviors\. Ablation studies further verify the contributions of multi\-query retrieval, explicit merging, and corresponding rewards to the overall performance\.

## 2Method

![Refer to caption](https://arxiv.org/html/2605.13534v1/x2.png)Figure 2:The training framwork of MultiSearch\. \(a\) An question\-answering example, including think, multi\-query search, information, merge and answer steps\. \(b\) Overview of the policy optimization procedure, where the model is trained using GDPO\.In this section, we provide a comprehensive description of the MultiSearch framework\. We begin by outlining the trajectory generation process, with a particular emphasis on our parallel multi\-query retrieval mechanism \(§[2\.1](https://arxiv.org/html/2605.13534#S2.SS1)\)\. Next, we present our multi\-granularity reward design, which includes an answer reward, a multi\-query reward, and a merging reward \(§[2\.2](https://arxiv.org/html/2605.13534#S2.SS2)\)\. Finally, we detail the training objective based on Group reward–Decoupled Normalization Policy Optimization\(§[2\.3](https://arxiv.org/html/2605.13534#S2.SS3)\)\.

### 2\.1Generation with Multi\-Query Retrieval

#### Rollout Generation\.

For each questionqqin the training datasets, the agent iteratively interacts with the search engineℰ\\mathcal\{E\}and generates a reasoning trajectoryoo\. Specifically, it generates multiple queries in<search\>…</search\>to trigger retrieval tools, and uses<information\>…</information\>to encapsulate the retrieved documents\. Then, the agent explicitly extracts and merges key information from the retrieved content within<merge\>…</merge\>\. The existing response concatenated with the<information\>and<merge\>blocks serves as the next input prompt for the subsequent generation step\. Thissearch→\\toinfo→\\tomergecycle continues until the agent determines that there is sufficient information and presents an answer inside<answer\>…</answer\>\(*cf\.*Figure[2](https://arxiv.org/html/2605.13534#S2.F2)\(a\)\)\.

#### Multi\-Query Retrieval\.

To scale the retrieval from diverse perspectives, we equip the agent with three query generation strategies: rephrasing, concept expansion, and question decomposition\. Rephrasing helps retrieve documents that use different lexical or syntactic expressions, mitigating the risk of missing relevant information due to phrasing mismatches\[[25](https://arxiv.org/html/2605.13534#bib.bib25)\]\. Concept expansion broadens the search scope by adding related terms, synonyms, or hypernyms, which is particularly useful when the initial query is too narrow\[[26](https://arxiv.org/html/2605.13534#bib.bib26)\]\. Question decomposition splits a complex question into simpler sub\-questions, solves them in parallel, and then consolidates the retrieved information to produce the final answer\. At each retrieval step, the agent generates three queries for parallel retrieval, adopting one or more of the above strategies or exploring alternative strategies autonomously\.

#### Explicit Merging\.

After retrieval, repetitive or irrelevant documents are removed to eliminate redundancy\. The agent reads the remaining documents and explicitly places relevant information within<merge\>and</merge\>\. The training template is illustrated in Appendix[B](https://arxiv.org/html/2605.13534#A2)\. For more analysis on different integration manners, please refer to Appendix[A\.4](https://arxiv.org/html/2605.13534#A1.SS4)\.

### 2\.2Reward Modeling

The reward system of MultiSearch consists of three components: \(1\) Answer Reward, which evaluates the accuracy of the final prediction\. \(2\) Multi\-Query Reward, which encourages generating multiple queries for retrieval\. \(3\) Merging Reward, which assesses the quality of the information consolidation\.

#### Answer reward\.

We compute a word\-level F1 score between the predicted answer enclosed in<answer\>…</answer\>and the ground\-truth answer to measure the correctness of the agent’s prediction\. The answer reward is defined as:

rans=F1\(apred,a\)=2nintnpred\+ntruthr\_\{\\text\{ans\}\}=\\text\{F1\}\(a\_\{\\text\{pred\}\},a\)=\\frac\{2n\_\{\\text\{int\}\}\}\{n\_\{\\text\{pred\}\}\+n\_\{\\text\{truth\}\}\}\(1\)wherenpredn\_\{\\text\{pred\}\}andntruthn\_\{\\text\{truth\}\}are the word counts of the predicted answer and the ground\-truth answer, respectively\. Andnintn\_\{\\text\{int\}\}is the word count of their intersection\.

#### Multi\-Query Reward\.

We introduce a dedicated reward for the proposed multi\-query retrieval mechanism\. Specifically, we extract all queries within the<search\>…</search\>blocks throughout the rollout and calculate the average number of queries generated per step:

rquery=\{0\.1,nq\>20,otherwiser\_\{\\text\{query\}\}=\\begin\{cases\}0\.1,&n\_\{\\text\{q\}\}\>2\\\\ 0,&\\text\{otherwise\}\\end\{cases\}\(2\)wherenqn\_\{\\text\{q\}\}is the average number of queries per step\.

#### Merging reward\.

The merging step is designed to remove irrelevant information from the retrieved documents and to consolidate key evidence\. To evaluate the quality of this integration, we aggregate all text enclosed within<merge\></merge\>blocks, and verify whether the ground\-truth answer appears in any of them:

rmerge=\{0\.1,∃ℳi∈\{ℳ1,ℳ2,…ℳn\},ℳi∩a=a0,otherwiser\_\{\\text\{merge\}\}=\\begin\{cases\}0\.1,&\\exists\\mathcal\{M\}\_\{i\}\\in\\\{\\mathcal\{M\}\_\{1\},\\mathcal\{M\}\_\{2\},\.\.\.\\mathcal\{M\}\_\{n\}\\\},\\mathcal\{M\}\_\{i\}\\cap a=a\\\\ 0,&\\text\{otherwise\}\\end\{cases\}\(3\)where\{ℳ1,ℳ2,…ℳn\}\\\{\\mathcal\{M\}\_\{1\},\\mathcal\{M\}\_\{2\},\.\.\.\\mathcal\{M\}\_\{n\}\\\}denotes thennmerging steps within a single rollout\. The multi\-query reward and merging reward are applied only when the answer is correct\. In particular, the final rewards are defined as follows:

\(ℛans,ℛquery,ℛmerge\)=\{\(rans,rquery,rmerge\),rans\>0\(0,0,0\),otherwise\(\\mathcal\{R\}\_\{\\text\{ans\}\},\\mathcal\{R\}\_\{\\text\{query\}\},\\mathcal\{R\}\_\{\\text\{merge\}\}\)=\\begin\{cases\}\(r\_\{\\text\{ans\}\},r\_\{\\text\{query\}\},r\_\{\\text\{merge\}\}\),&r\_\{\\text\{ans\}\}\>0\\\\ \(0,0,0\),&\\text\{otherwise\}\\end\{cases\}\(4\)

### 2\.3Reinforcement Learning

Group reward\-Decoupled Normalization Policy Optimization \(GDPO\)\[[17](https://arxiv.org/html/2605.13534#bib.bib17)\]was proposed to address a key limitation of Group Relative Policy Optimization \(GRPO\)\[[27](https://arxiv.org/html/2605.13534#bib.bib27)\]in multi‑reward RL\. Unlike GRPO, which directly sums all reward components into a single rollout reward, GDPO normalizes each reward independently within the group\. The resulting advantages are then aggregated and subjected to batch\-wise normalization\. This decoupled design preserves the distinct contributions of individual rewards, enabling the model to receive more fine‑grained and informative advantage signals\.

We employ GDPO as the learning algorithm for RL training\. Specifically, for each input question, given a policy modelπθ\\pi\_\{\\theta\}and a reference modelπref\\pi\_\{\\text\{ref\}\}, GDPO samples a group of rollouts\{oi\}i=1G\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}\. The agent is optimized by maximizing the following objective:

𝒥GDPO\(θ\)=\\displaystyle\\mathcal\{J\}\_\{\\text\{GDPO\}\}\(\\theta\)=𝔼x∼𝒟,\{oi\}i=1G∼πθold\(⋅\|x;ℰ\)\[1G∑i=1G1\|oi\|∑t=1\|oi\|min\(πθ\(oi,t∣x,oi,<t;ℰ\)πθold\(oi,t∣x,oi,<t;ℰ\)A^i,t,\\displaystyle\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},\\\{o\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\|x;\\mathcal\{E\}\)\}\\Bigg\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{\|o\_\{i\}\|\}\\sum\_\{t=1\}^\{\|o\_\{i\}\|\}\\min\\Bigg\(\\frac\{\\pi\_\{\\theta\}\(o\_\{i,t\}\\mid x,o\_\{i,<t\};\\mathcal\{E\}\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(o\_\{i,t\}\\mid x,o\_\{i,<t\};\\mathcal\{E\}\)\}\\hat\{A\}\_\{i,t\},\(5\)clip\(πθ\(oi,t∣x,oi,<t;ℰ\)πθold\(oi,t∣x,oi,<t;ℰ\),1−ϵ,1\+ϵ\)A^i,t\)−β𝔻KL\[πθ∥πref\]\]\\displaystyle\\text\{clip\}\\left\(\\frac\{\\pi\_\{\\theta\}\(o\_\{i,t\}\\mid x,o\_\{i,<t\};\\mathcal\{E\}\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(o\_\{i,t\}\\mid x,o\_\{i,<t\};\\mathcal\{E\}\)\},1\-\\epsilon,1\+\\epsilon\\right\)\\hat\{A\}\_\{i,t\}\\Bigg\)\-\\beta\\mathbb\{D\}\_\{\\text\{KL\}\}\\left\[\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\text\{ref\}\}\\right\]\\Bigg\]
whereGGdenotes the group size,ℰ\\mathcal\{E\}represents the search engine,ϵ\\epsilonis the clipping ratio andβ\\betais the coefficient of KL divergence\.A^i,t\\hat\{A\}\_\{i,t\}is the batch\-wise normalized advantage for theii\-th rollout in current group, defined as:

A^i,j,t=Ai,j,t−mean\{Ai′,j′,t\|i′∈DBatch,j′=1,…G\}std\{Ai′,j′,t\|i′∈DBatch,j′=1,…G\}\\hat\{A\}\_\{i,j,t\}=\\frac\{A\_\{i,j,t\}\-\\mathrm\{mean\}\\\{A\_\{i^\{\\prime\},j^\{\\prime\},t\}\|i^\{\\prime\}\\in D\_\{\\text\{Batch\}\},j^\{\\prime\}=1,\.\.\.G\\\}\}\{\\mathrm\{std\}\\\{A\_\{i^\{\\prime\},j^\{\\prime\},t\}\|i^\{\\prime\}\\in D\_\{\\text\{Batch\}\},j^\{\\prime\}=1,\.\.\.G\\\}\}\(6\)where

Ai,j,t=∑k∈\{ans,query,merge\}wkAi,j,tk,withAi,j,tk=ri,j,tk−mean\(rtk\)std\(rtk\)A\_\{i,j,t\}=\\sum\_\{k\\in\\\{\\text\{ans,query,merge\}\\\}\}w\_\{k\}A^\{k\}\_\{i,j,t\},\\quad\\text\{with\}\\quad A^\{k\}\_\{i,j,t\}=\\frac\{r^\{k\}\_\{i,j,t\}\-\\mathrm\{mean\}\(r^\{k\}\_\{t\}\)\}\{\\mathrm\{std\}\(r^\{k\}\_\{t\}\)\}\(7\)represents the weighted sum of the normalized advantages of different reward components\. Figure[2](https://arxiv.org/html/2605.13534#S2.F2)\(b\) illustrates an overview of our GDPO\-based training framework\.

## 3Experiment

In this section, we conduct a series of experiments to address the following research questions: \(1\) RQ1: How does MultiSearch perform on question\-answering tasks? \(2\) RQ2: What are the contributions of the multi\-query retrieval and explicit merging mechanism? \(3\) RQ3: Can our different query generation strategy effectively guide the search agent? \(4\) RQ4: How sensitive is MultiSearch to the number of retrieval queries and the retrieval depth?

### 3\.1Settings

#### Datasets & Evaluation Metrics\.

We evaluate MultiSearch on three general Q&A datasets: NQ\[[18](https://arxiv.org/html/2605.13534#bib.bib18)\], TriviaQA\[[20](https://arxiv.org/html/2605.13534#bib.bib20)\], PopQA\[[19](https://arxiv.org/html/2605.13534#bib.bib19)\], and four multi\-hop reasoning Q&A datasets: HotpotQA\[[21](https://arxiv.org/html/2605.13534#bib.bib21)\], 2WikiMultiHopQA \(2Wiki\)\[[22](https://arxiv.org/html/2605.13534#bib.bib22)\], Musique\[[23](https://arxiv.org/html/2605.13534#bib.bib23)\], and Bamboogle\[[24](https://arxiv.org/html/2605.13534#bib.bib24)\]\. For a fair comparison, we follow the settings of baseline methods\[[13](https://arxiv.org/html/2605.13534#bib.bib13),[16](https://arxiv.org/html/2605.13534#bib.bib16)\], training the agent on a combined dataset of NQ and HotpotQA, and adopting Extract Match \(EM\) as the evaluation metric\. For ablation on different evaluation metrics and training datasets, please refer to Appendix[A\.1](https://arxiv.org/html/2605.13534#A1.SS1)and[A\.2](https://arxiv.org/html/2605.13534#A1.SS2)\.

#### Baselines\.

To evaluate the effectiveness of our MultiSearch framework, we consider diverse baselines mainly grouped into three categories: \(1\) methods without retrieval: direct inference with LLM, Chain of Thought \(CoT\) reasoning\[[28](https://arxiv.org/html/2605.13534#bib.bib28)\], Supervised Fine\-Tuning \(SFT\), and R1\-like fine\-tuning\[[29](https://arxiv.org/html/2605.13534#bib.bib29)\]\. \(2\) methods with single\-turn static retrieval: naive Retrieval\-Augmented Generation \(RAG\)\[[5](https://arxiv.org/html/2605.13534#bib.bib5)\]\. \(3\) methods with multi\-turn dynamic retrieval: Search\-o1\[[11](https://arxiv.org/html/2605.13534#bib.bib11)\], Interleaving retrieval with chain\-of\-thought reasoning \(IRCoT\)\[[30](https://arxiv.org/html/2605.13534#bib.bib30)\], previous outstanding agentic RL methods including ReSearch\[[15](https://arxiv.org/html/2605.13534#bib.bib15)\], Search\-R1\[[13](https://arxiv.org/html/2605.13534#bib.bib13)\]and AutoRefine\[[16](https://arxiv.org/html/2605.13534#bib.bib16)\], as well as relatively recent work Dr\.Zero\[[31](https://arxiv.org/html/2605.13534#bib.bib31)\], AdaSearch\[[32](https://arxiv.org/html/2605.13534#bib.bib32)\]and CriticSearch\[[33](https://arxiv.org/html/2605.13534#bib.bib33)\]\.

Table 1:\(RQ1\) Main results\.Bolddenotes the best results, andunderlineindicates the second best\.MethodsSingle\-Hop QAMulti\-Hop QANQTriviaQAPopQAHotpotQA2WikiMusiqueBamboogleAvg\.Qwen2\.5\-3B\-Base/InstructDirect Generation0\.1060\.2880\.1080\.1490\.2440\.0200\.0240\.134CoT0\.0230\.0320\.0050\.0210\.0210\.0020\.0000\.015IRCoT0\.1110\.3120\.2000\.1640\.1710\.0670\.2400\.181SFT0\.2490\.2920\.1040\.1860\.2480\.0440\.1120\.176R1\-Instruct0\.2100\.4490\.1710\.2080\.2750\.0600\.1920\.224R1\-Base0\.2260\.4550\.1730\.2010\.2680\.0550\.2240\.229RAG0\.3480\.5440\.3870\.2550\.2260\.0470\.0800\.270Search\-o10\.2380\.4720\.2620\.2210\.2180\.0540\.3200\.255AdaSearch0\.3790\.5680\.4280\.3340\.3850\.1440\.2800\.360CriticSearch\-\-\-0\.4140\.4090\.1800\.368\-Dr\.Zero0\.3910\.5720\.4310\.2980\.2910\.0910\.2000\.326ReSearch\-Instruct0\.3650\.5710\.3950\.3510\.2720\.0950\.2660\.331ReSearch\-Base0\.4270\.5970\.4300\.3050\.2720\.0740\.1280\.319Search\-R1\-Instruct0\.3970\.5650\.3910\.3310\.3100\.1240\.2320\.336Search\-R1\-Base0\.4210\.5830\.4130\.2970\.2740\.0660\.1280\.312ZeroSearch\-Instruct0\.4020\.5800\.4600\.2280\.2140\.1040\.1810\.310ZeroSearch\-Base0\.4340\.6380\.4840\.3220\.3560\.1380\.1530\.361StepSearch\-Instruct\-\-\-0\.3450\.3200\.1740\.344\-StepSearch\-Base\-\-\-0\.3290\.3390\.1810\.328\-AutoRefine\-Instruct0\.4360\.5970\.4470\.4040\.3800\.1690\.3360\.396AutoRefine\-Base0\.4670\.6200\.4500\.4050\.3930\.1570\.3440\.405MultiSearch\-Instruct0\.4700\.6150\.4440\.4200\.4120\.1830\.3710\.416MultiSearch\-Base0\.4710\.6300\.4550\.4310\.4130\.1630\.3900\.422Qwen2\.5\-7B\-Base/InstructDirect Inference0\.1340\.4080\.1400\.1830\.2500\.0310\.1200\.181CoT0\.0480\.1850\.0540\.0920\.1110\.0220\.2320\.106IRCoT0\.2240\.4780\.3010\.1330\.1490\.0720\.2240\.239SFT0\.3180\.3540\.1210\.2170\.2590\.0660\.1120\.207R1\-instruct0\.2700\.5370\.1990\.2370\.2920\.0720\.2930\.271R1\-base0\.2970\.5390\.2020\.2420\.2730\.0830\.2960\.276RAG0\.3490\.5850\.3920\.2990\.2350\.0580\.2080\.304Search\-o10\.1510\.4430\.1310\.1870\.1760\.0580\.2960\.206ReasonRAG\-\-0\.4150\.3840\.4360\.1280\.360\-CtriticSearch\-\-\-0\.4420\.4280\.1940\.472\-Dr\.Zero0\.4060\.6080\.4160\.3620\.3470\.1040\.3600\.372Search\-R1\-Instruct0\.4290\.6230\.4270\.3860\.3460\.1620\.4000\.396Search\-R1\-Base0\.3950\.5600\.3880\.3260\.2970\.1250\.3600\.350ZeroSearch\-Instruct0\.4140\.5740\.4480\.2740\.3000\.0980\.1110\.317ZeroSearch\-Base0\.4300\.6160\.4140\.3380\.3460\.1300\.1390\.345StepSearch\-Instruct\-\-\-0\.3860\.3660\.2260\.312\-StepSearch\-Base\-\-\-0\.3800\.3850\.2160\.467\-AutoRefine\-Instruct0\.4190\.6180\.4030\.4080\.3250\.1880\.4160\.397AutoRefine\-Base0\.4700\.6490\.4630\.4450\.3690\.2040\.4350\.434MultiSearch\-Instruct0\.4720\.6360\.4370\.4210\.3510\.1710\.4630\.422MultiSearch\-Base0\.4910\.6570\.4580\.4460\.4160\.1700\.4760\.445

#### Implementation details\.

In line with previous work, we use Qwen2\.5\-3B\-Base/Instruct and Qwen2\.5\-7B\-Base/Instruct as the backbone models\. For retrieval, we utilize E5\[[34](https://arxiv.org/html/2605.13534#bib.bib34)\]as the search engine and 2018 Wikipedia dump\[[35](https://arxiv.org/html/2605.13534#bib.bib35)\]as the external data source\. Both the number of queries and retrieved documents are set to 3\. The reported results are averaged over three runs, and we omit the variance as it is relatively small\. Most of the baseline results are taken directly from the corresponding original papers or paper of Search\-R1 under comparable settings, while AutoRefine is reproduced using open\-source code\. Additional details of the experimental settings can be found in Appendix[C](https://arxiv.org/html/2605.13534#A3)\.

### 3\.2Main Results \(RQ1\)

Overall Performance\.The performance of MultiSearch compared to baseline methods is presented in Table[1](https://arxiv.org/html/2605.13534#S3.T1)\. As shown, MultiSearch achieves the highest average accuracy across seven benchmarks \(columnAvg\.\) for both 3B and 7B model sizes\. The gains are particularly notable on multi\-hop benchmarks, where multi\-query retrieval expands the scope of relevant information to cover multiple entities, relations, or sub\-questions, and explicit merging consolidates the retrieved information for subsequent reasoning\. Additionally, base model variants outperform their instruction\-tuned counterparts\. As discussed in Appendix[A\.5](https://arxiv.org/html/2605.13534#A1.SS5), one possible explanation is that instruction fine\-tuning may reduce the model’s adaptability to new multi\-step reasoning tasks\.

Retrieval Quality and Reasoning Steps\.Figure[4](https://arxiv.org/html/2605.13534#S3.F4)\(a\) first examines the SNR of intermediate retrieved information\. MultiSearch produces higher\-SNR<information\>blocks than Search\-R1, and the explicit<merge\>step further improves the SNR\. This indicates that multi\-query retrieval broadens useful information coverage, while merging helps reduce noisy content and forms a more reliable context for subsequent reasoning\. This higher\-SNR information helps reduce the need for follow\-up searches to compensate for incomplete or noisy retrieval, consistent with the fewer search steps observed in Figure[4](https://arxiv.org/html/2605.13534#S3.F4)\(b\)\.

![Refer to caption](https://arxiv.org/html/2605.13534v1/x3.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.13534v1/x4.png)\(b\)

Figure 3:Comparison of the retrieval quality and efficiency between Search\-R1 and MultiSearch\. \(a\) Signal to noise ratio of<information\>and<merge\>blocks\. \(b\) Average number of search steps across different datasets\.
![Refer to caption](https://arxiv.org/html/2605.13534v1/x5.png)Figure 4:Distribution of various query generation strategies: question decomposition \(Q\), concept expansion \(C\), rephrasing \(R\), and others \(O\)\.

### 3\.3Ablation on Key Components \(RQ2\)

To assess the contributions of key components in MultiSearch, we perform ablation studies on Qwen2\.5\-3B\-Instruct, with results reported in Table[2](https://arxiv.org/html/2605.13534#S3.T2)\. Specifically, we define five variants: \(1\) the full MultiSearch, \(2\) MultiSearch without merging reward \(w/oℛmerge\\mathcal\{R\}\_\{\\text\{merge\}\}\), \(3\) MultiSearch without both merging reward and multi\-query reward \(w/oℛmerge\\mathcal\{R\}\_\{\\text\{merge\}\}&ℛquery\\mathcal\{R\}\_\{\\text\{query\}\}\), \(4\) MultiSearch further excluding explicit merging \(w/oℛmerge\\mathcal\{R\}\_\{\\text\{merge\}\}&ℛquery\\mathcal\{R\}\_\{\\text\{query\}\}&<merge\>\), and \(5\) MultiSearch reduced to single\-query retrieval with only answer reward \(w/o all\)\. The results demonstrate that both the multi\-query retrieval & merging step, as well as our reward modeling, contribute positively to the overall performance\. As shown in Figure[7](https://arxiv.org/html/2605.13534#S3.F7), the merging rewardℛmerge\\mathcal\{R\_\{\\text\{merge\}\}\}and multi\-query rewardℛquery\\mathcal\{R\}\_\{\\text\{query\}\}gradually converge with the answer rewardℛans\\mathcal\{R\}\_\{\\text\{ans\}\}during training\. This observation suggests that the model progressively learns to generate multiple queries and consolidate key evidence from the retrieved documents to enhance its question\-answering performance\.

Table 2:\(RQ2\) Ablation on multi\-query retrieval, merging operation, and corresponding rewards\.
### 3\.4Ablation on query generation strategies \(RQ3\)

To address RQ3, we compare the performance under different query generation strategies\. In Table[3](https://arxiv.org/html/2605.13534#S3.T3), “w/ Rephrase”, “w/ Concept”, and “w/ Decompose” represent scenarios where the agent generate multiple queries merely under a single strategy: rephrasing, concept expansion, or question decomposition, respectively\. “Simple” refers to a setting where only multi\-query generation is required, without any strategies as guidance\. As illustrated in Table[3](https://arxiv.org/html/2605.13534#S3.T3), the agent guided by various strategies outperforms the one guided by a single strategy\. This suggests that the multi\-perspective strategies provide complementary benefits: rephrasing helps reduce lexical mismatch, concept expansion broadens the search scope, and question decomposition helps handle complex information needs\. Figure[4](https://arxiv.org/html/2605.13534#S3.F4)further shows that the trained agent uses a mixture of strategies across datasets, supporting the need for multi\-perspective query generation in retrieval\-during\-reasoning\.

Table 3:\(RQ3\) Ablation study on different query generation strategies\.
### 3\.5Sensitivity Analysis \(RQ4\)

Different hyperparameters may affect the performance of deep search agents\. We investigate the sensitivity of MultiSearch to the number of queriesnqn\_\{\\text\{q\}\}and the retrieval depthkk\. The results are presented in Figure[6](https://arxiv.org/html/2605.13534#S3.F6)and[6](https://arxiv.org/html/2605.13534#S3.F6), where “single\-hop”, “multi\-hop”, “avg” denote the average accuracy on single\-hop benchmarks, multi\-hop benchmarks, and all seven benchmarks, respectively\. As shown in Figure[6](https://arxiv.org/html/2605.13534#S3.F6), performance peaks whennq=3n\_\{\\text\{q\}\}=3\. Using too few queries fails to cover sufficient information, whereas too many queries may lead to supersaturation, introducing repetitive or irrelevant information\. As illustrated in Figure[6](https://arxiv.org/html/2605.13534#S3.F6), performance improves askkincreases whenk≤3k\\leq 3, but begins to decline whenk≥4k\\geq 4, as the search engine starts returning less relevant documents\.

![Refer to caption](https://arxiv.org/html/2605.13534v1/x6.png)

Figure 5:\(RQ4\) Performance of MultiSearch on differentnqn\_\{\\text\{q\}\}, demonstrating the sensitivity of MultiSearch to the number of retrieval queries\.![Refer to caption](https://arxiv.org/html/2605.13534v1/x7.png)

Figure 6:\(RQ4\) Performance of MultiSearch on differentTop k, demonstrating the sensitivity of MultiSearch to the retrieval depth\.
### 3\.6Different RL Methods

We further evaluate MultiSearch using GRPO and GDPO as RL algorithms\. The evaluation results are presented in Table[4](https://arxiv.org/html/2605.13534#S3.T4), and the training dynamics is shown in Figure[7](https://arxiv.org/html/2605.13534#S3.F7)\. As illustrated in the Figure, both methods successfully optimize the three different objectives, guiding the agent to follow the multi\-query retrieval mechanism and merge key evidence during training\. However, compared with GDPO, GRPO exhibits less balanced reward optimization\. The multi\-query reward rises quickly in the early stage \(*cf\.*Figure[7](https://arxiv.org/html/2605.13534#S3.F7)\(b\)\), while the answer and merging rewards improve more slowly \(*cf\.*Figure[7](https://arxiv.org/html/2605.13534#S3.F7)\(a\)\(c\)\)\. Since the multi\-query reward is easier to satisfy than rewards related to answer correctness or information consolidation, it may dominate the aggregated reward signal under GRPO\. GDPO mitigates this issue by normalizing reward components separately, resulting in more robust optimization and better performance, as shown in Table[4](https://arxiv.org/html/2605.13534#S3.T4)\.

![Refer to caption](https://arxiv.org/html/2605.13534v1/x8.png)Figure 7:Analysis of training rewards under GRPO and GDPO\.Table 4:Comparison between GRPO and GDPO on different backbone models\.MethodsSingle\-Hop QAMulti\-Hop QANQTriviaQAPopQAHotpotQA2WikiMusiqueBamboogleAvg\.Qwen2\.5\-3B\-Base/InstructMultiSearch\-Base \(GDPO\)0\.4710\.6300\.4550\.4310\.4130\.1630\.3900\.422MultiSearch\-Instruct \(GDPO\)0\.4700\.6150\.4440\.4200\.4120\.1830\.3710\.416MultiSearch\-Base \(GRPO\)0\.4500\.6060\.4350\.3990\.3930\.1650\.3720\.403MultiSearch\-Instruct \(GRPO\)0\.4260\.6050\.4240\.3800\.3660\.1400\.3440\.384Qwen2\.5\-7B\-Base/InstructMultiSearch\-Base \(GDPO\)0\.4910\.6570\.4580\.4460\.4160\.1700\.4760\.445MultiSearch\-Instruct \(GDPO\)0\.4720\.6360\.4370\.4210\.3510\.1710\.4630\.422MultiSearch\-Base \(GRPO\)0\.4580\.6170\.4320\.3890\.3430\.1630\.4190\.403MultiSearch\-Instruct \(GRPO\)0\.4180\.6370\.4240\.4030\.3500\.1580\.4150\.400

## 4Related Work

#### Deep Search Agents\.

LLMs have demonstrated strong reasoning capabilities\[[28](https://arxiv.org/html/2605.13534#bib.bib28),[29](https://arxiv.org/html/2605.13534#bib.bib29),[36](https://arxiv.org/html/2605.13534#bib.bib36),[37](https://arxiv.org/html/2605.13534#bib.bib37),[38](https://arxiv.org/html/2605.13534#bib.bib38)\], but still suffer from hallucinations and limited knowledge\[[2](https://arxiv.org/html/2605.13534#bib.bib2)\]\. Retrieval\-augmented generation \(RAG\) alleviates these issues by incorporating external knowledge\[[4](https://arxiv.org/html/2605.13534#bib.bib4),[5](https://arxiv.org/html/2605.13534#bib.bib5)\]\. However, native single\-turn RAG models struggle with complex questions, as they lack mechanisms to assess evidence sufficiency or perform iterative retrieval\. Prompt\-based multi\-turn methods partially address this limitation, yet remain suboptimal in optimizing tool\-call decisions\[[10](https://arxiv.org/html/2605.13534#bib.bib10),[30](https://arxiv.org/html/2605.13534#bib.bib30)\]\. Supervised fine\-tuning \(SFT\) effectively improves tool usage, but typically require large\-scale labeled data, which is costly to construct and maintain\[[39](https://arxiv.org/html/2605.13534#bib.bib39),[40](https://arxiv.org/html/2605.13534#bib.bib40)\]\. Recent work has explored reinforcement learning for deep search agents\. Search\-R1 models the search engine as part of the environment and uses outcome\-level rewards to guide search behavior\[[13](https://arxiv.org/html/2605.13534#bib.bib13)\]\. R1\-Searcher introduces a two\-stage training framework that separately improves search\-call formatting and answer accuracy\[[12](https://arxiv.org/html/2605.13534#bib.bib12)\]\. Subsequent methods, such as AutoRefine and EviNote\-RAG, further enhance retrieval\-augmented reasoning with post\-retrieval processing or task\-specific rewards\[[16](https://arxiv.org/html/2605.13534#bib.bib16),[41](https://arxiv.org/html/2605.13534#bib.bib41)\]\. Other methods investigate principle\-based reward models to provide supervision for intermediate steps\[[42](https://arxiv.org/html/2605.13534#bib.bib42),[43](https://arxiv.org/html/2605.13534#bib.bib43)\]\. However, most previous methods rely on single\-query retrieval, limiting the utility of each retrieval step\[[14](https://arxiv.org/html/2605.13534#bib.bib14),[31](https://arxiv.org/html/2605.13534#bib.bib31),[44](https://arxiv.org/html/2605.13534#bib.bib44)\]\. In contrast, our MultiSearch leverages parallel multi\-query retrieval and explicit merging to improve the quality of the search process\.

#### Reinforcement Learning\.

Reinforcement Learning from Human Feedback \(RLHF\) represents an early approach that transitions models from purely imitative behavior to strategic exploration\[[45](https://arxiv.org/html/2605.13534#bib.bib45),[46](https://arxiv.org/html/2605.13534#bib.bib46)\]\. Proximal Policy Optimization \(PPO\)\[[47](https://arxiv.org/html/2605.13534#bib.bib47)\]is widely used due to its stable learning dynamics, but it can incur high computational costs\. To mitigate this, several simplified variants such as Direct Preference Optimization \(DPO\)\[[48](https://arxiv.org/html/2605.13534#bib.bib48)\]and Group Relative Policy Optimization \(GRPO\)\[[27](https://arxiv.org/html/2605.13534#bib.bib27)\]have been proposed\. While early RL methods primarily target preference alignment, recent works explore RL in retrieval\-augmented reasoning settings\. Both PPO and GRPO have demonstrated effectiveness in these scenarios\. Considering both computational efficiency and performance, we adopt Group reward\-Decoupled Normalization Policy Optimization \(GDPO\)\[[17](https://arxiv.org/html/2605.13534#bib.bib17)\], an improved variant of GRPO specifically designed to handle multi\-reward objectives\.

## 5Conclusion

In this work, we introduceMultiSearch, an RL\-based framework for improving retrieval\-during\-reasoning in deep search agents\. Instead of relying on a single query at each reasoning step, MultiSearch generates queries from multiple perspectives and retrieves external information in parallel, expanding the scope of relevant information and reducing reliance on any single retrieval result\. The retrieved information is then explicitly merged, allowing the agent to consolidate useful information and reduce noise before subsequent reasoning\. To optimize these intermediate behaviors, we further propose a multi\-process reward design that supervises both multi\-query retrieval and information consolidation\. Experiments on seven question\-answering benchmarks show that MultiSearch improves the SNR of retrieved information and achieves better reasoning performance than strong baseline methods\. These results suggest that improving both retrieval coverage and information consolidation is important for building more effective deep search agents\.

## References

- Chang et al\. \[2024\]Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al\.A survey on evaluation of large language models\.*ACM transactions on intelligent systems and technology*, 15\(3\):1–45, 2024\.
- Zhang et al\. \[2025a\]Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al\.Siren’s song in the ai ocean: A survey on hallucination in large language models\.*Computational Linguistics*, 51\(4\):1373–1418, 2025a\.
- Peng et al\. \[2023\]Cheng Peng, Xi Yang, Aokun Chen, Kaleb E Smith, Nima PourNejatian, Anthony B Costa, Cheryl Martin, Mona G Flores, Ying Zhang, Tanja Magoc, et al\.A study of generative large language model for medical research and healthcare\.*NPJ digital medicine*, 6\(1\):210, 2023\.
- Gao et al\. \[2023\]Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang\.Retrieval\-augmented generation for large language models: A survey\.*CoRR*, abs/2312\.10997, 2023\.
- Lewis et al\. \[2020\]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen\-tau Yih, Tim Rocktäschel, et al\.Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.*Advances in neural information processing systems*, 33:9459–9474, 2020\.
- Huang et al\. \[2025\]Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al\.Deep research agents: A systematic examination and roadmap\.*arXiv preprint arXiv:2506\.18096*, 2025\.
- Xi et al\. \[2025\]Yunjia Xi, Jianghao Lin, Yongzhao Xiao, Zheli Zhou, Rong Shan, Te Gao, Jiachen Zhu, Weiwen Liu, Yong Yu, and Weinan Zhang\.A survey of llm\-based deep search agents: Paradigm, optimization, evaluation, and challenges\.*arXiv preprint arXiv:2508\.05668*, 2025\.
- Li et al\. \[2025a\]Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji\-Rong Wen, Yutao Zhu, and Zhicheng Dou\.Webthinker: Empowering large reasoning models with deep research capability\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025a\.
- Team et al\. \[2025\]Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al\.Tongyi deepresearch technical report\.*arXiv preprint arXiv:2510\.24701*, 2025\.
- Yao et al\. \[2022\]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao\.React: Synergizing reasoning and acting in language models\.In*The eleventh international conference on learning representations*, 2022\.
- Li et al\. \[2025b\]Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou\.Search\-o1: Agentic search\-enhanced large reasoning models\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 5420–5438, 2025b\.
- Song et al\. \[2025\]Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji\-Rong Wen\.R1\-searcher: Incentivizing the search capability in llms via reinforcement learning\.*arXiv preprint arXiv:2503\.05592*, 2025\.
- Jin et al\. \[2025\]Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han\.Search\-r1: Training LLMs to reason and leverage search engines with reinforcement learning\.In*Second Conference on Language Modeling*, 2025\.
- Sun et al\. \[2025\]Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou\.Zerosearch: Incentivize the search capability of llms without searching\.*arXiv preprint arXiv:2505\.04588*, 2025\.
- Chen et al\. \[2025\]Mingyang Chen, Linzhuang Sun, Tianpeng Li, sunhaoze, ZhouYijie, Chenzheng Zhu, Haofen Wang, Jeff Z\. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen\.Research: Learning to reason with search for LLMs via reinforcement learning\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.
- \[16\]Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, and Xiang Wang\.Search and refine during think: Facilitating knowledge refinement for improved retrieval\-augmented reasoning\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*\.
- Liu et al\. \[2026\]Shih\-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min\-Hung Chen, Hongxu Yin, Yu\-Chiang Frank Wang, Kwang\-Ting Cheng, et al\.Gdpo: Group reward\-decoupled normalization policy optimization for multi\-reward rl optimization\.*arXiv preprint arXiv:2601\.05242*, 2026\.
- Kwiatkowski et al\. \[2019\]Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al\.Natural questions: a benchmark for question answering research\.*Transactions of the Association for Computational Linguistics*, 7:453–466, 2019\.
- Mallen et al\. \[2023\]Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi\.When not to trust language models: Investigating effectiveness of parametric and non\-parametric memories\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 9802–9822, 2023\.
- Joshi et al\. \[2017\]Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer\.Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension\.In*Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 1601–1611, 2017\.
- Yang et al\. \[2018\]Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning\.Hotpotqa: A dataset for diverse, explainable multi\-hop question answering\.In*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, pages 2369–2380, 2018\.
- Ho et al\. \[2020\]Xanh Ho, Anh\-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa\.Constructing a multi\-hop qa dataset for comprehensive evaluation of reasoning steps\.In*Proceedings of the 28th International Conference on Computational Linguistics*, pages 6609–6625, 2020\.
- Trivedi et al\. \[2022\]Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal\.Musique: Multihop questions via single\-hop question composition\.*Transactions of the Association for Computational Linguistics*, 10:539–554, 2022\.
- Press et al\. \[2023\]Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis\.Measuring and narrowing the compositionality gap in language models\.In*Findings of the Association for Computational Linguistics: EMNLP 2023*, pages 5687–5711, 2023\.
- Ma et al\. \[2023\]Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan\.Query rewriting in retrieval\-augmented large language models\.In Houda Bouamor, Juan Pino, and Kalika Bali, editors,*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 5303–5315, December 2023\.
- \[26\]Ellen M Voorhees\.Query expansion using lexical\-semantic relations\.In*SIGIR’94: Proceedings of the Seventeenth Annual International ACM\-SIGIR Conference on Research and Development in Information Retrieval, organised by Dublin City University*, pages 61–69\. Springer\.
- Shao et al\. \[2024\]Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun\-Mei Song, Mingchuan Zhang, Y\. K\. Li, Yu Wu, and Daya Guo\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*ArXiv*, abs/2402\.03300, 2024\.
- Wei et al\. \[2022\]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H\. Chi, Quoc V\. Le, and Denny Zhou\.Chain\-of\-thought prompting elicits reasoning in large language models\.In*Proceedings of the 36th International Conference on Neural Information Processing Systems*\. Curran Associates Inc\., 2022\.
- Guo et al\. \[2025\]Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*, 2025\.
- Trivedi et al\. \[2023\]Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal\.Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 10014–10037, Toronto, Canada, July 2023\. Association for Computational Linguistics\.doi:10\.18653/v1/2023\.acl\-long\.557\.
- Yue et al\. \[2026\]Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang\.Dr\. zero: Self\-evolving search agents without training data\.*CoRR*, abs/2601\.07055, 2026\.doi:10\.48550/ARXIV\.2601\.07055\.
- Lin et al\. \[2025\]Tzu\-Han Lin, Wei\-Lin Chen, Chen\-An Li, Hung\-yi Lee, Yun\-Nung Chen, and Yu Meng\.Adasearch: Balancing parametric knowledge and search in large language models via reinforcement learning\.*CoRR*, abs/2512\.16883, 2025\.doi:10\.48550/ARXIV\.2512\.16883\.
- Zhang et al\. \[2025b\]Yaocheng Zhang, Haohuan Huang, Zijun Song, Yuanheng Zhu, Qichao Zhang, Zijie Zhao, and Dongbin Zhao\.Criticsearch: Fine\-grained credit assignment for search agents via a retrospective critic\.*arXiv preprint arXiv:2511\.12159*, 2025b\.
- Wang et al\. \[2022\]Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei\.Text embeddings by weakly\-supervised contrastive pre\-training\.*ArXiv*, abs/2212\.03533, 2022\.
- Karpukhin et al\. \[2020\]Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen\-tau Yih\.Dense passage retrieval for open\-domain question answering\.In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16\-20, 2020*, pages 6769–6781\. Association for Computational Linguistics, 2020\.doi:10\.18653/V1/2020\.EMNLP\-MAIN\.550\.
- Achiam et al\. \[2023\]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al\.Gpt\-4 technical report\.*arXiv preprint arXiv:2303\.08774*, 2023\.
- Yang et al\. \[2024\]An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al\.Qwen2\. 5\-math technical report: Toward mathematical expert model via self\-improvement\.*arXiv preprint arXiv:2409\.12122*, 2024\.
- Hui et al\. \[2024\]Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al\.Qwen2\. 5\-coder technical report\.*arXiv preprint arXiv:2409\.12186*, 2024\.
- Schick et al\. \[2023\]Timo Schick, Jane Dwivedi\-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom\.Toolformer: Language models can teach themselves to use tools\.*Advances in neural information processing systems*, 36:68539–68551, 2023\.
- Asai et al\. \[2023\]Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi\.Self\-rag: Learning to retrieve, generate, and critique through self\-reflection\.In*The Twelfth International Conference on Learning Representations*, 2023\.
- Dai et al\. \[2025\]Yuqin Dai, Guoqing Wang, Yuan Wang, Kairan Dou, Kaichen Zhou, Zhanwei Zhang, Shuo Yang, Fei Tang, Jun Yin, Pengyu Zeng, et al\.Evinote\-rag: Enhancing rag models via answer\-supportive evidence notes\.*arXiv preprint arXiv:2509\.00877*, 2025\.
- Xu et al\. \[2025\]Peiran Xu, Zhuohao Li, Xiaoying Xing, Guannan Zhang, Debiao Li, and Kunyu Shi\.Hybrid reward normalization for process\-supervised non\-verifiable agentic tasks\.*arXiv preprint arXiv:2509\.25598*, 2025\.
- Zhang et al\. \[2026\]Wenlin Zhang, Xiangyang Li, Kuicai Dong, Yichao Wang, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Derong Xu, Zhaocheng Du, Huifeng Guo, Ruiming Tang, and Xiangyu Zhao\.Process vs\. outcome reward: Which is better for agentic RAG reinforcement learning\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2026\.
- He et al\. \[2026\]Bowei He, Minda Hu, Zenan Xu, Hongru Wang, Licheng Zong, Yankai Chen, Chen Ma, Xue Liu, Pluto Zhou, and Irwin King\.Search\-r2: Enhancing search\-integrated reasoning via actor\-refiner collaboration\.*arXiv preprint arXiv:2602\.03647*, 2026\.
- Ouyang et al\. \[2022\]Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L\. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F\. Christiano, Jan Leike, and Ryan Lowe\.Training language models to follow instructions with human feedback\.In*Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022*, 2022\.
- Kaelbling et al\. \[1996\]Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore\.Reinforcement learning: A survey\.*Journal of artificial intelligence research*, 4:237–285, 1996\.
- Schulman et al\. \[2017\]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov\.Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*, 2017\.
- Rafailov et al\. \[2023\]Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn\.Direct preference optimization: Your language model is secretly a reward model\.*Advances in neural information processing systems*, 36:53728–53741, 2023\.

## Appendix AMore Experimental Results

### A\.1Ablation of Evaluation Metrics

To provide a more comprehensive evaluation, we further assess MultiSearch\-3B\-Base using additional metrics, including F1 score \(F1\) and Coverage Exact Match \(CEM\)\. As shown in Table[5](https://arxiv.org/html/2605.13534#A1.T5), MultiSearch consistently outperforms the baseline methods Search\-R1 and AutoRefine across all evaluated metrics, verifying the effectiveness of our framework\.

Table 5:Ablation of Evaluation Metrics\.
### A\.2Ablation of Training Datasets

MultiSearch is trained on a combined dataset of single\-hop NQ and multi\-hop HotpotQA, following prior work settings\. To examine the effectiveness of this design, we further conduct experiments by training on single\-hop data only \(NQ\) and multi\-hop data only \(HotpotQA\), respectively\. The results are reported in Table[6](https://arxiv.org/html/2605.13534#A1.T6)\.S\-avgandM\-avgdenotes the average accuracy over three single\-hop datasets and four multi\-hop datasets\. We observe that the joint training setting achieves the best overall performance\. In contrast, training solely on NQ leads to degraded performance on multi\-hop tasks, while training solely on HotpotQA results in lower performance on single\-hop tasks\. These suggest that joint training on both types of data provides a more balanced learning signal across different reasoning settings\.

Table 6:Comparison between different types of training datasets\.
### A\.3Impact of Different Reward Modeling Methods

As described in Section[2\.2](https://arxiv.org/html/2605.13534#S2.SS2), we adopt a conditional aggregation scheme for answer rewardransr\_\{\\text\{ans\}\}and step\-specific rewardsrqueryr\_\{\\text\{query\}\}&rmerger\_\{\\text\{merge\}\}, which are calculated over all the<search\>and<merge\>blocks in the rollout\. To investigate the effect of reward design, we conduct additional experiments with three variants: \(1\) Unconditional:rqueryr\_\{\\text\{query\}\}andrmerger\_\{\\text\{merge\}\}are applied regardless of whether the final answer is correct\. \(2\) Turn\-Level:rqueryr\_\{\\text\{query\}\}andrmerger\_\{\\text\{merge\}\}are calculated at each turn, and the final reward is averaged over all turns\. \(3\) EM:ransr\_\{\\text\{ans\}\}is calculated using Extract Match, instead of F1 score\. Results in Table[7](https://arxiv.org/html/2605.13534#A1.T7)show that our reward modeling method achieves higher overall accuracy\. This suggests that associating process rewards with the final outcome may help better align intermediate behaviors with task success, and that F1\-basedransr\_\{\\text\{ans\}\}could provide more informative feedback than EM in this setting\.

Table 7:Comparison between different Reward Designs\.VariantsNQTriviaQAPopQAHotpotQA2WikiMusiqueBamboogleAvg\.Our Method0\.4700\.6150\.4440\.4200\.4120\.1830\.3710\.416Conditional vs\. UnconditionalUnconditional0\.4270\.5930\.4180\.3520\.3430\.1130\.3030\.364Rollout\-Level vs\. Turn\-LevelTurn\-Level0\.4000\.6020\.3960\.3990\.4110\.1500\.3300\.384F1 vs\. EMEM0\.4610\.6100\.4330\.4050\.4030\.1650\.3410\.403

### A\.4Impact of Different Merging Modules

As described in Section[2\.1](https://arxiv.org/html/2605.13534#S2.SS1), the merging step is carried out by the search agent itself\. We further analyze the effect of different merging strategies, including: \(1\) selecting the most relevant passages using TF\-IDF \(TF\-IDF Screening\), and \(2\) using an external LLM to merge retrieved information \(LLM Extractor\)\. To control for model size, we employ Qwen2\.5\-3B\-Instruct as the external extractor\. As shown in Table[8](https://arxiv.org/html/2605.13534#A1.T8), TF\-IDF\-based selection yields relatively poor results, suggesting that textual similarity to the question does not necessarily indicate usefulness\. The variants using an external 3B\-LLM also achieve lower ultimate performance than MultiSearch, likely because the merging step is not integrated into the model’s optimization process\. These observations highlight the potential benefits of our agent\-driven merging paradigm\.

Table 8:Comparison between different merging modules\.
### A\.5Comparison Between Base & Instruct models

We analyze the performance differences between the base and instruction\-tuned models\. As shown in Figure[8](https://arxiv.org/html/2605.13534#A1.F8), the instruction\-tuned model outperforms the base model in the early stages, but the base model gradually catches up, ultimately reaching comparable or even superior performance\. This suggests that while instruction\-tuning improves the model’s ability to follow human instructions, its prior knowledge may limit its generalization to multi\-step reasoning tasks\.

![Refer to caption](https://arxiv.org/html/2605.13534v1/x9.png)Figure 8:Training dynamics of base and instruct models\.
### A\.6Impact of Different Retrievers

We compare two retriever variants: E5\-base\-v2 and BM25, with results reported in Table[9](https://arxiv.org/html/2605.13534#A1.T9)\. As shown, E5\-base\-v2 achieves a better overall performance, suggesting that dense retrieval may provide more relevant evidence than lexical matching in our setting\.

Table 9:Comparison between different retrievers\.

## Appendix BPrompts

### B\.1Training Prompt for Search Agents

Search Agent PromptRole\.You are a helpful assistant\. Answer the given question with multi\-turn search engine calling\. You can search as many times as necessary\.Input\.Question:\{question\}\.Instructions\.Reason through the available information using<think\>and</think\>\. Issue a search request using<search\>q1,q2,…qnq\_\{1\},q\_\{2\},\.\.\.q\_\{n\}</search\>when missing knowledge\. The retrieved documents will be placed in<information\>and</information\>\. Generate three diverse search queries each time, applying one or more of these strategies: rephrasing, concept expansion, and question decomposition\. Extract and integrate key information from the retrieved documents in<merge\>and</merge\>after each search\.Output\.Return a concise final answer inside<answer\>and</answer\>, without detailed illustrations\.

### B\.2Prompt for LLM Extractor

LLM Extractor PromptRole\.You are a document integration assistant\.Input\.Question:\{question\}\. Documents:\{documents\}\.Instructions\.Given a question and some documents, extract and integrate key information that is useful for the question\.Output\.Output only the integrated summary, without any extra commentary\.

### B\.3Prompt for SNR Evaluation

SNR Evaluation PromptRole\.You are an evaluator of information quality\. Your task is to assess the Signal\-to\-Noise Ratio of a given piece of content, based on a question and supporting documents\.Input\.Question:\{question\}\. Supporting Documents:\{documents\}\. Content to be evaluated:\{content\}\.Instructions\."Signal" represents information that is directly relevant to answering the question\. "Noise" represents information that is irrelevant to the question, redundant, contradictory to the supporting documents, or hallucinated\. Read the question and the supporting documents carefully, analyze the content, and determine which parts are "signal" \(useful, relevant, supported\) and which parts are "noise" \(useless, irrelevant, unsupported, or repetitive\)\.Output\.Provide a SNR between 0 and 1\.

## Appendix CMore Experimental Details

### C\.1Training Details

MultiSearch is trained using the AdamW optimizer with a learning rate of1×10−61\\times 10^\{\-6\}and a constant learning rate schedule without warmup\. Weight decay is set to 0\.01, and gradients are clipped with a maximum norm of 1\.0\. Training is conducted for 200 optimization steps using Fully Sharded Data Parallel \(FSDP\) without offloading parameters, gradients, or optimizer states\. We use 4 NVIDIA A100 GPUs for 3B models, while larger configurations are trained on 8 NVIDIA A100 GPUs\.

During training, rollout generation is performed using vLLM111[https://github\.com/vllm\-project/vllm](https://github.com/vllm-project/vllm)with a sampling temperature of 1\.0 and top\-p set to 0\.95\. We adopt GDPO for policy optimization with a group size of 5\. At each interaction step, the agent can issue up to 3 search queries, and the retrieval system returns the top 3 documents for each query\. Key hyperparameters are summarized in Table[10](https://arxiv.org/html/2605.13534#A3.T10)\.

Increasing the number of queries or retrieved documents per step may lead to longer rollouts due to the additional reasoning required\. However, the resulting increase in generation time is relatively moderate\. Moreover, because MultiSearch requires fewer reasoning steps, as shown in Figure[3b](https://arxiv.org/html/2605.13534#S3.F3.sf2), the overall runtime remains comparable despite the longer per\-step reasoning time\. Given the performance gains observed in Section[3\.5](https://arxiv.org/html/2605.13534#S3.SS5), we consider this trade\-off acceptable for the present experiments\.

Table 10:Hyperparameters used by MultiSearch\.Hyper‑parametersValueLearning Rate1×10−61\\times 10^\{\-6\}Training Batch Size512Validation Batch Size512Max Response Length2048Micro Training Batch Size32Group SizeGG5KL Coefficientβ\\beta0\.001Max Search Turns4Clip Ratioϵ\\epsilon0\.2Total Training Steps300
### C\.2Dataset Details

Table 11:Statistics of the seven datasets\.SplitSingle\-Hop QAMulti\-Hop QANQTriviaQAPopQAHotpotQA2WikiMusiqueBamboogle\# Train7916878785\-9044715,00019,938\-\# Dev87578837\-7405125762417\-\# Test36101131314267\-\-\-125

## Appendix DCase Studies

Here, we present a case study comparing Search\-R1 and MultiSearch in Table[12](https://arxiv.org/html/2605.13534#A4.T12)\. The example illustrates that MultiSearch retrieves more relevant information through multi\-query retrieval and further improves the SNR of the intermediate context through explicit merging\.

Table 12:Comparison between Search\-R1 and MultiSearch\. The predictions are coloredRedfor incorrect andGreenfor correct answers\.Yellow segmentsdenote the core evidences\.
## Appendix ELimitations

Despite the promising results, our work has several limitations:

#### Limited Task Generalization\.

The experiments are conducted exclusively on question\-answering datasets following prior work, without other types of tasks or domains\. As a result, the generality of MultiSearch beyond QA settings remains to be explored\.

#### Predefined Retrieval Corpus\.

The method relies on a static retrieval corpus for all experiments\. Its performance in fully online search environments, such as live web search engines, has not been tested, which may limit its applicability in dynamic information retrieval scenarios\.

#### Fixed Query Generation Strategies\.

While we explore multiple strategies for generating queries, the underlying framework still operates within a relatively fixed set of strategies\. This may restrict flexibility in more open\-ended or unstructured reasoning tasks\.

## Appendix FBroader Impacts

MultiSearch explores a parallel search and explicit merging framework\. By encouraging the agent to maintain multiple candidate searching directions, it can help broaden the scope of information acquisition during the reasoning process, particularly in scenarios where relevant evidence is distributed across multiple sources\. This suggests a possible direction for dynamically balancing search breadth and reasoning depth depending on task difficulty\. Furthermore, by making the merging process explicit, MultiSearch introduces a more structured interface between retrieval and reasoning\. This “explore\-then\-merge” pattern encourages the agent to organize retrieved evidence prior to final inference, contributing to a clearer understanding of how retrieved passages are integrated\. In this sense, the framework could facilitate more detailed post\-hoc analysis of evidence usage, although we do not explicitly evaluate interpretability in this work\.
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

Similar Articles

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

RL-Index: Reinforcement Learning for Retrieval Index Reasoning

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

Hybrid-IR: Dual-Path Hybrid Retrieval with Iterative Reasoning for Complex Medical Question Answering

Submit Feedback

Similar Articles

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL
RL-Index: Reinforcement Learning for Retrieval Index Reasoning
SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning
Hybrid-IR: Dual-Path Hybrid Retrieval with Iterative Reasoning for Complex Medical Question Answering