AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases

arXiv cs.AI Papers

Summary

This paper introduces AgenticRAG, a framework from Microsoft that enhances enterprise knowledge base retrieval by equipping LLMs with tools for iterative search, document navigation, and analysis. It demonstrates significant improvements in recall and factuality over standard RAG pipelines on multiple benchmarks.

arXiv:2605.05538v1 Announce Type: new Abstract: We present AgenticRAG, a practical agentic harness for retrieval and analysis over enterprise knowledge bases. Standard RAG pipelines place significant burden of grounding on the search stack, constraining the language model to a fixed candidate set chosen deep in the retrieval process. Our approach reduces this overdependence by layering a lightweight harness on top of existing enterprise search infrastructure, equipping a reasoning LLM with search, find, open, and summarize tools enabling the model to iteratively retrieve information, navigate within documents, and analyze evidence autonomously. On three open benchmarks we observe substantial gains: $49.6\%$ recall@1 on BRIGHT (+21.8 pp over the best embedding baseline), 0.96 factuality on WixQA ($+13\%$ relative improvement), and $92\%$ answer correctness on FinanceBench--within 2 pp of oracle access to true evidence. Ablation studies show that the most significant factor is the shift from single-shot retrieval to agentic tool use ($5.9\times$ improvement), while multi-query search and in-document navigation contribute to both quality and efficiency. We present various design choices in our agentic harness that were informed by pre-production deployments. Our results demonstrate its suitability for real-world enterprise production environments.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/08/26, 08:21 AM

# AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
Source: [https://arxiv.org/html/2605.05538](https://arxiv.org/html/2605.05538)
Susheel Suresh, Hazel Mak∗, Shangpo Chou, Fred Kroon, Sahil Bhatnagar

Microsoft Corporation

###### Abstract

We presentAgenticRAG, a practical agentic harness for retrieval and analysis over enterprise knowledge bases\. Standard RAG pipelines place significant burden of grounding on the search stack, constraining the language model to a fixed candidate set chosen deep in the retrieval process\. Our approach reduces this overdependence by layering a lightweight harness on top of existing enterprise search infrastructure, equipping a reasoning LLM with search, find, open, and summarize tools enabling the model to iteratively retrieve information, navigate within documents, and analyze evidence autonomously\. On three open benchmarks we observe substantial gains: 49\.6% recall@1 on BRIGHT \(\+21\.8 pp over the best embedding baseline\), 0\.96 factuality on WixQA \(\+13% relative improvement\), and 92% answer correctness on FinanceBench–within 2 pp of oracle access to true evidence\. Ablation studies show that the most significant factor is the shift from single\-shot retrieval to agentic tool use \(5\.9×\\timesimprovement\), while multi\-query search and in\-document navigation contribute to both quality and efficiency\. We present various design choices in our agentic harness that were informed by pre\-production deployments\. Our results demonstrate its suitability for real\-world enterprise production environments\.

AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases

Susheel Suresh††thanks:Equal contribution\.††thanks:sussuresh@microsoft\.com, Hazel Mak∗††thanks:hazelmak@microsoft\.com, Shangpo Chou, Fred Kroon, Sahil BhatnagarMicrosoft Corporation

## 1Introduction

Standard retrieval\-augmented generation \(RAG\) pipelines follow a static retrieve\-then\-generate paradigmLewiset al\.\([2020](https://arxiv.org/html/2605.05538#bib.bib20)\)\. In this design, the search stack effectively determines the final candidate set the large language model \(LLM\) will see and the model’s reasoning is constrained to that set\. Modern enterprise\-grade search stacks are highly optimized for scalability, latency, and multi\-stage ranking pipelines built on inverted indexes, probabilistic retrieval, and learned ranking modelsLiu and others \([2009](https://arxiv.org/html/2605.05538#bib.bib5)\); Nogueira and Cho \([2019](https://arxiv.org/html/2605.05538#bib.bib6)\); Thakuret al\.\([2021](https://arxiv.org/html/2605.05538#bib.bib7)\)\. These systems excel at keyword and short semantic queries and are strong for high\-recall candidate generation\. However, they are not designed to resolve situational, multi\-document, or analytically complex information needs—the kinds of queries knowledge workers issue against dense corpora such as technical manuals, compliance documents, and financial reports\.

Real\-world RAG systems[AzureAISearch](https://arxiv.org/html/2605.05538#bib.bib1)attempt to compensate for these limitations through retrieval enhancement techniques such as HyDEGaoet al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib2)\), multi\-query reformulationWanget al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib4)\), and adaptive or iterative retrieval strategiesTrivediet al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib3)\); Jeonget al\.\([2024](https://arxiv.org/html/2605.05538#bib.bib34)\)\. While these methods provide robustness to query phrasing and higher retrieval coverage, they largely preserve the same architectural assumption: retrieval decisions are finalized before substantive reasoning begins\. The LLM still operates over a fixed candidate set selected deep in the search stack, without the ability to iteratively navigate documents, synthesize evidence across sources, or reassess results from a higher\-level vantage point\.

Recent advances in reasoning\-capable language models have demonstrated strong performance on planning and iterative external tool useYaoet al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib39)\); Schicket al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib40)\)\. Rather than hard\-coding retrieval steps, we can empower the model itself to drive the process—deciding what to search for, which documents warrant deeper investigation, and when sufficient evidence has been gathered\. This relaxes the pressure on the search stack: it only needs to achieve good recall, while the model handles the final precision from its broader context\. We presentAgenticRAG, a practical harness that equips a reasoning LLM with four tools—search,find,open, andsummarize—layered on top of existing enterprise search infrastructure\. The search tool delegates to the underlying search stack for broad candidate discovery, while find and open serve as precision instruments that let the model drill into candidate documents via in\-document search and full\-content retrieval \(with rolling window access\)\. To manage the growing context during long reasoning chains, the harness monitors token usage and triggers the summarize tool when a threshold is reached, allowing the model to consolidate its findings while preserving key references\. Our contribution is system\-level: a lightweight inference\-time tool harness that requires no model fine\-tuning, custom embedding model, graph construction, or corpus\-specific preprocessing beyond indexing documents into the existing enterprise search backend\.

We evaluate on three benchmarks spanning retrieval, enterprise QA, and financial document reasoning\. Our approach achieves 49\.6% recall@1 on BRIGHT \(\+21\.8 pp over the best embedding baseline\), 0\.96 factuality on WixQA \(\+13% relative\), and 92\.00% answer correctness on FinanceBench–within 2 pp of oracle access\. Our method is deployed for pre\-production evaluation, and learnings from these deployments directly inform our design choices\. We provide detailed ablations analyzing the contribution of each tool, the effect of multi\-query search, and model\-level differences in retrieval strategy\.

## 2Related Work

Retrieval\-Augmented Generation \(RAG\) grounds LLM generation in external corpora to mitigate parametric memory limitationsLewiset al\.\([2020](https://arxiv.org/html/2605.05538#bib.bib20)\); Guuet al\.\([2020](https://arxiv.org/html/2605.05538#bib.bib21)\)\. Early approaches focused on identifying relevant documents using sparse or dense vector retrievalKhattab and Zaharia \([2020](https://arxiv.org/html/2605.05538#bib.bib23)\); Izacard and Grave \([2021](https://arxiv.org/html/2605.05538#bib.bib22)\)to enhance performance on knowledge\-intensive NLP tasks\. As context windows expanded, research shifted toward scaling retrieval to trillions of tokensBorgeaudet al\.\([2022](https://arxiv.org/html/2605.05538#bib.bib24)\)and optimizing in\-context learningRamet al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib25)\); Shiet al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib26)\)\. Despite these advancements, standard RAG pipelines often struggle with "long\-tail" knowledge and can suffer from hallucinations when retrieval failsMallenet al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib28)\); Gaoet al\.\([2024](https://arxiv.org/html/2605.05538#bib.bib29)\)\. Furthermore, static "retrieve\-then\-generate" paradigms lack the flexibility to handle complex, multi\-hop queries that require iterative information gatheringJianget al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib27)\); Presset al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib41)\)\.

To address the brittleness of static pipelines, the field has evolved towardAgenticpatterns, where autonomous agents \(LLMs\) dynamically orchestrate the retrieval processSinghet al\.\([2025](https://arxiv.org/html/2605.05538#bib.bib13)\); Ocheet al\.\([2025](https://arxiv.org/html/2605.05538#bib.bib14)\)\. Foundational work in agentic behaviors, such as ReActYaoet al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib39)\)and ToolformerSchicket al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib40)\), demonstrated that LLMs could effectively wield external tools to solve reasoning problems\. This paradigm has been formalized in systems like Self\-RAGAsaiet al\.\([2024](https://arxiv.org/html/2605.05538#bib.bib30)\)and Corrective RAGYanet al\.\([2024](https://arxiv.org/html/2605.05538#bib.bib31)\), which employ self\-reflection mechanisms to critique retrieved content and trigger fallbacks \(e\.g\., web search\) when necessary\. Recent approaches propose to integrate retrieval into planning: PlanRAGLee and others \([2024](https://arxiv.org/html/2605.05538#bib.bib32)\)and Search\-o1Li and others \([2025](https://arxiv.org/html/2605.05538#bib.bib18)\)separate high\-level planning from low\-level execution, allowing agents to decompose complex queries into sub\-tasks\. Similarly, Search\-R1Jin and others \([2025](https://arxiv.org/html/2605.05538#bib.bib19)\)uses reinforcement learning to train LLMs for autonomous search decisions\. While effective, many of these systems are designed for open\-domain search or require fine\-tuning, reinforcement learning, or dedicated retrieval policies, which makes them less directly applicable to proprietary enterprise corpora that cannot be exported for training\. They can also incur high latency and token costs due to recursive reasoning loopsTrivediet al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib3)\)\.

Another critical limitation in standard RAG is the "flattening" of documents into disjointed chunks, which discards valuable structural priors like headings and document boundaries\. RAPTORSarthiet al\.\([2024](https://arxiv.org/html/2605.05538#bib.bib16)\)addresses this by recursively clustering and summarizing text chunks into a tree structure, enabling retrieval at varying levels of abstraction\. Similarly, HiQAChen and others \([2024](https://arxiv.org/html/2605.05538#bib.bib33)\)constructs multi\-document hierarchical contexts\. Graph RAGEdgeet al\.\([2024](https://arxiv.org/html/2605.05538#bib.bib35)\); Scaffidi and others \([2025](https://arxiv.org/html/2605.05538#bib.bib36)\)approaches seek to build knowledge graphs from documents to support query\-focused summarization\. While powerful for unifying knowledgePan and others \([2024](https://arxiv.org/html/2605.05538#bib.bib37)\); Wang and others \([2024](https://arxiv.org/html/2605.05538#bib.bib38)\), graph construction is often computationally prohibitive for dynamic enterprise environments\. In contrast, our Agentic RAG harness is an inference\-time system that leverages a reasoning model with a "search" tool \(using a fast enterprise grade search stack\) alongside "find" and "open" tools for deeper information gathering and reasoning\. This positions the contribution as a deployable system integration for enterprise file systems: it works with existing search infrastructure, preserves document access controls, and avoids extensive pre\-computation or retraining\.

## 3Method

### 3\.1System Overview

We present an agentic RAG system for enterprise document search and question answering over large file systems\. Unlike traditional single\-pass RAG pipelines, our system employs an iterative reasoning loop where a large language model \(LLM\) autonomously decides when to search for documents, drill into specific passages, and retrieve full content before producing a final answer\.

The system addresses several challenges in enterprise RAG: \(1\) multi\-step reasoning: complex queries require information from multiple documents, \(2\) context window constraints: accumulated retrieval results must fit within LLM limits, \(3\) grounded responses: answers must include traceable citations to source documents, and \(4\) multi\-turn efficiency: follow\-up queries should reuse previously retrieved content rather than re\-executing redundant searches\. Our architecture supports multiple model families and reuses existing search infrastructure for the backend implementation of the retrieval tools\. By lightweight, we mean that the harness consists of four tools, requires no model fine\-tuning, no graph construction, and no custom embedding index beyond the enterprise search stack already deployed for document discovery\. Overall the system comprises three main components:

1. 1\.Agentic Loop: Orchestrates LLM\-tool interactions, bounded by maximum iterations\.
2. 2\.Retrieval Tools: Three tools \(search,find,open\) provide hierarchical access to enterprise documents\. Asummarizetool for context management during long reasoning chains\.
3. 3\.Conversation State: Maintains message history, token accounting, and reference ID mappings that track documents across iterations\.

### 3\.2Agentic Loop

The agent processes each query through a bounded iteration loop \(Figure[1](https://arxiv.org/html/2605.05538#S3.F1)\)\. Each iteration, upon receiving the current conversation, the agent either selects a tool to call and appends to the conversation, or returns the final answer with citations\.

The loop terminates under two conditions: \(1\) the model produces a text response, or \(2\) the iteration count reaches maximum iterations \(default: 15\)\. When maximum iterations are reached without a final answer, the agent issues a forced completion request, requiring the model to respond using available information\. If the token budget is exceeded during execution, the agent triggers context management \(Sec\.[3\.4](https://arxiv.org/html/2605.05538#S3.SS4)\) to free space and continues the loop\. For detailed algorithm, see Appendix[A\.1](https://arxiv.org/html/2605.05538#A1.SS1)\.

![Refer to caption](https://arxiv.org/html/2605.05538v1/figures/agentic_loop.png)Figure 1:Agentic loop
### 3\.3Retrieval Tools

The system provides three retrieval tools enabling hierarchical document exploration \(Table[1](https://arxiv.org/html/2605.05538#S3.T1)\)\. The agent decides which to invoke based on current information needs\.

Searchperforms enterprise\-wide document discovery by delegating to the existing enterprise search stack\. In the default configuration, the model may issue up to five query reformulations in one tool call\. The tool returns up to 10 results per query, each containing a snippet, title, filename, file type, and other available metadata\. Results from multiple queries are combined and deduplicated\. Each result receives a unique reference ID \(format: turnmmsearchnn\) using a globally incrementing counter, enabling subsequent find and open operations\.

Findperforms targeted in\-document search within a single document identified by its reference ID\. Given a list of keyword patterns, lexical matching is case\-insensitive substring matching; an optional semantic find mode can also be enabled\. The tool returns up to 2 matching passages per pattern\. Results are deduplicated by content and truncated at a bounded token limit \(∼\\sim11k tokens\)\. Find is most useful when the model knows*what*to look for, such as a revenue metric or a named concept inside a long filing\.

Openretrieves full document content in a fixed line window\. Each call returns a window of lines \(default: 1,800\) starting from either the beginning \(line 0\) or a specific line number chosen by the agent, and a response header indicating the viewing range and total document length \(e\.g\., "Viewing lines \[0–1799\] of 3000 lines"\)\. To access more than one portion from a file, the model makes subsequent calls with an explicit line number value\. This enables navigation through documents exceeding the window size while keeping each response bounded\. Open is most useful when the model knows*where*to read, such as context around a table, section heading, or line\-numbered preview\. The system prompt guides effective tool usage\. See Appendix[A\.2](https://arxiv.org/html/2605.05538#A1.SS2)for details\.

Table 1:Retrieval Tool Specifications
### 3\.4Context Management

Since retrieval tools can load∼\\sim11k tokens from files each time, the context window can be used up quickly\. To manage that, the harness monitors token usage against a 128K\-token threshold: it emits an internal warning when the conversation reaches 90% of the budget and forces summarization at the threshold\. The summarize tool enables the model to record current reasoning and designate which references to preserve\. The system then scans tool messages and removes content not associated with preserved reference IDs, freeing tokens while retaining cited evidence\. This approach extends effective context capacity\. See Figure[2](https://arxiv.org/html/2605.05538#S3.F2)for an example conversation history before and after context management\.

![Refer to caption](https://arxiv.org/html/2605.05538v1/figures/context_management.png)Figure 2:Example conversation history with context management via forced summarize tool call\.

## 4Experiment Setup

Our goal is to evaluate AgenticRAG in realistic enterprise settings where knowledge workers issue complex situational queries requiring multi\-step reasoning over large corpora of long, domain\-specific documents\. To this end we adopt theBRIGHT benchmarkSuet al\.\([2024](https://arxiv.org/html/2605.05538#bib.bib9)\)which contains StackExchange questions spanning eight domains\. We choose to evaluate on the long\-context setting of BRIGHT, where documents correspond to entire web pages rather than snippets and the task is to retrieve the full relevant document\(s\) for a given query\. For our agentic setting, the full BRIGHT web pages are converted to document files and indexed into the same enterprise search backend used by the search tool\. Search returns snippet previews with metadata and reference IDs, and the find and open tools then access full document content through those IDs\. Standard protocol of using Recall@1 to measure relevance of retrieved documents is adoptedSuet al\.\([2024](https://arxiv.org/html/2605.05538#bib.bib9)\)\. We instruct the model in AgenticRAG to provide relevancy scores for the citations it uses when producing the answer, which induces a ranking over documents for evaluation\. We also test our method onWixQACohenet al\.\([2025](https://arxiv.org/html/2605.05538#bib.bib10)\), which targets real\-world support and troubleshooting enterprise scenarios that require multi\-document and multi\-step reasoning for procedural answers\. We adopt the same LLM based factuality metric defined in WixQA\. Finally, we run evaluations on the popularFinanceBenchIslamet al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib11)\)dataset, which contains financial questions that require deep reasoning over large company financial documents\. Our metric here is answer correctness as a proxy for accurate information retrieval, since questions pertain to single documents\. Detailed benchmark descriptions, query set and corpus statistics are presented in Appendix[B](https://arxiv.org/html/2605.05538#A2)\.

## 5Results

### 5\.1Long\-Context Retrieval on BRIGHT

Table 2:Long\-context retrieval performance on unsplit web pages of StackExchange data from BRIGHT benchmark\. Scores are reported in recall@1\. Best baseline per category shown; full results in Table[9](https://arxiv.org/html/2605.05538#A3.T9)\.CategoryModelBio\.Earth\.Econ\.Psy\.Rob\.Stack\.Sus\.PonyAvg\.SparseBM2510\.715\.410\.78\.47\.422\.210\.75\.411\.4Open\-source Emb\.Qwen39\.236\.125\.742\.321\.323\.533\.11\.327\.8Proprietary Emb\.Voyage34\.435\.426\.741\.612\.912\.831\.11\.324\.5Reasoning EnhancedReDI28\.422\.421\.232\.019\.836\.321\.7–26\.0Ours \(AgenticRAG\)⌞\\llcornersearch, find, open, summ\.GPT\-5\-mini61\.748\.141\.465\.339\.440\.646\.64\.843\.5Claude Sonnet 4\.562\.360\.058\.767\.955\.034\.151\.77\.149\.6Table[2](https://arxiv.org/html/2605.05538#S5.T2)presents retrieval performance on the BRIGHT benchmark, showing the best baseline per category\. Full results with all models are in Appendix Table[9](https://arxiv.org/html/2605.05538#A3.T9), including reasoning\-enhanced baselines such as LLM re\-rankers over BM25/SBERT and ReDI\. Our agentic harness equipped with search, find, open, and summarize tools enables both Claude Sonnet 4\.5 and GPT\-5\-mini to achieve the highest recall@1 across all eight benchmark splits compared to all baselines\. With our AgenticRAG retrieval harness, Claude Sonnet 4\.5 achieves49\.6%average recall@1 \(\+21\.8 pp over Qwen, the best embedding model at 27\.8%\) and GPT\-5\-mini reaches 43\.5% \(\+15\.7 pp\)\. Gains are consistent across domains, with the largest improvements in Economics \(\+33\.0 pp\), Earth Science \(\+24\.6 pp\), Robotics \(\+33\.7 pp\), and Psychology \(\+25\.6 pp\)\. Even with a vast corpus of 5\.65K long documents averaging 16K tokens each, our method scales by leveraging traditional retrieval via the search tool, deeper reasoning enabled by the open / find tools and effective context window management by the summarize tool\. The best\-performing reasoning\-enhanced baseline, ReDI, uses a fine\-tuned Qwen3\-8B decomposition and retrieval\-fusion model and achieves 26\.0% recall@1 in the BRIGHT long\-document setting\. Our method outperforms it by \+23\.6 pp \(via Claude\) and \+17\.5 pp \(via GPT\-5\-mini\)\. Static approaches such as one\-time query rewriting or LLM\-based re\-ranking cannot match the iterative reasoning that our harness provides, and the gap is evident across all splits\.

### 5\.2Enterprise QA on WixQA

Figure[3](https://arxiv.org/html/2605.05538#S5.F3)presents factuality results on the WixQA benchmark, which requires multi\-document analysis to answer enterprise support questions\. Semantic embeddings alone fail to capture the cross\-document reasoning needed for these queries, whereas the iterative search and reasoning enabled by our harness excels\. On theExpert Writtensplit, our method with GPT\-5\-mini achieves a factuality score of0\.96, compared to 0\.85 for E5 retrieval and 0\.83 for BM25—a13%relative improvement over the best baseline\. We observe similar gains on theSimulatedsplit; see Appendix[C\.3](https://arxiv.org/html/2605.05538#A3.SS3)for details\.

![Refer to caption](https://arxiv.org/html/2605.05538v1/x1.png)Figure 3:Factuality performance on the WixQA Expert Written dataset \(N=200\)\. Our agentic approach \(red\) substantially outperforms BM25 \(blue\) and E5 \(green\) retrieval baselines across all generation models\.
### 5\.3Financial Document QA on FinanceBench

Table 3:Evaluation results on FinanceBench \(N=150\)\.Table[3](https://arxiv.org/html/2605.05538#S5.T3)presents answer correctness on the FinanceBench dataset, which evaluates question answering over real\-world financial filings\. The retrieval corpus consists of 84 long financial documents averaging∼\\sim116K tokens each \(∼\\sim140 pages per PDF\)\. Our agentic approach with GPT\-5\-mini achieves92\.00%correctness, substantially outperforming both traditional RAG \(by3\.8×\\times\) and moreover, is more general than the agentic tool use baseline ofSubramanianet al\.\([2026](https://arxiv.org/html/2605.05538#bib.bib12)\)\(by2\.8×\\times\), which relies on keyword search tools like pdfgrep, rga, and Linux commands\. We also adopt a baseline where the ground\-truth full\-page evidence is provided directly to GPT\-5\-mini, bypassing agentic retrieval entirely\. This oracle setting achieves 94\.00% and establishes an upper bound on the model’s reasoning ability given perfect evidence\. Our agentic system is within 2 pp of this upper bound which demonstrates its effectiveness\. Between GPT\-5\-mini vs Claude Sonnet 4\.5, our harness is equally effective\.

### 5\.4Token Cost and Retrieval Efficiency

Table 4:Total token usage comparison between AgenticRAG and Single\-shot Search across BRIGHT splits and FinanceBench\. All values are averages per query \(in thousands\) and include system prompt, tool calls, tool results, and any thinking tokens\. Cost ratio is AgenticRAG total tokens divided by Single\-shot Search total tokens\.Table[4](https://arxiv.org/html/2605.05538#S5.T4)quantifies the end\-to\-end token cost of agentic retrieval\. We measure total tokens consumed across the full interaction, including model thinking, tool\-call arguments, retrieved tool results, and final answer generation\. On BRIGHT, AgenticRAG averages 52\.3K total tokens per query, compared to 20\.4K for Single\-shot Search, a 2\.6×\\timestoken overhead\. This cost yields a disproportionate quality gain: Claude Sonnet 4\.5 with AgenticRAG reaches 49\.6% recall@1, compared to 8\.41% for Single\-shot Search, a 5\.9×\\timesimprovement\. FinanceBench is more expensive, averaging 114\.8K tokens per query and a 7\.8×\\timesratio over single\-shot search, which reflects deep navigation over long financial filings\. This higher cost is paired with 92\.00% answer correctness, close to the 94\.00% oracle evidence upper bound\. Tool usage in Table[5](https://arxiv.org/html/2605.05538#S5.T5)further shows that the system operates well within the 15\-iteration budget, averaging 4\.48–4\.79 tool calls per query\. The multi\-query ablation provides a direct efficiency comparison: the full system achieves comparable recall with 4\.79 average tool calls versus 6\.79 without multi\-query search, a 29% reduction in tool calls\.

### 5\.5Ablation Studies

Table 5:Ablation study of agentic components averaged across all BRIGHT splits\. Performance is measured by recall \(R@k\), along with average tool usage and per\-tool statistics\. Per\-split results are in Table[10](https://arxiv.org/html/2605.05538#A3.T10)\.#### Single\-shot vs Agentic Retrieval\.

From Table[5](https://arxiv.org/html/2605.05538#S5.T5)the most significant finding is the dramatic improvement from single\-shot search to full agentic tool use\. Single\-shot search achieves only 8\.41% recall@1 on average, while agentic tool use reaches43\.49%with GPT\-5\-mini and49\.59%with Claude Sonnet 4\.5—representing5\.2×\\timesand5\.9×\\timesimprovements respectively\. Notably, the proprietary search stack behind our search tool trades off raw retrieval quality for speed, immense scale, and availability compared to the state\-of\-the\-art embedding\-based retrievers in Table[2](https://arxiv.org/html/2605.05538#S5.T2)\. However, these quality differences vanish when our agentic harness is employed with a reasoning language model\. The improvement is consistent across splits \(ref\. Table[10](https://arxiv.org/html/2605.05538#A3.T10)\)\.

#### Model Comparison and Tool Usage Patterns\.

Claude Sonnet 4\.5 achieves a\+6\.1 ppimprovement over GPT\-5\-mini, outperforming on seven of eight splits \(detailed per\-split results in Appendix Table[10](https://arxiv.org/html/2605.05538#A3.T10)\)\. The two models exhibit distinct strategies that reflect an exploration–exploitation trade\-off\. Claude favors*exploitation*: it uses fewer search calls \(2\.51 vs 3\.39\) but opens more documents \(1\.54 vs 1\.22\) and relies more on semantic find \(0\.42 vs 0\.14, a 3×\\timesincrease\), going deeper into candidate documents\. GPT\-5\-mini favors*exploration*: it issues more search calls with reformulated queries rather than using in\-document find, casting a wider net across the corpus\. In the BRIGHT long\-document setting, where queries have only∼\\sim1\.9 golden documents on average amid a large corpus, relevant documents are sparse and broad exploration often surfaces irrelevant results\.

#### Failure Patterns\.

The main observed weakness is broad multi\-evidence retrieval, especially the Pony split, where each query has∼\\sim6\.9 gold documents on average compared to∼\\sim1\.9 across BRIGHT overall\. This setting rewards recovering many related documents, whereas our harness is optimized for coarse\-to\-fine navigation toward a small number of high\-value evidence sources\. This explains why Pony remains difficult for both of our models despite large gains on scientific and technical splits where relevant documents are sparse\. The pattern suggests that future trajectory policies should better detect broad evidence needs and shift from depth\-first document reading to wider coverage before final ranking\.

#### Component Contributions\.

We ablate individual components using GPT\-5\-mini to understand their contributions\. The most notable finding concerns multi\-query search\. In the default configuration, the model can issue up to 5 queries in parallel within a single search tool call, with results de\-duped and presented together\. Restricting the system to single\-query search \(w/o Multi\-query Search, 44\.84%\), where the model issues only one query per search call but receives the same number of results, achieves comparable recall@1 to the full system but at the cost of increased tool usage—6\.79 average tool calls compared to 4\.79 for the full system, with notably more search operations \(4\.38 vs 3\.39\) and document opens \(2\.16 vs 1\.22\)\. This suggests that multi\-query search improves efficiency by finding relevant documents with fewer iterations\. Detailed analysis of these ablations is provided in Appendix[C\.2](https://arxiv.org/html/2605.05538#A3.SS2)\.

#### Findings from Pre\-Production Deployments

In our pre\-production evaluation we identified several design choices that guide the model toward more optimal trajectories: \(1\)Surfacing document metadata in search resultslike title, filename, and file type helps the model disambiguate semantically similar snippets and avoid redundant searches; \(2\)Line\-numbered document previewslets the model anchor on specific content and jump to relevant sections in successive open calls; \(3\)Candidate reference retention after summarizationin the context window enables the model to go deeper on promising candidates via open/find, rather than restarting retrieval with reformulated queries\.\(4\)Having a switcherroute complex, multi\-intent queries to our agentic rag harness for deeper analysis\. Simple queries are routed to traditional rag for faster answers\. This is vital for the tradeoff between user experience, cost, model availability\. We have good early signals of this being effective and we continue to pursue this hybrid approach\.

## 6Conclusion

We presented a practical harness for AgenticRAG that equips reasoning language models with search, find, open, and summarize tools to autonomously retrieve and reason over large enterprise corpora\. Across three benchmarks, our approach achieves 49\.6% recall@1 on BRIGHT \(\+21\.8 pp over the best embedding baseline\), 0\.96 factuality on WixQA \(\+13% relative\), and 92\.00% answer correctness on FinanceBench—within 2 pp of oracle access\. Token analysis shows that these gains require a moderate 2\.6×\\timestoken overhead on BRIGHT relative to single\-shot search, while delivering a 5\.9×\\timesrecall@1 improvement\. These results demonstrate that our harness effectively extracts the value of reasoning models for enterprise information retrieval tasks requiring deep, multi\-step reasoning\. Future work will focus on large\-scale deployment, budget\-aware routing between traditional and agentic RAG, deeper failure analysis, ablations over iteration and window\-size budgets, and optimizing retrieval trajectories for fast iterative reasoning via fine tuning\.

## Acknowledgments

We thank Eli Coon, Kinfe Mengistu and members of the broader Copilot Studio team for feedback and discussions during internal experimentations\. Special thanks to James Cai and Alejandro Gutierrez Munoz for technical guidance and project sponsorship\.

## References

- A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi \(2024\)Self\-rag: learning to retrieve, generate, and critique through self\-reflection\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p2.1)\.
- \[2\]AzureAISearch\(\)Agentic Retrieval \- Azure AI Search — learn\.microsoft\.com\.Note:[https://learn\.microsoft\.com/en\-us/azure/search/agentic\-retrieval\-overview](https://learn.microsoft.com/en-us/azure/search/agentic-retrieval-overview)\[Accessed 14\-02\-2026\]Cited by:[§1](https://arxiv.org/html/2605.05538#S1.p2.1)\.
- S\. Borgeaud, A\. Mensch, J\. Hoffmann, T\. Cai, E\. Rutherford, K\. Millican, G\. B\. Van Den Driessche, J\. Lespiau, B\. Damoc, A\. Clark,et al\.\(2022\)Improving language models by retrieving from trillions of tokens\.InInternational Conference on Machine Learning \(ICML\),pp\. 2206–2240\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p1.1)\.
- X\. Chenet al\.\(2024\)HiQA: a hierarchical contextual augmentation rag for multi\-documents qa\.arXiv preprint arXiv:2402\.12345\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p3.1)\.
- D\. Cohen, L\. Burg, S\. Pykhnivskyi, H\. Gur, S\. Kovynov, O\. Atzmon, and G\. Barkan \(2025\)Wixqa: a multi\-dataset benchmark for enterprise retrieval\-augmented generation\.arXiv preprint arXiv:2505\.08643\.Cited by:[Table 7](https://arxiv.org/html/2605.05538#A1.T7),[§4](https://arxiv.org/html/2605.05538#S4.p1.1)\.
- D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, and J\. Larson \(2024\)From local to global: a graph rag approach to query\-focused summarization\.arXiv preprint arXiv:2404\.16130\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p3.1)\.
- L\. Gao, X\. Ma, J\. Lin, and J\. Callan \(2023\)Precise zero\-shot dense retrieval without relevance labels\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1762–1777\.Cited by:[§1](https://arxiv.org/html/2605.05538#S1.p2.1)\.
- Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, and H\. Wang \(2024\)Retrieval\-augmented generation for large language models: a survey\.arXiv preprint arXiv:2312\.10997\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p1.1)\.
- K\. Guu, K\. Lee, Z\. Tung, P\. Pasupat, and M\. Chang \(2020\)REALM: retrieval\-augmented language model pre\-training\.InInternational Conference on Machine Learning \(ICML\),pp\. 3929–3938\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p1.1)\.
- P\. Islam, A\. Kannappan, D\. Kiela, R\. Qian, N\. Scherrer, and B\. Vidgen \(2023\)Financebench: a new benchmark for financial question answering\.arXiv preprint arXiv:2311\.11944\.Cited by:[§B\.3](https://arxiv.org/html/2605.05538#A2.SS3.p1.2),[Table 8](https://arxiv.org/html/2605.05538#A2.T8),[§4](https://arxiv.org/html/2605.05538#S4.p1.1)\.
- G\. Izacard and E\. Grave \(2021\)Leveraging passage retrieval with generative models for open domain question answering\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,pp\. 874–880\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p1.1)\.
- S\. Jeong, J\. Baek, S\. Cho, S\. J\. Hwang, and J\. C\. Park \(2024\)Adaptive\-rag: learning to adapt retrieval\-augmented large language models through question complexity\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\),Cited by:[§1](https://arxiv.org/html/2605.05538#S1.p2.1)\.
- Z\. Jiang, F\. F\. Xu, J\. Araki, and G\. Neubig \(2023\)Active retrieval augmented generation\.arXiv preprint arXiv:2305\.06983\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p1.1)\.
- B\. Jinet al\.\(2025\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.arXiv preprint arXiv:2503\.09516\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p2.1)\.
- O\. Khattab and M\. Zaharia \(2020\)ColBERT: efficient and effective passage search via contextualized late interaction over bert\.InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 39–48\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p1.1)\.
- M\. Leeet al\.\(2024\)PlanRAG: a plan\-then\-retrieval augmented generation for generative large language models as decision makers\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\),Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p2.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2605.05538#S1.p1.1),[§2](https://arxiv.org/html/2605.05538#S2.p1.1)\.
- X\. Liet al\.\(2025\)Search\-o1: agentic search\-enhanced large reasoning models\.arXiv preprint arXiv:2501\.05366\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p2.1)\.
- T\. Liuet al\.\(2009\)Learning to rank for information retrieval\.Foundations and Trends® in Information Retrieval3\(3\),pp\. 225–331\.Cited by:[§1](https://arxiv.org/html/2605.05538#S1.p1.1)\.
- A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi \(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9802–9822\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p1.1)\.
- R\. Nogueira and K\. Cho \(2019\)Passage re\-ranking with bert\.arXiv preprint arXiv:1901\.04085\.Cited by:[§1](https://arxiv.org/html/2605.05538#S1.p1.1)\.
- A\. J\. Oche, A\. G\. Folashade, T\. Ghosal, and A\. Biswas \(2025\)A systematic review of key retrieval\-augmented generation \(rag\) systems: progress, gaps, and future directions\.arXiv preprint arXiv:2507\.18910\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p2.1)\.
- J\. Z\. Panet al\.\(2024\)Unifying large language models and knowledge graphs: a roadmap\.IEEE Transactions on Knowledge and Data Engineering\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p3.1)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.arXiv preprint arXiv:2210\.03350\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p1.1)\.
- O\. Ram, Y\. Levine, I\. Dalmedigos, D\. Schuhmann, G\. Sasho, E\. Karpas, O\. Shwartz\-Ziv, N\. Gupta, Y\. Wu, K\. Leyton\-Brown,et al\.\(2023\)In\-context retrieval\-augmented language models\.arXiv preprint arXiv:2302\.00083\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p1.1)\.
- P\. Sarthi, S\. Abdullah, A\. Tuli, S\. Khanna, A\. Goldie, and C\. D\. Manning \(2024\)RAPTOR: recursive abstractive processing for tree\-organized retrieval\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p3.1)\.
- H\. Scaffidiet al\.\(2025\)GraphRAG on technical documents \- impact of knowledge graph schema\.Transactions on Graph Data and Knowledge\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p3.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.05538#S1.p3.1),[§2](https://arxiv.org/html/2605.05538#S2.p2.1)\.
- W\. Shi, S\. Min, M\. Yasunaga, M\. Seo, R\. James, M\. Lewis, L\. Zettlemoyer, and W\. Yih \(2023\)REPLUG: retrieval\-augmented black\-box language models\.arXiv preprint arXiv:2301\.12652\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p1.1)\.
- A\. Singh, A\. Ehtesham, S\. Kumar, and T\. T\. Khoei \(2025\)Agentic retrieval\-augmented generation: a survey on agentic rag\.arXiv preprint arXiv:2501\.09136\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p2.1)\.
- H\. Su, H\. Yen, M\. Xia, W\. Shi, N\. Muennighoff, H\. Wang, H\. Liu, Q\. Shi, Z\. S\. Siegel, M\. Tang,et al\.\(2024\)Bright: a realistic and challenging benchmark for reasoning\-intensive retrieval\.arXiv preprint arXiv:2407\.12883\.Cited by:[Table 6](https://arxiv.org/html/2605.05538#A1.T6),[§B\.1](https://arxiv.org/html/2605.05538#A2.SS1.p1.1),[§4](https://arxiv.org/html/2605.05538#S4.p1.1)\.
- S\. Subramanian, W\. Akinfaderin, Y\. Zhang, I\. Singh, C\. Pecora, M\. Khanuja, S\. Singh, and M\. L\. Tanke \(2026\)Keyword search is all you need: achieving rag\-level performance without vector databases using agentic tool use\.External Links:[Link](https://www.amazon.science/publications/keyword-search-is-all-you-need-achieving-rag-level-performance-without-vector-databases-using-agentic-tool-use)Cited by:[§5\.3](https://arxiv.org/html/2605.05538#S5.SS3.p1.4)\.
- N\. Thakur, N\. Reimers, A\. Rücklé, A\. Srivastava, and I\. Gurevych \(2021\)Beir: a heterogenous benchmark for zero\-shot evaluation of information retrieval models\.arXiv preprint arXiv:2104\.08663\.Cited by:[§1](https://arxiv.org/html/2605.05538#S1.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2023\)Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 10014–10037\.Cited by:[§1](https://arxiv.org/html/2605.05538#S1.p2.1),[§2](https://arxiv.org/html/2605.05538#S2.p2.1)\.
- L\. Wang, N\. Yang, and F\. Wei \(2023\)Query2doc: query expansion with large language models\.arXiv preprint arXiv:2303\.07678\.Cited by:[§1](https://arxiv.org/html/2605.05538#S1.p2.1)\.
- X\. Wanget al\.\(2024\)Knowledge graph\-enhanced retrieval\-augmented generation\.arXiv preprint arXiv:2402\.12345\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p3.1)\.
- S\. Yan, J\. Gu, Y\. Zhu, and Z\. Ling \(2024\)Corrective retrieval augmented generation\.arXiv preprint arXiv:2401\.15884\.Cited by:[§2](https://arxiv.org/html/2605.05538#S2.p2.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.05538#S1.p3.1),[§2](https://arxiv.org/html/2605.05538#S2.p2.1)\.

## Appendix AMethod Details

### A\.1Agentic Loop Algorithm

Detailed agentic loop algorithm is shown in Algorithm[1](https://arxiv.org/html/2605.05538#alg1)\.

Algorithm 1Agentic Loop1:user\_query,max\_calls,token\_threshold

2:Formatted answer with citations

3:

conversation\.add​\(user\_query\)\\text\{conversation\}\.\\text\{add\}\(\\text\{user\\\_query\}\)
4:for

i=1i=1tomax\_callsdo

5:if

tokens​\(conversation\)≥token\_threshold\\text\{tokens\}\(\\text\{conversation\}\)\\geq\\text\{token\\\_threshold\}then

6:ManageContext\( \)⊳\\trianglerightForce summarize

7:endif

8:

response←LLM​\(conversation,tool\_schemas\)\\text\{response\}\\leftarrow\\text\{LLM\}\(\\text\{conversation\},\\text\{tool\\\_schemas\}\)
9:if

response\.has\_tool\_calls\\text\{response\}\.\\text\{has\\\_tool\\\_calls\}then

10:foreachtool\_callin

response\.tool\_calls\\text\{response\}\.\\text\{tool\\\_calls\}do

11:

result←ExecuteTool​\(tool\_call\)\\text\{result\}\\leftarrow\\textsc\{ExecuteTool\}\(\\text\{tool\\\_call\}\)
12:

conversation\.add​\(tool\_call,result\)\\text\{conversation\}\.\\text\{add\}\(\\text\{tool\\\_call\},\\text\{result\}\)
13:endfor

14:else

15:return

FormatAnswer\(response\.text\)\\textsc\{FormatAnswer\}\(\\text\{response\}\.\\text\{text\}\)
16:endif

17:endfor

18:returnForceFinalAnswer\( \)

### A\.2System Instructions for Tool Use

Overall instructions include:

- •Search before answering when uncertain\.
- •Progressively explore using find or open when snippets are insufficient\.
- •Reuse previous results rather than performing search again\.
- •Cite every time when information is used from tool outputs\.

When to use search:

- •Primary search tool across enterprise corpus\.
- •First choice for any work\-related query\.
- •When users reference current/changing information, enterprise\-specific terms, or acronyms\.
- •To verify details rather than making assumptions\.

When to use find:

- •In\-document pattern search for relevant files from search results\.
- •When search results do not give enough details\.
- •To get a focused view of a result in relation to certain terms\.

When to use open:

- •Windowed full content retrieval for relevant files from search results\.
- •When search results snippets are insufficient\.
- •To pull in more content from the most promising results\.
- •Can open multiple search results\.
- •Option to choose a line number close to the relevant content\.

Table 6:Dataset statistics for the BRIGHT benchmarkSuet al\.\([2024](https://arxiv.org/html/2605.05538#bib.bib9)\)long\-context splits used in our evaluation\.Table 7:Dataset statistics for the WixQACohenet al\.\([2025](https://arxiv.org/html/2605.05538#bib.bib10)\)benchmark \(median values\)\.

## Appendix BDataset Details

### B\.1BRIGHT Benchmark

We adopt the BRIGHT benchmarkSuet al\.\([2024](https://arxiv.org/html/2605.05538#bib.bib9)\), which is designed to capture realistic enterprise scenarios of information retrieval\. BRIGHT derives queries from StackExchange posts, reflecting human\-authored, highly situational and domain\-specific information needs\. For each query, the corpus contains positive documents cited in top\-voted answers and verified by human annotators, as well as negative documents collected via search engine retrieval\. The corpora is normalized web content \(e\.g\., Wikipedia pages, blogs, and reports\)\. This construction has shown to yields realistic retrieval pools with substantial semantic overlap between relevant and irrelevant documents\. We choose to evaluate on the long\-context setting of BRIGHT, where documents correspond to entire web pages rather than snippets and the task is to retrieve the full relevant document\(s\) for a given query\. Our experiments span eight domains: Biology, Earth Science, Economics, Psychology, Robotics, Stack Overflow, Sustainable Living, and Pony\. These domains cover a broad range of scientific, technical, and professional areas commonly encountered in enterprise information retrieval\. Across domains, corpora contain hundreds to thousands of documents, with average document lengths ranging from several thousand to over 40k tokens\. Queries themselves are also non\-trivial in length, with average query sizes exceeding 100 tokens in most domains and reaching several hundred tokens in technical domains such as Robotics and Stack Overflow\. Benchmark statistics are detailed in Table[6](https://arxiv.org/html/2605.05538#A1.T6)\. We follow the standard evaluation protocol of the BRIGHT benchmark and report Recall@1 for long\-context document retrieval\.

### B\.2WixQA Benchmark

WixQA targets procedural, long\-form queries that require multi\-step reasoning and specialized enterprise vocabulary, closely matching real\-world support and troubleshooting scenarios\. We utilize both the subsets in WiXQA for our experiments:Expert Written, containing authentic customer queries with step\-by\-step answers authored and validated by human domain experts, andSimulated, derived from multi\-turn user–chatbot interactions and curated into single\-turn queries with expert\-validated procedural correctness\. A defining characteristic of WixQA is its multi\-article dependency, where answering a query may require retrieving and synthesizing information from multiple documents\. All queries are grounded in a shared enterprise\-scale knowledge base of 6,221 domain\-specific help articles, making WixQA well suited for evaluating agentic RAG that must coordinate retrieval and reasoning over complex, multi\-document enterprise corpora\. Datasets statistics are presented in[7](https://arxiv.org/html/2605.05538#A1.T7)

### B\.3FinanceBench

Table 8:FinanceBenchIslamet al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib11)\)data statistics\.FinanceBenchIslamet al\.\([2023](https://arxiv.org/html/2605.05538#bib.bib11)\)is a human\-evaluated benchmark consisting of financial questions over public company filings \(10\-K, 10\-Q, 8\-K, and earnings reports in PDF form\)\. Questions span metrics\-generated and domain\-relevant categories: metrics\-generated questions target specific financial line items or ratios that require the model to locate the relevant data in the document and often perform a calculation, making them straightforward to verify since each has a single unambiguous answer\. Domain\-relevant questions require deeper financial reasoning, such as identifying drivers of margin changes or assessing capital intensity\. Each query pertains to a single document\. We choose this benchmark because of the large size of its documents \(averaging∼\\sim143 pages and∼\\sim117K tokens per PDF\), which is representative of enterprise domains where knowledge workers routinely work with dense, information\-heavy manuals, reports, and regulatory filings\. Corpus statistics are presented in Table[8](https://arxiv.org/html/2605.05538#A2.T8)\. The evaluation metric we use is answer correctness\. LLM as a judge is used for this process and manual review of the results is also conducted\.

## Appendix CAdditional Results

### C\.1Full BRIGHT Retrieval Results

Table 9:Full long\-context retrieval performance on unsplit web pages of StackExchange data from BRIGHT benchmark\. Scores are reported in recall@1\.Table[9](https://arxiv.org/html/2605.05538#A3.T9)presents the complete retrieval results on the BRIGHT benchmark across all baseline models\.

### C\.2Detailed Ablation Analysis

Table 10:Per\-split ablation study of agentic components on BRIGHT\. Performance is measured by recall \(R@k\), along with average tool usage and per\-tool usage statistics\.Table[10](https://arxiv.org/html/2605.05538#A3.T10)provides per\-split ablation results across all BRIGHT domains\. Removing the summarization tool \(w/o Summarize, 43\.34% avg recall@1\) has minimal impact, indicating that this component is rarely needed for the retrieval task\. Removing semantic find \(w/o Semantic Find, 46\.34%\) slightly*improves*average recall@1, likely because the lexical find fallback is sufficient for most in\-document searches and removing the semantic option reduces latency, allowing more search iterations within the same compute budget\.

### C\.3WixQA Simulated Results

As shown in Figure\.[4](https://arxiv.org/html/2605.05538#A3.F4), on simulated questions with expert validated ground truth answers i\.e\.Simulatedsplit of WixQA, our method achieves0\.94factuality, compared to 0\.77 for both E5\+GPT\-4o and E5\+Claude 3\.7\. The improvement is even more pronounced on this dataset, with a22%relative gain\. This suggests that agentic retrieval is particularly effective when questions require more complex reasoning or multi\-hop information gathering\.

![Refer to caption](https://arxiv.org/html/2605.05538v1/x2.png)Figure 4:Factuality performance on the WixQA Simulated dataset\. The performance gap between agentic retrieval and traditional methods is even larger on synthetic questions that require more complex reasoning\.
### C\.4Example Conversation

Figure[5](https://arxiv.org/html/2605.05538#A3.F5)shows an example conversation from the FinanceBench\.

![Refer to caption](https://arxiv.org/html/2605.05538v1/figures/example_conversation.png)Figure 5:Example conversation from FinanceBench\.

Similar Articles

RAG-Anything: All-in-One RAG Framework

Papers with Code Trending

RAG-Anything is a new open-source framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

LightRAG: Simple and Fast Retrieval-Augmented Generation

Papers with Code Trending

The article introduces LightRAG, an open-source framework that enhances Retrieval-Augmented Generation by integrating graph structures for improved contextual awareness and efficient information retrieval.