@vintcessun: RAG喂太多文档,检索质量反而从75%掉到40%?向量搜索被大量无关内容稀释,真实部署中命中率暴跌。 问题根源:异构文档混在一起检索,噪声淹没了信号。多智能体编排看似智能,实际引入精度-忠实度悖论——配置稍差就两头不讨好。 论文提出的MA…

X AI KOLs Timeline 论文

摘要

This paper identifies 'vector search dilution' in RAG systems when scaling to large heterogeneous document collections, where accuracy dropped from 75% to 40% in a real-world deployment. The proposed MASDR-RAG method uses domain scoping via organizational metadata before retrieval, improving P@10 from 0.77 to 0.86 with low cost and easy deployment.

RAG喂太多文档,检索质量反而从75%掉到40%?向量搜索被大量无关内容稀释,真实部署中命中率暴跌。 问题根源:异构文档混在一起检索,噪声淹没了信号。多智能体编排看似智能,实际引入精度-忠实度悖论——配置稍差就两头不讨好。 论文提出的MASDR-RAG方案很务实:先用元数据(如部门)限定检索范围,再走标准混合检索,最后单次生成。实验P@10从0.77提到0.86,成本低、易部署,比复杂编排更可靠。
查看原文
查看缓存全文

缓存时间: 2026/06/14 00:17

RAG喂太多文档,检索质量反而从75%掉到40%?向量搜索被大量无关内容稀释,真实部署中命中率暴跌。 问题根源:异构文档混在一起检索,噪声淹没了信号。多智能体编排看似智能,实际引入精度-忠实度悖论——配置稍差就两头不讨好。 论文提出的MASDR-RAG方案很务实:先用元数据(如部门)限定检索范围,再走标准混合检索,最后单次生成。实验P@10从0.77提到0.86,成本低、易部署,比复杂编排更可靠。


When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval

Source: https://arxiv.org/html/2606.11350 Nabaraj Subedi1∗, Ahmed Abdelaty2, and Shivanand Venkanna Sheshappanavar1 1Dept. of Electrical Engineering & Computer Science 2Dept. of Civil, Architectural Engineering & Construction Management University of Wyoming, Laramie, WY 82071, USA {nsubedi1, aahmed3, ssheshap}@uwyo.edu ∗*Correspondence author

Abstract

Retrieval-augmented generation degrades when scaled to large, heterogeneous document collections, where dense similarity loses discriminative power, and top-kkretrieval increasingly returns semantically similar but contextually incorrect chunks. We refer to this failure mode asvector search dilution. Even when using hybrid dense+sparse retrieval, we observed this firsthand in a deployed Wyoming Department of Transportation corpus, where scaling from5454to 1,128 documents (88,907 chunks) reduced accuracy from75%75\%to below40%40\%. To address this dilution, we propose MASDR-RAG ( Multi-Agent Scoped Domain Retrieval for RAG) and evaluate it on 200 expert-validated queries across five LLM backbones, six corpora, and two index stacks. Our results indicate thatdomain scoping using organizational metadata is the key fix, significantly improving P@10 from0.770.77to0.860.86(p<0.05p<0.05). Furthermore, our investigation of multi-agent orchestration revealed that a high degree of configuration dependence results —creating what we call theprecision–faithfulness paradox. Based on these varied outcomes, our practical recommendation is simple:scope first, then perform a single synthesis call, reserving full multi-agent orchestration for genuinely multi-domain corpora paired with native-tool-call backbones. Code and Data will be made public upon acceptance.

When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval

Nabaraj Subedi1∗, Ahmed Abdelaty2, and Shivanand Venkanna Sheshappanavar11Dept. of Electrical Engineering & Computer Science2Dept. of Civil, Architectural Engineering & Construction ManagementUniversity of Wyoming, Laramie, WY 82071, USA{nsubedi1, aahmed3, ssheshap}@uwyo.edu∗*Correspondence author

1Introduction

Retrieval-augmented generation (RAG) has become the dominant pattern for grounding LLM outputs in external knowledge(Lewiset al.,2020; Guuet al.,2020; Gaoet al.,2024). However, the standard embed–index–retrieve–generate pipeline scales poorly on regulated enterprise corpora spanning thousands of heterogeneous documents(Barnettet al.,2024; Wuet al.,2025). As the corpus expands across heterogeneous categories, dense retrieval loses its discriminative power. The effect persists even when the Approximate Nearest Neighbor (ANN) index returns the true nearest neighbors(Malkov and Yashunin,2020; Johnsonet al.,2021): those neighbors are semantically related to the query yet contextually irrelevant.

We identify and characterizevector search dilution, asemanticscaling problem. We study this problem in the current Wyoming Department of Transportation (WYDOT) chatbot, where scaling the corpus from5454to 1,128 documents across nine categories reduced accuracy on Standard-Specification queries from75%75\%to below40%40\%.To address this, we developed a domain-scoped retrieval framework,MASDR-RAG, together with a lightweight single-call variant,Hybrid-Routed.

Our experiments across five LLM backbones (Qwen2.5-7B-Instruct(Qwen Team,2024), Llama-3-8B-Instruct(Grattafiori and others,2024), and three commercial backbones via OpenRouter (Claude-Haiku-4.5, GPT-5-mini, DeepSeek-V3)), six corpora (EnterpriseComposite-9, HotpotQA-distractor(Yanget al.,2018), MULTIHOP-RAG(Tang and Yang,2024), NQ-Open , FinanceBench, and MMLU-Pro), and two index stacks (FAISS and Neo4j HNSW) identify domain scoping over organizational metadata as the primary driver of improved retrieval performance. In contrast, multi-agent orchestration produces configuration-dependent results. Under a Gemini production stack, it reduces RAGAS faithfulness from0.610.61to0.350.35(p<0.01p<0.01), creating what we call the precision–faithfulness paradox. However, this effect does not reproduce under an apples-to-apples open-source stack. Through controlled ablations, we further show that this degradation is not simply a consequence of splitting retrieval into multiple calls. Instead, it arises from the difficulty of synthesizing answers across multiple sources when the retrieved evidence contains dense, near-duplicate passages. These findings point to a practical design principle for large-scale RAG systems: scope retrieval first and use a single synthesis step whenever possible. Our contributions are threefold:

  1. 1.Diagnosis:We formalizevector search dilutionand characterize how retrieval quality degrades as corpus density increases.
  2. 2.Architecture and analysis:We introduce the multi-agent retrieval frameworkMASDR-RAGand the lightweight variantHybrid-Routed, along with controlled ablations that isolate the sources of synthesis failures.
  3. 3.Generalization:We evaluate across five LLMs, six corpora, and two retrieval stacks, showing that the findings are robust across models and indexing implementations while reducing costs relative to iterative ReAct-style baselines.

2Related Work

RAG and Dense Retrieval.

RAG(Lewiset al.,2020)pairs a generator with a retriever and has evolved through query transformation, re-ranking, and iterative retrieval(Gaoet al.,2024);agenticvariants(Singh and others,2025)let the model decide when to retrieve. Dense bi-encoders(Karpukhinet al.,2020)and late-interaction models(Khattab and Zaharia,2020; Santhanamet al.,2022)largely supplanted sparse retrieval(Robertson and Zaragoza,2009), while hybrid schemes(Sawarkaret al.,2024)stay competitive on multi-domain corpora. Index scaling is usually framedalgorithmicallyvia approximate nearest neighbors(Malkov and Yashunin,2020; Johnsonet al.,2021); we focus instead on a complementarysemanticdegradation. Prior work shows dense retrieval loses discriminative power as the index grows(Reimers and Gurevych,2021), irrelevant passages alter generation(Cuconasuet al.,2024), and long contexts introduce noise(Jinet al.,2025). We share this diagnosis but contribute aretrieval-free, corpus-intrinsicmeasurement (the dilution factorδ\delta, §3) along with a deployable fix, and confirm that the issue is not dense-specific by evaluating BM25 and ColBERTv2 (§6).

Query Routing and Multi-Agent Systems.

Two lines of prior work contextualize our approach to domain scoping.Strategy routing(Jeonget al.,2024; Zhanget al.,2025; Guoet al.,2025)chooses a retrievaldepthor entirepipelineper query—deciding when and how hard to retrieve, not where.Metadata filtering(Poliakov and others,2024)masks candidates post-hoc while still indexing the whole corpus. Our scoping is orthogonal: we route to one ofKKpre-existing organizational scopesthat live in the document graph as a first-class field (source_type,document_series, articlecategory), restricting the index at query time rather than filtering after the fact. Our trainedR2-Routedvariant (App.A33) demonstrates that the choice of routingtargetmatters as much as the routingmodel.

For orchestration, ReAct(Yaoet al.,2023)and LangChain(Chase,2023)provide general scaffolding for tool use.Genuinelymulti-agent RAG assigns distinct roles with inter-agent messaging: MA-RAG(Nguyen and others,2025)chains task-specific agents, and SCOUT-RAG(Liet al.,2026)runs cooperative domain-relevance and retrieval agents over graph domains. OurMASDR-RAGis deliberately simpler—asingle reasoning agentwithKKdomain-scoped tools, where each “agent” is a scope-bound tool configuration. We include both multi-agent paradigms as baselines (§8) and show that, with commercial generators, multi-round orchestration triggers a faithfulness collapse that is absent with open-source backbones (Table10).

Reranking, Iterative, and Graph RAG:

Two-stage pipelines rerank a bi-encoder top-KKwith a cross-encoder(Nogueiraet al.,2020); our ablation (App.A34) shows that while cross-encoder reranking lifts baseline faithfulness, it doesnotrecover the multi-agent collapse, ruling out within-scope ranking noise as its sole cause. Learned-sparse retrievers such as SPLADE(Formalet al.,2022)remain competitive; we evaluate the OpenSearch neural-sparse model(OpenSearch Project,2024)as an additional retriever baseline (§6). Iterative methods—IRCoT(Trivediet al.,2023), Self-Ask(Presset al.,2023), and Self-RAG(Asaiet al.,2024)—share ReAct’s multi-round loops, which our efficiency analysis shows are costly under open-source backbones. WhileShiet al.(2023)notes that LLMs are distracted by irrelevant context, we demonstrate that fragmented yet domain-precise context is similarly harmful.Finally, unlike GraphRAG(Edgeet al.,2024), which builds entity–relationship graphs, we use the graph’sorganizationalmetadata as explicit agent boundaries.

Evaluation:

RAGAS(Eset al.,2024)measures standard retrieval quality and faithfulness metrics. However, standard benchmarks—such as Natural Questions(Kwiatkowskiet al.,2019), HotpotQA(Yanget al.,2018), MultiHop-RAG(Tang and Yang,2024), and long-context suites(Yenet al.,2025)—rely on homogeneous or synthetic corpora. Consequently, they fail to capture the cross-domain dilution typical of a regulated enterprise environment, motivating he multi-domain evaluation frameworks introduced in this work.

3Vector Search Dilution

3.1System Context

The corpus comprises 1,128 documents spanning construction specifications, design manuals, materials testing procedures, crash reports, transportation improvement programs, and administrative reports, ingested into Neo4j asDocument→Section→Chunk\text{Document}\rightarrow\text{Section}\rightarrow\text{Chunk}. The production system uses Gemini Embedding (768-d), HNSW, and a BM25 full-text index. Traffic & Crash reports contribute34.8%34.8\%chunks despite being1.9%1.9\%documents (Table1).

CategoryDocsChunks%Chk/DocStandard Specs22,5192.81,260Construction Manual216,6417.5316Materials Testing62,1802.5363Design Manual231,4051.661Traffic & Crashes2230,92234.81,406STIP5913,63415.3231Annual Reports462,3412.651Bridge Program285,3996.1193Other92123,86626.826Total1,12888,907100—

Table 1:Document and chunk distribution by literaldocument_seriescategory. Agent scope filters (App.A11) span broader related-series unions, so the per-agent counts in Table13exceed the per-category counts . Chunk density varies54×54\timesacross categories.

3.2Formal Definition

Let𝒞={c1,…,cN}\mathcal{C}{=}\{c_{1},\dots,c_{N}\}beNNchunks partitioned intoKKcategories𝒞1,…,𝒞K\mathcal{C}_{1},\dots,\mathcal{C}_{K},e:𝒞→ℝde:\mathcal{C}\to\mathbb{R}^{d}an embedding, andqqa query targeting categoryk⋆k^{\star}. The top-mmretrieval set isRm​(q)=arg⁡maxS⊆𝒞,|S|=m​∑c∈Ssim​(e​(q),e​(c))R_{m}(q)=\arg\max_{S\subseteq\mathcal{C},|S|=m}\sum_{c\in S}\mathrm{sim}(e(q),e(c)). Dilution occurs when global precision is much lower than scoped precision:

δ​(q,k⋆)=1−Pglobal​(q)Pscoped​(q),\delta(q,k^{\star})\;=\;1-\frac{P_{\text{global}}(q)}{P_{\text{scoped}}(q)},wherePglobal​(q)P_{\text{global}}(q)is the fraction of the retrieval setRm​(q)R_{m}(q)belonging to the target categoryk⋆k^{\star}when retrieval ranges over all of𝒞\mathcal{C}, andPscoped​(q)P_{\text{scoped}}(q)is the same fraction when retrieval is restricted to𝒞k⋆\mathcal{C}_{k^{\star}}(soPscoped≈1P_{\text{scoped}}{\approx}1by construction). Thusδ=0\delta{=}0is no dilution andδ→1\delta\to 1severe dilution.

3.3Empirical Measurements

Categories with smaller chunk populations suffer the most severe dilution (Designδ=0.53\delta{=}0.53; Specsδ=0.43\delta{=}0.43), while high-density categories (Construction Manualδ=0.10\delta{=}0.10) largely resist it. The Spearman correlation betweenlog⁡(chunk count)\log(\text{chunk count})and meanδ\deltaacross the eight scopable categories isρ=−0.60\rho{=}{-}0.60(p=0.12p{=}0.12). Withn=8n{=}8categories, this single correlation is suggestive rather than statistically conclusive on its own; we corroborate it on the reproducible cross-DOT replication of §9, where the same correlation under the open-source BGE-M3 stack ranges fromρ=−0.68\rho{=}{-}0.68(WYDOT,1010categories) toρ=−0.95\rho{=}{-}0.95(CDOT,1010categories).

Table 2:Per-category dilution factor with per-query ranges.Refer to captionFigure 1:Dilutionδ\deltavs. chunk count, eight WYDOT scopes; Spearmanρ=−0.60\rho={-}0.60(p=0.12p{=}0.12).scoped retrieval:8585–98%98\%search-space reductionUserQueryHybrid Router(Regex→\toLLM)Orchestrator(function-calling)Specs Agent|𝒞|≈2.5|\mathcal{C}|{\approx}2.5kConstruction Agent|𝒞|≈6.6|\mathcal{C}|{\approx}6.6kMaterials Agent|𝒞|≈2.2|\mathcal{C}|{\approx}2.2kDesign Agent|𝒞|≈1.4|\mathcal{C}|{\approx}1.4kTraffic & Crashes Agent|𝒞|≈30.9|\mathcal{C}|{\approx}30.9kBridge Agent|𝒞|≈5.4|\mathcal{C}|{\approx}5.4kSTIP Agent|𝒞|≈13.6|\mathcal{C}|{\approx}13.6kAnnual Reports Agent|𝒞|≈2.3|\mathcal{C}|{\approx}2.3kHighway Safety Agent|𝒞|≈30.9|\mathcal{C}|{\approx}30.9kNeo4jKnowledgeGraphSynthesiser LLM(Qwen-7B / Llama-8B / Gemini)Answercategoryscoped ANN searchtop-kkchunkschunks + query

Figure 2:MASDR-RAG/Hybrid-Routeddata flow. Regex-then-LLM router dispatches to one of nine WYDOT domain agents; each agent ANN-searches itsdocument_seriesscope in the Neo4j graph, and a Qwen-7B / Llama-8B / Gemini synthesizer generates the answer.Geometrically, the dilution corresponds to the retrieval-time source confusion: on Composite-9, the diagonal ofP​(retrieved source∣gold source)P(\text{retrieved source}\mid\text{gold source})is only0.590.59under monolithic search, lifting to0.840.84under regex scoping and0.900.90underHybrid-Routed(Figure5, App.A27). Scoping does not improve the embedder; it forces the retrieval neighborhood to respect the source label already present in the document graph. A t-SNE projection (App.A6) and a worked WYDOT failure case (App.A26) illustrate the same mechanism.

4Architecture: MASDR-RAG and Hybrid-Routed

The architecture has three components: (1)Domain-scoped retrieval, where each agent restricts the search to documents matching a Neo4j metadata filter, reducing the effective search space by8585–98%98\%(Table13), (2)Hybrid routingthat runs a fast regex matcher first and falls back to a zero-shot classifier using an LLM, and (3)Multi-agent orchestrationthat dispatches to nine domain agents via function calling. Figure2summarizes the data flow.

Each agent’s scope filter reduces its effective search space by6565–98%98\%relative to the full corpus, with a weighted average of90.4%90.4\%(per-agent breakdown in App.A13, Table13). The orchestrator uses up to five tool-call rounds;Hybrid-Routeduses at most two LLM calls per query (one router, one synthesizer).

We useorchestrationfor this multi-round tool loop and are explicit about what it is not:MASDR-RAGis asingle reasoning agentwithKKdomain-scoped retrieval tools, and the per-domain “agents” are scope-bound tool configurations rather than autonomous agents that reason or communicate independently. The contrast with genuinely multi-agent RAG — where separate planner, extractor, and synthesis agents exchange intermediate reasoning — is drawn against the MA-RAG and SCOUT-RAG baselines in §2and §8.

5Evaluation: Proprietary Stack

200 expert-validated WYDOT queries (Gemini 2.5 Flash answer generator,95%95\%bootstrap CIs, permutation tests atα=0.05\alpha{=}0.05).

Metrics:

We report four metrics throughout.P@10andR@10are precision and recall at rank1010, computed against the expert-labeled target scope of each query: a retrieved chunk counts as relevant if it belongs to that scope.Correctness(Corr) is a binary per-answer judgment — an LLM judge (Qwen-2.5-7B, distinct from every system under test) marks each generated answer as correct or incorrect against the reference answer, and we report the mean; the judge prompt and rubric are in App.A10and App.A18.Faithfulness(Faith) is the RAGAS faithfulness score in[0,1][0,1], the fraction of claims in the generated answer that are supported by the retrieved context. Unless noted,nnin a table is the number of queries scored.

Table 3:WYDOT Gemini stack (n=200n{=}200): scoping lifts [email protected]→.86.77{\to}.86;MASDR-RAG’s faithfulness collapses.61→.35.61{\to}.35.p∗<.05{}^{*}p{<}.05,p∗∗<.01{}^{**}p{<}.01vs. monolithic.

6Open-Source Reproducibility

We re-ran all five systems with Qwen2.5-7B-Instruct and Llama-3-8B-Instruct synthesizers on BGE-M3(Chenet al.,2024)embeddings; a single L40S GPU handles the200200-query sweep in≈40\approx\!40min (Qwen) /≈100\approx\!100min (Llama).

相似文章

@freeman1266: 普通 RAG vs 知识图谱 RAG vs LLM Wiki——三种知识库检索方案,95% 的人选错了,不是因为不懂,是因为没认清自己的数据形态。 三句话讲清楚: 普通 RAG:把文档切成 chunk,向量化入库,问题来了找相似片段喂给 …

X AI KOLs Timeline

本文对比了普通RAG、知识图谱RAG和LLM Wiki三种知识库检索方案的适用场景与选型建议,强调根据数据形态选择正确方案,避免盲目使用复杂工具。

@sitinme: Github 30k star,不用向量数据库也能做 RAG,而且准确率还更高! 做 RAG 的人应该都有过这种体验:向量数据库返回的内容“看起来相关”,但就是不是你要的那个答案。 特别是处理合同、财报、技术手册这类长文档的时候,你问“第…

X AI KOLs Timeline

介绍一个GitHub上30k star的开源项目,通过推理而非向量数据库实现RAG,号称准确率更高,解决了向量检索中相似不等于相关的问题。