MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

arXiv cs.CL Papers

Summary

MCompassRAG enhances retrieval-augmented generation by enriching chunk representations with topic metadata and using LLM-teacher distillation, achieving 8.24% average improvement in information efficiency with over 5x lower latency compared to strong baselines.

arXiv:2606.18508v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems depend critically on how documents are chunked and searched. Fine-grained chunks can improve retrieval precision but expand the search space, increasing latency and cost; larger chunks reduce the number of candidates but make dense similarity less reliable, as the representation for each chunk mixes multiple topics and introduces more semantic noise. This trade-off becomes especially limiting in deep research tasks, where retrieval must be both fast and precise across large, heterogeneous corpora. We introduce MCompassRAG, a metadata-guided retrieval framework that uses topic-level signals as a semantic compass for selecting relevant evidence. Instead of relying only on cosine similarity between queries and noisy chunk embeddings, MCompassRAG enriches chunk representations with topic metadata in the same embedding space and trains a lightweight retriever through LLM-teacher distillation. At inference time, MCompassRAG performs topic-aware retrieval without additional LLM calls, improving both efficiency and evidence quality. Across six complex retrieval benchmarks, MCompassRAG improves information efficiency (IE) by 8.24% on average with over 5 times lower latency than the strongest efficient RAG baselines. Code is available on https://github.com/AmirAbaskohi/MCompassRAG.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:45 AM

# MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval
Source: [https://arxiv.org/html/2606.18508](https://arxiv.org/html/2606.18508)
Amirhossein Abaskohi1, Raymond Li1, Gaetano Cimino2, Peter West1, Giuseppe Carenini1, Issam H\. Laradji1,3 1University of British Columbia,2University of Salerno,3ServiceNow Research

###### Abstract

Retrieval\-augmented generation \(RAG\) systems depend critically on how documents are chunked and searched\. Fine\-grained chunks can improve retrieval precision but expand the search space, increasing latency and cost; larger chunks reduce the number of candidates but make dense similarity less reliable, as the representation for each chunk mixes multiple topics and introduces more semantic noise\. This trade\-off becomes especially limiting in deep research tasks, where retrieval must be both fast and precise across large, heterogeneous corpora\. We introduceMCompassRAG, a metadata\-guided retrieval framework that uses topic\-level signals as a semantic compass for selecting relevant evidence\. Instead of relying only on cosine similarity between queries and noisy chunk embeddings,MCompassRAGenriches chunk representations with topic metadata in the same embedding space and trains a lightweight retriever through LLM\-teacher distillation\. At inference time,MCompassRAGperforms topic\-aware retrieval without additional LLM calls, improving both efficiency and evidence quality\. Across six complex retrieval benchmarks,MCompassRAGimproves information efficiency \(IE\) by8\.24%on average with over5×\\timeslower latencythan the strongest efficient RAG baselines111Code is available on[GitHub](https://github.com/AmirAbaskohi/MCompassRAG)\.\.

![[Uncaptioned image]](https://arxiv.org/html/2606.18508v1/figures/logo.png)MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph\-Level Retrieval

Amirhossein Abaskohi1††thanks:Corresponding author:aabaskoh@cs\.ubc\.ca, Raymond Li1, Gaetano Cimino2,Peter West1, Giuseppe Carenini1, Issam H\. Laradji1,31University of British Columbia,2University of Salerno,3ServiceNow Research

## 1Introduction

Retrieval\-augmented generation \(RAG\) has become a standard paradigm for grounding large language models \(LLMs\) in external knowledgeLewiset al\.\([2020](https://arxiv.org/html/2606.18508#bib.bib2)\); Karpukhinet al\.\([2020](https://arxiv.org/html/2606.18508#bib.bib3)\)\. Yet the efficiency and quality of RAG hinge on a simple but consequential design choice: how documents are divided into retrievable units\. This choice becomes especially important in deep research tasksZhanget al\.\([2025b](https://arxiv.org/html/2606.18508#bib.bib25)\), where systems must search large corpora and often issue many retrieval calls before producing a final answer\. Standard dense retrieval over fixed\-size chunksZhaoet al\.\([2024](https://arxiv.org/html/2606.18508#bib.bib26)\)faces a granularity trade\-off\. Fine\-grained chunks, such as sentences or short paragraphs, offer precise evidence but greatly increase the number of candidates to index and search\. Larger chunks reduce the search space and improve retrieval efficiency, but they mix multiple topics and discourse roles into a single embedding\. As a result, similarity scores become noisy: relevant evidence can be diluted by unrelated text, while partially relevant chunks may be retrieved despite containing mostly irrelevant content\.

![Refer to caption](https://arxiv.org/html/2606.18508v1/x1.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.18508v1/x2.png)\(b\)

Figure 1:Overview ofMCompassRAG\.\(a\)MCompassRAGuses coarse chunks for efficiency and enriches them with topic vectors for topic\-aware retrieval\. At query time, relevant topic information guides retrieval over larger chunks\.\(b\)MCompassRAGimproves the performance–latency trade\-off over strong RAG baselines, with performance measured by average F1 on HotpotQAYanget al\.\([2018](https://arxiv.org/html/2606.18508#bib.bib39)\)and DRBenchAbaskohiet al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib31)\)\.Prior work addresses chunk granularity by either making chunks smaller, more structured, or hierarchically organized\. Proposition\-level retrieval decomposes documents into atomic unitsChenet al\.\([2024b](https://arxiv.org/html/2606.18508#bib.bib5)\), LLM\-guided segmentation improves chunk boundariesZhaoet al\.\([2025b](https://arxiv.org/html/2606.18508#bib.bib6),[a](https://arxiv.org/html/2606.18508#bib.bib13)\), and hierarchical methods such as RAPTOR retrieve across multiple abstraction levelsSarthiet al\.\([2024](https://arxiv.org/html/2606.18508#bib.bib7)\)\. While effective, these approaches often increase pre\-processing cost, require additional indices, or introduce extra scoring and selection stages\. LLM\-based re\-ranking and evidence selection can further improve qualityTaoet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib1)\), but add latency at inference time, which is problematic for deep research agents that repeatedly retrieve evidence over large corporaZhenget al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib53)\)\.

In this work, we take a different approach: rather than making chunks increasingly fine\-grained, adding hierarchical retrieval stages, or relying on expensive post\-retrieval filtering, we make coarse\-grained chunks more searchable\. As shown in Figure[1\(a\)](https://arxiv.org/html/2606.18508#S1.F1.sf1),MCompassRAGenriches each chunk with topic metadata that acts as a semantic compass for retrieval\. Specifically, a topic modeling encoder maps documents and chunks into topic\-aware vectors in the same semantic space as the retriever\. These topic vectors expose the main semantic directions covered by each coarse chunk, allowing retrieval to look beyond a single noisy chunk embedding\. At query time,MCompassRAGderives a compact query\-side topic representation from the metadata bank and uses it to score metadata\-enriched chunks\.MCompassRAGis agnostic to the specific topic model, requiring only that topics be embedded in the retriever’s semantic space\. We trainMCompassRAGas an extreme multi\-label classifierPrabhuet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib27)\)using LLM\-teacher distillation, where a lightweight student learns to identify multiple relevant chunks from metadata\-enriched representations without LLM calls at inference time\. This preserves the efficiency advantage of larger chunks while reducing the semantic noise that makes coarse\-grained cosine retrieval unreliable\. Across six complex retrieval benchmarks,MCompassRAGimproves information efficiency by 8\.24%on average over the strongest non\-LLM baseline while running at over5×\\timeslower latencycompared to strong LLM\-based RAG baselines, reflecting the efficiency–quality trade\-off illustrated in Figure[1\(b\)](https://arxiv.org/html/2606.18508#S1.F1.sf2)\.

Ourcontributionsare threefold\.First, we introduceMCompassRAG, a metadata\-guided retrieval framework that improves coarse\-grained retrieval by using selected topic metadata to make large chunks more precisely searchable without increasing the retrieval search space\.Second, we design a metadata selection and abstraction mechanism that first selects the topical metadata most relevant to the query from a corpus\-level metadata bank, then summarizes these signals into a compact query\-topic vector used for chunk scoring\. This makes the query representation topic\-aware before matching it against coarse\-grained chunks\.Third, we distill an LLM teacher into a lightweight student retriever trained with an extreme multi\-label objective, enabling efficient topic\-aware evidence selection without inference\-time LLM calls while preserving most teacher\-guided retrieval quality\.

## 2Related Work

![Refer to caption](https://arxiv.org/html/2606.18508v1/x3.png)Figure 2:Overview ofMCompassRAG\. During training, an LLM teacher provides relevance supervision, with query expansion used only as an additional teacher\-side metadata signal\. The metadata bank is built from chunks, enriched with document\-topic vectors and topic centroid embeddings\. At inference time,MCompassRAGselects and abstracts query\-relevant topic metadata, then scores query–chunk pairs with a lightweight student retriever\. Icons indicate trainability: denotes trained components and denotes frozen components\.#### Retrieval Granularity and Structured Retrieval in RAG\.

RAG grounds language model generation in external evidence retrieved before generationLewiset al\.\([2020](https://arxiv.org/html/2606.18508#bib.bib2)\); Karpukhinet al\.\([2020](https://arxiv.org/html/2606.18508#bib.bib3)\); Izacard and Grave \([2021](https://arxiv.org/html/2606.18508#bib.bib4)\)\. A key design choice is retrieval granularity: fine\-grained units improve evidence precision but enlarge the search space and may lose context, while coarse\-grained units preserve context and reduce candidates but make dense similarity noisier due to mixed topics and irrelevant content\. Prior work addresses this trade\-off through alternative retrieval units or index structures, including proposition\-level retrievalChenet al\.\([2024b](https://arxiv.org/html/2606.18508#bib.bib5)\), LLM\-guided and adaptive chunkingZhaoet al\.\([2025b](https://arxiv.org/html/2606.18508#bib.bib6),[a](https://arxiv.org/html/2606.18508#bib.bib13)\), query\-adaptive granularity selectionZhanget al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib21)\), and hierarchical retrieval across abstraction levelsSarthiet al\.\([2024](https://arxiv.org/html/2606.18508#bib.bib7)\)\. Other systems enrich retrieved evidence to reduce context fragmentationTaoet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib1)\)or promote diversity and coverage during selectionKhanet al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib15)\)\. While effective, these methods often require finer\-grained indexing, adaptive selection, hierarchical structures, extra scoring stages, or LLM\-based filtering\. In contrast,MCompassRAGpreserves the efficiency of coarse\-grained retrieval while making larger chunks more searchable with topic\-level metadata\.

#### Semantic Guidance and Efficient Retrieval\.

A complementary line of work improves RAG by modifying the query or retrieval process rather than the chunking strategy itself\. Query augmentation methods such as HyDEGaoet al\.\([2023](https://arxiv.org/html/2606.18508#bib.bib8)\), query expansionWanget al\.\([2023](https://arxiv.org/html/2606.18508#bib.bib9)\); Zhouet al\.\([2024](https://arxiv.org/html/2606.18508#bib.bib10)\), and decomposition\-based retrievalTrivediet al\.\([2023](https://arxiv.org/html/2606.18508#bib.bib11)\); Zhenget al\.\([2024](https://arxiv.org/html/2606.18508#bib.bib12)\)aim to better align the query with relevant evidence by generating hypothetical answers, adding related terms, or breaking complex questions into simpler retrieval steps\. Adaptive and iterative retrieval methods further refine the evidence set through repeated retrieval, reranking, or sufficiency checkingVermaet al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib14)\)\. These methods are effective when the query underspecifies the needed evidence, but they often introduce extra inference\-time computation\. Separately, generation\-side efficiency methods compress or reorganize retrieved context after retrieval to reduce decoding costLinet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib19)\); Louiset al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib20)\)\.MCompassRAGis orthogonal to these directions: rather than generating additional query text, repeatedly retrieving, or compressing context after retrieval, it uses corpus\-derived topic metadata as a compact semantic guide before retrieval\. This guides retrieval toward query\-relevant topics without inference\-time LLM calls, and remains compatible with query expansion, iterative retrieval, reranking, and context compression\.

## 3MCompassRAG

MCompassRAGis a metadata\-guided retrieval framework that makes coarse\-grained chunks more searchable without increasing the retrieval search space\. Given chunks𝒞=\{c1,…,cN\}\\mathcal\{C\}=\\\{c\_\{1\},\\ldots,c\_\{N\}\\\}and a queryqq, the goal is to retrieve the top\-kkchunks that provide useful evidence for answering the query\. Instead of relying only on cosine similarity between query and chunk embeddings,MCompassRAGaugments both queries and chunks with topic\-level metadata, allowing the retriever to better identify which semantic directions within a large chunk are relevant\.

Figure[2](https://arxiv.org/html/2606.18508#S2.F2)illustrates the full pipeline\. First, each chunk is processed by a topic model to obtain a chunk\-topic distribution, while topic centroids provide embedding\-space representations of the topics\. The chunk\-topic distributions are cached in a corpus\-level metadata bank and later used as query\-side guidance\. At inference time, the base query is encoded by the student encoder, and a selection policy compares the query embedding with metadata entries from the bank to select the most relevant topic distributions\. An abstraction module then summarizes the selected metadata distributions into a refined query\-topic distribution, reducing noise and bias from any single selected entry\. This refined distribution is converted into a compact query\-side topic vector and concatenated with the query embedding to form the metadata\-enriched query representation\. The student MLP classifier then scores this representation against each metadata\-enriched chunk representation and returns the top\-kkchunks\. During training, an LLM teacher provides relevance supervision using expanded queries, while the student receives only the base query and learns through BCE and knowledge\-distillation losses\. Thus, query expansion and LLM teacher scoring are used only for training; inference requires only metadata selection, abstraction, and student scoring\. The framework can use any topic model whose topics are represented in the retriever embedding space and that provides chunk\-level topic distributions\. In our implementation, we use CEMTMAbaskohiet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib22)\), an LLM\-distilled topic model that also leverages attention signals to produce document\-topic distributions\.

### 3\.1Topic Metadata and Metadata Bank

Let\{𝐭k\}k=1K\\\{\\mathbf\{t\}\_\{k\}\\\}\_\{k=1\}^\{K\}denote the topic centroids, where each𝐭k∈ℝd\\mathbf\{t\}\_\{k\}\\in\\mathbb\{R\}^\{d\}lies in the retriever embedding space and serves as the vector representation, or prototype, of topickk\. Each chunkccis associated with a topic distribution𝜽c∈ℝK\\boldsymbol\{\\theta\}\_\{c\}\\in\\mathbb\{R\}^\{K\}, whereθc,r\\theta\_\{c,r\}measures the strength of topicrrin chunkcc\. Since chunks are longer and more informative than queries, their topic distributions can be computed reliably and cached offline\.MCompassRAGstores these chunk\-level topic distributions in a metadata bank:

ℳ=\{𝜽c1,…,𝜽cN\}\.\\mathcal\{M\}=\\\{\\boldsymbol\{\\theta\}\_\{c\_\{1\}\},\\ldots,\\boldsymbol\{\\theta\}\_\{c\_\{N\}\}\\\}\.\(1\)Themetadata bankrepresents the topical structure of the corpus and serves as the source ofquery\-side guidance at inference time\. Intuitively, it provides a corpus\-level map of the semantic regions that queries may need to search, without relying only on the sparse signal in the query itself\. Given a new query,MCompassRAGdoes not directly rely on the query’s own topic distribution, which may be unreliable due to its short length\. Instead, it selects relevant topic distributions fromℳ\\mathcal\{M\}and abstracts them into a compact query\-side topic representation\. This abstraction step reduces bias toward any single selected chunk and produces a smoother topical signal, as described in Section[3\.2](https://arxiv.org/html/2606.18508#S3.SS2)\.

### 3\.2Metadata Selection and Representation

At inference time,MCompassRAGselects topic metadata from the bank that is relevant to the input query\. The query is first encoded by the student encoder,fψf\_\{\\psi\}:

𝐞q=fψ​\(q\)∈ℝd\.\\mathbf\{e\}\_\{q\}=f\_\{\\psi\}\(q\)\\in\\mathbb\{R\}^\{d\}\.\(2\)We implement theselection policyas a lightweight scoring module over the concatenation of the query embedding and each metadata\-entry embedding\. Each metadata entry𝜽ci\\boldsymbol\{\\theta\}\_\{c\_\{i\}\}is first converted into an embedding\-space summary:

𝐦i=∑k=1Kθci,k​𝐭k\.\\mathbf\{m\}\_\{i\}=\\sum\_\{k=1\}^\{K\}\\theta\_\{c\_\{i\},k\}\\mathbf\{t\}\_\{k\}\.\(3\)The selector then assigns an unnormalized compatibility score between the query embedding and each metadata\-entry summary:

ai=𝐰s⊤​\[𝐞q;𝐦i\]\+bs,a\_\{i\}=\\mathbf\{w\}\_\{s\}^\{\\top\}\[\\mathbf\{e\}\_\{q\};\\mathbf\{m\}\_\{i\}\]\+b\_\{s\},\(4\)where\[⋅;⋅\]\[\\cdot;\\cdot\]denotes concatenation\. The scores are converted into a probability distribution over metadata entries using a softmax operation:

si=exp⁡\(ai\)∑j=1Nexp⁡\(aj\)\.s\_\{i\}=\\frac\{\\exp\(a\_\{i\}\)\}\{\\sum\_\{j=1\}^\{N\}\\exp\(a\_\{j\}\)\}\.\(5\)The top\-LLmetadata entries according tosis\_\{i\}are selected and passed to theabstraction module\.

𝐇\(0\)=\[𝜽cj1;…;𝜽cjL\]∈ℝL×K\.\\mathbf\{H\}^\{\(0\)\}=\[\\boldsymbol\{\\theta\}\_\{c\_\{j\_\{1\}\}\};\\ldots;\\boldsymbol\{\\theta\}\_\{c\_\{j\_\{L\}\}\}\]\\in\\mathbb\{R\}^\{L\\times K\}\.\(6\)After a two\-layer Transformer encoderVaswaniet al\.\([2017](https://arxiv.org/html/2606.18508#bib.bib54)\), the outputs are mean\-pooled to form a refined query topic distribution:

𝜽^q=1L​∑ℓ=1L𝐇ℓ\(2\)\.\\hat\{\\boldsymbol\{\\theta\}\}\_\{q\}=\\frac\{1\}\{L\}\\sum\_\{\\ell=1\}^\{L\}\\mathbf\{H\}^\{\(2\)\}\_\{\\ell\}\.\(7\)This abstraction step combines complementary topic signals and suppresses redundant or noisy metadata entries and constructs topic\-enriched representations for both chunks and queries\. For a chunkcc, we select the top\-MMtopics from its topic distribution \(here,LLis the number of selected metadata entries, whileMMis the number of selected topics\):

𝒯c=top​\-​M⁡\(𝜽c\),\\mathcal\{T\}\_\{c\}=\\operatorname\{top\\text\{\-\}M\}\(\\boldsymbol\{\\theta\}\_\{c\}\),\(8\)and aggregate their topic centroids:

𝐠c=∑k∈𝒯cθc,k​𝐭k\.\\mathbf\{g\}\_\{c\}=\\sum\_\{k\\in\\mathcal\{T\}\_\{c\}\}\\theta\_\{c,k\}\\mathbf\{t\}\_\{k\}\.\(9\)The final chunk representation is𝐫c=\[𝐞c;𝐠c\]\\mathbf\{r\}\_\{c\}=\[\\mathbf\{e\}\_\{c\};\\mathbf\{g\}\_\{c\}\], where𝐞c=fψ​\(c\)\\mathbf\{e\}\_\{c\}=f\_\{\\psi\}\(c\)is the chunk embedding produced by the student encoder\. Similarly, the refined query topic distribution𝜽^q\\hat\{\\boldsymbol\{\\theta\}\}\_\{q\}is used to build a query\-side topic summary with the top\-MMtopics, yielding𝐫q=\[𝐞q;𝐠q\]\\mathbf\{r\}\_\{q\}=\[\\mathbf\{e\}\_\{q\};\\mathbf\{g\}\_\{q\}\]\.

The student retriever scores each query–chunk pair with a three\-layer MLP classifier:

z​\(q,c\)=MLPϕ⁡\(\[𝐫q;𝐫c\]\),z\(q,c\)=\\operatorname\{MLP\}\_\{\\phi\}\(\[\\mathbf\{r\}\_\{q\};\\mathbf\{r\}\_\{c\}\]\),\(10\)wherez​\(q,c\)z\(q,c\)is the predicted relevance logit\. This formulation casts retrieval as an extreme multi\-label classification problem: each chunk is a candidate label, and each query may correspond to multiple relevant chunks\.

### 3\.3Training with LLM\-Teacher Distillation

Training data construction\.We synthesize training data from the training split of each benchmark\. For each dataset, we sample 2,000 chunks and useGPT\-4oOpenAI \([2024](https://arxiv.org/html/2606.18508#bib.bib18)\)to generate 10 natural queries per chunk, resulting in 20,000 query–chunk pairs before negative sampling\. For each sampled chunkcic\_\{i\},GPT\-4oreceives the target chunk together with its preceding and following chunks\. It first generates a base queryqiq\_\{i\}whose answer requires evidence fromcic\_\{i\}\. It then generates an expanded queryq~i\\tilde\{q\}\_\{i\}by adding only background information from the two of the neighboring chunks, without revealing the answer or including answer\-specific hints\. We use Prompt[A\.1](https://arxiv.org/html/2606.18508#A1)for the query expansion\.

Training procedure and objective\.For relevance supervision, the source chunk is treated as a positive candidate, while negatives are sampled from non\-matching chunks\. We include both random negatives and hard negatives, where hard negatives are retrieved usingQwen3\-Embedding\-4BZhanget al\.\([2025c](https://arxiv.org/html/2606.18508#bib.bib29)\)as high\-similarity chunks that the LLM teacher judges as not useful for answering the query\. GPT\-4o is then used as an LLM teacher: given the expanded queryq~i\\tilde\{q\}\_\{i\}and a candidate chunk, it predicts whether the chunk provides direct or supporting evidence for answering the query \(see Prompt[A\.2](https://arxiv.org/html/2606.18508#A1)\)\. The resulting hard labely∈\{0,1\}y\\in\\\{0,1\\\}and teacher score/logitzTz^\{\\mathrm\{T\}\}are used as supervision for the student relevance classifier\.

The teacher scores each query–chunk pair using the expanded queryq~i\\tilde\{q\}\_\{i\}, whereas the student receives only the base queryqiq\_\{i\}\. This information asymmetry encourages the student to recover useful missing context through metadata selection and abstraction\. The training objective combines hard\-label binary cross\-entropy with soft teacher distillation:

ℒ=\(1−α\)​ℒBCE\+α​ℒKD,\\mathcal\{L\}=\(1\-\\alpha\)\\mathcal\{L\}\_\{\\mathrm\{BCE\}\}\+\\alpha\\mathcal\{L\}\_\{\\mathrm\{KD\}\},\(11\)whereα\\alphabalances hard\-label learning and soft distillation\. The binary cross\-entropy loss is

ℒBCE=−y​log⁡σ​\(z\)−\(1−y\)​log⁡\(1−σ​\(z\)\),\\mathcal\{L\}\_\{\\mathrm\{BCE\}\}=\-y\\log\\sigma\(z\)\-\(1\-y\)\\log\(1\-\\sigma\(z\)\),\(12\)wherezzis the student relevance logit andσ\\sigmais the sigmoid function\. The distillation term matches the teacher and student soft scores:

ℒKD=KL​\(σ​\(zT/τ\)∥σ​\(z/τ\)\),\\mathcal\{L\}\_\{\\mathrm\{KD\}\}=\\mathrm\{KL\}\\left\(\\sigma\(z^\{\\mathrm\{T\}\}/\\tau\)\\;\\\|\\;\\sigma\(z/\\tau\)\\right\),\(13\)wherezTz^\{\\mathrm\{T\}\}is the teacher score/logit andτ\\tauis the temperature\. The student encoder, topic centroids, and cached chunk topic distributions are kept fixed\. We train only the metadata selector, abstraction module, and MLP relevance classifier\.

Algorithm 1MCompassRAGInference0:Query

qq, precomputed chunk representations

\{𝐫cj\}j=1N\\\{\\mathbf\{r\}\_\{c\_\{j\}\}\\\}\_\{j=1\}^\{N\}, metadata bank

ℳ\\mathcal\{M\}, topic centroids

\{𝐭r\}r=1K\\\{\\mathbf\{t\}\_\{r\}\\\}\_\{r=1\}^\{K\}, selected metadata count

LL, top topics

MM, retrieved chunks

kk
0:Retrieved chunk set

𝒞k\\mathcal\{C\}\_\{k\}
1:

𝐞q←fψ​\(q\)\\mathbf\{e\}\_\{q\}\\leftarrow f\_\{\\psi\}\(q\)
2:// Metadata selection

3:foreach metadata entry

𝜽ci∈ℳ\\boldsymbol\{\\theta\}\_\{c\_\{i\}\}\\in\\mathcal\{M\}do

4:

𝐦i←∑r=1Kθci,r​𝐭r\\mathbf\{m\}\_\{i\}\\leftarrow\\sum\_\{r=1\}^\{K\}\\theta\_\{c\_\{i\},r\}\\mathbf\{t\}\_\{r\}
5:

ai←𝐰s⊤​\[𝐞q;𝐦i\]\+bsa\_\{i\}\\leftarrow\\mathbf\{w\}\_\{s\}^\{\\top\}\[\\mathbf\{e\}\_\{q\};\\mathbf\{m\}\_\{i\}\]\+b\_\{s\}
6:endfor

7:

si←exp⁡\(ai\)∑j=1\|ℳ\|exp⁡\(aj\)s\_\{i\}\\leftarrow\\frac\{\\exp\(a\_\{i\}\)\}\{\\sum\_\{j=1\}^\{\|\\mathcal\{M\}\|\}\\exp\(a\_\{j\}\)\}
8:

𝒮←top​\-​L⁡\(\{si\}\)\\mathcal\{S\}\\leftarrow\\operatorname\{top\\text\{\-\}L\}\(\\\{s\_\{i\}\\\}\)
9:// Metadata abstraction

10:

𝐇\(0\)←\[𝜽cj\]j∈𝒮\\mathbf\{H\}^\{\(0\)\}\\leftarrow\[\\boldsymbol\{\\theta\}\_\{c\_\{j\}\}\]\_\{j\\in\\mathcal\{S\}\}
11:

𝜽^q←MeanPool⁡\(TransformerEnc⁡\(𝐇\(0\)\)\)\\hat\{\\boldsymbol\{\\theta\}\}\_\{q\}\\leftarrow\\operatorname\{MeanPool\}\(\\operatorname\{TransformerEnc\}\(\\mathbf\{H\}^\{\(0\)\}\)\)
12:

𝒯q←top​\-​M⁡\(𝜽^q\)\\mathcal\{T\}\_\{q\}\\leftarrow\\operatorname\{top\\text\{\-\}M\}\(\\hat\{\\boldsymbol\{\\theta\}\}\_\{q\}\)
13:

𝐠q←∑r∈𝒯qθ^q,r​𝐭r\\mathbf\{g\}\_\{q\}\\leftarrow\\sum\_\{r\\in\\mathcal\{T\}\_\{q\}\}\\hat\{\\theta\}\_\{q,r\}\\mathbf\{t\}\_\{r\}
14:

𝐫q←\[𝐞q;𝐠q\]\\mathbf\{r\}\_\{q\}\\leftarrow\[\\mathbf\{e\}\_\{q\};\\mathbf\{g\}\_\{q\}\]
15:// Retrieval

16:foreach precomputed

𝐫cj\\mathbf\{r\}\_\{c\_\{j\}\}do

17:

zj←MLPϕ⁡\(\[𝐫q;𝐫cj\]\)z\_\{j\}\\leftarrow\\operatorname\{MLP\}\_\{\\phi\}\(\[\\mathbf\{r\}\_\{q\};\\mathbf\{r\}\_\{c\_\{j\}\}\]\)
18:endfor

19:

𝒞k←top​\-​kcj∈𝒞⁡\(\{zj\}\)\\mathcal\{C\}\_\{k\}\\leftarrow\\operatorname\{top\\text\{\-\}k\}\_\{c\_\{j\}\\in\\mathcal\{C\}\}\(\\\{z\_\{j\}\\\}\)
20:return

𝒞k\\mathcal\{C\}\_\{k\}

### 3\.4Inference

At inference time,MCompassRAGretrieves evidence without LLM calls\. All chunk embeddings, topic distributions, and topic\-enriched chunk representations are precomputed offline as indices for retrieval\. For a given query,MCompassRAGcomputes the query embedding, selects and abstracts relevant metadata from the bank, scores all cached chunks with the MLP classifier, and returns the top\-kkresults\. Algorithm[1](https://arxiv.org/html/2606.18508#alg1)summarizes this procedure\. Since topic extraction and chunk encoding are offline, online inference only requires lightweight metadata selection, abstraction, and scoring\.

## 4Experiments and Results

### 4\.1Experimental Setup

Models and implementation\.We useQwen3\-Embedding\-4B\(Zhanget al\.,[2025c](https://arxiv.org/html/2606.18508#bib.bib29)\)as the student encoder for query and chunk representations, andQwen3\-32B\(Team,[2025](https://arxiv.org/html/2606.18508#bib.bib28)\)as both the LLM teacher for relevance supervision and the final answer generator\. For baselines requiring LLM\-based generation, planning, or selection, we use the same LLM scale for fair comparison\. When a baseline requires reranking, we useQwen3\-Reranker\-4B\(Zhanget al\.,[2025c](https://arxiv.org/html/2606.18508#bib.bib29)\)\. Closed\-source API\-based components are accessed through OpenRouter222[https://openrouter\.ai/](https://openrouter.ai/)\. All experiments are run with access to 8 NVIDIA A100 80GB GPUs\.

Topic metadata\.We use CEMTMAbaskohiet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib22)\)withQwen3\-Embedding\-4Bas the topic modeling backbone\. CEMTM is trained on WikiWeb2MBurnset al\.\([2023](https://arxiv.org/html/2606.18508#bib.bib30)\)withK=100K=100topics\. See Appendix[E](https://arxiv.org/html/2606.18508#A5)for the topic granularity analysis\. We use only the CEMTM encoder to obtain chunk\-level document\-topic vectors and topic centroid embeddings\. Since the LLM teacher also requires topic\-aware representations, we additionally use aQwen3\-32B\-based CEMTM variant for teacher\-side topic modeling\. We ablate the in\-domain topic modeling in Appendix[F](https://arxiv.org/html/2606.18508#A6)\.

Benchmarks\.We evaluate on seven benchmarks: SCI\-DOCSCohanet al\.\([2020](https://arxiv.org/html/2606.18508#bib.bib36)\), LegalBench\-RAGPipitone and Alami \([2024](https://arxiv.org/html/2606.18508#bib.bib37)\), DragonballZhuet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib38)\), HotpotQAYanget al\.\([2018](https://arxiv.org/html/2606.18508#bib.bib39)\), SQuADRajpurkaret al\.\([2016](https://arxiv.org/html/2606.18508#bib.bib40)\), DRBenchAbaskohiet al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib31)\), and LongBenchV2Baiet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib32)\)\. For retrieval evaluation, we use SCI\-DOCS, LegalBench\-RAG, Dragonball, HotpotQA, SQuAD, and DRBench, which provide evidence annotations or links convertible to chunk\-level labels\. We use LongBenchV2 only for downstream evaluation, as it lacks chunk\-level evidence labels\. See Appendix[B](https://arxiv.org/html/2606.18508#A2)for more details\.

MethodDragonballHotpotQASQuADIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowRAPTOR30\.13±\\pm\.4139\.40±\\pm\.5210\.53±\\pm\.2945\.43±\\pm\.6359\.63±\\pm\.5813\.70±\\pm\.3460\.70±\\pm\.7132\.77±\\pm\.4421\.13±\\pm\.39Meta\-Chunking\-MSP31\.40±\\pm\.3840\.20±\\pm\.4711\.63±\\pm\.3155\.70±\\pm\.6964\.30±\\pm\.6217\.97±\\pm\.4280\.60±\\pm\.5841\.97±\\pm\.5334\.40±\\pm\.49Meta\-Chunking\-PPL40\.87±\\pm\.4542\.80±\\pm\.5015\.73±\\pm\.3666\.77±\\pm\.7365\.23±\\pm\.6421\.40±\\pm\.4778\.80±\\pm\.6241\.37±\\pm\.5533\.70±\\pm\.51DenseXRetrieval2\.27±\\pm\.124\.40±\\pm\.180\.09±\\pm\.0335\.60±\\pm\.5643\.17±\\pm\.497\.03±\\pm\.2161\.53±\\pm\.6831\.17±\\pm\.4619\.83±\\pm\.37SAKI\-RAG32\.90±\\pm\.4271\.37±\\pm\.6625\.40±\\pm\.4558\.73±\\pm\.7055\.60±\\pm\.5930\.03±\\pm\.5287\.17±\\pm\.5188\.80±\\pm\.4378\.93±\\pm\.57LLM34\.73±\\pm\.3976\.53±\\pm\.6127\.30±\\pm\.4362\.63±\\pm\.6755\.83±\\pm\.5533\.50±\\pm\.4989\.93±\\pm\.4691\.63±\\pm\.4082\.77±\\pm\.52LLM \+ 10 Topics40\.83±\\pm\.3487\.43±\\pm\.4934\.17±\\pm\.3872\.90±\\pm\.5859\.33±\\pm\.5142\.70±\\pm\.4494\.10±\\pm\.3395\.83±\\pm\.2989\.50±\\pm\.36MCompassRAG \+ 10 Topics38\.97±\\pm\.3682\.80±\\pm\.5232\.40±\\pm\.4070\.17±\\pm\.6156\.40±\\pm\.4840\.63±\\pm\.4693\.80±\\pm\.3595\.37±\\pm\.3188\.90±\\pm\.38

MethodDRBenchLegalBench\-RAGSCI\-DOCSIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowRAPTOR24\.13±\\pm\.3732\.77±\\pm\.448\.20±\\pm\.2524\.27±\\pm\.3532\.23±\\pm\.428\.20±\\pm\.2488\.63±\\pm\.5482\.77±\\pm\.5080\.37±\\pm\.55Meta\-Chunking\-MSP30\.60±\\pm\.4236\.13±\\pm\.4712\.30±\\pm\.3128\.30±\\pm\.3936\.10±\\pm\.4511\.07±\\pm\.2990\.47±\\pm\.4983\.53±\\pm\.4882\.10±\\pm\.52Meta\-Chunking\-PPL36\.30±\\pm\.4937\.57±\\pm\.5116\.17±\\pm\.3432\.70±\\pm\.4337\.53±\\pm\.4813\.57±\\pm\.3221\.07±\\pm\.3617\.60±\\pm\.313\.57±\\pm\.15DenseXRetrieval18\.40±\\pm\.3125\.37±\\pm\.385\.43±\\pm\.1919\.53±\\pm\.3324\.93±\\pm\.365\.13±\\pm\.1886\.00±\\pm\.5779\.33±\\pm\.5374\.67±\\pm\.60SAKI\-RAG37\.47±\\pm\.4662\.30±\\pm\.6128\.23±\\pm\.4331\.23±\\pm\.4146\.30±\\pm\.5219\.27±\\pm\.3686\.53±\\pm\.5092\.27±\\pm\.4384\.30±\\pm\.51LLM41\.53±\\pm\.4468\.43±\\pm\.5732\.27±\\pm\.4133\.93±\\pm\.3950\.40±\\pm\.4922\.13±\\pm\.3589\.37±\\pm\.4595\.10±\\pm\.3487\.47±\\pm\.43LLM \+ 10 Topics50\.27±\\pm\.3983\.17±\\pm\.4643\.30±\\pm\.3740\.10±\\pm\.3459\.47±\\pm\.4329\.70±\\pm\.3194\.67±\\pm\.3099\.50±\\pm\.1292\.50±\\pm\.28MCompassRAG \+ 10 Topics47\.97±\\pm\.4178\.57±\\pm\.4941\.20±\\pm\.3938\.40±\\pm\.3655\.10±\\pm\.4527\.90±\\pm\.3394\.13±\\pm\.3299\.03±\\pm\.1592\.10±\\pm\.29

Table 1:Retrieval performance across six benchmarks, averaged over three runs\.±\\pmvaluesdenote standard deviation\.Bold= best;underline= second\-best; shaded rows indicateMCompassRAG\. LLM\-based rows are inference\-time oracle upper bounds\. Detailedk=1,3,5k\{=\}1,3,5results are in Appendix[D](https://arxiv.org/html/2606.18508#A4)\.Final PerformanceHotpotQALongBench v2DRBenchDragonballAverage CostMethodF1↑\\uparrowF1↑\\uparrowAcc↑\\uparrowF1↑\\uparrowR\-L↑\\uparrowMTR↑\\uparrowBRT↑\\uparrowTok/Q↓\\downarrowLat\. \(ms\)↓\\downarrowEfficient RAG MethodsDense X Retrieval60\.960\.926\.426\.428\.628\.646\.846\.80\.2480\.2480\.2690\.2690\.5480\.54827592759112112Meta\-Chunking\-PPL64\.564\.529\.729\.731\.831\.850\.750\.70\.2720\.2720\.2920\.2920\.5710\.571239423949595RAPTOR63\.163\.128\.328\.330\.430\.449\.149\.10\.2640\.2640\.2850\.2850\.5630\.56331833183145145ReflectiveRAG67\.467\.431\.531\.533\.433\.453\.453\.40\.3030\.3030\.3250\.3250\.6040\.60435273527161161DF\-RAG66\.266\.230\.230\.232\.332\.352\.152\.10\.2910\.2910\.3130\.3130\.5920\.59248434843484484SAKI\-RAG68\.668\.632\.632\.634\.534\.555\.255\.20\.3140\.3140\.3360\.3360\.6190\.61955845584925925REFRAG73\.673\.637\.537\.539\.439\.460\.460\.40\.3540\.3540\.3710\.3710\.6500\.65078007800720720Long\-Context MethodsPageIndex78\.778\.741\.941\.943\.643\.665\.865\.80\.3720\.3720\.3940\.3940\.6820\.68253 88353\\,88344084408A\-RAG74\.974\.938\.738\.740\.440\.462\.462\.40\.3470\.3470\.3690\.3690\.6550\.65514 62514\\,62525572557Chroma Context\-176\.176\.140\.140\.141\.841\.864\.164\.10\.3590\.3590\.3820\.3820\.6690\.66920 43020\\,43030263026LLM72\.972\.936\.936\.938\.838\.859\.359\.30\.3520\.3520\.3620\.3620\.6420\.64241 05841\\,05833883388OursMCompassRAG71\.871\.835\.835\.835\.735\.758\.958\.90\.3330\.3330\.3550\.3550\.6350\.63541264126174174

Table 2:Downstream performance and efficiency across four benchmarks\. We report task\-specific generation metrics: Accuracy/F1 for QA\-style datasets and ROUGE\-L \(R\-L\), METEOR \(MTR\), and BERTScore \(BRT\) for free\-form generation\. Tok/Q denotes the average retrieved tokens per query, and Lat\. denotes end\-to\-end latency\.Baselines\.We compare against dense, structured, long\-context, and LLM\-based RAG baselines: DenseXRetrievalChenet al\.\([2024b](https://arxiv.org/html/2606.18508#bib.bib5)\), Meta\-Chunking with PPL and MSP variantsZhaoet al\.\([2025b](https://arxiv.org/html/2606.18508#bib.bib6)\), RAPTORSarthiet al\.\([2024](https://arxiv.org/html/2606.18508#bib.bib7)\), ReflectiveRAGVermaet al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib14)\), DF\-RAGKhanet al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib15)\), SAKI\-RAGTaoet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib1)\), REFRAGLinet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib19)\), PageIndexZhanget al\.\([2025a](https://arxiv.org/html/2606.18508#bib.bib41)\), A\-RAGDuet al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib42)\), Chroma Context\-1Bashiret al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib43)\), and a long\-contextQwen3\-32Bbaseline\. For retrieval evaluation, we include DenseXRetrieval, Meta\-Chunking, RAPTOR, SAKI\-RAG, and LLM retrievers, with both topic\-free and topic\-guided LLM variants\. Other baselines are evaluated only downstream, as they mainly target generation, decoding, reranking, or context\-use efficiency rather than standalone retrieval\. Refer to Appendix[B](https://arxiv.org/html/2606.18508#A2)for more details\.

Training and evaluation\.We trainMCompassRAGseparately for each benchmark, using synthetic training data when retrieval labels are unavailable or insufficient; for DRBench and LongBenchV2, we train on EDR\-200Prabhakaret al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib33)\)and LongBenchV1Baiet al\.\([2024](https://arxiv.org/html/2606.18508#bib.bib34)\), respectively\. We train only the metadata selector, abstraction module, and MLP classifier, while keeping all encoders and cached topic representations fixed\. Retrieval quality is measured by Recall, Precision, and Information Efficiency \(IE\), withIE​@​k=Precision​@​k×Recall​@​k\\mathrm\{IE@k\}=\\mathrm\{Precision@k\}\\times\\mathrm\{Recall@k\}, averaged overk∈\{1,3,5\}k\\in\\\{1,3,5\\\}and three runs\. Downstream performance is evaluated with task\-appropriate metrics: Accuracy, F1, ROUGE\-LLin \([2004](https://arxiv.org/html/2606.18508#bib.bib46)\), METEORBanerjee and Lavie \([2005](https://arxiv.org/html/2606.18508#bib.bib45)\), and BERTScoreZhang\*et al\.\([2020](https://arxiv.org/html/2606.18508#bib.bib47)\)\. For fair comparison across chunk granularities, retrieved chunks are added in ranked order until a fixed token budget is reached \(1K\)\. Full training hyperparameters, inference, and evaluation settings are provided in Appendix[C](https://arxiv.org/html/2606.18508#A3)\.

### 4\.2Comparison with Retrieval Baselines

Table[1](https://arxiv.org/html/2606.18508#S4.T1)reports retrieval performance across all six benchmarks\.MCompassRAGwith 10 topic signalsconsistently outperforms all baselinesacross every benchmark and metric\. The gains are most pronounced on harder, multi\-hop benchmarks: on DRBench,MCompassRAGachieves an IE of 47\.97 versus 37\.47 for the strongest non\-LLM baseline \(SAKI\-RAG\), and on LegalBench\-RAG it similarly leads on all three metrics\. On SCI\-DOCS and SQuAD, where retrieval is comparatively easier,MCompassRAGstill matches or exceeds all baselines with comfortable margins\. Notably,MCompassRAGclosely approaches the LLM \+ 10 Topics oracle, which invokes a full LLM at retrieval time,while requiring no inference\-time LLM calls: the IE gap is under 1 point on SCI\-DOCS \(94\.13 vs\. 94\.67\) and SQuAD \(93\.80 vs\. 94\.10\), and within 2–3 points on the remaining benchmarks\. The consistent gap between the topic\-free LLM and LLM \+ 10 Topics rows further confirms thattopic metadata carries substantial guidance value beyond raw chunk embeddings, whichMCompassRAGexploits efficiently through lightweight distillation rather than runtime LLM inference\. Appendix[G](https://arxiv.org/html/2606.18508#A7)provides qualitative examples illustrating how topic signals resolve retrieval failures that dense similarity cannot handle\.

MethodDragonballHotpotQASQuADIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowMCompassRAG38\.9782\.8032\.4070\.1756\.4040\.6393\.8095\.3788\.90W/O Abst\.38\.0382\.2731\.9069\.3056\.2040\.2093\.0394\.9388\.37W/O Select Pol\.38\.5380\.3031\.3770\.0755\.9339\.0793\.5393\.8087\.93W/O Abst\. \+ W/O Select Pol\.37\.4780\.8331\.1368\.2755\.9739\.4392\.5094\.1087\.47MSMarcoNguyenet al\.\([2016](https://arxiv.org/html/2606.18508#bib.bib48)\)36\.2078\.3729\.3066\.2355\.5736\.4091\.4093\.1385\.43CLaRaHeet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib49)\)35\.3077\.2728\.1064\.6755\.3034\.5390\.6092\.2083\.63

MethodDRBenchLegalBench\-RAGSCI\-DOCSIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowMCompassRAG47\.9778\.5741\.2038\.4055\.1027\.9094\.1399\.0392\.10W/O Abst\.47\.5077\.9340\.2337\.9354\.7027\.4793\.2798\.6391\.87W/O Selection Pol\.48\.2074\.9338\.7038\.2053\.2726\.5393\.8797\.1391\.30W/O Abst\. \+ W/O Selection Pol\.45\.9375\.6338\.2737\.3053\.9026\.8092\.4097\.8791\.00MSMarcoNguyenet al\.\([2016](https://arxiv.org/html/2606.18508#bib.bib48)\)44\.5373\.0335\.7336\.0352\.1024\.6091\.2096\.3788\.97CLaRaHeet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib49)\)43\.4771\.2333\.2735\.2351\.0323\.3090\.2795\.4086\.90

Table 3:Ablation study and training data generalizability across six benchmarks\. The top block \(blue rows\) shows the fullMCompassRAGmodel and its component ablations\.Pink rowsshowMCompassRAGtrained on out\-of\-domain datasets \(MSMarco and CLaRa\) rather than the target benchmark, evaluating generalizability of the distillation pipeline without in\-domain training data\.![Refer to caption](https://arxiv.org/html/2606.18508v1/x4.png)Figure 3:IE as a function of the number of topics passed to the model, comparing the teacher and student \(MCompassRAG\) across four ablation variants on Dragonball and DRBench\. Each column removes one component of the metadata selection and abstraction pipeline\.
### 4\.3Downstream Performance and Efficiency

Table[2](https://arxiv.org/html/2606.18508#S4.T2)compares downstream generation quality and efficiency across all methods\. Amongefficient RAG methods,MCompassRAGachieves competitive generation quality while remaining one of the most efficient systems\. With only4,126 tokens per query and 174 ms end\-to\-end latency,MCompassRAGis substantially cheaper than SAKI\-RAG \(5,584 tok, 925 ms\) and REFRAG \(7,800 tok, 720 ms\), the two strongest efficient baselines in generation quality\. This favorable performance–latency trade\-off is also reflected in Figure[1\(b\)](https://arxiv.org/html/2606.18508#S1.F1.sf2), whereMCompassRAGlies closer to the high\-performance, low\-latency region than competing RAG baselines\. The performance gap betweenMCompassRAGand these methods is largely attributable to their use of LLM\-based reranking or context selection at inference time, which filters out noisy evidence before generation at the cost of additional latency\.MCompassRAGrecovers much of this quality through topic\-guided retrieval alone, without any post\-retrieval LLM filtering\. AlthoughMCompassRAGrequires training, this is a one\-time cost rather than an inference\-time overhead; moreover, Table[3](https://arxiv.org/html/2606.18508#S4.T3)shows that the trained retriever cangeneralize across datasetswhen trained on a general dataset like MS MarcoNguyenet al\.\([2016](https://arxiv.org/html/2606.18508#bib.bib48)\), further amortizing this cost even when switching to new corpora\.Compared to long\-context methods,MCompassRAGoperates at over 10×\\timesfewer tokens than PageIndex and the LLM baseline, while delivering generation scores within a reasonable margin\. The remaining gap reflects the fact that long\-context methods can exploit all available evidence in the document, whereasMCompassRAGis constrained to a fixed retrieval budget; the key finding is that topic\-guided coarse retrieval recovers most of the evidence quality of expensive long\-context methods at a fraction of the cost\.

## 5Ablations

The Effect of Abstraction and Selection Policy\.Table[3](https://arxiv.org/html/2606.18508#S4.T3)\(blue rows\) shows that removing either the abstraction module or the selection policy consistently lowers IE, with the largest drop when both are removed\. The selection policy identifies query\-relevant metadata, while the abstraction module denoises and compresses the selected topic distributions into a usable query\-side signal\. Without selection, abstraction receives weaker metadata; without abstraction, selected topics remain a noisy raw mixture\. Their complementary roles explain why the fullMCompassRAGpipeline performs best across benchmarks\.

Embedding BackboneDragonballLegalBench\-RAGSCI\-DOCSIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowQwen3\-Embedding\-0\.6B34\.8374\.1628\.7434\.2748\.8623\.9390\.6895\.4188\.17Qwen3\-Embedding\-0\.6B \+ Projection36\.3877\.3430\.2136\.0651\.7325\.6892\.0496\.7789\.86BAAI/bge\-m335\.9176\.0829\.7635\.1450\.3124\.8392\.5797\.1890\.82all\-MiniLM\-L6\-v229\.6464\.2323\.4728\.9241\.7918\.9484\.2389\.2779\.36Qwen3\-Embedding\-4B38\.9782\.8032\.4038\.4055\.1027\.9094\.1399\.0392\.10Qwen3\-Embedding\-8B39\.4383\.4632\.9138\.8855\.7728\.3694\.3999\.1892\.47

Table 4:Embedding\-backbone ablation forMCompassRAGon three representative retrieval benchmarks\. Results report IE↑\\uparrow, Precision↑\\uparrow, and Recall↑\\uparrow, averaged over retrieval cutoffsk=1,3,5k\{=\}1,3,5\. TheQwen3\-Embedding\-4Brow corresponds to the main configuration used in Table[1](https://arxiv.org/html/2606.18508#S4.T1); other rows show expected trends before running the full ablation\.Bold= best;underline= second\-best\.Training Data Generalizability\.Thepink rowsin Table[3](https://arxiv.org/html/2606.18508#S4.T3)showMCompassRAGtrained on MSMarcoNguyenet al\.\([2016](https://arxiv.org/html/2606.18508#bib.bib48)\)and CLaRaHeet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib49)\)without any access to target\-benchmark data\. Despite having no in\-domain supervision,both variants substantially outperform all non\-LLM baselinesfrom Table[1](https://arxiv.org/html/2606.18508#S4.T1)across every benchmark\. The performance gap relative to in\-domain training is modest in most settings, indicating that the distillation pipeline learns transferable retrieval behavior rather than overfitting to benchmark\-specific patterns\. This is practically important:MCompassRAGdoes not require labeled in\-domain datato deliver strong topic\-guided retrieval, making it straightforward to deploy in new domains without additional annotation\.

Effect of the Number of Metadata Topic\.Figure[3](https://arxiv.org/html/2606.18508#S4.F3)analyzes how the number of selected topics affects IE on DRBench and Dragonball across four ablation variants\. The same trend observed in the main paper holds IE improves as the number of selected topics increases up to an intermediate range, typically around 12–15 topics, and then decreases as additional topics introduce noise\. This suggests that topic metadata is useful as a compact semantic guide, but excessive topic information can dilute the original query–chunk signal\. The teacher consistently outperforms the student, as it receives richer per\-topic representations, while the student relies on an abstracted topic summary\. However, the gap remains modest around the optimal topic range, indicating that the selection and abstraction modules preserve most of the useful teacher signal for the lightweight retriever\. This pattern holds across variants with and without the selection policy and abstraction module, further indicating that the degradation at high topic counts is not caused by these components but by the added noise from excessive topic information\.

Sensitivity to the Embedding Model\.To assess whetherMCompassRAGdepends on a specific embedding backbone, we evaluate its retrieval performance with different embedding models while keeping the rest of the pipeline fixed\. Table[4](https://arxiv.org/html/2606.18508#S5.T4)reports results on three representative benchmarks: Dragonball, LegalBench\-RAG, and SCI\-DOCS\. We compare the mainQwen3\-Embedding\-4Bconfiguration against a larger Qwen encoder, a smaller Qwen encoder, a projectedQwen3\-Embedding\-0\.6Bvariant,BAAI/bge\-m3Chenet al\.\([2024a](https://arxiv.org/html/2606.18508#bib.bib55)\), andall\-MiniLM\-L6\-v2Reimers and Gurevych \([2019](https://arxiv.org/html/2606.18508#bib.bib56)\); Wanget al\.\([2020](https://arxiv.org/html/2606.18508#bib.bib57)\)\. The projected variant adds a lightweight linear layer that maps the smaller encoder’s outputs into the topic\-metadata embedding space used by the main configuration, improving compatibility between query embeddings, chunk embeddings, and topic centroids\. Results show that stronger embedding models generally improve retrieval quality:Qwen3\-Embedding\-8Bperforms best, whileQwen3\-Embedding\-4Bremains close with lower computational cost\. The projectedQwen3\-Embedding\-0\.6Bconsistently outperforms its unprojected counterpart, suggesting that embedding\-space alignment helpsMCompassRAGuse topic metadata more effectively\. Notably, even with the much smallerall\-MiniLM\-L6\-v2,MCompassRAGremains competitive with several baselines in Table[1](https://arxiv.org/html/2606.18508#S4.T1)\. This suggests that the gains are not solely due to a strong embedding backbone; the metadata selection and abstraction mechanism provides useful retrieval guidance across different encoder choices\.

Topic ModelDragonballLegalBench\-RAGSCI\-DOCSIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowETM33\.7471\.2827\.3132\.8647\.1422\.7689\.4294\.3686\.91DSL\-Topic36\.8378\.6430\.5736\.3852\.1925\.9192\.7197\.4690\.49CWTM37\.2879\.3130\.9436\.7652\.6326\.2493\.0897\.9190\.96CEMTM38\.9782\.8032\.4038\.4055\.1027\.9094\.1399\.0392\.10

Table 5:Topic\-model ablation forMCompassRAGon three representative retrieval benchmarks\. Results report IE↑\\uparrow, Precision↑\\uparrow, and Recall↑\\uparrow, averaged over retrieval cutoffsk=1,3,5k\{=\}1,3,5\.Sensitivity to the Topic Model\.To evaluate whetherMCompassRAGdepends on a particular topic model, we replace the topic encoder while keeping the rest of the retrieval pipeline fixed\. Table[5](https://arxiv.org/html/2606.18508#S5.T5)compares four topic modeling approaches: ETMDienget al\.\([2020](https://arxiv.org/html/2606.18508#bib.bib52)\), CWTMFanget al\.\([2024](https://arxiv.org/html/2606.18508#bib.bib51)\), DSL\-TopicLiet al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib50)\), and CEMTMAbaskohiet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib22)\)\. ETM learns topics and words in a shared embedding space, making it a natural baseline for embedding\-space topic guidance\. CWTM adds contextualized representations to produce more semantically informed document\-topic distributions\. DSL\-Topic uses language\-model\-derived soft labels to provide semantic supervision for neural topic modeling; since it does not directly provide the centroids required byMCompassRAG, we approximate each centroid by averaging the embeddings of its top topic words\. CEMTM learns topic distributions from contextualized vision\-language embeddings, using distributional attention to weight token and image\-patch contributions and a reconstruction objective to align topic\-based representations with the pretrained embedding space\. CEMTM is our main topic model because it uses stronger semantic supervision than the alternatives and yields document\-topic vectors that integrate naturally with the retriever, making it especially suitable for metadata\-guided retrieval\. As shown in Table[5](https://arxiv.org/html/2606.18508#S5.T5), CEMTM achieves the best overall retrieval performance\. However, CWTM and DSL\-Topic remain competitive, with CWTM slightly outperforming DSL\-Topic across the three datasets\. This suggests thatMCompassRAGis not tied to a single topic model; rather, its main requirement is that the topic model provides meaningful document\-topic distributions and topic centroids that can be mapped into the retriever embedding space\. We also ablate the in\-domain topic modeling in Appendix[F](https://arxiv.org/html/2606.18508#A6)\.

## 6Conclusion and Future Works

We introducedMCompassRAG, a metadata\-guided retrieval framework that enriches coarse chunk representations with topic\-level signals and trains a lightweight student retriever through LLM\-teacher distillation, enabling topic\-aware retrieval without inference\-time LLM calls\. Across six retrieval benchmarks,MCompassRAGimproves information efficiency by 8\.24% on average over the strongest non\-LLM baseline while running at over 5×\\timeslower latency compared to strong LLM\-based baselines\. Ablation studies confirm that both the metadata selection policy and the abstraction module are necessary, and that the distillation pipeline generalizes well without in\-domain training data\. Several promising directions build on this work: jointly optimizing the topic model and retriever end\-to\-end could better align topic representations and further close the student–teacher gap; developing approximate selection strategies would improve scalability to very large corpora; and integratingMCompassRAGinto iterative deep research agents is a natural next step, where efficiency gains compound across multiple retrieval rounds\.

## Limitations

MCompassRAGhas a few limitations worth noting\. First, the quality of topic\-guided retrieval is directly dependent on the quality of the underlying topic model: poorly trained or misaligned topic representations will produce uninformative metadata signals\. This creates a dependency on reliable topic modeling, which can be difficult in low\-resource or specialized domains\. Second,MCompassRAGintroduces several hyperparameters, including the number of topic\-model topicsKK, selected metadata entries from the memory bankLL, metadata topics used for retrievalMM, and retrieved chunkskk, whose interactions are non\-trivial to tune\. As shown in Section[5](https://arxiv.org/html/2606.18508#S5), performance is sensitive to the number of topics, so this choice requires validation\. Third, the current topic enrichment strategy represents each chunk and query as a weighted sum of topic centroid embeddings, which is a lossy compression: combining multiple topic vectors into a single aggregated vector discards the individual structure of each topic signal\. As more topics are included, aggregation becomes noisier\. Future work should explore efficient sparse or cross\-attention topic integration that better preserves per\-topic structure\.

## References

- A\. Abaskohi, T\. Chen, M\. Muñoz\-Mármol, C\. Fox, A\. V\. Ramesh, É\. Marcotte, X\. H\. Lù, N\. Chapados, S\. Gella, C\. Pal, A\. Drouin, and I\. H\. Laradji \(2026\)DRBench: a realistic benchmark for enterprise deep research\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=IGYQ4c92e2)Cited by:[§B\.1](https://arxiv.org/html/2606.18508#A2.SS1.p7.1.1),[Appendix C](https://arxiv.org/html/2606.18508#A3.SS0.SSS0.Px1.p1.6),[Figure 1](https://arxiv.org/html/2606.18508#S1.F1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p3.1)\.
- A\. Abaskohi, R\. Li, C\. Li, S\. Joty, and G\. Carenini \(2025\)CEMTM: contextual embedding\-based multimodal topic modeling\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 11675–11692\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.590/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.590),ISBN 979\-8\-89176\-332\-6Cited by:[§3](https://arxiv.org/html/2606.18508#S3.p2.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p2.1),[§5](https://arxiv.org/html/2606.18508#S5.p5.1)\.
- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou, Y\. Dong, J\. Tang, and J\. Li \(2024\)LongBench: a bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 3119–3137\.External Links:[Link](https://aclanthology.org/2024.acl-long.172/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.172)Cited by:[Appendix C](https://arxiv.org/html/2606.18508#A3.SS0.SSS0.Px1.p1.6),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p5.2)\.
- Y\. Bai, S\. Tu, J\. Zhang, H\. Peng, X\. Wang, X\. Lv, S\. Cao, J\. Xu, L\. Hou, Y\. Dong, J\. Tang, and J\. Li \(2025\)LongBench v2: towards deeper understanding and reasoning on realistic long\-context multitasks\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 3639–3664\.External Links:[Link](https://aclanthology.org/2025.acl-long.183/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.183),ISBN 979\-8\-89176\-251\-0Cited by:[§B\.1](https://arxiv.org/html/2606.18508#A2.SS1.p8.1.1),[Appendix C](https://arxiv.org/html/2606.18508#A3.SS0.SSS0.Px1.p1.6),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p3.1)\.
- S\. Banerjee and A\. Lavie \(2005\)METEOR: an automatic metric for MT evaluation with improved correlation with human judgments\.InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization,J\. Goldstein, A\. Lavie, C\. Lin, and C\. Voss \(Eds\.\),Ann Arbor, Michigan,pp\. 65–72\.External Links:[Link](https://aclanthology.org/W05-0909/)Cited by:[Appendix C](https://arxiv.org/html/2606.18508#A3.SS0.SSS0.Px2.p1.5),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p5.2)\.
- H\. Bashir, K\. Hong, P\. Jiang, and Z\. Shi \(2026\)Chroma context\-1: training a self\-editing search agent\.Technical reportChroma\.External Links:[Link](https://trychroma.com/research/context-1)Cited by:[§B\.2](https://arxiv.org/html/2606.18508#A2.SS2.p11.1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p4.1)\.
- A\. Burns, K\. Srinivasan, J\. Ainslie, G\. Brown, B\. A\. Plummer, K\. Saenko, J\. Ni, and M\. Guo \(2023\)A suite of generative tasks for multi\-level multimodal webpage understanding\.InThe 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),External Links:[Link](https://openreview.net/forum?id=rwcLHjtUmn)Cited by:[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p2.1)\.
- J\. Chen, S\. Xiao, P\. Zhang, K\. Luo, D\. Lian, and Z\. Liu \(2024a\)M3\-embedding: multi\-linguality, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 2318–2335\.External Links:[Link](https://aclanthology.org/2024.findings-acl.137/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.137)Cited by:[§5](https://arxiv.org/html/2606.18508#S5.p4.1)\.
- T\. Chen, H\. Wang, S\. Chen, W\. Yu, K\. Ma, X\. Zhao, H\. Zhang, and D\. Yu \(2024b\)Dense X retrieval: what retrieval granularity should we use?\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 15159–15177\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.845/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.845)Cited by:[§B\.2](https://arxiv.org/html/2606.18508#A2.SS2.p2.1.1),[§1](https://arxiv.org/html/2606.18508#S1.p2.1),[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p4.1)\.
- A\. Cohan, S\. Feldman, I\. Beltagy, D\. Downey, and D\. Weld \(2020\)SPECTER: document\-level representation learning using citation\-informed transformers\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 2270–2282\.External Links:[Link](https://aclanthology.org/2020.acl-main.207/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.207)Cited by:[§B\.1](https://arxiv.org/html/2606.18508#A2.SS1.p2.1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p3.1)\.
- A\. B\. Dieng, F\. J\. R\. Ruiz, and D\. M\. Blei \(2020\)Topic modeling in embedding spaces\.Transactions of the Association for Computational Linguistics8,pp\. 439–453\.External Links:[Link](https://aclanthology.org/2020.tacl-1.29/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00325)Cited by:[§5](https://arxiv.org/html/2606.18508#S5.p5.1)\.
- M\. Du, B\. Xu, C\. Zhu, S\. Wang, P\. Wang, X\. Wang, and Z\. Mao \(2026\)A\-rag: scaling agentic retrieval\-augmented generation via hierarchical retrieval interfaces\.External Links:2602\.03442,[Link](https://arxiv.org/abs/2602.03442)Cited by:[§B\.2](https://arxiv.org/html/2606.18508#A2.SS2.p10.1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p4.1)\.
- Z\. Fang, Y\. He, and R\. Procter \(2024\)CWTM: leveraging contextualized word embeddings from BERT for neural topic modeling\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 4273–4286\.External Links:[Link](https://aclanthology.org/2024.lrec-main.382/)Cited by:[§5](https://arxiv.org/html/2606.18508#S5.p5.1)\.
- L\. Gao, X\. Ma, J\. Lin, and J\. Callan \(2023\)Precise zero\-shot dense retrieval without relevance labels\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 1762–1777\.External Links:[Link](https://aclanthology.org/2023.acl-long.99/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.99)Cited by:[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px2.p1.1)\.
- J\. He, R\. H\. Bai, S\. Williamson, J\. Z\. Pan, N\. Jaitly, and Y\. Zhang \(2025\)CLaRa: bridging retrieval and generation with continuous latent reasoning\.External Links:2511\.18659,[Link](https://arxiv.org/abs/2511.18659)Cited by:[Table 3](https://arxiv.org/html/2606.18508#S4.T3.18.9.16.1.1),[Table 3](https://arxiv.org/html/2606.18508#S4.T3.9.9.16.1.1),[§5](https://arxiv.org/html/2606.18508#S5.p2.1)\.
- G\. Izacard and E\. Grave \(2021\)Leveraging passage retrieval with generative models for open domain question answering\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,P\. Merlo, J\. Tiedemann, and R\. Tsarfaty \(Eds\.\),Online,pp\. 874–880\.External Links:[Link](https://aclanthology.org/2021.eacl-main.74/),[Document](https://dx.doi.org/10.18653/v1/2021.eacl-main.74)Cited by:[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px1.p1.1)\.
- V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 6769–6781\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.550/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.550)Cited by:[§1](https://arxiv.org/html/2606.18508#S1.p1.1),[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px1.p1.1)\.
- S\. H\. Khan, S\. Hong, J\. Wu, K\. Lybarger, Y\. Yin, E\. Babinsky, and D\. Liu \(2026\)DF\-RAG: query\-aware diversity for retrieval\-augmented generation\.InFindings of the Association for Computational Linguistics: EACL 2026,V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 2873–2894\.External Links:[Link](https://aclanthology.org/2026.findings-eacl.150/),[Document](https://dx.doi.org/10.18653/v1/2026.findings-eacl.150),ISBN 979\-8\-89176\-386\-9Cited by:[§B\.2](https://arxiv.org/html/2606.18508#A2.SS2.p7.1.1),[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p4.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 9459–9474\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2606.18508#S1.p1.1),[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Li, A\. Abaskohi, C\. Li, G\. Murray, and G\. Carenini \(2026\)DSL\-topic: improving topic modeling by distilling soft labelsfrom language models\.External Links:2602\.17907,[Link](https://arxiv.org/abs/2602.17907)Cited by:[§5](https://arxiv.org/html/2606.18508#S5.p5.1)\.
- C\. Lin \(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.External Links:[Link](https://aclanthology.org/W04-1013/)Cited by:[Appendix C](https://arxiv.org/html/2606.18508#A3.SS0.SSS0.Px2.p1.5),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p5.2)\.
- X\. Lin, A\. Ghosh, B\. K\. H\. Low, A\. Shrivastava, and V\. Mohan \(2025\)REFRAG: rethinking rag based decoding\.External Links:2509\.01092,[Link](https://arxiv.org/abs/2509.01092)Cited by:[§B\.2](https://arxiv.org/html/2606.18508#A2.SS2.p8.1.1),[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p4.1)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Bkg6RiCqY7)Cited by:[Appendix C](https://arxiv.org/html/2606.18508#A3.SS0.SSS0.Px1.p1.6)\.
- M\. Louis, T\. Formal, H\. Déjean, and S\. Clinchant \(2026\)OSCAR: online soft compression for RAG\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ideKAUWvFE)Cited by:[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Nguyen, M\. Rosenberg, X\. Song, J\. Gao, S\. Tiwary, R\. Majumder, and L\. Deng \(2016\)MS MARCO: A human generated machine reading comprehension dataset\.CoRRabs/1611\.09268\.External Links:[Link](http://arxiv.org/abs/1611.09268),1611\.09268Cited by:[§4\.3](https://arxiv.org/html/2606.18508#S4.SS3.p1.1),[Table 3](https://arxiv.org/html/2606.18508#S4.T3.18.9.15.1.1),[Table 3](https://arxiv.org/html/2606.18508#S4.T3.9.9.15.1.1),[§5](https://arxiv.org/html/2606.18508#S5.p2.1)\.
- OpenAI \(2024\)Introducing GPT\-4o and more tools to ChatGPT free users\.Note:[https://openai\.com/index/gpt\-4o\-and\-more\-tools\-to\-chatgpt\-free/](https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/)Cited by:[§3\.3](https://arxiv.org/html/2606.18508#S3.SS3.p1.4)\.
- OpenAI \(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.External Links:2508\.10925,[Link](https://arxiv.org/abs/2508.10925)Cited by:[§B\.2](https://arxiv.org/html/2606.18508#A2.SS2.p11.1)\.
- N\. Pipitone and G\. H\. Alami \(2024\)LegalBench\-rag: a benchmark for retrieval\-augmented generation in the legal domain\.External Links:2408\.10343,[Link](https://arxiv.org/abs/2408.10343)Cited by:[§B\.1](https://arxiv.org/html/2606.18508#A2.SS1.p3.1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p3.1)\.
- A\. Prabhakar, R\. Ram, Z\. Chen, S\. Savarese, F\. Wang, C\. Xiong, H\. Wang, and W\. Yao \(2025\)Enterprise deep research: steerable multi\-agent deep research for enterprise analytics\.External Links:2510\.17797,[Link](https://arxiv.org/abs/2510.17797)Cited by:[Appendix C](https://arxiv.org/html/2606.18508#A3.SS0.SSS0.Px1.p1.6),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p5.2)\.
- S\. C\. Prabhu, B\. Singh, A\. Mittal, S\. Asokan, S\. Mohan, D\. Saini, Y\. Prabhu, L\. Kumar, J\. Jiao, A\. S, N\. Tandon, M\. Gupta, S\. Agarwal, and M\. Varma \(2025\)MOGIC: metadata\-infused oracle guidance for improved extreme classification\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=uxA0GI240s)Cited by:[§1](https://arxiv.org/html/2606.18508#S1.p3.1)\.
- P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang \(2016\)SQuAD: 100,000\+ questions for machine comprehension of text\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,J\. Su, K\. Duh, and X\. Carreras \(Eds\.\),Austin, Texas,pp\. 2383–2392\.External Links:[Link](https://aclanthology.org/D16-1264),[Document](https://dx.doi.org/10.18653/v1/D16-1264),1606\.05250Cited by:[§B\.1](https://arxiv.org/html/2606.18508#A2.SS1.p6.1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p3.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.External Links:1908\.10084,[Link](https://arxiv.org/abs/1908.10084)Cited by:[§5](https://arxiv.org/html/2606.18508#S5.p4.1)\.
- P\. Sarthi, S\. Abdullah, A\. Tuli, S\. Khanna, A\. Goldie, and C\. D\. Manning \(2024\)RAPTOR: recursive abstractive processing for tree\-organized retrieval\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=GN921JHCRw)Cited by:[§B\.2](https://arxiv.org/html/2606.18508#A2.SS2.p4.1.1),[§1](https://arxiv.org/html/2606.18508#S1.p2.1),[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p4.1)\.
- W\. Tao, X\. Xing, Z\. Li, and X\. Xu \(2025\)SAKI\-RAG: mitigating context fragmentation in long\-document RAG via sentence\-level attention knowledge integration\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 1195–1213\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.63/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.63),ISBN 979\-8\-89176\-332\-6Cited by:[§B\.2](https://arxiv.org/html/2606.18508#A2.SS2.p5.1.1),[§1](https://arxiv.org/html/2606.18508#S1.p2.1),[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p4.1)\.
- Q\. Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2023\)Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 10014–10037\.External Links:[Link](https://aclanthology.org/2023.acl-long.557/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.557)Cited by:[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Vol\.30,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)Cited by:[§3\.2](https://arxiv.org/html/2606.18508#S3.SS2.p1.16)\.
- A\. Verma, S\. Gupta, S\. Pillai, P\. Sircar, and D\. Gupta \(2026\)ReflectiveRAG: rethinking adaptivity in retrieval\-augmented generation\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 5: Industry Track\),Y\. Matusevych, G\. Eryiğit, and N\. Aletras \(Eds\.\),Rabat, Morocco,pp\. 377–384\.External Links:[Link](https://aclanthology.org/2026.eacl-industry.27/),[Document](https://dx.doi.org/10.18653/v1/2026.eacl-industry.27),ISBN 979\-8\-89176\-384\-5Cited by:[§B\.2](https://arxiv.org/html/2606.18508#A2.SS2.p6.2.1),[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p4.1)\.
- L\. Wang, N\. Yang, and F\. Wei \(2023\)Query2doc: query expansion with large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 9414–9423\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.585/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.585)Cited by:[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Wang, F\. Wei, L\. Dong, H\. Bao, N\. Yang, and M\. Zhou \(2020\)MiniLM: deep self\-attention distillation for task\-agnostic compression of pre\-trained transformers\.External Links:2002\.10957,[Link](https://arxiv.org/abs/2002.10957)Cited by:[§5](https://arxiv.org/html/2606.18508#S5.p4.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 2369–2380\.External Links:[Link](https://aclanthology.org/D18-1259/),[Document](https://dx.doi.org/10.18653/v1/D18-1259)Cited by:[§B\.1](https://arxiv.org/html/2606.18508#A2.SS1.p5.1.1),[Figure 1](https://arxiv.org/html/2606.18508#S1.F1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p3.1)\.
- M\. Zhang, Y\. Tang, and P\. Team \(2025a\)PageIndex: next\-generation vectorless, reasoning\-based rag\.PageIndex Blog\.External Links:[Link](https://pageindex.ai/blog/pageindex-intro)Cited by:[§B\.2](https://arxiv.org/html/2606.18508#A2.SS2.p9.1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p4.1)\.
- W\. Zhang, X\. Li, Y\. Zhang, P\. Jia, Y\. Wang, H\. Guo, Y\. Liu, and X\. Zhao \(2025b\)Deep research: a survey of autonomous research agents\.External Links:2508\.12752,[Link](https://arxiv.org/abs/2508.12752)Cited by:[§1](https://arxiv.org/html/2606.18508#S1.p1.1)\.
- X\. Zhang, K\. Goswami, S\. Oymak, J\. Chen, and N\. Lipka \(2026\)SmartChunk retrieval: query\-aware chunk compression with planning for efficient document RAG\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Myti1QwL2t)Cited by:[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou \(2025c\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.External Links:2506\.05176,[Link](https://arxiv.org/abs/2506.05176)Cited by:[§3\.3](https://arxiv.org/html/2606.18508#S3.SS3.p2.3),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p1.1)\.
- T\. Zhang\*, V\. Kishore\*, F\. Wu\*, K\. Q\. Weinberger, and Y\. Artzi \(2020\)BERTScore: evaluating text generation with bert\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SkeHuCVFDr)Cited by:[Appendix C](https://arxiv.org/html/2606.18508#A3.SS0.SSS0.Px2.p1.5),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p5.2)\.
- J\. Zhao, Z\. Ji, Z\. Fan, H\. Wang, S\. Niu, B\. Tang, F\. Xiong, and Z\. Li \(2025a\)MoC: mixtures of text chunking learners for retrieval\-augmented generation system\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 5172–5189\.External Links:[Link](https://aclanthology.org/2025.acl-long.258/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.258),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.18508#S1.p2.1),[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Zhao, Z\. Ji, Y\. Feng, P\. Qi, S\. Niu, B\. Tang, F\. Xiong, and Z\. Li \(2025b\)Meta\-chunking: learning text segmentation and semantic completion via logical perception\.External Links:2410\.12788,[Link](https://arxiv.org/abs/2410.12788)Cited by:[§B\.2](https://arxiv.org/html/2606.18508#A2.SS2.p3.1.1),[§1](https://arxiv.org/html/2606.18508#S1.p2.1),[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p4.1)\.
- W\. X\. Zhao, J\. Liu, R\. Ren, and J\. Wen \(2024\)Dense text retrieval based on pretrained language models: a survey\.ACM Trans\. Inf\. Syst\.42\(4\)\.External Links:ISSN 1046\-8188,[Link](https://doi.org/10.1145/3637870),[Document](https://dx.doi.org/10.1145/3637870)Cited by:[§1](https://arxiv.org/html/2606.18508#S1.p1.1)\.
- H\. S\. Zheng, S\. Mishra, X\. Chen, H\. Cheng, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2024\)Take a step back: evoking reasoning via abstraction in large language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=3bq3jsvcQ1)Cited by:[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zheng, D\. Fu, X\. Hu, X\. Cai, L\. Ye, P\. Lu, and P\. Liu \(2025\)DeepResearcher: scaling deep research via reinforcement learning in real\-world environments\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 414–431\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.22/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.22),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.18508#S1.p2.1)\.
- W\. Zhou, J\. Zhang, H\. Hasson, A\. Singh, and W\. Li \(2024\)HyQE: ranking contexts with hypothetical query embeddings\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 13014–13032\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.761/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.761)Cited by:[§2](https://arxiv.org/html/2606.18508#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Zhu, Y\. Luo, D\. Xu, Y\. Yan, Z\. Liu, S\. Yu, R\. Wang, S\. Wang, Y\. Li, N\. Zhang, X\. Han, Z\. Liu, and M\. Sun \(2025\)RAGEval: scenario specific RAG evaluation dataset generation framework\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 8520–8544\.External Links:[Link](https://aclanthology.org/2025.acl-long.418/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.418),ISBN 979\-8\-89176\-251\-0Cited by:[§B\.1](https://arxiv.org/html/2606.18508#A2.SS1.p4.1.1),[§4\.1](https://arxiv.org/html/2606.18508#S4.SS1.p3.1)\.

## Appendix APrompts Used for Training

This appendix lists the prompts used during training\. Prompt[A\.1](https://arxiv.org/html/2606.18508#A1)is used to generate base and expanded queries from training chunks\. The next prompt, Prompt[A\.2](https://arxiv.org/html/2606.18508#A1), is used by the LLM teacher to assign relevance labels to query–chunk pairs during distillation\.

Prompt A\.1: Query ExpansionYou are given three consecutive chunks from a document: the previous chunk, the target chunk, and the next chunk\.Your task has two steps\.Step 1: Generate a base query\.Write a natural user question that requires information from the target chunk to answer\. The question should not directly copy the answer from the chunk, and it should not reveal the answer\.Step 2: Generate an expanded query\.Rewrite the base query by adding useful background context from the previous and next chunks\. The expanded query should make the information need clearer, but it must not reveal the answer or include direct answer hints\. Use only background context that helps specify the topic, setting, entities, or surrounding discussion\.Input:Previous chunk: \{previous\_chunk\}Target chunk: \{target\_chunk\}Next chunk: \{next\_chunk\}Output format:Base query: \{base query\}Expanded query: \{expanded query\}

Prompt A\.2: Teacher Relevance LabelingYou are given a question and a candidate knowledge chunk\. Decide whether the chunk contains information that is useful for answering the question\.Mark the chunk as relevant only if it provides direct or supporting evidence needed to answer the question\. Do not mark a chunk as relevant based only on vague topical similarity\.Question: \{expanded\_query\}Candidate chunk: \{candidate\_chunk\}Output only one number:1 = relevant0 = not relevant

## Appendix BBenchmark and Baseline Details

DatasetDomainLanguage\#Queries \(eval\)Corpus Size \(\#docs\)Avg\. Doc\. Len\. \(tokens\)Multi\-hopSCI\-DOCSScientificEN1,00025k7,955✗LegalBench\-RAGLegalEN6,85871427\.13k✗DragonballFinance/Legal/MedicalEN\+ZH6,7112,31111,436✗HotpotQAOpen\-domainEN113k105k1,247✓SQuADOpen\-domainEN107,7855362,303✗DRBenchEnterpriseEN1,0931,0931,089✓LongBenchV2Multi\-taskEN50350359\.38k✓

Table 6:Statistics of the seven benchmark datasets used in our evaluation\. “Avg\. Doc\. Len\.” reports average document length in characters\. “\#Queries \(eval\)” refers to the number of queries used in our experiments\. “Multi\-hop” indicates whether the benchmark requires cross\-document reasoning\.### B\.1Benchmark Dataset Details

We evaluateMCompassRAGon seven benchmarks spanning scientific, legal, open\-domain multi\-hop, reading comprehension, enterprise deep research, and long\-context tasks\. Table[6](https://arxiv.org/html/2606.18508#A2.T6)summarizes key statistics\.

SCI\-DOCSCohanet al\.\([2020](https://arxiv.org/html/2606.18508#bib.bib36)\)is a comprehensive evaluation suite for scientific document embeddings, covering seven document\-level tasks ranging from citation prediction and document classification to recommendation, and including tens of thousands of examples of anonymized user signals of document relatedness\. It was introduced alongside the SPECTER model to address the limitation that prior evaluations of scientific document representations focused on small datasets over a limited set of tasks, where extremely high AUC scores were already achievable\. The corpus consists of scientific paper abstracts, which are naturally multi\-topic and stylistically homogeneous, making it a natural testbed for topic\-guided retrieval\.

LegalBench\-RAGPipitone and Alami \([2024](https://arxiv.org/html/2606.18508#bib.bib37)\)is the first benchmark designed specifically to evaluate the retrieval step of RAG pipelines in the legal domain\. It is constructed by retracing the context used in LegalBench queries back to their original locations within the legal corpus, resulting in 6,858 query\-answer pairs over a corpus of over 79 million characters, entirely human\-annotated by legal experts\. The dataset covers a diverse range of legal documents including NDAs, M&A agreements, commercial contracts, and privacy policies\. The benchmark demands precise, minimal snippet retrieval rather than broad document recall, making it an especially challenging test of fine\-grained retrieval\.

DragonballZhuet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib38)\)is released as part of the RAGEval framework\. It contains 6,711 questions meticulously designed to reflect the complexity and specificity of their domains, covering finance, legal, and medical scenarios in both Chinese and English\. The framework introduces three novel keypoint\-based metrics—Completeness, Hallucination, and Irrelevance—to evaluate generated responses by distilling standard answers into 3–5 key points encompassing indispensable factual information and final conclusions\. Dragonball’s multilingual and multi\-domain construction stresses retrieval systems operating over heterogeneous, topically distinct evidence pools\.

HotpotQAYanget al\.\([2018](https://arxiv.org/html/2606.18508#bib.bib39)\)contains 113k Wikipedia\-based question\-answer pairs featuring four key properties: questions require finding and reasoning over multiple supporting documents; questions are diverse and unconstrained by any knowledge base schema; sentence\-level supporting facts are provided for reasoning supervision; and a category of factoid comparison questions tests the ability to extract and compare relevant facts across entities\. Sentence\-level supporting fact annotations make HotpotQA directly usable for chunk\-level retrieval evaluation; its multi\-hop structure requires retrievers to surface evidence distributed across distinct document segments\.

SQuADRajpurkaret al\.\([2016](https://arxiv.org/html/2606.18508#bib.bib40)\)contains 107,785 question\-answer pairs on 536 Wikipedia articles, where the answer to every question is a text span from the corresponding reading passage\. It covers a wide range of topics from musical celebrities to abstract concepts\. Unlike HotpotQA, SQuAD questions are largely single\-passage answerable, providing a complementary single\-hop retrieval axis in our evaluation\.

DRBenchAbaskohiet al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib31)\)evaluates AI agents on complex, open\-ended deep research tasks in enterprise settings, requiring agents to identify supporting facts from both the public web and private company knowledge bases\. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web\. The benchmark targets report generation in enterprise deep research settings, comprising 100 tasks with a total of 1,093 sub\-questions\.

LongBenchV2Baiet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib32)\)consists of 503 challenging multiple\-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single\-document QA, multi\-document QA, long in\-context learning, long\-dialogue history understanding, code repository understanding, and long structured data understanding\. Data was collected from nearly 100 highly educated individuals with diverse professional backgrounds\. LongBenchV2 is used exclusively for downstream generation evaluation, as it does not provide chunk\-level evidence labels that can serve as retrieval ground truth\.

### B\.2Baseline Method Details

We compare against eleven baselines\. We describe each method’s core methodology below, along with which part of the pipeline—retrieval or generation—it primarily targets\.

DenseXRetrievalChenet al\.\([2024b](https://arxiv.org/html/2606.18508#bib.bib5)\)introduces the*proposition*as a novel retrieval unit for dense retrieval\. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self\-contained natural language format\. A fine\-tuned generation model called the Propositionizer—trained via a two\-step distillation process—decomposes passages into their constituent propositions at indexing time\.

Meta\-Chunking \(PPL and MSP\)Zhaoet al\.\([2025b](https://arxiv.org/html/2606.18508#bib.bib6)\)leverages LLMs’ logical perception capabilities to identify optimal text segment boundaries, moving beyond fixed\-size and similarity\-based chunking\. It defines a meta\-chunk granularity between sentences and paragraphs, consisting of sentences with deep linguistic logical connections\. Two adaptive uncertainty\-driven strategies are proposed:*Perplexity \(PPL\) Chunking*, which identifies boundaries by analyzing the context perplexity distribution of an LLM—splitting at points of certainty and keeping intact at uncertainty; and*Margin Sampling \(MSP\) Chunking*, which uses LLMs to perform binary classification on whether consecutive sentences should be segmented based on the probability difference from margin sampling\. Additionally, a global information compensation mechanism—comprising a two\-stage hierarchical summary generation process and a three\-stage chunk rewriting procedure—preserves semantic integrity and contextual coherence across chunks\.

RAPTORSarthiet al\.\([2024](https://arxiv.org/html/2606.18508#bib.bib7)\)introduces the novel approach of recursively embedding, clustering, and summarizing chunks of text to construct a tree with differing levels of summarization from the bottom up\. At inference time, retrieval operates across all tree levels, enabling queries to be answered by combining evidence from fine\-grained passages and their higher\-level summaries\.

SAKI\-RAGTaoet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib1)\)addresses context fragmentation in long\-document RAG via two core components: \(1\) the SentenceAttnLinker, which constructs a semantically enriched knowledge repository by modeling inter\-sentence attention relationships; and \(2\) the Dual\-Axis Retriever, which expands and filters candidate chunks along both the semantic similarity and contextual relevance dimensions\.

ReflectiveRAGVermaet al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib14)\)addresses two persistent inefficiencies in standard RAG: static top\-kkretrieval regardless of evidence sufficiency, and context redundancy from semantically overlapping retrieved passages\. Current methods—fixed top\-kkretrieval, cross\-encoder reranking, or policy\-based iteration—rely on static heuristics or costly reinforcement learning, failing to assess evidence sufficiency or reduce redundancy\. ReflectiveRAG introduces a Self\-Reflective Retrieval \(SRR\) module that uses a compact language model to iteratively evaluate whether retrieved evidence is sufficient or requires further query reformulation, alongside a Noise Removal \(NR\) module that scores and filters retrieved chunks by relevance minus redundancy\.

DF\-RAGKhanet al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib15)\)systematically incorporates diversity into the retrieval step to improve performance on complex, reasoning\-intensive QA benchmarks\. It builds upon the Maximal Marginal Relevance framework to select information chunks that are both relevant to the query and maximally dissimilar from each other\. A key innovation is its ability to optimize the level of diversity for each query dynamically at test time without requiring any additional fine\-tuning or prior information\.

REFRAGLinet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib19)\)targets generation\-side efficiency by exploiting block\-diagonal attention patterns that arise from low inter\-passage semantic similarity among retrieved chunks\. It uses a compress–sense–expand framework: a lightweight encoder compresses each retrieved chunk into compact embeddings fed directly to the decoder; an RL\-trained policy selectively determines which chunks require full token\-level expansion; and the decoder operates over a substantially shorter effective input\.

PageIndexZhanget al\.\([2025a](https://arxiv.org/html/2606.18508#bib.bib41)\)replaces the standard chunk–embed–vector search pipeline with a hierarchical tree index built from documents, using an LLM to reason over that tree—analogous to how a human expert scans a table of contents\. Rather than passive similarity lookup, PageIndex performs active tree search, with the LLM navigating document structure across multiple reasoning steps\. Retrieval happens inline during the model’s reasoning process, allowing the system to begin streaming immediately without a blocking retrieval gate before the first token\.

A\-RAGDuet al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib42)\)proposes an agentic RAG framework that exposes hierarchical retrieval interfaces directly to the language model\. Unlike existing methods that either retrieve passages in a single shot and concatenate them into input, or predefine a workflow and prompt the model to execute it step\-by\-step, A\-RAG allows the model to adapt the retrieval strategy based on the specific task, choose different interaction strategies, and decide when sufficient evidence has been gathered to provide an answer\. A\-RAG satisfies three principles of agentic autonomy: Autonomous Strategy, Iterative Execution, and Interleaved Tool Use, making it a truly agentic framework\.

Chroma Context\-1Bashiret al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib43)\)is a 20B parameter agentic search model derived fromGPT\-OSS\-20BOpenAI \([2025](https://arxiv.org/html/2606.18508#bib.bib44)\)that achieves retrieval performance comparable to frontier\-scale LLMs at a fraction of the cost and up to 10×\\timesfaster inference speed\. It is designed to be used as a subagent in conjunction with a frontier reasoning model: given a query, it produces a ranked list of documents relevant to satisfying the query\. The model is trained to decompose queries into subqueries, iteratively search a corpus, and selectively edit its own context to free capacity for further exploration\. A key mechanism is self\-editing context management, in which the agent actively discards retrieved passages deemed irrelevant as the context window fills, preventing context rot during long\-horizon multi\-hop retrieval\.

## Appendix CTraining and Implementation Details

MethodDragonballHotpotQASQuADIE@​1↑@1\\uparrowPrec\.@​1↑@1\\uparrowRec\.@​1↑@1\\uparrowIE@​1↑@1\\uparrowPrec\.@​1↑@1\\uparrowRec\.@​1↑@1\\uparrowIE@​1↑@1\\uparrowPrec\.@​1↑@1\\uparrowRec\.@​1↑@1\\uparrowRAPTOR2\.7436\.407\.535\.4055\.639\.705\.4029\.7718\.13Meta\-Chunking\-MSP3\.2137\.208\.638\.4260\.3013\.9712\.2438\.9731\.40Meta\-Chunking\-PPL5\.0739\.8012\.7310\.6561\.2317\.4011\.7838\.3730\.70DenseXRetrieval0\.001\.080\.321\.1939\.173\.034\.7428\.1716\.83SAKI\-RAG15\.3168\.3722\.4013\.4351\.6026\.0365\.1585\.8075\.93LLM17\.8773\.5324\.3015\.2951\.8329\.5070\.7088\.6379\.77LLM \+ 10 Topics26\.3284\.4331\.1721\.4155\.3338\.7080\.3092\.8386\.50MCompassRAG \+ 10 Topics23\.4679\.8029\.4019\.1952\.4036\.6379\.3592\.3785\.90

MethodDRBenchLegalBench\-RAGSCI\-DOCSIE@​1↑@1\\uparrowPrec\.@​1↑@1\\uparrowRec\.@​1↑@1\\uparrowIE@​1↑@1\\uparrowPrec\.@​1↑@1\\uparrowRec\.@​1↑@1\\uparrowIE@​1↑@1\\uparrowPrec\.@​1↑@1\\uparrowRec\.@​1↑@1\\uparrowRAPTOR1\.3829\.274\.701\.5229\.235\.2062\.5180\.2777\.87Meta\-Chunking\-MSP2\.8732\.638\.802\.6733\.108\.0764\.5081\.0379\.60Meta\-Chunking\-PPL4\.3234\.0712\.673\.6534\.5310\.570\.1615\.101\.07DenseXRetrieval0\.4221\.871\.930\.4721\.932\.1355\.4576\.8372\.17SAKI\-RAG14\.5458\.8024\.737\.0443\.3016\.2773\.4389\.7781\.80LLM18\.6864\.9328\.779\.0747\.4019\.1378\.6892\.6084\.97LLM \+ 10 Topics31\.7179\.6739\.8015\.0856\.4726\.7089\.1999\.1090\.00MCompassRAG \+ 10 Topics28\.3075\.0737\.7012\.9752\.1024\.9088\.0398\.2589\.60

Table 7:Retrieval performance at depthk=1k\{=\}1across six benchmarks \(IE@​1↑@1\\uparrow, Precision@​1↑@1\\uparrow, Recall@​1↑@1\\uparrow\)\.Bold= best;underline= second\-best\.MCompassRAGrows are shaded\. LLM and LLM \+ 10 Topics are oracle upper bounds that use an LLM at retrieval time\.#### Training details\.

For each benchmark,MCompassRAGis trained separately using its corresponding training split\. When a benchmark does not provide a sufficiently large training set, we use 10% of the available data for synthetic training data construction\. For DRBenchAbaskohiet al\.\([2026](https://arxiv.org/html/2606.18508#bib.bib31)\)and LongBenchV2Baiet al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib32)\), which are smaller and do not provide suitable retrieval training labels, we train using EDR\-200Prabhakaret al\.\([2025](https://arxiv.org/html/2606.18508#bib.bib33)\)and LongBenchV1Baiet al\.\([2024](https://arxiv.org/html/2606.18508#bib.bib34)\), respectively\. For each dataset, we sample 2,000 training chunks and generate 10 synthetic queries per chunk, resulting in 20,000 query–chunk pairs before negative sampling\. We train the metadata selector, abstraction module, and MLP relevance classifier while keeping the student encoder, topic centroids, and cached chunk\-topic distributions fixed\. Unless otherwise specified, all hyperparameters follow our default setting: AdamWLoshchilov and Hutter \([2019](https://arxiv.org/html/2606.18508#bib.bib35)\)with learning rate2×10−52\\times 10^\{\-5\}, batch size 16, weight decay 0\.01, dropout 0\.1, and 3 training epochs\. The distillation temperature is set toτ=1\.0\\tau=1\.0, and the loss interpolation coefficient is set toα=0\.5\\alpha=0\.5\. For generation, we use temperatureτ=0\.7\\tau=0\.7and top\-p=0\.9p=0\.9; for teacher relevance scoring, we use temperatureτ=0\.0\\tau=0\.0to obtain deterministic judgments\.

#### Evaluation\.

Because the compared methods use different chunk granularities, evaluating all systems with a fixed number of retrieved chunks can be unfair: the same top\-kkmay correspond to very different amounts of retrieved text\. We therefore use two complementary evaluation protocols\. For retrieval quality, we report Recall, Precision, and Information Efficiency \(IE\), whereIE​@​k=Precision​@​k×Recall​@​k\\mathrm\{IE@k\}=\\mathrm\{Precision@k\}\\times\\mathrm\{Recall@k\}\. These metrics are computed atk∈\{1,3,5\}k\\in\\\{1,3,5\\\}and averaged over three runs\. For downstream evaluation, we use task\-appropriate generation metrics, including Accuracy, F1, ROUGE\-LLin \([2004](https://arxiv.org/html/2606.18508#bib.bib46)\), METEORBanerjee and Lavie \([2005](https://arxiv.org/html/2606.18508#bib.bib45)\), and BERTScoreZhang\*et al\.\([2020](https://arxiv.org/html/2606.18508#bib.bib47)\), depending on the benchmark\. To ensure fairness in downstream comparisons, retrieved chunks are added in ranked order until a fixed token budget is reached \(1K\), so each method provides the generator with the same maximum amount of evidence regardless of its chunk size\. This protocol evaluates retrieval methods under comparable evidence budgets while still allowing each method to use its own native chunking strategy\. We useL=50L=50andM=10M=10in our experiments\.

## Appendix DRetrieval Performance at Different Cutoffs

MethodDragonballHotpotQASQuADIE@​3↑@3\\uparrowPrec\.@​3↑@3\\uparrowRec\.@​3↑@3\\uparrowIE@​3↑@3\\uparrowPrec\.@​3↑@3\\uparrowRec\.@​3↑@3\\uparrowIE@​3↑@3\\uparrowPrec\.@​3↑@3\\uparrowRec\.@​3↑@3\\uparrowRAPTOR3\.7838\.659\.787\.4558\.6312\.706\.5332\.0220\.38Meta\-Chunking\-MSP4\.2939\.4510\.8810\.7463\.3016\.9713\.8741\.2233\.65Meta\-Chunking\-PPL6\.3042\.0514\.9813\.1064\.2320\.4013\.3840\.6232\.95DenseXRetrieval0\.032\.651\.072\.5442\.176\.035\.8030\.4219\.08SAKI\-RAG17\.4170\.6224\.6515\.8554\.6029\.0368\.8488\.0578\.18LLM20\.1275\.7826\.5517\.8254\.8332\.5074\.5490\.8882\.02LLM \+ 10 Topics28\.9786\.6833\.4224\.3258\.3341\.7084\.3895\.0888\.75MCompassRAG \+ 10 Topics25\.9782\.0531\.6521\.9655\.4039\.6383\.4194\.6288\.15

MethodDRBenchLegalBench\-RAGSCI\-DOCSIE@​3↑@3\\uparrowPrec\.@​3↑@3\\uparrowRec\.@​3↑@3\\uparrowIE@​3↑@3\\uparrowPrec\.@​3↑@3\\uparrowRec\.@​3↑@3\\uparrowIE@​3↑@3\\uparrowPrec\.@​3↑@3\\uparrowRec\.@​3↑@3\\uparrowRAPTOR2\.3431\.907\.322\.3531\.487\.4565\.5182\.1479\.75Meta\-Chunking\-MSP4\.0335\.2611\.433\.6535\.3510\.3267\.5582\.9181\.47Meta\-Chunking\-PPL5\.6236\.7015\.304\.7236\.7812\.820\.5016\.982\.94DenseXRetrieval1\.1124\.504\.551\.0624\.184\.3858\.2878\.7074\.05SAKI\-RAG16\.8061\.4227\.368\.4445\.5518\.5276\.6891\.6483\.67LLM21\.2167\.5631\.4010\.6249\.6521\.3882\.0494\.4786\.84LLM \+ 10 Topics34\.9182\.3042\.4217\.0058\.7228\.9591\.3399\.4091\.88MCompassRAG \+ 10 Topics31\.3377\.6940\.3314\.7654\.3527\.1590\.4198\.8491\.47

Table 8:Retrieval performance at depthk=3k\{=\}3across six benchmarks \(IE@​3↑@3\\uparrow, Precision@​3↑@3\\uparrow, Recall@​3↑@3\\uparrow\)\.Bold= best;underline= second\-best\.MCompassRAGrows are shaded\.Tables[7](https://arxiv.org/html/2606.18508#A3.T7),[8](https://arxiv.org/html/2606.18508#A4.T8), and[9](https://arxiv.org/html/2606.18508#A4.T9)report retrieval performance at cutoffsk=1k\{=\}1,k=3k\{=\}3, andk=5k\{=\}5, respectively, across all six benchmarks\. As expected, both precision and recall increase monotonically withkkfor all methods, since retrieving more documents provides greater coverage of relevant passages\. The relative ordering of methods remains consistent across all cutoffs:MCompassRAGoutperforms all non\-oracle baselines at every depth while staying within a narrow margin of the LLM \+ 10 Topics oracle, which relies on an LLM at retrieval time\. This consistency demonstrates that the gains from topic\-guided retrieval are not specific to any particular cutoff, but reflect a robust improvement in retrieval quality across the full range of evaluation settings reported here\.

MethodDragonballHotpotQASQuADIE@​5↑@5\\uparrowPrec\.@​5↑@5\\uparrowRec\.@​5↑@5\\uparrowIE@​5↑@5\\uparrowPrec\.@​5↑@5\\uparrowRec\.@​5↑@5\\uparrowIE@​5↑@5\\uparrowPrec\.@​5↑@5\\uparrowRec\.@​5↑@5\\uparrowRAPTOR6\.1643\.1514\.2812\.0964\.6318\.709\.0936\.5224\.88Meta\-Chunking\-MSP6\.7643\.9515\.3815\.9269\.3022\.9717\.4445\.7238\.15Meta\-Chunking\-PPL9\.0746\.5519\.4818\.5470\.2326\.4016\.9045\.1237\.45DenseXRetrieval0\.028\.150\.205\.7948\.1712\.038\.2334\.9223\.58SAKI\-RAG21\.9075\.1229\.1521\.2360\.6035\.0376\.5292\.5582\.68LLM24\.9380\.2831\.0523\.4260\.8338\.5082\.5295\.3886\.52LLM \+ 10 Topics34\.5891\.1837\.9230\.6964\.3347\.7092\.8699\.5893\.25MCompassRAG \+ 10 Topics31\.2986\.5536\.1528\.0261\.4045\.6391\.8399\.1292\.65

MethodDRBenchLegalBench\-RAGSCI\-DOCSIE@​5↑@5\\uparrowPrec\.@​5↑@5\\uparrowRec\.@​5↑@5\\uparrowIE@​5↑@5\\uparrowPrec\.@​5↑@5\\uparrowRec\.@​5↑@5\\uparrowIE@​5↑@5\\uparrowPrec\.@​5↑@5\\uparrowRec\.@​5↑@5\\uparrowRAPTOR4\.6737\.1512\.574\.3035\.9811\.9571\.7285\.8983\.50Meta\-Chunking\-MSP6\.7640\.5116\.685\.9139\.8514\.8273\.8586\.6685\.22Meta\-Chunking\-PPL8\.6241\.9520\.557\.1541\.2817\.321\.3920\.736\.70DenseXRetrieval2\.9229\.759\.802\.5528\.688\.8864\.1582\.4577\.80SAKI\-RAG21\.7466\.6732\.6111\.5250\.0523\.0283\.3995\.3987\.42LLM26\.6872\.8136\.6514\.0154\.1525\.8888\.9898\.2290\.59LLM \+ 10 Topics41\.7487\.5547\.6721\.1563\.2233\.4595\.62100\.0095\.62MCompassRAG \+ 10 Topics37\.8082\.9445\.5818\.6358\.8531\.6595\.22100\.0095\.22

Table 9:Retrieval performance at depthk=5k\{=\}5across six benchmarks \(IE@​5↑@5\\uparrow, Precision@​5↑@5\\uparrow, Recall@​5↑@5\\uparrow\)\.Bold= best;underline= second\-best\.MCompassRAGrows are shaded\.
## Appendix EEffect of Topic Granularity of Topic Model

Table[10](https://arxiv.org/html/2606.18508#A5.T10)reports retrieval performance as a function of the number of topicsKKin the underlying topic model\. Two consistent patterns emerge across all three benchmarks\. First,performance peaks atK=100K=100and degrades monotonically asKKincreases beyond this point\. At very high granularities \(K=500K=500–20002000\), topic representations become increasingly fine\-grained and sparse, making each topic centroid less representative of a coherent semantic direction\. As a result, the weighted aggregation of topic centroids produces chunk and query representations that are noisier and harder to match reliably\. Second,the student–teacher gap is largest atK=100K=100and nearly vanishes at highKK\. AtK=100K=100, the LLM teacher can exploit the richer and more semantically coherent per\-topic structure to outperform the student, which receives only a compressed topic summary\. AtK≥500K\\geq 500, both the teacher and the student suffer equally from the degraded topic quality, and their performance converges\. Together, these results suggest thata moderate topic granularity ofK=100K=100strikes the best balancebetween topic coherence and coverage, and we use this setting across all experiments in the main paper\. This finding is complementary to the analysis in Section[5](https://arxiv.org/html/2606.18508#S5), which studied the effect of how many topic signals are passed to the model at inference time: here we show that the quality of those signals, determined byKK, is equally important\. Even with an optimal number of passed topics, overly fine\-grained or coarse topic models will degrade retrieval quality\.

SCI\-DOCSLegalBench\-RAGDragonballKMethodRecall↑\\uparrowPrecision↑\\uparrowIE↑\\uparrowRecall↑\\uparrowPrecision↑\\uparrowIE↑\\uparrowRecall↑\\uparrowPrecision↑\\uparrowIE↑\\uparrow50MCompassRAG88\.8393\.3786\.8736\.2351\.4026\.3036\.7378\.5330\.47LLM92\.4398\.1391\.9037\.7055\.4327\.8338\.4382\.6032\.07100MCompassRAG94\.1399\.0392\.1038\.4055\.1027\.9038\.9782\.8032\.40LLM98\.3099\.6398\.0340\.1059\.4729\.7040\.8387\.4334\.17500MCompassRAG86\.5389\.6383\.4735\.3049\.8725\.2735\.6074\.9028\.77LLM87\.2090\.4084\.0335\.5750\.3025\.4335\.9775\.3728\.971000MCompassRAG84\.8087\.1081\.1734\.6048\.4724\.5734\.6372\.8327\.80LLM85\.1387\.4781\.5034\.7348\.6724\.6734\.8373\.0727\.902000MCompassRAG83\.4085\.2379\.6034\.0347\.4324\.1033\.9071\.2727\.07LLM83\.5785\.4079\.8334\.1047\.5324\.1734\.0071\.4027\.13

Table 10:Effect of topic model granularity \(KK\) on retrieval performance across three datasets\. Results are reported for MCompassRAG and LLM\-based methods\.
## Appendix FTopic Model Domain Adaptation: Training on Target Corpus

In the main experiments, we use a topic model trained on WikiWeb2M to provide a general\-purpose set of topic centroids and document\-topic vectors\. While this setting tests whetherMCompassRAGcan rely on a broadly trained topic model, some benchmarks contain domain\-specific terminology and evidence structures that may not be fully captured by a general corpus\. We therefore evaluate an in\-domain variant in which the topic model is trained directly on the target corpus of each benchmark, while keeping the rest of theMCompassRAGpipeline unchanged\.

Table[11](https://arxiv.org/html/2606.18508#A6.T11)compares the default WikiWeb2M\-trained topic model with target\-corpus topic models on Dragonball, LegalBench\-RAG, and SCI\-DOCS\. Training the topic model on the target corpus improves performance across all three datasets, with larger gains on LegalBench\-RAG and Dragonball, where domain\-specific terminology, entities, and narrative structure are especially important\. However, the gains are moderate rather than dramatic, showing thatMCompassRAGdoes not require retraining the topic model for every new corpus\. This is important for practical deployment: a general\-purpose topic model can already provide useful metadata guidance, while in\-domain topic modeling can be used as an optional enhancement when sufficient target\-corpus data and training budget are available\.

Topic Model Training CorpusDragonballLegalBench\-RAGSCI\-DOCSIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowIE↑\\uparrowPrec\.↑\\uparrowRec\.↑\\uparrowWikiWeb2M38\.9782\.8032\.4038\.4055\.1027\.9094\.1399\.0392\.10Target Corpus39\.2683\.7132\.8340\.1857\.3629\.6494\.8299\.2192\.86

Table 11:Effect of training the topic model on the target corpus\. Results report IE↑\\uparrow, Precision↑\\uparrow, and Recall↑\\uparrow, averaged over retrieval cutoffsk=1,3,5k\{=\}1,3,5\. The WikiWeb2M row corresponds to the mainMCompassRAGconfiguration, while the Target Corpus row trains the topic model on the corresponding benchmark corpus before running the same retrieval pipeline\.
## Appendix GQualitative Analysis

![Refer to caption](https://arxiv.org/html/2606.18508v1/x5.png)Figure 4:Qualitative retrieval comparison on LegalBench\-RAG for a query about the definition ofSuperior Proposalin an M&A acquisition agreement\.Top: five retrieval candidates from the §6\.03 region; the gold chunk \(C3, teal border\) competes against four topically adjacent clauses sharing substantial surface vocabulary\.Bottom left: dense retrieval ranks C2 \(Acquisition Proposaldefinition\) above C3 due to overlapping tokens, missing the gold chunk at rank 1\.Bottom right:MCompassRAGactivates topic signals T\-A \(fiduciary out / board determination\) and T\-B \(majority threshold\), suppresses T\-C and T\-D, and promotes C3 to rank 1 via the MLP scorer \(0\.89 vs\. 0\.57 for C2\)\.We present two qualitative examples to illustrate howMCompassRAGresolves retrieval failures that dense similarity cannot handle: a definitional ambiguity case from LegalBench\-RAG and an embedding\-space analysis from Dragonball Finance\.

### G\.1LegalBench\-RAG: definitional ambiguity in M&A agreements\.

Figure[4](https://arxiv.org/html/2606.18508#A7.F4)illustrates a concrete retrieval example from LegalBench\-RAG that exposes the core failure mode of dense retrieval and howMCompassRAGresolves it\. The query asks for the definition ofSuperior Proposalin an M&A acquisition agreement whose §6\.03 region contains several topically adjacent clauses: a no\-shop obligation \(C1\), the definition ofAcquisition Proposal\(C2\), a board recommendation withdrawal clause \(C4\), and a termination fee clause \(C5\)\. Dense retrieval assigns the highest cosine similarity to C2 \(0\.81\) rather than the gold chunk C3 \(0\.78\), ranking the wrong definition first\. The failure arises because C2 and C3 share substantial surface vocabulary \(“bona fide,” “majority,” “Acquisition,” “outstanding Shares”\), causing their embeddings to occupy nearby positions in the retriever space\. Cosine similarity cannot identify which latent topic of a chunk matches the query, nor distinguish a clause thatdefineswhat counts as an acquisition proposal from one thatevaluateswhether a proposal is superior\.

MCompassRAGrecovers the gold chunk by activating two topic signals identified by the metadata selector as compatible with the query embedding: T\-A, capturing the fiduciary\-out and board determination frame \(“more favorable,” “financial advisor,” “board determines in good faith”\), and T\-B, capturing the majority threshold frame \(“majority of outstanding Shares,” “bona fide written Acquisition Proposal”\)\. The selector simultaneously suppresses signals associated with solicitation restrictions \(T\-C\) and merger consideration \(T\-D\), which are prominent in the neighboring chunks but orthogonal to the query’s information need\. The abstraction module pools T\-A and T\-B into a compact query\-side topic vector aligned with C3’s own topic representation, and the MLP classifier assigns C3 a relevance score of 0\.89 versus 0\.57 for C2, promoting the correct definition to rank 1 without any inference\-time LLM call\. This disambiguation was learned through the teacher–student asymmetry in Section[3\.3](https://arxiv.org/html/2606.18508#S3.SS3): the LLM teacher, given the expanded query framing the information need in terms of board determination and financial\-advisor consultation, labels C3 as relevant and C2 as not, training the student to recover the same judgment through topic metadata alone\.

### G\.2Dragonball Finance: topic\-guided separation in embedding space\.

Figure[5](https://arxiv.org/html/2606.18508#A7.F5)visualizes the effect of topic enrichment on a Dragonball Finance example in which the query asks for a summary of Sparkling Clean Housekeeping Services’ sustainability and social responsibility efforts in 2019\. The eight retrieval candidates span the full thematic range of the company’s corporate governance report: board composition \(C1\), executive remuneration \(C2\), risk management \(C3\), financial highlights \(C4\), shareholder structure \(C5\), internal audit \(C6\), and two surface\-overlap distractors whose vocabulary partially overlaps with the gold chunk: a compliance and anti\-corruption clause \(C7, which shares the phrase “corporate citizenship”\) and a strategic outlook statement \(C8, which shares “long\-term value creation”\)\.

In the raw embedding space \(Figure[5](https://arxiv.org/html/2606.18508#A7.F5)a\), the query and the gold CSR chunk are already relatively proximate, yet several hard negatives remain in the same neighbourhood, reflecting the broad semantic overlap that coarse governance\-report language introduces\. After topic enrichment \(Figure[5](https://arxiv.org/html/2606.18508#A7.F5)b\), the query–gold alignment tightens substantially: the metadata selector activates the CSR topic centroid for the query and the gold chunk’s own topic distribution loads on the same signal, pulling the two representations into close alignment while the hard negatives, whose dominant topic vectors correspond to governance, finance, and risk, drift away\. The surface\-overlap distractors C7 and C8 are particularly informative: despite sharing specific phrases with the gold chunk, their topic distributions do not load on the CSR centroid and therefore receive lower relevance scores from the MLP classifier, confirming thatMCompassRAG’s disambiguation operates at the level of latent topic structure rather than lexical overlap\.

![Refer to caption](https://arxiv.org/html/2606.18508v1/x6.png)Figure 5:t\-SNE visualization of chunk embeddings for a Dragonball Finance query on Sparkling Clean Housekeeping Services’ 2019 sustainability efforts\. Chunks cover eight aspects of the corporate governance report: board composition \(C1\), executive remuneration \(C2\), risk management \(C3\), financial highlights \(C4\), shareholde structure \(C5\), internal audit \(C6\), compliance and anti\-corruption \(C7\), and strategic outlook \(C8\); C7 and C8 are surface\-overlap distractors that share phrases with the gold CSR chunk\.\(a\) Raw embedding space: the query and gold chunk are proximate but several hard negatives occupy the same neighbourhood\.\(b\) Topic\-enriched space: topic enrichment tightens the query–gold alignment while pushing all hard negatives, including the surface\-overlap distractors C7 and C8, away from the query\.

Similar Articles

Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

arXiv cs.CL

Disco-RAG proposes a discourse-aware retrieval-augmented generation framework that integrates discourse signals through intra-chunk discourse trees and inter-chunk rhetorical graphs to improve knowledge synthesis in LLMs. The method achieves state-of-the-art results on QA and summarization benchmarks without fine-tuning.

LightRAG: Simple and Fast Retrieval-Augmented Generation

Papers with Code Trending

The article introduces LightRAG, an open-source framework that enhances Retrieval-Augmented Generation by integrating graph structures for improved contextual awareness and efficient information retrieval.

RAG-Anything: All-in-One RAG Framework

Papers with Code Trending

RAG-Anything is a new open-source framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.