LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

arXiv cs.CL 05/08/26, 04:00 AM Papers
rag latent-reasoning efficiency agentic-ai retrieval-augmented-generation latency-reduction
Summary
LatentRAG is a novel framework that shifts reasoning and retrieval for agentic RAG into continuous latent space, reducing inference latency by approximately 90% while maintaining performance comparable to explicit methods.
arXiv:2605.06285v1 Announce Type: new Abstract: Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with the retrieval system. This iterative process incurs substantial latency due to the autoregressive generation of lengthy thoughts and subqueries. To address this limitation, we propose LatentRAG, a novel framework that shifts both reasoning and retrieval from discrete language space to continuous latent space. Unlike existing explicit methods that generate natural language thoughts or subqueries token-by-token, LatentRAG produces latent tokens for thoughts and subqueries directly from the hidden states in a single forward pass. We align LLMs with dense retrieval models in the latent space, enabling retrieval over latent subquery tokens and supporting end-to-end joint optimization. To improve transparency and encourage semantically meaningful latent representations, we incorporate a parallel latent decoding mechanism that translates latent tokens back into natural language. Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately 90%, substantially narrowing the latency gap with traditional single-step RAG.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/08/26, 07:37 AM
# Latent Reasoning and Retrieval for Efficient Agentic RAG
Source: [https://arxiv.org/html/2605.06285](https://arxiv.org/html/2605.06285)
Yijia Zheng Marcel Worring University of Amsterdam, Amsterdam, the Netherlands \{y\.zheng, m\.worring\}@uva\.nl

###### Abstract

Single\-step retrieval\-augmented generation \(RAG\) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions\. Agentic RAG extends this paradigm by replacing single\-step retrieval with a multi\-step process, in which the large language model \(LLM\) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with the retrieval system\. This iterative process incurs substantial latency due to the autoregressive generation of lengthy thoughts and subqueries\. To address this limitation, we proposeLatentRAG, a novel framework that shifts both reasoning and retrieval from discrete language space to continuous latent space\. Unlike existing explicit methods that generate natural language thoughts or subqueries token\-by\-token, LatentRAG produces latent tokens for thoughts and subqueries directly from the hidden states in a single forward pass\. We align LLMs with dense retrieval models in the latent space, enabling retrieval over latent subquery tokens and supporting end\-to\-end joint optimization\. To improve transparency and encourage semantically meaningful latent representations, we incorporate a parallel latent decoding mechanism that translates latent tokens back into natural language\. Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately90%, substantially narrowing the latency gap with traditional single\-step RAG\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.06285v1/x1.png)Figure 1:Comparison of performance and latency on multi\-hop QA datasets\.LatentRAG achieves comparable performance to competitive agentic RAG methods such as Search\-R1 and AutoRefine, while maintaining efficiency on par with naive single\-step RAG\. Search\-R1 incurs substantial latency in thought and subquery generation, whereas LatentRAG substantially reduces the time spent in these two stages, leading to the observed efficiency gains\. Detailed stage\-wise latency breakdowns are provided in Appendix[E\.5](https://arxiv.org/html/2605.06285#A5.SS5)\.Large language models \(LLMs\) have demonstrated strong capabilities in answering complex questions\[[31](https://arxiv.org/html/2605.06285#bib.bib5),[62](https://arxiv.org/html/2605.06285#bib.bib2),[51](https://arxiv.org/html/2605.06285#bib.bib4)\], but these capabilities are fundamentally bounded by their static internal knowledge\[[58](https://arxiv.org/html/2605.06285#bib.bib6),[64](https://arxiv.org/html/2605.06285#bib.bib7)\]\. Solely relying on internal knowledge limits their performance on questions that require up\-to\-date information or proprietary knowledge\[[63](https://arxiv.org/html/2605.06285#bib.bib8),[61](https://arxiv.org/html/2605.06285#bib.bib9)\]and increases the risk of hallucinations\[[21](https://arxiv.org/html/2605.06285#bib.bib10),[19](https://arxiv.org/html/2605.06285#bib.bib12)\]\. To improve both the factuality and transparency of LLM\-generated outputs, retrieval\-augmented generation \(RAG\)\[[32](https://arxiv.org/html/2605.06285#bib.bib13),[14](https://arxiv.org/html/2605.06285#bib.bib14)\]retrieves question\-relevant information from an external retrieval system to augment the LLM inputs\[[11](https://arxiv.org/html/2605.06285#bib.bib15),[42](https://arxiv.org/html/2605.06285#bib.bib16)\]\. Traditional RAG methods provide an efficient way to access external knowledge, but their single\-step retrieval design limits their effectiveness on complex questions that require iterative reasoning and retrieval\[[57](https://arxiv.org/html/2605.06285#bib.bib24),[50](https://arxiv.org/html/2605.06285#bib.bib23)\]\.

Motivated by the success of tool\-using LLM agents\[[74](https://arxiv.org/html/2605.06285#bib.bib38),[46](https://arxiv.org/html/2605.06285#bib.bib39)\], recent agentic RAG approaches\[[34](https://arxiv.org/html/2605.06285#bib.bib27),[24](https://arxiv.org/html/2605.06285#bib.bib26)\]replace traditional single\-step retrieval with a multi\-step agentic search process that alternates between generation and retrieval\. In this process, the LLM acts as a search agent and iteratively decides what to retrieve\. At each iteration, the agent generates a thought via chain\-of\-thought \(CoT\) reasoning\[[65](https://arxiv.org/html/2605.06285#bib.bib25)\]and then produces the next action, which can be either a subquery for the next retrieval step or the final answer\. Each generated subquery is used to retrieve relevant documents\. Unlike static single\-step retrieval in traditional RAG, this multi\-step agentic search process enables complex questions to be decomposed and effectively solved step by step\[[35](https://arxiv.org/html/2605.06285#bib.bib42),[23](https://arxiv.org/html/2605.06285#bib.bib41)\]\. Although agentic RAG methods demonstrate strong performance on tasks with complex questions\[[50](https://arxiv.org/html/2605.06285#bib.bib23),[36](https://arxiv.org/html/2605.06285#bib.bib40)\], they incur substantial latency due to the additional multi\-step interactions\[[13](https://arxiv.org/html/2605.06285#bib.bib28),[55](https://arxiv.org/html/2605.06285#bib.bib31)\]\.

To identify the latency bottlenecks of agentic RAG, we measure the average inference time across different stages for both naive single\-step RAG and agentic RAG methods\. As shown in Fig\.[1](https://arxiv.org/html/2605.06285#S1.F1), on multi\-hop question answering \(QA\) datasets, the total inference time of a representative agentic RAG method, Search\-R1\[[24](https://arxiv.org/html/2605.06285#bib.bib26)\], requires 16–22×\\timesthe inference time of naive RAG\. This overhead is primarily driven by the thought and subquery generation stages, which together account for approximately 90% of the total latency\. Both stages involve autoregressive token\-by\-token generation of long outputs, where each output token depends on previously generated tokens, leading to multiple sequential LLM forward passes with limited parallelism\. In contrast, prefill, retrieval, and final answer generation take far less time than the other two stages\. The inference time comparison indicates that the latency bottlenecks of agentic RAG lie in the thought and subquery generation stages\.

To reduce the thought and subquery generation latency in agentic RAG, we draw inspiration from another technique called latent reasoning\. Latent reasoning\[[15](https://arxiv.org/html/2605.06285#bib.bib32),[6](https://arxiv.org/html/2605.06285#bib.bib33)\]is an efficient reasoning paradigm that performs reasoning within the continuous hidden states of the LLM, also referred to as latent tokens, without explicitly generating discrete language tokens\. Compared to explicit reasoning, latent reasoning avoids allocating computation to non\-semantic tokens that are produced solely for linguistic fluency\[[7](https://arxiv.org/html/2605.06285#bib.bib37),[15](https://arxiv.org/html/2605.06285#bib.bib32)\]\. Furthermore, continuous latent tokens allow the LLM to directly generate high\-level semantic representation, avoiding the inefficiency of explicit token\-by\-token generation and thereby enabling more parallelizable computation\[[85](https://arxiv.org/html/2605.06285#bib.bib34),[3](https://arxiv.org/html/2605.06285#bib.bib90),[54](https://arxiv.org/html/2605.06285#bib.bib91)\]\. Although latent reasoning offers a promising avenue for enhancing reasoning efficiency, its application to agentic RAG remains unexplored\.

In this work, we pioneer the integration of latent reasoning into the agentic RAG paradigm and, more importantly, propose a latent retrieval mechanism\. Unlike generation\-only tasks studied in prior work on latent reasoning\[[15](https://arxiv.org/html/2605.06285#bib.bib32),[12](https://arxiv.org/html/2605.06285#bib.bib35)\], agentic RAG requires the LLM to emit explicit subquery tokens to invoke external retrieval\. This explicit token generation not only incurs significant decoding overhead but also prevents gradient propagation, thereby hindering direct optimization of the LLM using retrieval signals\. To overcome these limitations, we investigate whether latent tokens generated by an LLM can effectively serve as subqueries for retrieval\. This introducestwo challenges\.\(1\) Data scarcity:Training retrieval models typically requires large\-scale paired data, often comprising hundreds of millions of query–document pairs\[[77](https://arxiv.org/html/2605.06285#bib.bib86),[60](https://arxiv.org/html/2605.06285#bib.bib87)\]\. In contrast, agentic RAG systems are commonly developed under a training setup that provides only tens of thousands of question–answering pairs, without explicit supervision on the ground\-truth documents for intermediate subqueries\[[24](https://arxiv.org/html/2605.06285#bib.bib26),[49](https://arxiv.org/html/2605.06285#bib.bib45)\]\. This data scarcity makes it difficult to learn effective retrieval capability using conventional training paradigms for retrieval models\.\(2\) Transparency:Latent tokens inherently obscure the intermediate thoughts and subqueries, which is particularly problematic for agentic RAG, as lengthy and redundant retrieved documents make answer verification and evidence attribution\[[45](https://arxiv.org/html/2605.06285#bib.bib92),[4](https://arxiv.org/html/2605.06285#bib.bib93)\]time\-consuming without explicit intermediate steps\.

To address the aforementioned challenges, we introduceLatentRAG, an efficient agentic RAG framework that conducts reasoning and retrieval in the latent space\. Specifically, we feed a sequence of special thought and subquery tokens into the LLM and use the corresponding last hidden states as latent thought and subquery tokens, respectively\. These latent tokens are obtained in a single forward pass, enabling parallel computation and avoiding the inefficiency of autoregressive generation\. To address challenge \(1\), we align the LLM with a pretrained dense retrieval model in the latent space\. The latent subquery tokens are used as inputs to the retrieval model to generate latent subquery embeddings\. We then minimize the KL divergence between the similarity distribution over documents induced by latent subquery embeddings and that induced by natural language subquery embeddings\. This design enables fully differentiable end\-to\-end joint optimization of the LLM and the retrieval model\. To address challenge \(2\) and encourage the latent tokens to capture meaningful semantics, we incorporate a parallel latent decoding mechanism that converts latent tokens into natural language thoughts and subqueries\. During inference, this latent decoding process is optional, enabling a trade\-off between transparency and efficiency\. Since this latent decoding process depends only on the latent tokens, all thoughts and subqueries across different steps can be decoded in parallel, reducing the latency of the decoding process\. Our main contributions are summarized as follows:

- •We introduce LatentRAG, a novel agentic RAG framework that performs reasoning and retrieval in the latent space, reducing the latency overhead of explicit thought and subquery generation\.
- •We propose a latent\-space alignment objective that jointly optimizes the LLM and retrieval model, enabling latent tokens to serve as effective retrieval queries while supporting end\-to\-end training\.
- •We incorporate a parallel decoding mechanism that translates latent tokens into explicit thoughts and subqueries, improving transparency while remaining more efficient than explicit agentic RAG\.

Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods, with relative performance differences of less than 5%, while significantly reducing latency overhead by approximately90%on average, approaching the latency of traditional single\-step RAG\.

## 2Related Work

#### Agentic RAG\.

Recent advances in RAG have shifted beyond traditional single\-step methods\[[32](https://arxiv.org/html/2605.06285#bib.bib13),[14](https://arxiv.org/html/2605.06285#bib.bib14)\]toward agentic RAG approaches\[[36](https://arxiv.org/html/2605.06285#bib.bib40),[35](https://arxiv.org/html/2605.06285#bib.bib42),[50](https://arxiv.org/html/2605.06285#bib.bib23)\], which perform multi\-step retrieval by iteratively generating intermediate thoughts and subqueries\. Early agentic RAG methods\[[57](https://arxiv.org/html/2605.06285#bib.bib24),[22](https://arxiv.org/html/2605.06285#bib.bib52),[69](https://arxiv.org/html/2605.06285#bib.bib53),[34](https://arxiv.org/html/2605.06285#bib.bib27)\]primarily rely on prompting strategies to enable LLMs to interact with retrieval systems\. To improve the retrieval ability of LLMs, Self\-RAG\[[2](https://arxiv.org/html/2605.06285#bib.bib54)\]and AutoRAG\[[29](https://arxiv.org/html/2605.06285#bib.bib55)\]construct synthetic training data from RAG benchmark datasets for supervised fine\-tuning\. Some methods\[[13](https://arxiv.org/html/2605.06285#bib.bib28),[8](https://arxiv.org/html/2605.06285#bib.bib29),[20](https://arxiv.org/html/2605.06285#bib.bib30)\]further introduce mechanisms to balance internal knowledge and external retrieval, enabling LLMs to retrieve only when internal knowledge is insufficient\. To mitigate the reliance on supervised training data and promote more flexible search strategies, a growing line of work\[[24](https://arxiv.org/html/2605.06285#bib.bib26),[5](https://arxiv.org/html/2605.06285#bib.bib56),[52](https://arxiv.org/html/2605.06285#bib.bib57),[82](https://arxiv.org/html/2605.06285#bib.bib58)\]formulates agentic RAG as a Markov decision process, where LLMs learn an optimal decision policy to interact with the retrieval system via reinforcement learning \(RL\)\. Recent RL\-based approaches further incorporate fine\-grained intermediate reward functions\[[68](https://arxiv.org/html/2605.06285#bib.bib59),[67](https://arxiv.org/html/2605.06285#bib.bib60),[76](https://arxiv.org/html/2605.06285#bib.bib61),[80](https://arxiv.org/html/2605.06285#bib.bib62)\]and explore parallel retrieval strategies\[[81](https://arxiv.org/html/2605.06285#bib.bib63),[55](https://arxiv.org/html/2605.06285#bib.bib31),[71](https://arxiv.org/html/2605.06285#bib.bib64)\]\. As discussed in the introduction, all these existing methods require generating long sequences of thoughts and subqueries in the language space, leading to substantial latency\. In contrast to existing approaches, we explore performing reasoning and retrieval in the latent space, avoiding long textual thought and subquery generation and achieving significant efficiency gains\.

#### Latent Reasoning\.

Latent reasoning\[[85](https://arxiv.org/html/2605.06285#bib.bib34),[75](https://arxiv.org/html/2605.06285#bib.bib65)\]reduces the latency overhead of explicit chain\-of\-thought \(CoT\) reasoning\[[65](https://arxiv.org/html/2605.06285#bib.bib25)\]by operating in the continuous hidden states of LLMs, but existing work primarily focuses on generation\-only tasks\[[12](https://arxiv.org/html/2605.06285#bib.bib35),[15](https://arxiv.org/html/2605.06285#bib.bib32)\]without external retrieval\. Early research explores adding filler tokens to enable LLMs to allocate more computation within the hidden states before generating outputs\[[12](https://arxiv.org/html/2605.06285#bib.bib35),[43](https://arxiv.org/html/2605.06285#bib.bib46)\]\. Coconut\[[15](https://arxiv.org/html/2605.06285#bib.bib32)\]proposes an autoregressive latent reasoning paradigm, where each latent token,*i\.e\.*, a generated hidden state, is recursively fed back into the LLM to generate the next latent token\. While the training process of Coconut is only supervised by the final answer, some methods\[[48](https://arxiv.org/html/2605.06285#bib.bib47),[7](https://arxiv.org/html/2605.06285#bib.bib37),[59](https://arxiv.org/html/2605.06285#bib.bib66),[66](https://arxiv.org/html/2605.06285#bib.bib67)\]further utilize information generated by explicit CoT as intermediate supervision to improve the training process\. To enhance semantic consistency and address the distributional mismatch between the latent token space and the model input space, recent approaches\[[78](https://arxiv.org/html/2605.06285#bib.bib68),[84](https://arxiv.org/html/2605.06285#bib.bib69),[9](https://arxiv.org/html/2605.06285#bib.bib70)\]constrain latent representations as mixtures of the language token embeddings\. Some methods\[[17](https://arxiv.org/html/2605.06285#bib.bib71),[70](https://arxiv.org/html/2605.06285#bib.bib72)\]introduce lightweight assistant models to generate latent tokens, thereby improving efficiency while avoiding disruption to the capabilities of the base LLM\. Latent reasoning has been extended to practical applications, including retrieval\. CLaRa\[[16](https://arxiv.org/html/2605.06285#bib.bib73)\]leverages latent reasoning to compress retrieved information in single\-step RAG, while a concurrent work, LaSER\[[25](https://arxiv.org/html/2605.06285#bib.bib74)\], develops a dense retrieval model based on latent reasoning\. Despite the rapid advancement of latent reasoning, its application to agentic RAG introduces several challenges as discussed in the introduction, leaving this area largely unexplored\. In this paper, we pioneer the integration of latent reasoning into agentic RAG and further propose a latent retrieval mechanism, significantly reducing latency overhead\.

## 3Preliminaries

Following the standard setting in prior RAG research\[[24](https://arxiv.org/html/2605.06285#bib.bib26),[55](https://arxiv.org/html/2605.06285#bib.bib31)\], we study the question\-answering \(QA\) task defined as follows\. Given a questionqq, the objective is to generate an answeraaby retrieving the necessary information from a large corpus𝒟=\{d1,d2,…,dN\}\\mathcal\{D\}=\\\{d\_\{1\},d\_\{2\},\\ldots,d\_\{N\}\\\}, where eachdid\_\{i\}represents a document\. To simplify notation, for each symbol that represents natural language text \(e\.g\.,did\_\{i\}\), we use the same symbol to denote its token sequence\.

LLMs are widely used for solving the QA task\. An LLM maps an input token sequence to an output sequence through two stages:prefillanddecoding\. In the prefill stage, all input tokens are processed in parallel to compute the key\-value \(KV\) cache\. In the decoding stage, output tokens are generated autoregressively, where each token is produced based on the KV cache of the input tokens and previously generated tokens\. Due to autoregressive dependencies, the decoding stage can only generate output tokens in a token\-by\-token manner, leading to substantial latency for long outputs\.

RAG systems augment LLMs with information retrieved from an external retrieval system\. Two types of retrieval models are widely adopted\[[50](https://arxiv.org/html/2605.06285#bib.bib23),[10](https://arxiv.org/html/2605.06285#bib.bib48)\]: sparse retrieval models, which rely on exact token\-level matches, and dense retrieval models, which encode the query and documents into continuous embeddings and select top\-kkdocuments based on cosine similarity\. Dense retrieval models capture deeper semantic similarity than sparse retrieval models, leading to superior performance on RAG benchmarks\[[26](https://arxiv.org/html/2605.06285#bib.bib43),[37](https://arxiv.org/html/2605.06285#bib.bib49)\]\. Thus, in this work, we focus on dense retrieval models\.

Agentic RAG methods perform multi\-step generation and retrieval, as shown in Fig\.[2](https://arxiv.org/html/2605.06285#S4.F2)\(a\)\. At each iteration, the LLM generates a reasoning thought and a corresponding subquery, which is then used to retrieve relevant information from an external retrieval system\. Formally, at iterationtt, the historicalinteraction trajectoryis denoted as a sequence:

ℐt=\(τ0,s0,c0,…,τt−1,st−1,ct−1\),\\displaystyle\\mathcal\{I\}\_\{t\}=\(\\tau\_\{0\},s\_\{0\},c\_\{0\},\\ldots,\\tau\_\{t\-1\},s\_\{t\-1\},c\_\{t\-1\}\),\(1\)whereτi\\tau\_\{i\}represents theii\-th thought,sis\_\{i\}is theii\-th generated subquery, andcic\_\{i\}comprises the contents of the top\-kkdocuments retrieved usingsis\_\{i\}\. Conditioned on the questionqqand the interaction trajectoryℐt\\mathcal\{I\}\_\{t\}, the agent first performs reasoning by producing the next thoughtτt\\tau\_\{t\}and subsequently generates the next subquerysts\_\{t\}, denoted jointly as\(τt,st\)=gLLM\(q,ℐt;θLLM\)\(\\tau\_\{t\},s\_\{t\}\)=g\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}\(q,\\mathcal\{I\}\_\{t\};\\theta\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}\), whereθLLM\\theta\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}represents the parameters of the LLM\. After the reasoning process, if the agent concludes that sufficient information has been gathered, it generates a final answeraa, expressed as\(τt,a\)=gLLM\(q,ℐt;θLLM\)\(\\tau\_\{t\},a\)=g\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}\(q,\\mathcal\{I\}\_\{t\};\\theta\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}\)\.

## 4Methodology

![Refer to caption](https://arxiv.org/html/2605.06285v1/x2.png)Figure 2:\(1\) Traditional explicit agentic RAG methods alternate between generation and retrieval, producing natural language thoughts and subqueries at each generation step to iteratively retrieve relevant documents\. \(2\) LatentRAG only produces latent thought and subquery tokens at each generation step, and the latent subquery tokens are used for retrieval\. \(3\) LatentRAG contains three components:Generation\(Sec\.[4\.1](https://arxiv.org/html/2605.06285#S4.SS1)\),Retrieval\(Sec\.[4\.2](https://arxiv.org/html/2605.06285#S4.SS2)\), andLatent Decoding\(Sec\.[4\.3](https://arxiv.org/html/2605.06285#S4.SS3)\)\.LatentRAG adopts a similar procedure to traditional explicit agentic RAG described in Sec\.[3](https://arxiv.org/html/2605.06285#S3), where the LLM agent iteratively generates thoughts and subqueries, and the subqueries are then used to retrieve relevant information\. Unlike explicit agentic RAG methods that generate thoughts and subqueries in the language space, LatentRAG operates in the latent space and only produces latent tokens,*i\.e\.*, the last\-layer hidden states, for thoughts and subqueries \(Sec\.[4\.1](https://arxiv.org/html/2605.06285#S4.SS1)\)\. The latent subquery tokens are then used as inputs to retrieve relevant documents \(Sec\.[4\.2](https://arxiv.org/html/2605.06285#S4.SS2)\)\. To improve transparency, the latent thought and subquery tokens can be decoded into natural language via the latent decoding process \(Sec\.[4\.3](https://arxiv.org/html/2605.06285#S4.SS3)\)\. The model is trained with a joint objective that combines losses from different components \(Sec\.[4\.4](https://arxiv.org/html/2605.06285#S4.SS4)\)\. The overall framework is shown in Fig\.[2](https://arxiv.org/html/2605.06285#S4.F2)\.

### 4\.1Generation with Latent Tokens

We replace the explicit thoughtsτt\\tau\_\{t\}and subqueriessts\_\{t\}in Eq\.[1](https://arxiv.org/html/2605.06285#S3.E1)with sequences of special tokensτtℓ\\tau^\{\\ell\}\_\{t\}andstℓs^\{\\ell\}\_\{t\}, respectively\. Hereτtℓ=\(<𝚝𝚑𝚒𝚗𝚔1\>,…,<𝚝𝚑𝚒𝚗𝚔m\>\)\\tau^\{\\ell\}\_\{t\}=\(\\texttt\{<\}\\mathtt\{think\}\_\{1\}\\texttt\{\>\},\\ldots,\\texttt\{<\}\\mathtt\{think\}\_\{m\}\\texttt\{\>\}\)denotes a sequence ofmmspecial thought tokens, andstℓ=\(<𝚚𝚞𝚎𝚛𝚢1\>,…,<𝚚𝚞𝚎𝚛𝚢n\>\)s^\{\\ell\}\_\{t\}=\(\\texttt\{<\}\\mathtt\{query\}\_\{1\}\\texttt\{\>\},\\ldots,\\texttt\{<\}\\mathtt\{query\}\_\{n\}\\texttt\{\>\}\)denotes a sequence ofnnspecial subquery tokens\. At iterationtt, the interaction trajectory is denoted as a sequence:

ℐtℓ=\(τ0ℓ,s0ℓ,c0,…,τt−1ℓ,st−1ℓ,ct−1\)\.\\displaystyle\\mathcal\{I\}^\{\\ell\}\_\{t\}=\(\\tau^\{\\ell\}\_\{0\},s^\{\\ell\}\_\{0\},c\_\{0\},\\ldots,\\tau^\{\\ell\}\_\{t\-1\},s^\{\\ell\}\_\{t\-1\},c\_\{t\-1\}\)\.\(2\)The special tokens serve as latent computation slots, allowing the LLM to allocate additional internal computation without generating explicit natural language thoughts and subqueries\. During the prefill stage, the special tokens are processed in parallel, producing their last hidden statesHtτH^\{\\tau\}\_\{t\}andHtsH^\{s\}\_\{t\}, which are referred to as latent thought and subquery tokens, respectively\.

Given the questionqqand the interaction trajectoryℐtℓ\\mathcal\{I\}^\{\\ell\}\_\{t\}, we append the input with the special thought tokensτtℓ\\tau^\{\\ell\}\_\{t\}and let the LLM decode an action token from the last latent thought token:

αt=gLLM\(q,ℐtℓ,τtℓ;θLLM\),\\displaystyle\\alpha\_\{t\}=g\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}\(q,\\mathcal\{I\}^\{\\ell\}\_\{t\},\\tau^\{\\ell\}\_\{t\};\\theta\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}\),\(3\)whereαt∈\{<𝚚𝚞𝚎𝚛𝚢\>,<𝚊𝚗𝚜𝚠𝚎𝚛\>\}\\alpha\_\{t\}\\in\\\{\\texttt\{<\}\\mathtt\{query\}\\texttt\{\>\},\\texttt\{<\}\\mathtt\{answer\}\\texttt\{\>\}\\\}represents whether to proceed with retrieval by generating a subquery or to terminate by producing the final answer\. Ifαt=<𝚚𝚞𝚎𝚛𝚢\>\\alpha\_\{t\}=\\texttt\{<\}\\mathtt\{query\}\\texttt\{\>\}, we append the special tokensstℓs^\{\\ell\}\_\{t\}to the input sequence\(q,ℐtℓ,τtℓ\)\(q,\\mathcal\{I\}^\{\\ell\}\_\{t\},\\tau^\{\\ell\}\_\{t\}\)and let the LLM generate the latent subquery tokens:

Hts=fLLM\(stℓ;q,ℐtℓ,τtℓ,θLLM\),\\displaystyle H^\{s\}\_\{t\}=f\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}\(s^\{\\ell\}\_\{t\};q,\\mathcal\{I\}^\{\\ell\}\_\{t\},\\tau^\{\\ell\}\_\{t\},\\theta\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}\),\(4\)wherefLLMf\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}denotes a single forward pass of the LLM\. The obtained latent subquery tokensHtsH^\{s\}\_\{t\}are used to retrieve relevant top\-kkdocuments, which constitute the retrieved contentctc\_\{t\}\(described in Sec\.[4\.2](https://arxiv.org/html/2605.06285#S4.SS2)\)\. Ifαt=<𝚊𝚗𝚜𝚠𝚎𝚛\>\\alpha\_\{t\}=\\texttt\{<\}\\mathtt\{answer\}\\texttt\{\>\}, the model continues to generate the final answeraa\.

LatentRAG is trained via supervised fine\-tuning \(SFT\) using interaction trajectories produced by existing explicit agentic RAG methods\. Specifically, we replace each natural language thoughtτt\\tau\_\{t\}and subquerysts\_\{t\}with the corresponding special token sequencesτtℓ\\tau^\{\\ell\}\_\{t\}andstℓs^\{\\ell\}\_\{t\}\. The token sequences within these trajectories are formatted according to a predefined prompt template \(see Appendix[D](https://arxiv.org/html/2605.06285#A4)\) and then provided as input to the LLM\. The LLM is optimized to generate the correct action tokenαt\\alpha\_\{t\}and the final answeraausing the standard cross\-entropy loss, denoted asℒgen\\mathcal\{L\}\_\{\\mathrm\{gen\}\}\.

### 4\.2Latent Retrieval

We use the generated latent subquery tokensHtsH^\{s\}\_\{t\}to retrieve relevant contentctc\_\{t\}\. Since these latent tokens reside in the output space of the LLM and are not directly compatible with the input space of the retrieval model, we add a lightweight projector moduleProjret\\mathrm\{Proj\}\_\{\\mathrm\{ret\}\}to bridge the two spaces\. The projector is composed of a bidirectional self\-attention layer and a position\-wise feed\-forward network \(FFN\) layer\. The projected latent subquery tokens are fed into a trainable retrieval model to obtain the latent subquery embedding:

𝒗stℓ=fret\(Projret\(Hts\);θret\)\.\\displaystyle\\boldsymbol\{v\}\_\{s^\{\\ell\}\_\{t\}\}=f\_\{\\mathrm\{ret\}\}\(\\mathrm\{Proj\}\_\{\\mathrm\{ret\}\}\(H^\{s\}\_\{t\}\);\\theta\_\{\\mathrm\{ret\}\}\)\.\(5\)Hereθret\\theta\_\{\\mathrm\{ret\}\}denotes the parameters of the retrieval model, which are initialized from a pretrained model and will be optimized during the fine\-tuning process\. Since ground truth documents are not available in our setting, we train the model to produce latent subquery embeddings that approximate the retrieval behavior induced by the corresponding natural language subqueries\. Specifically, each natural language subquerysts\_\{t\}in the trajectory is encoded using a reference retrieval model to produce a reference embedding𝒗st′\\boldsymbol\{v\}\_\{s\_\{t\}\}^\{\\prime\}:

𝒗st′=fret\(st;θret′\),\\displaystyle\\boldsymbol\{v\}\_\{s\_\{t\}\}^\{\\prime\}=f\_\{\\mathrm\{ret\}\}\(s\_\{t\};\\theta\_\{\\mathrm\{ret\}\}^\{\\prime\}\),\(6\)where the reference retrieval model is initialized from the same pretrained model as the trainable one, but its parametersθret′\\theta\_\{\\mathrm\{ret\}\}^\{\\prime\}remain frozen during the fine\-tuning process\. The reference embeddings are used to retrieve top\-kkdocuments from the corpus, which are treated as pseudo\-relevant documents\.

To learn subquery embeddings that align with relevant documents, a common practice is to use the InfoNCE loss\[[41](https://arxiv.org/html/2605.06285#bib.bib50)\], which pulls query embeddings closer to positive documents while pushing them away from negatives\. However, in our setting, pseudo\-relevant documents are not ground\-truth annotations and may contain substantial noise\. In addition, unlike large\-scale dense retrieval pretraining settings that rely on hundreds of millions of labeled query–document pairs\[[77](https://arxiv.org/html/2605.06285#bib.bib86),[60](https://arxiv.org/html/2605.06285#bib.bib87)\], agentic RAG is typically trained with only tens of thousands of samples\[[24](https://arxiv.org/html/2605.06285#bib.bib26),[49](https://arxiv.org/html/2605.06285#bib.bib45)\]\. Such data noise and scarcity make the standard InfoNCE objective less well\-suited to the agentic RAG setting\.

To better leverage the prior knowledge encoded in the pretrained retrieval models, we introduce a retrieval objective based on Kullback–Leibler \(KL\) divergence\. Specifically, for each subquerysts\_\{t\}and each candidate documentdid\_\{i\}, we compute the following cosine similarity\-based probabilities using the reference subquery embedding and the corresponding latent subquery embedding:

pi\(st\)=exp⁡\(cos\(𝒗st′,𝒗di\)/β\)∑j=1Ndexp⁡\(cos\(𝒗st′,𝒗dj\)/β\),qi\(st\)=exp⁡\(cos\(𝒗stℓ,𝒗di\)/β\)∑j=1Ndexp⁡\(cos\(𝒗stℓ,𝒗dj\)/β\),\\displaystyle p\_\{i\}\(s\_\{t\}\)=\\frac\{\\exp\(\\mathrm\{cos\}\(\\boldsymbol\{v\}\_\{s\_\{t\}\}^\{\\prime\},\\boldsymbol\{v\}\_\{d\_\{i\}\}\)/\\beta\)\}\{\\sum\_\{j=1\}^\{N\_\{d\}\}\\exp\(\\mathrm\{cos\}\(\\boldsymbol\{v\}\_\{s\_\{t\}\}^\{\\prime\},\\boldsymbol\{v\}\_\{d\_\{j\}\}\)/\\beta\)\},\\quad\\quad q\_\{i\}\(s\_\{t\}\)=\\frac\{\\exp\(\\mathrm\{cos\}\(\\boldsymbol\{v\}\_\{s^\{\\ell\}\_\{t\}\},\\boldsymbol\{v\}\_\{d\_\{i\}\}\)/\\beta\)\}\{\\sum\_\{j=1\}^\{N\_\{d\}\}\\exp\(\\mathrm\{cos\}\(\\boldsymbol\{v\}\_\{s^\{\\ell\}\_\{t\}\},\\boldsymbol\{v\}\_\{d\_\{j\}\}\)/\\beta\)\},\(7\)whereβ\\betais the temperature parameter that controls the sharpness of the distribution\.NdN\_\{d\}is the number of candidate documents, where the candidate set consists of all in\-batch pseudo\-relevant documents\. The retrieval loss function is defined as the KL divergence between both distributions:

ℒret=1∥ℬs∥∑st∈ℬs∑i=1Ndpi\(st\)log⁡pi\(st\)qi\(st\),\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{ret\}\}=\\frac\{1\}\{\\lVert\\mathcal\{B\}\_\{s\}\\rVert\}\\sum\_\{s\_\{t\}\\in\\mathcal\{B\}\_\{s\}\}\\sum\_\{i=1\}^\{N\_\{d\}\}p\_\{i\}\(s\_\{t\}\)\\log\\frac\{p\_\{i\}\(s\_\{t\}\)\}\{q\_\{i\}\(s\_\{t\}\)\},\(8\)whereℬs\\mathcal\{B\}\_\{s\}denotes all the subqueries in a training batch\. An alternative objective is to directly align𝒗stℓ\\boldsymbol\{v\}\_\{s^\{\\ell\}\_\{t\}\}and𝒗st′\\boldsymbol\{v\}\_\{s\_\{t\}\}^\{\\prime\}by minimizing cosine distance\. However, our ablation experiments show that it yields lower performance compared to the KL objective\.

Table 1:Overall results with different retrieval models\.GreenandRedvalues respectively indicate relative percentage improvement and degradation compared to the corresponding baseline\.Greenandyellowshaded cells indicate the best and second best across all settings\. Results show that our method achieves substantial efficiency gains with limited or even no performance degradation\.MethodsNQTriviaQAPopQAHotpotQA2wikiMusiqueBamboogleAverageEM\(%\)↑Lat\.\(ms\)↓EM\(%\)↑Lat\.\(ms\)↓EM\(%\)↑Lat\.\(ms\)↓EM\(%\)↑Lat\.\(ms\)↓EM\(%\)↑Lat\.\(ms\)↓EM\(%\)↑Lat\.\(ms\)↓EM\(%\)↑Lat\.\(ms\)↓EM\(%\)↑Lat\.\(ms\)↓Direct Infer14\.7625742\.4815814\.5017516\.6817921\.891913\.1923911\.2020017\.81200Qwen3\-Embedding\-0\.6BNaive RAG25\.8445253\.3233435\.7229727\.4130021\.233914\.9642211\.2031425\.67359Iter\-RetGen31\.331,88456\.911,24140\.291,29731\.741,30325\.201,5677\.162,51319\.201,46130\.261,609Search\-o122\.334,17844\.534,10231\.253,42225\.094,88628\.796,30110\.806,92829\.604,74127\.484,937R1\-Searcher36\.347,56956\.107,89438\.377,42643\.349,30646\.998,86117\.8310,06334\.407,49139\.058,373ZeroSearch34\.074,19454\.673,86640\.723,48430\.614,13634\.964,66711\.755,02732\.804,16634\.234,220DeepRAG33\.162,84755\.482,86039\.152,87333\.443,23243\.923,75813\.204,14932\.002,84835\.763,224Search\-R1◆43\.933,55361\.705,19847\.313,58843\.056,58943\.276,92519\.616,84638\.404,90442\.475,372LatentRAG◆46\.29491\(\-86\.2%\)59\.85478\(\-90\.8%\)46\.63501\(\-86\.0%\)46\.73626\(\-90\.5%\)44\.89704\(\-89\.8%\)19\.82730\(\-89\.3%\)40\.00623\(\-87\.3%\)43\.46\(\+2\.3%\)593\(\-89\.0%\)AutoRefine△44\.354,78263\.094,22347\.194,39744\.065,34442\.535,26418\.665,55339\.204,22442\.734,827LatentRAG△45\.73409\(\-91\.4%\)60\.18400\(\-90\.5%\)47\.42422\(\-90\.4%\)47\.16528\(\-90\.1%\)46\.14607\(\-88\.5%\)20\.73639\(\-88\.5%\)39\.20581\(\-86\.2%\)43\.79\(\+2\.5%\)512\(\-89\.4%\)e5\-base\-v2Search\-R1◆48\.033,30164\.595,40946\.023,40344\.716,26342\.136,66220\.196,41643\.204,70944\.125,166LatentRAG◆49\.03453\(\-86\.3%\)61\.30410\(\-92\.4%\)43\.04470\(\-86\.2%\)46\.41588\(\-90\.6%\)38\.61687\(\-89\.7%\)18\.87656\(\-89\.8%\)39\.20558\(\-88\.2%\)42\.35\(\-4\.0%\)546\(\-89\.4%\)AutoRefine△48\.204,30365\.303,85347\.124,13545\.135,13341\.904,69521\.805,58348\.004,04545\.354,535LatentRAG△49\.86369\(\-91\.4%\)62\.32370\(\-90\.4%\)43\.88407\(\-90\.2%\)46\.27464\(\-91\.0%\)41\.98552\(\-88\.2%\)20\.94548\(\-90\.2%\)40\.80519\(\-87\.2%\)43\.72\(\-3\.6%\)462\(\-89\.8%\)jina\-embeddings\-v5\-text\-nanoSearch\-R1◆45\.793,38163\.165,12046\.543,63944\.316,16142\.956,75820\.406,17744\.804,57943\.995,116LatentRAG◆47\.37456\(\-86\.5%\)61\.25394\(\-92\.3%\)46\.72426\(\-88\.3%\)47\.66540\(\-91\.2%\)45\.06641\(\-90\.5%\)22\.26632\(\-89\.8%\)43\.20592\(\-87\.1%\)44\.79\(\+1\.8%\)526\(\-89\.7%\)AutoRefine△45\.984,73064\.733,90346\.844,10944\.794,96242\.954,87920\.525,61443\.204,05344\.144,607LatentRAG△47\.59368\(\-92\.2%\)61\.54365\(\-90\.7%\)47\.82383\(\-90\.7%\)48\.10467\(\-90\.6%\)45\.24539\(\-89\.0%\)22\.92540\(\-90\.4%\)40\.80514\(\-87\.3%\)44\.86\(\+1\.6%\)454\(\-90\.1%\)harrier\-oss\-v1\-270mSearch\-R1◆44\.403,51062\.804,98347\.143,35144\.886,27044\.166,86418\.626,61340\.004,94443\.145,219LatentRAG◆46\.15485\(\-86\.2%\)60\.40444\(\-91\.1%\)45\.41497\(\-85\.2%\)47\.27614\(\-90\.2%\)44\.99698\(\-89\.8%\)19\.90708\(\-89\.3%\)34\.40636\(\-87\.1%\)42\.65\(\-1\.1%\)583\(\-88\.8%\)AutoRefine△43\.804,45764\.323,89947\.684,15345\.404,70343\.314,91020\.155,51143\.204,36043\.984,570LatentRAG△45\.68389\(\-91\.3%\)60\.98391\(\-90\.0%\)45\.92409\(\-90\.2%\)46\.74507\(\-89\.2%\)44\.58602\(\-87\.7%\)20\.56579\(\-89\.5%\)35\.20559\(\-87\.2%\)42\.81\(\-2\.7%\)491\(\-89\.3%\)F2LLM\-v2\-330MSearch\-R1◆43\.353,71761\.045,39444\.923,62041\.996,22841\.536,76518\.546,69036\.005,18341\.055,371LatentRAG◆45\.54484\(\-87\.0%\)58\.56432\(\-92\.0%\)44\.21448\(\-87\.6%\)44\.58593\(\-90\.5%\)43\.02679\(\-90\.0%\)18\.37667\(\-90\.0%\)36\.00602\(\-88\.4%\)41\.47\(\+1\.0%\)558\(\-89\.6%\)AutoRefine△43\.824,48362\.534,20445\.174,27142\.634,99241\.585,10119\.035,41240\.804,38142\.224,692LatentRAG△45\.32403\(\-91\.0%\)58\.83386\(\-90\.8%\)44\.72417\(\-90\.2%\)44\.74499\(\-90\.0%\)44\.08583\(\-88\.6%\)19\.82566\(\-89\.5%\)36\.80552\(\-87\.4%\)42\.04\(\-0\.4%\)487\(\-89\.6%\)

### 4\.3Latent Decoding

To improve transparency of the decision\-making process and enhance latent representation learning, we introduce a latent decoding objective\. The key idea is to optimize the LLM to reconstruct the corresponding natural language sequences directly from the generated latent tokens\.

We add projector modulesProjτ\\mathrm\{Proj\}\_\{\\tau\}andProjs\\mathrm\{Proj\}\_\{s\}to map latent thought and subquery tokens into the LLM input space, respectively\. The projector modules follow the same structure as the projector introduced in Sec[4\.2](https://arxiv.org/html/2605.06285#S4.SS2)\. The projected latent thought tokens or latent subquery tokens are then fed into the LLM to decode the corresponding natural language thoughtτt\\tau\_\{t\}or subquerysts\_\{t\}:

τt=gLLM\(Projτ\(Htτ\);θLLM\),st=gLLM\(Projs\(Hts\);θLLM\)\.\\displaystyle\\tau\_\{t\}=g\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}\(\\mathrm\{Proj\}\_\{\\tau\}\(H^\{\\tau\}\_\{t\}\);\\theta\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}\),\\quad\\quad s\_\{t\}=g\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}\(\\mathrm\{Proj\}\_\{s\}\(H^\{s\}\_\{t\}\);\\theta\_\{\\scriptscriptstyle\\mathrm\{LLM\}\}\)\.\(9\)The prompts used to format these inputs are provided in Appendix[D](https://arxiv.org/html/2605.06285#A4)\. The decoding process is optimized using the standard cross\-entropy loss between the generated sequence and the corresponding natural language target\. This results in two decoding losses: a thought decoding lossℒdecτ\\mathcal\{L\}\_\{\\mathrm\{dec\}\}^\{\\tau\}and a subquery decoding lossℒdecs\\mathcal\{L\}\_\{\\mathrm\{dec\}\}^\{s\}\. The latent decoding loss is the combination of both terms:

ℒdec=ℒdecτ\+ℒdecs\.\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{dec\}\}=\\mathcal\{L\}\_\{\\mathrm\{dec\}\}^\{\\tau\}\+\\mathcal\{L\}\_\{\\mathrm\{dec\}\}^\{s\}\.\(10\)During inference, this latent decoding process is optional, allowing the LLM agent to perform reasoning and retrieval entirely in the latent space for efficiency\. When required, latent tokens can be decoded into natural language for transparency\. Since each decoding process depends only on its corresponding latent tokens, all thoughts and subqueries across multiple steps can be decoded in parallel, thus reducing the latency of generating these natural language sequences\.

### 4\.4Overall Training Objective

The overall training objective is defined as a weighted combination of the generation loss, retrieval loss, and latent decoding loss:

ℒ=ℒgen\+λretℒret\+ℒdec,\\displaystyle\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{gen\}\}\+\\lambda\_\{\\mathrm\{ret\}\}\\mathcal\{L\}\_\{\\mathrm\{ret\}\}\+\\mathcal\{L\}\_\{\\mathrm\{dec\}\},\(11\)whereλret\\lambda\_\{\\mathrm\{ret\}\}controls the relative scale of the retrieval loss\. We do not introduce additional scaling factors forℒgen\\mathcal\{L\}\_\{\\mathrm\{gen\}\}andℒdec\\mathcal\{L\}\_\{\\mathrm\{dec\}\}since both are derived from the standard LLM cross\-entropy objective and thus have comparable magnitudes\.

## 5Experiments

### 5\.1Experimental Setup

#### Datasets\.

We evaluate LatentRAG using seven common benchmark QA datasets, comprising three general QA datasets \(NQ\[[30](https://arxiv.org/html/2605.06285#bib.bib75)\], TriviaQA\[[27](https://arxiv.org/html/2605.06285#bib.bib76)\], and PopQA\[[38](https://arxiv.org/html/2605.06285#bib.bib77)\]\) and four multi\-hop QA datasets \(HotpotQA\[[73](https://arxiv.org/html/2605.06285#bib.bib78)\], 2wiki\[[18](https://arxiv.org/html/2605.06285#bib.bib79)\], Musique\[[56](https://arxiv.org/html/2605.06285#bib.bib80)\], and Bamboogle\[[44](https://arxiv.org/html/2605.06285#bib.bib81)\]\)\. We use the 2018 Wikipedia dump\[[28](https://arxiv.org/html/2605.06285#bib.bib82)\]as the corpus for retrieval\. More details of the datasets can be found in Appendix[C](https://arxiv.org/html/2605.06285#A3)\.

#### Baselines\.

We compare LatentRAG against a diverse set of baselines covering direct inference \(Direct Infer\), traditional single\-step RAG \(Naive RAG\[[32](https://arxiv.org/html/2605.06285#bib.bib13)\]\), prompt\-based agentic RAG \(Iter\-RetGen\[[47](https://arxiv.org/html/2605.06285#bib.bib83)\], Search\-o1\[[34](https://arxiv.org/html/2605.06285#bib.bib27)\]\), and training\-based agentic RAG \(R1\-Searcher\[[52](https://arxiv.org/html/2605.06285#bib.bib57)\], ZeroSearch\[[53](https://arxiv.org/html/2605.06285#bib.bib84)\], DeepRAG\[[13](https://arxiv.org/html/2605.06285#bib.bib28)\], Search\-R1\[[24](https://arxiv.org/html/2605.06285#bib.bib26)\], AutoRefine\[[49](https://arxiv.org/html/2605.06285#bib.bib45)\]\)\.

#### Implementation details\.

Following previous works\[[24](https://arxiv.org/html/2605.06285#bib.bib26),[49](https://arxiv.org/html/2605.06285#bib.bib45)\], we adopt Qwen2\.5\-7B\[[72](https://arxiv.org/html/2605.06285#bib.bib96)\]as the default LLM for all methods\. For training\-based baselines, we utilize their published model weights to ensure the faithful reproduction of their reported performance\. Training trajectories are constructed from a combined training set of NQ and HotpotQA using Search\-R1 and AutoRefine\. Variants trained on trajectories generated by Search\-R1 and AutoRefine are denoted asLatentRAG◆andLatentRAG△, respectively\. To reduce computational costs, we conduct main experiments using lightweight retrieval models with fewer than 1B parameters, which are among the top\-performing models on the MTEB benchmark\[[39](https://arxiv.org/html/2605.06285#bib.bib85)\]and cover diverse model architectures, including Qwen3\-Embedding\-0\.6B\[[77](https://arxiv.org/html/2605.06285#bib.bib86)\], e5\-base\-v2\[[60](https://arxiv.org/html/2605.06285#bib.bib87)\], jina\-embeddings\-v5\-text\-nano\[[1](https://arxiv.org/html/2605.06285#bib.bib88)\], harrier\-oss\-v1\-270m111[https://huggingface\.co/microsoft/harrier\-oss\-v1\-270m](https://huggingface.co/microsoft/harrier-oss-v1-270m), and F2LLM\-v2\-330M\[[79](https://arxiv.org/html/2605.06285#bib.bib89)\]\. Unless otherwise specified, we use Qwen3\-Embedding\-0\.6B as the default retriever\. To evaluate the trade\-off between performance and latency, we report the exact match \(EM\) score\[[24](https://arxiv.org/html/2605.06285#bib.bib26)\]and the average latency per question\. Latency is measured on a single NVIDIA H100 GPU with 94 GB memory by default\. More implementation details are in Appendix[B](https://arxiv.org/html/2605.06285#A2)\.

### 5\.2Main Results

#### Overall performance and latency\.

As shown in Table[1](https://arxiv.org/html/2605.06285#S4.T1), advanced agentic RAG methods such as Search\-R1 and AutoRefine achieve superior performance over naive single\-step RAG, but incur substantially higher latency, with an average overhead of around 15×\\timesthe latency in single\-step RAG\. This latency gap is more pronounced on multi\-hop QA datasets\. In contrast, LatentRAG trained on trajectories from Search\-R1 and AutoRefine achieves comparable performance, with relative differences within 5%, while significantly reducing latency by approximately 90%\. This advantage holds consistently across diverse retrieval models\. Fig\.[1](https://arxiv.org/html/2605.06285#S1.F1)shows that LatentRAG significantly reduces latency in thought and subquery generation\.

Compared to other retrieval models, we observe a relatively large performance drop when using e5\-base\-v2\. To investigate the source of this discrepancy, we analyze the embedding spaces of different retrieval models\. As shown in Fig\.[4](https://arxiv.org/html/2605.06285#A5.F4)in the Appendix, e5\-base\-v2 exhibits severe anisotropy\[[33](https://arxiv.org/html/2605.06285#bib.bib94),[83](https://arxiv.org/html/2605.06285#bib.bib95)\], indicating that the embeddings produced by the model are highly concentrated within a narrow cone on a hypersphere\. This skewed distribution makes it difficult for the LLM to adapt to the retrieval space\. More analysis is provided in Appendix[E\.1](https://arxiv.org/html/2605.06285#A5.SS1)\.

Table 2:Latency with and without decoding\.
#### Latent decoding efficiency\.

Latent decoding is an option for improving transparency at the cost of additional latency\. To quantify this overhead, Table[2](https://arxiv.org/html/2605.06285#S5.T2)reports the latency of LatentRAG with and without latent decoding\. latent decoding increases the overall latency of LatentRAG by approximately 4–5×\\times\. Nevertheless, it still reduces latency by 63\.3% and 47\.4% compared to Search\-R1 and AutoRefine, respectively\. The efficiency gain stems from the removal of sequential dependencies, enabling parallel decoding across steps\. The actual speedup is bounded by the longest sequence in the batch, which determines the number of decoding steps required\. We report themax length ratioin Table[2](https://arxiv.org/html/2605.06285#S5.T2), defined as the fraction of tokens in the longest thought or subquery sequence over the total decoding length\. A higher ratio indicates a more imbalanced distribution of sequence lengths\. In particular, LatentRAG△exhibits a larger max length ratio, which explains its less pronounced efficiency gains\. Further analysis is provided in Appendix[E\.2](https://arxiv.org/html/2605.06285#A5.SS2), along with case studies of decoded examples in Appendix[E\.7](https://arxiv.org/html/2605.06285#A5.SS7)\.

![Refer to caption](https://arxiv.org/html/2605.06285v1/x3.png)Figure 3:Performance and latency results across different retrieval model and LLM sizes\.
#### Scaling model size\.

We study scalability along two orthogonal dimensions\. For retrieval model scaling, we evaluate Qwen3\-Embedding\-0\.6B, 4B, and 8B\[[77](https://arxiv.org/html/2605.06285#bib.bib86)\]with a fixed 7B LLM\. For LLM scaling, we evaluate Qwen2\.5\-3B, 7B, and 14B with a fixed Qwen3\-Embedding\-0\.6B retrieval model\. Larger retrieval models produce higher\-dimensional embeddings, resulting in a substantially larger index that cannot fit on a single GPU\. To ensure a fair comparison across different model sizes, we use three H100 GPUs for retrieval deployment and one for the LLM across all scaling experiments\.

As shown in Fig\.[3](https://arxiv.org/html/2605.06285#S5.F3), performance improves with increasing model size along both dimensions\. Scaling the retrieval model introduces negligible latency overhead, as the retrieval process can be efficiently parallelized\. In contrast, scaling the LLM leads to substantial latency increases for SearchR1 due to increased decoding time for thought and subquery generation\. Our method achieves comparable performance across most settings and yields improvements in the 3B LLM setting while significantly reducing inference latency\.

### 5\.3Ablation Studies

Table 3:Ablation studies on key design choices\.We conduct ablation studies on key design choices to validate their effectiveness\. Specifically, we replace the KL\-based retrieval objective in Eq\.[8](https://arxiv.org/html/2605.06285#S4.E8)with two alternative choices: \(i\) a cosine loss, which directly minimizes the cosine distance between the latent subquery embedding𝒗stℓ\\boldsymbol\{v\}\_\{s^\{\\ell\}\_\{t\}\}and the corresponding reference subquery embedding𝒗st′\\boldsymbol\{v\}\_\{s\_\{t\}\}^\{\\prime\}, and \(ii\) a standard InfoNCE loss\[[41](https://arxiv.org/html/2605.06285#bib.bib50)\], which is widely used for training retrieval models\. We further consider two ablation settings: \(iii\) removing the pretrained retrieval model and relying solely on the LLM to produce subquery embeddings, and \(iv\) removing the latent decoding loss in Eq\.[10](https://arxiv.org/html/2605.06285#S4.E10)during training\. We report the average EM score as well as two retrieval related metrics: \(a\) retrieval success rate, defined as the proportion of successful iterative retrievals where the retrieved documents contain the ground truth answer, and \(b\) retrieval overlap, defined as the proportion of documents retrieved by Search\-R1 that are also retrieved by our model\.

As shown in Table[3](https://arxiv.org/html/2605.06285#S5.T3), LatentRAG with the proposed KL\-based objective achieves better EM score and retrieval success rate compared to the cosine and InfoNCE alternatives\. The cosine loss yields the highest retrieval overlap ratio, indicating closer imitation of the teacher model Search\-R1\. However, its performance is lower than that of the KL\-based variant, suggesting that overly aligning with the teacher model may limit model capacity and lead to suboptimal performance\. Removing the pretrained retrieval model also degrades the performance, highlighting the importance of the inductive bias provided by the pretrained retrieval model\. Finally, removing the latent decoding loss leads to performance degradation, suggesting that latent decoding not only improves transparency at inference time, but also facilitates the learning of latent representations during training\.

## 6Conclusion

In this paper, we propose LatentRAG, an efficient agentic RAG framework that shifts both reasoning and retrieval from discrete language space to continuous latent space\. Experiments show that LatentRAG achieves performance comparable to existing agentic RAG methods while reducing latency by approximately 90%\. To improve transparency, the latent tokens can be optionally decoded into natural language with additional latency overhead, while still achieving an overall 40–60% reduction in latency compared to the corresponding baselines\. Experiments across different model scales further demonstrate the general applicability of LatentRAG\.

## References

- \[1\]\(2026\)Jina\-embeddings\-v5\-text: task\-targeted embedding distillation\.arXiv preprint arXiv:2602\.15547\.Cited by:[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px3.p1.2)\.
- \[2\]A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi\(2024\)Self\-RAG: learning to retrieve, generate, and critique through self\-reflection\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[3\]L\. Barrault, P\. Duquenne, M\. Elbayad, A\. Kozhevnikov, B\. Alastruey, P\. Andrews, M\. Coria, G\. Couairon, M\. R\. Costa\-jussà, D\. Dale,et al\.\(2024\)Large concept models: language modeling in a sentence representation space\.arXiv preprint arXiv:2412\.08821\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p4.1)\.
- \[4\]J\. E\. Batista, E\. Vatai, and M\. Wahib\(2025\)SAFE: improving LLM systems using sentence\-level in\-generation attribution\.arXiv preprint arXiv:2505\.12621\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p5.1)\.
- \[5\]M\. Chen, L\. Sun, T\. Li, C\. Zhu, H\. Wang, J\. Z\. Pan, W\. Zhang, H\. Chen, F\. Yang, Z\. Zhou,et al\.\(2025\)ReSearch: learning to reason with search for LLMs via reinforcement learning\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[6\]X\. Chen, A\. Zhao, H\. Xia, X\. Lu, H\. Wang, Y\. Chen, W\. Zhang, J\. Wang, W\. Li, and X\. Shen\(2025\)Reasoning beyond language: a comprehensive survey on latent chain\-of\-thought reasoning\.arXiv preprint arXiv:2505\.16782\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p4.1)\.
- \[7\]J\. Cheng and B\. Van Durme\(2024\)Compressed chain of thought: efficient reasoning through dense representations\.arXiv preprint arXiv:2412\.13171\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p4.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[8\]Q\. Cheng, X\. Li, S\. Li, Q\. Zhu, Z\. Yin, Y\. Shao, L\. Li, T\. Sun, H\. Yan, and X\. Qiu\(2024\)Unified active retrieval for retrieval augmented generation\.InFindings of EMNLP,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[9\]J\. Deng, L\. Pang, Z\. Wei, S\. Xu, Z\. Duan, K\. Xu, Y\. Song, H\. Shen, and X\. Cheng\(2025\)Latent reasoning in LLMs as a vocabulary\-space superposition\.arXiv preprint arXiv:2510\.15522\.Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[10\]W\. Fan, Y\. Ding, L\. Ning, S\. Wang, H\. Li, D\. Yin, T\. Chua, and Q\. Li\(2024\)A survey on RAG meeting LLMs: towards retrieval\-augmented large language models\.InKDD,Cited by:[§3](https://arxiv.org/html/2605.06285#S3.p3.1)\.
- \[11\]Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, H\. Wang, and H\. Wang\(2023\)Retrieval\-augmented generation for large language models: a survey\.arXiv preprint arXiv:2312\.10997\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p1.1)\.
- \[12\]S\. Goyal, Z\. Ji, A\. S\. Rawat, A\. K\. Menon, S\. Kumar, and V\. Nagarajan\(2024\)Think before you speak: training language models with pause tokens\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p5.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[13\]X\. Guan, J\. Zeng, F\. Meng, C\. Xin, Y\. Lu, H\. Lin, X\. Han, L\. Sun, and J\. Zhou\(2026\)DeepRAG: thinking to retrieve step by step for large language models\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p2.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px2.p1.1)\.
- \[14\]K\. Guu, K\. Lee, Z\. Tung, P\. Pasupat, and M\. Chang\(2020\)Retrieval augmented language model pre\-training\.InICML,Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p1.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[15\]S\. Hao, S\. Sukhbaatar, D\. Su, X\. Li, Z\. Hu, J\. Weston, and Y\. Tian\(2025\)Training large language models to reason in a continuous latent space\.InCOLM,Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p4.1),[§1](https://arxiv.org/html/2605.06285#S1.p5.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[16\]J\. He, R\. H\. Bai, S\. Williamson, J\. Z\. Pan, N\. Jaitly, and Y\. Zhang\(2025\)CLaRa: bridging retrieval and generation with continuous latent reasoning\.arXiv preprint arXiv:2511\.18659\.Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[17\]Y\. He, W\. Zheng, Y\. Zhu, Z\. Zheng, L\. Su, S\. Vasudevan, Q\. Guo, L\. Hong, and J\. Li\(2025\)SemCoT: accelerating chain\-of\-thought reasoning through semantically\-aligned implicit tokens\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[18\]X\. Ho, A\. D\. Nguyen, S\. Sugawara, and A\. Aizawa\(2020\)Constructing a multi\-hop QA dataset for comprehensive evaluation of reasoning steps\.InCOLING,Cited by:[Appendix C](https://arxiv.org/html/2605.06285#A3.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px1.p1.1)\.
- \[19\]L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin,et al\.\(2025\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Trans\. Inf\. Syst\.\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p1.1)\.
- \[20\]S\. Jeong, J\. Baek, S\. Cho, S\. J\. Hwang, and J\. C\. Park\(2024\)Adaptive\-RAG: learning to adapt retrieval\-augmented large language models through question complexity\.InNAACL,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung\(2023\)Survey of hallucination in natural language generation\.ACM Comput\. Surv\.\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p1.1)\.
- \[22\]Z\. Jiang, F\. F\. Xu, L\. Gao, Z\. Sun, Q\. Liu, J\. Dwivedi\-Yu, Y\. Yang, J\. Callan, and G\. Neubig\(2023\)Active retrieval augmented generation\.InEMNLP,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[23\]B\. Jin, J\. Yoon, P\. Kargupta, S\. O\. Arik, and J\. Han\(2025\)An empirical study on reinforcement learning for reasoning\-search interleaved LLM agents\.arXiv preprint arXiv:2505\.15117\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p2.1)\.
- \[24\]B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. Arik, D\. Wang, H\. Zamani, and J\. Han\(2025\)Search\-R1: training LLMs to reason and leverage search engines with reinforcement learning\.InCOLM,Cited by:[Appendix B](https://arxiv.org/html/2605.06285#A2.SS0.SSS0.Px4.p2.1),[Appendix C](https://arxiv.org/html/2605.06285#A3.p1.1),[§1](https://arxiv.org/html/2605.06285#S1.p2.1),[§1](https://arxiv.org/html/2605.06285#S1.p3.1),[§1](https://arxiv.org/html/2605.06285#S1.p5.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.06285#S3.p1.5),[§4\.2](https://arxiv.org/html/2605.06285#S4.SS2.p2.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px3.p1.2)\.
- \[25\]J\. Jin, Y\. Zhang, M\. Li, D\. Long, P\. Xie, Y\. Zhu, and Z\. Dou\(2026\)LaSER: internalizing explicit reasoning into latent space for dense retrieval\.arXiv preprint arXiv:2603\.01425\.Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[26\]J\. Jin, Y\. Zhu, Z\. Dou, G\. Dong, X\. Yang, C\. Zhang, T\. Zhao, Z\. Yang, and J\. Wen\(2025\)FlashRAG: a modular toolkit for efficient retrieval\-augmented generation research\.InWWW,Cited by:[Appendix C](https://arxiv.org/html/2605.06285#A3.p1.1),[§3](https://arxiv.org/html/2605.06285#S3.p3.1)\.
- \[27\]M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer\(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.InACL,Cited by:[Appendix C](https://arxiv.org/html/2605.06285#A3.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px1.p1.1)\.
- \[28\]V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih\(2020\)Dense passage retrieval for open\-domain question answering\.InEMNLP,Cited by:[Appendix C](https://arxiv.org/html/2605.06285#A3.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px1.p1.1)\.
- \[29\]D\. Kim, B\. Kim, D\. Han, and M\. Eibich\(2024\)AutoRAG: automated framework for optimization of retrieval augmented generation pipeline\.arXiv preprint arXiv:2410\.20878\.Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[30\]T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee,et al\.\(2019\)Natural questions: a benchmark for question answering research\.TACL\.Cited by:[Appendix C](https://arxiv.org/html/2605.06285#A3.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px1.p1.1)\.
- \[31\]J\. Lai, W\. Gan, J\. Wu, Z\. Qi, and P\. S\. Yu\(2024\)Large language models in law: a survey\.AI Open\.Cited by:[Appendix A](https://arxiv.org/html/2605.06285#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.06285#S1.p1.1)\.
- \[32\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p1.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px2.p1.1)\.
- \[33\]B\. Li, H\. Zhou, J\. He, M\. Wang, Y\. Yang, and L\. Li\(2020\)On the sentence embeddings from pre\-trained language models\.InEMNLP,Cited by:[§E\.1](https://arxiv.org/html/2605.06285#A5.SS1.p1.1),[§5\.2](https://arxiv.org/html/2605.06285#S5.SS2.SSS0.Px1.p2.1)\.
- \[34\]X\. Li, G\. Dong, J\. Jin, Y\. Zhang, Y\. Zhou, Y\. Zhu, P\. Zhang, and Z\. Dou\(2025\)Search\-o1: agentic search\-enhanced large reasoning models\.InEMNLP,Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p2.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px2.p1.1)\.
- \[35\]Y\. Li, W\. Zhang, Y\. Yang, W\. Huang, Y\. Wu, J\. Luo, Y\. Bei, H\. P\. Zou, X\. Luo, Y\. Zhao,et al\.\(2025\)Towards agentic RAG with deep reasoning: a survey of RAG\-reasoning systems in LLMs\.InFindings of EMNLP,Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p2.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[36\]M\. Lin, Z\. Wu, Z\. Xu, H\. Liu, X\. Tang, Q\. He, C\. Aggarwal, X\. Zhang, and S\. Wang\(2025\)A comprehensive survey on reinforcement learning\-based agentic search: foundations, roles, optimizations, evaluations, and applications\.arXiv preprint arXiv:2510\.16724\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p2.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[37\]Y\. Lyu, Z\. Li, S\. Niu, F\. Xiong, B\. Tang, W\. Wang, H\. Wu, H\. Liu, T\. Xu, and E\. Chen\(2025\)CRUD\-RAG: a comprehensive chinese benchmark for retrieval\-augmented generation of large language models\.ACM Trans\. Inf\. Syst\.\.Cited by:[§3](https://arxiv.org/html/2605.06285#S3.p3.1)\.
- \[38\]A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi\(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InACL,Cited by:[Appendix C](https://arxiv.org/html/2605.06285#A3.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px1.p1.1)\.
- \[39\]N\. Muennighoff, N\. Tazi, L\. Magne, and N\. Reimers\(2023\)MTEB: massive text embedding benchmark\.InEACL,Cited by:[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px3.p1.2)\.
- \[40\]nostalgebraist\(2020\)Interpreting GPT: the logit lens\.Note:[https://www\.lesswrong\.com/posts/AcKRB8wDpdaN6v6ru/interpreting\-gpt\-the\-logit\-lens](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens)Cited by:[§E\.7](https://arxiv.org/html/2605.06285#A5.SS7.SSS0.Px3.p1.1)\.
- \[41\]A\. v\. d\. Oord, Y\. Li, and O\. Vinyals\(2018\)Representation learning with contrastive predictive coding\.arXiv preprint arXiv:1807\.03748\.Cited by:[§4\.2](https://arxiv.org/html/2605.06285#S4.SS2.p2.1),[§5\.3](https://arxiv.org/html/2605.06285#S5.SS3.p1.2)\.
- \[42\]B\. Peng, Y\. Zhu, Y\. Liu, X\. Bo, H\. Shi, C\. Hong, Y\. Zhang, and S\. Tang\(2025\)Graph retrieval\-augmented generation: a survey\.ACM Trans\. Inf\. Syst\.\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p1.1)\.
- \[43\]J\. Pfau, W\. Merrill, and S\. R\. Bowman\(2024\)Let’s think dot by dot: hidden computation in transformer language models\.InCOLM,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[44\]O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis\(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of EMNLP,Cited by:[Appendix C](https://arxiv.org/html/2605.06285#A3.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px1.p1.1)\.
- \[45\]J\. Qi, G\. Sarti, R\. Fernández, and A\. Bisazza\(2024\)Model internals\-based answer attribution for trustworthy retrieval\-augmented generation\.InEMNLP,Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p5.1)\.
- \[46\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom\(2023\)Toolformer: language models can teach themselves to use tools\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p2.1)\.
- \[47\]Z\. Shao, Y\. Gong, Y\. Shen, M\. Huang, N\. Duan, and W\. Chen\(2023\)Enhancing retrieval\-augmented large language models with iterative retrieval\-generation synergy\.InFindings of EMNLP,Cited by:[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px2.p1.1)\.
- \[48\]Z\. Shen, H\. Yan, L\. Zhang, Z\. Hu, Y\. Du, and Y\. He\(2025\)CODI: compressing chain\-of\-thought into continuous space via self\-distillation\.InEMNLP,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[49\]Y\. Shi, S\. Li, C\. Wu, Z\. Liu, J\. Fang, H\. Cai, A\. Zhang, and X\. Wang\(2025\)Search and refine during think: facilitating knowledge refinement for improved retrieval\-augmented reasoning\.InNeurIPS,Cited by:[Appendix B](https://arxiv.org/html/2605.06285#A2.SS0.SSS0.Px4.p2.1),[Appendix C](https://arxiv.org/html/2605.06285#A3.p1.1),[§1](https://arxiv.org/html/2605.06285#S1.p5.1),[§4\.2](https://arxiv.org/html/2605.06285#S4.SS2.p2.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px3.p1.2)\.
- \[50\]A\. Singh, A\. Ehtesham, S\. Kumar, and T\. T\. Khoei\(2025\)Agentic retrieval\-augmented generation: a survey on agentic RAG\.arXiv preprint arXiv:2501\.09136\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p1.1),[§1](https://arxiv.org/html/2605.06285#S1.p2.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.06285#S3.p3.1)\.
- \[51\]K\. Singhal, T\. Tu, J\. Gottweis, R\. Sayres, E\. Wulczyn, M\. Amin, L\. Hou, K\. Clark, S\. R\. Pfohl, H\. Cole\-Lewis,et al\.\(2025\)Toward expert\-level medical question answering with large language models\.Nat\. Med\.\.Cited by:[Appendix A](https://arxiv.org/html/2605.06285#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.06285#S1.p1.1)\.
- \[52\]H\. Song, J\. Jiang, Y\. Min, J\. Chen, Z\. Chen, W\. X\. Zhao, L\. Fang, and J\. Wen\(2025\)R1\-Searcher: incentivizing the search capability in LLMs via reinforcement learning\.arXiv preprint arXiv:2503\.05592\.Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px2.p1.1)\.
- \[53\]H\. Sun, Z\. Qiao, J\. Guo, X\. Fan, Y\. Hou, Y\. Jiang, P\. Xie, Y\. Zhang, F\. Huang, and J\. Zhou\(2025\)ZeroSearch: incentivize the search capability of LLMs without searching\.arXiv preprint arXiv:2505\.04588\.Cited by:[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px2.p1.1)\.
- \[54\]J\. Tack, J\. Lanchantin, J\. Yu, A\. Cohen, I\. Kulikov, J\. Lan, S\. Hao, Y\. Tian, J\. Weston, and X\. Li\(2025\)LLM pretraining with continuous concepts\.arXiv preprint arXiv:2502\.08524\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p4.1)\.
- \[55\]Z\. Tan, J\. Huang, Q\. Wu, H\. Zhang, C\. Zhuang, and J\. Gu\(2026\)RAG\-R1: incentivizing the search and reasoning capabilities of LLMs through multi\-query parallelism\.InAAAI,Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p2.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.06285#S3.p1.5)\.
- \[56\]H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal\(2022\)MuSiQue: multihop questions via single\-hop question composition\.TACL\.Cited by:[Appendix C](https://arxiv.org/html/2605.06285#A3.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px1.p1.1)\.
- \[57\]H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal\(2023\)Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions\.InACL,Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p1.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[58\]C\. Wang, X\. Liu, Y\. Yue, Q\. Guo, X\. Hu, X\. Tang, T\. Zhang, C\. Jiayang, Y\. Yao, X\. Hu, Z\. Qi, W\. Gao, Y\. Wang, L\. Yang, J\. Wang, X\. Xie, Z\. Zhang, and Y\. Zhang\(2025\)Survey on factuality in large language models\.ACM Comput\. Surv\.\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p1.1)\.
- \[59\]J\. Wang, Z\. Wu, F\. Lai, S\. Lian, and Z\. Zeng\(2025\)SynAdapt: learning adaptive reasoning in large language models via synthetic continuous chain\-of\-thought\.arXiv preprint arXiv:2508\.00574\.Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[60\]L\. Wang, N\. Yang, X\. Huang, B\. Jiao, L\. Yang, D\. Jiang, R\. Majumder, and F\. Wei\(2022\)Text embeddings by weakly\-supervised contrastive pre\-training\.arXiv preprint arXiv:2212\.03533\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p5.1),[§4\.2](https://arxiv.org/html/2605.06285#S4.SS2.p2.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px3.p1.2)\.
- \[61\]M\. Wang, A\. Stoll, L\. Lange, H\. Adel, H\. Schütze, and J\. Strötgen\(2025\)Bring your own knowledge: a survey of methods for LLM knowledge expansion\.arXiv preprint arXiv:2502\.12598\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p1.1)\.
- \[62\]P\. Wang, T\. Liu, C\. Wang, Z\. Li, Y\. Wang, S\. Yan, C\. Jia, X\. Liu, X\. Chen, J\. Xu,et al\.\(2025\)A survey on large language models for mathematical reasoning\.ACM Comput\. Surv\.\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p1.1)\.
- \[63\]S\. Wang, Y\. Zhu, H\. Liu, Z\. Zheng, C\. Chen, and J\. Li\(2024\)Knowledge editing for large language models: a survey\.ACM Comput\. Surv\.\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p1.1)\.
- \[64\]Y\. Wang, M\. Wang, M\. A\. Manzoor, F\. Liu, G\. N\. Georgiev, R\. J\. Das, and P\. Nakov\(2024\)Factuality of large language models: a survey\.InEMNLP,Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p1.1)\.
- \[65\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.NeurIPS\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p2.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[66\]X\. Wei, X\. Liu, Y\. Zang, X\. Dong, Y\. Cao, J\. Wang, X\. Qiu, and D\. Lin\(2026\)SIM\-CoT: supervised implicit chain\-of\-thought\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[67\]P\. Wu, M\. Zhang, K\. Wan, W\. Zhao, K\. He, X\. Du, and Z\. Chen\(2026\)HiPRAG: hierarchical process rewards for efficient agentic retrieval augmented generation\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[68\]Y\. Xie, N\. Thomas, N\. Hansen, Y\. Fu, L\. E\. Li, and X\. Wang\(2026\)TIPS: turn\-level information\-potential reward shaping for search\-augmented LLMs\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[69\]Z\. Xinjie, F\. Gao, X\. Song, Y\. Chen, R\. Yang, Y\. Fu, Y\. Wang, Y\. Iwasawa, Y\. Matsuo, and I\. Li\(2025\)ReAgent: reversible multi\-agent reasoning for knowledge\-enhanced multi\-hop QA\.InEMNLP,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[70\]Y\. Xu, X\. Guo, Z\. Zeng, and C\. Miao\(2025\)SoftCoT: soft chain\-of\-thought for efficient reasoning with LLMs\.InACL,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[71\]Z\. Xu, Z\. Xu, R\. Zhang, C\. Zhu, S\. Yu, W\. Liu, Q\. Zhang, W\. Ding, C\. Yu, and Y\. Wang\(2026\)WideSeek\-R1: exploring width scaling for broad information seeking via multi\-agent reinforcement learning\.arXiv preprint arXiv:2602\.04634\.Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[72\]A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px3.p1.2)\.
- \[73\]Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning\(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InEMNLP,Cited by:[Appendix C](https://arxiv.org/html/2605.06285#A3.p1.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px1.p1.1)\.
- \[74\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao\(2023\)React: synergizing reasoning and acting in language models\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p2.1)\.
- \[75\]X\. Yu, Z\. Chen, Y\. He, T\. Fu, C\. Yang, C\. Xu, Y\. Ma, X\. Hu, Z\. Cao, J\. Xu,et al\.\(2026\)The latent space: foundation, evolution, mechanism, ability, and outlook\.arXiv preprint arXiv:2604\.02029\.Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[76\]F\. Zhang, X\. Niu, C\. Ying, G\. Lin, Z\. Hao, Z\. Fan, C\. Huang, J\. Keung, B\. Chen, and J\. Lin\(2026\)A2Search: ambiguity\-aware question answering with reinforcement learning\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[77\]Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin,et al\.\(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.arXiv preprint arXiv:2506\.05176\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p5.1),[§4\.2](https://arxiv.org/html/2605.06285#S4.SS2.p2.1),[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px3.p1.2),[§5\.2](https://arxiv.org/html/2605.06285#S5.SS2.SSS0.Px3.p1.1)\.
- \[78\]Z\. Zhang, X\. He, W\. Yan, A\. Shen, C\. Zhao, S\. Wang, Y\. Shen, and X\. E\. Wang\(2025\)Soft thinking: unlocking the reasoning potential of LLMs in continuous concept space\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[79\]Z\. Zhang, Z\. Liao, H\. Yu, P\. Di, and R\. Wang\(2026\)F2LLM\-v2: inclusive, performant, and efficient embeddings for a multilingual world\.arXiv preprint arXiv:2603\.19223\.Cited by:[§5\.1](https://arxiv.org/html/2605.06285#S5.SS1.SSS0.Px3.p1.2)\.
- \[80\]Q\. Zhao, R\. Wang, D\. Xu, D\. Zha, and L\. Liu\(2025\)R\-Search: empowering LLM reasoning with search via multi\-reward reinforcement learning\.arXiv preprint arXiv:2506\.04185\.Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[81\]S\. Zhao, T\. Yu, A\. Xu, J\. Singh, A\. Shukla, and R\. Akkiraju\(2025\)ParallelSearch: train your LLMs to decompose query and search sub\-queries in parallel with reinforcement learning\.arXiv preprint arXiv:2508\.09303\.Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[82\]Y\. Zheng, D\. Fu, X\. Hu, X\. Cai, L\. Ye, P\. Lu, and P\. Liu\(2025\)DeepResearcher: scaling deep research via reinforcement learning in real\-world environments\.InEMNLP,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px1.p1.1)\.
- \[83\]W\. Zhou, B\. Y\. Lin, and X\. Ren\(2021\)IsoBN: fine\-tuning BERT with isotropic batch normalization\.InAAAI,Cited by:[§E\.1](https://arxiv.org/html/2605.06285#A5.SS1.p1.1),[§5\.2](https://arxiv.org/html/2605.06285#S5.SS2.SSS0.Px1.p2.1)\.
- \[84\]Y\. Zhou, Y\. Wang, X\. Yin, S\. Zhou, and A\. R\. Zhang\(2026\)The geometry of reasoning: flowing logics in representation space\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.
- \[85\]R\. Zhu, T\. Peng, T\. Cheng, X\. Qu, J\. Huang, D\. Zhu, H\. Wang, K\. Xue, X\. Zhang, Y\. Shan,et al\.\(2025\)A survey on latent reasoning\.arXiv preprint arXiv:2507\.06203\.Cited by:[§1](https://arxiv.org/html/2605.06285#S1.p4.1),[§2](https://arxiv.org/html/2605.06285#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix ABroader Impacts and Limitations

#### Broader impacts\.

This work proposes an efficient agentic RAG framework that performs reasoning and retrieval in the latent space\. The proposed approach can be applied to a wide range of information\-seeking scenarios, such as legal or clinical question answering\[[31](https://arxiv.org/html/2605.06285#bib.bib5),[51](https://arxiv.org/html/2605.06285#bib.bib4)\], and improve overall efficiency in these tasks\. More broadly, as most existing work focuses on training agents to use search engines originally designed for humans, this work suggests a shift from human\-oriented text\-based search engines to agent\-oriented embedding\-based search engines that better support agent usage\. This provides a potential direction for rethinking search engines in the era of agentic systems\.

#### Limitations & future work\.

Our method relies on SFT over trajectories generated by existing agentic RAG methods, and its performance is therefore partly bounded by the quality of the training data\. This hinders the model from directly learning an optimal retrieval policy through interactions with the retrieval system\. Nevertheless, our approach yields strong and efficient initial models that serve as an effective foundation for future research\. Future work could investigate reinforcement learning to improve performance by encouraging exploration and exploitation\.

## Appendix BImplementation Details

#### Training data construction\.

As described in the main paper, we combine the training sets from NQ and HotpotQA to construct a unified training dataset\. We then build training trajectories using interaction data generated by Search\-R1 and AutoRefine on this unified training dataset\. Each trajectory consists of the question, intermediate reasoning thoughts, subqueries, retrieved document chunks, and the final generated answer\. AutoRefine introduces an additional refinement stage to improve the initially retrieved documents\. To maintain a consistent trajectory format with Search\-R1, we merge the refinement text into the reasoning thoughts\. We retain only those trajectories that produce a correct final answer for training\. To facilitate finer\-grained control over different components in the generated trajectories, we introduce a set of special tokens to explicitly mark structural elements in the output, such as<𝙰𝚗𝚜𝚠𝚎𝚛\>…</𝙰𝚗𝚜𝚠𝚎𝚛\>\\texttt\{<\}\\mathtt\{Answer\}\\texttt\{\>\}\\ldots\\texttt\{<\}\\mathtt\{\\texttt\{/\}Answer\}\\texttt\{\>\}\. In contrast, these tags are typically tokenized into multiple subword units in Search\-R1 and AutoRefine\. This difference may introduce minor variations in generation time\. However, its impact is negligible compared to the overall latency reduction achieved by our framework\.

#### Computing resources & parallelization strategies\.

For training, we optimize LatentRAG on a single compute node equipped with two NVIDIA H100 GPUs, each with 94 GB of memory\. Each training job takes about 24 to 48 hours to complete\. To reduce GPU memory consumption, we enable gradient checkpointing to minimize the storage of intermediate activations\. For distributed training, we adopt DeepSpeed ZeRO1222[https://github\.com/deepspeedai/DeepSpeed](https://github.com/deepspeedai/DeepSpeed), which shards the optimizer states across GPU devices while keeping gradients and model parameters fully replicated\. This design avoids additional communication overhead associated with parameter and gradient sharding, thereby maintaining efficient data\-parallel training\. To address the imbalance in trajectory lengths, we implement a binned batching strategy\. Specifically, we partition trajectories into 200 bins according to their lengths and construct each batch by sampling from a single bin\. This binned batching strategy ensures that samples within each batch have similar sequence lengths and therefore reduces padding overhead and improves computational efficiency\. We use bfloat16 precision and FlashAttention\-2333[https://github\.com/dao\-ailab/flash\-attention](https://github.com/dao-ailab/flash-attention)during training\. We adopt LoRA444[https://github\.com/microsoft/LoRA](https://github.com/microsoft/LoRA)with a rank of 16 for parameter\-efficient fine\-tuning, which significantly reduces the number of trainable parameters, thereby lowering memory and computational costs\.

For evaluation, we conduct experiments on a single NVIDIA H100 GPU with 94 GB of memory by default\. We deploy the retrieval system using Faiss555[https://github\.com/facebookresearch/faiss](https://github.com/facebookresearch/faiss)on the GPU with half\-precision indexing and load the LLM on the same device\. To ensure a fair comparison across different methods, we measure both LLM prefill and decoding latency using the standard forward pass implemented in Hugging Face Transformers666[https://huggingface\.co/docs/transformers](https://huggingface.co/docs/transformers)\. For scaling experiments, larger retrieval models produce higher\-dimensional document embeddings that exceed the memory capacity of a single GPU\. For example, the index built from Qwen3\-Embedding\-8B occupies approximately 160 GB even with float16 precision\. To accommodate this, for all the scaling experiments, we deploy the retrieval system across three H100 GPUs, while using a separate H100 GPU to serve the LLM\. This setup ensures sufficient GPU resources for both retrieval and generation, allowing us to report latency under sufficient computational resources, where system bottlenecks are minimized\.

#### Hyperparameters\.

We fine\-tune the model using LoRA with rank 16 and scaling factor 64, applied to all projection weights\. The model is trained for 5 epochs with a learning rate of1×10−41\\times 10^\{\-4\}\. The maximum trajectory length is capped at 3000 tokens\. For the KL divergence loss, we set the target distribution based on the similarity scores between queries and documents\. Specifically, we select the temperature factor that makes the cumulative probability of the top\-3 retrieved documents approach 0\.5\. In practice, this corresponds to setting the temperature toβ=0\.03\\beta=0\.03in most cases\. The loss weight for the retrieval objective is set toλret=1\\lambda\_\{\\mathrm\{ret\}\}=1\. For the retrieval model, we remove dropout to reduce noise in the target distribution, while for the LLM we apply a dropout rate of 0\.1\. We usem=4m=4thought tokens for each thought generation step andn=16n=16subquery tokens for each subquery generation step\. The training batch size is set to 16\. The model is optimized using AdamW optimizer withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, and a weight decay of 0\.01\. For retrieval loss calculation, we retrieve the top\-16 documents as pseudo\-relevant documents and combine them with in\-batch negative samples,*i\.e\.*, the pseudo\-relevant documents from other subqueries within the same batch, to form a candidate document set, which will be used to compute the document probability distribution\.

#### Evaluation metrics & measurement protocol\.

We adopt exact match \(EM\) as the primary performance evaluation metric\. The EM score measures whether the final predicted answer exactly matches the ground\-truth answer\. Before evaluation, both predicted and ground\-truth answers are normalized by removing articles \(e\.g\., a, an, the\), stripping whitespace, removing punctuation, and converting all text to lowercase\. For all retrieval\-based methods, we retrieve the top\-3 documents per query\. The maximum number of retrieval iterations is set to 4\. For efficiency, we report latency, which captures the end\-to\-end response time from receiving a query to generating the final answer\. We sample the first 100 questions from each dataset to estimate latency\. To enable fine\-grained latency analysis, we report a breakdown of the latency across different stages, including prefill, thought generation, subquery generation, retrieval, and answer generation\.

Following prior work\[[24](https://arxiv.org/html/2605.06285#bib.bib26),[49](https://arxiv.org/html/2605.06285#bib.bib45)\], we use Qwen2\.5 Instruct for inference\-based methods due to its stronger instruction\-following capabilities\. For training\-based baselines, we adopt checkpoints released in the original papers that are based on the Base variant of Qwen2\.5, which demonstrated better performance in prior work compared to the Instruct variant under training\-based settings\[[24](https://arxiv.org/html/2605.06285#bib.bib26)\]\. We also initialize and fine\-tune our model from Qwen2\.5 Base for fair comparison\.

For stage\-wise latency measurement, the embedding time of a natural language subquery is included in the retrieval stage\. For our method, to reduce the number of vectors transmitted to the retrieval system, we generate subquery embeddings on the model side from the latent tokens and pass only the resulting embedding vector to the retrieval system\. Therefore, the embedding computation time is attributed to the subquery generation stage instead of the retrieval stage\. This design leads to higher measured subquery generation time and lower retrieval time for our method\. However, this difference does not affect the computation of the overall latency\.

## Appendix CDataset Description

Table 4:Summary of datasets\.We conduct our experiments on seven benchmark QA datasets, following previous works\[[24](https://arxiv.org/html/2605.06285#bib.bib26),[49](https://arxiv.org/html/2605.06285#bib.bib45)\]\. These datasets include three general QA datasets \(NQ\[[30](https://arxiv.org/html/2605.06285#bib.bib75)\], TriviaQA\[[27](https://arxiv.org/html/2605.06285#bib.bib76)\], and PopQA\[[38](https://arxiv.org/html/2605.06285#bib.bib77)\]\) and four multi\-hop QA datasets \(HotpotQA\[[73](https://arxiv.org/html/2605.06285#bib.bib78)\], 2wiki\[[18](https://arxiv.org/html/2605.06285#bib.bib79)\], Musique\[[56](https://arxiv.org/html/2605.06285#bib.bib80)\], and Bamboogle\[[44](https://arxiv.org/html/2605.06285#bib.bib81)\]\)\. Instead of using the original documents provided by each dataset as the retrieval corpus, we follow\[[26](https://arxiv.org/html/2605.06285#bib.bib43)\]and adopt a more challenging and realistic setting by using the full Wikipedia 2018 dump\[[28](https://arxiv.org/html/2605.06285#bib.bib82)\]as the corpus\. The corpus contains 21,015,324 chunked documents, making retrieval significantly more difficult due to its large scale and diverse content\. For training, we use the dataset splits provided by FlashRAG777[https://huggingface\.co/datasets/RUC\-NLPIR/FlashRAG\_datasets](https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets)and train our models on the training sets of NQ and HotpotQA\. We evaluate all methods on the test sets \(or development sets when test sets are unavailable\) of the seven benchmark datasets\. Table[4](https://arxiv.org/html/2605.06285#A3.T4)summarizes the number of QA pairs in each dataset\. Bamboogle is a very small dataset containing only 125 samples, which may lead to higher evaluation variance and less stable performance estimates compared to the other benchmark datasets\.

## Appendix DPrompt Templates

In this section, we provide all prompt templates used in our framework\. Double curly braces \{\{⋯\\cdots\}\} denote runtime placeholders\. Prompt[D](https://arxiv.org/html/2605.06285#A4)presents the template for latent thought and subquery generation\. The latent thought and subquery tokens are derived from the hidden states at the positions of the corresponding special tokens\. An action token is predicted based on the final thought token\. If the action token is<𝚊𝚗𝚜𝚠𝚎𝚛\>\\texttt\{<\}\\mathtt\{answer\}\\texttt\{\>\}, the special subquery tokens in the prompt template are replaced with the answer token<𝚊𝚗𝚜𝚠𝚎𝚛\>\\texttt\{<\}\\mathtt\{answer\}\\texttt\{\>\}to trigger the answer generation process\. Prompt[D](https://arxiv.org/html/2605.06285#A4)and Prompt[D](https://arxiv.org/html/2605.06285#A4)present the templates for latent thought and subquery decoding, respectively\.

Prompt D\.1\.Prompt template for thought and subquery generationAnswer the following question by reasoning step by step and retrieving necessary information at each step: \{\{QUESTION\}\} <𝚝𝚑𝚒𝚗𝚔1\>…<𝚝𝚑𝚒𝚗𝚔m\>\\texttt\{<\}\\mathtt\{think\}\_\{1\}\\texttt\{\>\}\\ldots\\texttt\{<\}\\mathtt\{think\}\_\{m\}\\texttt\{\>\}<𝚚𝚞𝚎𝚛𝚢1\>…<𝚚𝚞𝚎𝚛𝚢n\>\\texttt\{<\}\\mathtt\{query\}\_\{1\}\\texttt\{\>\}\\ldots\\texttt\{<\}\\mathtt\{query\}\_\{n\}\\texttt\{\>\} <𝚒𝚗𝚏𝚘𝚛𝚖𝚊𝚝𝚒𝚘𝚗\>\\texttt\{<\}\\mathtt\{information\}\\texttt\{\>\} Doc 1 \{\{TOP\-𝟷\\mathtt\{1\}\_DOCUMENT\}\} Doc 2 \{\{TOP\-𝟸\\mathtt\{2\}\_DOCUMENT\}\} Doc 3 \{\{TOP\-𝟹\\mathtt\{3\}\_DOCUMENT\}\} </𝚒𝚗𝚏𝚘𝚛𝚖𝚊𝚝𝚒𝚘𝚗\>\\texttt\{<\}\\mathtt\{\\texttt\{/\}information\}\\texttt\{\>\} ⋯\\cdots <𝚝𝚑𝚒𝚗𝚔1\>…<𝚝𝚑𝚒𝚗𝚔m\>\\texttt\{<\}\\mathtt\{think\}\_\{1\}\\texttt\{\>\}\\ldots\\texttt\{<\}\\mathtt\{think\}\_\{m\}\\texttt\{\>\}<𝚚𝚞𝚎𝚛𝚢1\>…<𝚚𝚞𝚎𝚛𝚢n\>\\texttt\{<\}\\mathtt\{query\}\_\{1\}\\texttt\{\>\}\\ldots\\texttt\{<\}\\mathtt\{query\}\_\{n\}\\texttt\{\>\}

Prompt D\.2\.Prompt template for latent thought decodingDecode the thought based on the latent representation: \{\{LATENT\_THOUGHT\_TOKENS\}\}

Prompt D\.3\.Prompt template for latent subquery decodingDecode the subquery based on the latent representation: \{\{LATENT\_SUBQUERY\_TOKENS\}\}

## Appendix EMore Experimental Results

### E\.1Embedding Space Analysis of Retrieval Models

In Table[1](https://arxiv.org/html/2605.06285#S4.T1)of the main paper, compared to other retrieval models, our method exhibits a relatively larger performance drop when using e5\-base\-v2\. To further investigate the source of this discrepancy, we analyze the differences in the geometric properties of the embedding space across different retrieval models\. Specifically, for each retrieval model, we generateℓ2\\ell\_\{2\}\-normalized embeddings for the entire Wikipedia corpus\. We then compute the mean direction of all document embeddings produced by that model\. Next, we measure the cosine similarity and the angular distance between each document embedding and this mean direction and visualize their respective distributions\. A distribution that is skewed toward higher cosine similarities \(or lower angles\) indicates that the embeddings are concentrated around the mean direction rather than being uniformly distributed over the hypersphere, thereby reflecting a stronger anisotropy\[[33](https://arxiv.org/html/2605.06285#bib.bib94),[83](https://arxiv.org/html/2605.06285#bib.bib95)\]in the embedding space\.

As shown in Fig\.[4](https://arxiv.org/html/2605.06285#A5.F4), the embeddings generated by e5\-base\-v2 exhibit extremely high cosine similarity and low angular deviation with respect to the mean direction, demonstrating severe anisotropy\. This suggests that the embeddings are highly concentrated around a narrow region of the hypersphere, rather than being well spread out\. As a result, small approximation errors in the embedding space may lead to completely different retrieval outputs, making it challenging to train a model to faithfully approximate the behavior of the original retrieval model\. Moreover, such a skewed distribution may force the LLM to deviate from its original representation geometry to adapt to this skewed concentrated space, which could negatively affect the performance of the LLM\.

![Refer to caption](https://arxiv.org/html/2605.06285v1/x4.png)Figure 4:Distribution of cosine similarity and angle between document embeddings and their mean direction\.We visualize distributions using violin plots\. In contrast to other retrieval models, e5\-base\-v2 yields embeddings with extremely high cosine similarity and small angular deviation, indicating collapse into a narrow cone of the hypersphere and severe anisotropy\.
### E\.2Latent Decoding Efficiency Analysis

Table 5:Average latency \(ms\) with and without latent decoding across all datasets\.![Refer to caption](https://arxiv.org/html/2605.06285v1/x5.png)Figure 5:Latency reduction using batch latent decoding vs\. max length ratio\.Lower max length ratios are associated with higher latency reduction ratios\. Each data point corresponds to the results on each dataset\.As discussed in the main paper, latent decoding improves transparency at the cost of additional latency\. A good property of our method is that the decoding of thoughts and subqueries is conditionally independent given the latent tokens\. This property allows us to perform parallel decoding across different steps, in contrast to existing agentic RAG methods that generate these sequences sequentially\.

To quantify the effect of reduced latency enabled by our parallel decoding strategy, we report detailed latency measurements across multiple datasets and compare them with baseline methods\. As shown in Table[5](https://arxiv.org/html/2605.06285#A5.T5), using latent decoding increases latency by approximately 4–5×\\timescompared to the setting without latent decoding\. Nevertheless, compared to corresponding baseline methods, our method with latent decoding reduces overall latency by approaximately 23\-68% across different datasets\.

The efficiency gains from parallel decoding are more pronounced when sequence lengths are balanced, as this reduces padding overhead and avoids unnecessary computation\. To characterize the impact of sequence length imbalance, we define the max length ratio as the ratio between the token count of the longest thought or subquery sequence and the total token count within a decoding batch\. A higher max length ratio indicates a more imbalanced batch, where a single long sequence dominates most tokens\. Such imbalance reduces the efficiency gains of parallel decoding due to increased padding overhead and more LLM forward passes\. As shown in Fig\.[5](https://arxiv.org/html/2605.06285#A5.F5), the latency reduction percentage decreases as the max length ratio increases\. This trend indicates that the effectiveness of parallel decoding is strongly associated with the degree of sequence length balance\. Nevertheless, across different datasets with varying max length ratios, our method with latent decoding consistently achieves significant latency reductions, demonstrating the effectiveness of the parallel decoding strategy\.

### E\.3Detailed Stage\-wise Latency Comparison

Table[6](https://arxiv.org/html/2605.06285#A5.T6)shows the detailed stage\-wise latency breakdown when using the Qwen3\-Embedding\-0\.6B retrieval model\. Compared to naive single\-step RAG, Search\-R1 and AutoRefine introduce significant latency overhead\. The average latency across all datasets is approximately 15×\\timesthat of naive RAG\. This overhead mainly comes from the thought and subquery generation stages, which together account for about 90% of the total latency\. In contrast, our method, trained on trajectories generated by Search\-R1 and AutoRefine, significantly reduces the overall latency by approximately 90% compared to the corresponding baseline\.

Table 6:Detailed stage\-wise latency breakdowns \(ms\) using the Qwen3\-Embedding\-0\.6B retrieval model\.Search\-R1 and AutoRefine incur significantly higher latency in the thought and subquery generation stages compared to our method\.
### E\.4Impact of Trajectory Quality on Model Performance

Table 7:Effect of trajectory quality\.EM scores \(%\)↑are reported for LatentRAG◆\-3B trained on trajectories generated by Search\-R1◆models of different sizes \(3B, 7B, and 14B\)\.Greenvalues show gains obtained by training with trajectories from larger models, compared to the 3B setting\.To investigate the effect of trajectory quality on model performance, we train the same model using trajectories generated by LLMs of different sizes\. As shown in Fig\.[3](https://arxiv.org/html/2605.06285#S5.F3)in the main paper, larger LLMs consistently achieve better performance, suggesting that they tend to produce higher\-quality interaction trajectories\. Therefore, we use trajectories generated by Search\-R1 based on Qwen2\.5 models of different scales to train our method with the Qwen2\.5\-3B model\. As shown in Table[7](https://arxiv.org/html/2605.06285#A5.T7), LatentRAG models trained on trajectories generated by the 7B and 14B models yield an average improvement of approximately 15% over the variant trained on trajectories generated by the 3B model\. These improvements indicate that our method benefits significantly from higher\-quality training trajectories, highlighting the importance of the quality of the model used for trajectory generation\.

### E\.5Influence of Latent Token Numbers

![Refer to caption](https://arxiv.org/html/2605.06285v1/x6.png)Figure 6:performance under different numbers of latent thought and subquery tokens\.To investigate the impact of latent token numbers, we vary the number of latent thought tokensmmand the number of subquery tokensnnand evaluate the exact match scores under different configurations\. As shown in Fig\.[6](https://arxiv.org/html/2605.06285#A5.F6), performance remains relatively stable across different settings\. It increases slightly at first and reaches a peak when using 4 thought tokens and 16 subquery tokens for each step, followed by a decline as the token numbers continue to increase\. This suggests that while additional latent tokens can provide more expressive capacity and increase performance, excessive tokens may introduce redundancy\. Therefore, in our experiments, we setm=4m=4andn=16n=16\.

### E\.6Average Token Counts and Number of Forward Passes

Table 8:Average token counts and number of forward passes per question\.\(in\)and\(out\)denote input and output tokens, respectively\. Due to autoregressive token\-by\-token generation, output tokens incur more forward passes and thus higher latency\.MethodsToken Counts\# Forward PassesThoughtSubqueryAnswerOthersTotalSearch\-R1◆121\.8\(out\)37\.9\(out\)9\.6\(out\)1325\.5\(in\)1325\.5\(in\)\+ 169\.4\(out\)169\.4LatentRAG◆w/o decoding13\.8\(in\)39\.0\(in\)5\.8\(out\)1222\.5\(in\)1275\.2\(in\)\+ 5\.8\(out\)11\.7LatentRAG◆w/ decoding13\.8\(in\)\+ 117\.7\(out\)39\.0\(in\)\+ 29\.3\(out\)5\.8\(out\)1436\.2\(in\)1489\.0\(in\)\+ 152\.8\(out\)52\.8

To analyze token usage efficiency, we report the average token counts per question\. We distinguish between input and output tokens, as output tokens are generated autoregressively and cannot be fully parallelized, typically incurring higher latency and being more costly in practice\. For example, in the OpenAI API pricing888[https://openai\.com/api/pricing/](https://openai.com/api/pricing/), output tokens are typically priced about 6×\\timeshigher than input tokens\. We also report the number of forward passes per question\. The number of forward passes corresponds to how many sequential LLM forward computations are required, which typically relates to the overall latency under sufficient hardware resources\. Moreover, this latency cannot be easily reduced by simply scaling up GPU computational resources as it is fundamentally constrained by sequential dependencies in the generation process\.

As shown in Table[8](https://arxiv.org/html/2605.06285#A5.T8), Search\-R1 generates substantially more output tokens due to explicit thought and subquery generation, which in turn leads to a large number of LLM forward passes and explains its high latency reported in the main paper\. In contrast, our method directly computes latent thought or subquery tokens by feeding a sequence of special tokens in parallel, which requires only a single forward pass per thought or subquery\. As a result, without latent decoding, our method requires less than 5% of the output tokens compared to Search\-R1, significantly reducing the number of forward passes and thereby achieving substantially lower latency, as reported in the main paper\. As an option to improve transparency at the cost of additional latency, latent decoding increases the number of output tokens in our method to a level comparable to Search\-R1\. However, since in our method the thought and subquery sequences across different steps are conditionally independent given the latent tokens, these sequences can be decoded in parallel, which significantly reduces the number of LLM forward passes\. Moreover, the decoding process depends only on the latent tokens rather than attending to the full interaction history, which can further reduces computational overhead in practice\. Consequently, even with a comparable number of output tokens, our method with latent decoding still only requires much fewer forward passes and achieves higher efficiency\.

### E\.7Case Studies

To qualitatively analyze the behavior of our method, we present several case studies of the reasoning and retrieval processes of LatentRAG\.

#### Success case analysis\.

As shown in Success Case[1](https://arxiv.org/html/2605.06285#A5.T1)&[2](https://arxiv.org/html/2605.06285#A5.T2), our method successfully learns the reasoning and retrieval patterns of the respective baseline models\. Models trained on trajectories from different baselines generate thoughts and subqueries that are similar to those of the original models\. For instance, the decoded thoughts of LatentRAG◆are able to capture the refinement structure in the reasoning process of AutoRefine\. In Success Case[3](https://arxiv.org/html/2605.06285#A5.T3), although both models arrive at the correct answer after a sequence of reasoning and retrieval steps, they exhibit redundant retrieval in the final stage\. This suggests that undesirable behaviors of the teacher model may also be learned by our method, highlighting the importance of trajectory quality as discussed in Appendix[E\.4](https://arxiv.org/html/2605.06285#A5.SS4)\.

#### Failure case analysis\.

As shown in Failure Case[1](https://arxiv.org/html/2605.06285#A5.T1a)and[2](https://arxiv.org/html/2605.06285#A5.T2a), although the reasoning and retrieval processes of our method are correct, the model sometimes fails to produce fully consistent outputs, leading to incorrect answers under exact match evaluation\. This might indicate that latent representations facilitate the learning of abstract concepts but are less effective for precise lexical output\. Nevertheless, our method maintains competitive performance while significantly reducing overall latency by 90%, highlighting the value of latent reasoning and retrieval in agentic RAG\. Future research can further investigate how to balance the use of latent representations and precise contextual information for accurate answer generation\.

#### LogitLens analysis\.

To investigate what information is encoded in each latent thought or subquery token, we leverage LogitLens\[[40](https://arxiv.org/html/2605.06285#bib.bib97)\]to analyze the generated latent tokens\. LogitLens projects hidden states into the vocabulary space using the unembedding matrix of the LLM, enabling inspection of the token\-level information encoded in the hidden states\. Fig\.[7](https://arxiv.org/html/2605.06285#A5.F7)&[8](https://arxiv.org/html/2605.06285#A5.F8)present the top\-5 predicted language tokens by logits for each latent token\. Although we do not explicitly constrain latent tokens to align with the LLM vocabulary space, the model still distributes these latent representations around semantically related vocabulary regions\. In particular, the decoded vocabulary tokens from the thought and subquery tokens of the first step are closely related to the first subquery, while those from later steps gradually shift toward vocabulary regions associated with the second subquery and eventually the final answer\. Additionally, unlike natural language tokenization, which typically represents text through a fixed subword decomposition that may split semantic units across multiple tokens, a latent token can encode the whole semantic concept, such asChristianity TodayorWilliam Goldman\. These findings suggest that performing reasoning and retrieval in the latent space may provide more flexibility and expressivity than operating in natural language space\.

Table 1:Search\-R1◆vs\. LatentRAG◆Table 2:AutoRefine△vs\. LatentRAG△Table 3:Search\-R1◆vs\. LatentRAG◆Table 1:Search\-R1◆vs\. LatentRAG◆Table 2:Search\-R1◆vs\. LatentRAG◆![Refer to caption](https://arxiv.org/html/2605.06285v1/x7.png)Figure 7:LogitLens Case Study 1 on LatentRAG◆\.Latent thought and subquery tokens in the first step align with tokens related to the first subquery,The author of The Thing of It Is…, while those in the second step shift toward tokens related to the second subquery,William Goldman nationality\. A latent token can encode the whole semantic concept, such asThe Thing of It Is…orWilliam Goldman\.![Refer to caption](https://arxiv.org/html/2605.06285v1/x8.png)Figure 8:LogitLens Case Study 2 on LatentRAG◆\.Latent thought and subquery tokens in the first step align with tokens related to the first subquery,Eugene Habecker chairman of which magazine, while those in the second step shift toward tokens related to the second subquery,Christianity Today magazine type\. A latent token can encode the whole semantic concept, such asmagazine typeorChristianity Today\.
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

Similar Articles

AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases

LightRAG: Simple and Fast Retrieval-Augmented Generation

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

RAG-Anything: All-in-One RAG Framework

Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking

Submit Feedback

Similar Articles

AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
LightRAG: Simple and Fast Retrieval-Augmented Generation
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
RAG-Anything: All-in-One RAG Framework
Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking