@h100envy: This paper completely changed how I think about the retrieval loop in RAG: Segment -> Decide if retrieval is needed -> …
Summary
This paper introduces a novel retrieval loop for RAG that uses reflection tokens and on-demand retrieval, allowing the model to decide when to fetch documents or rely on internal knowledge, with critique and tree-decoding to improve accuracy.
View Cached Full Text
Cached at: 06/29/26, 10:32 PM
This paper completely changed how I think about the retrieval loop in RAG:
Segment -> Decide if retrieval is needed -> Fetch or skip -> Generate -> Critique your own output -> Next segment
Here is the 5-step blueprint:
Reflection tokens: the model learns retrieve/critique special tokens as part of its own vocabulary, next to normal words.
On-demand retrieval: at each segment the model decodes a token and decides itself whether to fetch docs or answer from parameters.
Critique: once passages are pulled, a token rates their relevance and whether they actually support its own output.
Tree-decoding: beam search over critique tokens picks the continuation that maximizes utility out of K candidates.
Critic + Generator: a critic model inserts tokens into the corpus offline, the generator trains with plain next-token, no expensive online RLHF.
Key insight: retrieval is not always good; the model should decide for itself when to pull documents and when to stay quiet.
Skipping retrieval drops PopQA accuracy by 40% relative, yet costs only 2% on fact verification (PubHealth).
Read this, then check the article below.
Similar Articles
@h100envy: This paper completely changed how I think about trusting retrieval in RAG: Fetch documents -> Score their quality -> Ge…
This paper presents a 5-step blueprint for improving trust in RAG by using a lightweight retrieval evaluator that scores document quality and triggers actions (correct, incorrect, ambiguous) to handle retrieval failures, with plug-and-play integration.
@omarsar0: Nice paper combining the strength of Skills and RAG. Most RAG systems retrieve on every query, whether the model needs …
Research introduces Skill-RAG, a novel approach that combines Skills with Retrieval-Augmented Generation to address inefficiencies in traditional RAG systems that retrieve on every query regardless of whether the model actually needs the information.
@_rohit_tiwari_: I wasted months trying to understand RAG. So I created this clear step-by-step guide. https://drive.google.com/file/d/1…
A clear step-by-step guide to understanding Retrieval-Augmented Generation (RAG), covering explanations, visuals, and various architectures like Naïve RAG, Advanced RAG, Graph RAG, Multimodal RAG, and Agentic RAG.
@Julian_a42f9a: Late-interaction retrieval models are widely used for their strong performance, but their representations can be utiliz…
A new paper shows that late-interaction retrieval model representations can effectively replace raw document text in RAG tasks, extending their utility beyond retrieval.
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
LatentRAG is a novel framework that shifts reasoning and retrieval for agentic RAG into continuous latent space, reducing inference latency by approximately 90% while maintaining performance comparable to explicit methods.