@h100envy: This paper completely changed how I think about the retrieval loop in RAG: Segment -> Decide if retrieval is needed -> …

X AI KOLs Timeline 06/29/26, 06:22 PM Papers

rag retrieval retrieval-augmented-generation reflection-tokens on-demand-retrieval critique tree-decoding critic-generator

Summary

This paper introduces a novel retrieval loop for RAG that uses reflection tokens and on-demand retrieval, allowing the model to decide when to fetch documents or rely on internal knowledge, with critique and tree-decoding to improve accuracy.

This paper completely changed how I think about the retrieval loop in RAG: Segment -> Decide if retrieval is needed -> Fetch or skip -> Generate -> Critique your own output -> Next segment Here is the 5-step blueprint: Reflection tokens: the model learns retrieve/critique special tokens as part of its own vocabulary, next to normal words. On-demand retrieval: at each segment the model decodes a token and decides itself whether to fetch docs or answer from parameters. Critique: once passages are pulled, a token rates their relevance and whether they actually support its own output. Tree-decoding: beam search over critique tokens picks the continuation that maximizes utility out of K candidates. Critic + Generator: a critic model inserts tokens into the corpus offline, the generator trains with plain next-token, no expensive online RLHF. Key insight: retrieval is not always good; the model should decide for itself when to pull documents and when to stay quiet. Skipping retrieval drops PopQA accuracy by 40% relative, yet costs only 2% on fact verification (PubHealth). Read this, then check the article below.

Original Article

View Cached Full Text

Cached at: 06/29/26, 10:32 PM

This paper completely changed how I think about the retrieval loop in RAG:

Segment -> Decide if retrieval is needed -> Fetch or skip -> Generate -> Critique your own output -> Next segment

Here is the 5-step blueprint:

Reflection tokens: the model learns retrieve/critique special tokens as part of its own vocabulary, next to normal words.

On-demand retrieval: at each segment the model decodes a token and decides itself whether to fetch docs or answer from parameters.

Critique: once passages are pulled, a token rates their relevance and whether they actually support its own output.

Tree-decoding: beam search over critique tokens picks the continuation that maximizes utility out of K candidates.

Critic + Generator: a critic model inserts tokens into the corpus offline, the generator trains with plain next-token, no expensive online RLHF.

Key insight: retrieval is not always good; the model should decide for itself when to pull documents and when to stay quiet.

Skipping retrieval drops PopQA accuracy by 40% relative, yet costs only 2% on fact verification (PubHealth).

Read this, then check the article below.

@h100envy: This paper completely changed how I think about the retrieval loop in RAG: Segment -> Decide if retrieval is needed -> …

Similar Articles

@h100envy: This paper completely changed how I think about trusting retrieval in RAG: Fetch documents -> Score their quality -> Ge…

@omarsar0: Nice paper combining the strength of Skills and RAG. Most RAG systems retrieve on every query, whether the model needs …

@_rohit_tiwari_: I wasted months trying to understand RAG. So I created this clear step-by-step guide. https://drive.google.com/file/d/1…

@Julian_a42f9a: Late-interaction retrieval models are widely used for their strong performance, but their representations can be utiliz…

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG

Submit Feedback

Similar Articles

@h100envy: This paper completely changed how I think about trusting retrieval in RAG: Fetch documents -> Score their quality -> Ge…

@omarsar0: Nice paper combining the strength of Skills and RAG. Most RAG systems retrieve on every query, whether the model needs …

@_rohit_tiwari_: I wasted months trying to understand RAG. So I created this clear step-by-step guide. https://drive.google.com/file/d/1…

@Julian_a42f9a: Late-interaction retrieval models are widely used for their strong performance, but their representations can be utiliz…

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG