@h100envy: This paper completely changed how I think about the retrieval loop in RAG: Segment -> Decide if retrieval is needed -> …

X AI KOLs Timeline Papers

Summary

This paper introduces a novel retrieval loop for RAG that uses reflection tokens and on-demand retrieval, allowing the model to decide when to fetch documents or rely on internal knowledge, with critique and tree-decoding to improve accuracy.

This paper completely changed how I think about the retrieval loop in RAG: Segment -> Decide if retrieval is needed -> Fetch or skip -> Generate -> Critique your own output -> Next segment Here is the 5-step blueprint: Reflection tokens: the model learns retrieve/critique special tokens as part of its own vocabulary, next to normal words. On-demand retrieval: at each segment the model decodes a token and decides itself whether to fetch docs or answer from parameters. Critique: once passages are pulled, a token rates their relevance and whether they actually support its own output. Tree-decoding: beam search over critique tokens picks the continuation that maximizes utility out of K candidates. Critic + Generator: a critic model inserts tokens into the corpus offline, the generator trains with plain next-token, no expensive online RLHF. Key insight: retrieval is not always good; the model should decide for itself when to pull documents and when to stay quiet. Skipping retrieval drops PopQA accuracy by 40% relative, yet costs only 2% on fact verification (PubHealth). Read this, then check the article below.
Original Article
View Cached Full Text

Cached at: 06/29/26, 10:32 PM

This paper completely changed how I think about the retrieval loop in RAG:

Segment -> Decide if retrieval is needed -> Fetch or skip -> Generate -> Critique your own output -> Next segment

Here is the 5-step blueprint:

Reflection tokens: the model learns retrieve/critique special tokens as part of its own vocabulary, next to normal words.

On-demand retrieval: at each segment the model decodes a token and decides itself whether to fetch docs or answer from parameters.

Critique: once passages are pulled, a token rates their relevance and whether they actually support its own output.

Tree-decoding: beam search over critique tokens picks the continuation that maximizes utility out of K candidates.

Critic + Generator: a critic model inserts tokens into the corpus offline, the generator trains with plain next-token, no expensive online RLHF.

Key insight: retrieval is not always good; the model should decide for itself when to pull documents and when to stay quiet.

Skipping retrieval drops PopQA accuracy by 40% relative, yet costs only 2% on fact verification (PubHealth).

Read this, then check the article below.

Similar Articles