Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing
Summary
Skill-RAG is a failure-aware RAG framework that uses hidden-state probing and skill routing to diagnose and correct query-evidence misalignment in retrieval-augmented generation. The approach detects retrieval failures and selectively applies targeted skills (query rewriting, question decomposition, evidence focusing) to improve accuracy on hard cases and out-of-distribution datasets.
View Cached Full Text
Cached at: 04/20/26, 08:29 AM
# Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing Source: https://arxiv.org/html/2604.15771 Raymond Li University of British Columbia Vancouver, British Columbia, Canada [email protected] Xi Zhu Rutgers University New Brunswick, New Jersey, USA [email protected] Zhaoqian Xue University of Pennsylvania Perelman School of Medicine Philadelphia, Pennsylvania, USA [email protected] Jiaojiao Han New Jersey Institute of Technology Newark, New Jersey, USA [email protected] Jingcheng Niu Technical University of Darmstadt Darmstadt, Germany [email protected] Fan Yang Wake Forest University Winston-Salem, North Carolina, USA [email protected] ## Abstract Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose—leaving the structural causes of query–evidence misalignment unaddressed. We observe that a significant portion of persistent retrieval failures stem not from the absence of relevant evidence but from an alignment gap between the query and the evidence space. We propose **Skill-RAG**, a failure-aware RAG framework that couples a lightweight hidden-state prober with a prompt-based skill router. The prober gates retrieval at two pipeline stages; upon detecting a failure state, the skill router diagnoses the underlying cause and selects among four retrieval skills—query rewriting, question decomposition, evidence focusing, and an exit skill for truly irreducible cases—to correct misalignment before the next generation attempt. Experiments across multiple open-domain QA and complex reasoning benchmarks show that Skill-RAG substantially improves accuracy on hard cases persisting after multi-turn retrieval, with particularly strong gains on out-of-distribution datasets. Representation-space analyses further reveal that the proposed skills occupy structured, separable regions of the failure state space, supporting the view that query–evidence misalignment is a typed rather than monolithic phenomenon.  Given an input query, a hidden-state prober gates retrieval decisions at two stages; upon detecting a failure state, a prompt-based skill router selects among four retrieval skills to correct query–evidence misalignment before the next generation attempt. ## 1. Introduction Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models (LLMs) in external knowledge, substantially improving factual reliability on knowledge-intensive tasks (Lewis et al., 2020; Gao et al., 2023; Jin et al., 2025a). Building on this, adaptive and iterative retrieval mechanisms have been proposed to dynamically determine when and how often to retrieve (Jiang et al., 2023; Asai et al., 2024; Jeong et al., 2024). Yet existing methods largely treat retrieval control as a coarse-grained decision—focused on whether to retrieve and how many times—while leaving unaddressed the structural reasons behind retrieval failure and the corrective strategies they demand. As a result, a non-trivial fraction of hard cases exhibit persistent failures that repeated retrieval alone cannot resolve (Trivedi et al., 2023a; Tang and Yang, 2024). A closer examination reveals that a significant portion of these failures stem not from the absence of relevant evidence but from a structural alignment gap: the query is poorly formulated relative to the evidence space, causing successive retrievals to surface documents that are topically adjacent yet inferentially insufficient. Such failures exhibit structured patterns in the model's internal representations—a query too broad calls for evidence focusing, entangled premises call for decomposition, divergent surface forms call for rewriting—as evidenced by the geometric structure of failure representations we analyze in Section 4.3. We operationalize this insight by introducing **failure states**: latent representations derived from the model's hidden layers that signal when retrieval has stalled, enabling a skill router to select targeted retrieval actions in place of generic re-retrieval. We present **Skill-RAG**, a failure-aware RAG framework that employs a lightweight hidden-state prober to detect when retrieval has stalled and gate entry into skill routing. Upon detecting a failure state, a prompt-based skill router diagnoses the underlying cause and selects among four **retrieval skills**—query rewriting (Ma et al., 2023), question decomposition (Press et al., 2023), evidence focusing (Yan et al., 2024), and an exit skill that identifies truly irreducible cases and terminates retrieval gracefully. Unlike prior work that optimizes retrieval triggering or iteration depth, Skill-RAG reframes post-retrieval recovery as a **conditional skill-selection problem**, providing fine-grained, failure-conditioned control over how LLMs acquire external knowledge. We make three contributions: 1. We propose the first framework that integrates hidden-state prober gating with prompt-based skill routing for post-retrieval failure recovery, yielding a unified probing-and-routing pipeline that requires no additional LLM calls for either decision. 2. We introduce a transferable skill vocabulary of four retrieval skills grounded in observed failure patterns, yielding consistent improvements across multiple models and datasets and establishing a reusable taxonomy for query–evidence alignment correction. 3. Experiments across multiple benchmarks show that Skill-RAG achieves competitive or state-of-the-art performance while substantially outperforming prober-only baselines (Baek et al., 2025) on out-of-distribution datasets, highlighting the benefit of failure-conditioned skill routing beyond simple gating. ## 2. Related Work **Adaptive and Iterative Retrieval.** Early RAG systems retrieve once before generation (Lewis et al., 2020), while subsequent work has explored adaptive and iterative mechanisms to improve retrieval coverage and efficiency. IRCoT (Trivedi et al., 2023b) interleaves chain-of-thought reasoning with retrieval, using each reasoning step to guide the next query; Iter-RetGen (Shao et al., 2023) feeds the model's previous output as context for the next retrieval round. FLARE (Jiang et al., 2023) triggers retrieval when token-level generation confidence falls below a threshold; DRAGIN (Su et al., 2024) uses attention signals to determine retrieval timing; Self-RAG (Asai et al., 2024) trains the model to emit special tokens controlling retrieval and self-critique; and Adaptive-RAG (Jeong et al., 2024) classifies query complexity to route among retrieval strategies of varying depth. Probing-RAG (Baek et al., 2025) leverages hidden-state representations to gate retrieval decisions. Despite these advances, all existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose—leaving the structural causes of query–evidence misalignment unaddressed. **Query Reformulation and Corrective Retrieval.** A parallel line of work improves retrieval quality through query-side and evidence-side interventions. Query rewriting methods reformulate the original query to better match corpus indexing conventions (Ma et al., 2023); question decomposition approaches break complex multi-hop queries into sequential sub-queries (Press et al., 2023); and CRAG (Yan et al., 2024) evaluates retrieved document quality and triggers corrective actions—including web search and evidence filtering—when retrieval confidence is low. While CRAG operates at the document level, assessing whether retrieved passages are relevant, Skill-RAG operates at the failure-state level, diagnosing why the model failed to generate a correct answer and routing to targeted alignment corrections accordingly. Skill-RAG unifies query rewriting, decomposition, and evidence focusing into a single failure-conditioned routing framework, selecting among skills based on the model's diagnosed failure state rather than applying any single strategy unconditionally. ## 3. Method Figure 1 illustrates the Skill-RAG pipeline. Given an input query, a hidden-state prober first assesses whether the model's parametric knowledge suffices to answer without retrieval; if so, the answer is returned directly. Otherwise, a standard retrieval step is performed and the prober reapplied to the augmented generation. If the retrieved evidence proves sufficient, the answer is finalized; if not, a prompt-based skill router receives the failed reasoning, answer, and retrieved evidence, diagnoses the underlying cause of misalignment, and selects one of four retrieval skills to reformulate the query or refocus the evidence. The revised query triggers a new retrieval round, and the prober gates the next iteration. This process repeats until the prober judges the model's state sufficient or a maximum number of retrieval rounds is reached. ### 3.1. Prober Training To train the prober, we apply two retrieval strategies to the training splits of our in-domain datasets (HotpotQA (Yang et al., 2018), NQ (Kwiatkowski et al., 2019), and TriviaQA (Joshi et al., 2017))—no retrieval and single-step retrieval—and prompt the model to produce a chain-of-thought reasoning trace followed by a final answer. For each example, we extract hidden states corresponding to reasoning and answer tokens from the posterior two-thirds of the model's layers, and assign a binary label by comparing the generated answer against the gold answer, yielding a labeled dataset of hidden-state representations paired with correctness signals. The prober is implemented as a feed-forward network with a single hidden layer and a binary classification head. To leverage information across depths, we train one prober per layer and aggregate outputs by averaging predicted probabilities at inference time, producing a single gating signal that reflects answer readiness across representational levels. ### 3.2. Skill Router When the prober detects a failure state, a prompt-based skill router is invoked. The router receives the original question, the model's failed reasoning and answer, and the currently retrieved evidence, diagnoses the cause of misalignment, and selects one of four retrieval skills. **Query rewriting** targets cases where the query's surface form diverges from corpus indexing conventions, producing a reformulated query better aligned with retrievable evidence (Ma et al., 2023). **Question decomposition** addresses multi-hop queries with entangled premises, generating a sequence of sub-queries that isolate each reasoning step before issuing a final retrieval query (Press et al., 2023). **Evidence focusing** handles semantically broad queries by extracting missing evidence slots from the current context and issuing a grounded query targeting the specific information gap (Yan et al., 2024). **Exit** identifies cases where misalignment is irreducible, due to missing knowledge or model capacity limits, and terminates retrieval to avoid unnecessary inference overhead. ### 3.3. Iterative Skill Retrieval and Termination Following skill execution, the reformulated query is issued to the retriever, and the model generates a new answer conditioned on the updated evidence. The prober then gates the next iteration. This loop continues until one of three termination conditions is met: the skill router selects exit, the prober judges the model's state sufficient (Jin et al., 2025b), or a predefined maximum number of retrieval rounds is reached. ## 4. Experiments ### 4.1. Setup We evaluate on five open-domain QA benchmarks spanning single-hop and multi-hop reasoning. Three datasets—NQ (Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017), and HotpotQA (Yang et al., 2018)—serve as in-domain benchmarks, from which we sample 3,000 examples for prober training and 500 for development. Two multi-hop datasets—MuSiQue (Trivedi et al., 2022) and 2WikiMultiHopQA (Ho et al., 2020)—are held out as out-of-distribution (OOD) test sets, each evaluated on 500 examples. All methods use BM25 (Robertson and Zaragoza, 2009) as the retriever. We compare Skill-RAG against six baselines: - **No Retrieval**, which generates answers from parametric knowledge alone - **Single-step RAG**, which performs one round of retrieval before generation - **FLARE** (Jiang et al., 2023), which triggers retrieval based on token-level generation uncertainty - **DRAGIN** (Su et al., 2024), which determines retrieval timing via attention-based relevance signals - **Adaptive-RAG** (Jeong et al., 2024), which routes queries to retrieval strategies of varying complexity via a trained classifier - **Probing-RAG** (Baek et al., 2025), which gates retrieval decisions using hidden-state probing We conduct experiments using Gemma2-9B as the backbone model; results across additional model families will be reported in future work. All methods use 4-shot prompting and are evaluated on Exact Match (EM) and Accuracy (ACC). ### 4.2. Main Results Table 1 reports results on Gemma2-9B across five benchmarks. Skill-RAG achieves state-of-the-art or competitive performance on in-domain datasets, matching or surpassing Probing-RAG on both EM and ACC across HotpotQA, NQ, and TriviaQA. The most pronounced gains appear on OOD be
Similar Articles
@omarsar0: Nice paper combining the strength of Skills and RAG. Most RAG systems retrieve on every query, whether the model needs …
Research introduces Skill-RAG, a novel approach that combines Skills with Retrieval-Augmented Generation to address inefficiencies in traditional RAG systems that retrieve on every query regardless of whether the model actually needs the information.
Why Retrieval-Augmented Generation Fails: A Graph Perspective
This paper investigates why Retrieval-Augmented Generation (RAG) systems fail despite having access to correct evidence. Using circuit tracing and attribution graphs, the authors find that correct predictions exhibit deeper reasoning paths and more distributed evidence flow, while failures show shallow and fragmented patterns. They propose a graph-based error detection framework and targeted interventions to improve RAG reliability.
Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG
Corpus2Skill (C2S) is an agentic RAG system that replaces traditional vector/BM25 retrieval with a navigable skill hierarchy tree, allowing LLMs to browse enterprise knowledge directly at query time without embedding models or retrieval indexes.
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
LatentRAG is a novel framework that shifts reasoning and retrieval for agentic RAG into continuous latent space, reducing inference latency by approximately 90% while maintaining performance comparable to explicit methods.
LightRAG: Simple and Fast Retrieval-Augmented Generation
The article introduces LightRAG, an open-source framework that enhances Retrieval-Augmented Generation by integrating graph structures for improved contextual awareness and efficient information retrieval.