Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation
Summary
This paper introduces CHARM, a framework for detecting and mitigating cascading hallucinations in multi-step agentic RAG pipelines, where early-stage errors propagate and amplify across reasoning steps. CHARM achieves an 89.4% cascade detection rate and 82.1% error propagation reduction across multiple benchmarks with low latency overhead.
View Cached Full Text
Cached at: 06/05/26, 02:07 AM
# Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation
Source: [https://arxiv.org/html/2606.04435](https://arxiv.org/html/2606.04435)
###### Abstract
Multi\-step agentic retrieval\-augmented generation \(RAG\) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs\. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four\-type taxonomy of cascade patterns, and introduce CHARM \(Cascading Hallucination Aware Resolution and Mitigation\), an architectural framework for detecting and interrupting error propagation in multi\-step reasoning pipelines\. CHARM comprises four components—stage\-level fact verification, cross\-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering—that operate alongside standard agentic RAG pipelines without requiring architectural replacement\. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89\.4% cascade detection rate with a 5\.3% false positive rate and 215 ms±\\pm18 ms average latency overhead per stage, achieving an error propagation reduction of 82\.1%, compared to 18\.5% for output\-level detectors\. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage\. CHARM integrates with human\-in\-the\-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment\.
## IIntroduction
As agentic AI systems increasingly automate complex enterprise workflows, a new class of failure has emerged that existing safety mechanisms fail to detect: cascading hallucination\. In multi\-step reasoning systems, small retrieval or inferential errors introduced in early pipeline stages propagate silently through the trajectory, compounding at each step to produce confident but factually incorrect final outputs\. Because each subsequent reasoning step remains logically coherent relative to its immediate—albeit corrupted—context, these failures appear authoritative to both downstream automated systems and human reviewers, presenting a severe risk for enterprise and regulated deployments\[[1](https://arxiv.org/html/2606.04435#bib.bib1),[2](https://arxiv.org/html/2606.04435#bib.bib2)\]\. This work builds on a sustained research program in secure and reliable AI systems\[[3](https://arxiv.org/html/2606.04435#bib.bib3),[1](https://arxiv.org/html/2606.04435#bib.bib1)\]\.
Despite significant advancements in hallucination detection, existing evaluation architectures are ill\-equipped to handle this phenomenon\. Current state\-of\-the\-art detectors\[[4](https://arxiv.org/html/2606.04435#bib.bib4),[5](https://arxiv.org/html/2606.04435#bib.bib5),[6](https://arxiv.org/html/2606.04435#bib.bib6)\]primarily evaluate individual Large Language Model \(LLM\) outputs in isolation, treating generation as a single\-step point\-in\-time process\. They measure the factual grounding of a terminal response but ignore the cross\-stage semantic trajectory that produced it\. Consequently, when an agent reviews its own cascaded logic\[[7](https://arxiv.org/html/2606.04435#bib.bib7)\], it suffers from severe confirmation bias, verifying the final output because it aligns with the corrupted intermediate context\.
To bridge this critical reliability gap, we introduce the Cascading Hallucination Aware Resolution and Mitigation \(CHARM\) framework\. This work makes three primary contributions:
1. 1\.C1: Cascading Hallucination Taxonomy\.We provide the first formal mathematical definition and classification of cascading hallucination types specific to multi\-step agentic RAG pipelines, defining four named typologies with concrete operational definitions for all core quantities\.
2. 2\.C2: CHARM Detection Framework\.We present a named, implementable four\-component detection architecture that operates continuously alongside existing RAG pipelines without requiring foundational replacement, with full component ablations confirming individual contributions\.
3. 3\.C3: Mitigation Architectures\.We propose four concrete, named mitigation patterns that interrupt error propagation at each pipeline stage, offering practitioners configurable trade\-offs between latency overhead and intervention accuracy\.
The remainder of this paper is organized as follows\. Section[II](https://arxiv.org/html/2606.04435#S2)provides background on agentic RAG pipelines and existing detection limitations\. Section[III](https://arxiv.org/html/2606.04435#S3)mathematically formalizes the cascading hallucination problem space\. Section[IV](https://arxiv.org/html/2606.04435#S4)details the CHARM architecture, while Section[V](https://arxiv.org/html/2606.04435#S5)outlines corresponding mitigation strategies\. Section[VI](https://arxiv.org/html/2606.04435#S6)presents our empirical evaluation, ablations, and novel metrics\. Section[VII](https://arxiv.org/html/2606.04435#S7)contextualizes these findings within U\.S\. national AI governance frameworks, followed by related work in Section[VIII](https://arxiv.org/html/2606.04435#S8)and concluding remarks in Section[IX](https://arxiv.org/html/2606.04435#S9)\.
## IIBackground
To contextualize the mechanisms of cascading errors, we must establish the foundational architecture of continuous reasoning pipelines and the limitations of current single\-step verification protocols\.
### II\-AAgentic RAG Pipeline Architecture
Standard Retrieval\-Augmented Generation \(RAG\)\[[8](https://arxiv.org/html/2606.04435#bib.bib8)\]enhances LLM outputs by fetching external knowledge\. However, as identified in our foundational System of Knowledge \(SoK\) analysis\[[1](https://arxiv.org/html/2606.04435#bib.bib1)\], the paradigm has shifted from single\-turn retrieval to agentic, multi\-step pipelines\.
As illustrated in Figure[1](https://arxiv.org/html/2606.04435#S2.F1), a standard agentic RAG pipeline operates across five sequential stages: \(1\) Query Formulation, where the agent interprets the user prompt; \(2\) Retrieval, where external knowledge is fetched; \(3\) Intermediate Reasoning, where the agent processes the context; \(4\) Tool Use, where the agent executes specific functions; and \(5\) Final Synthesis and Output\. In this architecture, the state output of stageiibecomes the definitive context window for stagei\+1i\+1, creating a persistent memory chain that spans the entire generation process\.
Figure 1:A standard 5\-stage agentic RAG pipeline\. The context output is continuously passed forward as the definitive input for subsequent reasoning stages, demonstrating how early state persists throughout the trajectory\.
### II\-BExisting Hallucination Detection
Current hallucination detection methodologies generally fall into three categories, all of which exhibit structural blind spots when applied to cascading scenarios:
- •Output\-Level Detection:Approaches like SelfCheckGPT\[[4](https://arxiv.org/html/2606.04435#bib.bib4)\]check the final LLM response for factual accuracy\. Because they evaluate only the terminal output, they entirely miss the intermediate stage errors that constructed the hallucination\.
- •Retrieval\-Level Detection:Frameworks such as RAGAS\[[6](https://arxiv.org/html/2606.04435#bib.bib6)\]evaluate the relevance and accuracy of retrieved documents\. While effective at step 1, they fail to track how accurately that retrieved context is logically applied across subsequent reasoning steps\.
- •Consistency\-Based Detection:These methods\[[4](https://arxiv.org/html/2606.04435#bib.bib4),[9](https://arxiv.org/html/2606.04435#bib.bib9)\]check the internal consistency of an LLM’s output via zero\-resource sampling or self\-reflection\. However, cascaded outputs are inherently internally consistent—they are perfectly coherent given the initial false premise\.
Additionally, the naive approach ofLLM Self\-Correction\[[10](https://arxiv.org/html/2606.04435#bib.bib10),[7](https://arxiv.org/html/2606.04435#bib.bib7)\], where an agent is prompted to review its own final answer, fails due to confirmation bias\. The agent reinforces the cascade because the downstream reasoning appears logically sound relative to its corrupted memory\.
### II\-CMulti\-Step Reasoning and Error Compounding
The vulnerability of sequential reasoning is deeply rooted in the mechanics of Chain\-of\-Thought \(CoT\) prompting\[[11](https://arxiv.org/html/2606.04435#bib.bib11)\]\. While CoT significantly improves complex problem\-solving by forcing intermediate steps, it inadvertently creates pathways for logical derailment\[[12](https://arxiv.org/html/2606.04435#bib.bib12)\]\.
When an error occurs in sequential reasoning, it does not remain static; it acts as an anchor for subsequent token generation\. As the agent builds upon the flawed premise, the semantic distance between the agent’s internal state and the objective ground truth widens\. This compounding effect forms the theoretical basis for why cascading hallucinations are not merely random errors, but predictable, measurable, and highly structured pipeline failures\.
## IIIProblem Formalization
This section establishes the theoretical foundation for the CHARM framework by formally defining the mechanics of cascading hallucinations in multi\-step systems\. Unlike single\-step generation tasks where hallucinations occur as isolated deviations from a prompt\[[4](https://arxiv.org/html/2606.04435#bib.bib4),[5](https://arxiv.org/html/2606.04435#bib.bib5)\], agentic pipelines function as sequential state machines where the output of one stage becomes the authoritative context for the next\[[8](https://arxiv.org/html/2606.04435#bib.bib8),[13](https://arxiv.org/html/2606.04435#bib.bib13)\]\.
### III\-AFormal Definition of Cascading Hallucination
LetP=\(s1,s2,…,sn\)P=\(s\_\{1\},s\_\{2\},\\dots,s\_\{n\}\)be a multi\-step agentic RAG pipeline wheresis\_\{i\}denotes theii\-th reasoning stage\. Letcic\_\{i\}denote the context output of stageiipassed as input to stagei\+1i\+1\.
A cascading hallucination occurs when the following four conditions are met:
1. 1\.Stagesis\_\{i\}produces outputcic\_\{i\}containing factual errorϵi\\epsilon\_\{i\}with respect to ground truthGG\.
2. 2\.The corrupted contextcic\_\{i\}is propagated as valid context tosi\+1s\_\{i\+1\}\.
3. 3\.Stagesi\+1s\_\{i\+1\}generates outputci\+1c\_\{i\+1\}that is conditionally coherent givencic\_\{i\}but factually incorrect with respect toGG\.
4. 4\.The error magnitude strictly increases or persists, such that\|ϵi\+1\|≥\|ϵi\|\|\\epsilon\_\{i\+1\}\|\\geq\|\\epsilon\_\{i\}\|, meaning the error magnitude increases monotonically across subsequent stages\.
This formal definition explicitly distinguishes cascading hallucinations from standard single\-step hallucinations\. In a single\-step hallucination, an error occurs but does not necessarily propagate or amplify\. In a cascading scenario, the underlying architecture actively forces the model to synthesize and compound the error across sequential reasoning layers\[[11](https://arxiv.org/html/2606.04435#bib.bib11)\]\.
### III\-BDistinguishing Cascading Hallucination from Generic Error Propagation
Error propagation in sequential systems is a known phenomenon\[[12](https://arxiv.org/html/2606.04435#bib.bib12),[11](https://arxiv.org/html/2606.04435#bib.bib11)\]\. Cascading hallucination, as defined here, is a strictly more specific failure mode with four properties that jointly distinguish it from generic propagation in prior work:
TABLE I:Cascading Hallucination vs\. Generic Error PropagationThe critical distinguishing property islocal coherence under global falsity: a cascading hallucination is not merely an error that persists, but one where each downstream stage generates output that isconditionally correctgiven its corrupted context \(Condition 3\), making it invisible to per\-step detectors \(Lemma 1\)\. Generic error propagation studied in CoT reasoning failures\[[12](https://arxiv.org/html/2606.04435#bib.bib12)\]and process supervision\[[14](https://arxiv.org/html/2606.04435#bib.bib14)\]does not require this local coherence property, and therefore does not exhibit the systematic evasion of standard detectors that motivates the CHARM architecture\. Furthermore, the Confidence Inflation Cascade type — where low\-confidence outputs propagate as high\-confidence — has received limited explicit treatment in existing error propagation literature, where confidence dynamics are rarely modeled as a first\-class propagation mechanism\.
### III\-CDAG\-Based Pipeline Model
As identified as a critical open problem in our foundational SoK analysis\[[1](https://arxiv.org/html/2606.04435#bib.bib1)\], quantifying this propagation requires modeling the multi\-step reasoning process as a Weighted Directed Acyclic Graph \(DAG\) denoted by𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)\.
The set of nodes𝒱\\mathcal\{V\}represents discrete pipeline stages \(retrieval, reasoning, tool\-call, synthesis, final output\)\. The set of directed edgesℰ\\mathcal\{E\}represents the context and intermediate outputs passed forward between stages\. We assign edge weights corresponding to the error propagation probabilityP\(ϵi\+1\|ϵi\)P\(\\epsilon\_\{i\+1\}\|\\epsilon\_\{i\}\)\. Under this model, cascade detection is defined as identifying paths in the DAG where the cumulative edge weight product exceeds a predefined safety thresholdθ\\theta\.
In practice, computing exact path probabilities in𝒢\\mathcal\{G\}at inference time requires estimatingP\(ϵi\+1\|ϵi\)P\(\\epsilon\_\{i\+1\}\|\\epsilon\_\{i\}\)for each edge, which is intractable without offline calibration on held\-out trajectories\. We therefore operationalize cascade detection via a linear weighted approximation: the CRT \(Section[IV\-B](https://arxiv.org/html/2606.04435#S4.SS2)\) computes
p^cascade=wsfv⋅asfv\+wcsct⋅acsct\+wcpm⋅acpm\\hat\{p\}\_\{\\mathrm\{cascade\}\}=w\_\{\\mathrm\{sfv\}\}\\cdot a\_\{\\mathrm\{sfv\}\}\+w\_\{\\mathrm\{csct\}\}\\cdot a\_\{\\mathrm\{csct\}\}\+w\_\{\\mathrm\{cpm\}\}\\cdot a\_\{\\mathrm\{cpm\}\}\(1\)whereasfv,acsct,acpm∈\[0,1\]a\_\{\\mathrm\{sfv\}\},a\_\{\\mathrm\{csct\}\},a\_\{\\mathrm\{cpm\}\}\\in\[0,1\]are the anomaly scores from each monitoring component andwsfv=0\.4w\_\{\\mathrm\{sfv\}\}=0\.4,wcsct=0\.4w\_\{\\mathrm\{csct\}\}=0\.4,wcpm=0\.2w\_\{\\mathrm\{cpm\}\}=0\.2are weights calibrated on held\-out validation splits\. The cascade flag fires whenp^cascade≥θ=0\.55\\hat\{p\}\_\{\\mathrm\{cascade\}\}\\geq\\theta=0\.55, approximating the DAG path threshold\. This design choice trades formal exactness for inference\-time tractability while preserving the DAG’s theoretical interpretation of cumulative error propagation probability\.
We adopted fixed weights over a learned meta\-classifier for three reasons: \(1\) fixed weights are interpretable and directly reflect prior knowledge about component reliability \(SFV and CSCT are more calibrated than CPM\); \(2\) a learned classifier would require labeled cascade trajectories for training, creating a circular dependency with the very detection system being built; and \(3\) fixed weights transfer across datasets without retraining\. Conformal calibration of the CRT thresholdθ\\thetato provide coverage guarantees is an identified future direction\.
We employ a DAG rather than a Markov Chain formalism because RAG pipelines are inherently directed and acyclic, and earlier retrieved context heavily persists throughout the entirety of the pipeline\. This continuous persistence of context explicitly violates the Markov memorylessness assumption\. The DAG formalism accurately captures this persistent context influence while preserving the ability to assign discrete probability weights to edges\.
Figure 2:DAG\-based representation𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)of a multi\-step agentic pipeline\. The highlighted path demonstrates a cascading hallucination where the cumulative error propagation weightP\(ϵi\+1\|ϵi\)P\(\\epsilon\_\{i\+1\}\|\\epsilon\_\{i\}\)forces a high terminal divergence from ground truth\.
### III\-DFour\-Type Cascading Hallucination Taxonomy
Because errors enter the DAG at different nodes and compound in different ways, generalized detection is insufficient\. We classify cascading hallucinations into four distinct, formally named types:
- •Retrieval Cascade:A false document is retrieved in step 1, causing all subsequent reasoning to build on a false premise\. The primary detection signal is source\-output semantic divergence at stage 1\.
- •Inference Cascade:Correct retrieval occurs, but an incorrect inference is made at step 2, which downstream stages subsequently amplify\. The primary detection signal is an entailment score drop between the retrieved evidence and the inferred conclusion\.
- •Context Poisoning Cascade:Manipulated external data corrupts the agent’s memory and all subsequent steps\. The primary detection signal is an anomalous semantic shift in context between stages\.
- •Confidence Inflation Cascade:A low\-confidence output is treated as a high\-confidence output by the next step, causing false certainty to grow monotonically\. The primary detection signal is a confidence score increase despite underlying semantic drift from ground truth\.
### III\-ETheoretical Limitations of Standard Detectors
To definitively establish the necessity of the CHARM framework, we provide a formal argument demonstrating why standard per\-step hallucination detectors fail to capture cascading errors\.
Lemma 1:For each cascade type, the per\-step output passes standard detection thresholds\.
Proof\.Let standard output\-level hallucination detectors be defined by a local entailment thresholdτ\\tau\. In a cascading scenario, stagesi\+1s\_\{i\+1\}receives the corrupted contextcic\_\{i\}\. By definition,si\+1s\_\{i\+1\}generatesci\+1c\_\{i\+1\}such that it is conditionally coherent givencic\_\{i\}\. Therefore, the local entailment probabilityP\(ci\+1\|ci\)P\(c\_\{i\+1\}\|c\_\{i\}\)remains exceptionally high, often satisfyingP\(ci\+1\|ci\)\>τP\(c\_\{i\+1\}\|c\_\{i\}\)\>\\tau\. Consequently, the per\-step detector evaluates the local generation step as factually grounded and internally consistent, completely missing the global divergence from ground truthGG\.■\\blacksquare
Corollary 1:Per\-step detection is inherently insufficient for cascade identification\.
Proof\.Following Lemma 1, since standard single\-step detectors only evaluate the isolated local transitionP\(cn\|cn−1\)P\(c\_\{n\}\|c\_\{n\-1\}\), they are entirely blind to the monotonically compounding error magnitude\|ϵn\|\|\\epsilon\_\{n\}\|across the broader pipeline trajectory\. Identifying a cascade mandates cross\-stage trajectory tracking and continuous semantic evaluation, capabilities that per\-step, isolation\-based detectors\[[6](https://arxiv.org/html/2606.04435#bib.bib6)\]fundamentally lack\.■\\blacksquare
### III\-FOperational Definitions
To ensure reproducibility and to connect the formal quantities above to concrete estimators, we define each core measurement as follows\.
Error magnitude\|ϵi\|\|\\epsilon\_\{i\}\|at stagesis\_\{i\}is defined stage\-adaptively to avoid the ill\-posedness of comparing non\-answer\-like intermediate outputs \(retrieval snippets, tool I/O\) directly to the final ground truth answerGG\.
Forretrieval and tool\-call stages\(s1s\_\{1\},s4s\_\{4\}\), where outputs are evidence snippets rather than answer\-like text, error magnitude is measured using a dual\-anchor strategy\. In standard operation, the entailment\-based veracity deficit is computed againstc1c\_\{1\}:
\|ϵi\|early=1−NLIentail\(c1,ci\)\|\\epsilon\_\{i\}\|^\{\\mathrm\{early\}\}=1\-\\mathrm\{NLI\}\_\{\\mathrm\{entail\}\}\\\!\\left\(c\_\{1\},\\,c\_\{i\}\\right\)\(2\)whereNLIentail\\mathrm\{NLI\}\_\{\\mathrm\{entail\}\}is the entailment probability from the SFV cross\-encoder \(cross\-encoder/nli\-deberta\-v3\-base\)\. ForRetrieval Cascadescenarios wherec1c\_\{1\}is itself the corrupted anchor, this definition would undercount error until later stages detect the drift\. To address this, CHARM additionally maintains a secondary anchor: a top\-kkconsensus summary computed fromk=3k=3retrieved candidates at stage 1, rather than only the top\-1 document\. The SFV compares subsequent stage outputs against this consensus anchor in parallel withc1c\_\{1\}, flagging divergence from either reference\. This dual\-anchor design reduces the risk that a corrupted single top\-1 document silently becomes the unchallenged reference for all subsequent stages\.
Forreasoning and synthesis stages\(s2s\_\{2\},s3s\_\{3\},s5s\_\{5\}\), where outputs are answer\-proximate, error magnitude is the semantic divergence from ground truth:
\|ϵi\|late=1−sim\(ϕ\(ci\),ϕ\(G\)\)\|\\epsilon\_\{i\}\|^\{\\mathrm\{late\}\}=1\-\\mathrm\{sim\}\\\!\\left\(\\phi\(c\_\{i\}\),\\,\\phi\(G\)\\right\)\(3\)whereϕ\(⋅\)\\phi\(\\cdot\)is theall\-mpnet\-base\-v2Sentence\-BERT\[[15](https://arxiv.org/html/2606.04435#bib.bib15)\]embedding andsim\(⋅,⋅\)\\mathrm\{sim\}\(\\cdot,\\cdot\)is cosine similarity\. This stage\-adaptive definition ensures that intermediate outputs are evaluated against appropriate reference anchors at each point in the trajectory rather than against the final answer they have not yet produced\.
Error propagation probabilityP\(ϵi\+1\|ϵi\)P\(\\epsilon\_\{i\+1\}\|\\epsilon\_\{i\}\)on DAG edge\(si,si\+1\)\(s\_\{i\},s\_\{i\+1\}\)is estimated empirically as the frequency with which a detected error at stagesis\_\{i\}produces a measurable error at stagesi\+1s\_\{i\+1\}\(i\.e\.,\|ϵi\+1\|\>δ\|\\epsilon\_\{i\+1\}\|\>\\delta, whereδ=0\.15\\delta=0\.15is a calibrated minimum divergence threshold\)\. This estimation is computed offline on the training split of each dataset and applied as a fixed prior during inference\.
We note that this empirical estimation treats co\-occurrence of errors at adjacent stages as evidence of propagation rather than independent occurrence; establishing causal error transmission formally — distinguishing propagation from coincidence — remains an important open problem for future work\.
Cascade detection thresholdθ\\thetais selected via grid search overθ∈\{0\.40,0\.45,0\.50,0\.55,0\.60,0\.65\}\\theta\\in\\\{0\.40,0\.45,0\.50,0\.55,0\.60,0\.65\\\}on a held\-out validation split \(10% of each dataset\), optimizing theF1F\_\{1\}score between CDR and\(1−FPR\)\(1\-\\mathrm\{FPR\}\)\. The selected valueθ=0\.55\\theta=0\.55yields the best harmonic mean across all four datasets\.
Commensurability note:The stage\-adaptive definitions \(Equations[2](https://arxiv.org/html/2606.04435#S3.E2)and[3](https://arxiv.org/html/2606.04435#S3.E3)\) use different reference anchors across stages, which raises the question of whether\|ϵi\|early\|\\epsilon\_\{i\}\|^\{\\mathrm\{early\}\}and\|ϵi\|late\|\\epsilon\_\{i\}\|^\{\\mathrm\{late\}\}are directly comparable\. We treat the monotonicity condition\|ϵi\+1\|≥\|ϵi\|\|\\epsilon\_\{i\+1\}\|\\geq\|\\epsilon\_\{i\}\|\(Definition condition 4\) as a within\-stage\-type constraint rather than a cross\-stage\-type one: Retrieval and tool\-call stage errors are compared against each other using\|ϵ\|early\|\\epsilon\|^\{\\mathrm\{early\}\}, and reasoning and synthesis stage errors against each other using\|ϵ\|late\|\\epsilon\|^\{\\mathrm\{late\}\}\. At the stage 1→\\to2 boundary \(retrieval to reasoning\), the transition is monitored by the CSCT’s semantic drift signal rather than a direct magnitude comparison, which is more appropriate given the output type change\. This design choice acknowledges that strict numerical monotonicity across heterogeneous stage types is not measurable from a single scalar, and that the CRT’s weighted aggregation \(Equation 3\) naturally handles this by combining complementary signals optimized for each transition type\.
## IVThe CHARM Framework
To address the inherent limitations of per\-step hallucination detection in multi\-step reasoning systems, we introduce CHARM \(Cascading Hallucination Aware Resolution and Mitigation\)\. CHARM is a modular, architectural framework designed to detect and interrupt error propagation across sequential pipeline stages without requiring the replacement of the underlying agentic architecture\.
### IV\-AArchitecture Overview
The CHARM architecture operates as a parallel observation and enforcement layer alongside a standard agentic RAG pipeline\. As illustrated in Figure[3](https://arxiv.org/html/2606.04435#S4.F3), the system comprises three concurrent monitoring components that track the semantic and probabilistic trajectory of the agent’s context, feeding signals into a fourth component, a centralized resolution engine\. This design ensures that intermediate stage errors are caught before they can compound into confident, finalized hallucinations\.
Figure 3:The CHARM System Architecture\. The standard agentic pipeline \(left\) executes normally while the parallel CHARM layer \(center\) monitors inter\-stage context passing\. Anomaly signals trigger the Cascade Resolution Trigger \(right\), which interfaces directly with automated mitigations and human\-in\-the\-loop \(HITL\-AP\) governance protocols\.
### IV\-BCHARM Components
The framework consists of four named, interconnected components\. Table[II](https://arxiv.org/html/2606.04435#S4.T2)summarizes their technical mechanisms and the specific cascade types they detect\.
1. 1\.Stage\-Level Fact Verifier \(SFV\):The SFV checks each intermediate stage output against the initially retrieved evidence before passing it to the subsequent stage\. Utilizing cross\-encoder entailment scoring viacross\-encoder/nli\-deberta\-v3\-base\[[16](https://arxiv.org/html/2606.04435#bib.bib16)\]with entailment thresholdτ=0\.72\\tau=0\.72\(calibrated on held\-out validation splits\), the SFV prevents the propagation of ungrounded claims\.
2. 2\.Cross\-Stage Consistency Tracker \(CSCT\):The CSCT maintains a running consistency check across all pipeline stages usingall\-mpnet\-base\-v2\[[15](https://arxiv.org/html/2606.04435#bib.bib15)\]embedding\-based cosine similarity, with a drift thresholdδdrift=0\.18\\delta\_\{\\mathrm\{drift\}\}=0\.18\. It flags contradictions or anomalous semantic shifts across the trajectory\.
3. 3\.Confidence Propagation Monitor \(CPM\):The CPM tracks the model’s self\-reported confidence scores across stages\. Letpi∈\[0,1\]p\_\{i\}\\in\[0,1\]denote the calibrated confidence score at stagesis\_\{i\}after temperature scaling withT=1\.4T=1\.4\. CPM maintains a running Bayesian estimate of the expected confidence trajectory, modeled with priorpi∼Beta\(αi,βi\)p\_\{i\}\\sim\\mathrm\{Beta\}\(\\alpha\_\{i\},\\beta\_\{i\}\), initialized atα1=β1=2\\alpha\_\{1\}=\\beta\_\{1\}=2\(uninformative\)\. After observingpip\_\{i\}, the posterior updates as: αi\+1=αi\+pi,βi\+1=βi\+\(1−pi\)\\alpha\_\{i\+1\}=\\alpha\_\{i\}\+p\_\{i\},\\quad\\beta\_\{i\+1\}=\\beta\_\{i\}\+\(1\-p\_\{i\}\)\(4\)An inflation anomaly is flagged whenpip\_\{i\}exceeds the posterior predictive meanμi=αi/\(αi\+βi\)\\mu\_\{i\}=\\alpha\_\{i\}/\(\\alpha\_\{i\}\+\\beta\_\{i\}\)by more thanΔ=0\.15\\Delta=0\.15, i\.e\.,pi−μi\>Δp\_\{i\}\-\\mu\_\{i\}\>\\Delta\. A known limitation of self\-reported LLM confidence is poor calibration\[[17](https://arxiv.org/html/2606.04435#bib.bib17)\]; CPM therefore applies temperature scaling\[[18](https://arxiv.org/html/2606.04435#bib.bib18)\]withT=1\.4T=1\.4, calibrated on 500 held\-out trajectories per dataset, before the Bayesian update\. For APIs lacking logit access, CPM falls back to an NLI\-derived uncertainty proxy computed from thecontradictionprobability of the SFV cross\-encoder — specifically,1−P\(entail\)−P\(neutral\)1\-P\(\\mathrm\{entail\}\)\-P\(\\mathrm\{neutral\}\)— rather than the entailment score used by SFV\. This complementary signal captures epistemic uncertainty rather than factual grounding, preserving signal independence from SFV even when logit access is unavailable\. CPM is designed as acomplementary detection signalrather than a primary detector: its standalone CDR \(38\.3%\) reflects the inherent difficulty of confidence\-only cascade detection, while its \+6\.4 pp contribution to the SFV\+CSCT configuration confirms it catches Confidence Inflation Cascades that entailment and drift signals alone cannot detect\. Under the no\-logit condition, the CPM anomaly flag fires when the contradiction probability exceeds a separately calibrated thresholdτcpm=0\.35\\tau\_\{\\mathrm\{cpm\}\}=0\.35, distinct from SFV’s entailment thresholdτ=0\.72\\tau=0\.72\. Ablation under this condition shows CPM contributes\+4\.1\+4\.1pp CDR above the SFV\+CSCT configuration, confirming non\-redundant complementarity even without direct logit access\.
4. 4\.Cascade Resolution Trigger \(CRT\):Operating as the final enforcement layer, the CRT aggregates signals from the SFV, CSCT, and CPM using a weighted voting scheme: SFV and CSCT each carry weight 0\.4; CPM carries weight 0\.2, reflecting the lower reliability of self\-reported confidence\. When the aggregated score exceedsθ=0\.55\\theta=0\.55, the CRT halts the pipeline and initiates a targeted resolution strategy \(Section[V](https://arxiv.org/html/2606.04435#S5)\)\.
TABLE II:CHARM Component Summary and Cascade Detection MappingAlgorithm[1](https://arxiv.org/html/2606.04435#alg1)summarizes the CRT decision procedure\.
Algorithm 1CRT decision and mitigation routing logic\.1:Input:
sfv\_scoresfv\\\_score,
csct\_scorecsct\\\_score,
cpm\_scorecpm\\\_score,
stage\_idstage\\\_id
2:Output:
cascade\_flagcascade\\\_flag,
mitigation\_typemitigation\\\_type
3:
p←0\.4×sfv\_score\+0\.4×csct\_score\+0\.2×cpm\_scorep\\leftarrow 0\.4\\times sfv\\\_score\+0\.4\\times csct\\\_score\+0\.2\\times cpm\\\_score
4:if
p<θ\(0\.55\)p<\\theta\\text\{ \(0\.55\)\}then
5:returnFalse,None
6:endif
7:
cascade\_type←cascade\\\_type\\leftarrow
8:
infer\_type\(sfv\_score,csct\_score,cpm\_score\)\\text\{infer\\\_type\}\(sfv\\\_score,csct\\\_score,cpm\\\_score\)
9:if
stage\_id≤2stage\\\_id\\leq 2then⊳\\trianglerightEarly stage→\\rightarrowre\-retrieve
10:returnTrue, “CRR”
11:elseif
sfv\_score\>0\.7sfv\\\_score\>0\.7and
csct\_score\>0\.7csct\\\_score\>0\.7then⊳\\trianglerightMulti\-signal→\\rightarrowparallel verify
12:returnTrue, “PVA”
13:elseif
stage\_id≥4stage\\\_id\\geq 4then⊳\\trianglerightLate stage→\\rightarrowrollback
14:returnTrue, “PRR”
15:else⊳\\trianglerightDefault→\\rightarrowconfidence gate
16:returnTrue, “SCT”
17:endif
The routing thresholds in Algorithm[1](https://arxiv.org/html/2606.04435#alg1)reflect empirical calibration: stage≤2\\leq 2captures retrievals and initial inferences \(the most common cascade origin stages in our evaluation\); thesfv\_score\>0\.7sfv\\\_score\>0\.7andcsct\_score\>0\.7csct\\\_score\>0\.7threshold for PVA activation reflects the 0\.72 SFV entailment threshold minus a 0\.02 margin for aggregation noise; and stage≥4\\geq 4for PRR reflects tool\-use and synthesis stages where rollback cost is justified by late\-stage cascade severity\.
### IV\-CIntegration with Existing RAG Pipelines
A primary design constraint of the CHARM framework is non\-intrusiveness\. CHARM wraps around existing production pipelines implemented via LangChain\[[19](https://arxiv.org/html/2606.04435#bib.bib19)\]or LlamaIndex\[[20](https://arxiv.org/html/2606.04435#bib.bib20)\]without requiring structural teardowns\. Each component is highly modular, enabling independent deployment under computational constraints\. Detection thresholds are fully adjustable to accommodate domain\-specific risk tolerances\.
### IV\-DIntegration with Human\-in\-the\-Loop Governance
To provide a complete reliability and governance stack, CHARM integrates with the Human\-in\-the\-Loop Governance for Agentic AI Pipelines \(HITL\-AP\) framework\[[21](https://arxiv.org/html/2606.04435#bib.bib21)\]\. The CRT serves as the technical bridge between automated detection and human governance\. When the CRT triggers on a low\-confidence cascade, it automatically routes to a lightweight mitigation pattern \(e\.g\., re\-retrieval\)\. When the CRT detects a high\-confidence cascade—where the system is confidently hallucinating a compounding error—it halts execution and routes the trajectory to the HITL\-AP human approval checkpoint\. Audit logs generated by the CRT feed directly into the HITL\-AP compliance logging mechanism, ensuring all error propagation trajectories are captured for enterprise review\.
## VMitigation Architectures
While detection via the CHARM framework provides visibility into error trajectories, a robust agentic system requires automated mechanisms to interrupt and resolve these cascades\. We propose four named mitigation patterns \(M1–M4\), each offering different trade\-offs between computational overhead and mitigation success rate \(MSR\) compared to naive LLM self\-correction\[[10](https://arxiv.org/html/2606.04435#bib.bib10),[7](https://arxiv.org/html/2606.04435#bib.bib7)\]\.
### V\-AM1: Cascade Re\-Retrieval \(CRR\)
The CRR pattern is triggered when the SFV or CSCT flags a potential error at the initial retrieval or early reasoning stages\. The system halts execution and triggers a fresh retrieval step with modified query parameters\. While this introduces medium latency overhead \(\+320\+320ms avg\.\), it is the most effective method for quenching retrieval\-based cascades before they reach the reasoning core\.
### V\-BM2: Staged Confidence Thresholding \(SCT\)
SCT serves as always\-on baseline protection for high\-throughput systems\. Each stage passes its output to the next only if the CPM reports a score exceeding a dynamically calibrated threshold \(\+38\+38ms per stage gate\)\. If the score falls below the threshold, the system triggers a localized verification step before proceeding\.
### V\-CM3: Parallel Verification Agent \(PVA\)
For high\-stakes domains such as the financial and enterprise sectors discussed in Section[VII](https://arxiv.org/html/2606.04435#S7), the PVA deploys a secondary, independent verification agent running in parallel with the primary reasoning pipeline\. The PVA is activated when the CRT aggregates simultaneous anomaly signals from both the SFV and CSCT, indicating multi\-component cascade confidence\. Independence from the primary agent is ensured through three mechanisms: \(1\)Model isolation— the PVA uses a different backbone LLM \(GPT\-4o\-mini in our implementation, vs\. GPT\-4o for the primary agent\), preventing shared token\-level biases from the primary trajectory; \(2\)Prompt isolation— the PVA receives only the original query and the specific claim under verification, with no access to the primary agent’s intermediate context, preventing confirmation bias inheritance; and \(3\)Knowledge base isolation— the PVA queries a separate, read\-only trusted reference corpus \(Wikipedia snapshot frozen at experiment time\) rather than the dynamic retrieval index used by the primary pipeline\. This three\-layer isolation ensures that PVA verdicts represent a genuinely independent verification signal\. Although this effectively doubles computational cost, it provides the highest reliability \(95\.2% MSR\) for regulated environments where correctness is non\-negotiable\.
### V\-DM4: Pipeline Rollback and Re\-Execution \(PRR\)
When a cascade is detected at a late stage \(e\.g\., stage 4 or 5\), the CRT initiates a rollback to the last known clean stage identified by the CSCT, corrects the identified error via targeted prompt adjustment, and re\-executes the pipeline from that point \(\+1\.8×\+1\.8\\timesre\-execution overhead\)\. This ensures the final output is built on a corrected context rather than a poisoned one\.
TABLE III:Comparison of CHARM Mitigation Patterns
## VIEvaluation
To validate the efficacy of the CHARM framework, we designed a comprehensive evaluation harness to test multi\-step reasoning systems under cascading failure conditions\.
### VI\-AExperimental Setup and Agentic Adaptation
Evaluating agentic pipelines requires continuous reasoning environments; however, standard multi\-hop QA datasets are inherently static\. To address this, we developed anagentic trajectory wrapper\. Rather than providing the full context window upfront, the evaluation harness forces the LLM agent to use a designated Search Tool to fetch paragraphs sequentially across multiple reasoning steps\.
#### VI\-A1Implementation Stack
All experiments useGPT\-4oas the backbone LLM, accessed via the OpenAI API with temperature set to 0\.0 for deterministic outputs\. The agentic pipeline is implemented using LangChain AgentExecutor\[[19](https://arxiv.org/html/2606.04435#bib.bib19)\]with a ReAct\[[22](https://arxiv.org/html/2606.04435#bib.bib22)\]reasoning trace\. Retrieval uses a dense retriever \(FAISS\[[23](https://arxiv.org/html/2606.04435#bib.bib23)\]\) with Wikipedia paragraph embeddings encoded viatext\-embedding\-3\-small\. No reranker is applied in the primary experiments; a cross\-encoder reranker ablation \(ms\-marco\-MiniLM\-L\-6\-v2\) is reported in Table[IV](https://arxiv.org/html/2606.04435#S6.T4)\.
#### VI\-A2CHARM Component Configuration
TheSFVusescross\-encoder/nli\-deberta\-v3\-base\[[16](https://arxiv.org/html/2606.04435#bib.bib16)\]with entailment thresholdτ=0\.72\\tau=0\.72\. TheCSCTusesall\-mpnet\-base\-v2\[[15](https://arxiv.org/html/2606.04435#bib.bib15)\]with drift thresholdδdrift=0\.18\\delta\_\{\\mathrm\{drift\}\}=0\.18\. TheCPMapplies temperature scalingT=1\.4T=1\.4, calibrated on 500 held\-out trajectories per dataset\. TheCRTaggregates signals with weights SFV:0\.4, CSCT:0\.4, CPM:0\.2 and fires atθ=0\.55\\theta=0\.55\. In our GPT\-4o experiments, Stage\-level confidencepip\_\{i\}is obtained from the model’slogprobsoutput \(top\_logprobs=1\)\. Specifically,pip\_\{i\}is computed as the mean of the top\-token log\-probabilities across the final sentence of the stage output, exponentiated to obtain probabilities and clipped to\[0\.01,0\.99\]\[0\.01,0\.99\]:pi=clip\(exp\(1\|T\|∑t∈Tlogpt\),0\.01,0\.99\)p\_\{i\}=\\mathrm\{clip\}\\\!\\left\(\\exp\\\!\\left\(\\frac\{1\}\{\|T\|\}\\sum\_\{t\\in T\}\\log p\_\{t\}\\right\),\\,0\.01,\\,0\.99\\right\), whereTTis the set of tokens in the final sentence andptp\_\{t\}is the top\-token probability at positiontt\. This per\-sentence aggregation captures the model’s confidence in its concluding claim rather than averaging over the full stage output\. Logprob access is enabled via the OpenAI API parameterlogprobs=true\. The temperature\-scaled value is used for the Bayesian update; the NLI\-contradiction fallback is invoked only whenlogprobsare unavailable, which occurred in 0% of our GPT\-4o experimental runs\.
#### VI\-A3Long\-Context Handling
NLI cross\-encoder inputs are truncated to a maximum of512 tokensper the DeBERTa\-v3 model limit\. For stage outputs exceeding this limit, we apply a sliding window with stride 256 tokens and take the minimum entailment score across windows as the conservative estimate \(i\.e\., flagging if any window falls belowτ\\tau\)\. For CSCT, sentence embeddings are computed over the full stage output without truncation, asall\-mpnet\-base\-v2processes variable\-length inputs up to 512 tokens with mean pooling; outputs exceeding this are chunked and mean\-pooled across chunk embeddings\. In our evaluation datasets, median stage output length was 187 tokens \(HotpotQA\), 312 tokens \(MuSiQue\), and 278 tokens \(2WikiMultiHopQA\), placing most outputs within single\-window range\.
#### VI\-A4Threshold Sensitivity
Detection thresholds \(τ=0\.72\\tau=0\.72,δdrift=0\.18\\delta\_\{\\mathrm\{drift\}\}=0\.18,θ=0\.55\\theta=0\.55\) were calibrated on 10% held\-out validation splits of each dataset independently\. Cross\-domain generalization of these thresholds to substantially different corpora \(e\.g\., scientific or legal text\) is an open question; per\-domain recalibration is recommended for production deployments, and is supported by CHARM’s configurable threshold interface \(Section[IV\-C](https://arxiv.org/html/2606.04435#S4.SS3)\)\. Threshold sensitivity analysis and cross\-domain transfer evaluation are identified as planned extensions\.
#### VI\-A5Backbone Independence
The CHARM detection layer \(SFV, CSCT, CPM\) operates entirely independently of the backbone LLM: all NLI inference \(cross\-encoder/nli\-deberta\-v3\-base\) and embedding computation \(all\-mpnet\-base\-v2\) use locally hosted open\-source models with no API dependency\. The backbone LLM is used solely for pipeline generation; substituting GPT\-4o with any instruction\-tuned LLM \(e\.g\., Llama\-3, Mistral\) requires no changes to CHARM components\. Full open\-source backbone evaluation is a planned extension of this work\.
#### VI\-A6Hardware and Latency Measurement
All experiments were conducted on a singleNVIDIA A100 80 GB GPU\(local NLI/embedding inference\) with API calls routed to OpenAI endpoints\. Per\-stage latency \(LO/s\) measures wall\-clock time added by CHARM components only \(NLI inference, embedding computation, confidence scoring\); backbone LLM latency is excluded to isolate pure framework overhead\. We note that methods such as SelfCheckGPT and RAGAS require additional LLM calls beyond the primary pipeline \(e\.g\., SelfCheckGPT samples multiple generations; RAGAS invokes LLM\-based faithfulness scoring\), while CHARM’s detection components run on locally hosted models with no additional LLM calls\. On an end\-to\-end wall\-clock basis, CHARM’s early cascade detection \(average CDD = 2\.1\) halts the pipeline before stages 3–5 execute, saving 2–3 full LLM inference calls per detected cascade\. This early\-exit behavior makes CHARM’s effective end\-to\-end overhead substantially lower than the per\-stage LO/s figure suggests when a cascade is present\. All reported LO/s values are averaged over five independent runs per dataset\.
#### VI\-A7Datasets
We evaluate across four datasets mapped to our cascade taxonomy:
- •HotpotQA\[[24](https://arxiv.org/html/2606.04435#bib.bib24)\]: Multi\-hop reasoning \(Retrieval and Inference Cascades\)\.500 injected trajectories; 200 clean trajectories\.
- •MuSiQue\[[25](https://arxiv.org/html/2606.04435#bib.bib25)\]: Multi\-step compositional questions \(Inference and Confidence Inflation Cascades\)\.400 injected trajectories; 150 clean trajectories\.
- •2WikiMultiHopQA\[[26](https://arxiv.org/html/2606.04435#bib.bib26)\]: Multi\-document reasoning via targeted poison injection \(Context Poisoning Cascades\)\.400 injected trajectories; 150 clean trajectories\.
- •Custom Adversarial Set: 200 synthetic trajectories \(50 per cascade type\); 100 clean trajectories\.
To support reproducibility and community benchmarking, we will release the agentic trajectory wrapper, cascade injection scripts, annotated adversarial trajectories, and the full CHARM evaluation harness at[https://github\.com/sarmishra/CHARM\-agentic\-rag](https://github.com/sarmishra/CHARM-agentic-rag)\.
### VI\-BCascade Injection and Annotation Protocol
To ensure controlled, reproducible cascade generation, we apply a four\-method injection protocol mapped to cascade type:
- •Retrieval Cascade injection:The top\-1 retrieved document is replaced with a semantically proximate but factually incorrect document, generated via GPT\-4o with explicit counterfactual instructions\. Applied to HotpotQA\.
- •Inference Cascade injection:The retrieval stage is left clean; a misleading reasoning cue is prepended to the intermediate context at stage 2\. Applied to MuSiQue\.
- •Context Poisoning injection:Adversarial passages are inserted into the knowledge base using a gradient\-free embedding\-proximal attack\[[27](https://arxiv.org/html/2606.04435#bib.bib27)\], ensuring the poisoned document passes retrieval relevance filtering\. Applied to 2WikiMultiHopQA\.
- •Confidence Inflation injection:Low\-confidence hedging language \(“possibly”, “may be”\) is removed from stage outputs, simulating false certainty propagation\. Applied across all datasets in the Custom Adversarial Set\.
Custom adversarial set construction:The 200\-trajectory adversarial set comprises 50 trajectories per cascade type: 50 Retrieval Cascades \(GPT\-4o counterfactual top\-1 replacement\), 50 Inference Cascades \(misleading reasoning cue injection\), 50 Context Poisoning Cascades \(embedding\-proximal adversarial passages\), and 50 Confidence Inflation Cascades \(hedging language removal\)\. All trajectories were constructed from HotpotQA questions not present in the training or validation splits used for threshold calibration, ensuring strict separation between calibration and test examples\. Each trajectory was reviewed by the authors to confirm the injection produced a detectable cascade \(ground truth cascade type and injection stage labeled by the constructor\)\. The full dataset, injection scripts, and annotation schema are released at[https://github\.com/sarmishra/CHARM\-agentic\-rag](https://github.com/sarmishra/CHARM-agentic-rag)\.
Ground truth annotation:Each injected trajectory is labeled with the injection stagesiinjects\_\{i\}^\{\\mathrm\{inject\}\}and cascade type\. Under thestrict early\-detection criterion, a detection is counted as a true positive \(TP\) if the CRT flags an anomaly at any stagesjs\_\{j\}wherej≤iinject\+1j\\leq i^\{\\mathrm\{inject\}\}\+1, i\.e\., the cascade is caught before it propagates more than one additional stage\. This strict criterion drives the reported 89\.4% CDR and reflects CHARM’s primary design goal of early interruption\. Under aliberal criterion\(any detection before the final output stages5s\_\{5\}\), CHARM flags 100% of all injected cascades, confirming complete coverage before terminal output\.
FPR estimation:False positive rate is measured on a separate held\-out set ofclean, non\-injected trajectories\(is\_cascade = false\) drawn from the same datasets: 200 clean trajectories from HotpotQA, 150 from MuSiQue, 150 from 2WikiMultiHopQA, and 100 from the custom adversarial set \(500 total\)\. These trajectories contain no artificially introduced errors and represent legitimate multi\-hop reasoning chains with correct final answers verified against dataset gold labels\. The 5\.3% FPR is computed as the fraction of these clean trajectories that the CRT incorrectly flags as cascades\. Clean and injected sets are strictly disjoint; no trajectory appears in both\.
EPR computation:Error Propagation Reduction is computed as:
EPR=1−EMCHARMEMNone\\mathrm\{EPR\}=1\-\\frac\{\\mathrm\{EM\}\_\{\\mathrm\{CHARM\}\}\}\{\\mathrm\{EM\}\_\{\\mathrm\{None\}\}\}\(5\)whereEMCHARM\\mathrm\{EM\}\_\{\\mathrm\{CHARM\}\}andEMNone\\mathrm\{EM\}\_\{\\mathrm\{None\}\}denote theexact\-match error ratesfor the CHARM system and the no\-detection baseline respectively, defined as the fraction of injected trajectories where the final output does not match the gold answer string \(i\.e\.,1−EMaccuracy1\-\\mathrm\{EM\}\_\{\\mathrm\{accuracy\}\}\)\. EPR therefore measures how much CHARM reduces incorrect final outputs relative to the no\-detection baseline, providing a direct measure of error propagation interruption\.
A pilot study on naturally occurring cascades \(without injection\) is reported in Section[VI\-G](https://arxiv.org/html/2606.04435#S6.SS7)\.
### VI\-CBaseline Comparisons
We evaluate CHARM against four direct baselines and reference two process\-level systems discussed qualitatively in Section[VIII\-C](https://arxiv.org/html/2606.04435#S8.SS3):
1. 1\.No Detection \(None\):A zero\-intervention baseline establishing vulnerability\.
2. 2\.Output\-Level Detector \(SelfCheckGPT\[[4](https://arxiv.org/html/2606.04435#bib.bib4)\]\):Evaluates only the terminal output\.
3. 3\.Retrieval Fact Checker \(RAGAS\[[6](https://arxiv.org/html/2606.04435#bib.bib6)\]\):Evaluates retrieved documents without cross\-stage trajectory tracking\.
4. 4\.LLM Self\-Correction\[[7](https://arxiv.org/html/2606.04435#bib.bib7)\]:An agent prompts itself to review its own final answer, demonstrating confirmation bias in cascade scenarios\.
5. 5\.EVER\[[28](https://arxiv.org/html/2606.04435#bib.bib28)\]:A process\-level incremental verification framework that rectifies hallucinations during generation\. Because EVER reports only answer\-level EM and F1 scores rather than cascade\-specific detection metrics, a direct column\-for\-column comparison in Table[VI](https://arxiv.org/html/2606.04435#S6.T6)is not possible; we discuss its relationship to CHARM in Section[VIII\-C](https://arxiv.org/html/2606.04435#S8.SS3)\.
6. 6\.IRCoT\[[29](https://arxiv.org/html/2606.04435#bib.bib29)\]:An interleaved retrieval\-with\-chain\-of\-thought framework evaluated on the same three datasets used here\. As with EVER, IRCoT reports EM and F1 rather than cascade detection metrics; we provide a qualitative comparison in Section[VIII\-C](https://arxiv.org/html/2606.04435#S8.SS3)\.
### VI\-DEvaluation Metrics
We assess performance using six metrics, including one measurement introduced for the first time in this paper to standardize cascade evaluation:
- •Cascade Detection Rate \(CDR\):Percentage of injected cascades identified before final output\.
- •False Positive Rate \(FPR\):Percentage of grounded trajectories incorrectly flagged\.
- •Error Propagation Reduction \(EPR\):Reduction in final output error rate, computed per Equation[5](https://arxiv.org/html/2606.04435#S6.E5)\.
- •Mitigation Success Rate \(MSR\):Percentage of detected cascades successfully resolved\.
- •Cascade Depth at Detection \(CDD\):Average pipeline stage \(s1…sns\_\{1\}\\dots s\_\{n\}\) at which a cascade is detected\. To our knowledge, no prior work standardizes cascade detection depth as a quantitative trajectory metric; while AgentHallu\[[30](https://arxiv.org/html/2606.04435#bib.bib30)\]localizes hallucination origin post\-hoc, CDD captures detection timing at inference time as a standardized, reusable evaluation criterion\. Lower values indicate earlier intervention\.
- •Latency Overhead per Stage \(LO/s\):Average additional wall\-clock processing time \(in milliseconds\) introduced by CHARM components at each individual pipeline stage\.
### VI\-EAblation Study
To quantify the contribution of individual CHARM components, we evaluate six ablated configurations on HotpotQA\. Results are presented in Table[IV](https://arxiv.org/html/2606.04435#S6.T4)\.
TABLE IV:Component Ablation Study — CDR on HotpotQAThe ablation confirms that SFV is the strongest individual component \(61\.2% CDR\), consistent with its role in catching Retrieval and Inference Cascades—the most frequent types in HotpotQA’s structured two\-hop format\. CSCT adds complementary coverage for longer semantic drift trajectories \(\+18\.2 percentage points over SFV alone\)\. CPM’s standalone contribution is limited \(38\.3%\), reflecting the inherent difficulty of confidence\-only detection; however, its addition to SFV\+CSCT yields a further \+6\.4 percentage point gain, confirming it catches Confidence Inflation Cascades missed by the other two components\. Each component carries a meaningful detection contribution, validating the four\-component architecture\. Table[V](https://arxiv.org/html/2606.04435#S6.T5)reports per\-mitigation effectiveness\.
TABLE V:Mitigation Pattern EffectivenessCross\-Dataset Generalization:The per\-dataset CDR results in Table[VII](https://arxiv.org/html/2606.04435#S6.T7)serve as an implicit cross\-dataset ablation: CHARM’s performance advantage over the single\-component Output\-Level baseline \(which approximates SFV\-only behavior\) ranges from 66\.4 pp on HotpotQA to 63\.7 pp on MuSiQue and 66\.0 pp on 2WikiMultiHopQA, indicating that multi\-component coverage benefits generalize across reasoning topologies rather than being specific to HotpotQA’s two\-hop structure\. Additionally, under the no\-logit API condition \(CPM using contradiction probability fallback\), Full CHARM retains\+4\.1\+4\.1pp CDR above the SFV\+CSCT configuration on HotpotQA, confirming that CPM’s Bayesian trajectory modeling provides complementary signal even without direct logit access\.
To assess signal independence, we computed the Pearson correlation between SFV entailment anomaly scores and CPM contradiction fallback scores across all clean and injected trajectories:r=0\.31r=0\.31\(p<0\.001p<0\.001\), indicating moderate but non\-redundant correlation\. CPM’s contradiction signal captures trajectories where confidence rises despite neutral or contradictory NLI output — a distinct pattern from SFV’s entailment deficit\.
While simpler temporal anomaly detectors such as EWMA or CUSUM could serve as CPM alternatives, the Beta\-Bayesian formulation offers a natural probabilistic interpretation of confidence trajectory drift and produces a directly interpretable posterior meanμi\\mu\_\{i\}as the expected confidence baseline\. Empirical comparison against EWMA\-based CPM is a planned evaluation extension\.
CRT Weight and Threshold Robustness:The weights\(0\.4,0\.4,0\.2\)\(0\.4,0\.4,0\.2\)and thresholdθ=0\.55\\theta=0\.55were selected by grid search on held\-out validation splits optimizingF1F\_\{1\}between CDR and\(1−FPR\)\(1\-\\text\{FPR\}\)\(Section[III\-F](https://arxiv.org/html/2606.04435#S3.SS6)\)\. Equal weights\(0\.33,0\.33,0\.33\)\(0\.33,0\.33,0\.33\)assign CPM the same weight as the more reliable SFV and CSCT, which prior calibration experiments showed inflates FPR due to CPM’s inherently noisier signal without logit access\. A full ROC/AUPRC sensitivity analysis overθ\\thetaand weight grids is a planned evaluation addition; the current fixed\-weight design is justified by interpretability and cross\-dataset transfer without retraining\.
### VI\-FResults and Analysis
As presented in Table[VI](https://arxiv.org/html/2606.04435#S6.T6), output\-level detectors and LLM Self\-Correction failed dramatically in cascading scenarios\. Because downstream reasoning steps were coherent relative to the corrupted intermediate context, self\-correction suffered from severe confirmation bias \(12\.8% CDR\)\. RAGAS achieved 41\.7% CDR by catching retrieval\-stage errors but entirely missed inference and confidence inflation cascades, which occur after the retrieval stage it monitors\. CHARM achieved an 89\.4% CDR and an average CDD of 2\.1, proving it interrupts error propagation by the second reasoning stage with a per\-stage component overhead of215±18215\\pm 18ms\. Unlike SelfCheckGPT and RAGAS, CHARM’s detection components require no additional LLM calls; furthermore, early cascade detection at stage 2\.1 halts pipeline execution before the computationally expensive later stages run, meaning CHARM’s end\-to\-end wall\-clock cost is lower than a naive per\-stage comparison suggests\. The 450 ms overhead reported for SelfCheckGPT and 380 ms for RAGAS reflect their inherent additional LLM sampling and faithfulness\-scoring calls respectively, making a direct LO/s comparison across these methods a conservative view that understates CHARM’s relative efficiency\. CHARM achieved an MSR of 91\.3%, proving that when a cascade is flagged, the automated mitigation patterns \(M1–M4\) successfully resolve the error and restore trajectory alignment\. Process\-level baselines EVER and IRCoT are discussed qualitatively in Section[VIII\-C](https://arxiv.org/html/2606.04435#S8.SS3), as they report answer\-level EM and F1 scores that are not directly comparable to cascade\-specific detection metrics\.
All reported CDR and EPR improvements over the strongest single baseline \(RAGAS, CDR = 41\.7%\) are statistically significant atp<0\.01p<0\.01under a paired bootstrap test\[[31](https://arxiv.org/html/2606.04435#bib.bib31)\]with 10,000 resamples\. Resampling was performed at thetrajectory level: each resample drawsNNtrajectories with replacement from the full evaluation pool \(N=1,500N=1\{,\}500injected \+ 500 clean trajectories across all four datasets\), recomputes CDR, FPR, and EPR for both CHARM and RAGAS on the resample, and records the difference\. The reportedpp\-value is the fraction of resamples where RAGAS equaled or exceeded CHARM\.
TABLE VI:Performance Comparison of Detection Frameworks on Cascading Trajectories\. CHARM results are mean±\\pmstandard deviation over five independent runs\.Performance Across Datasets:To ensure CHARM’s robustness across different reasoning topologies, we disaggregated CDR across the four evaluation datasets \(Table[VII](https://arxiv.org/html/2606.04435#S6.T7)\)\. CHARM maintained high efficacy across all types, with HotpotQA yielding the highest performance \(92\.5%\) as its structured two\-hop format produces cleaner semantic transitions, making drift detection more reliable than in compositional multi\-document tasks\. Performance dropped slightly on MuSiQue and 2WikiMultiHopQA \(86\.1% and 87\.8% respectively\) due to the inherent complexity of compositional reasoning, which occasionally masked anomalous semantic shifts from the CSCT component\. CHARM consistently outperformed RAGAS, the strongest retrieval\-level baseline, across all categories\.
TABLE VII:Cascade Detection Rate \(CDR\) by DatasetRobustness to Near\-Miss Distractors:To evaluate CHARM under long\-context stress conditions analogous to Self\-RAG’s distractor evaluation conditions\[[32](https://arxiv.org/html/2606.04435#bib.bib32)\], we constructed a distractor stress variant of the Custom Adversarial Set in which each trajectory included three semantically proximate but factually incorrect documents alongside the correct source\. Under these conditions, CHARM’s CDR dropped to 84\.1% \(vs\. 91\.2% without distractors\), with FPR increasing to 7\.8%\. The CSCT component was most affected, as embedding\-proximal distractors occasionally passed cosine similarity drift detection\. This identifies adversarial embedding\-proximal attacks as a meaningful attack surface and informs the adversarial robustness discussion in Section[VII\-D](https://arxiv.org/html/2606.04435#S7.SS4)\.
### VI\-GNaturally Occurring Cascade Pilot
To assess ecological validity beyond synthetic injections, we ran CHARM on 50 naturally occurring HotpotQA failure trajectories—agent runs that produced incorrect final answers without any injected perturbation, drawn from the full evaluation split\. Among these, CHARM flagged anomalous trajectory signals in 38 of 50 cases \(76%\), with the CRT triggering at stage 2\.3 on average\. Manual inspection of the 38 flagged cases confirmed cascade\-like characteristics \(local coherence with global error\) in 34 of 38 \(89\.5%\), and found independent stage errors \(non\-cascading\) in 4 cases\. The 12 unflagged cases contained errors that emerged only at the final synthesis stage, beyond CHARM’s cross\-stage monitoring window\. While a larger\-scale natural cascade corpus remains future work, this pilot provides initial evidence that CHARM’s detection generalizes beyond synthetic injection conditions\.
## VIIDiscussion
The empirical results demonstrate that CHARM effectively interrupts cascading errors, but the broader impact of this framework extends into enterprise AI governance\.
### VII\-AAlignment with NIST AI Risk Management Frameworks
A critical imperative for responsible AI adoption in the United States is alignment with federal guidelines\. In July 2024, the National Institute of Standards and Technology \(NIST\) released the Artificial Intelligence Risk Management Framework: Generative AI Profile \(NIST AI 600\-1\)\[[2](https://arxiv.org/html/2606.04435#bib.bib2)\], explicitly identifying “Confabulation” \(hallucination\) as a primary risk category\. CHARM directly addresses this named risk by mapping its architectural mitigations to the foundational functions of the broader NIST AI RMF\[[33](https://arxiv.org/html/2606.04435#bib.bib33)\], as detailed in Table[VIII](https://arxiv.org/html/2606.04435#S7.T8)\.
TABLE VIII:CHARM Component Mapping to NIST AI Risk Management Frameworks
### VII\-BEnterprise Deployment in Regulated Industries
As U\.S\. enterprises accelerate AI deployment—with 78% of organizations now using AI in at least one business function and 23% actively scaling agentic AI systems\[[34](https://arxiv.org/html/2606.04435#bib.bib34)\]—the theoretical risks of multi\-step hallucination become concrete operational vulnerabilities\. Cascading hallucinations are uniquely dangerous in regulated industries such as financial services and legal compliance, where downstream decisions are highly sensitive to initial inputs\. Furthermore, as these systems integrate with external enterprise tools, the risk of context poisoning via adversarial inputs\[[35](https://arxiv.org/html/2606.04435#bib.bib35)\]necessitates robust cross\-stage validation\. Because CHARM is highly retrofittable, it provides a practical pathway for organizations to secure their existing production\-grade deployments without requiring expensive architectural overhauls\.
### VII\-CThe Complete Reliability and Governance Stack
CHARM is explicitly designed to integrate with the HITL\-AP framework\[[21](https://arxiv.org/html/2606.04435#bib.bib21)\]\. Together, they form a comprehensive security stack: CHARM continuously monitors semantic trajectories and interrupts low\-confidence cascades autonomously, while routing high\-confidence cascades to the HITL\-AP human approval checkpoints\. This integrated architecture ensures agentic systems remain tethered to enterprise governance protocols\.
### VII\-DLimitations and Adversarial Robustness
The computational overhead of the PVA \(M3\) may be prohibitive for latency\-sensitive applications\. While the FPR is manageable at 5\.3%, it can become elevated in highly ambiguous domains where ground truth is deeply nuanced\. The current evaluation scope is limited to text\-based agentic RAG pipelines; extending to multimodal trajectories remains an open challenge\.
Semantic Illusion Boundary:A known limitation of embedding\- and NLI\-based detectors is reduced effectiveness on “semantic illusion” hallucinations, where RLHF\-era models produce factually incorrect outputs that remain semantically proximate to the correct answer\[[36](https://arxiv.org/html/2606.04435#bib.bib36)\]\. Our current evaluation uses synthetic cascade injections \(GPT\-4o\-generated counterfactuals, context perturbations\) which are semantically distinguishable by design; performance on datasets specifically engineered to induce semantic illusions \(e\.g\., HaluEval\[[37](https://arxiv.org/html/2606.04435#bib.bib37)\]\) may differ\. Evaluating CHARM’s SFV and CSCT on such benchmarks and hybridizing with reasoning\-capable LLM judges as an alternative SFV backend is an identified extension for future work\.
Adversarial Robustness Boundaries:A sophisticated adversary aware of CHARM’s detection mechanisms could engineer context poisoning attacks specifically designed to evade the CSCT’s cosine similarity drift detection—for example, by constructing counterfactual documents that are semantically proximate to ground truth while remaining factually incorrect\. Similarly, entailment\-ambiguous injections, where the false claim is logically consistent with but not entailed by the evidence, could evade the SFV’s NLI threshold\. As demonstrated by our distractor stress test \(Section[VI\-F](https://arxiv.org/html/2606.04435#S6.SS6)\), embedding\-proximal attacks reduce CDR by 7\.1 percentage points\. Hardening CHARM against such white\-box adversarial attacks via adversarial fine\-tuning of the SFV cross\-encoder is a critical direction for future work, particularly for the Context Poisoning Cascade type\[[27](https://arxiv.org/html/2606.04435#bib.bib27)\]\.
Synthetic Injection Scope:The primary evaluation relies on controlled cascade injection rather than organically occurring cascades from real agent deployments\. While synthetic injection enables precise ground truth labeling and controlled comparison across cascade types, it may not fully represent the distribution of naturally occurring cascades\. The 200\-trajectory custom adversarial set partially mitigates this by including manually designed cascade scenarios, but human annotation of natural multi\-step agent failures at scale remains an important gap\. Constructing a human\-annotated natural cascade corpus and validating CHARM’s detection on it is an identified priority for future work\. Results should therefore be interpreted as evidence of controlled cascade detection efficacy under structured failure conditions rather than definitive performance on naturally occurring enterprise agent trajectories\.
## VIIIRelated Work
Our research intersects with three primary domains: hallucination detection, multi\-step reasoning evaluation, and agentic system reliability\. By isolating the phenomenon of compounding errors, we differentiate CHARM from existing point\-in\-time evaluation methods\.
### VIII\-AHallucination Detection in LLMs
The proliferation of LLMs has driven significant research into hallucination detection and mitigation\[[9](https://arxiv.org/html/2606.04435#bib.bib9),[38](https://arxiv.org/html/2606.04435#bib.bib38)\]\. Most existing approaches evaluate individual generation outputs in isolation\. Methods like SelfCheckGPT\[[4](https://arxiv.org/html/2606.04435#bib.bib4)\]leverage zero\-resource sampling to detect inconsistencies in black\-box LLM generations\. Fine\-grained atomic evaluation frameworks such as FActScore\[[5](https://arxiv.org/html/2606.04435#bib.bib5)\]break down long\-form generations into verifiable claims\. In the context of RAG, frameworks like RAGAS\[[6](https://arxiv.org/html/2606.04435#bib.bib6)\]and ARES\[[39](https://arxiv.org/html/2606.04435#bib.bib39)\]evaluate the faithfulness of an answer against the retrieved context\. While these methods demonstrate high accuracy for single\-step generation, they inherently assume that the retrieved context is uncorrupted or that the reasoning chain is confined to a single transition\. In contrast, CHARM specifically addresses trajectory\-level error propagation across pipeline stages as a first\-class architectural concern, operating as a passive retrofit layer that models cross\-stage semantic drift and confidence inflation dynamics—capabilities not jointly addressed by any prior single framework\.
### VIII\-BMulti\-Step Reasoning Failures
The foundation for multi\-step LLM execution stems from advancements like Chain\-of\-Thought \(CoT\) prompting\[[11](https://arxiv.org/html/2606.04435#bib.bib11)\], which allows models to break complex problems into intermediate steps\. However, research into compositional reasoning errors\[[12](https://arxiv.org/html/2606.04435#bib.bib12)\]has demonstrated that LLMs frequently suffer from logical derailment as reasoning depth increases\. These works establish the theoretical foundation for error compounding, proving that local coherence does not guarantee global factual accuracy\. CHARM builds upon this theoretical foundation by operationalizing it into a detectable architectural metric \(CDD\)\.
### VIII\-CProcess\-Level Verification and Planning
Recent work has begun to address error propagation at the process level rather than the output level, with methods reporting answer\-level accuracy \(EM and F1\) on the same datasets used in this paper\.
EVER\[[28](https://arxiv.org/html/2606.04435#bib.bib28)\]applies real\-time, step\-wise generation with retrieval\-based verification and rectification, reporting improvements in multi\-hop reasoning on HotpotQA under EM and F1 evaluation\. EVER explicitly targets the “snowballing” hallucination phenomenon—errors that compound across sequential reasoning steps—which aligns closely with the cascading failure mode formalized in this paper\. However, EVER operates at the claim level within individual generation steps and does not model cross\-stage semantic trajectory or confidence propagation across a structured pipeline\. Consequently, it cannot detect Confidence Inflation Cascades or Context Poisoning Cascades that manifest as anomalous trajectory\-level drift rather than local claim\-level contradictions\.
IRCoT\[[29](https://arxiv.org/html/2606.04435#bib.bib29)\]interleaves chain\-of\-thought reasoning with retrieval steps, achieving up to 15 F1\-point improvements on HotpotQA, 2WikiMultiHopQA, and MuSiQue over single\-step retrieval baselines, and reducing factual errors in generated CoT by up to 50%\. While IRCoT reduces upstream error propagation through iterative re\-grounding, it requires deep integration into the reasoning loop and cannot be applied as a passive retrofit layer to existing production pipelines\. Furthermore, IRCoT has no mechanism for detecting Confidence Inflation Cascades, as it does not monitor confidence trajectories\.
Self\-RAG\[[32](https://arxiv.org/html/2606.04435#bib.bib32)\]demonstrates that adaptive self\-reflective retrieval reduces error propagation in agentic generation, and its distractor evaluation conditions motivate our near\-miss stress test in Section[VI\-F](https://arxiv.org/html/2606.04435#S6.SS6)\.
Most directly related to CHARM’s CDD metric is AgentHallu\[[30](https://arxiv.org/html/2606.04435#bib.bib30)\], which performs step\-level localization of hallucination origin in multi\-agent trajectories and provides causal explanations for error emergence\. AgentHallu demonstrates that identifyingwherein a trajectory a hallucination originates is both feasible and practically valuable\. CHARM’s CDD metric formalizes this intuition as a standardized, quantitative evaluation criterion: while AgentHallu focuses on post\-hoc attribution across agent trajectories, CHARM detects and interrupts cascadesat inference timebefore the trajectory completes, targeting a different operational point in the pipeline lifecycle\.
Production\-grade non\-LLM verifier stacks combining retrieval\-aware relevance scoring with NLI are conceptually aligned with CHARM’s SFV component; such verifiers could serve as drop\-in SFV backends given CHARM’s modular design\. Small reasoning verifiers that provide factuality discrimination with explanations represent another natural SFV backend option for resource\-constrained deployments\. Evaluating CHARM with alternative verifier backends is a planned extension\.
TABLE IX:Qualitative Comparison of Related FrameworksTable[IX](https://arxiv.org/html/2606.04435#S8.T9)summarizes the key differentiating dimensions\. CHARM is the only framework that simultaneously detects cascades at inference time, monitors the full cross\-stage trajectory, tracks confidence propagation, and requires no architectural changes to the primary pipeline\.
Relative to AgentHallu’s post\-hoc attribution approach, CHARM differentiates by operating at inference time rather than retrospectively\. More broadly, CHARM differentiates from this line of work in three ways: \(1\) it operates as a non\-intrusive parallel monitoring layer requiring no architectural changes to the primary pipeline; \(2\) it jointly models semantic drift, entailment grounding, and confidence trajectory as a unified detection signal rather than any single dimension; and \(3\) it introduces a formally defined cascade taxonomy and the CDD metric that standardize evaluation for this class of failures\. Because EVER and IRCoT report EM and F1 scores while CHARM reports cascade\-specific metrics \(CDR, FPR, EPR, CDD\), direct numerical comparison is not presented; the contribution of CHARM is orthogonal—it provides the detection and governance infrastructure within which methods like IRCoT could operate\. Multi\-agent verification frameworks such as MARCH\[[40](https://arxiv.org/html/2606.04435#bib.bib40)\]and cryptographically\-grounded approaches such as FINCH\-ZK\[[41](https://arxiv.org/html/2606.04435#bib.bib41)\]provide complementary hallucination mitigation angles; CHARM’s SFV component could be instantiated with any such verifier backend, making the CHARM architecture extensible to these approaches\.
### VIII\-DAgentic System Reliability
As LLMs evolve from isolated chatbots to autonomous agents equipped with tool use\[[42](https://arxiv.org/html/2606.04435#bib.bib42)\], evaluating system reliability has become increasingly complex\. Our foundational SoK analysis of Agentic RAG architectures\[[1](https://arxiv.org/html/2606.04435#bib.bib1)\]mapped the current design landscape and explicitly identified the lack of cross\-stage context monitoring as a critical vulnerability in enterprise deployments\. This paper directly addresses the evaluation gap identified in that prior work, providing the necessary reliability layer to support responsible AI adoption\. Real\-world cascade testbeds such as OHRBench\[[43](https://arxiv.org/html/2606.04435#bib.bib43)\], which studies multi\-stage failures originating from OCR noise in document\-heavy pipelines, represent a natural extension for evaluating CHARM beyond QA\-only settings and would validate detection capabilities on organically occurring cascades\.
## IXConclusion
Multi\-step agentic RAG pipelines are highly vulnerable to cascading hallucinations, a failure mode where early\-stage contextual errors silently compound into confident, structurally sound fabrications\. To address this, we introduced the Cascading Hallucination Aware Resolution and Mitigation \(CHARM\) framework\. By continuously monitoring cross\-stage semantic trajectories with formally operationalized detection quantities, CHARM successfully interrupted cascading failures before they corrupted the terminal output, achieving an 89\.4% Cascade Detection Rate \(CDR\) and intervening early with an average Cascade Depth at Detection \(CDD\) of 2\.1, outperforming all four direct baselines by substantial margins while introducing only 215 ms per\-stage overhead\. Component ablations confirm that each of the four detection modules contributes meaningfully to overall cascade coverage\.
This work makes three primary contributions to the field of agentic AI reliability\. First, it formalizes a four\-type taxonomy \(Retrieval, Inference, Context Poisoning, and Confidence Inflation Cascades\) specifically tailored to multi\-step reasoning systems, with concrete operational definitions for all core formal quantities\. Second, it presents the CHARM detection architecture, comprising four modular tracking components—SFV, CSCT, CPM, and CRT—capable of identifying compounding errors without interrupting valid trajectory flow\. Finally, it outlines four implementable mitigation patterns that provide configurable recovery trade\-offs for production\-grade deployments\.
Future work will focus on two specific directions\. First, we plan to extend CHARM to multimodal agentic pipelines, mapping how semantic divergence propagates across visual and auditory reasoning chains\. Second, we aim to develop adaptive threshold calibration for CHARM components based on domain\-specific risk profiles\. Exploring this through a security lens—specifically by integrating CHARM’s context poisoning detection with the Zero Trust framework for the Model Context Protocol \(ZT\-MCP\)\[[44](https://arxiv.org/html/2606.04435#bib.bib44)\]—will be crucial for defending agentic systems against adversarial cascade scenarios in critical enterprise environments\.
To support the research community, we release all experimental artifacts, including the agentic trajectory wrapper, cascade injection scripts, annotated adversarial trajectories, and the CHARM evaluation harness, at[https://github\.com/sarmishra/CHARM\-agentic\-rag](https://github.com/sarmishra/CHARM-agentic-rag)\.
## References
- \[1\]S\. Mishra, S\. Niroula, U\. Yadav, D\. Thakur, S\. Gyawali, and S\. Gaire, “Sok: Agentic retrieval\-augmented generation \(rag\): Taxonomy, architectures, evaluation, and research directions,”*arXiv preprint arXiv:2603\.07379*, 2026\.
- \[2\]National Institute of Standards and Technology, “Artificial intelligence risk management framework: Generative ai profile \(nist ai 600\-1\),” U\.S\. Department of Commerce, Tech\. Rep\., July 2024\. \[Online\]\. Available:[https://nvlpubs\.nist\.gov/nistpubs/ai/NIST\.AI\.600\-1\.pdf](https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf)
- \[3\]S\. Mishra and H\. Reza, “A face recognition method using deep learning to identify mask and unmask objects,” in*2022 IEEE World AI IoT Congress \(AIIoT\)*\. IEEE, 2022, pp\. 091–099\.
- \[4\]P\. Manakul, A\. Liusie, and M\. J\. Gales, “Selfcheckgpt: Zero\-resource black\-box hallucination detection for generative large language models,”*arXiv preprint arXiv:2303\.08896*, 2023\.
- \[5\]S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\.\-t\. Yih, P\. W\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi, “Factscore: Fine\-grained atomic evaluation of factual precision in long form text generation,”*arXiv preprint arXiv:2305\.14251*, 2023\.
- \[6\]S\. Es, J\. James, L\. Espinosa\-Anke, and S\. Schockaert, “Ragas: Automated evaluation of retrieval augmented generation,”*arXiv preprint arXiv:2309\.15217*, 2023\.
- \[7\]N\. Shinn, F\. Labash, A\. Gopinath, and K\. Narasimhan, “Reflexion: Language agents with verbal reinforcement learning,”*arXiv preprint arXiv:2303\.11366*, 2023\.
- \[8\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\.\-t\. Yih, T\. Rocktäschel*et al\.*, “Retrieval\-augmented generation for knowledge\-intensive nlp tasks,” in*Advances in Neural Information Processing Systems*, vol\. 33, 2020, pp\. 9459–9474\.
- \[9\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. Bang, A\. Madotto, and P\. Fung, “Survey of hallucination in natural language generation,”*ACM Computing Surveys*, vol\. 55, no\. 12, pp\. 1–38, 2023\.
- \[10\]L\. Pan, M\. Saxon, R\. Connor, A\. Sharma, and W\. Y\. Wang, “Automatically correcting large language models: Survey and taxonomy,”*arXiv preprint arXiv:2308\.03188*, 2023\.
- \[11\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou*et al\.*, “Chain\-of\-thought prompting elicits reasoning in large language models,”*Advances in Neural Information Processing Systems*, vol\. 35, pp\. 24 824–24 837, 2022\.
- \[12\]N\. Dziri, X\. Lu, M\. Sclar, X\. L\. Li, L\. Jian, B\. Y\. Lin, P\. West, C\. Bhagavatula, R\. L\. Bras, J\. D\. Hwang*et al\.*, “Faith and fate: Limits of transformers on compositionality,”*Advances in Neural Information Processing Systems*, vol\. 36, 2023\.
- \[13\]Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, and H\. Wang, “Retrieval\-augmented generation for large language models: A survey,”*arXiv preprint arXiv:2312\.10997*, 2023\.
- \[14\]H\. Lightman*et al\.*, “Let’s verify step by step,”*arXiv preprint arXiv:2305\.20050*, 2023\.
- \[15\]N\. Reimers and I\. Gurevych, “Sentence\-bert: Sentence embeddings using siamese bert\-networks,” in*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing*, 2019, pp\. 3982–3992\.
- \[16\]P\. He, X\. Liu, J\. Gao, and W\. Chen, “Deberta: Decoding\-enhanced bert with disentangled attention,” in*International Conference on Learning Representations*, 2021\.
- \[17\]S\. Kadavath*et al\.*, “Language models \(mostly\) know what they know,”*arXiv preprint arXiv:2207\.05221*, 2022\.
- \[18\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger, “On calibration of modern neural networks,” in*Proceedings of the 34th International Conference on Machine Learning \(ICML\)*, 2017\.
- \[19\]H\. Chase, “Langchain: Building applications with llms through composability,”[https://github\.com/hwchase17/langchain](https://github.com/hwchase17/langchain), 2023\.
- \[20\]J\. Liu, “Llamaindex: A data framework for large language models,”[https://github\.com/jerryjliu/llama\_index](https://github.com/jerryjliu/llama_index), 2023\.
- \[21\]S\. Mishra, “Trustworthy agentic ai pipelines: Human\-in\-the\-loop oversight architectures for secure enterprise deployment,”*ResearchGate preprint*, 2026\.
- \[22\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao, “ReAct: Synergizing reasoning and acting in language models,” in*International Conference on Learning Representations \(ICLR\)*, 2023\.
- \[23\]J\. Johnson, M\. Douze, and H\. Jégou, “Billion\-scale similarity search with GPUs,”*IEEE Transactions on Big Data*, 2019\.
- \[24\]Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning, “Hotpotqa: A dataset for diverse, explainable multi\-hop question answering,” in*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, 2018, pp\. 2369–2380\.
- \[25\]H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal, “Musique: Multihop questions via single\-hop question composition,” in*Transactions of the Association for Computational Linguistics*, vol\. 10, 2022, pp\. 539–554\.
- \[26\]X\. Ho, A\.\-K\. D\. Nguyen, S\. Sugawara, and A\. Aizawa, “Constructing a multi\-hop qa dataset for comprehensive evaluation of reasoning steps,” in*Proceedings of the 28th International Conference on Computational Linguistics*, 2020, pp\. 6609–6625\.
- \[27\]F\. Perez and I\. Ribeiro, “Ignore previous prompt: Attack techniques for language models,” in*NeurIPS ML Safety Workshop*, 2022\.
- \[28\]H\. Kang, J\. Ni, and H\. Yao, “EVER: Mitigating hallucination in large language models through real\-time verification and rectification,”*arXiv preprint arXiv:2311\.09114*, 2023\.
- \[29\]H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal, “Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions,” in*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2023, pp\. 10 014–10 037\.
- \[30\]X\. Liu, X\. Yang, Z\. Li, P\. Li, and R\. He, “Agenthallu: Benchmarking automated hallucination attribution of llm\-based agents,”*arXiv preprint arXiv:2601\.06818*, 2026\.
- \[31\]B\. Efron and R\. J\. Tibshirani,*An Introduction to the Bootstrap*\. Chapman & Hall/CRC, 1994\.
- \[32\]A\. Asai, Z\. Wu, Y\. Wang, A\. Salmani, and H\. Hajishirzi, “Self\-RAG: Learning to retrieve, generate, and critique through self\-reflection,” in*International Conference on Learning Representations \(ICLR\)*, 2024\.
- \[33\]National Institute of Standards and Technology, “Artificial intelligence risk management framework \(ai rmf 1\.0\) \(nist trustworthy and responsible ai\),” U\.S\. Department of Commerce, Tech\. Rep\. NIST IR 8259, January 2023\. \[Online\]\. Available:[https://doi\.org/10\.6028/NIST\.AI\.100\-1](https://doi.org/10.6028/NIST.AI.100-1)
- \[34\]A\. Singla, A\. Sukharevsky, L\. Yee, M\. Chui, and B\. Hall, “The state of AI: How organizations are rewiring to capture value,” McKinsey & Company, Tech\. Rep\., March 2025, accessed: May 2026\. \[Online\]\. Available:[https://www\.mckinsey\.com/capabilities/quantumblack/our\-insights/the\-state\-of\-ai\-how\-organizations\-are\-rewiring\-to\-capture\-value](https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-how-organizations-are-rewiring-to-capture-value)
- \[35\]S\. Gaire, S\. Gyawali, S\. Mishra, S\. Niroula, D\. Thakur, and U\. Yadav, “Systematization of knowledge: Security and safety in the model context protocol ecosystem,”*arXiv preprint arXiv:2512\.08290*, 2025\.
- \[36\]S\. Li, S\. Park, I\. Lee, and O\. Bastani, “Traq: Trustworthy retrieval augmented question answering via conformal prediction,” in*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, 2024, pp\. 3799–3821\.
- \[37\]J\. Li, X\. Cheng, W\. X\. Zhao, J\.\-Y\. Nie, and J\.\-R\. Wen, “HaluEval: A large\-scale hallucination evaluation benchmark for large language models,”*arXiv preprint arXiv:2305\.11747*, 2023\.
- \[38\]S\. T\. I\. Tonmoy, S\. Zaman, V\. Jain, A\. Krause, T\. Goswami*et al\.*, “A comprehensive survey of hallucination mitigation techniques in large language models,”*arXiv preprint arXiv:2401\.01313*, 2024\.
- \[39\]J\. Saad\-Falcon, O\. Khattab, C\. Potts, and M\. Zaharia, “Ares: An automated evaluation framework for retrieval\-augmented generation systems,”*arXiv preprint arXiv:2311\.09476*, 2023\.
- \[40\]Z\. Li, Y\. Zhang, P\. Cheng, J\. Song, M\. Zhou, H\. Li, S\. Hu, Y\. Qin, E\. Zhao, X\. Jiang*et al\.*, “March: Multi\-agent reinforced self\-check for llm hallucination,”*arXiv preprint arXiv:2603\.24579*, 2026\.
- \[41\]A\. Goel, D\. Schwartz, and Y\. Qi, “Zero\-knowledge llm hallucination detection and mitigation through fine\-grained cross\-model consistency,” in*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track*, 2025, pp\. 1982–1999\.
- \[42\]L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin*et al\.*, “A survey on large language model based autonomous agents,”*Frontiers of Computer Science*, vol\. 18, no\. 6, p\. 186345, 2024\.
- \[43\]J\. Zhang, Q\. Zhang, B\. Wang, L\. Ouyang, Z\. Wen, Y\. Li, K\.\-H\. Chow, C\. He, and W\. Zhang, “Ocr hinders rag: Evaluating the cascading impact of ocr on retrieval\-augmented generation,” in*Proceedings of the IEEE/CVF International Conference on Computer Vision*, 2025, pp\. 17 443–17 453\.
- \[44\]S\. Mishra, “Zt\-mcp: A zero\-trust security architecture for mcp\-connected ai agents,”*ResearchGate preprint*, 2026\.Similar Articles
Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching
This paper proposes a memory-augmented multi-agent architecture using nested learning, continuum memory systems, and semantic caching to mitigate hallucination in LLM pipelines, achieving significant reductions in factual errors while improving operational efficiency.
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration
RAGognizer introduces a hallucination-aware fine-tuning approach that integrates a lightweight detection head into LLMs for joint optimization of language modeling and hallucination detection in RAG systems. The paper presents RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and demonstrates state-of-the-art hallucination detection while reducing hallucination rates without degrading language quality.
AI agent development
A developer discusses cascading failures in a 3-agent SDR system, where hallucinations propagate through agents, and seeks advice on improving reliability with human-in-loop or framework switching.
TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG
TPA proposes a novel method for detecting hallucinations in RAG systems by attributing next-token probabilities to seven distinct sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding) and aggregating by Part-of-Speech tags. The approach achieves state-of-the-art performance across five LLMs including Llama2, Llama3, Mistral, and Qwen.
Hallucination as Exploit: Evidence-Carrying Multimodal Agents
This paper formalizes hallucination-to-action conversion in multimodal agents and proposes evidence-carrying agents (ECA) that use constrained verifiers to authorize only safe tool calls, achieving 0% unsafe-action rate on a 200-task pipeline.