ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents
Summary
ProvenanceGuard是一种用于MCP驱动的LLM代理的源感知事实性验证器,它通过分解回答为原子声明、路由到特定源证据、检查支持并验证归因,解决了跨源混淆问题。在医疗领域的评估中,它达到了0.802的块F1和0.858的源准确率。
View Cached Full Text
Cached at: 06/17/26, 05:40 AM
# ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents Source: [https://arxiv.org/html/2606.18037](https://arxiv.org/html/2606.18037) Ander Alvarez[ander\.alvarez@multiversecomputing\.com](https://arxiv.org/html/2606.18037v1/mailto:[email protected])Multiverse Computing, Parque Cientifico y Tecnológico de Gipuzkoa, Paseo de Miramón, 170, 20014 Donostia / San Sebastián, SpainSanthiya Rajan[santhiya\.rajan@multiversecomputing\.com](https://arxiv.org/html/2606.18037v1/mailto:[email protected])Multiverse Computing, Parque Cientifico y Tecnológico de Gipuzkoa, Paseo de Miramón, 170, 20014 Donostia / San Sebastián, SpainSamuel Mugel[sam\.mugel@multiversecomputing\.com](https://arxiv.org/html/2606.18037v1/mailto:[email protected])Multiverse Computing, Centre for Social Innovation, 192 Spadina Avenue Suite 509, Toronto, ON M5T 2C2, CanadaRomán Orús[roman\.orus@multiversecomputing\.com](https://arxiv.org/html/2606.18037v1/mailto:[email protected])Multiverse Computing, Parque Cientifico y Tecnológico de Gipuzkoa, Paseo de Miramón, 170, 20014 Donostia / San Sebastián, SpainDonostia International Physics Center, Paseo Manuel de Lardizabal 4, E\-20018 San Sebastián, SpainIkerbasque Foundation for Science, Maria Diaz de Haro 3, E\-48013 Bilbao, Spain ###### Abstract Many tool\-using LLM agents use Model Context Protocol \(MCP\) to produce answers from heterogeneous evidence sources, including search results, APIs, databases, clinical records, formulary tools, and other external systems exposed through tool servers\. Existing factuality and faithfulness metrics typically evaluate whether an answer is supported by the available context after evidence has been pooled\. This abstraction misses an important provenance\-sensitive failure mode: a claim may be supported somewhere in the evidence while being attributed to the wrong source\. We call this failure cross\-source conflation\. We introduceProvenanceGuard, a source\-aware verifier for MCP\-grounded answers\.ProvenanceGuardis the retained calibrated Router\+NLI system: it consumes captured MCP traces with stable tool IDs, source IDs, and raw tool outputs; decomposes an answer into atomic claims; routes claims to source\-specific evidence; checks support with NLI and an attention\-derived token\-alignment proxy; and separately compares the claim’s stated attribution with the routed source\. It returns both per\-claim source verdicts and an answer\-level allow/block decision, and can invoke retrieval\-augmented answer revision \(RARR\)\-style repair to revise blocked answers before re\-verification\. We evaluate on a frozen prospective corpus of 281 captured medical\-domain MCP\-agent traces\. A 266\-trace claim\-adjudicated subset yields 2,325 LLM\-assisted claim labels, split by trace into train, validation, and held\-out test partitions; the 361 held\-out labels are then verified by human experts, and the full 281\-trace corpus is used for answer\-level repair evaluation\. On the 40\-trace, 361\-claim held\-out split,ProvenanceGuardreaches block F1 0\.802 and source accuracy 0\.858 over 260 source\-eligible claims\. Source\-blind claim/evidence baselines on the same packet reach block F1 0\.783 for MiniCheck, 0\.758 for RAGAS Faithfulness, 0\.662 for AlignScore, and 0\.436 for SummaC\-ZS, but none emit claim\-to\-source IDs\. On a harder multi\-source adjudicated benchmark,ProvenanceGuardreaches block F1 0\.846 over frozen extracted claims from the locked test questions, while source\-plus\-relation accuracy drops to 0\.229, showing that exact source ownership remains difficult under many semantically close candidate sources\. On the full trace set, a repair\-and\-reverify loop resolves all 173 blocked answers, though 144 require conservative fallback rather than substantive rewriting; on reconstructed multi\-source test traces, a fresh answer\-level repair rerun resolves all 59 initially blocked answers with two terminal fallbacks\. In 50 generated, clinically framed conflation probes over frozen captured MCP evidence,ProvenanceGuarddetects all 50 deliberately injected source\-attribution swaps in controlled probes with no retained wrong attribution\. These results indicate that source attribution is an independent evaluation axis for factuality verification in MCP\-based agents\. Source\-aware factuality verification, provenance, Model Context Protocol, tool\-using LLM agents, retrieval\-augmented revision, natural language inference ## IIntroduction LLM agents that use Model Context Protocol \(MCP\) increasingly operate over multiple external tools rather than a single retrieved passage\. They call tools through MCP servers, inspect structured records, combine multiple outputs, and often generate answers that mix source\-grounded facts with general background knowledge and safety disclaimers\. A medical agent may combine a PubMed abstract with a patient’s FHIR record, while an enterprise agent may combine a CRM entry, a billing record, and a support ticket\. In such settings, factuality verification must assess not only whether a claim is supported somewhere in the available evidence, but also whether the answer assigns the claim to the correct source\. Consider an answer that states: “According to the patient’s chart, empagliflozin reduced a mortality endpoint\.” The mortality claim may be supported by a clinical\-trial abstract, but not by the patient’s chart\. A source\-blind verifier that pools the chart and the abstract can mark the claim as supported\. A source\-aware verifier should instead reject the attribution because the evidential source is incorrect\. Claim in answer: “Empagliflozin reduced a mortality endpoint\.”Stated source in answer: patient chartPatient chartdiagnosis, medicationssame patient, wrong sourceLiterature abstractmortality estimatecorrect supporting sourcePubMed metadatatitle, journal, yearsame topic, partial sourceFHIR labspatient valuessame chart family, no supportSource\-blind supportpooled evidence:supportedSource\-aware verdictsupport yes, attributionblockrouted supportstated attributionsame claim can be supported somewhere while attributed to the wrong source Figure 1:Why source\-aware factuality is stricter than source\-blind support\. A claim can be supported by one MCP source while the answer attributes it to another\. Source\-blind scoring sees support in pooled evidence;ProvenanceGuardseparately checks whether the supporting source matches the stated or implied attribution\.The frozen 281\-trace captured\-agent corpus does not by itself estimate the natural prevalence of this failure mode: the captured random traces contain single source\-family outputs, or PubMed search plus metadata outputs that remain within the literature family\. We therefore evaluate cross\-source conflation with a targeted 50\-case source\-conflation probe set over frozen captured MCP evidence, where each probe contains one deliberate attribution swap\. This distinction matters because many factuality systems were designed for summary consistency or retrieval faithfulness\. They typically score support against a context, rather than claim\-to\-source ownership\. Such scores are informative for unsupported fabrication, but they are insufficient for MCP\-grounded agents whose answers carry implicit or explicit provenance claims\. This paper makes four contributions: 1. 1\.We formulate source\-attribution factuality for MCP\-grounded answers: a verifier must evaluate both support and source ownership for each claim\. 2. 2\.We introduceProvenanceGuard, a calibrated source\-aware Router\+NLI verifier that decomposes answers into claims, preserves stable MCP tool IDs and source IDs from raw tool outputs, routes claims to source\-specific evidence, and detects cross\-source conflation\. 3. 3\.We report a frozen captured medical\-domain agent benchmark built from 281 captured MCP\-agent traces, a 266\-trace claim subset with 2,325 LLM\-assisted labels, and a 40\-trace held\-out packet with complete MCP tool outputs and human\-expert\-reviewed labels\. 4. 4\.We compareProvenanceGuardagainst MiniCheck, RAGAS Faithfulness, AlignScore, and SummaC\-ZS on the same held\-out claim packet, while separately evaluating RARR\-style repair on full captured traces and targeted source\-conflation probes\. The central claim is limited to source\-attribution factuality in MCP\-grounded answers; we do not claim to solve open\-domain factuality detection, clinical safety validation, or parametric\-knowledge correction\. ## IIRelated Work Our work sits at the intersection of support verification, source attribution, and tool\-grounded agent evaluation\. We organize prior work by the kind of evidence object each line of work preserves\. #### Fine\-grained support verification\. Factual consistency systems such as FActScore, MiniCheck, SummaC, AlignScore, and VeriScore evaluate whether generated claims are supported by evidence\[[14](https://arxiv.org/html/2606.18037#bib.bib1),[19](https://arxiv.org/html/2606.18037#bib.bib2),[13](https://arxiv.org/html/2606.18037#bib.bib3),[23](https://arxiv.org/html/2606.18037#bib.bib4),[18](https://arxiv.org/html/2606.18037#bib.bib11)\]\. Fine\-grained work decomposes generated text below the sentence level: Dependency Arc Entailment localizes errors at dependency arcs\[[8](https://arxiv.org/html/2606.18037#bib.bib15)\], QASemConsistency expresses predicate\-argument propositions as question\-answer pairs\[[2](https://arxiv.org/html/2606.18037#bib.bib16)\], and PrefixNLI studies entailment over generation prefixes\[[9](https://arxiv.org/html/2606.18037#bib.bib19)\]\. These methods motivate claim\-level checking, but they do not generally preserve claim\-to\-MCP\-source IDs\. They are therefore relevant to binary allow/block decisions but are not direct baselines for source attribution metrics such as Top\-1 source accuracy, recall@k, mean reciprocal rank, or source\-set Jaccard\. #### RAG faithfulness and attributed generation\. RAG evaluation frameworks such as RAGAS ask whether answer statements are faithful to retrieved context\[[3](https://arxiv.org/html/2606.18037#bib.bib5)\]\. Other work studies answer generation with citations and attribution: ALCE evaluates citation quality for LLM\-generated answers\[[7](https://arxiv.org/html/2606.18037#bib.bib6)\]; AttributedQA formalizes attributed question answering\[[1](https://arxiv.org/html/2606.18037#bib.bib7)\]; AutoAIS automates AIS\-style attribution judgments\[[22](https://arxiv.org/html/2606.18037#bib.bib9),[16](https://arxiv.org/html/2606.18037#bib.bib8)\]; and TRUE consolidates factual\-consistency datasets across summarization, dialogue, paraphrasing, and verification\[[11](https://arxiv.org/html/2606.18037#bib.bib10)\]\. More recent attribution work localizes evidence to user\-selected spans through LAQuer\[[10](https://arxiv.org/html/2606.18037#bib.bib17)\]or decomposes generation into executable attribution programs\[[21](https://arxiv.org/html/2606.18037#bib.bib18)\]\. These methods are close in motivation, but faithfulness to pooled or cited context is not equivalent to source ownership\. A claim can be supported by one retrieved source while being falsely attributed to another\. ALCE\[[7](https://arxiv.org/html/2606.18037#bib.bib6)\]evaluates whether LLM\-generated citations point to the correct supporting passage, which is the closest existing task to source attribution\. However, ALCE operates at the passage or chunk level within a single retrieved set, whereas MCP traces expose stable tool\-level source IDs that require a routing step to identify which tool output is responsible for a given claim\. Our work can therefore be seen as extending citation\-style attribution to the tool\-provenance layer\. #### Tool and source attribution in multi\-source systems\. As RAG systems become tool\-using agents, attribution must track which tool output supplied the evidence\. Atomic Information Flow models tool outputs, LLM calls, and final responses as flows of atomic information through an orchestration graph\[[5](https://arxiv.org/html/2606.18037#bib.bib20)\]\. FaithfulRAG focuses on conflicts between retrieved evidence and parametric knowledge at the fact level\[[24](https://arxiv.org/html/2606.18037#bib.bib21)\], while Answering with Faithfulness couples answer generation with faithfulness prediction\[[4](https://arxiv.org/html/2606.18037#bib.bib22)\]\. Our setting differs because MCP traces expose stable tool and source identifiers\. We do not infer latent information flow or resolve parametric\-knowledge conflicts; we verify whether the answer’s stated or implied attribution matches the routed MCP source\. #### Post\-hoc revision and trained verifiers\. RARR\[[6](https://arxiv.org/html/2606.18037#bib.bib12)\]takes a generated passage, researches evidence, and revises unsupported claims while preserving the original style and structure\. In our setting, RARR\-style repair is evaluated after source\-aware blocking: the verifier rejects an answer, repair attempts to produce a source\-grounded revision or conservative fallback, and the same verifier rechecks the revised answer\. Training\-based approaches improve consistency at generation or detection time, including reinforcement learning with textual entailment feedback\[[17](https://arxiv.org/html/2606.18037#bib.bib13)\], FactCC’s synthetic inconsistency classifier\[[12](https://arxiv.org/html/2606.18037#bib.bib14)\], and RAGulator’s lightweight out\-of\-context detectors for grounded generation\[[15](https://arxiv.org/html/2606.18037#bib.bib23)\]\. These methods are complementary to ProvenanceGuard, which operates as an independent post\-hoc check on black\-box MCP\-agent outputs\. For calibration specifically, alternatives include Platt scaling, isotonic regression, and conformal prediction; we adopt a random\-forest calibrator because it handles tabular verifier features after simple one\-hot encoding and numeric scaling\. #### Tool\-use and agent traces\. Tool\-use evaluation often studies whether agents call the right tools, complete tasks, and produce valid final answers\. Our focus is narrower: after an agent has produced an answer and the tool trace is available, can a verifier decide whether each claim is supported by the source the answer cites or implies? A related MCP community proposal has identified a similar gap: the proposed verification capability for MCP servers\[[20](https://arxiv.org/html/2606.18037#bib.bib24)\]calls for structured verdicts with confidence scores, mirroring our allow/block/unavailable decision space\. Our work provides a concrete implementation of this verification layer for the MCP trace setting\. ## IIIProvenanceGuard UserqueryMCP agentcoreDraftanswerTool callsand serversSource\-bearingtraceAnswer \+ tracepacketAtomic claimdecompositionSourceroutingNLI supportcheckCalibrationscoreAttributioncheckAnswergateVerifiedoutputBlockRARR\-style repaircandidate Figure 2:Sequential source\-aware verification pipeline\. The agent core calls MCP tools and produces a draft answer;ProvenanceGuardconsumes the answer and captured source\-bearing tool traces, decomposes the answer into claims, routes claims to MCP evidence, estimates support with NLI, token alignment, and calibration, checks attribution separately from support, and sends blocked answers through repair and re\-verification\.ProvenanceGuardverifies attribution in provenance\-preserving Model Context Protocol \(MCP\) traces\. The goal is not to prove that a source is clinically or scientifically correct\. The goal is narrower: when an agent answer makes a factual claim,ProvenanceGuardasks whether the claim is supported by the source it should be attributed to, and whether the answer assigns the claim to the right MCP provenance object\. This framing is important in data\-sensitive domains, where an offline verifier can spend more computation on reliable source attribution rather than optimizing for interactive latency\. ### III\.1Trace Interface ProvenanceGuardstarts from the trace object produced by a tool\-using agent\. A trace contains the user request, the final assistant answer, and a list of complete tool outputs\. Each evidence object is represented as ei=\(tooli,sourcei,texti\),e\_\{i\}=\(\\mathrm\{tool\}\_\{i\},\\mathrm\{source\}\_\{i\},\\mathrm\{text\}\_\{i\}\),wheretooli\\mathrm\{tool\}\_\{i\}identifies the tool family,sourcei\\mathrm\{source\}\_\{i\}identifies the provenance\-bearing source within the trace, andtexti\\mathrm\{text\}\_\{i\}is the observed tool output or structured sub\-output\. Two tool calls to the same tool family \(e\.g\., two PubMed searches with different queries\) produce separate evidence objects if their source IDs differ\. The verifier never collapses these evidence objects into a single anonymous context\. It carries the source identifier forward because the same answer may combine patient\-record evidence, literature evidence, and metadata evidence\. The output of this stage is a source\-preserving evidence setE=\{e1,…,en\}E=\\\{e\_\{1\},\\ldots,e\_\{n\}\\\}paired with the assistant answeryy\. When a tool output lacks a stable source ID, the verifier falls back to tool name as the provenance identifier\. Appendix[A](https://arxiv.org/html/2606.18037#A1)gives the escaped JSON trace structure used for this interface\. ### III\.2Atomic Claims The next step decomposes the answer into checkable claims, C\(y\)=\{c1,…,cm\},m≤M\.C\(y\)=\\\{c\_\{1\},\\ldots,c\_\{m\}\\\},\\quad m\\leq M\.Eachcjc\_\{j\}should express one factual proposition\. Literal values that can change the meaning of a claim, such as numbers, units, dates, identifiers, medication quantities, and quoted values, are preserved exactly\. If the answer names a source family, such as patient history or published literature, the decomposer also preserves that stated attribution span\. Generic safety boilerplate is not treated as evidence\-bearing content unless it contains a concrete factual assertion\. This prevents template text from dominating the claim set while keeping the verifier accountable for factual statements that remain in the answer\. ### III\.3Source Routing For each claim,ProvenanceGuardranks the source\-specific evidence objects before support checking\. Letqjq\_\{j\}be the embedding of claimcjc\_\{j\}, and letes,1,…,es,nse\_\{s,1\},\\ldots,e\_\{s,n\_\{s\}\}be the embedded chunks belonging to sourcess\. The source representation is the centroid rs=1ns∑i=1nses,i\.r\_\{s\}=\\frac\{1\}\{n\_\{s\}\}\\sum\_\{i=1\}^\{n\_\{s\}\}e\_\{s,i\}\.The router selects the highest\-scoring source, s^j=argmaxscos\(qj,rs\),\\hat\{s\}\_\{j\}=\\arg\\max\_\{s\}\\cos\(q\_\{j\},r\_\{s\}\),and records the routing margin between the top two source scores\. The top\-ranked source is used for single\-source attribution\. Top\-kkrouting is used only as an analysis of whether the correct source is near the head of the ranked list\. After routing, the premise is narrowed to claim\-relevant evidence and capped by a fixed length budget\. This premise\-selection step controls distractor effects before NLI scoring: shorter premises reduce irrelevant context, while longer premises may improve recall but increase truncation and distraction risk\. The current service default is 512 tokens for the premise–claim pair, which avoids the earlier 256\-token bottleneck while staying within the base DeBERTa context window\. ### III\.4Support and Alignment The routed premise and claim are then passed to a natural\-language\-inference \(NLI\) model\. The NLI relation is one of entailment, neutral, or contradiction\. Entailment is evidence of support, contradiction is evidence of conflict, and neutral means the routed source does not provide enough support by itself\. ProvenanceGuardalso computes a heuristic token\-alignment proxy from NLI attention\. This proxy is used as an additional grounding signal, not as an independently validated explanation of the NLI model\. LetPPbe the premise token indices andHcH\_\{c\}the claim token indices in the paired NLI encoding\. For attention tensorAℓ,hA^\{\\ell,h\}, the alignment score for claim tokeni∈Hci\\in H\_\{c\}is ai=maxj∈P\(1LK∑ℓ=1L∑h=1KAijℓ,h\),a\_\{i\}=\\max\_\{j\\in P\}\\left\(\\frac\{1\}\{LK\}\\sum\_\{\\ell=1\}^\{L\}\\sum\_\{h=1\}^\{K\}A^\{\\ell,h\}\_\{ij\}\\right\),whereLLis the number of layers andKKis the number of attention heads\. A content token is marked weakly grounded when ai<τmaxkak\.a\_\{i\}<\\tau\\max\_\{k\}a\_\{k\}\.The supported\-token ratio is computed over non\-stopword claim tokens\. A claim with low token support remains blocked even if the coarse NLI label is uncertain\. Literal values receive a stricter check\. If an entailed claim contains a protected value absent from the routed premise after normalization, the claim is treated as unsupported or contradictory\. Conversely, a neutral NLI output can be lexically rescued only when all protected values are present and normalized content\-term overlap, \|Tclaim∩Tpremise\|\|Tclaim\|,\\frac\{\|T\_\{\\mathrm\{claim\}\}\\cap T\_\{\\mathrm\{premise\}\}\|\}\{\|T\_\{\\mathrm\{claim\}\}\|\},exceeds a fixed rescue threshold\. This rescue rule is restricted to exact structured evidence; otherwise neutral remains not enough evidence\. ### III\.5Calibrated Claim Decision The previous stages produce several imperfect signals: the router says which source is most likely responsible for the claim, NLI estimates whether that routed premise entails the claim, and alignment checks whether tokens and protected values are grounded\. None of these signals is reliable enough on its own\. A high route score can still point to a semantically related but wrong source, and a neutral NLI label can occur when structured evidence supports the claim lexically but not in the form expected by the NLI model\. Routingscore, margin, toolNLIlabel and scoreLexical, token,protected\-value featuresOne\-hot\+ scalingRandomForestsupport calibratorpsup\(cj,s^j\)p\_\{\\mathrm\{sup\}\}\(c\_\{j\},\\hat\{s\}\_\{j\}\)Support verdictat threshold 0\.65 Support thresholdValidation block metric00\.20\.40\.60\.81\.000\.20\.40\.60\.81\.0Block F1Block recall0\.65 Figure 3:Calibration layer\. The calibrator receives only verifier\-internal routing, NLI, lexical, token\-alignment, and protected\-value features\. It returns a calibrated support probability for the routed source and a binary support verdict after validation\-threshold selection\. The lower panel shows the validation threshold sweep used to select the retained support threshold\.The calibrated claim decision is therefore the support decision for the routed source\. It keeps the source identity fixed and learns an operating boundary from development data using only verifier\-internal features\. Calibration should tune a verifier whose raw routing and support scores are already meaningful; it should not replace source\-aware modeling\. The categorical inputs are the NLI label, trace category, predicted tool family, and router status\. The numeric inputs are route score, NLI score, routing margin, lexical overlap with the routed source, lexical overlap with all trace evidence, claim length, evidence\-chunk count, protected\-value counts, missing protected\-value counts, stated\-attribution indicator, and interaction features such as NLI\-score times route\-score and route\-score times lexical overlap\. These inputs are not source\-blind baseline outputs; baseline systems are evaluated later and are never used to train or calibrateProvenanceGuard\. The retained calibrator is a random\-forest classifier trained on the development training split\. The target is binary support for the routed source:zj=1z\_\{j\}=1when claimcjc\_\{j\}is adjudicated supported bys^j\\hat\{s\}\_\{j\}, andzj=0z\_\{j\}=0for not\-enough\-evidence, unsupported, contradiction, or failed\-source cases\. Categorical features are one\-hot encoded and numeric features are standardized before fitting\. The forest uses 400 trees with maximum depth 5, minimum leaf size 8, balanced class weights, bootstrap bagging, Gini impurity splitting, and fixed seed 20260607\. Because a random forest is not optimized by epoch\-wise gradient descent, we do not report a neural training\-loss curve; the relevant training objective is node\-level Gini impurity reduction, and the relevant operating curve is the validation sweep over decision thresholds\. Table 1:Training details for the retained calibrated claim decision\. The random forest is non\-neural and has no epoch\-wise loss curve; its split objective is Gini impurity, and the operating point is selected by the validation threshold curve in Figure[3](https://arxiv.org/html/2606.18037#S3.F3)\.After fitting, the forest estimates psup\(cj,s^j\)=P\(zj=1∣ϕ\(cj,s^j,E\)\),p\_\{\\mathrm\{sup\}\}\(c\_\{j\},\\hat\{s\}\_\{j\}\)=P\(z\_\{j\}=1\\mid\\phi\(c\_\{j\},\\hat\{s\}\_\{j\},E\)\),whereϕ\\phiis the feature vector assembled from routing, NLI, alignment, lexical, and protected\-value checks\. The support threshold is selected on validation by maximizing block F1, with block accuracy used only as a tie\-breaker\. The selected threshold is 0\.65\. A claim is passed forward as supported only whenpsup\(cj,s^j\)≥0\.65p\_\{\\mathrm\{sup\}\}\(c\_\{j\},\\hat\{s\}\_\{j\}\)\\geq 0\.65; otherwise it is passed forward as blocked for the routed source\. This calibration step deliberately answers only one question:*is the claim supported by the routed MCP source?*It does not yet decide whether the answer described that source correctly\. The result passed to the next stage is a tuple containing the calibrated support decision, the routed source identifier, and the evidence features that explain the decision\. ### III\.6Attribution and Conflation Attribution is the second decision\. Once calibration has decided whether claimcjc\_\{j\}is supported by routed sources^j\\hat\{s\}\_\{j\},ProvenanceGuardcompares that routed source with the source family stated or implied in the answer\. Explicit attribution spans are matched by lexical aliases for patient\-history, chart, FHIR, PubMed, literature, metadata, and tool\-name variants\. If no explicit span is present, patient\-specific claims default to patient\-record sources, literature\-general claims default to literature sources, and ambiguous claims are marked unavailable rather than silently assigned\. A claim can be supported by some MCP source and still be wrong as an attributed answer if the answer assigns it to a different source\. Letaja\_\{j\}be the source family stated or implied by claimcjc\_\{j\}, and lets^j\\hat\{s\}\_\{j\}be the routed supporting source\.ProvenanceGuardmarks a source conflation when the calibrated support decision is positive butaja\_\{j\}ands^j\\hat\{s\}\_\{j\}refer to incompatible provenance families\. For example, a patient\-specific fact may be supported by a patient\-history tool output but incorrectly introduced as a literature finding\. The factual content is then not the only issue: the answer has also misrepresented where the fact came from\. This is the gap that source\-blind support checks cannot address\. ### III\.7Answer Decision ProvenanceGuardaggregates claim decisions with a fail\-closed policy\. An answer is blocked if any claim has source conflation, high\-confidence contradiction, missing protected values, failed routing, or insufficient support\. An answer is allowed only when every factual claim is supported by the appropriate routed source\. Empty evidence, malformed claims, or failed verifier components are not silently accepted\. ### III\.8Repair and Reverification Blocked answers enter a bounded revise\-and\-reverify loop\. The repair step receives the original answer, the claim\-level verifier outputs, and the routed evidence\. It can rewrite unsupported spans, correct source attribution, or remove claims that cannot be grounded\. The reported implementation is RARR\-style: it follows the retrieve/revise/reverify pattern, but many blocked full\-trace cases terminate through deterministic evidence\-only rewrites, blocked\-claim pruning, or conservative non\-claim fallback rather than the original open\-web RARR procedure\. The revised answer is then evaluated by the sameProvenanceGuardverifier\. The loop terminates when the revised answer passes source\-aware verification or when remaining unverified content is replaced by a conservative non\-claim response\. The RARR\-style repair loop is therefore not treated as a separate oracle; it is a repair mechanism whose output must satisfy the same attribution\-sensitive checks that blocked the original answer\. ## IVExperimental Setup ### IV\.1Captured MCP\-Agent Traces We evaluate the pipeline in Section[III](https://arxiv.org/html/2606.18037#S3)on captured traces from a medical MCP\-agent evaluation stack\. Each trace contains a clinician\-facing question, the route selected by the agent, the observed tool calls, complete source\-bearing tool outputs, and the final assistant reply\. The setting is useful for source\-aware evaluation because patient\-record tools and literature\-search tools produce adjacent but distinct provenance objects\. A correct answer may need both kinds of evidence, but it should not attribute a patient\-record fact to literature or a literature fact to the patient chart\. The full trace set is used for answer\-level and repair evaluation\. A trace\-level split with LLM\-assisted claims is used for claim\-level support and source\-attribution scoring\. This instantiates the trace interface in Section[III](https://arxiv.org/html/2606.18037#S3): the verifier receives the answer and the complete list of MCP evidence objects, not a pooled retrieval context\. #### Data governance\. The frozen traces are internal MCP\-agent evaluation traces collected for this benchmark\. Patient\-like records, names, identifiers, and appendix examples are synthetic or de\-identified benchmark artifacts; no protected health information is included in the released artifact bundle\. The study evaluates verifier behavior on captured tool outputs and is not a clinical intervention or a medical\-device validation study\. ### IV\.2Claim Labels The claim\-level evaluation follows the decomposition step in Section[III](https://arxiv.org/html/2606.18037#S3)\. Answers are decomposed into atomic claims, and each claim receives an LLM\-assisted support label, relation label, and best\-source annotation\. Splits are made by trace rather than by claim so that claims from the same assistant answer do not appear in both development and held\-out evaluation\. The held\-out split contains 40 traces and 361 claims\. Source metrics are computed only for claims with an adjudicated source target\. Binary support metrics treat unsupported, not\-enough\-evidence, contradiction, and conflation outcomes as block decisions\. We distinguish three related but separate claim units\. First, offline label construction uses model\-assisted claim\-packet extraction and adjudication to create frozen claims and labels; this is a benchmark\-construction step, not the reported runtime decomposer\. Second, claim\-level scoring tables evaluateProvenanceGuardon those frozen claims, so no runtime decomposer is invoked\. Third, answer\-level verification and repair must split a full answer into claims at runtime; the reported answer\-level and repair experiments use the deterministic rule\-based decomposer for that stage\. The runtime decomposer filters boilerplate disclaimers and response\-level meta\-commentary before routing and NLI scoring\. ### IV\.3Multi\-Source Adjudicated Benchmark Extension We also rerunProvenanceGuardon a harder multi\-source adjudicated benchmark\. This packet is harder than the primary 40\-trace held\-out split because it is represented as pairwise claim–candidate\-source rows: each claim case may have multiple candidate sources, including same\-topic distractors\. Its labels are produced by a groupedgpt\-5\.4LLM\-as\-a\-judge pass over blind claim packets and then validated before packaging\. The fixed policy combines the benchmark training split with earlier two\-judge positive\-boost rows, while validation and test use only the benchmark’s adjudicated rows\. The locked test split contains 59 questions, 254 claim cases, and 2,587 pairwise source\-candidate rows\. The benchmark is useful for the multi\-tool cases that the primary held\-out split underrepresents\. It contains chart\-plus\-literature cases, literature\-summary versus exact\-citation cases, same\-topic wrong\-chart candidates, and count/resource\-summary claims\. We report these as stress slices rather than as separate training objectives\. The router\+NLI rerun is scored on frozen independently extracted answer claims associated with the same locked test questions\. This creates a small unit mismatch: the pairwise benchmark split has 254 claim cases, while the frozen extraction artifact contains 263 extracted claims for those 59 questions\. We therefore report both counts and make clear which unit each table uses\. Metrics from the pairwise comparison package and metrics from the frozen extracted\-claim rerun should not be compared as if they were computed over the same examples\. 281 fullMCP traces266 claim\-labeledtracesTrain \+ validation1,597 \+ 367 claimsHeld\-out40 traces361 claimsClaim support \+source metrics260 source\-eligibleAnswer repair281 full tracesLocked multi\-sourcebenchmark59 testquestions254 claim cases2,587 pairwise rows263 frozenextracted claimsMulti\-toolstress slicesRouter\+NLI \+answer\-level RARR\-style repairFrozen MCPevidence50 injectedsource swapsControlledconflation repairprimary captured\-trace corpusharder multi\-source benchmarktargeted attribution probes Figure 4:Evaluation datasets and units used in the paper\. The primary captured\-trace corpus supports claim\-level scoring, source\-blind baseline comparison, and full\-trace repair\. The newer locked multi\-source benchmark is reported separately because its pairwise source\-candidate rows and frozen extracted claims use different units\. A targeted 50\-case source\-conflation probe set isolates explicit attribution errors\. ### IV\.4ProvenanceGuardInstantiation The evaluated system instantiates each stage in Section[III](https://arxiv.org/html/2606.18037#S3)\. Atomic claims are routed to source\-specific MCP evidence by cosine similarity over source embeddings\. The routed premise is checked with NLI, token alignment, protected\-value matching, and lexical rescue\. A RandomForest calibration layer then maps the verifier features to a supported versus blocked claim decision for the routed source\. The support threshold is selected on validation before held\-out evaluation\. This calibrated support decision is passed to the attribution stage, which separately checks whether the answer’s stated or implied source matches the routed source\. The NLI scorer uses a 512\-token pair budget by default, and the joint verifier runtime now defaults to a 512\-token pair budget for new reruns; historical 256\-token backbone runs are reported as such rather than overwritten\. Table 2:Operating constants for the reportedProvenanceGuardconfiguration\. These values make explicit the non\-learned thresholds and model identifiers used in the frozen run\.This setup is designed for offline or data\-sensitive review, where the cost of false attribution can dominate latency concerns\. It therefore favors conservative blocking and explicit source preservation over a low\-latency best\-effort answer\. ### IV\.5Claim Decomposition Measurement The current artifacts do not contain a separately human\-authored gold decomposition set with span\-level or proposition\-level recall\. To make the decomposition stage measurable, we compare the deterministic rule\-based decomposer against the frozen independent sentence\-level extraction artifact on the locked benchmark test questions\. We report this as reference\-set agreement, not as human gold decomposition accuracy\. The metric is still useful because it exposes over\-splitting, missed extracted claims, and protected\-value preservation errors before the routing and NLI stages\. ### IV\.6External Baselines MiniCheck, RAGAS Faithfulness, AlignScore, and SummaC\-ZS are run on the same held\-out claim packet as source\-blind support comparators\. These baselines estimate whether a claim is supported by pooled evidence, but they do not emit MCP source identifiers\. They are therefore compared on binary support metrics only, not on source attribution\. ### IV\.7Repair and Conflation Evaluation RARR\-style repair is evaluated as the repair\-and\-reverification stage described in Section[III](https://arxiv.org/html/2606.18037#S3)\. Rejected answers are rewritten using routed evidence or, when the remaining content cannot be verified, replaced by a conservative non\-claim response\. Revised answers are then scored by the same verifier\. We report a full\-trace repair run and a targeted source\-conflation probe set\. Each probe injects one deliberate attribution error into otherwise captured benchmark evidence: patient\-record facts are attributed to literature, or literature facts are attributed to the patient chart\. This isolates the failure mode that source\-blind support checks cannot measure\. ### IV\.8Adjudication Status The primary held\-out labels are produced by two independent judge prompts run on a local Gemma 4 E4B instruction\-tuned model, followed by priority adjudication for disagreements, and the resulting labels are LLM\-assisted\. Human expert review is limited to the final 361\-label held\-out packet used for the reported primary evaluation\. The separate multi\-source benchmark uses thegpt\-5\.4adjudication path described above and is reported as a distinct benchmark rather than merged into the primary held\-out packet\. ## VResults ### V\.1RQ1: Source\-Aware Support Decisions Finding\.ProvenanceGuardpreserves source identity while making a conservative held\-out support decision\. Table 3:Prospective captured\-agent trace claim\-levelProvenanceGuardresults\. Verdict macro F1 is macro F1 over the four factual verdict labels: supported, unsupported, contradicted, and not\-enough\-evidence\.ProvenanceGuardis the retained RandomForest\-calibrated Router\+NLI verifier\. Source accuracy and source\-plus\-relation accuracy are computed over source\-eligible claims; the held\-out split has 260 source\-eligible claims\.On the held\-out split,ProvenanceGuardreaches 0\.802 block F1 with block recall of 0\.993 and block precision of 0\.673\. It also reports source accuracy of 0\.858 and source\-plus\-relation accuracy of 0\.681 over 260 source\-eligible held\-out claims\. The low four\-way verdict macro F1 reflects the highly imbalanced held\-out verdict distribution and the rarity or absence of contradiction and conflation labels in the random held\-out split; binary block F1 is the primary deployment metric\. The trace\-bootstrap interval for block F1 is \[0\.664, 0\.900\], reflecting the small 40\-trace held\-out split and claim clustering within traces\. The mechanism is source preservation\. Source\-blind baselines see pooled evidence and can estimate whether a claim is supported somewhere, but they cannot tell whether the answer assigned that claim to the correct MCP source\.ProvenanceGuardkeeps the routed source ID attached to each claim, giving 0\.858 source accuracy over 260 source\-eligible held\-out claims and 0\.681 source\-plus\-relation accuracy\. This supports RQ1: source\-aware verification can match or improve support detection while adding a provenance metric that source\-blind baselines cannot report\. The limitation is precision\.ProvenanceGuardtrades precision for recall: its block precision is 0\.673\. In data\-sensitive review, this fail\-closed operating point is often appropriate, but deployments with high review burden may choose a different threshold\. This pattern is consistent with prior factual\-consistency systems: NLI\-style and support\-checking methods are useful for detecting unsupported text, but they do not solve source ownership unless source identity is part of the decision\. ### V\.2Multi\-Source Benchmark Rerun and Multi\-Tool Stress Slices Finding\.On the harder multi\-source adjudicated benchmark,ProvenanceGuardremains a strong conservative blocker but source\-exact attribution is much harder\. Table 4:ProvenanceGuardrerun on the locked multi\-source adjudicated benchmark\. Pairwise cases are the benchmark claim cases; frozen claims are independently extracted answer claims from the same 59 test questions and are the unit scored by the claim\-level rerun\.The locked test split contains 254 pairwise claim cases expanded into 2,587 source\-candidate rows\. The frozen extraction artifact for the same 59 questions contains 263 extracted claims, which are the unit scored by theProvenanceGuardrouter\+NLI rerun\. On those frozen extracted claims,ProvenanceGuardreaches block F1 0\.846, but source accuracy is 0\.503 and source\-plus\-relation accuracy is 0\.229\. This is the expected direction for a benchmark with many same\-topic candidates: binary blocking remains feasible, while exact source ownership becomes substantially harder\. Table 5:Stress slices available in the locked multi\-source adjudicated test benchmark, counted after grouping pairwise source\-candidate rows by claim case\. Slice labels are non\-exclusive diagnostics, so counts need not sum to the total number of claim cases\.The stress\-slice counts show that the new benchmark covers the failure modes missing from simpler two\-source traces\. The test split includes 14 multi\-tool claim cases, 64 chart\-plus\-literature mixed cases, 14 literature\-summary versus exact\-citation cases, 118 same\-topic wrong\-patient/chart cases, and 240 count or resource\-summary cases\. It also includes 42 cases with semantically close wrong candidates\. These are the cases most relevant to MCP provenance: the wrong source can be topically plausible even when it is not the correct provenance object\. Table 6:ProvenanceGuardperformance by multi\-source stress slice on frozen extracted claims from the locked test questions\. Slice rows are small and are intended as diagnostics rather than powered subgroup estimates\.The slice metrics show where exact provenance is hardest\.ProvenanceGuardretains high block F1 on chart\-plus\-literature and literature\-citation slices, but source\-plus\-relation accuracy falls to 0\.127 on same\-topic wrong\-chart or wrong\-patient cases and 0\.179 on count/resource\-summary claims\. This suggests that source\-aware benchmarks should report both support and provenance metrics: a conservative blocker can still fail to identify the exact supporting provenance object\. Table 7:MeasuredProvenanceGuardclaim\-packet stage latency on the locked multi\-source test questions\. The packet provides frozen claims, so decomposition and repair are measured separately in Table[10](https://arxiv.org/html/2606.18037#S5.T10)\.The measured multi\-source router\+NLI rerun is fast at the claim\-packet level: mean NLI call latency is 0\.036 seconds and mean routing latency per question group is 0\.029 seconds\. The table does not include claim decomposition because the benchmark packet supplies frozen claims\. Table 8:Claim\-decomposition agreement against the frozen independent sentence\-level extraction reference on the locked multi\-source test questions\. This is reference\-set agreement, not a separate human gold span annotation\.Against the frozen independent extraction reference, rule\-based decomposition has high recall \(0\.935\) but lower precision \(0\.644\), producing 382 claims for 263 reference claims\. This is acceptable for a fail\-closed verifier because over\-splitting tends to increase review burden rather than silently allow unsupported content, but the protected\-value exact rate among value\-bearing matches is only 0\.563\. The paper therefore treats decomposition as a measured limitation, not a solved preprocessing step\. Table 9:Answer\-levelProvenanceGuardrepair evaluation on reconstructed multi\-source test traces\. The conservative repair policy resolved all initially blocked answers under re\-verification; two required the terminal fallback response\.Table 10:Answer\-level RARR\-style repair rerun latency with rule\-based decomposition, local embedding and NLI scoring, one repair iteration, and post\-repair re\-verification\.The fresh answer\-level rerun reconstructs full test\-question traces from the locked benchmark pairwise rows, runs verification with rule\-based decomposition, and enables one repair iteration plus re\-verification\. The external correction editor was unavailable in this local run, so the repair policy used its local fallbacks: 47 evidence\-only rewrites, 10 blocked\-claim pruning repairs, and 2 terminal conservative responses\. All 59 reconstructed answers are initially blocked, all 59 are handled by this repair policy, and all 59 revised outputs pass the same verifier\. Mean elapsed time per answer is 0\.498 seconds in this local setup, with NLI and routing dominating the measured stage latencies\. ### V\.3RQ2: Ablation Studies Finding\.A direct raw verdict head substantially reduces the earlier calibration gap, while routing keeps the correct source near the head of the source ranking\. The initial uncalibrated NLI\-only support decision reaches 0\.750 block F1 and only 0\.363 held\-out verdict accuracy, while the retained calibratedProvenanceGuardrun reaches 0\.802 block F1 and 0\.812 verdict accuracy\. This 0\.449 verdict\-accuracy gap showed that the naive raw label mapping was not semantically strong enough\. We therefore add a direct raw verdict head trained on the train split over routed Router\+NLI and source\-evidence features, without selecting a deployment support threshold on validation\. This raw verdict head reaches 0\.839 held\-out verdict accuracy, 0\.816 block F1, and 0\.704 source\-plus\-relation accuracy\. Threshold calibration on top of this raw head does not improve held\-out accuracy: the validation\-F1 threshold gives 0\.817 verdict accuracy and 0\.800 block F1, while a high\-recall threshold gives 0\.795 verdict accuracy and 0\.789 block F1 but restores block recall to 0\.993\. The gap to the retained calibrated verifier is therefore no longer a raw\-accuracy deficit; it is an operating\-point tradeoff\. Router\-only source ranking gives 0\.858 Top\-1 source accuracy\. We do not treat Top\-3 or Top\-5 as meaningful performance metrics on this split because each source\-eligible held\-out claim has at most two candidate sources\. Table 11:Router ablation on the held\-out claim split\. Router\-only reports whether the adjudicated source is ranked first over the 260 source\-eligible claims\. NLI\-only uses the routed NLI support decision without the calibrated evidence\-feature layer\. The calibrated source\-aware verifier is the retainedProvenanceGuardsystem\. Top\-3 and Top\-5 are omitted because the held\-out source candidate sets are too small to make those cutoffs discriminative\.The mechanism is that the naive raw NLI label loses information that is present in evidence\-derived features: routing score, routing margin, lexical overlap, token support, and protected\-value coverage\. The direct raw verdict head learns a verdict decision from those features before any threshold calibration layer is applied\. Calibration is still useful for choosing a fail\-closed operating point, but it no longer needs to rescue a fundamentally weak raw verdict mapping in this experiment\. The calibrated support verdict also creates the precondition for conflation detection: only after the verifier knows which routed source supports the claim can it ask whether the answer attributed that claim to the same provenance family\. Hard route\-score cutoffs do not solve the problem: Table[15](https://arxiv.org/html/2606.18037#S5.T15)shows that increasingly strict cutoffs reject most claims and reduce source accuracy among retained source\-eligible claims\. This confirms the RQ2 hypothesis in a bounded way\. Routing alone is necessary for attribution but does not decide support; NLI alone estimates support but does not preserve the source\-ranking behavior needed for attribution; calibrated Router\+NLI combines both\. The limitation is that this held\-out split cannot evaluate large\-kksource recall: Top\-3 and Top\-5 would be tautological because each claim has at most two candidate sources\. A larger trace set with more simultaneous MCP sources is needed to measure whether top\-kkrouting remains strong under heavier source competition\. We then compare the retained verifier with development variants that test alternative calibration and routing choices\. The raw verdict head has the strongest held\-out verdict accuracy and block F1, whileProvenanceGuardis retained as the fail\-closed deployment point because it preserves the highest block recall among the strong variants\. Thresholding the raw head moves it toward that conservative operating point but reduces held\-out accuracy and block F1\. The evidence\-calibrated ExtraTrees/logistic blend and calibrated\-score ensemble improve some validation statistics, but in the held\-out audit they reduce block recall and do not improve block F1 relative to the retained system\. We therefore report both the raw verdict head as the strongest uncalibrated verifier andProvenanceGuardas the retained conservative verifier for RARR\-style repair\. Table 12:Held\-out results for source\-aware verifier development runs\. All rows are evaluated on the same 361\-claim held\-out packet\. Source metrics are computed on the 260 tool\-source\-eligible claims\. The retained system is bolded\.Table 13:Raw and validation\-calibrated systems from the current post\-finetune benchmark package on the locked multi\-source adjudicated benchmark\. This table uses the package’s pairwise comparison units\. The legacy pairwise\-scorer row is a diagnostic from the package and is not the retained frozen\-claimProvenanceGuardpipeline; Table[4](https://arxiv.org/html/2606.18037#S5.T4)separately reports that frozen extracted\-claim rerun\.The multi\-source benchmark comparison clarifies the role of calibration\. Calibration is useful when it tunes an already meaningful score into an operating threshold\. It is not a substitute for a raw model that understands source ownership\. In the current post\-finetune benchmark package, the completed two\-head backbones already have much stronger raw source\-plus\-relation accuracy than the legacy comparison rows: Long DeBERTa and ModernBERT reach 0\.478, and ModernCE reaches 0\.420\. Validation calibration changes the operating point rather than uniformly improving every metric: block F1 increases slightly for the calibrated variants, while source\-plus\-relation accuracy is unchanged or lower\. This supports the design goal for future versions: improve the base verifier so raw relation and source scores are sensible, then use calibration only to choose a deployment operating point\. Table 14:Raw longer\-context diagnostic using a 2048\-token ModernBERT checkpoint trained for one bounded 100\-group epoch\.After the original 20\-epoch 2048\-token ModernBERT checkpoint proved unreadable, we trained a replacement 2048\-token ModernBERT checkpoint for one bounded 100\-group epoch and evaluated it raw on the full locked multi\-source test split\. This is a diagnostic checkpoint rather than a full finetune: case source\-plus\-relation accuracy is 0\.090, with source\-pair accuracy 0\.559\. This gives a valid long\-context raw test point, but it does not yet support the stronger claim that longer context alone makes the raw verifier sensible\. The route\-score diagnostic explains why a hard no\-source cutoff was not retained\. Increasing the cutoff rejects most claims before NLI and reduces source accuracy among the retained source\-eligible claims, so route score alone is too blunt for support decisions\. Table 15:Held\-out route\-score threshold diagnostic\. A hard no\-source cutoff rejects low\-score claims before NLI\. Higher cutoffs reject most claims and reduce source accuracy among retained source\-eligible claims, so this policy was not retained\. Source accuracy is undefined when no claims are retained\.Table 16:Blocked prediction taxonomy forProvenanceGuardon prospective captured\-agent traces\.The blocked\-case taxonomy explains where the system spends its recall\. On the held\-out split, 176 blocked claims are NLI\-neutral or not\-enough\-evidence cases and 29 are wrong\-route candidates\. This is consistent with the method design:ProvenanceGuardis primarily rejecting claims that cannot be grounded in the routed source, not only overt contradictions\. ### V\.4RQ3: Comparison With Source\-Blind Baselines Finding\.Source\-blind support baselines are competitive on binary support, but they cannot evaluate MCP source attribution\. The comparison table is therefore read only on the support axis\.ProvenanceGuardhas the highest block F1 at 0\.802, followed by MiniCheck at 0\.783 and RAGAS Faithfulness at 0\.758\. MiniCheck is close enough that the 0\.019 absolute F1 gap is not statistically significant under paired trace\-level bootstrap comparison \(one\-sidedp≈0\.13p\\approx 0\.13; two\-sidedp≈0\.26p\\approx 0\.26\)\. The practical difference is metric coverage: onlyProvenanceGuardreports claim\-to\-source IDs, source accuracy, and source\-plus\-relation accuracy\. Table 17:Held\-out captured\-agent trace claim\-level factuality comparison\. Verdict macro F1 is macro F1 over the four factual verdict labels: supported, unsupported, contradicted, and not\-enough\-evidence\.ProvenanceGuarduses only router, NLI, and evidence\-derived features\. MiniCheck, RAGAS Faithfulness, AlignScore, and SummaC\-ZS are source\-blind claim/evidence support baselines and therefore are not scored on claim\-to\-source attribution\.The uncertainty estimates confirm that the support\-only advantage should not be overread\.ProvenanceGuardreaches 0\.802 block F1 with a 95% CI of \[0\.664, 0\.900\], MiniCheck reaches 0\.783 \[0\.645, 0\.882\], and RAGAS Faithfulness reaches 0\.758 \[0\.618, 0\.861\]\. These intervals overlap because the held\-out split contains only 40 traces\. Table 18:Uncertainty estimates for held\-out claim\-level metrics and targeted repair\. Claim\-level intervals use 5,000 trace\-level bootstrap resamples over the 40 held\-out traces\. Repair intervals for 50/50 probe successes use an exact binomial 95% interval rounded to two decimals because case\-level bootstrap resampling is degenerate on the fixed controlled probe set\.ProvenanceGuardsource accuracy is computed over 260 source\-eligible held\-out claims\.The mechanism is the distinction between pooled support and provenance\-sensitive support\. MiniCheck and RAGAS can determine whether evidence contains enough information to support a claim, but their output does not encode whether the supporting evidence came from the patient chart, PubMed metadata, literature search, or another MCP source\. This connects to the paper’s main hypothesis: attribution requires preserving source identity through routing and support checking\. The limitation is scope\. The source\-blind baselines are not failed attribution systems; they are different tools\. They remain appropriate comparators for support estimation, and their strong held\-out F1 shows that pooled\-evidence support checking is a meaningful baseline\. The unexpected result is how close MiniCheck is on support F1, which suggests that the clearest contribution ofProvenanceGuardis not a large binary F1 gain but the ability to retain source ownership while maintaining competitive support detection\. ### V\.5RQ4: Repair After Source\-Aware Blocking Finding\.The repair loop can turn blocked captured\-trace answers into verifier\-passing answers, but many repairs require conservative fallback rather than a fully rewritten substantive answer\. On 281 full captured traces, the pre\-repair verifier allows 108 answers and blocks 173\. The repair loop resolves all 173 blocked answers, and all 173 revised outputs pass the same Router\+NLI verifier\. However, 144 of those resolutions use a terminal conservative response\. This full\-trace result should be separated from the targeted source\-conflation probe set in RQ5, where the controlled probes are deliberately simpler\. Table 19:Full captured\-trace RARR\-style repair results on 281 captured MCP\-agent traces\. Rejected answers enter the repair loop and are then re\-scored by the same source\-router plus NLI verifier\. Terminal fallback means the remaining unverified content was replaced with a conservative non\-claim response\.The answer\-level rerun in Table[9](https://arxiv.org/html/2606.18037#S5.T9)is a second repair check on the multi\-source benchmark rather than on the original 281\-trace corpus\. It is stricter in the sense that all reconstructed benchmark answers are initially blocked, but it is also smaller and uses the same deterministic rule\-based runtime decomposition configuration as the reported full\-trace repair run\. The mechanism is fail\-closed repair\. The RARR\-style repair loop first attempts source\-grounded rewriting, then reruns the verifier\. If unsupported content remains, the terminal fallback removes the remaining factual claim rather than allowing an unverifiable answer\. This supports the repair hypothesis only in the conservative sense: the loop can prevent unverifiable content from passing, but it does not always recover a rich answer\. The limitation is that repair success is measured against the same verifier that triggered the block\. This is appropriate for checking pipeline consistency, but it is not independent proof that the revised answer is clinically complete\. Compared with prior RARR work, this setting is stricter because revision must preserve MCP source attribution, not only improve general factual consistency against retrieved text\. ### V\.6RQ5: Targeted Source Conflation Finding\.ProvenanceGuarddetects and repairs deliberately injected source\-conflation errors in a controlled clinical probe set\. In 50 clinically framed source\-conflation probes, the verifier blocks all 50 source\-conflation replies and labels all 50 as explicit conflations\. The repair policy resolves all 50 cases; all 50 pass post\-repair verification, and no revised answer retains the deliberately wrong attribution\. The exact binomial 95% interval for 50/50 successes is approximately \[0\.93, 1\.00\]; the bootstrap interval is degenerate on this fixed controlled probe set and should not be read as a generalization interval\. Table 20:Targeted clinically framed source\-conflation probe set over frozen captured MCP evidence\. The 50 probes contain one deliberate source\-attribution error each; post\-repair verification uses the same source\-router plus NLI verifier\.The mechanism is the separation between support and attribution\. The factual content in these probes is often supported somewhere in the trace, but the answer assigns it to the wrong provenance family\. Source\-blind support checks cannot see that failure mode because the claim may still be supported by pooled evidence\.ProvenanceGuarddetects it by comparing stated attribution with the routed supporting source\. This supports the source\-conflation hypothesis for explicit attribution swaps\. The limitation is difficulty: each probe contains one clear source swap and no adversarial attempt to hide the attribution error\. The 1\.000 result should therefore be interpreted as a diagnostic check for a constrained failure mode, not as evidence that all source\-conflation cases are solved\. ### V\.7Adjudication Status Finding\.The held\-out evaluation uses complete LLM\-assisted labels with human expert review\. The held\-out packet has labels for all 361 claims\. The agreement rows reflect two independent judge prompts run on a local Gemma 4 E4B instruction\-tuned model and priority adjudication for disagreements\. Human experts then reviewed the resulting held\-out labels before they were used for evaluation\. Table 21:Held\-out adjudication status\. Labels come from two independent judge prompts run on a local Gemma 4 E4B instruction\-tuned model, priority adjudication for disagreements, and human expert review of the resulting 361 held\-out labels\. This table does not claim human review of the training or validation labels; exact model identifiers and access dates are reported in Section[IX](https://arxiv.org/html/2606.18037#S9)\.The mechanism is a two\-stage adjudication workflow: model judges provide first\-pass labels, disagreements receive priority adjudication, and human experts verify the final held\-out labels\. This supports reproducible benchmarking while incorporating expert review after model adjudication\. ## VIDiscussion The central result is not that attribution guarantees truth\. It does not\. The contribution is that a verifier for MCP\-based agents can preserve source identity while estimating support, making it possible to distinguish a supported claim from a correctly attributed supported claim\. This distinction matters when one answer combines patient\-specific records, medical literature, and tool metadata\. The empirical pattern is consistent across the held\-out results\.ProvenanceGuardis competitive with source\-blind support baselines on binary blocking, but its distinctive value is provenance: it returns claim\-to\-source IDs and can detect source conflation\. The gain is most useful in offline or data\-sensitive settings where a conservative block is preferable to silently accepting an answer whose facts are attached to the wrong source\. The multi\-source rerun sharpens this interpretation\. On a larger pairwise source\-candidate benchmark, binary blocking remains strong, but exact source\-plus\-relation accuracy drops sharply\. This suggests that provenance verification should be judged on two axes: support detection and source ownership\. A model that is useful for blocking unsupported content is not necessarily good enough for exact attribution when several semantically close sources are present\. The calibration results also change the engineering lesson\. Calibration is useful for setting a fail\-closed operating point, but it should not carry the core semantics of the verifier\. The initial raw Router\+NLI label mapping had exactly that weakness: it reached only 0\.363 held\-out verdict accuracy, compared with 0\.812 for the retained calibrated verifier\. A direct raw verdict head over routed NLI and source\-evidence features reduces this dependence, reaching 0\.839 held\-out verdict accuracy without a validation\-selected support threshold\. Threshold calibration on top of this raw head does not improve held\-out accuracy; it shifts the system toward higher block recall\. With a high\-recall threshold, the raw head reaches the same 0\.993 block recall asProvenanceGuardbut falls to 0\.795 verdict accuracy and 0\.789 block F1\. The remaining distinction is therefore an operating\-point tradeoff rather than a raw semantic failure: the raw head has higher verdict accuracy and block F1, while the retained calibrated verifier has higher block recall at a better conservative operating point\. The post\-finetune multi\-source comparison reinforces this point: stronger raw two\-head backbones reach substantially higher source\-plus\-relation accuracy before calibration, and calibration mainly adjusts the deployment threshold\. Future work should therefore prioritize stronger source\-aware raw training and harder negative candidates before adding more calibration machinery\. The replacement 2048\-token ModernBERT diagnostic run is loadable and evaluated, but its low source\-plus\-relation accuracy leaves the longer\-context hypothesis unresolved rather than supported\. The repair results should be read with the same caution\. RARR\-style repair plus reverification can remove or correct unsupported content, and the targeted source\-conflation probes show that explicit attribution swaps are detectable\. But repair success against the same verifier is not independent clinical validation, and conservative fallbacks are evidence of fail\-closed behavior rather than rich answer recovery\. Offline benchmark claim\-packet extraction and LLM\-assisted adjudication used a local Gemma 4 E4B instruction\-tuned model\. Runtime answer\-level verification and repair used the deterministic rule\-based decomposer\. This is a deliberately conservative deployment setting: it keeps the verifier compatible with offline and data\-sensitive environments\. The framework itself is not tied to that local model\. More capable state\-of\-the\-art LLMs, including frontier Opus\- or GPT\-class systems, may improve the LLM\-dependent offline benchmark\-construction stages or repair rewriting\. We treat this as a generalization hypothesis rather than a reported result, because the held\-out numbers in this paper are only for the frozen local\-model configuration\. The new decomposition measurement also suggests where stronger runtime decomposition could help: the rule\-based decomposer recovers most frozen reference claims but over\-splits and preserves protected values imperfectly\. The main open question is scale\. The primary held\-out split is small, confidence intervals are wide, and it cannot meaningfully evaluate Top\-3 or Top\-5 source routing because the current traces usually expose only one or two candidate sources per claim\. The multi\-source benchmark is a step toward this harder setting, but it also shows that longer context and stronger source\-aware raw models are needed when the candidate set contains many plausible near misses\. ## VIIConclusion We presentedProvenanceGuard, a source\-aware factuality verifier for MCP\-based LLM agents\. The paper’s central claim is that factuality verification in multi\-tool settings must go beyond pooled evidence support: it must also determine whether each claim is attributed to the correct source\. This distinction matters because a claim can be supported somewhere in the available MCP trace while still being misleadingly assigned to the wrong tool output, patient record, literature source, or metadata source\. ProvenanceGuardpreserves stable MCP source identifiers, decomposes answers into atomic claims, routes claims to source\-specific evidence, checks support with NLI and alignment signals, and compares stated attribution with the routed source\. On a frozen corpus of captured medical\-domain MCP\-agent traces, the system achieved competitive blocking performance while also producing claim\-to\-source attribution judgments unavailable from source\-blind baselines\. In targeted source\-conflation probes, it detected all deliberately injected attribution swaps, demonstrating the value of source\-aware verification\. These results suggest that source attribution should be treated as an independent evaluation axis for tool\-using LLM agents\. However, the current evidence is limited by the use of one medical agent stack, labels that are LLM\-assisted, a small held\-out split, and controlled source\-conflation probes\. Future work should expand evaluation to larger, more diverse MCP environments and include more subtle source\-conflation cases\. ## VIIILimitations The trace benchmark is drawn from one medical MCP\-agent stack\. It is representative of that stack, but it does not establish universal medical factuality or clinical safety validation\. The reproducible object is the frozen trace, not future behavior of PubMed, FHIR resources, search tools, or the live agent\. Tool outputs and agent behavior may vary if the same prompts are rerun at a later date\. The 2,325\-label claim subset uses LLM\-assisted adjudication with two independent judge prompts and priority adjudication for disagreements\. Human expert verification covers only the 361 held\-out labels used in the reported primary evaluation; it does not cover training or validation labels and does not establish universal medical factuality or clinical safety validation\. The benchmark should therefore not be interpreted as a fully clinician\-adjudicated training corpus\. The claim\-level bootstrap intervals are wide because the held\-out split contains only 40 traces and claims are clustered within traces\. Reported uncertainty therefore uses trace\-level resampling rather than treating all 361 claims as independent\. The held\-out labels are dominated by supported and not\-enough\-evidence claims\. There are few explicit contradictions and no gold conflation relation in the random held\-out claim split\. The targeted 50\-case source\-conflation probe set is therefore the most direct evidence for controlled conflation detection and repair, while the random 281\-trace run evaluates full\-answer behavior\. Because the 50 probes are generated by injecting one explicit source\-attribution swap into otherwise captured benchmark evidence, the 1\.000 repair metrics do not imply that the task is solved under harder, multi\-error, paraphrased, or adversarial source\-conflation conditions\. The paper reports claim\-decomposition agreement against a frozen independent extraction artifact, not against a separately human\-authored gold atomic\-claim set\. This makes the decomposition numbers useful for regression testing and error analysis, but not a final estimate of human gold extraction precision and recall\. Decomposition errors can still affect answer\-level behavior, especially protected\-value preservation, so this remains an evaluation gap\. The initial raw Router\+NLI label mapping depended heavily on calibration for final verdict accuracy\. On the main held\-out split, the naive raw verdict accuracy is 0\.363 and the retained calibrated verifier reaches 0\.812\. The direct raw verdict head reduces this gap by reaching 0\.839 verdict accuracy without validation\-threshold calibration, but it is still trained on split\-specific Router\+NLI and source\-evidence features rather than being an end\-to\-end source\-aware NLI model\. Threshold calibration on top of the raw head shows the expected recall–accuracy tradeoff rather than an additional held\-out accuracy gain: the high\-recall operating point reaches 0\.993 block recall but falls to 0\.795 verdict accuracy\. This improves the reported calibration\-dependence story, while leaving robustness under distribution shift as an evaluation gap\. The 2048\-token raw ModernBERT result uses a replacement one\-epoch checkpoint trained for only 100 sampled groups after the original longer run was not available for evaluation\. This is enough to verify the evaluation path and report a valid raw long\-context test point, but it is not a full training run\. We therefore do not claim that longer context improved raw source\-aware verification\. External source\-blind support baselines are not source\-attribution systems\. They are appropriate binary decision comparators where configured, but they are not baselines for claim\-to\-source accuracy\. The LLM\-dependent components are also a validity boundary\. The reported configuration uses a local Gemma 4 E4B instruction\-tuned model for offline benchmark claim\-packet extraction and adjudication where LLM calls are required\. Runtime answer\-level verification and repair use deterministic decomposition\. Stronger frontier LLMs may produce cleaner offline claim packets, more stable adjudications, or richer repairs, but those systems are not evaluated in the reported frozen run\. ## IXReproducibility The reproducible unit is the frozen MCP trace and its derived claim\-level evaluation packet\. Each reported table is generated from stored traces, claim packets, LLM\-assisted labels, prediction files, baseline outputs, uncertainty resamples, and repair outputs\. The artifact release preserves the raw tool outputs and stable source identifiers needed to rerun source routing and attribution checks\. The implementation used for the reported run instantiates Section[III](https://arxiv.org/html/2606.18037#S3)as follows\. Source routing uses the all\-MiniLM\-L6\-v2 SentenceTransformer model with cosine similarity over normalized embeddings\. Support checking uses Moritz Laurer’s DeBERTa\-v3\-base checkpoint fine\-tuned on MNLI, FEVER, and ANLI\. Offline benchmark claim\-packet extraction and the two model\-adjudication passes use a local Gemma 4 E4B instruction\-tuned model; runtime answer\-level verification and repair use the deterministic rule\-based decomposer\. The multi\-source adjudication path uses thegpt\-5\.4API model identifier, and the local judge prompts use the deployed Gemma 4 E4B instruction\-tuned checkpoint, recorded asgemma\-4\-e4b\-itwhere the deployment exposes model IDs; model documentation and local model\-card checks were accessed on 2026\-06\-16\. The calibrated support layer is trained on the development split, and its threshold is selected on validation before held\-out scoring\. The NLI scorer truncates premise–claim pairs at a configurable token budget\. The current default is 512 tokens through the NLI maximum\-length setting; the prior 256\-token setting is retained only in historical backbone artifacts that were run under local MPS memory constraints\. New joint\-verifier training and scoring runs default to 512 tokens unless a model\-specific longer\-context run overrides the argument\. The 2048\-token ModernBERT raw evaluation and replacement checkpoint are included in the released artifact bundle\. The full\-trace repair outputs are stored separately from claim\-level outputs because they evaluate answer\-level behavior rather than isolated claim predictions\. External baselines are stored as source\-blind support outputs and are not used to train or calibrateProvenanceGuard\. Historical rerun summaries are not used for the manuscript numbers\. The reconstructed multi\-source answer\-level traces, fresh answer\-level RARR\-style outputs, summary, and latency files are stored together in the released artifact bundle\. The raw verdict\-head results and threshold\-calibration variants are stored underartifacts/router\_nli\_raw\_verdict\_head/\. The multi\-source benchmark tables are generated from frozen benchmark artifacts and copied into the paper artifact bundle\. The associated answer\-trace reconstruction, claim\-decomposition measurement, and long\-context raw\-evaluation outputs are preserved with the same released artifacts\. ## References - \[1\]B\. Bohnet, V\. Q\. Tran, P\. Verga, R\. Aharoni, D\. Andor, L\. B\. Soares, M\. Ciaramita, J\. Eisenstein, K\. Ganchev, J\. Herzig, K\. Hui, T\. Kwiatkowski, J\. Ma, J\. Ni, L\. Sestorain Saralegui, T\. Schuster, W\. W\. Cohen, M\. Collins, D\. Das, D\. Metzler, S\. Petrov, and K\. Webster\(2022\)Attributed question answering: evaluation and modeling for attributed large language models\.External Links:2212\.08037,[Link](https://arxiv.org/abs/2212.08037)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px2.p1.1)\. - \[2\]\(2026\)Localizing factual inconsistencies in attributable text generation\.Transactions of the Association for Computational Linguistics\.External Links:[Link](https://aclanthology.org/2026.tacl-1.6/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px1.p1.1)\. - \[3\]S\. Es, J\. James, L\. Espinosa\-Anke, and S\. Schockaert\(2024\)RAGAS: automated evaluation of retrieval augmented generation\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations,St\. Julian’s, Malta,pp\. 150–158\.External Links:[Link](https://aclanthology.org/2024.eacl-demo.16/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px2.p1.1)\. - \[4\]S\. Filice, E\. Haramaty, G\. Horowitz, Z\. Karnin, L\. Lewin\-Eytan, and A\. Shtoff\(2025\)Generate but verify: answering with faithfulness in RAG\-based question answering\.InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics,pp\. 1017–1037\.External Links:[Link](https://aclanthology.org/2025.ijcnlp-long.56/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px3.p1.1)\. - \[5\]J\. Gao, J\. Zhou, Q\. Sun, R\. Huang, and S\. Yoo\(2026\)Atomic information flow: a network flow model for tool attributions in RAG systems\.External Links:2602\.04912,[Link](https://arxiv.org/abs/2602.04912)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px3.p1.1)\. - \[6\]L\. Gao, Z\. Dai, P\. Pasupat, A\. Chen, A\. T\. Chaganty, Y\. Fan, V\. Y\. Zhao, N\. Lao, H\. Lee, D\. Juan, and K\. Guu\(2023\)RARR: researching and revising what language models say, using language models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Toronto, Canada,pp\. 16477–16508\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.910),[Link](https://aclanthology.org/2023.acl-long.910/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px4.p1.1)\. - \[7\]T\. Gao, H\. Yen, J\. Yu, and D\. Chen\(2023\)Enabling large language models to generate text with citations\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Singapore,pp\. 10355–10377\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.641),[Link](https://aclanthology.org/2023.emnlp-main.641/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px2.p1.1)\. - \[8\]T\. Goyal and G\. Durrett\(2020\)Evaluating factuality in generation with dependency\-level entailment\.InFindings of the Association for Computational Linguistics: EMNLP 2020,External Links:[Link](https://aclanthology.org/2020.findings-emnlp.322/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px1.p1.1)\. - \[9\]S\. Harary, E\. Hirsch, A\. Slobodkin, D\. Wan, M\. Bansal, and I\. Dagan\(2025\)PrefixNLI: detecting factual inconsistencies as soon as they arise\.External Links:2511\.01359,[Link](https://arxiv.org/abs/2511.01359)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px1.p1.1)\. - \[10\]E\. Hirsch, A\. Slobodkin, D\. Wan, E\. Stengel\-Eskin, M\. Bansal, and I\. Dagan\(2025\)LAQuer: localized attribution queries in content\-grounded generation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,pp\. 15355–15370\.External Links:[Link](https://aclanthology.org/2025.acl-long.746/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px2.p1.1)\. - \[11\]O\. Honovich, R\. Aharoni, J\. Herzig, H\. Taitelbaum, D\. Kukliansy, V\. Cohen, T\. Scialom, I\. Szpektor, A\. Hassidim, and Y\. Matias\(2022\)TRUE: re\-evaluating factual consistency evaluation\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\),Seattle, United States,pp\. 3905–3920\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.naacl-main.287),[Link](https://aclanthology.org/2022.naacl-main.287/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px2.p1.1)\. - \[12\]W\. Kryściński, B\. McCann, C\. Xiong, and R\. Socher\(2020\)Evaluating the factual consistency of abstractive text summarization\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 9332–9346\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.750),[Link](https://aclanthology.org/2020.emnlp-main.750/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px4.p1.1)\. - \[13\]P\. Laban, T\. Schnabel, P\. N\. Bennett, and M\. A\. Hearst\(2021\)SummaC: re\-visiting nli\-based models for inconsistency detection in summarization\.External Links:2111\.09525,[Link](https://arxiv.org/abs/2111.09525)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px1.p1.1)\. - \[14\]S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. W\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi\(2023\)FActScore: fine\-grained atomic evaluation of factual precision in long form text generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://arxiv.org/abs/2305.14251)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px1.p1.1)\. - \[15\]I\. Poey, J\. Liu, and Q\. Zhong\(2025\)RAGulator: lightweight out\-of\-context detectors for grounded text generation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track,pp\. 1057–1071\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-industry.73),[Link](https://aclanthology.org/2025.emnlp-industry.73/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px4.p1.1)\. - \[16\]H\. Rashkin, V\. Nikolaev, M\. Lamm, L\. Aroyo, M\. Collins, D\. Das, S\. Petrov, G\. S\. Tomar, I\. Turc, and D\. Reitter\(2023\)Measuring attribution in natural language generation models\.Computational Linguistics49\(4\),pp\. 777–840\.External Links:[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00486),[Link](https://aclanthology.org/2023.cl-4.2/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px2.p1.1)\. - \[17\]P\. Roit, J\. Ferret, L\. Shani, R\. Aharoni, G\. Cideron, R\. Dadashi, M\. Geist, S\. Girgin, L\. Hussenot, O\. Keller, N\. Momchev, S\. Ramos Garea, P\. Stanczyk, N\. Vieillard, O\. Bachem, G\. Elidan, A\. Hassidim, O\. Pietquin, and I\. Szpektor\(2023\)Factually consistent summarization via reinforcement learning with textual entailment feedback\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Toronto, Canada,pp\. 6252–6272\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.344),[Link](https://aclanthology.org/2023.acl-long.344/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px4.p1.1)\. - \[18\]Y\. Song, Y\. Kim, and M\. Iyyer\(2024\)VERISCORE: evaluating the factuality of verifiable claims in long\-form text generation\.InFindings of the Association for Computational Linguistics: EMNLP,External Links:[Link](https://arxiv.org/abs/2406.19276)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px1.p1.1)\. - \[19\]L\. Tang, P\. Laban, and G\. Durrett\(2024\)MiniCheck: efficient fact\-checking of llms on grounding documents\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 8818–8847\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.499),[Link](https://aclanthology.org/2024.emnlp-main.499/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px1.p1.1)\. - \[20\]ThoughtProof\(2026\-04\)Reasoning verification capability — verifying ai output correctness through MCP\.Note:MCP GitHub Discussion \#2574External Links:[Link](https://github.com/modelcontextprotocol/modelcontextprotocol/discussions/2574)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px5.p1.1)\. - \[21\]D\. Wan, E\. Hirsch, E\. Stengel\-Eskin, I\. Dagan, and M\. Bansal\(2025\)GenerationPrograms: fine\-grained attribution with executable programs\.External Links:2506\.14580,[Link](https://arxiv.org/abs/2506.14580)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px2.p1.1)\. - \[22\]X\. Yue, B\. Wang, Z\. Chen, K\. Zhang, Y\. Su, and H\. Sun\(2023\)Automatic evaluation of attribution by large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Singapore,pp\. 4615–4635\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.307),[Link](https://aclanthology.org/2023.findings-emnlp.307/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px2.p1.1)\. - \[23\]Y\. Zha, Y\. Yang, R\. Li, and Z\. Hu\(2023\)AlignScore: evaluating factual consistency with a unified alignment function\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics,Toronto, Canada,pp\. 11328–11348\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.634),[Link](https://aclanthology.org/2023.acl-long.634/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px1.p1.1)\. - \[24\]Q\. Zhang, Z\. Xiang, Y\. Xiao, L\. Wang, J\. Li, X\. Wang, and J\. Su\(2025\)FaithfulRAG: fact\-level conflict modeling for context\-faithful retrieval\-augmented generation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://aclanthology.org/2025.acl-long.1062/)Cited by:[§II](https://arxiv.org/html/2606.18037#S2.SS0.SSS0.Px3.p1.1)\. ## Appendix AIllustrative Escaped JSON Schema Fragments ``` { "question_id": "clin-real-v2-shuf-003", "user_question": "Summarize beta blockers in heart failure with atrial fibrillation.", "category": "literature_review", "route": "pubmed_research", "tool_calls": [ "search_pubmed_key_words", "get_pubmed_article_metadata" ], "full_tool_outputs": [ { "tool_name": "get_pubmed_article_metadata", "source_id": "tool_output::get_pubmed_article_metadata", "source_role": "complete_tool_output", "chunk_count": 3, "text": "Complete tool output for get_pubmed_article_metadata.\\n[1] ..." }, { "tool_name": "search_pubmed_key_words", "source_id": "tool_output::search_pubmed_key_words", "source_role": "complete_tool_output", "chunk_count": 12, "text": "Complete tool output for search_pubmed_key_words.\\n[1] ..." } ], "original_evidence_chunk_count": 15, "assistant_answer": "A source-grounded answer summarizing the literature...", "final_reply_to_user": "A source-grounded answer summarizing the literature...", "elapsed_s": 19.97 } ``` ## Appendix BRandom\-Forest Calibration Input and Output ``` { "claim_id": "clin-real-v2-shuf-014::c01", "candidate_claim": "The patient is Carla Mendoza.", "routed_source": { "source_id": "tool_output::load_patient_history", "tool_name": "load_patient_history", "route_score": 0.224, "routing_margin": 0.224 }, "calibrator_input": { "categorical_features": { "verifier_label": "entailment", "category": "chart_plus_literature", "pred_tool": "load_patient_history", "status": "routed_entailment" }, "numeric_features": { "route_score": 0.224, "pred_score": 0.927, "routing_margin": 0.224, "term_overlap": 1.000, "all_term_overlap": 1.000, "claim_terms_n": 3, "protected_n": 1, "protected_missing": 0, "protected_missing_all": 0, "evidence_chunks_n": 4, "claim_len": 5, "has_stated_attr": 0, "score_x_route": 0.208, "score_x_overlap": 0.927, "route_x_overlap": 0.224, "neutral": 0, "entailment": 1, "contradiction": 0, "weak_route_high_score": 0, "strong_route": 0, "score_x_route_squared": 0.043 } }, "calibrator": { "model": "RandomForestClassifier", "n_estimators": 400, "max_depth": 5, "min_samples_leaf": 8, "class_weight": "balanced", "threshold": 0.650 }, "calibrator_output": { "support_probability": 0.9996, "support_decision": "supported", "pred_relation": "entailment", "pred_source_id": "tool_output::load_patient_history", "pred_tool_id": "load_patient_history" }, "downstream_use": { "support_passes_to_attribution_matching": true, "answer_level_policy": "block if any claim fails support or attribution" } } ``` ## Appendix CEscaped Claim, Routing, NLI, and Repair Example ``` { "trace_evidence": { "tool_name": "load_patient_history", "source_id": "tool_output::load_patient_history", "text": "id: bench-6; name: Felipe Costa; label: Asthma; status: active." }, "atomic_claim": { "text": "Felipe Costa (bench-6) has active condition Asthma", "stated_attribution": "According to PubMed evidence." }, "routing": { "routed_source_id": "tool_output::load_patient_history", "tool_name": "load_patient_history", "score": 0.637, "margin": 0.413 }, "nli": { "relation": "entailment", "protected_values": ["bench-6"], "unsupported_content_token_indices": [] }, "attribution": { "status": "conflation", "reason": "Supported by patient-history evidence, not PubMed evidence." }, "rarr": { "status": "corrected", "replacement": "Patient history: Felipe Costa (bench-6) has active Asthma." } } ```
Similar Articles
PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation
PropGuard is a propagation-aware framework for safeguarding LLM-based multi-agent systems (LLM-MAS) from malicious instructions that propagate across agents and rounds. It constructs a dual-view spatio-temporal graph and uses a GE-GRPO trained inspector to detect and remediate suspicious propagation subgraphs.
PrologMCP: A Standardized Prolog Tool Interface for LLM Agents
Introduces PrologMCP, an open-source server that exposes Prolog as a stateful tool via the Model Context Protocol, enabling LLM agents to delegate reasoning to a symbolic solver. Evaluation shows competitive or superior accuracy on deductive reasoning tasks compared to frontier reasoning LLMs.
Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation
This paper identifies a failure mode in LLMs where they do not verify the validity of numerical statistics when synthesizing multiple sources, instead relying on the stylistic markers of analytical rigor. The authors term this 'epistemic alignment' and show that it persists across models and domains, resisting prompting-based mitigations.
Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning
This paper introduces a relevance-sensitive evaluation suite for legal AI, demonstrating that LLMs are overly sensitive to legally irrelevant perturbations, and proposes LexGuard, an adversarial multi-agent framework using formal reasoning to improve legal reasoning reliability.
@HowToAI_: Meta discovered a technique that makes LLMs 94% more accurate. And it completely destroys everything we thought we knew…
Meta's Chain-of-Verification (CoVe) prompting technique improves LLM factual accuracy by 94% through a four-step self-verification pipeline, reducing hallucinations without fine-tuning.