KG-Guard: Graph-Based Hallucination Detection for Knowledge Base Question Answering
Summary
KG-Guard is a lightweight graph-based framework for detecting hallucinations in LLM-based knowledge base question answering. It treats the LLM as a black box and uses a graph encoder with a MLP classifier to identify hallucinated answer nodes, outperforming baselines while having far fewer parameters.
View Cached Full Text
Cached at: 06/02/26, 03:41 PM
# KG-Guard: Graph-Based Hallucination Detection for Knowledge Base Question Answering
Source: [https://arxiv.org/html/2606.00328](https://arxiv.org/html/2606.00328)
Albert Sawczyn1,† &Piotr Bielak1&Tomasz Kajdanowicz11Department of Artificial Intelligence Wrocław University of Science and Technology Wrocław, Poland †albert\.sawczyn@pwr\.edu\.pl
###### Abstract
Large language models \(LLMs\) are increasingly used for knowledge base question answering \(KBQA\), where answering requires selecting entities from a question\-specific knowledge\-graph subgraph\. Yet LLMs are known to hallucinate across tasks, and KBQA is no exception: even when we provide a graph as the knowledge source, the model may rely on parametric knowledge instead of graph evidence or perform invalid reasoning over the given relations\. Such hallucinated answer nodes can limit the practical deployment of KBQA systems, especially in high\-stakes domains such as healthcare\. We formulate hallucination detection in KBQA as an answer\-node classification problem and propose a lightweight graph\-based framework that treats the answering LLM as a black box\.KG\-Guardrepresents each KBQA instance as an augmented graph\. It initializes node features with semantic representations of KG entities, marks topic entities and LLM\-proposed answer nodes with learned vectors, and connect a virtual question node to the topic entities\. A graph encoder then produces verification\-oriented node representations, and a small MLP classifies each proposed answer node using its graph representation together with the question embedding\. Experiments on WebQSP, ComplexWebQuestions, and PUGG show that our detector achieves the highest F1 on all three benchmarks \(82\.082\.0,87\.487\.4, and84\.384\.3\), outperforming LLM\-as\-judge and sampling\-based baselines, while having∼305×\\sim 305\\timesfewer parameters than the reference approaches\. Beyond detection, the node\-level feedback is actionable: when flagged answers are fed back to the KBQA system for iterative refinement, downstream KBQA F1 improves by13\.013\.0–14\.514\.5points and Exact Match by16\.916\.9–17\.617\.6points\.
## 1Introduction
Large language models \(LLMs\) are increasingly used as reasoning components in knowledge\-intensive applications\. One important setting is knowledge\-base question answering \(KBQA\), where the model must return the entity nodes from a knowledge graph that answer a given question\(Lanet al\.,[2023](https://arxiv.org/html/2606.00328#bib.bib3)\)\. Earlier approaches queried the full graph symbolically; most LLM\-based pipelines instead retrieve a question\-specific subgraph and reason over it\. For example, given “What is the capital of Australia?”, the system should select*Canberra*from a subgraph built around the topic entity*Australia*\. This graph structure provides an explicit grounding space whose entities and relations are easier to curate, update, and inspect than unstructured documents\. Earlier KBQA systems often relied on graph\-centric architectures, whereas recent pipelines increasingly use LLMs to select answer entities from retrieved subgraphs\(Maet al\.,[2025](https://arxiv.org/html/2606.00328#bib.bib4); Baeket al\.,[2023](https://arxiv.org/html/2606.00328#bib.bib9); Heet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib17)\)\.
LLM\-based pipelines improve language understanding but also introduce hallucinations\(Huanget al\.,[2025](https://arxiv.org/html/2606.00328#bib.bib1)\)\. In KBQA, the LLM may rely on parametric knowledge instead of the retrieved subgraph or reason incorrectly over graph facts, leading to wrong answer nodes\. To the best of our knowledge, hallucination detection for LLM\-based KBQA outputs remains underexplored: prior work focuses mainly on answering KBQA or on hallucination detection in other QA settings\.
Existing detectors typically treat hallucination as a text\-level problem: they ignore the KBQA subgraph, do not classify individual answer nodes, or require white\-box access to internal LLM signals unavailable through closed APIs\. We therefore treat hallucination detection in KBQA as a graph learning problem\. Returned nodes leave structured signals in the retrieved subgraph: hallucinated and factual nodes may differ in their connections to topic entities and in their question\-relevant local neighborhoods\.
We proposeKG\-Guard, a graph\-based hallucination detector for KBQA that treats the answering LLM as a black box\. Given a question, retrieved subgraph, and LLM\-proposed answer nodes,KG\-Guardbuilds semantic node and question representations, marks topic entities and answer nodes, adds a virtual question node connected to topic entities, and runs a lightweight graph encoder\. Each returned node is classified from its graph representation and the question embedding using a small MLP\. Thus,KG\-Guarduses only the retrieved graph and LLM outputs, without internal states or activations\(Chenet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib28); Binkowskiet al\.,[2025](https://arxiv.org/html/2606.00328#bib.bib15)\), and avoids the extra LLM calls required by judge\-based\(Zhenget al\.,[2023](https://arxiv.org/html/2606.00328#bib.bib23)\)or sampling\-based detectors\(Manakulet al\.,[2023](https://arxiv.org/html/2606.00328#bib.bib12)\)\. Its node\-level feedback also supports iterative answer refinement, following evidence that fine\-grained hallucination feedback can improve factuality correction\(Sawczynet al\.,[2026](https://arxiv.org/html/2606.00328#bib.bib13)\)\.
KBQA instance\(q,G,T\)\(q,G,T\)fKBQAf\_\{\\mathrm\{KBQA\}\}LLM\-based QAcandidate answer nodesA^\\hat\{A\}KG\-Guardfθ\(q,G,T,a^\)f\_\{\\theta\}\(q,G,T,\\hat\{a\}\)flagged setℋ=\{a^:y^a^=1\}\\mathcal\{H\}=\\\{\\hat\{a\}:\\hat\{y\}\_\{\\hat\{a\}\}=1\\\}accepted answer setA^\\hat\{A\}verify eacha^∈A^\\hat\{a\}\\in\\hat\{A\}ℋ=∅\\mathcal\{H\}=\\emptysetwhileℋ≠∅\\mathcal\{H\}\\neq\\emptyset
Figure 1:Role ofKG\-Guardin the KBQA loop\.The LLM\-based KBQA method maps\(q,G,T\)\(q,G,T\)to candidate answer nodesA^\\hat\{A\}\.KG\-Guardlabels returned nodes and feeds flagged hallucinationsℋ\\mathcal\{H\}back for targeted refinement untilℋ=∅\\mathcal\{H\}=\\emptysetor the iteration cap is reached \(see Section[4\.4](https://arxiv.org/html/2606.00328#S4.SS4)\)\.Our contributions can be summarized as follows:
- •We formulate KBQA hallucination detection as an answer\-node classification on retrieved KG subgraphs – the first dedicated method for this problem\.
- •We proposeKG\-Guard, a lightweight black\-box graph\-based detector that overperforms LLM\-based baselines while using∼305×\{\\sim\}305\\timesfewer parameters\.
- •We show that node\-level feedback signals enable iterative answer refinement \(Fig\.[1](https://arxiv.org/html/2606.00328#S1.F1)\), improving downstream KBQA F1 by13\.013\.0–14\.514\.5pp\. and Exact Match by16\.916\.9–17\.617\.6pp\.
- •We evaluate on WebQSP, CWQ, and PUGG against LLM\-as\-judge and sampling\-based baselines, with ablations validating each architectural design choice\.
- •
## 2Related Work
Recent KBQA systems increasingly combine language models with retrieved graph evidence\. KAPING augments prompts with relevant KG facts for zero\-shot question answering\(Baeket al\.,[2023](https://arxiv.org/html/2606.00328#bib.bib9)\)\. Earlier graph QA models, such as GRAFT\-Net, QA\-GNN, and GreaseLM, study question\-aware reasoning over question\-specific subgraphs or joint language–graph representations\(Sunet al\.,[2018](https://arxiv.org/html/2606.00328#bib.bib16); Yasunagaet al\.,[2021](https://arxiv.org/html/2606.00328#bib.bib10); Zhanget al\.,[2021](https://arxiv.org/html/2606.00328#bib.bib11)\)\. A related approach, G\-Retriever, targets open\-ended QA over KG, applying RAG to construct query\-relevant subgraphs via Prize\-Collecting Steiner Tree optimization \(PCST\)\(Heet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib17)\)\. More recent agentic approaches traverse or plan over KG reasoning paths iteratively to produce grounded answers\(Sunet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib33); Luoet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib34)\)\. NN\-RAG feeds GNN\-retrieved KG reasoning paths to LLM for answer generation\(Mavromatis and Karypis,[2025](https://arxiv.org/html/2606.00328#bib.bib35)\)\. All of these are KBQA methods: they aim to predict answer entities using graph evidence\. Our goal is orthogonal: given candidate answer nodes from an external LLM\-based KBQA system, we ask whether they are hallucinated — a problem, to our knowledge, not previously studied\.
Hallucination detection has largely been studied in free\-text settings\. Black\-box sampling\-based methods such as SelfCheckGPT estimate factuality by sampling multiple generations\(Manakulet al\.,[2023](https://arxiv.org/html/2606.00328#bib.bib12)\)\. Fine\-grained methods verify outputs at the level of individual facts or claims\(Minet al\.,[2023](https://arxiv.org/html/2606.00328#bib.bib36); Sawczynet al\.,[2026](https://arxiv.org/html/2606.00328#bib.bib13)\); our work shares this spirit by classifying individual answer nodes rather than whole generations\. Another line of work exploits internal model signals such as hidden states\(Azaria and Mitchell,[2023](https://arxiv.org/html/2606.00328#bib.bib30); Chenet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib28); Kossenet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib29); Farquharet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib26)\)or attention maps\(Chuanget al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib31); Sriramananet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib27); Binkowskiet al\.,[2025](https://arxiv.org/html/2606.00328#bib.bib15)\)\. While valuable for the general domain, these methods are not designed for KBQA, they classify whole generations rather than individual answer nodes and ignore available KG\. Additionally, they require access to internal LLM states, limiting applicability to closed models\.
Several recent approaches apply the KG structure to hallucination detection\. GraphEval\(Sansfordet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib14)\)extracts atomic claims from LLM output as KG triples and verifies each against the provided textual context via NLI models; FactAlign\(Rashadet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib18)\)and knowledge\-centric detection\(Huet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib32)\)similarly extract triples from generated text and align them with a textual reference\. GraphCheck constructs KGs from both claim and source document, then applies GNNs as a soft prompt to an LLM verifier\(Chenet al\.,[2025](https://arxiv.org/html/2606.00328#bib.bib37)\)\. All require an external textual reference and operate on free\-form generated text, whereas we classify answer nodes directly on an existing structured KG\.
Iterative refinement is a general strategy for correcting LLM outputs\(Madaanet al\.,[2023](https://arxiv.org/html/2606.00328#bib.bib39); Dhuliawalaet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib40); Sawczynet al\.,[2026](https://arxiv.org/html/2606.00328#bib.bib13)\)\. KGR uses direct triple lookup in a KG to guide revision\(Guanet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib38)\)\. Our correction procedure identifies hallucinated answer nodes via a trained graph\-encoder classifier rather than triple\-level conflict resolution\.
Message\-passing GNNs compute node representations by repeatedly aggregating information from local neighborhoods, as in GCN, GraphSAGE, and GIN\(Kipf and Welling,[2017](https://arxiv.org/html/2606.00328#bib.bib20); Hamiltonet al\.,[2017](https://arxiv.org/html/2606.00328#bib.bib21); Xuet al\.,[2019](https://arxiv.org/html/2606.00328#bib.bib22)\)\. For our task, attention\-based graph encoders are especially natural because only part of the retrieved graph may be relevant\. Graph Attention Networks \(GAT\) learn neighbor\-specific attention weights\(Veličkovićet al\.,[2018](https://arxiv.org/html/2606.00328#bib.bib19)\), while GraphTransformer extends this with multi\-head dot\-product attention for expressive message passing\(Shiet al\.,[2021](https://arxiv.org/html/2606.00328#bib.bib41)\)\.
## 3Method
### 3\.1Problem Formulation
We consider a KBQA task in which each example is associated with a natural\-language question and a question\-specific subgraph extracted from a knowledge graph\. Formally, each instance is represented as\(q,G,T,A∗\)\(q,G,T,A^\{\*\}\), whereqqis the question andG=\(V,E\)G=\(V,E\)is the retrieved subgraph\. Here,VVis the set of nodes and each edgee=\(u,r,v\)∈Ee=\(u,r,v\)\\in Ecorresponds to a directed KG triple from nodeuuto nodevvwith relation labelr∈ℛr\\in\\mathcal\{R\}, whereℛ\\mathcal\{R\}denotes the set of relation labels\. We treat these relations as edge attributes rather than as a fixed heterogeneous graph schema, which allows relation labels to be encoded from their text\. The setT⊆VT\\subseteq Vcontains the topic entities, i\.e\., nodes mentioned in the question, andA∗⊆VA^\{\*\}\\subseteq Vcontains the gold answer nodes\.
Given\(q,G,T\)\(q,G,T\), an LLM\-based KBQA methodfKBQAf\_\{\\mathrm\{KBQA\}\}is applied to answer the question:
A^=fKBQA\(q,G,T\),A^⊆V\.\\hat\{A\}=f\_\{\\mathrm\{KBQA\}\}\(q,G,T\),\\qquad\\hat\{A\}\\subseteq V\.If the method abstains with a special*unknown*response, we setA^=∅\\hat\{A\}=\\emptyset\. In this work, we only consider examples for which the method returns at least one node, i\.e\.,A^≠∅\\hat\{A\}\\neq\\emptyset\. This restriction matches the goal of our detector, which is to verify concrete answer nodes rather than abstentions\.
The hallucination detection task is then formulated as a binary classification problem over individual returned nodes\. If the LLM returns multiple nodes for a single question, we decompose this output into separate detection instances, one per predicted node\. Thus, for each retained KBQA example\(q,G,T,A∗,A^\)\(q,G,T,A^\{\*\},\\hat\{A\}\)and eacha^∈A^\\hat\{a\}\\in\\hat\{A\}, we create one classification instance\(q,G,T,A∗,a^\)\(q,G,T,A^\{\*\},\\hat\{a\}\)and define the target labely∈\{0,1\}y\\in\\\{0,1\\\}as
y=\{0,ifa^∈A∗,1,otherwise,y=\\begin\{cases\}0,&\\text\{if \}\\hat\{a\}\\in A^\{\*\},\\\\ 1,&\\text\{otherwise,\}\\end\{cases\}wherey=1y=1denotes a hallucinated answer node andy=0y=0denotes a factual one\. This definition naturally handles questions with multiple valid answers: a returned node is treated as factual if it matches any gold answer node, and hallucinated otherwise\. Consequently, a single KBQA example may yield multiple hallucination\-detection instances, corresponding to the different nodes proposed by the LLM\. In a deployed system, however, the detector can process the graph once \(single forward pass\) and mark which of the returned nodes are hallucinated\.
Our goal is to learn a detectorfθ\(q,G,T,a^\)→\{0,1\}f\_\{\\theta\}\(q,G,T,\\hat\{a\}\)\\rightarrow\\\{0,1\\\}that predicts whether an individual node selected byfKBQAf\_\{\\mathrm\{KBQA\}\}is factual or hallucinated\.
### 3\.2KG\-Guard\(Hallucination Detector\)
Augmented graphG~\\widetilde\{G\}vqv\_\{q\}xv=\[x\_\{v\}=\[ϕ\(tv\)\\phi\(t\_\{v\}\)∥\\\|MT\[τv\]M\_\{T\}\[\\tau\_\{v\}\]∥\\\|MA\[αv\]M\_\{A\}\[\\alpha\_\{v\}\]\]\]vqv\_\{q\}: virtual question node cyan: topic entities orange: LLM\-returned nodesGraph encodergθ\(⋅\)g\_\{\\theta\}\(\\cdot\)answer\-nodeembeddingha^h\_\{\\hat\{a\}\}questionqqText encoderϕ\(⋅\)\\phi\(\\cdot\)questionembeddingzqz\_\{q\}\[ha^∥zq\]\[h\_\{\\hat\{a\}\}\\,\\\|\\,z\_\{q\}\]MLPψ\(⋅\)\\psi\(\\cdot\)Hallucinated \(y=1y=1\)or factual \(y=0y=0\)
Figure 2:KG\-Guardarchitecture for labeling LLM\-returned nodes\.Node features combine semantic node representations with topic\-entity marksMTM\_\{T\}and answer\-node marksMAM\_\{A\}\. A virtual question nodevqv\_\{q\}is connected to the topic entities with directed edges\. The graph encodergθg\_\{\\theta\}computes answer\-node representationsha^h\_\{\\hat\{a\}\}, which are concatenated with the question embeddingzqz\_\{q\}and passed to an MLP to predict whether the returned node is hallucinated \(y=1y=1\) or factual \(y=0y=0\)\.Our method \(see Figure[2](https://arxiv.org/html/2606.00328#S3.F2)\) operates on top of the answer returned by the LLM\-based KBQA method for a given KBQA instance \(assumingA^≠∅\\hat\{A\}\\neq\\emptyset\)\.
We start from the retrieved subgraph and construct node features by combining semantic node information with task\-specific marks\. Knowledge\-graph nodes typically do not come with informative numeric features, so we initialize them from text\. Lettvt\_\{v\}denote the textual representation of nodevv, such as its label or name, and letϕ\(⋅\)\\phi\(\\cdot\)be a semantic text encoder\.
We then augment these semantic features with trainable mark embeddings\. Letτv=𝕀\[v∈T\]\\tau\_\{v\}=\\mathbb\{I\}\[v\\in T\]indicate whether nodevvis a topic entity, and letαv=𝕀\[v∈A^\]\\alpha\_\{v\}=\\mathbb\{I\}\[v\\in\\hat\{A\}\]indicate whether it was returned by the LLM\-based KBQA method\. We use two trainable lookup tables,MT∈ℝ2×dTM\_\{T\}\\in\\mathbb\{R\}^\{2\\times d\_\{T\}\}andMA∈ℝ2×dAM\_\{A\}\\in\\mathbb\{R\}^\{2\\times d\_\{A\}\}, for topic\-entity and answer\-node marks, respectively\. Each table contains one embedding for the positive case and one embedding for the negative case\.
The initial feature vector of nodevvis then:
xv=\[ϕ\(tv\)‖MT\[τv\]‖MA\[αv\]\],x\_\{v\}=\\left\[\\phi\(t\_\{v\}\)\\,\\\|\\,M\_\{T\}\[\\tau\_\{v\}\]\\,\\\|\\,M\_\{A\}\[\\alpha\_\{v\}\]\\right\],where∥\\\|denotes concatenation\. Thus, each node representation contains three parts: a semantic representation of the node text, a learnable mark indicating whether the node is a topic entity, and a learnable mark indicating whether the node was proposed as an answer\. We use the same text encoder to obtain the question embeddingzq=ϕ\(q\)z\_\{q\}=\\phi\(q\)\.
To condition message passing on the question, we augment the graph with a virtual question nodevqv\_\{q\}initialized withzqz\_\{q\}\. This node is connected only to the topic entitiesTT, rather than to every node in the retrieved subgraph\. LetG~\\widetilde\{G\}andX~\\widetilde\{X\}denote the resulting augmented graph and its node features\.
The marked and augmented graph is then processed by a graph encodergθg\_\{\\theta\}to obtain node\-level representationsH=gθ\(G~,X~\)H=g\_\{\\theta\}\(\\widetilde\{G\},\\widetilde\{X\}\), whereH=\{hv\}v∈VH=\\\{h\_\{v\}\\\}\_\{v\\in V\}\. We instantiategθg\_\{\\theta\}either as a GraphTransformer or as a GAT, with the GraphTransformer used as the main configuration in our experiments\. For a particular returned answer nodea^\\hat\{a\}, we extract its final representationha^h\_\{\\hat\{a\}\}and concatenate it with the question embeddingzqz\_\{q\}\. The resulting vector is passed through an MLP classifier to obtain a scalar logit:
sa^=MLPψ\(\[ha^∥zq\]\),s\_\{\\hat\{a\}\}=\\mathrm\{MLP\}\_\{\\psi\}\\left\(\[h\_\{\\hat\{a\}\}\\,\\\|\\,z\_\{q\}\]\\right\),wheresa^s\_\{\\hat\{a\}\}is a scalar logit\. The final hallucination score is obtained as
p^\(y=1∣q,G,a^\)=σ\(sa^\),\\hat\{p\}\(y=1\\mid q,G,\\hat\{a\}\)=\\sigma\(s\_\{\\hat\{a\}\}\),withy=1y=1denoting a hallucinated answer node\. Thus, the detector combines graph structure around the returned node with the semantics of the question, while remaining independent of the internal activations of the answering LLM\.
## 4Experimental Setup
The extented implementation details are provided in Appendix[A](https://arxiv.org/html/2606.00328#A1)\.
### 4\.1Evaluation Data
Training and evaluation of our detector requires tuples\(q,G,A∗,A^\)\(q,G,A^\{\*\},\\hat\{A\}\)combining a question, a retrieved subgraph, gold answer nodes, and LLM\-predicted answer nodes\. No existing benchmark provides all four components\. Therefore, we construct the evaluation data by running a full KBQA pipeline on three established benchmarks, while recording both gold and predicted answer nodes as detection instances\. In practical deployment, direct hallucination annotation is an equally valid and less expensive alternative to annotating a KBQA dataset and then deriving the hallucination labels\.
#### 4\.1\.1KBQA Benchmarks
We evaluate on three KBQA benchmarks that vary in question complexity, underlying knowledge graph, and language\.
##### WebQuestionsSP
\(WebQSP\)\(Yihet al\.,[2016](https://arxiv.org/html/2606.00328#bib.bib5)\)\(License: CC\-BY 4\.0\.\) is a standard single\-hop factoid KBQA benchmark grounded in the Freebase knowledge graph\(Bollackeret al\.,[2007](https://arxiv.org/html/2606.00328#bib.bib6)\)\. Questions are naturally phrased real\-world queries collected from Google Suggest\(Berantet al\.,[2013](https://arxiv.org/html/2606.00328#bib.bib2)\)\. Because the official validation split is small \(246 questions\), we supplement it with questions sampled from the training set\. Our validation set size is 500 questions\.
##### ComplexWebQuestions
\(CWQ\)\(Talmor and Berant,[2018](https://arxiv.org/html/2606.00328#bib.bib7)\)\(License: GPL v2\+\.\) extends WebQSP by programmatically appending compositional constraints to WebQSP SPARQL queries, then paraphrasing the resulting questions via crowdworkers\. This procedure produces questions requiring logical reasoning, making CWQ substantially harder than WebQSP\. It is also grounded in Freebase and is the largest of our three benchmarks\.
##### PUGG
\(Sawczynet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib8)\)\(License: CC BY\-SA 4\.0\.\) is a Polish KBQA dataset grounded in the Wikidata knowledge graph\(Vrandečić and Krötzsch,[2014](https://arxiv.org/html/2606.00328#bib.bib48)\), combining naturally collected \(Google Suggest\) and template\-based \(SPARQL\-paired templates and paraphrasing\) questions\. We choose that dataset to evaluate on a non\-English language and a different KG\. Because Wikidata hub nodes inflate subgraph sizes, we exclude nodes connected to more than10001000other entities as a preprocessing step before subgraph retrieval\.
#### 4\.1\.2Hallucination Dataset Construction
Table 1:Dataset statistics for hallucination detection\. Each KBQA question may yield multiple answer nodes proposed by the LLM \(\#Answers\); each node is labeled hallucinated \(%Hallu\.\), correct \(%Correct\), or abstained \(%Abstained\)\. Abstained responses are counted in\#Answersbut excluded from the classification task\.##### Subgraph retrieval
Both KBQA and hallucination detection require a question\-relevant subgraph from the KG; we apply the same retrieval procedure for both tasks\. For all three datasets, we follow theG\-Retrieverpipeline\(Heet al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib17)\): candidate neighbors are first collected by BFS expansion from topic entities, and a Prize\-Collecting Steiner Tree \(PCST\) formulation is then solved to select a compact, question\-relevant subgraph\. We use the same PCST hyperparameters as in the originalG\-Retrieverpipeline\. Subgraph statistics are reported in Appendix[E\.1](https://arxiv.org/html/2606.00328#A5.SS1)\(Table[13](https://arxiv.org/html/2606.00328#A5.T13)\)\.
##### KBQA method
We use KAPING\(Baeket al\.,[2023](https://arxiv.org/html/2606.00328#bib.bib9)\)as the LLM\-based KBQA method\. Beyond the answer node, KAPING is extended to also return a textual reasoning summary and a list of reasoning triples \(the KG triples cited as evidence\)\. Our approach does not use these elements, and they are used only as additional context for the LLM\-based baselines \(Section[4\.2](https://arxiv.org/html/2606.00328#S4.SS2)\)\. The answering LLM is constrained to return only nodes present in the retrieved subgraph via structured generation; it may also abstain with an*unknown*response\. The full prompt template is provided in the code repository\. As discussed in Section[3\.1](https://arxiv.org/html/2606.00328#S3.SS1), abstaining examples are excluded from the hallucination\-detection task\. The statistics, including the fraction of unknown responses, are reported in Table[1](https://arxiv.org/html/2606.00328#S4.T1)\. Additional statistics of KBQA answers are reported in Appendix[E\.2](https://arxiv.org/html/2606.00328#A5.SS2)and[E\.3](https://arxiv.org/html/2606.00328#A5.SS3)\.
### 4\.2Baselines
We compare against three groups of baselines covering trivial predictors, LLM\-based detection, and sampling\-based consistency analysis\. All selected baselines are black\-box methods that operate without access to model internals, enabling a direct comparison with our approach, which also mirrors the common real\-world deployment setting where the LLM model is accessed as a service\.
##### Trivial baselines
*Random*assigns the hallucinated label uniformly at random;*MostFrequent*always predicts the majority class in the validation split\. These establish lower bounds and quantify the effect of class imbalance on each metric\.
##### Llama\-4\-Scout\-17B
We useLlama\-4\-Scout\(AI,[2025](https://arxiv.org/html/2606.00328#bib.bib24)\)as an LLM\-as\-judge baseline\(Zhenget al\.,[2023](https://arxiv.org/html/2606.00328#bib.bib23)\)\. We evaluate three prompt variants supplying progressively richer context:\(1\)*Graph*\(G\): question and subgraph serialized as a triple list;\(2\)*Graph \+ Reasoning Summary*\(G\+RS\): the above plus the answering LLM’s reasoning summary;\(3\)*Graph \+ Reasoning Summary \+ Reasoning Triples*\(G\+RS\+RT\): the above plus the answering LLM’s reasoning triples\. While straightforward to deploy, this approach is computationally expensive:Llama\-4\-Scouthas109109B total parameters \(1717B active in its MoE architecture\), requiring substantial GPU memory per detection call\.
##### GPT\-5\-mini
We evaluateGPT\-5\-mini\(Singhet al\.,[2026](https://arxiv.org/html/2606.00328#bib.bib25)\)as a judge using two configurations: G and G\+RS\+RT\. Both outperformed G\+RS by a substantial margin on average with Llama\. Due to commercial API costs, we restrict evaluation to these two configurations\.
##### SelfCheckGPT
We adapt the sampling\-based method ofManakulet al\.\([2023](https://arxiv.org/html/2606.00328#bib.bib12)\)to structured KBQA outputs\. We drawNNindependent predictions from the LLM\-based KBQA method with temperatureT=1\.0T=1\.0\. We then compute the fraction of predictions that do*not*include a given node as the hallucination score\. We evaluate withN∈\{5,10\}N\\in\\\{5,10\\\}to assess sensitivity to sample count\. This baseline represents a widely used black\-box hallucination detection method and is the most expensive computationally, requiringNNforward passes of the LLM\-based KBQA method per example\.
### 4\.3Evaluation
For hallucination detection, we report F1 as the primary metric, computed over predicted answer nodes only; Precision, Recall, Accuracy, and AUC\-PR are reported in Appendix[D\.1](https://arxiv.org/html/2606.00328#A4.SS1)\. Each configuration is trained with33random seeds; we report the mean and standard deviation across seeds\. For methods that rely on LLM calls, we report single\-run results due to the high inference cost\.
### 4\.4Hallucination Correction
While our primary focus is hallucination detection, we also examine its potential for answer correction\.KG\-Guardcan guide answer revision within a larger system\. We investigate whether feedingKG\-Guardpredictions back to the LLM enables it to revise its own hallucinated responses\.
Starting from the initialKAPINGanswers, we iterate the following steps:\(1\)KG\-Guardscores each predicted answer node\.\(2\)Examples with no nodes flagged as hallucinated are marked resolved, and their answers remain unchanged\.\(3\)For each active example \(i\.e\., with at least one flagged answer node\), the LLM is re\-prompted in a chat\-style conversation consisting of a system instruction, the original question and subgraph, the LLM’s most recent response, and a follow\-up message listing the flagged answer node names\. Steps 1–3 repeat until all examples are resolved or the iteration cap \(max\.55\) is reached\. We useLlama\-4\-Scoutas the refinement LLM\.
We report downstream KBQA metrics \(F1, Precision, Recall, and Exact Match\) at the initial step and after refinement on all three datasets, together with the distribution of refinement iterations\.
## 5Results
Ablation studies on the graph encoder and input modalities are presented in Appendix[B](https://arxiv.org/html/2606.00328#A2)\.
### 5\.1Hallucination Detection
Hallucination detection results are summarized in Table[2](https://arxiv.org/html/2606.00328#S5.T2)and full results are in Appendix[D\.1](https://arxiv.org/html/2606.00328#A4.SS1)\.KG\-Guard\(GraphTransformer\) ranks first on every dataset, achieving82\.082\.0on WebQSP,87\.487\.4on CWQ, and84\.384\.3on PUGG\. The strongest non\-trivial baseline is GPT\-5\-mini, scoring22\.822\.8,3\.23\.2, and3\.13\.1F1 points belowKG\-Guardon WebQSP, CWQ, and PUGG, respectively\.
Table 2:Hallucination detection F1 comparingKG\-Guardagainst baselines\.Avg\. Rankis the average rank of each method across the three datasets \(lower is better\)\. Values with±\\pmdenote mean±\\pmstd over 3 runs\.G,RS, andRTdenote the retrieved subgraph, reasoning summary, and reasoning triples provided to the LLM judge, respectively\.MethodWebQSPCWQPUGGAvg\. RankKG\-Guard\(GraphTransformer\)\{\\texttt\{KG\-Guard\}\\ \}\_\{\\text\{\(GraphTransformer\)\}\}82\.0±\\pm0\.787\.4±\\pm0\.284\.3±\\pm1\.01\.0KG\-Guard\(GAT\)\{\\texttt\{KG\-Guard\}\\ \}\_\{\\text\{\(GAT\)\}\}79\.8±\\pm1\.286\.3±\\pm0\.782\.7±\\pm2\.72\.3Most Frequent65\.983\.583\.53\.0GPT 5 Mini\(G\+RS\+RT\)\{\\text\{GPT 5 Mini\\ \}\}\_\{\\text\{\(G\+RS\+RT\)\}\}59\.284\.281\.24\.0GPT 5 Mini\(G\)\{\\text\{GPT 5 Mini\\ \}\}\_\{\\text\{\(G\)\}\}55\.982\.682\.64\.7SelfCheck\(N=10\)\{\\text\{SelfCheck\\ \}\}\_\{\\text\{\(N=10\)\}\}48\.772\.273\.86\.3SelfCheck\(N=5\)\{\\text\{SelfCheck\\ \}\}\_\{\\text\{\(N=5\)\}\}42\.167\.148\.38\.7Llama 4 Scout\(G\+RS\+RT\)\{\\text\{Llama 4 Scout\\ \}\}\_\{\\text\{\(G\+RS\+RT\)\}\}41\.061\.260\.89\.0Llama 4 Scout\(G\)\{\\text\{Llama 4 Scout\\ \}\}\_\{\\text\{\(G\)\}\}36\.760\.268\.89\.0Llama 4 Scout\(G\+RS\)\{\\text\{Llama 4 Scout\\ \}\}\_\{\\text\{\(G\+RS\)\}\}36\.561\.361\.39\.0Random49\.6±\\pm1\.058\.3±\\pm0\.558\.8±\\pm1\.89\.0The advantage over baselines is most visible on WebQSP\. All three Llama\-4\-Scout judge variants fall below the Random baseline on this dataset \(F136\.536\.5–41\.041\.0vs\. Random at49\.649\.6\), and GPT\-5\-mini reaches only59\.259\.2\.KG\-Guard, by contrast, achieves82\.082\.0, a substantial margin over all baselines on this benchmark\.
On CWQ and PUGG, GPT\-5\-mini reaches84\.284\.2and82\.682\.6but still ranks belowKG\-Guardon both datasets\. Among the Llama\-4\-Scout prompt variants, differences between configurations are visible, but the best variant differs per dataset, indicating that richer context does not consistently benefit judge\-based detection\. The class distribution on these datasets is more skewed – over71%71\\%of answers are hallucinated \(Table[1](https://arxiv.org/html/2606.00328#S4.T1)\)\. Due to that, MostFrequent reaches83\.583\.5F1 on both datasets, but its AUC\-PR equals the class frequency \(71\.771\.7on CWQ,71\.671\.6on PUGG\), compared to94\.294\.2and90\.090\.0forKG\-Guard\(Tables[10](https://arxiv.org/html/2606.00328#A4.T10),[11](https://arxiv.org/html/2606.00328#A4.T11)\)\.
WebQSP consists of naturally formulated user queries, whereas CWQ questions are derived programmatically from logical reasoning templates and a portion of PUGG questions share a similar structure\. We hypothesize that judge\-based methods benefit from the regular reasoning patterns inherent to template\-derived questions; naturally phrased queries provide weaker structural regularities, making WebQSP a harder setting for LLM\-based hallucination detection\.KG\-Guard, relying on graph topology rather than linguistic regularities, is robust to this distinction\.
SelfCheckGPT improves with sample count \(N=10N\{=\}10overN=5N\{=\}5\) and is the best open\-source baseline, outperforming all Llama\-4\-Scout variants, yet remains substantially belowKG\-Guardwhile incurring higher inference cost due to requiringNNforward passes of the full KBQA model per example\.
LLM\-judges are strongly precision\-skewed \(GPT\-5\-mini:85\.3%85\.3\\%precision at45\.3%45\.3\\%recall on WebQSP\), missing many hallucinated nodes\. Contrastingly,KG\-Guardis more balanced, and the F1 gap is driven primarily by higher recall\. Moreover,KG\-Guard’s precision – recall trade\-off can be adjusted via the classification threshold, an option unavailable to LLM\-based methods \(Appendix[D\.2](https://arxiv.org/html/2606.00328#A4.SS2)\)\.
### 5\.2Hallucination Correction
KBQA performance before and after hallucination\-aware iterative refinement is summarized in Table[3](https://arxiv.org/html/2606.00328#S5.T3)and full per\-dataset results are in Appendix[D\.2](https://arxiv.org/html/2606.00328#A4.SS2)\. We observe consistent gains across all three datasets: mean F1 improves by\+13\.8\+13\.8points and Exact Match by\+17\.4\+17\.4points on average\. The EM gain is especially notable – it requires the predicted answer set to coincide exactly with the gold set per example, so any missing or spurious answer node counts as a failure\. Consistent improvement on this strict metric confirms that this method of refinement is able to fully resolve hallucinations rather than only shifting answers closer to correct\. These results demonstrate thatKG\-Guard’s output is actionable\. By providing specific hallucinated answer nodes, the LLM can make targeted corrections\.
Table 3:KBQA answer quality before \(Initial\) and after \(Refined\) iterative refinement guided byKG\-Guard, measured by F1 and Exact Match \(EM\)\.Δ\\Deltais the score improvement\.The metrics per refinement step are shown in Figure[3](https://arxiv.org/html/2606.00328#S5.F3)\. The largest gains occur after the first refinement step, after which the improvement curve flattens\. The distribution of iterations per example \(Figure[4](https://arxiv.org/html/2606.00328#S5.F4)\) reveals how many corrections each example needed\. Step 0 denotes examples already correct before any refinement\. Among examples that did require refinement, the largest share resolves at the first refinement iteration, and the count drops quickly at later steps\. A substantial fraction of examples reach the iteration cap without converging, indicating hallucinations that the refinement procedure could not fully resolve\. This shows that a single targeted correction is remarkably effective, resolving the majority of addressable hallucinations without further iteration\. Practitioners who wish to reduce inference costs can therefore cap refinement at a single step with limited loss in correction quality\.
Figure 3:KBQA metrics per refinement step across three datasets\. Step 0 is the initial unrefined prediction; the largest gains occur at step 1, after which the curve flattens\.Figure 4:Distribution of refinement iterations per example across three datasets\. Step 0 denotes examples correct before any refinement; step 5 denotes examples that reached the iteration cap without resolving\.
## 6Conclusion
We formulated hallucination detection in KBQA as answer\-node classification on retrieved KG subgraphs and proposedKG\-Guard, a lightweight black\-box graph\-based detector\. Our method ranks first on all three benchmarks, outperforming LLM\-as\-judge and sampling\-based baselines while using∼305×\{\\sim\}305\\timesfewer parameters and requiring zero LLM calls at detection time\. The detector generalizes across question types and languages\.
Beyond detection, the node\-level output ofKG\-Guardis actionable: feeding flagged answer nodes back to the KBQA LLM as targeted correction signals yields consistent downstream KBQA gains across all three datasets, with most of the improvement captured in a single refinement step\. Together, these results establish explicit graph\-based structural verification as a practical and efficient path toward more reliable LLM\-based KBQA systems\.
## Limitations and Future Work
KG\-Guardrequires training examples with hallucination labels\. These can be derived from any KBQA dataset by comparing answer nodes against gold answers or by labeling hallucinations directly\. Because we perform node\-level classification on the retrieved subgraph, it cannot handle cases where the LLM abstains with an*unknown*prediction\. Such responses are not represented as nodes in the graph\. However, defining what counts as a hallucinated abstention is non\-trivial: LLMs are generally expected to abstain when the answer is genuinely uncertain\. WhileKG\-Guardis applicable to any LLM\-based KBQA pipeline, our experiments cover a single pipeline\. Sensitivity to alternative pipelines is future work\.
The answering LLM also produces a free\-form reasoning summary and supporting triples, used by the LLM\-as\-judge baselines\. We experimented with triples marking and incorporating the summary embedding, but neither gave consistent gains\. We hypothesize that the reasoning signal is already recoverable from the subgraph\. Whether more sophisticated fusion methods could exploit these signals remains open for future work\.
##### Broader impacts
KG\-Guardaims to improve reliability of LLM\-based KBQA systems in high\-stakes domains\. As any ML system can produce incorrect predictions, it should augment rather than replace human factual verification\.
## References
- Introducing LLaMA 4: advancing multimodal intelligence\.Technical reportExternal Links:[Link](https://ai.meta.com/blog/llama-4-multimodal-intelligence/)Cited by:[§4\.2](https://arxiv.org/html/2606.00328#S4.SS2.SSS0.Px2.p1.2)\.
- A\. Azaria and T\. Mitchell \(2023\)The internal state of an LLM knows when it‘s lying\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 967–976\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.68/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.68)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p2.1)\.
- J\. Baek, A\. F\. Aji, and A\. Saffari \(2023\)Knowledge\-augmented language model prompting for zero\-shot knowledge graph question answering\.InProceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations \(NLRSE\),B\. Dalvi Mishra, G\. Durrett, P\. Jansen, D\. Neves Ribeiro, and J\. Wei \(Eds\.\),Toronto, Canada,pp\. 78–106\.External Links:[Link](https://aclanthology.org/2023.nlrse-1.7/),[Document](https://dx.doi.org/10.18653/v1/2023.nlrse-1.7)Cited by:[§1](https://arxiv.org/html/2606.00328#S1.p1.1),[§2](https://arxiv.org/html/2606.00328#S2.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.00328#S4.SS1.SSS2.Px2.p1.1)\.
- J\. Berant, A\. Chou, R\. Frostig, and P\. Liang \(2013\)Semantic parsing on Freebase from question\-answer pairs\.InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,D\. Yarowsky, T\. Baldwin, A\. Korhonen, K\. Livescu, and S\. Bethard \(Eds\.\),Seattle, Washington, USA,pp\. 1533–1544\.External Links:[Link](https://aclanthology.org/D13-1160/)Cited by:[§4\.1\.1](https://arxiv.org/html/2606.00328#S4.SS1.SSS1.Px1.p1.1)\.
- J\. Binkowski, D\. Janiak, A\. Sawczyn, B\. Gabrys, and T\. J\. Kajdanowicz \(2025\)Hallucination detection in LLMs using spectral features of attention maps\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 24354–24385\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1239/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1239),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.00328#S1.p4.1),[§2](https://arxiv.org/html/2606.00328#S2.p2.1)\.
- K\. Bollacker, R\. Cook, and P\. Tufts \(2007\)Freebase: a shared database of structured general human knowledge\.InProceedings of the 22nd National Conference on Artificial Intelligence \- Volume 2,AAAI’07,pp\. 1962–1963\.External Links:ISBN 9781577353232Cited by:[§4\.1\.1](https://arxiv.org/html/2606.00328#S4.SS1.SSS1.Px1.p1.1)\.
- C\. Chen, K\. Liu, Z\. Chen, Y\. Gu, Y\. Wu, M\. Tao, Z\. Fu, and J\. Ye \(2024\)INSIDE: LLMs’ internal states retain the power of hallucination detection\.External Links:[Link](https://openreview.net/forum?id=Zj12nzlQbz)Cited by:[§1](https://arxiv.org/html/2606.00328#S1.p4.1),[§2](https://arxiv.org/html/2606.00328#S2.p2.1)\.
- Y\. Chen, H\. Liu, Y\. Liu, J\. Xie, R\. Yang, H\. Yuan, Y\. Fu, P\. Y\. Zhou, Q\. Chen, J\. Caverlee, and I\. Li \(2025\)GraphCheck: breaking long\-term text barriers with extracted knowledge graph\-powered fact\-checking\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 14976–14995\.External Links:[Link](https://aclanthology.org/2025.acl-long.729/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.729),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p3.1)\.
- Y\. Chuang, L\. Qiu, C\. Hsieh, R\. Krishna, Y\. Kim, and J\. R\. Glass \(2024\)Lookback lens: detecting and mitigating contextual hallucinations in large language models using only attention maps\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 1419–1436\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.84/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.84)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p2.1)\.
- S\. Dadas, M\. Perełkiewicz, and R\. Poświata \(2024\)PIRB: a comprehensive benchmark of Polish dense and hybrid text retrieval methods\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 12761–12774\.External Links:[Link](https://aclanthology.org/2024.lrec-main.1117/)Cited by:[Appendix A](https://arxiv.org/html/2606.00328#A1.SS0.SSS0.Px2.p1.11)\.
- S\. Dhuliawala, M\. Komeili, J\. Xu, R\. Raileanu, X\. Li, A\. Celikyilmaz, and J\. Weston \(2024\)Chain\-of\-verification reduces hallucination in large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 3563–3578\.External Links:[Link](https://aclanthology.org/2024.findings-acl.212/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.212)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p4.1)\.
- V\. P\. Dwivedi and X\. Bresson \(2021\)A generalization of transformer networks to graphs\.Cited by:[Appendix A](https://arxiv.org/html/2606.00328#A1.SS0.SSS0.Px2.p1.11)\.
- W\. Falcon and The PyTorch Lightning team \(2019\)PyTorch Lightning\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.3828935),[Link](https://github.com/Lightning-AI/lightning)Cited by:[Appendix A](https://arxiv.org/html/2606.00328#A1.SS0.SSS0.Px2.p1.11)\.
- S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal \(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630\(8017\),pp\. 625–630\.External Links:ISSN 1476\-4687,[Document](https://dx.doi.org/10.1038/s41586-024-07421-0),[Link](https://doi.org/10.1038/s41586-024-07421-0)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p2.1)\.
- M\. Fey and J\. E\. Lenssen \(2019\)Fast graph representation learning with PyTorch Geometric\.InICLR Workshop on Representation Learning on Graphs and Manifolds,Cited by:[Appendix A](https://arxiv.org/html/2606.00328#A1.SS0.SSS0.Px2.p1.11)\.
- X\. Guan, Y\. Liu, H\. Lin, Y\. Lu, B\. He, X\. Han, and L\. Sun \(2024\)Mitigating large language model hallucinations via autonomous knowledge graph\-based retrofitting\.InProceedings of the Thirty\-Eighth AAAI Conference on Artificial Intelligence and Thirty\-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence,AAAI’24/IAAI’24/EAAI’24\.External Links:ISBN 978\-1\-57735\-887\-9,[Link](https://doi.org/10.1609/aaai.v38i16.29770),[Document](https://dx.doi.org/10.1609/aaai.v38i16.29770)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p4.1)\.
- W\. L\. Hamilton, R\. Ying, and J\. Leskovec \(2017\)Inductive representation learning on large graphs\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 1025–1035\.External Links:ISBN 9781510860964Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p5.1)\.
- X\. He, Y\. Tian, Y\. Sun, N\. V\. Chawla, T\. Laurent, Y\. LeCun, X\. Bresson, and B\. Hooi \(2024\)G\-retriever: retrieval\-augmented generation for textual graph understanding and question answering\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 132876–132907\.External Links:[Document](https://dx.doi.org/10.52202/079017-4224),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/efaf1c9726648c8ba363a5c927440529-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2606.00328#S1.p1.1),[§2](https://arxiv.org/html/2606.00328#S2.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.00328#S4.SS1.SSS2.Px1.p1.1)\.
- X\. Hu, D\. Ru, L\. Qiu, Q\. Guo, T\. Zhang, Y\. Xu, Y\. Luo, P\. Liu, Y\. Zhang, and Z\. Zhang \(2024\)Knowledge\-centric hallucination detection\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 6953–6975\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.395/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.395)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p3.1)\.
- L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin, and T\. Liu \(2025\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Trans\. Inf\. Syst\.43\(2\)\.External Links:ISSN 1046\-8188,[Link](https://doi.org/10.1145/3703155),[Document](https://dx.doi.org/10.1145/3703155)Cited by:[§1](https://arxiv.org/html/2606.00328#S1.p2.1)\.
- T\. N\. Kipf and M\. Welling \(2017\)Semi\-supervised classification with graph convolutional networks\.In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24\-26, 2017, Conference Track Proceedings,External Links:[Link](https://openreview.net/forum?id=SJU4ayYgl)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p5.1)\.
- J\. Kossen, J\. Han, M\. Razzak, L\. Schut, S\. Malik, and Y\. Gal \(2024\)Semantic entropy probes: robust and cheap hallucination detection in llms\.External Links:2406\.15927,[Link](https://arxiv.org/abs/2406.15927)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p2.1)\.
- Y\. Lan, G\. He, J\. Jiang, J\. Jiang, W\. X\. Zhao, and J\. Wen \(2023\)Complex knowledge base question answering: a survey\.IEEE Transactions on Knowledge and Data Engineering35\(11\),pp\. 11196–11215\.External Links:[Document](https://dx.doi.org/10.1109/TKDE.2022.3223858)Cited by:[§1](https://arxiv.org/html/2606.00328#S1.p1.1)\.
- L\. Luo, Y\. Li, G\. Haffari, and S\. Pan \(2024\)Reasoning on graphs: faithful and interpretable large language model reasoning\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p1.1)\.
- C\. Ma, Y\. Chen, T\. Wu, A\. Khan, and H\. Wang \(2025\)Large language models meet knowledge graphs for question answering: synthesis and opportunities\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 24578–24597\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1249/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1249),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.00328#S1.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)SELF\-refine: iterative refinement with self\-feedback\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p4.1)\.
- P\. Manakul, A\. Liusie, and M\. Gales \(2023\)SelfCheckGPT: zero\-resource black\-box hallucination detection for generative large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 9004–9017\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.557/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.557)Cited by:[§1](https://arxiv.org/html/2606.00328#S1.p4.1),[§2](https://arxiv.org/html/2606.00328#S2.p2.1),[§4\.2](https://arxiv.org/html/2606.00328#S4.SS2.SSS0.Px4.p1.4)\.
- C\. Mavromatis and G\. Karypis \(2025\)GNN\-RAG: graph neural retrieval for efficient large language model reasoning on knowledge graphs\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 16682–16699\.External Links:[Link](https://aclanthology.org/2025.findings-acl.856/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.856),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p1.1)\.
- S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi \(2023\)FActScore: fine\-grained atomic evaluation of factual precision in long form text generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 12076–12100\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.741/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.741)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p2.1)\.
- M\. Rashad, A\. Zahran, A\. Amin, A\. Abdelaal, and M\. Altantawy \(2024\)FactAlign: fact\-level hallucination detection and classification through knowledge graph alignment\.InProceedings of the 4th Workshop on Trustworthy Natural Language Processing \(TrustNLP 2024\),A\. Ovalle, K\. Chang, Y\. T\. Cao, N\. Mehrabi, J\. Zhao, A\. Galstyan, J\. Dhamala, A\. Kumar, and R\. Gupta \(Eds\.\),Mexico City, Mexico,pp\. 79–84\.External Links:[Link](https://aclanthology.org/2024.trustnlp-1.8/),[Document](https://dx.doi.org/10.18653/v1/2024.trustnlp-1.8)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p3.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-BERT: sentence embeddings using Siamese BERT\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 3982–3992\.External Links:[Link](https://aclanthology.org/D19-1410/),[Document](https://dx.doi.org/10.18653/v1/D19-1410)Cited by:[Appendix A](https://arxiv.org/html/2606.00328#A1.SS0.SSS0.Px2.p1.11)\.
- H\. Sansford, N\. Richardson, H\. P\. Maretic, and J\. N\. Saada \(2024\)GraphEval: A knowledge\-graph based LLM hallucination evaluation framework\.InProceedings of the Workshop on Knowledge\-infused Learning co\-located with 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining \(KDD\), Barcelona, Spain, August 26, 2024,M\. Gaur, E\. Tsamoura, E\. Raff, N\. Vedula, and S\. Parthasarathy \(Eds\.\),CEUR Workshop Proceedings,pp\. 20–31\.External Links:[Link](https://ceur-ws.org/Vol-3894/paper5.pdf)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p3.1)\.
- A\. Sawczyn, J\. Binkowski, D\. Janiak, B\. Gabrys, and T\. J\. Kajdanowicz \(2026\)FactSelfCheck: fact\-level black\-box hallucination detection for LLMs\.InFindings of the Association for Computational Linguistics: EACL 2026,V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 5603–5621\.External Links:[Link](https://aclanthology.org/2026.findings-eacl.296/),[Document](https://dx.doi.org/10.18653/v1/2026.findings-eacl.296),ISBN 979\-8\-89176\-386\-9Cited by:[§1](https://arxiv.org/html/2606.00328#S1.p4.1),[§2](https://arxiv.org/html/2606.00328#S2.p2.1),[§2](https://arxiv.org/html/2606.00328#S2.p4.1)\.
- A\. Sawczyn, K\. Viarenich, K\. Wojtasik, A\. Domogała, M\. Oleksy, M\. Piasecki, and T\. Kajdanowicz \(2024\)Developing PUGG for Polish: a modern approach to KBQA, MRC, and IR dataset construction\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 10978–10996\.External Links:[Link](https://aclanthology.org/2024.findings-acl.652/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.652)Cited by:[§4\.1\.1](https://arxiv.org/html/2606.00328#S4.SS1.SSS1.Px3.p1.1)\.
- Y\. Shi, Z\. Huang, S\. Feng, H\. Zhong, W\. Wang, and Y\. Sun \(2021\)Masked label prediction: unified message passing model for semi\-supervised classification\.InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI\-21,Z\. Zhou \(Ed\.\),pp\. 1548–1554\.Note:Main TrackExternal Links:[Document](https://dx.doi.org/10.24963/ijcai.2021/214),[Link](https://doi.org/10.24963/ijcai.2021/214)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p5.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram, A\. Nathan, A\. Luo, A\. Helyar, A\. Madry, A\. Efremov, A\. Spyra, A\. Baker\-Whitcomb, A\. Beutel, A\. Karpenko, A\. Makelov, A\. Neitz, A\. Wei, A\. Barr, A\. Kirchmeyer, A\. Ivanov, A\. Christakis, A\. Gillespie, A\. Tam, A\. Bennett, A\. Wan, A\. Huang, A\. M\. Sandjideh, A\. Yang, A\. Kumar, A\. Saraiva, A\. Vallone, A\. Gheorghe, A\. G\. Garcia, A\. Braunstein, A\. Liu, A\. Schmidt, A\. Mereskin, A\. Mishchenko, A\. Applebaum, A\. Rogerson, A\. Rajan, A\. Wei, A\. Kotha, A\. Srivastava, A\. Agrawal, A\. Vijayvergiya, A\. Tyra, A\. Nair, A\. Nayak, B\. Eggers, B\. Ji, B\. Hoover, B\. Chen, B\. Chen, B\. Barak, B\. Minaiev, B\. Hao, B\. Baker, B\. Lightcap, B\. McKinzie, B\. Wang, B\. Quinn, B\. Fioca, B\. Hsu, B\. Yang, B\. Yu, B\. Zhang, B\. Brenner, C\. R\. Zetino, C\. Raymond, C\. Lugaresi, C\. Paz, C\. Hudson, C\. Whitney, C\. Li, C\. Chen, C\. Cole, C\. Voss, C\. Ding, C\. Shen, C\. Huang, C\. Colby, C\. Hallacy, C\. Koch, C\. Lu, C\. Kaplan, C\. Kim, C\. Minott\-Henriques, C\. Frey, C\. Yu, C\. Czarnecki, C\. Reid, C\. Wei, C\. Decareaux, C\. Scheau, C\. Zhang, C\. Forbes, D\. Tang, D\. Goldberg, D\. Roberts, D\. Palmie, D\. Kappler, D\. Levine, D\. Wright, D\. Leo, D\. Lin, D\. Robinson, D\. Grabb, D\. Chen, D\. Lim, D\. Salama, D\. Bhattacharjee, D\. Tsipras, D\. Li, D\. Yu, D\. Strouse, D\. Williams, D\. Hunn, E\. Bayes, E\. Arbus, E\. Akyurek, E\. Y\. Le, E\. Widmann, E\. Yani, E\. Proehl, E\. Sert, E\. Cheung, E\. Schwartz, E\. Han, E\. Jiang, E\. Mitchell, E\. Sigler, E\. Wallace, E\. Ritter, E\. Kavanaugh, E\. Mays, E\. Nikishin, F\. Li, F\. P\. Such, F\. de Avila Belbute Peres, F\. Raso, F\. Bekerman, F\. Tsimpourlas, F\. Chantzis, F\. Song, F\. Zhang, G\. Raila, G\. McGrath, G\. Briggs, G\. Yang, G\. Parascandolo, G\. Chabot, G\. Kim, G\. Zhao, G\. Valiant, G\. Leclerc, H\. Salman, H\. Wang, H\. Sheng, H\. Jiang, H\. Wang, H\. Jin, H\. Sikchi, H\. Schmidt, H\. Aspegren, H\. Chen, H\. Qiu, H\. Lightman, I\. Covert, I\. Kivlichan, I\. Silber, I\. Sohl, I\. Hammoud, I\. Clavera, I\. Lan, I\. Akkaya, I\. Kostrikov, I\. Kofman, I\. Etinger, I\. Singal, J\. Hehir, J\. Huh, J\. Pan, J\. Wilczynski, J\. Pachocki, J\. Lee, J\. Quinn, J\. Kiros, J\. Kalra, J\. Samaroo, J\. Wang, J\. Wolfe, J\. Chen, J\. Wang, J\. Harb, J\. Han, J\. Wang, J\. Zhao, J\. Chen, J\. Yang, J\. Tworek, J\. Chand, J\. Landon, J\. Liang, J\. Lin, J\. Liu, J\. Wang, J\. Tang, J\. Yin, J\. Jang, J\. Morris, J\. Flynn, J\. Ferstad, J\. Heidecke, J\. Fishbein, J\. Hallman, J\. Grant, J\. Chien, J\. Gordon, J\. Park, J\. Liss, J\. Kraaijeveld, J\. Guay, J\. Mo, J\. Lawson, J\. McGrath, J\. Vendrow, J\. Jiao, J\. Lee, J\. Steele, J\. Wang, J\. Mao, K\. Chen, K\. Hayashi, K\. Xiao, K\. Salahi, K\. Wu, K\. Sekhri, K\. Sharma, K\. Singhal, K\. Li, K\. Nguyen, K\. Gu\-Lemberg, K\. King, K\. Liu, K\. Stone, K\. Yu, K\. Ying, K\. Georgiev, K\. Lim, K\. Tirumala, K\. Miller, L\. Ahmad, L\. Lv, L\. Clare, L\. Fauconnet, L\. Itow, L\. Yang, L\. Romaniuk, L\. Anise, L\. Byron, L\. Pathak, L\. Maksin, L\. Lo, L\. Ho, L\. Jing, L\. Wu, L\. Xiong, L\. Mamitsuka, L\. Yang, L\. McCallum, L\. Held, L\. Bourgeois, L\. Engstrom, L\. Kuhn, L\. Feuvrier, L\. Zhang, L\. Switzer, L\. Kondraciuk, L\. Kaiser, M\. Joglekar, M\. Singh, M\. Shah, M\. Stratta, M\. Williams, M\. Chen, M\. Sun, M\. Cayton, M\. Li, M\. Zhang, M\. Aljubeh, M\. Nichols, M\. Haines, M\. Schwarzer, M\. Gupta, M\. Shah, M\. Y\. Guan, M\. Huang, M\. Dong, M\. Wang, M\. Glaese, M\. Carroll, M\. Lampe, M\. Malek, M\. Sharman, M\. Zhang, M\. Wang, M\. Pokrass, M\. Florian, M\. Pavlov, M\. Wang, M\. Chen, M\. Wang, M\. Feng, M\. Bavarian, M\. Lin, M\. Abdool, M\. Rohaninejad, N\. Soto, N\. Staudacher, N\. LaFontaine, N\. Marwell, N\. Liu, N\. Preston, N\. Turley, N\. Ansman, N\. Blades, N\. Pancha, N\. Mikhaylin, N\. Felix, N\. Handa, N\. Rai, N\. Keskar, N\. Brown, O\. Nachum, O\. Boiko, O\. Murk, O\. Watkins, O\. Gleeson, P\. Mishkin, P\. Lesiewicz, P\. Baltescu, P\. Belov, P\. Zhokhov, P\. Pronin, P\. Guo, P\. Thacker, Q\. Liu, Q\. Yuan, Q\. Liu, R\. Dias, R\. Puckett, R\. Arora, R\. T\. Mullapudi, R\. Gaon, R\. Miyara, R\. Song, R\. Aggarwal, R\. Marsan, R\. Yemiru, R\. Xiong, R\. Kshirsagar, R\. Nuttall, R\. Tsiupa, R\. Eldan, R\. Wang, R\. James, R\. Ziv, R\. Shu, R\. Nigmatullin, S\. Jain, S\. Talaie, S\. Altman, S\. Arnesen, S\. Toizer, S\. Toyer, S\. Miserendino, S\. Agarwal, S\. Yoo, S\. Heon, S\. Ethersmith, S\. Grove, S\. Taylor, S\. Bubeck, S\. Banesiu, S\. Amdo, S\. Zhao, S\. Wu, S\. Santurkar, S\. Zhao, S\. R\. Chaudhuri, S\. Krishnaswamy, Shuaiqi, Xia, S\. Cheng, S\. Anadkat, S\. P\. Fishman, S\. Tobin, S\. Fu, S\. Jain, S\. Mei, S\. Egoian, S\. Kim, S\. Golden, S\. Mah, S\. Lin, S\. Imm, S\. Sharpe, S\. Yadlowsky, S\. Choudhry, S\. Eum, S\. Sanjeev, T\. Khan, T\. Stramer, T\. Wang, T\. Xin, T\. Gogineni, T\. Christianson, T\. Sanders, T\. Patwardhan, T\. Degry, T\. Shadwell, T\. Fu, T\. Gao, T\. Garipov, T\. Sriskandarajah, T\. Sherbakov, T\. Korbak, T\. Kaftan, T\. Hiratsuka, T\. Wang, T\. Song, T\. Zhao, T\. Peterson, V\. Kharitonov, V\. Chernova, V\. Kosaraju, V\. Kuo, V\. Pong, V\. Verma, V\. Petrov, W\. Jiang, W\. Zhang, W\. Zhou, W\. Xie, W\. Zhan, W\. McCabe, W\. DePue, W\. Ellsworth, W\. Bain, W\. Thompson, X\. Chen, X\. Qi, X\. Xiang, X\. Shi, Y\. Dubois, Y\. Yu, Y\. Khakbaz, Y\. Wu, Y\. Qian, Y\. T\. Lee, Y\. Chen, Y\. Zhang, Y\. Xiong, Y\. Tian, Y\. Cha, Y\. Bai, Y\. Yang, Y\. Yuan, Y\. Li, Y\. Zhang, Y\. Yang, Y\. Jin, Y\. Jiang, Y\. Wang, Y\. Wang, Y\. Liu, Z\. Stubenvoll, Z\. Dou, Z\. Wu, and Z\. Wang \(2026\)OpenAI gpt\-5 system card\.Technical reportExternal Links:2601\.03267,[Link](https://arxiv.org/abs/2601.03267)Cited by:[§4\.2](https://arxiv.org/html/2606.00328#S4.SS2.SSS0.Px3.p1.1)\.
- G\. Sriramanan, S\. Bharti, V\. S\. Sadasivan, S\. Saha, P\. Kattakinda, and S\. Feizi \(2024\)LLM\-check: investigating detection of hallucinations in large language models\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 34188–34216\.External Links:[Document](https://dx.doi.org/10.52202/079017-1077),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/3c1e1fdf305195cd620c118aaa9717ad-Paper-Conference.pdf)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p2.1)\.
- H\. Sun, B\. Dhingra, M\. Zaheer, K\. Mazaitis, R\. Salakhutdinov, and W\. Cohen \(2018\)Open domain question answering using early fusion of knowledge bases and text\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 4231–4242\.External Links:[Link](https://aclanthology.org/D18-1455/),[Document](https://dx.doi.org/10.18653/v1/D18-1455)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p1.1)\.
- J\. Sun, C\. Xu, L\. Tang, S\. Wang, C\. Lin, Y\. Gong, L\. Ni, H\. Shum, and J\. Guo \(2024\)Think\-on\-graph: deep and responsible reasoning of large language model on knowledge graph\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nnVO1PvbTv)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p1.1)\.
- A\. Talmor and J\. Berant \(2018\)The web as a knowledge\-base for answering complex questions\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 641–651\.External Links:[Link](https://aclanthology.org/N18-1059/),[Document](https://dx.doi.org/10.18653/v1/N18-1059)Cited by:[§4\.1\.1](https://arxiv.org/html/2606.00328#S4.SS1.SSS1.Px2.p1.1)\.
- P\. Veličković, G\. Cucurull, A\. Casanova, A\. Romero, P\. Liò, and Y\. Bengio \(2018\)Graph attention networks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rJXMpikCZ)Cited by:[Appendix A](https://arxiv.org/html/2606.00328#A1.SS0.SSS0.Px2.p1.11),[§2](https://arxiv.org/html/2606.00328#S2.p5.1)\.
- D\. Vrandečić and M\. Krötzsch \(2014\)Wikidata: a free collaborative knowledgebase\.Commun\. ACM57\(10\),pp\. 78–85\.External Links:[Document](https://dx.doi.org/10.1145/2629489),ISSN 0001\-0782,[Link](http://doi.acm.org/10.1145/2629489)Cited by:[§4\.1\.1](https://arxiv.org/html/2606.00328#S4.SS1.SSS1.Px3.p1.1)\.
- K\. Xu, W\. Hu, J\. Leskovec, and S\. Jegelka \(2019\)How powerful are graph neural networks?\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ryGs6iA5Km)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p5.1)\.
- M\. Yasunaga, H\. Ren, A\. Bosselut, P\. Liang, and J\. Leskovec \(2021\)QA\-GNN: reasoning with language models and knowledge graphs for question answering\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tur, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),Online,pp\. 535–546\.External Links:[Link](https://aclanthology.org/2021.naacl-main.45/),[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.45)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p1.1)\.
- W\. Yih, M\. Richardson, C\. Meek, M\. Chang, and J\. Suh \(2016\)The value of semantic parse labeling for knowledge base question answering\.InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),K\. Erk and N\. A\. Smith \(Eds\.\),Berlin, Germany,pp\. 201–206\.External Links:[Link](https://aclanthology.org/P16-2033/),[Document](https://dx.doi.org/10.18653/v1/P16-2033)Cited by:[§4\.1\.1](https://arxiv.org/html/2606.00328#S4.SS1.SSS1.Px1.p1.1)\.
- X\. Zhang, A\. Bosselut, M\. Yasunaga, H\. Ren, P\. Liang, C\. D\. Manning, and J\. Leskovec \(2021\)GreaseLM: graph reasoning enhanced language models\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2201.08860)Cited by:[§2](https://arxiv.org/html/2606.00328#S2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing, H\. Zhang, J\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 46595–46623\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf)Cited by:[§1](https://arxiv.org/html/2606.00328#S1.p4.1),[§4\.2](https://arxiv.org/html/2606.00328#S4.SS2.SSS0.Px2.p1.2)\.
## Appendix
This appendix provides supplementary material organized as follows:
Appendix[A](https://arxiv.org/html/2606.00328#A1):Implementation details\.
Appendix[B](https://arxiv.org/html/2606.00328#A2):Ablation studies on graph encoder components, input modalities, and the effect of training with KBQA\-derived labels\.
Appendix[C](https://arxiv.org/html/2606.00328#A3):Model size comparison and computation demands across methods\.
Appendix[D](https://arxiv.org/html/2606.00328#A4):Full F1, precision, recall, accuracy, and AUC\-PR results for hallucination detection and correction\.
Appendix[E](https://arxiv.org/html/2606.00328#A5):Dataset and subgraph statistics, including subgraph structure, answer count distributions, and LLM abstention and hallucination patterns\.
## Appendix AImplementation Details
##### LLM settings
Llama\-4\-Scout\-17Bis used for KAPING, the LLM\-as\-judge baseline, and answer refinement, with temperature0\.10\.1to favor the most probable outputs while allowing retries on output\-parsing errors\. The SelfCheckGPT baseline draws samples from the same model with temperature1\.01\.0for diversity\.GPT\-5\-miniis queried with temperature11and default reasoning effort \(medium\); temperature11is the only value permitted by the API when reasoning is enabled\.
##### KG\-Guardtraining
Node, edge, and question embeddings are produced by a frozen Sentence Transformer\[Reimers and Gurevych,[2019](https://arxiv.org/html/2606.00328#bib.bib42)\]:all\-roberta\-large\-v1\[Reimers and Gurevych,[2019](https://arxiv.org/html/2606.00328#bib.bib42)\]for WebQuestionsSP and ComplexWebQuestions, and multilingualmmlw\-retrieval\-roberta\-large\[Dadaset al\.,[2024](https://arxiv.org/html/2606.00328#bib.bib43)\]for PUGG; \(in both casesd=1024d=1024\)\. Node features are obtained by encoding the node label, and edge features by encoding the relation label\. Dimensions of marking embeddings for topic entities and answer nodes are set todT=dA=20d\_\{T\}=d\_\{A\}=20\. The learnable embeddings for edges between the virtual question node are sized to10241024\. The graph encoder is instantiated as either a GraphTransformer\[Dwivedi and Bresson,[2021](https://arxiv.org/html/2606.00328#bib.bib44)\]or a GAT\[Veličkovićet al\.,[2018](https://arxiv.org/html/2606.00328#bib.bib19)\]\. Both use22layers,88attention heads, and256256\-dimensional hidden representations\. We train with Adam \(lr=10−3\\text\{lr\}\{=\}10^\{\-3\}, weight decay10−410^\{\-4\}\), batch size3232, for up to300300epochs with early stopping \(patience5050\)\. All hyperparameter values were chosen empirically\. The implementation uses PyTorch Geometric\[Fey and Lenssen,[2019](https://arxiv.org/html/2606.00328#bib.bib45)\]and PyTorch Lightning\[Falcon and The PyTorch Lightning team,[2019](https://arxiv.org/html/2606.00328#bib.bib47)\]\.
##### Compute resources
Llama\-4\-Scoutinference was performed using vLLM on 4×\\timesH100 GPUs\.KG\-Guardtraining was conducted on a NVIDIA A40 GPU \(see Appendix[C](https://arxiv.org/html/2606.00328#A3)for details\)\.
## Appendix BAdditional Experiments
### B\.1Effect of Graph Encoder Components
Table 4:Ablation of graph encoder components on hallucination detection F1\.Δ\\Deltais the F1 change in percentage points relative to thefullKG\-Guardmodel; negative values indicate performance loss from removing the component\. Mean±\\pmstd over 3 runs\.To justify the proposed graph\-encoder design, we ablate node marking and the virtual question node independently and in combination\. Table[4](https://arxiv.org/html/2606.00328#A2.T4)presents the results\. Removing node marking \(topic entity and answer node trainable embeddings\) costs−0\.8\-0\.8pp on average, while removing the virtual question node costs−0\.5\-0\.5pp; their combined removal drops performance by−2\.2\-2\.2pp, exceeding the sum of the individual effects, indicating the two are mutually reinforcing\. CWQ is notably more robust across variants, whereas WebQSP shows the largest sensitivity to ablation\.
### B\.2Effect of Input Modalities
Table 5:Ablation of input modalities on hallucination detection F1\.Δ\\Deltais the F1 change in percentage points relative to thefullKG\-Guardmodel \(graph \+ question text\); negative values indicate performance loss from removing the modality\.RSis the reasoning summary produced by the answering LLM and was not used inKG\-Guard\. Mean±\\pmstd over 3 runs\.We investigate how much each input modality contributes to detection and whether the LLM’s reasoning summary \(RS\) \(not used byKG\-Guard\) can substitute for graph structure\. Table[5](https://arxiv.org/html/2606.00328#A2.T5)reports the results\. Ablating text features \(graph only\) costs−1\.1\-1\.1pp on average, while ablating the graph encoder \(question only\) reduces F1 by−6\.0\-6\.0pp, confirming that graph structure is the primary discriminative signal and question embeddings provide a complementary but secondary gain\. Adding the reasoning summary to the text input \(Q\+RS\) yields no improvement over question text alone \(−6\.1\-6\.1vs\.−6\.0\-6\.0pp\), indicating that textual reasoning paths do not encode the relational evidence the KG graph provides\. WebQSP is the most sensitive dataset for this ablation\.
### B\.3Training with KBQA Dataset
A natural alternative to our approach is to train the same model to predict correct answer nodes using KBQA gold labels, then flag LLM\-proposed answers as hallucinated whenever the model judges them to be incorrect\. Since the KBQA datasets are considerably larger than our hallucination detection dataset, we subsample each KBQA split to match its size and class balance\. We repeat the procedure with three random seeds for each training run \(Table[7](https://arxiv.org/html/2606.00328#A2.T7)\)\.
Table 6:Hallucination detection F1 when trainingKG\-Guardon hallucination labels \(Hall\. Det\.\) versusKBQA\-derivedlabels \(KBQA\)\. Mean±\\pmstd over 3 runs\.Table 7:Training set statistics for hallucination\-specific \(Hall\. Det\.\) and KBQA\-derived supervision \(KBQA\-original: original split;KBQA\-sampled: subsampled to match Hall\. Det\. size for fair comparison\)\.\#Instances: total labeled nodes;\#Pos\./\#Neg\.: hallucinated/correct instances;%Pos\.: class imbalance;\#Graphs: unique subgraphs\. Mean±\\pmstd over sampling runs\.Table[6](https://arxiv.org/html/2606.00328#A2.T6)shows our method outperforms the KBQA\-supervised model on all three datasets despite identical training set sizes, indicating that hallucination\-specific labels provide a stronger training signal than KBQA gold labels for this task\.
Moreover, hallucination annotations are also cheaper to collect than KBQA annotations\. Rather than asking annotators to answer each question against the knowledge base, it reduces to verifying whether the proposed answer node is supported by triples in the retrieved subgraph — a local check with no domain prerequisite\. Additionally, each label is a simple binary judgment, whereas KBQA annotation requires identifying one or more correct answers from the full KG\.
## Appendix CModel Size and Computation Demands
Table[8](https://arxiv.org/html/2606.00328#A3.T8)compares model sizes and query\-time LLM call counts across methods\.KG\-Guardadds only2\.522\.52M trainable parameters on top of a frozen sentence encoder \(∼358\{\\sim\}358M total\), whereas LLM\-based baseline \(Llama\-4\-Scout\) operates at109109B parameters—305×305\\timeslarger; atfp16precision this corresponds to∼0\.7\{\\sim\}0\.7GB for our full inference stack versus∼218\{\\sim\}218GB forLlama\-4\-Scout, exceeding the capacity of a single high\-end GPU\. Each LLM call requires autoregressive generation: while the final answer spans only several tokens,*thinking*enabled inGPT\-5\-minican consume hundreds of additional tokens per query\.SelfCheckfurther multiplies this cost by applying an LLM\-based KBQA pipelineN=5N\{=\}5orN=10N\{=\}10times per query\.KG\-Guardmakes zero LLM calls at detection time: classification is a single feed\-forward pass through the graph encoder and a small MLP\.
In practical deployment over a fixed knowledge graph, node and relation embeddings can be precomputed and stored offline\. During inference, only the question sentence requires encoding, after which a single graph\-encoder forward pass, and one MLP evaluation yields the hallucination score\. This lightweight design makesKG\-Guarddeployable on commodity hardware, with no multi\-GPU infrastructure or external API access required\.
Table 8:Model size and per\-queryLLM callsat detection time\.Encoder: frozen SentenceTransformer \(all\-roberta\-large\-v1≈355\{\\approx\}355M for WebQSP/CWQ;mmlw\-retrieval\-roberta\-large≈435\{\\approx\}435M for PUGG\)\.Detector: trainable classifier parameters\.Total: Encoder \+ Detector\.×\\timesours: total parameter ratio relative toKG\-Guard\. Llama 4 Scout has 109B total, but 17B active parameters \(MoE\); SelfCheck also uses Llama but in a sampling\-based paradigm \(NNcalls per query\)\. GPT 5 Mini parameter count is undisclosed\.
## Appendix DAdditional Results
### D\.1Hallucination Detection
Table 9:Full hallucination detection results on WebQSP \(F1, Precision, Recall, Accuracy, AUC\-PR\)\.G,RS, andRTdenote the retrieved subgraph, reasoning summary, and reasoning triples provided to the LLM judge\. Values with±\\pmdenote mean±\\pmstd over 3 runs\.Table 10:Full hallucination detection results on CWQ \(F1, Precision, Recall, Accuracy, AUC\-PR\)\.G,RS, andRTdenote the retrieved subgraph, reasoning summary, and reasoning triples provided to the LLM judge\. Values with±\\pmdenote mean±\\pmstd over 3 runs\.Table 11:Full hallucination detection results on PUGG \(F1, Precision, Recall, Accuracy, AUC\-PR\)\.G,RS, andRTdenote the retrieved subgraph, reasoning summary, and reasoning triples provided to the LLM judge\. Values with±\\pmdenote mean±\\pmstd over 3 runs\.Tables[9](https://arxiv.org/html/2606.00328#A4.T9),[10](https://arxiv.org/html/2606.00328#A4.T10), and[11](https://arxiv.org/html/2606.00328#A4.T11)extend the results reported in Section[5\.1](https://arxiv.org/html/2606.00328#S5.SS1)with the full precision, recall, accuracy, and AUC\-PR breakdown\. A consistent pattern emerges: LLM\-based judges are strongly precision\-skewed\.GPT\-5\-mini\(G\+RS\+RT\) reaches85\.3%85\.3\\%precision on WebQSP but only45\.3%45\.3\\%recall; Llama\-4\-Scout variants reach up to96\.7%96\.7\\%precision on PUGG at recalls near44%44\\%\.KG\-Guardachieves a more balanced trade\-off:84\.9%84\.9\\%precision and81\.3%81\.3\\%recall on WebQSP \(F1=82\.0\\text\{F1\}=82\.0\), showing that the F1 advantage over LLM judges is driven primarily by higher recall\.KG\-Guardalso leads in AUC\-PR on all three datasets \(91\.491\.4on WebQSP,94\.294\.2on CWQ,90\.090\.0on PUGG\), versus the strongest non\-trivial baseline \(GPT\-5\-mini\) at65\.565\.5,88\.488\.4, and89\.589\.5respectively\.
### D\.2Hallucination Correction
Table 12:KBQA answer quality before \(Initial\) and after \(Refined\) iterative refinement guided byKG\-Guardhallucination flags\.Δ\\Deltais the score improvement\.Table[12](https://arxiv.org/html/2606.00328#A4.T12)extends the F1 and Exact Match results reported in Section[5\.2](https://arxiv.org/html/2606.00328#S5.SS2)with Precision, Recall, and Accuracy\. Gains are consistent across all five metrics and all three datasets\. Precision improves by14\.814\.8–15\.615\.6pp, while recall improves by9\.59\.5–12\.412\.4pp; precision gains exceeding recall gains indicate that refinement primarily eliminates incorrect answer nodes rather than recovering missing ones\.
## Appendix EAdditional Statistics
### E\.1Knowledge Graph Subgraphs
Table 13:Statistics of the retrieved question\-specific subgraphs by PCST used as input toKG\-Guard\.Avg\./Max \#NodesandAvg\./Max \#Triplesreport node and edge statistics per subgraph, respectively\.Table[13](https://arxiv.org/html/2606.00328#A5.T13)reports subgraph sizes across datasets\. WebQSP and CWQ subgraphs are compact \(avg\.1515–1717nodes,1818–2020triples\), owing to Freebase’s sparser entity neighborhood compared to Wikidata\. PUGG subgraphs are an order of magnitude larger \(avg\.7272–9898nodes,8383–121121triples\) due to Wikidata’s denser structure with many hub nodes, even after filtering nodes connected to more than10001000entities\. Maximum subgraph sizes reach thousands of triples across all datasets, requiring the graph encoder to handle substantial structural variability\. These subgraphs also could exceed the practical context limits of LLMs, motivating a graph\-based approach\. The full subgraph size distributions are shown in Figure[5](https://arxiv.org/html/2606.00328#A5.F5)\. All three datasets are heavy\-tailed\.
Figure 5:Subgraph size distributions across the three datasets\. Top row: number of nodes; bottom row: number of triples\. Counts are on a log scale\.
### E\.2Answer Statistics
Table 14:Average answer count per question\.\#GT Ans\./Q: gold \(ground truth\) answer nodes;\#LLM Ans\./Q: nodes returned by the LLM\-based KBQA\.Table[14](https://arxiv.org/html/2606.00328#A5.T14)reports the average number of gold and LLM\-predicted answer nodes per question\. WebQSP contains substantially more gold answers per question \(avg\.10\.210\.2–11\.511\.5\) than CWQ \(1\.91\.9–2\.32\.3\) or PUGG \(≈1\.8\{\\approx\}1\.8\), reflecting that WebQSP questions admit multiple valid answer entities\. The LLM consistently under\-predicts on WebQSP, returning3\.03\.0–3\.33\.3nodes on average against10\+10\{\+\}gold answers\. On CWQ and PUGG the LLM returns a similar number of nodes to the gold count\.
### E\.3LLM Abstention and Hallucination Patterns
Table 15:LLM abstention rate \(%Unknown\) and gold answer coverage \(%Answer in Graph: fraction of examples where at least one gold answer node appears in the retrieved subgraph\)\.Table 16:LLM\-based KBQA abstention rate \(*unknown*answers\) conditioned on gold answer presence in the retrieved subgraph\.%Answer absent/%Answer present: abstention rate when the gold answer is absent from or present in the subgraph, respectively\.Table 17:Hallucination rate among non\-abstained LLM\-based KBQA answers, conditioned on gold answer presence in the retrieved subgraph\.%Answer absent/%Answer present: hallucination rate when the gold answer is absent from or present in the subgraph, respectively\. %Answer absent is always100%100\\%by definition: if the gold answer is not in the subgraph, any non\-abstained node returned by the system is hallucinated\.Table[15](https://arxiv.org/html/2606.00328#A5.T15)shows that a substantial fraction of LLM responses are abstentions \(*unknown*\):1616–18%18\\%on WebQSP,1919–21%21\\%on PUGG, and3535–38%38\\%on CWQ\. In all cases the gold answer is present in the retrieved subgraph for fewer than two thirds of examples, reflecting the inherent difficulty of subgraph retrieval\. Larger subgraphs increase answer coverage but add noise and exceed LLM context limits, while smaller subgraphs are more tractable but risk omitting the correct evidence\. Tables[16](https://arxiv.org/html/2606.00328#A5.T16)and[17](https://arxiv.org/html/2606.00328#A5.T17)show how abstention and hallucination rates vary with answer presence\. When the gold answer is absent from the subgraph, the LLM abstains at rates of3535–57%57\\%depending on the dataset\. When the gold answer is present, abstention rates drop to11–11%11\\%and hallucination among non\-abstaining responses falls to3838–47%47\\%\.
These patterns confirm that hallucination in this setting is tightly coupled to whether the answer evidence is accessible in the retrieved subgraph\. When the gold answer is absent, the LLM has no graph evidence to ground its response and is prone to hallucinate, underlining the need for a dedicated hallucination detection step\.Similar Articles
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
This paper reveals that much of the reported progress in LLM hallucination detection is due to benchmark construction artifacts, where ground-truth answers are embedded in prompts, allowing a simple text-similarity baseline to achieve near-perfect scores. Through a large-scale controlled evaluation, the authors show that most methods perform near chance under proper controls, except for supervised probes on upper-layer hidden states such as SAPLMA and their proposed DRIFT.
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
This paper introduces MHGraphBench, a knowledge-graph-grounded benchmark for evaluating large language models on mental health knowledge, including entity recognition, relation judgment, and multi-hop reasoning. Experiments across 15 LLMs reveal a gap between recognition and judgment capabilities.
HalluScore: Large Language Model Hallucination Question Answering Benchmark
Introduces HalluScore, a structured Arabic QA benchmark for evaluating hallucination in LLMs across reasoning difficulty, knowledge domains, and cultural contexts. Contains 827 questions with verified evidence and annotations, tested on 17 LLMs.
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
Researchers introduce SHADE, a hybrid estimator that combines Good-Turing coverage with graph-spectral cues to quantify semantic uncertainty and detect LLM hallucinations when only a few black-box samples are available.
Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation
This paper investigates how fine-tuning LLMs on new knowledge induces factual hallucinations, showing that unfamiliarity within specific knowledge types drives hallucinations through weakened attention to key entities. The authors propose mitigating this by reintroducing known knowledge during later training stages.