Ontology-Guided Evidence Path Inference for Multi-hop Knowledge Graph Question Answering

arXiv cs.AI Papers

Summary

Proposes OPI, an ontology-guided framework for multi-hop knowledge graph question answering that leverages a relation-centric ontology graph for bidirectional retrieval and iterative refinement, achieving state-of-the-art results on multiple benchmarks.

arXiv:2606.28076v1 Announce Type: new Abstract: Knowledge graph question answering (KGQA) aims to answer natural-language questions by reasoning over structured facts. Existing multi-hop KGQA methods mainly rely on topic-centered expansion, which faces two key challenges: the search space rapidly grows with noisy mixed-type paths, and retrieved paths may fail to satisfy the semantic constraints of complex questions. To address these challenges, we propose OPI, an ontology-guided evidence path inference framework for multi-hop KGQA. OPI introduces a relation-centric ontology graph to capture the head-tail type constraints of relations, providing a compact interface for answer-side constraints. Based on this ontology graph, OPI first introduces a bidirectional retrieval mechanism by mapping the predicted answer type to compatible final-hop relations and combining topic-side prefix expansion with answer-side final-hop matching, thereby suppressing noisy mixed-type expansion. OPI further adopts an iterative refinement strategy to reassess retrieved paths and candidate answers under the question context, filtering type-compatible but question-irrelevant evidence for more reliable answer prediction. Experiments on WebQSP, CWQ, and MetaQA show that OPI substantially reduces the search space, improves Hit@1/F1 by 4.6/5.0 points on WebQSP and 8.9/3.3 points on CWQ over the strongest prior results, and achieves near-saturated Hit@1 on MetaQA with the retrieval module alone.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:28 AM

# Ontology-Guided Evidence Path Inference for Multi-hop Knowledge Graph Question Answering
Source: [https://arxiv.org/html/2606.28076](https://arxiv.org/html/2606.28076)
,Meihan WuPengcheng LaboratoryShenzhenChina[wumh@pcl\.ac\.cn](https://arxiv.org/html/2606.28076v1/mailto:[email protected]),Cundi FangNational University of Defense TechnologyChangshaChina[fangcundi@nudt\.edu\.cn](https://arxiv.org/html/2606.28076v1/mailto:[email protected]),Jie PengNational University of Defense TechnologyChangshaChina[pengjie@nudt\.edu\.cn](https://arxiv.org/html/2606.28076v1/mailto:[email protected])andXiaodong WangNational University of Defense TechnologyChangshaChina[xdwang@nudt\.edu\.cn](https://arxiv.org/html/2606.28076v1/mailto:[email protected])

###### Abstract\.

Knowledge graph question answering \(KGQA\) aims to answer natural\-language questions by reasoning over structured facts\. Existing multi\-hop KGQA methods mainly rely on topic\-centered expansion, which faces two key challenges: the search space rapidly grows with noisy mixed\-type paths, and retrieved paths may fail to satisfy the semantic constraints of complex questions\. To address these challenges, we propose OPI, an ontology\-guided evidence path inference framework for multi\-hop KGQA\. OPI introduces a relation\-centric ontology graph to capture the head\-tail type constraints of relations, providing a compact interface for answer\-side constraints\. Based on this ontology graph, OPI first introduces a bidirectional retrieval mechanism by mapping the predicted answer type to compatible final\-hop relations and combining topic\-side prefix expansion with answer\-side final\-hop matching, thereby suppressing noisy mixed\-type expansion\. OPI further adopts an iterative refinement strategy to reassess retrieved paths and candidate answers under the question context, filtering type\-compatible but question\-irrelevant evidence for more reliable answer prediction\. Experiments on WebQSP, CWQ, and MetaQA show that OPI substantially reduces the search space, improves Hit@1/F1 by 4\.6/5\.0 points on WebQSP and 8\.9/3\.3 points on CWQ over the strongest prior results, and achieves near\-saturated Hit@1 on MetaQA with the retrieval module alone\.

PVLDB Reference Format: PVLDB, 14\(1\): XXX\-XXX, 2020\. [doi:XX\.XX/XXX\.XX](https://doi.org/XX.XX/XXX.XX)††This work is licensed under the Creative Commons BY\-NC\-ND 4\.0 International License\. Visit[https://creativecommons\.org/licenses/by\-nc\-nd/4\.0/](https://creativecommons.org/licenses/by-nc-nd/4.0/)to view a copy of this license\. For any use beyond those covered by this license, obtain permission by emailing[info@vldb\.org](https://arxiv.org/html/2606.28076v1/mailto:[email protected])\. Copyright is held by the owner/author\(s\)\. Publication rights licensed to the VLDB Endowment\. Proceedings of the VLDB Endowment, Vol\. 14, No\. 1 ISSN 2150\-8097\. [doi:XX\.XX/XXX\.XX](https://doi.org/XX.XX/XXX.XX)

## 1\.Introduction

Knowledge graph question answering \(KGQA\) aims to answer natural\-language questions by reasoning over entities and relations in a knowledge graph, and has been applied in domains such as healthcare\(Frisoniet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib38)\), agriculture\(Yanget al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib46)\), and multimedia\(Leeet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib47)\)\. Driven by the rapid advances of large language models \(LLMs\), recent studies increasingly integrate LLMs with KGs to support question understanding, reasoning planning, and graph\-grounded answer generation\(Panet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib45)\)\. These methods explore diverse strategies, including iterative graph exploration, relation\-path planning, and graph\-constrained decoding, and have advanced multi\-hop reasoning over KGs\(Liuet al\.,[2025b](https://arxiv.org/html/2606.28076#bib.bib14); Wenet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib40); Sunet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib10); Luoet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib11),[2025](https://arxiv.org/html/2606.28076#bib.bib15); Tanet al\.,[2025](https://arxiv.org/html/2606.28076#bib.bib39); Mavromatis and Karypis,[2025](https://arxiv.org/html/2606.28076#bib.bib13)\)\. Nevertheless, complex multi\-hop KGQA remains challenging over large\-scale KGs, where increasing scale, heterogeneity, and structural complexity introduce large search spaces and ambiguous reasoning evidence\(Dong,[2023](https://arxiv.org/html/2606.28076#bib.bib44); Geet al\.,[2021](https://arxiv.org/html/2606.28076#bib.bib43); Rabbaniet al\.,[2023](https://arxiv.org/html/2606.28076#bib.bib42)\)\.

A common paradigm for multi\-hop KGQA is to retrieve evidence paths or subgraphs rooted at the topic entity and use them to derive answers\. To improve retrieval efficiency, many methods control the expansion space through local subgraph construction, iterative graph exploration, and path pruning\(Wenet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib40); Sunet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib10); Tanet al\.,[2025](https://arxiv.org/html/2606.28076#bib.bib39)\)\. To improve reasoning quality, recent methods further introduce semantic signals such as relation\-path planning, graph\-structured prompting, and LLM\-based evidence reasoning\(Liuet al\.,[2025b](https://arxiv.org/html/2606.28076#bib.bib14); Luoet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib11),[2025](https://arxiv.org/html/2606.28076#bib.bib15); Mavromatis and Karypis,[2025](https://arxiv.org/html/2606.28076#bib.bib13)\)\. Despite these advances, most retrieval procedures are still primarily driven by topic\-centered expansion: they start from the topic entity and progressively explore neighboring entities and relations\. Consequently, retrieval often produces numerous structurally reachable paths that are inconsistent with the expected answer type, while also introducing spurious candidate paths that fail to satisfy the complex semantic constraints of the question\.

This topic\-centered retrieval paradigm faces two challenges in multi\-hop KGQA, as illustrated in[Figure 1](https://arxiv.org/html/2606.28076#S1.F1)\. The first ispath explosion: unconstrained forward expansion from the topic entity produces a massive expansion of candidate paths ending in heterogeneous types, the vast majority of which are completely irrelevant to the expected answer type\. As shown in the upper part of[Figure 1](https://arxiv.org/html/2606.28076#S1.F1), paths starting from the same topic entity may branch into countries, persons, awards, clubs, languages, and other mixed\-type endpoints\. The second issemantic misalignment: even when candidate paths reach answer\-type\-compatible entities, they may still violate the implicit constraints of the question and thus deviate from the intended reasoning semantics\. As shown in the lower part of[Figure 1](https://arxiv.org/html/2606.28076#S1.F1), several paths can lead to plausible language entities, but only the path that captures both the birth\-country constraint and the official\-language constraint provides the correct reasoning evidence\. Therefore, an effective multi\-hop KGQA framework requires the ability to both mitigate the explosion of candidate paths and identify reasoning chains that satisfy the complex semantic constraints of the question\.

![Refer to caption](https://arxiv.org/html/2606.28076v1/x1.png)Figure 1\.Illustration of two main challenges in multi\-hop KGQA\. The upper part shows that topic\-centered expansion produces many mixed\-type candidate paths, while the lower part shows that even answer\-type\-compatible paths may still violate the semantic constraints implied by the question\.To address these challenges, we propose OPI, anOntology\-guided evidencePathInference framework for multi\-hop KGQA\. Rather than relying on unconstrained topic\-centered expansion that produces noisy, mixed\-type paths, OPI introduces a relation\-centric ontology graph as the structural basis for evidence path inference\. This graph abstracts the knowledge graph into a type\-level representation that captures how relations connect head and tail entity types, providing a compact interface for answer\-side constraints\. Driven by this compact interface, OPI replaces unconstrained topic\-centered retrieval with a two\-stage inference paradigm that addresses the two challenges\. It first leverages a bidirectional retrieval mechanism to impose answer\-side constraints and prune the explosive search space early\. It then applies an iterative refinement strategy to filter spurious evidence and improve semantic alignment with the question\.

Specifically, OPI leveragesa bidirectional retrieval mechanismto obtain a set of answer\-type\-compatible evidence paths\. Given a question, OPI predicts the implied answer type and maps it to a small set of answer\-type\-compatible final\-hop relations\. It then combines topic\-side prefix expansion with answer\-side final\-hop matching, so that retrieval is no longer driven by unconstrained topic\-centered expansion alone\. This bidirectional retrieval mechanism effectively suppresses mixed\-type path growth and mitigates structural path explosion\. After retrieval, OPI appliesan iterative answer refinement strategyto reassess the retrieved paths and candidate answers under the question context\. The generator produces an answer hypothesis from the current path and answer contexts, while the refiner evaluates the hypothesis together with the retrieved evidence and returns structured feedback\. This feedback updates the focused path context and candidate answer context, enabling OPI to filter type\-compatible but question\-irrelevant evidence and generate more reliable final answers\.

Our main contributions are summarized as follows:

- •We introduce a relation\-centric ontology graph, which captures how relations connect head and tail entity types and provides the structural basis for answer\-side constrained path retrieval\.
- •We propose an ontology\-guided bidirectional retrieval mechanism, which combines topic\-side prefix expansion with answer\-side final\-hop matching to reduce noisy mixed\-type expansion and alleviate path explosion\.
- •We design an iterative answer refinement strategy, which jointly reassesses retrieved paths and candidate answers under the question context to filter type\-compatible but question\-irrelevant evidence\.
- •We conduct extensive experiments on WebQSP, CWQ, and MetaQA, demonstrating the effectiveness of OPI and its key components across different KGQA benchmarks\.

## 2\.Preliminaries

In this section, we formalize the KGQA task considered in this paper by defining the knowledge graph, its type\-level abstraction, and the answer prediction objective\.

### 2\.1\.Knowledge Graph and Ontology Graph

Knowledge graph\.A knowledge graph is denoted by𝒢=\(ℰ,ℛ,𝒯\)\\mathcal\{G\}=\(\\mathcal\{E\},\\mathcal\{R\},\\mathcal\{T\}\), where:

- •ℰ\\mathcal\{E\}is the set of entities;
- •ℛ\\mathcal\{R\}is the set of relations;
- •𝒯⊆ℰ×ℛ×ℰ\\mathcal\{T\}\\subseteq\\mathcal\{E\}\\times\\mathcal\{R\}\\times\\mathcal\{E\}is the set of factual triples\.

Each factual triple\(eh,r,et\)∈𝒯\(e\_\{h\},r,e\_\{t\}\)\\in\\mathcal\{T\}indicates that the head entityehe\_\{h\}is connected to the tail entityete\_\{t\}by relationrr\.

Ontology graph\.The ontology graph provides a type\-level abstraction of the knowledge graph through relation signatures\. It is denoted by𝒪=\(𝒞,ℛ,𝒮\)\\mathcal\{O\}=\(\\mathcal\{C\},\\mathcal\{R\},\\mathcal\{S\}\), where:

- •𝒞\\mathcal\{C\}is the set of semantic types associated with entities;
- •ℛ\\mathcal\{R\}is the relation set shared with the knowledge graph;
- •𝒮⊆𝒞×ℛ×𝒞\\mathcal\{S\}\\subseteq\\mathcal\{C\}\\times\\mathcal\{R\}\\times\\mathcal\{C\}is the set of type\-level relation signatures\.

Each relation signature\(ch,r,ct\)∈𝒮\(c\_\{h\},r,c\_\{t\}\)\\in\\mathcal\{S\}specifies that relationrrcan connect head entities of typechc\_\{h\}to tail entities of typectc\_\{t\}at the ontology level\. While the knowledge graph stores concrete entity\-level facts, the ontology graph captures type\-level compatibility through these relation signatures\.[Figure 2](https://arxiv.org/html/2606.28076#S2.F2)illustrates how entity\-level triples in the knowledge graph are abstracted into type\-level relation signatures in the ontology graph\.

![Refer to caption](https://arxiv.org/html/2606.28076v1/x2.png)Figure 2\.An example of a knowledge graph and its ontology graph\. The ontology graph abstracts entity\-level triples into type\-level relation signatures\.
### 2\.2\.Knowledge Graph Question Answering

Given a natural language questionqq, we assume that a topic entityeq∈ℰe\_\{q\}\\in\\mathcal\{E\}mentioned in the question is identified following standard KGQA settings\(Jianget al\.,[2023c](https://arxiv.org/html/2606.28076#bib.bib4); Luoet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib11)\)\. An evidence path rooted ateqe\_\{q\}is a sequence of connected factual triples:

\(1\)p=\(e0→r1e1→r2⋯→rlel\),p=\(e\_\{0\}\\xrightarrow\{r\_\{1\}\}e\_\{1\}\\xrightarrow\{r\_\{2\}\}\\cdots\\xrightarrow\{r\_\{l\}\}e\_\{l\}\),wheree0=eqe\_\{0\}=e\_\{q\}and\(ei−1,ri,ei\)∈𝒯\(e\_\{i\-1\},r\_\{i\},e\_\{i\}\)\\in\\mathcal\{T\}for each1≤i≤l1\\leq i\\leq l\. The terminal entityele\_\{l\}of an evidence path is regarded as a candidate answer grounded in the knowledge graph\.

For each question, the ground\-truth answer set is denoted by𝒴∗⊆ℰ\\mathcal\{Y\}^\{\*\}\\subseteq\\mathcal\{E\}\. The task is to learn a function

\(2\)fθ​\(q,eq,𝒢,𝒪\)→y^,f\_\{\\theta\}\(q,e\_\{q\},\\mathcal\{G\},\\mathcal\{O\}\)\\rightarrow\\hat\{y\},wherey^∈ℰ\\hat\{y\}\\in\\mathcal\{E\}denotes the final predicted answer\. We assume that each correct answery∈𝒴∗y\\in\\mathcal\{Y\}^\{\*\}can be reached from the topic entity through at least one bounded\-length evidence path in𝒢\\mathcal\{G\}\. The objective is therefore to produce an answer predictiony^\\hat\{y\}that is graph\-reachable from the topic entity and semantically consistent with the constraints expressed by the question\.

## 3\.Method

### 3\.1\.Overall Architecture

OPI consists of three tightly connected modules: ontology graph construction, ontology\-guided bidirectional retrieval, and iterative answer refinement\. It first abstracts the original knowledge graph into a relation\-centric ontology graph, where each relation is represented by a type\-level relation signature\. Based on this ontology graph, OPI predicts the answer type implied by the question and maps it to a set of answer\-type\-compatible final\-hop relations, enabling bidirectional retrieval that combines topic\-side prefix expansion with answer\-side final\-hop matching\. OPI then refines the retrieved evidence and candidate answers through a generator\-refiner loop, where structured feedback iteratively updates both the path context and the answer context\.[Figure 3](https://arxiv.org/html/2606.28076#S3.F3)summarizes the main inference pipeline of OPI\.

![Refer to caption](https://arxiv.org/html/2606.28076v1/x3.png)Figure 3\.Overall inference pipeline of OPI\. The left part shows ontology\-guided bidirectional retrieval, which retrieves tail\-type\-compatible evidence paths\. The right part shows iterative answer refinement, where a generator–refiner loop refines these paths and candidate answers to produce the final answer\.
### 3\.2\.Ontology Graph Construction

In large\-scale knowledge graphs, the entity\-level graph is often dense and highly heterogeneous\. As the hop count increases, unconstrained topic\-centered expansion can easily produce a large number of noisy evidence paths, causing rapid search\-space growth\. This motivates us to construct a compact type\-level abstraction of the knowledge graph, so that repetitive entity\-level facts can be summarized into more stable relation\-type patterns\.

Specifically, we abstract the factual knowledge graph𝒢\\mathcal\{G\}into a relation\-centric ontology graph𝒪\\mathcal\{O\}\. In this graph, entity\-level facts are summarized by type\-level relation signatures of the form\(ch,r,ct\)\(c\_\{h\},r,c\_\{t\}\), wherechc\_\{h\}andctc\_\{t\}denote the head and tail entity types associated with relationrr, respectively\. Rather than preserving all concrete entity\-level triples, the ontology graph captures how relations connect semantic types at the schema level\. Since different knowledge bases expose type information in different forms, we adopt two construction strategies: schema\-based extraction for knowledge bases with explicit schema predicates, and data\-driven induction for knowledge bases without explicit schema predicates\.

#### 3\.2\.1\.For Freebase\-style Knowledge Bases\.

When explicit schema predicates are available, relation signatures can be extracted directly from the schema\. Specifically, we construct the ontology graph from the Freebase RDF dump\(Microsoft,[2015](https://arxiv.org/html/2606.28076#bib.bib34)\)under the canonical namespacehttp://rdf\.freebase\.com/ns/\. For each relation, we pairtype\.property\.schemawithtype\.property\.expected\_typeto obtain the head and tail types in its relation signature\. The former specifies the domain type of the relation, while the latter specifies its expected range type\. In this way, each schema\-defined relation can be transformed into a type\-level triple\(ch,r,ct\)\(c\_\{h\},r,c\_\{t\}\)\.

To improve the reliability of the resulting ontology graph, we filter out non\-semantic or administrative types, such ascommon\.topic, and retain only relations with complete head\-tail signatures\. For a small number of missing or under\-specified relations, we perform conservative completion using only the training split to avoid information leakage\. Specifically, we aggregate the observed head and tail entity sets of each such relation from the training data, infer their dominant types, and accept the completion only when both sides admit a consistent type assignment\. This strategy leverages the explicit schema structure of Freebase while improving coverage for rare or incomplete relations\. The same construction principle naturally extends to other KGs with explicit schema predicates, such as DBpedia, whererdfs:domainandrdfs:rangeprovide analogous type constraints\.

#### 3\.2\.2\.For Wiki\-Movie\-style Knowledge Bases\.

When explicit schema predicates are unavailable, relation signatures are induced directly from data\. In Wiki\-Movie\-style KGs, the original graph does not provide schema predicates that explicitly define the domain and range types of each relation\. Therefore, we first infer entity types from the training QA pairs and their associated type\-path annotations\. Each entity is assigned a single main type according to its most frequent observation, which reduces type sparsity and avoids unstable relation signatures caused by overly fine\-grained or inconsistent type assignments\.

After obtaining entity\-level type assignments, we aggregate the observed head\-tail type pairs for each relation\. For a relationrr, all training triples containingrrare mapped from entity\-level triples\(eh,r,et\)\(e\_\{h\},r,e\_\{t\}\)to type\-level observations\(ch,r,ct\)\(c\_\{h\},r,c\_\{t\}\)\. We then retain the dominant head\-tail type pair as the relation signature ofrr\. This data\-driven strategy allows ontology graph construction even when no explicit schema is available\. More generally, for KGs such as Wikidata, relation signatures can also be induced by mapping entities to type sets viaP31\(*instance of*\) and optionally generalizing them withP279\(*subclass of*\), followed by relation\-wise aggregation of head and tail type statistics\.

In both settings, the final ontology graph takes the same form: each relation is associated with a relation signature\(ch,r,ct\)\(c\_\{h\},r,c\_\{t\}\)\. This unified relation\-signature interface makes OPI applicable across heterogeneous knowledge graphs\. More importantly, it provides the structural link from ontology construction to retrieval: in the next stage, predicted answer types are mapped to answer\-type\-compatible final\-hop relations through exactly this interface\.

### 3\.3\.Ontology\-Guided Bidirectional Retrieval

Building on the relation\-signature interface defined above, OPI introduces an ontology\-guided bidirectional retrieval mechanism to alleviate path explosion\. Here,bidirectionaldoes not refer to conventional two\-sided entity\-level BFS from the topic entity and concrete answer entities\. Instead, it means that retrieval is jointly constrained by topic\-side prefix expansion and answer\-side type constraints\. Specifically, OPI first predicts the answer type implied by the question, maps it to answer\-type\-compatible final\-hop relations through the ontology graph, and then combines topic\-side prefix expansion with answer\-side final\-hop matching to retrieve evidence paths and candidate answers\.

#### 3\.3\.1\.Answer\-Type Prediction

The relation\-signature interface makes answer\-type prediction a natural target for retrieval guidance\. Compared with direct answer\-entity prediction, answer\-type prediction is substantially more compact and stable: questions with different surface forms and different gold answers often still share the same answer\-side semantic category, such as*person*,*film*, or*location*\. Predicting the answer type before graph retrieval therefore provides a coarse but reliable target that can later be translated into structural constraints on the final hop\.

To construct supervision, given a questionqqwith topic entityeqe\_\{q\}and gold answereae\_\{a\}, we extract the shortest factual pathwz​\(eq,ea\)=\(eq→r1⋯→rlea\)w\_\{z\}\(e\_\{q\},e\_\{a\}\)=\(e\_\{q\}\\xrightarrow\{r\_\{1\}\}\\cdots\\xrightarrow\{r\_\{l\}\}e\_\{a\}\)with relation sequencez=\(r1,…,rl\)z=\(r\_\{1\},\\ldots,r\_\{l\}\)in𝒢\\mathcal\{G\}\. We use this path only as a minimal supervision signal\. In particular, we take its last\-hop relationrlr\_\{l\}and query the ontology graph𝒪\\mathcal\{O\}for relation signatures of the form\(ch,rl,ca\)\(c\_\{h\},r\_\{l\},c\_\{a\}\), wherechc\_\{h\}is a head type andcac\_\{a\}is a tail type\. The tail type in such a signature is then regarded as a valid answer entity type for the question\. This yields the target set

\(3\)𝒞a=\{ca:∃ch,\(ch,rl,ca\)∈𝒮\},\\mathcal\{C\}\_\{a\}=\\left\\\{c\_\{a\}\\;:\\;\\exists\\,c\_\{h\},\\ \(c\_\{h\},r\_\{l\},c\_\{a\}\)\\in\\mathcal\{S\}\\right\\\},If multiple ontology\-consistent answer entity types are induced, we treat them as equally valid and define the supervision target as

\(4\)Q​\(ca∣q,eq,ea,𝒢,𝒪\)=\{1\|𝒞a\|,ca∈𝒞a,0,otherwise\.Q\(c\_\{a\}\\mid q,e\_\{q\},e\_\{a\},\\mathcal\{G\},\\mathcal\{O\}\)=\\begin\{cases\}\\displaystyle\\frac\{1\}\{\|\\mathcal\{C\}\_\{a\}\|\},&c\_\{a\}\\in\\mathcal\{C\}\_\{a\},\\\\\[4\.0pt\] 0,&\\text\{otherwise\}\.\\end\{cases\}FollowingLuoet al\.\([2024](https://arxiv.org/html/2606.28076#bib.bib11)\), we use this simple target distribution to supervise answer\-type prediction\.

We then fine\-tune a large language model to generate the answer entity type conditioned on the questionqq\. This defines a conditional prior

\(5\)log⁡Pϕ​\(ca∣q\)=∑i=1\|ca\|log⁡Pϕ​\(si∣s<i,q\)\.\\log P\_\{\\phi\}\(c\_\{a\}\\mid q\)=\\sum\_\{i=1\}^\{\|c\_\{a\}\|\}\\log P\_\{\\phi\}\(s\_\{i\}\\mid s\_\{<i\},q\)\.whereϕ\\phidenotes the model parameters,sis\_\{i\}is theii\-th generated token, ands<is\_\{<i\}denotes the token prefix before positionii\. The resulting prediction is subsequently mapped to answer\-type\-compatible final\-hop relations for retrieval\.

#### 3\.3\.2\.Answer\-Type\-Guided Bidirectional Retrieval

Given a predicted answer typecac\_\{a\}, OPI uses the ontology graph to identify which relations can end at entities of this type\. Specifically, we search the relation\-signature set𝒮\\mathcal\{S\}for all signatures whose tail type iscac\_\{a\}\. For each matched signature\(ch,r,ca\)\(c\_\{h\},r,c\_\{a\}\), the relationrris regarded as a candidate final\-hop relation, because it can connect some head typechc\_\{h\}to the predicted answer typecac\_\{a\}at the ontology level\. Formally, we define the answer\-type\-compatible final\-hop relation set as

\(6\)ℛcalast=\{r:∃ch,\(ch,r,ca\)∈𝒮\},\\mathcal\{R\}^\{\\mathrm\{last\}\}\_\{c\_\{a\}\}=\\left\\\{r\\;:\\;\\exists\\,c\_\{h\},\\ \(c\_\{h\},r,c\_\{a\}\)\\in\\mathcal\{S\}\\right\\\},Through this relation\-signature interface, the predicted answer type is converted into a set of structurally plausible final\-hop relations for subsequent retrieval\.

Retrieval then proceeds bidirectionally\. On the forward side, we start from the topic entity and traverse the factual graph along the original edge directions, expanding candidate path prefixes hop by hop under bounded depth and path budgets\. On the answer side, rather than starting from concrete answer entities, OPI represents the answer constraint by a set of ontology\-compatible final\-hop relationsℛcalast\\mathcal\{R\}^\{\\mathrm\{last\}\}\_\{c\_\{a\}\}induced by the predicted answer type\. Importantly, this set is defined at the ontology level and may still contain multiple admissible relations for the target tail type\. However, under a specific question, the feasible last hop is further constrained by the endpoint of the forward prefix\. Concretely, for each prefix ending at a penultimate nodevv, only those relations inℛcalast\\mathcal\{R\}^\{\\mathrm\{last\}\}\_\{c\_\{a\}\}for which the type ofvvmatches the required head\-side signature remain valid\. Therefore, many ontology\-legitimate final hops are eliminated immediately once the forward context is fixed, and the candidate set at the last step is typically reduced to only a few relations, and in many cases to a unique one\.

The two sides thus meet at the penultimate node\. For each forward prefix ending atvv, we check whethervvcan serve as a valid head entity for somer∈ℛcalastr\\in\\mathcal\{R\}^\{\\mathrm\{last\}\}\_\{c\_\{a\}\}\. If so, we append the matched final hop and obtain a complete evidence path\. Rather than expanding all outgoing relations at the final step, the search explicitly reserves the last hop for ontology\-guided completion\. This design not only enforces answer\-type consistency, but also avoids introducing a large number of spurious branches that would otherwise arise from unconstrained topic\-centered expansion\.

This mechanism constrains the search space and reduces search complexity\. Letbbdenote the average branching factor of the factual knowledge graph\. Without answer\-side constraints, an unconstrainedxx\-hop forward expansion from the topic entity explores

\(7\)\|𝒫xfull​\(eq\)\|=O​\(bx\)\.\|\\mathcal\{P\}^\{\\mathrm\{full\}\}\_\{x\}\(e\_\{q\}\)\|=O\(b^\{x\}\)\.
With answer\-type\-guided final\-hop constraints, retrieval is reformulated as searching for valid\(x−1\)\(x\-1\)\-hop prefixes followed by a constrained one\-hop completion\. Letℛcalast\\mathcal\{R\}^\{\\mathrm\{last\}\}\_\{c\_\{a\}\}denote the set of final\-hop relations compatible with the predicted answer typecac\_\{a\}, and letβ​\(ca\)\\beta\(c\_\{a\}\)denote the effective branching factor after applying these final\-hop constraints at prefix endpoints\. Since the constrained completion is selected from the original outgoing branches, we haveβ​\(ca\)≤b\\beta\(c\_\{a\}\)\\leq b\. The resulting search space is therefore bounded by

\(8\)\|𝒫xbi​\(eq,ca\)\|=O​\(bx−1⋅β​\(ca\)\)\.\|\\mathcal\{P\}^\{\\mathrm\{bi\}\}\_\{x\}\(e\_\{q\},c\_\{a\}\)\|=O\\\!\\left\(b^\{x\-1\}\\cdot\\beta\(c\_\{a\}\)\\right\)\.This formulation shows that OPI reduces the last\-hop expansion from unconstrained entity\-level branching to answer\-type\-constrained completion\. The reduction depends on the selectivity of the answer\-side constraints, rather than assuming that all ontology\-compatible relation sets are uniformly small\.

#### 3\.3\.3\.Fallback Retrieval and Evidence Reranking

If no answer type is predicted, or if the predicted answer type cannot be mapped to any answer\-type\-compatible final\-hop relation through the relation\-signature interface, we do not perform answer\-side constrained matching\. Instead, we revert to topic\-centered forward retrieval from the topic entity and enumerate candidate paths up to a bounded number of hops\. This fallback preserves a graph\-grounded evidence source even when answer\-type guidance is unavailable\.

After candidate paths are obtained, we retain the top\-kkevidence paths under a fixed path budget\. In the current implementation, the candidate paths and the question are encoded by a pretrained language model \(e\.g\., SentenceBERT\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.28076#bib.bib37)\)\), and their semantic relevance is measured in the embedding space\. The top\-kkpaths are then retained and converted into readable path strings\. Candidate answers are finally extracted from the tail entities of the retained paths, ensuring that the final answer space remains grounded in explicit graph evidence\.

### 3\.4\.Iterative Answer Refinement

After ontology\-guided bidirectional retrieval, OPI obtains a bounded set of candidate evidence paths and candidate answers extracted from their tail entities\. Although this retrieval stage substantially reduces the search space, the remaining paths may still contain competing branches, partially relevant evidence, or semantically incomplete support for the final answer\. OPI therefore does not directly return the top\-ranked candidate\. Instead, it performs iterative answer refinement through a generator\-refiner loop\.

#### 3\.4\.1\.Generator\-Refiner Loop

At iterationtt, the generator produces an answer hypothesis from the questionqqand the current retrieval context\. In the current implementation, this context includes the predicted answer\-type constraintscac\_\{a\}, the reasoning\-path contextP\(t−1\)P^\{\(t\-1\)\}, the answer contextA\(t−1\)A^\{\(t\-1\)\}, and the refinement feedback from the previous roundF\(t−1\)F^\{\(t\-1\)\}\. Thus, the first iteration is already grounded in retrieved evidence, while later iterations are further constrained by revision signals\.

Formally, the generation step at iterationttis defined as

\(9\)y\(t\)=Gθ​\(q,ca,P\(t−1\),A\(t−1\),F\(t−1\)\),y^\{\(t\)\}=G\_\{\\theta\}\(q,c\_\{a\},P^\{\(t\-1\)\},A^\{\(t\-1\)\},F^\{\(t\-1\)\}\),wherey\(t\)y^\{\(t\)\}is the current answer hypothesis\. OPI further introduces a refiner that takes the current answer hypothesis together with a refinement\-specific evidence subset and converts it into explicit revision actions:

\(10\)F\(t\)=Rψ​\(q,ca,P¯\(t\),y\(t\)\),F^\{\(t\)\}=R\_\{\\psi\}\(q,c\_\{a\},\\bar\{P\}^\{\(t\)\},y^\{\(t\)\}\),whereP¯\(t\)\\bar\{P\}^\{\(t\)\}is the refinement\-specific path subset selected from the initial reranked path pool according to the current answer hypothesis\. The refiner output includes a confidence score, an issue tag, retained answers, forbidden answers, prioritized paths, supplementary paths, dropped paths, and a short feedback summary\. The generator proposes answer hypotheses, whereas the refiner converts the current answer state into structured revision actions\.

#### 3\.4\.2\.Refiner\-Guided Context Update

Starting from the second iteration, OPI no longer directly reuses the full retrieved path set in the next generator call\. Instead, it reconstructs the next\-round path context from the initial reranked path pool and the previous refinement result\. Paths aligned with retained answers are preferred\. If no retained answers are available, the system falls back to paths aligned with the previous generated answers\. When the refiner indicates conflict or answer\-set noise, OPI supplements the focused paths with a small number of neutral paths\. Prioritized paths are promoted, supplementary paths are appended, and dropped paths are removed\.

Formally, the next\-round context is updated as

\(11\)P\(t\)\\displaystyle P^\{\(t\)\}=UpdateP​\(P\(0\),y\(t\),F\(t\)\),\\displaystyle=\\mathrm\{Update\}\_\{P\}\\bigl\(P^\{\(0\)\},y^\{\(t\)\},F^\{\(t\)\}\\bigr\),\(12\)A\(t\)\\displaystyle A^\{\(t\)\}=UpdateA​\(y\(t\),F\(t\)\)\.\\displaystyle=\\mathrm\{Update\}\_\{A\}\\bigl\(y^\{\(t\)\},F^\{\(t\)\}\\bigr\)\.whereP\(0\)P^\{\(0\)\}denotes the initial reranked path pool\. In the current implementation,A\(t\)A^\{\(t\)\}corresponds to the retained answers exposed to the next generator round, while forbidden answers are injected separately through refinement feedback and explicit type\-level answer constraints\. Thus, the next\-round path context is reconstructed from the initial path pool under refinement guidance, whereas the next\-round answer context is formed by the supported answers retained by the refiner\. If the stopping criterion is met, this context update is skipped\.

#### 3\.4\.3\.Iterative Revision and Stopping

OPI adopts an adaptive stopping strategy based on two signals: refiner confidence and answer stability\. The refiner outputs a discrete confidence level together with structured revision actions\. The refinement loop terminates when the refiner assigns the highest confidence level, indicating that the current answer is sufficiently supported by the retrieved evidence\. We use this highest confidence level as a conservative stopping threshold to avoid premature termination when the evidence remains ambiguous\.

OPI also stops when the post\-refinement answer remains unchanged across consecutive iterations\. This stability criterion captures cases where further refinement no longer changes the final prediction\. If either stopping condition is satisfied, OPI returns the answer from the current refinement round\. Otherwise, the updated path context, answer context, answer\-type constraints, and refinement feedback are passed to the next generator round\. If no early stopping condition is triggered, OPI returns the answer from the final refinement round\.

Overall, OPI closes the loop between retrieved evidence and final answer prediction\. The generator proposes answer hypotheses under graph\-grounded retrieval context, while the refiner converts the current answer state into explicit revision actions\. The path and answer contexts are then updated accordingly for the next round\. This adaptive process helps suppress noisy branches and improves answer reliability beyond single\-pass retrieval\-and\-readout\.

Algorithm 1OPI Inference Procedure1:Input:Question

qq, topic entity

eqe\_\{q\}, knowledge graph

𝒢\\mathcal\{G\}, ontology graph

𝒪\\mathcal\{O\}, path budget

kk, maximum refinement rounds

TT
2:Output:Final answer

y^\\hat\{y\}
3:Step 1: Ontology\-Guided Bidirectional Retrieval

4:Predict answer type:

ca←PredictType​\(q\)c\_\{a\}\\leftarrow\\textsc\{PredictType\}\(q\)
5:Map

cac\_\{a\}to valid final\-hop relations:

ℛcalast←MapToLastHop​\(ca,𝒪\)\\mathcal\{R\}^\{\\mathrm\{last\}\}\_\{c\_\{a\}\}\\leftarrow\\textsc\{MapToLastHop\}\(c\_\{a\},\\mathcal\{O\}\)
6:if

ℛcalast≠∅\\mathcal\{R\}^\{\\mathrm\{last\}\}\_\{c\_\{a\}\}\\neq\\emptysetthen

7:Retrieve candidate evidence paths by bidirectional search:

8:

P\(0\)←BidirectionalRetrieve​\(eq,ℛcalast,𝒢\)P^\{\(0\)\}\\leftarrow\\textsc\{BidirectionalRetrieve\}\(e\_\{q\},\\mathcal\{R\}^\{\\mathrm\{last\}\}\_\{c\_\{a\}\},\\mathcal\{G\}\)
9:else

10:Retrieve candidate evidence paths by fallback search:

11:

P\(0\)←FallbackRetrieve​\(eq,𝒢\)P^\{\(0\)\}\\leftarrow\\textsc\{FallbackRetrieve\}\(e\_\{q\},\\mathcal\{G\}\)
12:endif

13:Rerank and retain the top\-

kkpaths:

14:

P\(0\)←RerankTopK​\(q,P\(0\),k\)P^\{\(0\)\}\\leftarrow\\textsc\{RerankTopK\}\(q,P^\{\(0\)\},k\)
15:Extract initial answer candidates:

16:

A\(0\)←ExtractAnswers​\(P\(0\)\)A^\{\(0\)\}\\leftarrow\\textsc\{ExtractAnswers\}\(P^\{\(0\)\}\)
17:Step 2: Iterative Answer Refinement

18:Initialize refinement feedback:

F\(0\)←∅F^\{\(0\)\}\\leftarrow\\emptyset
19:for

t=1t=1to

TTdo

20:Generate the current answer hypothesis:

21:

y\(t\)←Generate​\(q,ca,P\(t−1\),A\(t−1\),F\(t−1\)\)y^\{\(t\)\}\\leftarrow\\textsc\{Generate\}\(q,c\_\{a\},P^\{\(t\-1\)\},A^\{\(t\-1\)\},F^\{\(t\-1\)\}\)
22:Select the refinement\-specific path subset:

23:

P¯\(t\)←SelectRefinePaths​\(P\(0\),y\(t\)\)\\bar\{P\}^\{\(t\)\}\\leftarrow\\textsc\{SelectRefinePaths\}\(P^\{\(0\)\},y^\{\(t\)\}\)
24:Refine the current answer and obtain revision actions:

25:

F\(t\)←Refine​\(q,ca,P¯\(t\),y\(t\)\)F^\{\(t\)\}\\leftarrow\\textsc\{Refine\}\(q,c\_\{a\},\\bar\{P\}^\{\(t\)\},y^\{\(t\)\}\)
26:Derive the refined answer for the current round:

27:

y\(t\)←PostRefine​\(y\(t\),F\(t\)\)\{y\}^\{\(t\)\}\\leftarrow\\textsc\{PostRefine\}\(y^\{\(t\)\},F^\{\(t\)\}\)
28:if

ShouldStop​\(F\(t\),y\(t\)\)\\textsc\{ShouldStop\}\(F^\{\(t\)\},\{y\}^\{\(t\)\}\)then

29:break

30:endif

31:Update the next\-round path context and answer context:

32:

P\(t\)←UpdatePathContext​\(P\(0\),y\(t\),F\(t\)\)P^\{\(t\)\}\\leftarrow\\textsc\{UpdatePathContext\}\(P^\{\(0\)\},y^\{\(t\)\},F^\{\(t\)\}\)
33:

A\(t\)←UpdateAnswerContext​\(y\(t\),F\(t\)\)A^\{\(t\)\}\\leftarrow\\textsc\{UpdateAnswerContext\}\(y^\{\(t\)\},F^\{\(t\)\}\)
34:endfor

35:return

y^←y\(t\)\\hat\{y\}\\leftarrow\{y\}^\{\(t\)\}

## 4\.Experiments

We conduct extensive experiments to answer the following research questions:RQ1: Does OPI outperform existing KGQA methods on WebQSP and CWQ?RQ2: Is ontology\-guided bidirectional retrieval effective as an independent retrieval module?RQ3: How much does each component of OPI contribute to the overall performance?RQ4: Why do ontology\-based constraints improve retrieval effectiveness?RQ5: Is OPI robust across different LLM backbones?RQ6: Can OPI reduce retrieval cost and evidence noise?

### 4\.1\.Experimental Settings

Datasets\.We evaluate our method on three widely used KGQA benchmarks: WebQuestionsSP \(WebQSP\)\(Yihet al\.,[2016](https://arxiv.org/html/2606.28076#bib.bib30)\), ComplexWebQuestions \(CWQ\)\(Talmor and Berant,[2018](https://arxiv.org/html/2606.28076#bib.bib31)\), and MetaQA\(Zhanget al\.,[2018](https://arxiv.org/html/2606.28076#bib.bib33)\)\. These benchmarks are built on two knowledge graphs and cover both open\-domain and domain\-specific QA scenarios\. Specifically, WebQSP and CWQ are based on Freebase\(Bollackeret al\.,[2008](https://arxiv.org/html/2606.28076#bib.bib32)\), while MetaQA is constructed on Wiki\-Movie\(Milleret al\.,[2016](https://arxiv.org/html/2606.28076#bib.bib16)\)\. As shown in[Table 1](https://arxiv.org/html/2606.28076#S4.T1), they vary in dataset size, reasoning depth, and question complexity, providing a comprehensive evaluation setting for multi\-hop KGQA\.

For the Freebase setting, we evaluate OPI on WebQSP and CWQ, two standard benchmarks based on the preprocessed Freebase subgraphs widely used in prior work\(Heet al\.,[2021](https://arxiv.org/html/2606.28076#bib.bib2); Luoet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib11)\)\. OPI constructs the relation\-centric ontology graph from the full Freebase dump for more complete type\-level constraints, while restricting retrieval and evaluation to the benchmark subgraphs for fair comparison\. WebQSP contains 4,737 questions and mainly involves relatively simple reasoning, with answers typically located within two hops of the topic entity\. CWQ is more challenging, containing 34,699 questions with compositional structures and additional constraints, often requiring up to four hops of reasoning\. We follow the standard data splits for both datasets\(Sunet al\.,[2018](https://arxiv.org/html/2606.28076#bib.bib1)\)\.

For the Wiki\-Movie setting, we adopt MetaQA, a movie\-domain KGQA benchmark built on the Wiki\-Movie knowledge graph, which contains 43,234 entities, 9 relations, and 133,582 triples\. MetaQA includes more than 400K questions and is divided into three subsets by reasoning depth: MetaQA\-1hop, MetaQA\-2hop, and MetaQA\-3hop\. Following prior work\(Heet al\.,[2021](https://arxiv.org/html/2606.28076#bib.bib2)\), we evaluate on all three subsets and construct the one\-shot setting by randomly sampling one training instance for each question template, resulting in 161, 210, and 150 training examples for the three subsets, respectively\.

Table 1\.Statistics of the experiment datasets\.Evaluation Metrics\.Following prior work\(Sunet al\.,[2018](https://arxiv.org/html/2606.28076#bib.bib1); Luoet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib11); Sunet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib10)\), we use Hit@1 and F1 as the primary evaluation metrics\. Hit@1 measures whether the top\-ranked prediction matches any gold answer, reflecting top\-1 answer accuracy\. F1 evaluates the predicted answer set by jointly considering precision and recall, which is important for questions with multiple correct answers\. In the ablation studies, we additionally report precision and recall for a more fine\-grained analysis of prediction quality\.

Comparison Baselines\.We compare OPI with representative KGQA methods from four categories: embedding\-based methods, retrieval\-based methods, standalone LLMs, and KG\-enhanced LLM methods\. The first two categories cover conventional KG reasoning models that learn KG representations or retrieve question\-relevant subgraphs, while standalone LLMs evaluate reasoning with parametric knowledge alone\. We further include recent KG\-enhanced LLM methods, such as ToG, RoG, ORT, and GCR, which are the most direct competitors to OPI\. For MetaQA, we additionally compare with representative methods reported in prior work to evaluate retrieval\-only multi\-hop structural reasoning\.

Implementation Details\.OPI uses LLaMA2\-Chat\-7B as the fine\-tuned backbone LLM for answer\-type prediction\. The model is instruction\-tuned for three epochs on four A100\-40G GPUs with batch size 4, learning rate 2e\-5, cosine learning rate scheduling, and a warmup ratio of 0\.03\. The joint training time is about 4\.7 hours for WebQSP and CWQ, 44\.0 hours for the MetaQA 1–3 hop data, and 0\.42 hours for the MetaQA\-oneshot setting\. For ontology\-guided bidirectional retrieval, we set the maximum reasoning depth according to dataset hop statistics, using two hops for WebQSP and four hops for CWQ\. Retrieved evidence paths are ranked by Sentence\-BERT similarity between the question and the path text, withall\-mpnet\-base\-v2as the ranking model, and at most 256 evidence paths are retained for each question\. For answer refinement, we use prompt\-based LLMs with temperature 0\.2, maximum output length 128, and one sampled response per question\. Since different prompt\-based models may exhibit different refinement behaviors, the number of refinement rounds is selected on the validation set: three rounds for DeepSeek\-v3 and one round for GPT\-4o\. In all cases, OPI allows at most three refinement rounds and stops early when the refiner reaches high confidence or when two consecutive rounds produce stable answers\.

### 4\.2\.Overall Evaluation \(RQ1\)

[Table 2](https://arxiv.org/html/2606.28076#S4.T2)reports the main results on WebQSP and CWQ\. We compare OPI with four groups of representative methods, including embedding\-based methods, retrieval\-based methods, standalone LLMs, and recent KG\-enhanced LLM methods\. Overall, OPI achieves the best metric\-wise performance on both datasets\. Taking the best score among the two OPI variants for each metric, OPI achieves 92\.3 Hit@1 and 76\.8 F1 on WebQSP\. Compared with the strongest previous results, it improves Hit@1 by 4\.6 points and F1 by 5\.0 points over ORT\. On the more complex CWQ dataset, OPI achieves 76\.5 Hit@1 and 62\.7 F1, outperforming the metric\-wise strongest prior results by 8\.9 points in Hit@1 and 3\.3 points in F1\. These gains show that OPI is effective not only on WebQSP questions, which are generally less compositional, but also on complex questions in CWQ that require more careful multi\-hop evidence retrieval\.

Table 2\.Comparison of methods on WebQSP and CWQ\. The backbone model used by each method is shown in parentheses\.Beyond the overall performance, we further analyze the behavior of different method categories\. Traditional embedding\-based and retrieval\-based methods can exploit KG structure through learned representations or retrieved subgraphs, but they still struggle on compositionally complex questions, as reflected by their consistently lower scores on CWQ\. Standalone LLMs achieve reasonable results without explicit graph retrieval, but they generally lag behind KG\-enhanced LLM methods, indicating that parametric knowledge alone is insufficient for reliable multi\-hop KGQA\. Recent KGs\+LLMs methods further improve performance by combining LLM reasoning with graph evidence, yet many of them still retrieve evidence mainly from the topic entity side\. As the hop number increases, such topic\-centered expansion can introduce many type\-mixed paths and semantically ambiguous candidates\. In contrast, OPI incorporates type\-level answer\-side constraints into bidirectional retrieval and further refines candidate answers through a generator\-refiner loop, improving both evidence quality and answer selection\.

We also observe different strengths between the two OPI variants\. OPI with DeepSeek\-v3 achieves the highest Hit@1 on both WebQSP and CWQ, suggesting stronger top\-answer selection\. OPI with GPT\-4o obtains the best F1 on both datasets, suggesting better coverage and calibration when multiple candidate answers are involved\. Despite this difference, both variants consistently outperform prior KG\-enhanced LLM methods, suggesting that the improvement is not tied to a specific backbone model, but is largely attributable to OPI’s ontology\-guided bidirectional retrieval and iterative refinement framework\.

### 4\.3\.Effectiveness of Ontology\-Guided Bidirectional Retrieval \(RQ2\)

To further examine whether ontology\-guided bidirectional retrieval can provide effective graph evidence by itself, we evaluate OPI under a pure retrieval\-only setting\. In this setting, candidate answers are directly extracted from the endpoints of retrieved evidence paths, without any subsequent LLM\-based answer generation or iterative refinement\.

Table 3\.Retrieval\-only comparison with reproduced RoG and GCR on WebQSP and CWQ\.Results on WebQSP and CWQ\.[Table 3](https://arxiv.org/html/2606.28076#S4.T3)compares OPI with reproduced RoG and GCR variants under the same retrieval\-only setting\. OPI\-BR achieves the highest Hit@1 on both datasets, reaching 95\.39 on WebQSP and 88\.95 on CWQ\. The improvement is particularly clear on CWQ, where compositional questions make topic\-centered expansion more likely to introduce noisy and type\-mixed paths\. These results indicate that answer\-side ontology constraints can effectively restrict retrieval to structurally more plausible endpoints under complex multi\-hop reasoning, thereby providing useful graph\-grounded evidence for subsequent answer generation and refinement\.

However, the F1 results show a different trend\. On WebQSP, OPI\-BR obtains an F1 of 39\.09, lower than RoG\-BR and GCR\-BR\. On CWQ, OPI\-BR improves over RoG\-BR in F1, but still lags behind GCR\-BR\. This is because RoG and GCR already incorporate question\-level semantics during path construction: RoG prompts the LLM to generate relation paths, while GCR further constrains this process with graph structure\. In contrast, OPI\-BR in this experiment mainly applies answer\-side ontology constraints to retrieve answer\-type\-compatible endpoints, but does not yet perform full question\-aware path verification or answer\-set refinement\. As a result, its raw retrieved endpoints may still contain type\-compatible but question\-irrelevant entities, which limits F1\. Overall, this comparison shows that ontology\-guided bidirectional retrieval is particularly effective for top\-answer reachability and search\-space reduction, but answer\-set completeness still benefits from the subsequent refinement stages\.

Table 4\.Comparison of methods on MetaQA\. Results of prior methods are reported as in their original papers\. Our results are reported with two decimal places for better distinction near saturation\.Results on MetaQA\.We further evaluate the bidirectional retrieval module of OPI on MetaQA, where questions are more template\-like and relation paths are relatively regular\. Compared with WebQSP and CWQ, MetaQA relies more on structural path matching and less on complex semantic disambiguation, making it suitable for evaluating pure retrieval\-only reasoning\. Therefore, we report bidirectional retrieval results without additional LLM\-based answer generation, because the retrieved endpoints already provide a direct test of whether the structural relation path can reach the correct answer\. Here, BR denotes bidirectional retrieval and does not include any subsequent LLM\-based answer generation\.

[Table 4](https://arxiv.org/html/2606.28076#S4.T4)reports the results on the full MetaQA benchmark\. OPI\-BR achieves near\-saturated Hit@1 across all three subsets, with 100\.00 on 1\-hop, 99\.99 on 2\-hop, and 99\.96 on 3\-hop questions\. OPI\-BR\-oneshot obtains very close performance, especially on 2\-hop and 3\-hop questions\. These results show that when evidence paths are regular and answer\-side constraints are reliable, ontology\-guided bidirectional retrieval alone can already provide sufficient evidence for accurate answer prediction\.

### 4\.4\.Ablation Study \(RQ3\)

We conduct ablation studies to assess three key designs in OPI: type\-level search space, early answer\-side constraint, and iterative answer refinement\. All variants use the same Llama\-2\-7B \+ DeepSeek\-v3 setting\.[Table 5](https://arxiv.org/html/2606.28076#S4.T5)reports the results on WebQSP and CWQ in terms of Hit@1, F1, precision, and recall\.

Table 5\.Ablation study results on WebQSP and CWQ\. All variants use the same Llama\-2\-7B \+ DeepSeek\-v3 setting\.Effect of type\-level search space\.This variant removes the type\-level path space and directly traverses the original KG from the topic entity\. When the number of outgoing edges is large, it keeps the top\-kkpaths according to the SentenceBERT similarity between each path and the question\. This change leads to the largest performance drop: Hit@1/F1 decrease by 10\.19/13\.95 points on WebQSP and by 22\.37/18\.54 points on CWQ\. The degradation suggests that direct topic\-centered expansion is easily affected by large branching factors\. Although similarity\-based pruning reduces the search cost, it may discard correct paths before they reach the answer side\. By contrast, OPI uses type\-level relation signatures to form a more compact search space, which helps preserve answer\-relevant paths under multi\-hop expansion\.

Effect of in\-retrieval answer\-side constraint\.This variant removes answer\-side constraints from the retrieval process\. Instead, it first performs topic\-side forward expansion and then applies answer\-type\-compatible final\-hop relations as a post\-retrieval filter\. F1 decreases from 74\.91 to 66\.75 on WebQSP and from 59\.59 to 45\.61 on CWQ, showing that answer\-side constraints are more effective when they participate in retrieval rather than only filtering completed paths\. Since relevant paths may already be pruned during topic\-centered forward expansion, post\-retrieval filtering cannot recover them\. Nevertheless, this variant still outperformsw/o type\-level search space, because the delayed final\-hop filter can still remove some answer\-type\-incompatible endpoints\.

Effect of iterative answer refinement\.This variant directly uses the single\-pass output of the same prompt\-based LLM as the final answer, without applying generator\-refiner iterations\. It achieves higher Hit@1 and recall on both datasets, but lower precision and F1\. For example, compared with full OPI on CWQ, recall increases from 71\.23 to 78\.06, while precision decreases from 58\.96 to 52\.32 and F1 decreases from 59\.59 to 56\.42\. This indicates that single\-pass generation tends to retain a broader answer set, but also introduces more false positives\. Iterative refinement therefore does not simply maximize coverage; instead, it trades part of the aggressive candidate retention for a cleaner answer set, suggesting that refinement acts more as a precision\-oriented filtering mechanism than as a recall\-maximizing step\.

### 4\.5\.Analysis of Ontology Graph \(RQ4\)

Table 6\.Statistics of ontology graph construction for Freebase and Wiki\-Movie\.[Table 6](https://arxiv.org/html/2606.28076#S4.T6)reports the construction statistics of the ontology graphs for Freebase and Wiki\-Movie\. The results show that the proposed construction procedure is feasible under different schema settings\. For Freebase, which provides explicit schema predicates, the pipeline processes 71,210 schema entries from the RDF dump and completes in approximately 6,958 seconds \(≈\\approx1\.93 hours\) on a single machine\. For Wiki\-Movie, where explicit schema predicates are unavailable, the ontology graph is induced directly from 134,741 triples and constructed in approximately 1\.17 seconds\. This contrast mainly reflects the much larger scale and richer schema structure of Freebase, while also showing that the same ontology abstraction can be instantiated efficiently for compact domain\-specific KGs\.

The constructed ontology graphs cover most relations used in downstream KGQA benchmarks\. For WebQSP, only 19 out of 5,726 relations have missing or incomplete signatures, accounting for 0\.33%\. For CWQ, this number is 21 out of 6,576, accounting for 0\.32%\. Moreover, 19 missing or incomplete signatures are shared by WebQSP and CWQ, corresponding to only 0\.28% of the 6,837 unique relations across both datasets\. These results show that the Freebase\-derived ontology graph covers nearly all benchmark relations\. For MetaQA, all 9 relations have valid head–tail type signatures under the induced Wiki\-Movie ontology\.

The resulting ontology graphs are also compact relative to their underlying knowledge bases\. The Freebase ontology graph contains 32,195 relation signatures and 12,369 entity types, providing broad schema coverage without direct expansion over dense entity\-level facts\. The Wiki\-Movie ontology graph is much smaller, with only 9 relation signatures and 11 entity types, yet still preserves the type\-level constraints required by MetaQA\. These statistics suggest that the ontology graph serves as a lightweight abstraction: it retains the type semantics needed for retrieval while substantially simplifying the structure used during multi\-hop reasoning\.

Overall, these results show that ontology graph construction is scalable and effective across heterogeneous KG settings\. Explicit schema predicates provide reliable relation signatures for schema\-rich KGs, while schema\-light KGs can be handled through data\-driven induction\. Thus, the ontology graph offers a lightweight and practical type\-level interface for retrieval in multi\-hop KGQA\.

### 4\.6\.Effect of Different LLM Backbones \(RQ5\)

Table 7\.Performance of different LLM backbone variants on WebQSP and CWQ\.To examine whether OPI depends on a specific LLM backbone, we evaluate different combinations of fine\-tuned models and prompt\-based models\. The fine\-tuned models are task\-adapted LLMs used for answer\-type prediction, including Qwen2\-1\.5B, Qwen2\-7B, and Llama\-2\-7B\. The prompt\-based models are instruction\-following LLMs used in the answer refinement stage, including DeepSeek\-v3 and GPT\-4o\.[Table 7](https://arxiv.org/html/2606.28076#S4.T7)reports the results on WebQSP and CWQ\.

Overall, OPI performs consistently well across different model combinations\. On WebQSP, all variants achieve over 90 Hit@1 and over 73 F1\. On CWQ, they also maintain strong performance, with Hit@1 ranging from 70\.57 to 76\.52 and F1 ranging from 58\.71 to 62\.73\. These results indicate that OPI is not tied to a single LLM choice\. Instead, its effectiveness largely stems from the ontology\-guided bidirectional retrieval and iterative answer refinement framework, which can be adapted to different fine\-tuned and prompt\-based models\.

We further observe complementary behaviors between prompt\-based models\. Variants using DeepSeek\-v3 generally achieve higher Hit@1, especially on CWQ, suggesting stronger top\-answer selection\. In contrast, variants using GPT\-4o obtain higher F1 on both datasets, indicating better answer\-set calibration when multiple candidate answers are involved\. Among the fine\-tuned models, Llama\-2\-7B achieves the best overall results under both prompt\-based models\. Nevertheless, Qwen2\-1\.5B and Qwen2\-7B remain competitive, showing that OPI can also operate effectively with smaller models for answer\-type prediction\.

### 4\.7\.Efficiency and Robustness Analysis \(RQ6\)

![Refer to caption](https://arxiv.org/html/2606.28076v1/x4.png)Figure 4\.Efficiency and robustness analysis of OPI\. \(a\)\-\(b\) Search\-space and retrieval\-cost comparison between forward\-only and bidirectional retrieval\. \(c\)\-\(d\) Evidence quality comparison before answer generation\. \(e\)\-\(f\) Quality\-efficiency tradeoff of adaptive refinement\. \(g\)\-\(h\) F1 gains across question categories\.Search\-space and retrieval\-cost reduction\.We compare OPI’s ontology\-guided bidirectional retrieval with a forward\-only baseline by measuring the average number of candidate paths, candidate answers, and retrieval time\. As shown in Figures[4](https://arxiv.org/html/2606.28076#S4.F4)\(a\) and[4](https://arxiv.org/html/2606.28076#S4.F4)\(b\), OPI reduces candidate paths by 98\.7% on WebQSP and over 99% on CWQ, candidate answers by 98\.9% and 99\.97%, and retrieval time by 95\.1% and 95\.3%, respectively\. These reductions show that answer\-side type constraints effectively prevent uncontrolled topic\-centered expansion\. By reserving the final hop for ontology\-guided matching, OPI avoids exploring many type\-incompatible evidence paths and therefore lowers retrieval cost\.

Evidence cleanliness and answer coverage\.We evaluate the retrieved evidence before answer generation using precision, recall, gold\-hit rate, and top\-1 hit rate\. Figures[4](https://arxiv.org/html/2606.28076#S4.F4)\(c\) and[4](https://arxiv.org/html/2606.28076#S4.F4)\(d\) show that OPI improves precision by 30\.85% on WebQSP and 23\.88% on CWQ, and improves top\-1 hit rate by 54\.24% and 31\.12%, respectively\. Although recall and gold\-hit rate slightly decrease, they remain at 88\.73%/93\.98% on WebQSP and 87\.19%/88\.98% on CWQ\. This indicates that OPI filters noisy evidence while preserving most answer\-relevant candidates\.

Quality\-efficiency tradeoff of adaptive refinement\.We compare fixed\-round refinement with adaptive refinement, where the refinement process stops based on answer stability and refiner confidence\. As shown in Figures[4](https://arxiv.org/html/2606.28076#S4.F4)\(e\) and[4](https://arxiv.org/html/2606.28076#S4.F4)\(f\), adaptive refinement achieves 74\.91 F1 with only 1\.31 rounds on WebQSP, slightly outperforming fixed three\-round refinement\. On CWQ, it achieves 59\.59 F1 with 1\.61 rounds, closely matching the fixed three\-round result\. This reduces the average number of refinement rounds by 56\.4% and 46\.5%, respectively, showing that adaptive refinement avoids unnecessary iterations while maintaining answer quality\.

Benefits on semantically challenging questions\.We compare the initial answer and the final refined answer across temporal, attribute/numerical, constrained, and simple questions\. Figures[4](https://arxiv.org/html/2606.28076#S4.F4)\(g\) and[4](https://arxiv.org/html/2606.28076#S4.F4)\(h\) show consistent F1 gains across all categories\. On WebQSP, the largest gain appears on simple questions, with a 5\.01\-point improvement, followed by constrained questions with a 3\.91\-point gain\. On CWQ, temporal questions benefit the most, with an 8\.23\-point improvement, while constrained and simple questions improve by 3\.37 and 3\.13 points, respectively\. These results suggest that iterative answer refinement helps OPI mitigate semantic ambiguity, complementing bidirectional retrieval, which primarily addresses structural path explosion\.

## 5\.Related Work

### 5\.1\.Multi\-hop Knowledge Graph Question Answering

Existing multi\-hop KGQA methods can be broadly grouped into embedding\-based methods, retrieval\-based methods, standalone LLM methods, and KG\-enhanced LLM methods\. Early embedding\-based methods, such as KV\-Mem\(Milleret al\.,[2016](https://arxiv.org/html/2606.28076#bib.bib16)\), NSM\(Heet al\.,[2021](https://arxiv.org/html/2606.28076#bib.bib2)\), TransferNet\(Shiet al\.,[2021](https://arxiv.org/html/2606.28076#bib.bib17)\), and KGT5\(Saxenaet al\.,[2022](https://arxiv.org/html/2606.28076#bib.bib18)\), learn graph\-aware representations by encoding questions and entities in continuous spaces or propagating question\-aware signals over graph structures\. Although effective, their reasoning process is often implicit and may not preserve explicit multi\-hop path semantics\. Retrieval\-based methods instead construct question\-relevant subgraphs or evidence paths before answer prediction, as in GraftNet\(Sunet al\.,[2018](https://arxiv.org/html/2606.28076#bib.bib1)\), SR\+NSM\(Zhanget al\.,[2022](https://arxiv.org/html/2606.28076#bib.bib3)\), and UniKGQA\(Jianget al\.,[2023c](https://arxiv.org/html/2606.28076#bib.bib4)\)\. While these methods improve interpretability by exposing supporting evidence, they typically rely on topic\-centered expansion and may retrieve many structurally reachable but semantically irrelevant paths as reasoning depth increases\.

With the emergence of large language models, standalone LLMs have been applied to KGQA through prompting, in\-context learning, or chain\-of\-thought reasoning\. Models such as Llama\-2, Llama\-3\.1, ChatGPT, GPT\-4o, and DeepSeek\-v3 show strong language understanding and reasoning abilities\(Touvronet al\.,[2023](https://arxiv.org/html/2606.28076#bib.bib5); Meta,[2024](https://arxiv.org/html/2606.28076#bib.bib6); OpenAI,[2022](https://arxiv.org/html/2606.28076#bib.bib7),[2024](https://arxiv.org/html/2606.28076#bib.bib8); DeepSeek\-AIet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib9)\), but they may hallucinate unsupported facts and are limited by the absence of explicit graph\-grounded evidence\. Recent KG\-enhanced LLM methods address this limitation by incorporating retrieved triples, evidence paths, graph structures, or symbolic tools into LLM\-based reasoning\. For example, ToG explores reasoning chains over KGs with LLM guidance\(Sunet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib10)\), RoG generates relation paths for graph\-grounded reasoning\(Luoet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib11)\), SymAgent combines symbolic reasoning with agentic LLM behaviors\(Liuet al\.,[2025a](https://arxiv.org/html/2606.28076#bib.bib12)\), GNN\-RAG integrates graph neural retrieval with LLM generation\(Mavromatis and Karypis,[2025](https://arxiv.org/html/2606.28076#bib.bib13)\), and GCR further improves KG\-enhanced reasoning through graph\-context reasoning\(Luoet al\.,[2025](https://arxiv.org/html/2606.28076#bib.bib15)\)\. Recent work such asR2R^\{2\}also emphasizes reliable reasoning with LLMs over KGs\(Yuanet al\.,[2026](https://arxiv.org/html/2606.28076#bib.bib28)\)\. These methods substantially improve graph\-grounded answer generation, but their retrieval stages are still largely driven by topic\-side exploration or post\-retrieval evidence selection\. In contrast, OPI introduces answer\-side type constraints into the retrieval process itself, using a relation\-centric ontology graph to guide final\-hop matching and reduce noisy path explosion before LLM\-based answer refinement\.

### 5\.2\.Ontology and Type\-level Guidance

Ontology and schema information provide high\-level semantic structures for organizing entities, relations, types, and constraints in knowledge graphs\. Compared with entity\-level triples, type\-level abstractions capture more stable semantic regularities, such as the expected head and tail types of relations\. Recent studies have explored LLMs for ontology construction, enhancement, and learning\(Funket al\.,[2023](https://arxiv.org/html/2606.28076#bib.bib51); Toroet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib52); Babaei Giglouet al\.,[2023](https://arxiv.org/html/2606.28076#bib.bib54)\), as well as ontology construction from large RDF resources or capability\-question\-driven workflows\(Chen and Zhao,[2018](https://arxiv.org/html/2606.28076#bib.bib55); Kommineniet al\.,[2024](https://arxiv.org/html/2606.28076#bib.bib56)\)\. These studies show that ontology construction and enrichment can provide useful foundations for knowledge\-intensive tasks\.

In KGQA, ontology information has been used to improve the semantic alignment between natural\-language questions and graph structures\. For example, ontology\-guided prompting and reverse\-thinking strategies have been introduced to strengthen multi\-hop reasoning and generalization\(Jianget al\.,[2025b](https://arxiv.org/html/2606.28076#bib.bib48); Liuet al\.,[2025b](https://arxiv.org/html/2606.28076#bib.bib14)\), while OntoTune aligns LLMs with domain ontologies through ontology\-driven self\-training\(Liuet al\.,[2025c](https://arxiv.org/html/2606.28076#bib.bib49)\)\. However, most existing methods use ontology information only as auxiliary signals, providing limited graph\-grounded constraints in large heterogeneous retrieval spaces\. Industrial systems such as Palantir further show that ontology can serve as a structured interface connecting data, reasoning, and downstream actions\(Palantir,[2024](https://arxiv.org/html/2606.28076#bib.bib57)\)\. OPI follows this motivation but specializes it for multi\-hop KGQA: it constructs a relation\-centric ontology graph from the KG itself and uses type\-level signatures to map the predicted answer type to compatible final\-hop relations, thereby integrating type\-level and entity\-level path search\.

## 6\.Conclusion

In this paper, we proposed OPI, an ontology\-guided evidence path inference framework for multi\-hop KGQA\. OPI introduces a relation\-centric ontology graph to make answer\-side type constraints explicit, and combines topic\-side prefix expansion with answer\-side final\-hop matching to reduce noisy path explosion\. It further employs a generator\-refiner loop to jointly reassess retrieved evidence and answer hypotheses, filtering type\-compatible but question\-irrelevant candidates\. Experiments on WebQSP, CWQ, and MetaQA show that OPI consistently outperforms representative multi\-hop KGQA methods\. These results demonstrate that OPI provides a compact and reusable relation\-centric ontology graph for retrieval, while effectively alleviating path explosion and semantic misalignment in multi\-hop KGQA\.

In future work, we plan to extend OPI to knowledge graphs with weaker or less reliable type information\. A key direction is to construct and update relation signatures when entity types are missing, noisy, or only partially available, so that ontology\-guided retrieval remains effective beyond KGs with well\-defined schemas\. Such an extension would broaden the applicability of OPI to open\-domain and dynamically evolving knowledge graphs\.

## References

- D\. Agarwal, R\. Das, S\. Khosla, and R\. Gangadharaiah \(2024\)Bring your own kg: self\-supervised program synthesis for zero\-shot kgqa\.InFindings of the association for computational linguistics: NAACL 2024,Mexico City, Mexico,pp\. 896–919\.Cited by:[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.13.10.1)\.
- H\. Babaei Giglou, J\. D’Souza, and S\. Auer \(2023\)LLMs4OL: large language models for ontology learning\.InInternational Semantic Web Conference,pp\. 408–427\.Cited by:[§5\.2](https://arxiv.org/html/2606.28076#S5.SS2.p1.1)\.
- K\. Bollacker, C\. Evans, P\. Paritosh, T\. Sturge, and J\. Taylor \(2008\)Freebase: a collaboratively created graph database for structuring human knowledge\.InProceedings of the 2008 ACM SIGMOD international conference on Management of data,Vancouver, BC, Canada,pp\. 1247–1250\.Cited by:[§4\.1](https://arxiv.org/html/2606.28076#S4.SS1.p1.1)\.
- D\. Chen and H\. Zhao \(2018\)Research on the method of extracting domain knowledge from the freebase RDF dumps\.IEEE Access6,pp\. 50306–50322\.Cited by:[§5\.2](https://arxiv.org/html/2606.28076#S5.SS2.p1.1)\.
- DeepSeek\-AI, A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)DeepSeek\-v3 technical report\.Technical reportTechnical ReportarXiv:2412\.19437,DeepSeek\-AI\.External Links:[Link](https://arxiv.org/abs/2412.19437)Cited by:[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.16.15.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p2.1)\.
- L\. Ding, N\. Ding, Q\. Tao, and P\. Shi \(2025\)Enhancing graph multi\-hop reasoning for question answering with llms: an approach based on adaptive path generation\.Journal of Intelligent Information Systems63\(5\),pp\. 1455–1485\.Cited by:[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.17.14.1)\.
- X\. L\. Dong \(2023\)Generations of knowledge graphs: the crazy ideas and the business impact\.Proc\. VLDB Endow\.16\(12\),pp\. 4130–4137\.External Links:ISSN 2150\-8097,[Link](https://doi.org/10.14778/3611540.3611636),[Document](https://dx.doi.org/10.14778/3611540.3611636)Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1)\.
- G\. Frisoni, A\. Cocchieri, A\. Presepi, G\. Moro, and Z\. Meng \(2024\)To generate or to retrieve? on the effectiveness of artificial contexts for medical open\-domain question answering\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 9878–9919\.External Links:[Link](https://aclanthology.org/2024.acl-long.533/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.533)Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1)\.
- M\. Funk, S\. Hosemann, J\. C\. Jung, and C\. Lutz \(2023\)Towards ontology construction with language models\.InJoint proceedings of the 1st workshop on Knowledge Base Construction from Pre\-Trained Language Models \(KBC\-LM\) and the 2nd challenge on Language Models for Knowledge Base Construction \(LM\-KBC\) co\-located with the 22nd International Semantic Web Conference \(ISWC 2023\),Vol\.3577\.Cited by:[§5\.2](https://arxiv.org/html/2606.28076#S5.SS2.p1.1)\.
- C\. Ge, X\. Liu, L\. Chen, Y\. Gao, and B\. Zheng \(2021\)LargeEA: aligning entities for large\-scale knowledge graphs\.Proceedings of the VLDB Endowment15\(2\),pp\. 237–245\.External Links:ISSN 2150\-8097,[Link](https://doi.org/10.14778/3489496.3489504),[Document](https://dx.doi.org/10.14778/3489496.3489504)Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1)\.
- G\. He, Y\. Lan, J\. Jiang, W\. X\. Zhao, and J\. Wen \(2021\)Improving multi\-hop knowledge base question answering by learning intermediate supervision signals\.InProceedings of the 14th ACM international conference on web search and data mining,Online,pp\. 553–561\.Cited by:[§4\.1](https://arxiv.org/html/2606.28076#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.28076#S4.SS1.p3.1),[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.5.4.1),[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.5.2.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p1.1)\.
- Y\. Ji, K\. Wu, J\. Li, W\. Chen, M\. Zhong, X\. Jia, and M\. Zhang \(2024\)Retrieval and reasoning on kgs: integrate knowledge graphs into large language models for complex question answering\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Miami, Florida, USA,pp\. 7598–7610\.Cited by:[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.11.8.1)\.
- J\. Jiang, K\. Zhou, Z\. Dong, K\. Ye, W\. X\. Zhao, and J\. Wen \(2023a\)Structgpt: a general framework for large language model to reason over structured data\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 9237–9251\.Cited by:[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.8.5.1)\.
- J\. Jiang, K\. Zhou, W\. X\. Zhao, Y\. Li, and J\. Wen \(2023b\)Reasoninglm: enabling structural subgraph reasoning in pre\-trained language models for question answering over knowledge graph\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 3721–3735\.Cited by:[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.9.6.1)\.
- J\. Jiang, K\. Zhou, W\. X\. Zhao, Y\. Song, C\. Zhu, H\. Zhu, and J\. Wen \(2025a\)KG\-agent: an efficient autonomous agent framework for complex reasoning over knowledge graph\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 9505–9523\.Cited by:[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.16.13.1)\.
- J\. Jiang, K\. Zhou, X\. Zhao, and J\. Wen \(2023c\)UniKGQA: unified retrieval and reasoning for solving multi\-hop question answering over knowledge graph\.InThe Eleventh International Conference on Learning Representations,Kigali, Rwanda\.External Links:[Link](https://openreview.net/forum?id=Z63RvyAZ2Vh)Cited by:[§2\.2](https://arxiv.org/html/2606.28076#S2.SS2.p1.3),[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.11.10.1),[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.7.4.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p1.1)\.
- L\. Jiang, J\. Huang, C\. Moller, and R\. Usbeck \(2025b\)Ontology\-guided, hybrid prompt learning for generalization in knowledge graph question answering\.In2025 19th International Conference on Semantic Computing \(ICSC\),pp\. 28–35\.Cited by:[§5\.2](https://arxiv.org/html/2606.28076#S5.SS2.p2.1)\.
- J\. Kim, Y\. Kwon, Y\. Jo, and E\. Choi \(2023\)KG\-gpt: a general framework for reasoning on knowledge graphs using large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Singapore,pp\. 9410–9421\.Cited by:[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.10.7.1)\.
- V\. K\. Kommineni, B\. König\-Ries, and S\. Samuel \(2024\)From human experts to machines: an LLM supported approach to ontology and knowledge graph construction\.arXiv preprint arXiv:2403\.08345\.Cited by:[§5\.2](https://arxiv.org/html/2606.28076#S5.SS2.p1.1)\.
- J\. Lee, Y\. Wang, J\. Li, and M\. Zhang \(2024\)Multimodal reasoning with multimodal knowledge graph\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 10767–10782\.External Links:[Link](https://aclanthology.org/2024.acl-long.579/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.579)Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1)\.
- B\. Liu, J\. Zhang, F\. Lin, C\. Yang, M\. Peng, and W\. Yin \(2025a\)Symagent: a neural\-symbolic self\-learning agent framework for complex reasoning over knowledge graphs\.InProceedings of the ACM on Web Conference 2025,Sydney, Australia,pp\. 98–108\.Cited by:[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.19.18.1),[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.15.12.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p2.1)\.
- R\. Liu, L\. Luobei, J\. Li, B\. Wang, M\. Liu, D\. Wu, S\. Wang, and B\. Qin \(2025b\)Ontology\-guided reverse thinking makes large language models stronger on knowledge graph question answering\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 15269–15284\.External Links:[Link](https://aclanthology.org/2025.acl-long.741/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.741),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1),[§1](https://arxiv.org/html/2606.28076#S1.p2.1),[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.21.20.1),[§5\.2](https://arxiv.org/html/2606.28076#S5.SS2.p2.1)\.
- Z\. Liu, C\. Gan, J\. Wang, Y\. Zhang, Z\. Bo, M\. Sun, H\. Chen, and W\. Zhang \(2025c\)Ontotune: ontology\-driven self\-training for aligning large language models\.InProceedings of the ACM on Web Conference 2025,pp\. 119–133\.Cited by:[§5\.2](https://arxiv.org/html/2606.28076#S5.SS2.p2.1)\.
- L\. Luo, Y\. Li, G\. Haffari, and S\. Pan \(2024\)Reasoning on graphs: faithful and interpretable large language model reasoning\.InThe Twelfth International Conference on Learning Representations,Vienna, Austria,pp\. 1–24\.External Links:[Link](https://openreview.net/forum?id=ZGNWW7xZ6Q)Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1),[§1](https://arxiv.org/html/2606.28076#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.28076#S2.SS2.p1.3),[§3\.3\.1](https://arxiv.org/html/2606.28076#S3.SS3.SSS1.p2.13),[§4\.1](https://arxiv.org/html/2606.28076#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.28076#S4.SS1.p4.1),[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.18.17.1),[Table 3](https://arxiv.org/html/2606.28076#S4.T3.1.3.1.1),[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.12.9.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p2.1)\.
- L\. Luo, Z\. Zhao, G\. Haffari, Y\. Li, C\. Gong, and S\. Pan \(2025\)Graph\-constrained reasoning: faithful reasoning on knowledge graphs with large language models\.InProceedings of the 42nd International Conference on Machine Learning,Vol\.267,Vancouver, Canada,pp\. 41540–41565\.External Links:[Link](https://proceedings.mlr.press/v267/luo25t.html)Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1),[§1](https://arxiv.org/html/2606.28076#S1.p2.1),[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.22.21.1),[Table 3](https://arxiv.org/html/2606.28076#S4.T3.1.4.2.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p2.1)\.
- C\. Mavromatis and G\. Karypis \(2025\)Gnn\-rag: graph neural retrieval for efficient large language model reasoning on knowledge graphs\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 16682–16699\.Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1),[§1](https://arxiv.org/html/2606.28076#S1.p2.1),[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.20.19.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p2.1)\.
- Meta \(2024\)Build the future of ai with meta llama 3\.External Links:[Link](https://llama.meta.com/llama3/)Cited by:[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.13.12.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p2.1)\.
- Microsoft \(2015\)FastRDFStore\-data\.External Links:[Link](https://download.microsoft.com/download/A/E/4/AE428B7A-9EF9-446C-85CF-D8ED0C9B1F26/FastRDFStore-data.zip)Cited by:[§3\.2\.1](https://arxiv.org/html/2606.28076#S3.SS2.SSS1.p1.1)\.
- A\. Miller, A\. Fisch, J\. Dodge, A\. Karimi, A\. Bordes, and J\. Weston \(2016\)Key\-value memory networks for directly reading documents\.InProceedings of the 2016 conference on empirical methods in natural language processing,Austin, Texas,pp\. 1400–1409\.Cited by:[§4\.1](https://arxiv.org/html/2606.28076#S4.SS1.p1.1),[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.4.3.2),[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.4.1.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p1.1)\.
- OpenAI \(2022\)Introducing ChatGPT\.External Links:[Link](https://openai.com/index/chatgpt/)Cited by:[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.14.13.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p2.1)\.
- OpenAI \(2024\)Hello gpt\-4o\.External Links:[Link](https://openai.com/index/hello-gpt-4o/)Cited by:[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.15.14.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p2.1)\.
- Palantir \(2024\)Connecting ai to decisions with the palantir ontology\.Note:[https://blog\.palantir\.com/connecting\-ai\-to\-decisions\-with\-the\-palantir\-ontology\-c73f7b0a1a72](https://blog.palantir.com/connecting-ai-to-decisions-with-the-palantir-ontology-c73f7b0a1a72)Accessed: 2026\-04\-29Cited by:[§5\.2](https://arxiv.org/html/2606.28076#S5.SS2.p2.1)\.
- S\. Pan, L\. Luo, Y\. Wang, C\. Chen, J\. Wang, and X\. Wu \(2024\)Unifying large language models and knowledge graphs: a roadmap\.IEEE Transactions on Knowledge and Data Engineering36\(7\),pp\. 3580–3599\.External Links:[Document](https://dx.doi.org/10.1109/TKDE.2024.3352100)Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1)\.
- K\. Rabbani, M\. Lissandrini, and K\. Hose \(2023\)Extraction of validating shapes from very large knowledge graphs\.Proceedings of the VLDB Endowment16\(5\),pp\. 1023–1032\.External Links:ISSN 2150\-8097,[Link](https://doi.org/10.14778/3579075.3579078),[Document](https://dx.doi.org/10.14778/3579075.3579078)Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 3982–3992\.Cited by:[§3\.3\.3](https://arxiv.org/html/2606.28076#S3.SS3.SSS3.p2.2)\.
- A\. Saxena, A\. Kochsiek, and R\. Gemulla \(2022\)Sequence\-to\-sequence knowledge graph completion and question answering\.InProceedings of the 60th annual meeting of the association for computational linguistics \(volume 1: long papers\),Dublin, Ireland,pp\. 2814–2828\.Cited by:[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.7.6.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p1.1)\.
- J\. Shi, S\. Cao, L\. Hou, J\. Li, and H\. Zhang \(2021\)Transfernet: an effective and transparent framework for multi\-hop question answering over relation graph\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,Online and Punta Cana, Dominican Republic,pp\. 4149–4158\.Cited by:[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.6.5.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p1.1)\.
- H\. Sun, T\. Bedrax\-Weiss, and W\. Cohen \(2019\)Pullnet: open domain question answering with iterative retrieval on knowledge bases and text\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),Hong Kong, China,pp\. 2380–2390\.Cited by:[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.9.8.1)\.
- H\. Sun, B\. Dhingra, M\. Zaheer, K\. Mazaitis, R\. Salakhutdinov, and W\. Cohen \(2018\)Open domain question answering using early fusion of knowledge bases and text\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,Brussels, Belgium,pp\. 4231–4242\.Cited by:[§4\.1](https://arxiv.org/html/2606.28076#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2606.28076#S4.SS1.p4.1),[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.8.7.2),[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.6.3.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p1.1)\.
- J\. Sun, C\. Xu, L\. Tang, S\. Wang, C\. Lin, Y\. Gong, L\. Ni, H\. Shum, and J\. Guo \(2024\)Think\-on\-graph: deep and responsible reasoning of large language model on knowledge graph\.InThe Twelfth International Conference on Learning Representations,Vienna, Austria\.External Links:[Link](https://openreview.net/forum?id=nnVO1PvbTv)Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1),[§1](https://arxiv.org/html/2606.28076#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.28076#S4.SS1.p4.1),[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.17.16.2),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p2.1)\.
- A\. Talmor and J\. Berant \(2018\)The web as a knowledge\-base for answering complex questions\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),New Orleans, Louisiana,pp\. 641–651\.Cited by:[§4\.1](https://arxiv.org/html/2606.28076#S4.SS1.p1.1)\.
- X\. Tan, X\. Wang, Q\. Liu, X\. Xu, X\. Yuan, and W\. Zhang \(2025\)Paths\-over\-graph: knowledge graph empowered large language model reasoning\.InProceedings of the ACM on Web Conference 2025,WWW ’25,New York, NY, USA,pp\. 3505–3522\.External Links:ISBN 9798400712746,[Link](https://doi.org/10.1145/3696410.3714892),[Document](https://dx.doi.org/10.1145/3696410.3714892)Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1),[§1](https://arxiv.org/html/2606.28076#S1.p2.1)\.
- S\. Toro, A\. V\. Anagnostopoulos, S\. Bello, K\. Blumberg, R\. Cameron, L\. Carmody, A\. D\. Diehl, D\. M\. Dooley, W\. Duncan, P\. Fey, P\. Gaudet, N\. L\. Harris, marcin p\. joachimiak, L\. Kiani, T\. Lubiana, M\. C\. Munoz\-Torres, S\. T\. O’Neil, D\. Osumi\-Sutherland, A\. Puig, J\. Reese, L\. Reiser, S\. M\. C\. Robb, T\. Ruemping, J\. Seager, E\. Sid, R\. Stefancsik, M\. Weber, V\. Wood, M\. A\. Haendel, and C\. J\. Mungall \(2024\)Dynamic retrieval augmented generation of ontologies using artificial intelligence \(dragon\-ai\)\.Journal of Biomedical Semantics15\(1\),pp\. 19\.Cited by:[§5\.2](https://arxiv.org/html/2606.28076#S5.SS2.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.External Links:2307\.09288,[Document](https://dx.doi.org/10.48550/arXiv.2307.09288),[Link](https://arxiv.org/abs/2307.09288)Cited by:[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.12.11.2),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p2.1)\.
- Y\. Wen, Z\. Wang, and J\. Sun \(2024\)MindMap: knowledge graph prompting sparks graph of thoughts in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 10370–10388\.External Links:[Link](https://aclanthology.org/2024.acl-long.558/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.558)Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1),[§1](https://arxiv.org/html/2606.28076#S1.p2.1)\.
- T\. Yang, Y\. Mei, L\. Xu, H\. Yu, and Y\. Chen \(2024\)Application of question answering systems for intelligent agriculture production and sustainable management: a review\.Resources, Conservation and Recycling204,pp\. 107497\.External Links:ISSN 0921\-3449,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.resconrec.2024.107497),[Link](https://www.sciencedirect.com/science/article/pii/S0921344924000910)Cited by:[§1](https://arxiv.org/html/2606.28076#S1.p1.1)\.
- W\. Yih, M\. Richardson, C\. Meek, M\. Chang, and J\. Suh \(2016\)The value of semantic parse labeling for knowledge base question answering\.InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),Berlin, Germany,pp\. 201–206\.Cited by:[§4\.1](https://arxiv.org/html/2606.28076#S4.SS1.p1.1)\.
- C\. Yuan, R\. Lin, and C\. Guo \(2026\)Reliable reasoning: learning and inference based on the ability of large language models\.Applied Soft Computing190,pp\. 114618\.Cited by:[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.1.1),[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.1.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p2.1)\.
- H\. Zhang, L\. Zhou, and H\. Yang \(2025\)Learning to retrieve and reason on knowledge graph through active self\-reflection\.External Links:2502\.14932,[Document](https://dx.doi.org/10.48550/arXiv.2502.14932),[Link](https://arxiv.org/abs/2502.14932)Cited by:[Table 4](https://arxiv.org/html/2606.28076#S4.T4.1.14.11.1)\.
- J\. Zhang, X\. Zhang, J\. Yu, J\. Tang, J\. Tang, C\. Li, and H\. Chen \(2022\)Subgraph retrieval enhanced model for multi\-hop knowledge base question answering\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Dublin, Ireland,pp\. 5773–5784\.Cited by:[Table 2](https://arxiv.org/html/2606.28076#S4.T2.1.10.9.1),[§5\.1](https://arxiv.org/html/2606.28076#S5.SS1.p1.1)\.
- Y\. Zhang, H\. Dai, Z\. Kozareva, A\. Smola, and L\. Song \(2018\)Variational reasoning for question answering with knowledge graph\.InProceedings of the AAAI conference on artificial intelligence,Vol\.32,New Orleans, Lousiana, USA,pp\. 6069–6076\.Cited by:[§4\.1](https://arxiv.org/html/2606.28076#S4.SS1.p1.1)\.

Similar Articles

ProvenAI: Provenance-Native Traces of Evidence in Generated Answers

arXiv cs.CL

ProvenAI introduces a framework for decomposing transparency in multi-hop question answering into three independently measurable layers: answer correctness, citation fidelity, and per-document influence, revealing a citation-influence gap where cited sources may have weak influence while uncited sources significantly shape the output.