Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

arXiv cs.CL Papers

Summary

This paper proposes SGR, a framework that enhances LLM stepwise reasoning by integrating external knowledge graphs through query-relevant subgraph generation, combining Cypher-based reasoning with collaborative reasoning integration. Experiments on CWQ, WebQSP, GrailQA, and KQA Pro show improved reasoning accuracy over standard prompting and knowledge-enhanced baselines.

arXiv:2606.04454v1 Announce Type: new Abstract: Large language models have shown strong performance in natural language generation and downstream reasoning tasks, but they still struggle with logical consistency, factual grounding, and interpretability in complex multi-step reasoning. To address these limitations, this paper proposes SGR, a stepwise reasoning enhancement framework that integrates large language models with external knowledge graphs through query-relevant subgraph generation. Given an input question, SGR first extracts key entities, relations, and constraints to construct a structured schema, then retrieves compact subgraphs from a knowledge graph using schema-guided querying. The generated subgraphs provide explicit relational evidence that guides the language model through step-by-step reasoning. In addition, SGR combines direct Cypher-based reasoning with collaborative reasoning integration, allowing candidate answers from multiple reasoning paths to be validated and aggregated according to both model confidence and graph consistency. Experiments on benchmark datasets including CWQ, WebQSP, GrailQA, and KQA Pro demonstrate that SGR improves reasoning accuracy and Hits@1 performance over standard prompting and several knowledge-enhanced baselines. Ablation studies further show that schema guidance and Neo4j-based retrieval are both crucial to the effectiveness of the framework. These results indicate that dynamically generated external subgraphs can improve the accuracy, robustness, and interpretability of LLM-based reasoning.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:14 AM

# Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation
Source: [https://arxiv.org/html/2606.04454](https://arxiv.org/html/2606.04454)
Xin Zhang1, Yang Cao2, Baoxing Wu1, Kai Song2, Siying Li1 1School of Information Science and Engineering, Chongqing Jiaotong University 2School of Computer Science and Technology, Chongqing University of Posts and Telecommunications

###### Abstract

Large language models have shown strong performance in natural language generation and downstream reasoning tasks, but they still struggle with logical consistency, factual grounding, and interpretability in complex multi\-step reasoning\. To address these limitations, this paper proposes SGR, a stepwise reasoning enhancement framework that integrates large language models with external knowledge graphs through query\-relevant subgraph generation\. Given an input question, SGR first extracts key entities, relations, and constraints to construct a structured schema, then retrieves compact subgraphs from a knowledge graph using schema\-guided querying\. The generated subgraphs provide explicit relational evidence that guides the language model through step\-by\-step reasoning\. In addition, SGR combines direct Cypher\-based reasoning with collaborative reasoning integration, allowing candidate answers from multiple reasoning paths to be validated and aggregated according to both model confidence and graph consistency\. Experiments on benchmark datasets including CWQ, WebQSP, GrailQA, and KQA Pro demonstrate that SGR improves reasoning accuracy and Hits@1 performance over standard prompting and several knowledge\-enhanced baselines\. Ablation studies further show that schema guidance and Neo4j\-based retrieval are both crucial to the effectiveness of the framework\. These results indicate that dynamically generated external subgraphs can improve the accuracy, robustness, and interpretability of LLM\-based reasoning\.

Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

## 1Introduction

Large language models \(LLMs\) have achieved remarkable success in natural language understanding, generation, and few\-shot learning, demonstrating strong generalization across a wide range of downstream tasks\(Brownet al\.,[2020](https://arxiv.org/html/2606.04454#bib.bib1)\)\. Recent prompting strategies, such as chain\-of\-thought reasoning, further enable LLMs to decompose complex problems into intermediate steps and improve their performance on multi\-step reasoning tasks\(Weiet al\.,[2022](https://arxiv.org/html/2606.04454#bib.bib2)\)\. Despite these advances, LLMs still face important limitations when solving knowledge\-intensive reasoning problems\. Their outputs may lack factual grounding, suffer from logical inconsistency, or provide reasoning processes that are difficult to verify\. These issues are especially problematic in complex question answering scenarios where correct answers depend on multiple entities, relations, and constraints\. Moreover, because the reasoning process of LLMs is often implicit, it remains challenging to ensure faithful interpretability and trace the evidence supporting a generated answer\(Jacovi and Goldberg,[2020](https://arxiv.org/html/2606.04454#bib.bib3)\)\.

Knowledge graphs provide a promising way to address these challenges by representing entities and relations as structured triples, enabling explicit and verifiable reasoning over external knowledge\(Jiet al\.,[2021](https://arxiv.org/html/2606.04454#bib.bib4)\)\. In knowledge\-intensive question answering, structured resources such as Freebase have been widely used to support multi\-hop inference and improve factual reliability\(Yao and Van Durme,[2014](https://arxiv.org/html/2606.04454#bib.bib6)\)\. Retrieval\-augmented generation further demonstrates that supplying external evidence to language models can improve their performance on knowledge\-intensive tasks\(Lewiset al\.,[2020](https://arxiv.org/html/2606.04454#bib.bib5)\)\. However, conventional retrieval methods often rely on unstructured passages or loosely connected facts, which may fail to preserve the relational dependencies required for complex reasoning\. Recent studies have therefore explored more structured forms of reasoning, including logic\-guided translation, deliberate planning, and graph\-based interaction between LLMs and knowledge sources\(Yanget al\.,[2024](https://arxiv.org/html/2606.04454#bib.bib21); Xionget al\.,[2025](https://arxiv.org/html/2606.04454#bib.bib24); Jianget al\.,[2023](https://arxiv.org/html/2606.04454#bib.bib11); Yaoet al\.,[2023](https://arxiv.org/html/2606.04454#bib.bib7)\)\. Nevertheless, how to dynamically construct compact, query\-relevant graph evidence and use it to guide faithful stepwise reasoning remains an open problem\.

To address this problem, we propose SGR, a stepwise reasoning enhancement framework that improves LLM reasoning by dynamically generating external subgraphs from knowledge graphs\. Instead of relying only on the implicit knowledge stored in model parameters or retrieving isolated textual evidence, SGR first converts an input question into a structured schema containing key entities, relations, and constraints\. This schema is then used to retrieve a compact query\-relevant subgraph from an external knowledge graph\. The generated subgraph provides explicit relational evidence, allowing the LLM to perform reasoning step by step along verifiable knowledge paths\.

SGR further combines two complementary reasoning strategies\. First, it performs direct reasoning enhancement by translating the generated schema into Cypher queries and executing them in Neo4j to obtain candidate answers grounded in the knowledge graph\. Second, it applies collaborative reasoning integration, in which candidate reasoning paths are validated and aggregated according to both language model confidence and graph consistency\. In this way, SGR not only improves factual grounding, but also reduces the risk of unsupported reasoning and enhances the interpretability of the final answer\.

The main contributions of this paper are summarized as follows\. First, we propose a stepwise reasoning enhancement framework that integrates LLMs with dynamically generated external subgraphs for knowledge\-intensive question answering\. Second, we introduce a schema\-guided subgraph generation strategy that extracts query\-relevant entities, relations, and constraints to retrieve compact structured evidence from knowledge graphs\. Third, we design a collaborative reasoning integration mechanism that combines Cypher\-based direct reasoning with graph\-consistency validation across multiple reasoning paths\. Finally, experiments on CWQ, WebQSP, GrailQA, and KQA Pro demonstrate that SGR improves Hits@1 and accuracy over standard prompting methods and several knowledge\-enhanced baselines, while ablation studies confirm the importance of schema guidance and Neo4j\-based retrieval\.

## 2Related Work

### 2\.1Knowledge\-Enhanced Language Models

Knowledge graphs have long been used as structured resources for representing entities, relations, and factual triples\. Early studies on relational machine learning over knowledge graphs provide the foundation for embedding entities and relations into continuous spaces and performing link prediction and reasoning over structured knowledge\(Nickelet al\.,[2015](https://arxiv.org/html/2606.04454#bib.bib8)\)\. With the development of pre\-trained language models, researchers have explored how to inject external knowledge into language representations\. ERNIE incorporates knowledge information into BERT\-style pre\-training, showing that entity\-aware representation learning can improve language understanding tasks\(Sunet al\.,[2019](https://arxiv.org/html/2606.04454#bib.bib9)\)\. More broadly, knowledge\-enhanced pre\-trained language models aim to combine symbolic knowledge with distributed representations, enabling models to better capture factual and relational information beyond text\-only pre\-training\(Huet al\.,[2023](https://arxiv.org/html/2606.04454#bib.bib10)\)\.

In addition to representation learning, logical reasoning over knowledge graphs has been investigated as a way to improve interpretability and multi\-hop inference\. TILP learns temporal logical rules over knowledge graphs in a differentiable manner, demonstrating the value of explicit rule structures for temporal reasoning\([Xionget al\.,](https://arxiv.org/html/2606.04454#bib.bib25)\)\. TEILP further extends this direction by using logical reasoning for time prediction over knowledge graphs\(Xionget al\.,[2024b](https://arxiv.org/html/2606.04454#bib.bib26)\)\. Recent work also shows that large language models can acquire temporal reasoning ability when properly trained or prompted, suggesting that neural models and structured temporal knowledge can complement each other\(Xionget al\.,[2024a](https://arxiv.org/html/2606.04454#bib.bib22)\)\. These studies motivate our use of external subgraphs as explicit structured evidence for improving the factual grounding and interpretability of LLM reasoning\.

### 2\.2Retrieval\-Augmented and Knowledge\-Intensive Reasoning

Retrieval\-augmented models improve language generation by providing external evidence at inference or pre\-training time\. Retrieval\-Augmented Generation retrieves relevant passages from an external corpus and conditions generation on the retrieved evidence, improving performance on knowledge\-intensive NLP tasks\(Lewiset al\.,[2020](https://arxiv.org/html/2606.04454#bib.bib5)\)\. REALM introduces retrieval into language model pre\-training, allowing the model to retrieve documents from a large corpus and use them for prediction\(Guuet al\.,[2020](https://arxiv.org/html/2606.04454#bib.bib17)\)\. In open\-domain question answering, multi\-step retriever\-reader frameworks iteratively retrieve and read evidence, which helps answer complex questions requiring multiple supporting facts\(Daset al\.,[2019](https://arxiv.org/html/2606.04454#bib.bib18)\)\. Fusion\-in\-Decoder further shows that generative models can effectively leverage multiple retrieved passages by encoding them separately and fusing the evidence during decoding\(Izacard and Grave,[2021](https://arxiv.org/html/2606.04454#bib.bib13)\)\.

Several benchmarks and systems have been proposed to evaluate knowledge\-intensive reasoning\. KILT provides a unified benchmark for knowledge\-intensive language tasks and emphasizes the importance of grounding model outputs in external sources\(Petroniet al\.,[2021](https://arxiv.org/html/2606.04454#bib.bib19)\)\. Open question answering over tables and text explores reasoning across heterogeneous evidence types, showing that complex QA often requires combining structured and unstructured information\(Chenet al\.,[2020](https://arxiv.org/html/2606.04454#bib.bib12)\)\. Knowledge\-augmented prompting methods further demonstrate that LLMs can benefit from explicit knowledge retrieved from knowledge graphs in zero\-shot KGQA settings\(Baeket al\.,[2023](https://arxiv.org/html/2606.04454#bib.bib16)\)\. Compared with these approaches, our framework focuses on dynamically generating compact query\-relevant subgraphs rather than retrieving isolated text passages or loosely connected facts\. This allows the model to reason over explicit relational paths and better preserve the structural dependencies required for multi\-hop question answering\.

### 2\.3Structured and Collaborative Reasoning with LLMs

Recent work has increasingly explored how to guide LLMs with structured reasoning processes\. Least\-to\-most prompting decomposes complex problems into simpler subproblems and solves them sequentially, showing that stepwise reasoning can improve model performance on difficult tasks\(Zhouet al\.,[2022](https://arxiv.org/html/2606.04454#bib.bib20)\)\. StructGPT provides a general framework for enabling LLMs to reason over structured data through iterative reading and reasoning operations\(Jianget al\.,[2023](https://arxiv.org/html/2606.04454#bib.bib11)\)\. Graph reasoning enhanced language models further incorporate graph neural reasoning into language models for question answering, demonstrating the benefit of explicitly modeling relational structures\(Zhanget al\.,[2022](https://arxiv.org/html/2606.04454#bib.bib14)\)\.

Another important direction is collaborative reasoning between LLMs and knowledge graphs\. Recent surveys and roadmaps argue that LLMs and KGs have complementary strengths: LLMs provide flexible language understanding and generation, while KGs provide precise, interpretable, and verifiable structured knowledge\(Panet al\.,[2024](https://arxiv.org/html/2606.04454#bib.bib15)\)\. This complementarity is especially important for complex knowledge\-intensive QA, where reasoning often requires both semantic understanding of the question and faithful traversal of entity\-relation paths\. Our proposed SGR framework follows this direction by integrating LLM\-based schema construction, Cypher\-based graph querying, and collaborative answer validation\. Unlike methods that rely only on retrieved passages or direct prompting, SGR uses generated external subgraphs to support stepwise reasoning and aggregates candidate answers according to both model confidence and graph consistency\. In this way, SGR aims to improve not only answer accuracy, but also the transparency and reliability of the reasoning process\.

![Refer to caption](https://arxiv.org/html/2606.04454v1/Fig1.png)Figure 1:Pipeline of SGR framework\.

## 3Methodology

### 3\.1Structured Subgraph Generation

#### 3\.1\.1Knowledge Graph Construction

SGR uses an external knowledge graph as the structured reasoning source for generating query\-relevant evidence\. The knowledge graph is represented as

𝒢=\(𝒱,ℰ,ℛ\),\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\},\\mathcal\{R\}\),\(1\)where𝒱\\mathcal\{V\}denotes the set of entities,ℛ\\mathcal\{R\}denotes the set of relation types, andℰ\\mathcal\{E\}denotes the set of factual triples\. Each edge in the graph is represented as a triple

whereh,t∈𝒱h,t\\in\\mathcal\{V\}are the head and tail entities, andr∈ℛr\\in\\mathcal\{R\}is the relation connecting them\. This triple\-based representation allows SGR to explicitly model factual dependencies among entities and provides a verifiable basis for multi\-hop reasoning\.

To support efficient graph querying, the extracted triples are stored in Neo4j\. Entities are modeled as graph nodes, while relations are modeled as directed edges between nodes\. Each node contains attributes such as entity name, type, and identifier, and each edge contains the corresponding relation label and optional constraint information\. This construction enables the system to retrieve paths, neighborhoods, and constrained relation patterns through Cypher queries\.

Given an input questionqq, SGR first identifies the question\-specific schema

𝒮q=𝒱q,ℛq,𝒞q,\\mathcal\{S\}\_\{q\}=\{\\mathcal\{V\}\_\{q\},\\mathcal\{R\}\_\{q\},\\mathcal\{C\}\_\{q\}\},\(3\)where𝒱q\\mathcal\{V\}\_\{q\}is the set of linked entities,ℛq\\mathcal\{R\}\_\{q\}is the set of candidate relations, and𝒞q\\mathcal\{C\}\_\{q\}is the set of constraints, such as entity types, temporal restrictions, comparison conditions, or numerical limits\. The schema serves as an intermediate structured representation between the natural language question and the external knowledge graph\.

The entity linking module maps mentions in the question to corresponding nodes in𝒢\\mathcal\{G\}\. Relation extraction then identifies possible predicates that connect the linked entities to potential answer nodes\. Constraint extraction further captures semantic conditions that must be satisfied during graph retrieval\. For example, in questions involving time, location, ranking, or aggregation, the corresponding constraints are added to𝒞q\\mathcal\{C\}\_\{q\}and later used to filter candidate reasoning paths\.

Through this construction process, the full knowledge graph is transformed into a searchable structured space\. Instead of allowing the language model to reason only from implicit parametric knowledge, SGR grounds the reasoning process in explicit entities, relations, and constraints\. This provides the foundation for the subsequent subgraph generation stage, where compact evidence subgraphs are dynamically retrieved according to the question schema\.

#### 3\.1\.2Subgraph Generation Process

After constructing the question schema𝒮q\\mathcal\{S\}\_\{q\}, SGR generates a compact query\-relevant subgraph from the external knowledge graph\. Starting from the linked entities in𝒱q\\mathcal\{V\}\_\{q\}, the framework expands along candidate relations inℛq\\mathcal\{R\}\_\{q\}while applying the constraints in𝒞q\\mathcal\{C\}\_\{q\}\. This process retrieves entity\-relation paths that are most relevant to the input question\.

A candidate reasoning path is denoted as

pi=\(v0,r1,v1,…,rk,vk\),p\_\{i\}=\(v\_\{0\},r\_\{1\},v\_\{1\},\\ldots,r\_\{k\},v\_\{k\}\),\(4\)wherev0v\_\{0\}is a seed entity andkkis the path length\. Each path is scored according to relation relevance, entity alignment, and constraint satisfaction\. The top\-ranked paths are merged to form the subgraph𝒢q\\mathcal\{G\}\_\{q\}, while duplicate and low\-relevance triples are removed\. The resulting subgraph provides compact structured evidence for later reasoning and reduces noise from the full knowledge graph\.

### 3\.2Stepwise Reasoning Enhancement

Given the generated subgraph𝒢q\\mathcal\{G\}\_\{q\}, SGR guides the language model to reason step by step over explicit graph evidence\. Instead of directly producing an answer, the model follows relevant entities, relations, and constraints in the subgraph to construct an interpretable reasoning trajectory\.

Letztz\_\{t\}denote the reasoning state at steptt\. The stepwise reasoning process is formulated as

zt=fθ​\(q,𝒮q,𝒢q,z<t\),z\_\{t\}=f\_\{\\theta\}\(q,\\mathcal\{S\}\_\{q\},\\mathcal\{G\}\_\{q\},z\_\{<t\}\),\(5\)wherez<tz\_\{<t\}represents previous reasoning states\. At each step, the model selects evidence from the subgraph and updates the reasoning state\. Unsupported reasoning steps are filtered or assigned lower reliability\. In this way, SGR improves factual grounding and makes the reasoning process easier to verify\.

### 3\.3Collaborative Reasoning Integration

Complex questions may produce multiple candidate reasoning paths\. To improve robustness, SGR integrates answers from different paths using both model confidence and graph consistency\. For each pathpip\_\{i\}, the language model produces a candidate answeraia\_\{i\}with confidenceρi\\rho\_\{i\}, while the graph module computes a consistency scoreηi\\eta\_\{i\}\.

The reliability score of each path is defined as

ωi=λ​ρi\+\(1−λ\)​ηi,\\omega\_\{i\}=\\lambda\\rho\_\{i\}\+\(1\-\\lambda\)\\eta\_\{i\},\(6\)whereλ\\lambdabalances the two signals\. The final answer is selected by aggregating support from all paths:

a^=arg⁡maxa​∑iωi⋅𝕀​\(ai=a\)\.\\hat\{a\}=\\arg\\max\_\{a\}\\sum\_\{i\}\\omega\_\{i\}\\cdot\\mathbb\{I\}\(a\_\{i\}=a\)\.\(7\)
This mechanism suppresses noisy single\-path predictions and favors answers supported by multiple reliable reasoning paths\. Therefore, SGR improves both answer accuracy and interpretability\.

Table 1:Performance comparison of different reasoning methods on CWQ, WebQSP, and GrailQA\. Hits@1 and accuracy are reported where applicable, and the best results for each metric are highlighted in bold\. Note: Best results are taken from prior work, includingα\\alpha,β\\beta,γ\\gamma, andδ\\delta\.
### 3\.4Direct Reasoning Enhancement

Direct reasoning enhancement aims to obtain graph\-grounded candidate answers by converting the generated schema into executable Cypher queries\. Unlike pure prompting methods, which rely mainly on the implicit knowledge of the language model, this component performs explicit reasoning over the external knowledge graph\. Given the structured schema𝒮q=𝒱q,ℛq,𝒞q\\mathcal\{S\}\_\{q\}=\{\\mathcal\{V\}\_\{q\},\\mathcal\{R\}\_\{q\},\\mathcal\{C\}\_\{q\}\}, SGR constructs Cypher queries that search for entity\-relation paths satisfying the semantic and logical constraints of the input question\. The retrieved results provide direct evidence from Neo4j and serve as reliable candidate answers for subsequent validation and integration\.

#### 3\.4\.1Cypher LLM

The Cypher LLM module translates the schema generated from the input question into a structured graph query\. Specifically, the model receives the extracted entities, candidate relations, and constraints, and then generates a Cypher query that can be executed in Neo4j\. This query is designed to match reasoning paths in the knowledge graph that are consistent with the question semantics\.

Let the Cypher generation process be denoted as

𝒬c=gθ​\(q,𝒮q\),\\mathcal\{Q\}\_\{c\}=g\_\{\\theta\}\(q,\\mathcal\{S\}\_\{q\}\),\(8\)wheregθg\_\{\\theta\}represents the language model used for query generation and𝒬c\\mathcal\{Q\}\_\{c\}is the generated Cypher query\. Executing𝒬c\\mathcal\{Q\}\_\{c\}on the external knowledge graph produces a set of candidate answers:

𝒜c=Exec​\(𝒬c,𝒢\),\\mathcal\{A\}\_\{c\}=\\mathrm\{Exec\}\(\\mathcal\{Q\}\_\{c\},\\mathcal\{G\}\),\(9\)whereExec​\(⋅\)\\mathrm\{Exec\}\(\\cdot\)denotes query execution in Neo4j\.

The Cypher LLM module improves reasoning in two ways\. First, it transforms natural language reasoning into executable symbolic operations, making the reasoning process more explicit and verifiable\. Second, it narrows the search space by using the schema constraints, which helps avoid irrelevant entities and noisy relations\. As a result, the model can obtain candidate answers that are directly grounded in structured graph evidence rather than relying only on generated textual reasoning\.

#### 3\.4\.2Answer Validation

Although Cypher\-based retrieval provides graph\-grounded candidate answers, errors may still occur due to imperfect entity linking, relation matching, or query generation\. Therefore, SGR introduces an answer validation step to verify whether each candidate answer is supported by the generated subgraph and consistent with the original question\.

For each candidate answeraj∈𝒜ca\_\{j\}\\in\\mathcal\{A\}\_\{c\}, SGR evaluates its validity using both graph evidence and semantic consistency\. The graph consistency score measures whether the answer can be reached through a valid reasoning path in the generated subgraph:

γj=C​\(pj,𝒢q\),\\gamma\_\{j\}=C\(p\_\{j\},\\mathcal\{G\}\_\{q\}\),\(10\)wherepjp\_\{j\}is the reasoning path associated with answeraja\_\{j\}\. The semantic confidence score measures the compatibility between the question, the schema, and the candidate answer:

δj=Pθ​\(aj∣q,𝒮q,pj\)\.\\delta\_\{j\}=P\_\{\\theta\}\(a\_\{j\}\\mid q,\\mathcal\{S\}\_\{q\},p\_\{j\}\)\.\(11\)
The final validation score is computed as

vj=μ​δj\+\(1−μ\)​γj,v\_\{j\}=\\mu\\delta\_\{j\}\+\(1\-\\mu\)\\gamma\_\{j\},\(12\)whereμ∈\[0,1\]\\mu\\in\[0,1\]controls the balance between semantic confidence and graph consistency\. Candidate answers with validation scores below a predefined threshold are filtered out:

𝒜v=aj∈𝒜c∣vj≥ϵ\.\\mathcal\{A\}\_\{v\}=\{a\_\{j\}\\in\\mathcal\{A\}\_\{c\}\\mid v\_\{j\}\\geq\\epsilon\}\.\(13\)
This validation mechanism reduces the influence of incorrect or weakly supported answers\. By requiring candidate answers to be both semantically plausible and structurally supported by the knowledge graph, SGR improves the reliability of direct reasoning and provides more trustworthy inputs for the later collaborative reasoning stage\.

### 3\.5Collaborative Reasoning Enhancement

To improve robustness, SGR integrates candidate answers from multiple reasoning paths rather than relying on a single Cypher query result\. Given the query\-relevant subgraph𝒢q\\mathcal\{G\}\_\{q\}, the framework obtains a set of candidate reasoning paths:

𝒫q=p1,p2,…,pM\.\\mathcal\{P\}\_\{q\}=\{p\_\{1\},p\_\{2\},\\ldots,p\_\{M\}\}\.\(14\)Each pathpip\_\{i\}produces a candidate answeraia\_\{i\}with a model confidence score:

ρi=Pθ​\(ai∣q,pi,𝒮q\),\\rho\_\{i\}=P\_\{\\theta\}\(a\_\{i\}\\mid q,p\_\{i\},\\mathcal\{S\}\_\{q\}\),\(15\)where𝒮q\\mathcal\{S\}\_\{q\}is the generated schema\. SGR also evaluates the graph consistency of each path:

ηi=C​\(pi,𝒢q\)\.\\eta\_\{i\}=C\(p\_\{i\},\\mathcal\{G\}\_\{q\}\)\.\(16\)
The reliability score of each path is computed by combining model confidence and graph consistency:

ωi=λ​ρi\+\(1−λ\)​ηi,\\omega\_\{i\}=\\lambda\\rho\_\{i\}\+\(1\-\\lambda\)\\eta\_\{i\},\(17\)whereλ∈\[0,1\]\\lambda\\in\[0,1\]controls their relative importance\. The final answer is selected by aggregating support from all paths:

a^=arg⁡maxa​∑i=1Mωi⋅𝕀​\(ai=a\)\.\\hat\{a\}=\\arg\\max\_\{a\}\\sum\_\{i=1\}^\{M\}\\omega\_\{i\}\\cdot\\mathbb\{I\}\(a\_\{i\}=a\)\.\(18\)
This collaborative reasoning strategy reduces the influence of noisy or incomplete paths\. Answers supported by multiple high\-confidence and graph\-consistent paths receive higher scores, while unsupported answers are suppressed\. As a result, SGR improves both the reliability and interpretability of multi\-hop reasoning\.

## 4Experiments

### 4\.1Experimental Setup

We evaluate SGR on four knowledge\-intensive question answering benchmarks: CWQ, WebQSP, GrailQA, and KQA Pro\. These datasets test multi\-hop, compositional, and structured reasoning over knowledge graphs\. For each question, SGR extracts key entities, relations, and constraints to build a schema, which is then converted into a Cypher query and executed in Neo4j to retrieve relevant subgraphs and candidate answers\. We test three variants: SGR/Cypher LLM, SGR/ChatGPT, and SGR/GPT\-4\. We compare them with standard prompting methods, including IO prompting and Chain\-of\-Thought prompting, as well as knowledge\-enhanced baselines such as StructGPT and Tree\-of\-Graphs\. Performance is measured using Hits@1 and accuracy\. We also conduct ablation studies by removing schema prompts and Neo4j retrieval to evaluate the contribution of each component\.

Table 2:Ablation experiment results on the CWQ dataset
### 4\.2Experimental Results

Table[1](https://arxiv.org/html/2606.04454#S3.T1)presents the main results on CWQ, WebQSP, and GrailQA\. Overall, SGR consistently outperforms standard prompting baselines, showing the effectiveness of using external subgraph evidence for multi\-step reasoning\. On CWQ, SGR/ChatGPT improves Hits@1 from 0\.376 for IO Prompt and 0\.388 for CoT to 0\.578, while accuracy increases to 0\.526\. With GPT\-4, SGR further achieves the best accuracy of 0\.590\. On WebQSP, SGR/GPT\-4 obtains 0\.826 Hits@1 and 0\.808 accuracy, matching the best Hits@1 result and achieving the highest accuracy among all methods\. On GrailQA, SGR/GPT\-4 reaches 0\.756 Hits@1 and the best accuracy of 0\.703, demonstrating strong generalization to complex query structures\.

These results indicate that schema\-guided subgraph generation and graph\-grounded reasoning help LLMs produce more accurate and reliable answers\. The consistent gains over prompting baselines and competitive performance against existing knowledge\-enhanced methods confirm the effectiveness of the proposed SGR framework\.

### 4\.3Ablation Study

The ablation study evaluates the effects of schema guidance and Neo4j\-based retrieval on the CWQ dataset\. As shown in Table[2](https://arxiv.org/html/2606.04454#S4.T2), removing either component causes a clear performance drop\. Without schema prompts, SGR/Cypher LLM decreases from 0\.523 to 0\.322 in Hits@1 and from 0\.445 to 0\.128 in accuracy, while SGR/ChatGPT drops from 0\.578 to 0\.400 in Hits@1 and from 0\.526 to 0\.319 in accuracy\. This indicates that schema guidance is important for constructing accurate reasoning paths\. Similarly, removing Neo4j retrieval reduces performance for both models, showing that external graph evidence is necessary for factual grounding\. Overall, the results demonstrate that schema prompts and Neo4j retrieval are complementary components, and their combination enables SGR to achieve more accurate and reliable multi\-hop reasoning\.

![Refer to caption](https://arxiv.org/html/2606.04454v1/Fig2.png)

Figure 2:Impact brought by removing Schema prompts\.![Refer to caption](https://arxiv.org/html/2606.04454v1/Fig3.png)

Figure 3:Impact brought by removing neo4j retrieval\.
### 4\.4Application Scenarios and Error Analysis

SGR is suitable for knowledge\-intensive tasks that require reliable multi\-step reasoning, such as knowledge\-based question answering, intelligent search, educational question answering, biomedical knowledge exploration, and enterprise knowledge management\. In these scenarios, answers often depend on multiple entities and relations rather than a single fact\. By generating query\-relevant subgraphs, SGR provides explicit evidence paths that help the language model produce more grounded and interpretable answers\. The framework is also useful for complex information retrieval\. Compared with direct prompting or unstructured retrieval, schema\-guided subgraph generation can preserve important relational structures while filtering out irrelevant information\. This allows the model to focus on compact and useful evidence during reasoning\.

However, SGR still produces several types of errors\. First, entity linking errors may occur when the question contains ambiguous or rare entities\. Second, relation identification errors may appear when natural language expressions do not match the relation labels in the knowledge graph\. Third, useful triples may be removed during subgraph filtering, leading to incomplete reasoning paths\. Finally, overly large subgraphs may introduce noisy evidence and mislead the model\. These errors suggest that more accurate entity linking, relation matching, and path ranking strategies are needed in future work\.

## 5Conclusion

This paper proposed SGR, a stepwise reasoning enhancement framework that integrates LLMs with external knowledge graphs through query\-relevant subgraph generation\. By using schema\-guided retrieval, Cypher\-based reasoning, and collaborative answer integration, SGR grounds the reasoning process in structured evidence and improves interpretability\. Experiments on CWQ, WebQSP, GrailQA, and KQA Pro show that SGR improves accuracy and Hits@1 over standard prompting and knowledge\-enhanced baselines\. Ablation results further confirm that schema guidance and Neo4j\-based retrieval are key to the framework’s effectiveness\. Overall, SGR demonstrates that external subgraphs can help LLMs perform more reliable and transparent multi\-step reasoning\.

## Limitations

SGR still has several limitations\. First, its performance depends on the quality and coverage of the external knowledge graph; missing or noisy facts may lead to incorrect reasoning\. Second, errors in entity extraction, relation identification, or schema construction can affect subgraph retrieval and final answers\. Third, choosing an appropriate subgraph size is challenging, since small subgraphs may miss useful evidence while large ones may introduce noise\. Finally, SGR requires additional computation for schema generation, Cypher querying, and reasoning\-path integration, which may increase inference cost compared with direct prompting methods\.

## References

- Knowledge\-augmented language model prompting for zero\-shot knowledge graph question answering\.arXiv preprint arXiv:2306\.04136\.Cited by:[§2\.2](https://arxiv.org/html/2606.04454#S2.SS2.p2.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2606.04454#S1.p1.1)\.
- W\. Chen, M\. Chang, E\. Schlinger, W\. Wang, and W\. W\. Cohen \(2020\)Open question answering over tables and text\.arXiv preprint arXiv:2010\.10439\.Cited by:[§2\.2](https://arxiv.org/html/2606.04454#S2.SS2.p2.1)\.
- R\. Das, S\. Dhuliawala, M\. Zaheer, and A\. McCallum \(2019\)Multi\-step retriever\-reader interaction for scalable open\-domain question answering\.arXiv preprint arXiv:1905\.05733\.Cited by:[§2\.2](https://arxiv.org/html/2606.04454#S2.SS2.p1.1)\.
- K\. Guu, K\. Lee, Z\. Tung, P\. Pasupat, and M\. Chang \(2020\)Retrieval augmented language model pre\-training\.InInternational conference on machine learning,pp\. 3929–3938\.Cited by:[§2\.2](https://arxiv.org/html/2606.04454#S2.SS2.p1.1)\.
- L\. Hu, Z\. Liu, Z\. Zhao, L\. Hou, L\. Nie, and J\. Li \(2023\)A survey of knowledge enhanced pre\-trained language models\.IEEE Transactions on Knowledge and Data Engineering36\(4\),pp\. 1413–1430\.Cited by:[§2\.1](https://arxiv.org/html/2606.04454#S2.SS1.p1.1)\.
- G\. Izacard and E\. Grave \(2021\)Leveraging passage retrieval with generative models for open domain question answering\.InProceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume,pp\. 874–880\.Cited by:[§2\.2](https://arxiv.org/html/2606.04454#S2.SS2.p1.1)\.
- A\. Jacovi and Y\. Goldberg \(2020\)Towards faithfully interpretable nlp systems: how should we define and evaluate faithfulness?\.Cited by:[§1](https://arxiv.org/html/2606.04454#S1.p1.1)\.
- S\. Ji, S\. Pan, E\. Cambria, P\. Marttinen, and P\. S\. Yu \(2021\)A survey on knowledge graphs: representation, acquisition, and applications\.IEEE transactions on neural networks and learning systems33\(2\),pp\. 494–514\.Cited by:[§1](https://arxiv.org/html/2606.04454#S1.p2.1)\.
- J\. Jiang, K\. Zhou, Z\. Dong, K\. Ye, W\. X\. Zhao, and J\. Wen \(2023\)Structgpt: a general framework for large language model to reason over structured data\.arXiv preprint arXiv:2305\.09645\.Cited by:[§1](https://arxiv.org/html/2606.04454#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.04454#S2.SS3.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2606.04454#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.04454#S2.SS2.p1.1)\.
- M\. Nickel, K\. Murphy, V\. Tresp, and E\. Gabrilovich \(2015\)A review of relational machine learning for knowledge graphs\.Proceedings of the IEEE104\(1\),pp\. 11–33\.Cited by:[§2\.1](https://arxiv.org/html/2606.04454#S2.SS1.p1.1)\.
- S\. Pan, L\. Luo, Y\. Wang, C\. Chen, J\. Wang, and X\. Wu \(2024\)Unifying large language models and knowledge graphs: a roadmap\.IEEE Transactions on Knowledge and Data Engineering36\(7\),pp\. 3580–3599\.Cited by:[§2\.3](https://arxiv.org/html/2606.04454#S2.SS3.p2.1)\.
- F\. Petroni, A\. Piktus, A\. Fan, P\. Lewis, M\. Yazdani, N\. De Cao, J\. Thorne, Y\. Jernite, V\. Karpukhin, J\. Maillard,et al\.\(2021\)KILT: a benchmark for knowledge intensive language tasks\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 2523–2544\.Cited by:[§2\.2](https://arxiv.org/html/2606.04454#S2.SS2.p2.1)\.
- Y\. Sun, S\. Wang, Y\. Li, S\. Feng, X\. Chen, H\. Zhang, X\. Tian, D\. Zhu, H\. Tian, and H\. Wu \(2019\)Ernie: enhanced representation through knowledge integration\.Cited by:[§2\.1](https://arxiv.org/html/2606.04454#S2.SS1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2606.04454#S1.p1.1)\.
- S\. Xiong, A\. Payani, R\. Kompella, and F\. Fekri \(2024a\)Large language models can learn temporal reasoning\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 10452–10470\.Cited by:[§2\.1](https://arxiv.org/html/2606.04454#S2.SS1.p2.1)\.
- S\. Xiong, A\. Payani, Y\. Yang, and F\. Fekri \(2025\)Deliberate reasoning in language models as structure\-aware planning with an accurate world model\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 31900–31931\.Cited by:[§1](https://arxiv.org/html/2606.04454#S1.p2.1)\.
- \[19\]S\. Xiong, Y\. Yang, F\. Fekri, and J\. C\. KerceTILP: differentiable learning of temporal logical rules on knowledge graphs\.InThe Eleventh International Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2606.04454#S2.SS1.p2.1)\.
- S\. Xiong, Y\. Yang, A\. Payani, J\. C\. Kerce, and F\. Fekri \(2024b\)Teilp: time prediction over knowledge graphs via logical reasoning\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 16112–16119\.Cited by:[§2\.1](https://arxiv.org/html/2606.04454#S2.SS1.p2.1)\.
- Y\. Yang, S\. Xiong, A\. Payani, E\. Shareghi, and F\. Fekri \(2024\)Harnessing the power of large language models for natural language to first\-order logic translation\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6942–6959\.Cited by:[§1](https://arxiv.org/html/2606.04454#S1.p2.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[§1](https://arxiv.org/html/2606.04454#S1.p2.1)\.
- X\. Yao and B\. Van Durme \(2014\)Information extraction over structured data: question answering with freebase\.InProceedings of the 52nd annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 956–966\.Cited by:[§1](https://arxiv.org/html/2606.04454#S1.p2.1)\.
- X\. Zhang, A\. Bosselut, M\. Yasunaga, H\. Ren, P\. Liang, C\. D\. Manning, and J\. Leskovec \(2022\)Greaselm: graph reasoning enhanced language models for question answering\.arXiv preprint arXiv:2201\.08860\.Cited by:[§2\.3](https://arxiv.org/html/2606.04454#S2.SS3.p1.1)\.
- D\. Zhou, N\. Schärli, L\. Hou, J\. Wei, N\. Scales, X\. Wang, D\. Schuurmans, C\. Cui, O\. Bousquet, Q\. Le,et al\.\(2022\)Least\-to\-most prompting enables complex reasoning in large language models\.arXiv preprint arXiv:2205\.10625\.Cited by:[§2\.3](https://arxiv.org/html/2606.04454#S2.SS3.p1.1)\.

Similar Articles

GraphReAct: Reasoning and Acting for Multi-step Graph Inference

arXiv cs.AI

This paper introduces GraphReAct, a framework that extends reasoning-acting paradigms to graph-structured data for multi-step inference. It combines topological and semantic retrieval with context refinement to improve performance on graph learning benchmarks.