Travel-Oriented Reasoning Large Language Model via Domain-Specific Knowledge Graphs

arXiv cs.CL 06/30/26, 04:00 AM Papers

travel-domain reasoning knowledge-graphs fine-tuning qwen3 domain-specific llm

Summary

This paper proposes a modular pipeline that uses a domain-specific knowledge graph to generate multi-hop QA pairs and fine-tune a reasoning LLM (Qwen3-4B) for the travel domain, achieving 82.4% exact match accuracy, significantly outperforming the baseline.

arXiv:2606.29254v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate broad reasoning abilities but struggle with accuracy and reliability in specialized domains such as travel, where reasoning depends on precise definitions, rules, and expert-defined conceptual frameworks, and where confident but unfounded outputs arise from a reasoning failure in which the model has not internalized the underlying domain graph rather than from missing domain knowledge alone. We propose a modular pipeline for building a travel-domain reasoning LLM grounded in an expert-designed knowledge graph (KG). Our pipeline integrates a travel KG that encodes domain entities and their relationships, a bottom-up construction procedure that walks the KG to produce multi-hop question answer (QA) pairs, a supervised fine-tuning stage that embeds the domain knowledge into a reasoning-capable LLM using the generated QA pairs as auditable reasoning traces, and a travel-domain benchmark dataset that measures the fine-tuned model's accuracy and calibration. We evaluate our approach using Qwen3-4B with LoRA adaptation. Our reasoning model achieves an $82.4\%$ exact match on the benchmark. This performance significantly outperforms the pretrained Qwen3-4B baseline at $22.4\%$. A calibration analysis decomposes the residual $17.57\%$ of errors into two distinct failure modes: an over-confident multi-label decoder that predicts both correct answers plus one spurious option on most dual-answer mistakes, and a smaller reasoning failure on single-answer questions where the supporting facts are present in the KG but the model fails to reconstruct the correct multi-hop path. This split confirms that explicit KG-grounded reasoning substantially improves the accuracy and uncertainty interpretation of LLMs in specialized domains, and isolates per-option calibration and trace-length-aware decoding as the next axes of improvement.

Original Article

View Cached Full Text

Cached at: 06/30/26, 05:30 AM

# Travel-Oriented Reasoning Large Language Model via Domain-Specific Knowledge Graphs
Source: [https://arxiv.org/html/2606.29254](https://arxiv.org/html/2606.29254)
$2026$

###### Abstract\.

Large language models $LLMs$ demonstrate broad reasoning abilities but struggle with accuracy and reliability in specialized domains such as travel, where reasoning depends on precise definitions, rules, and expert\-defined conceptual frameworks, and where confident but unfounded outputs arise from a reasoning failure in which the model has not internalized the underlying domain graph rather than from missing domain knowledge alone\. We propose a modular pipeline for building a travel\-domain reasoning LLM grounded in an expert\-designed knowledge graph $KG$\. Our pipeline integrates a travel KG that encodes domain entities and their relationships, a bottom\-up construction procedure that walks the KG to produce multi\-hop question answer $QA$ pairs, a supervised fine\-tuning stage that embeds the domain knowledge into a reasoning\-capable LLM using the generated QA pairs as auditable reasoning traces, and a travel\-domain benchmark dataset that measures the fine\-tuned model’s accuracy and calibration\. We evaluate our approach using Qwen3\-4B with LoRA adaptation\. Our reasoning model achieves an82\.4%82\.4\\%exact match on the benchmark\. This performance significantly outperforms the pretrained Qwen3\-4B baseline at22\.4%22\.4\\%\. A calibration analysis decomposes the residual17\.57%17\.57\\%of errors into two distinct failure modes: an over\-confident multi\-label decoder that predicts both correct answers plus one spurious option on most dual\-answer mistakes, and a smaller reasoning failure on single\-answer questions where the supporting facts are present in the KG but the model fails to reconstruct the correct multi\-hop path\. This split confirms that explicit KG\-grounded reasoning substantially improves the accuracy and uncertainty interpretation of LLMs in specialized domains, and isolates per\-option calibration and trace\-length\-aware decoding as the next axes of improvement\.

knowledge graphs, large language models, domain\-specific reasoning, supervised fine\-tuning, travel domain

††copyright:none††conference:The 5th Workshop on Uncertainty Reasoning and Quantification in Decision Making; August 2026; Jeju, Korea††journalyear:2026††ccs:Computing methodologies Artificial intelligence††ccs:Computing methodologies Machine learning\*\*footnotetext:These authors contributed equally to this work\.## 1\.Introduction

Large language models $LLMs$ demonstrate remarkable reasoning capabilities across general domains$[wei2022chain,](https://arxiv.org/html/2606.29254#bib.bib1);[kojima2022large,](https://arxiv.org/html/2606.29254#bib.bib2)$\. They successfully follow logical steps, generate complex explanations, and perform multi\-hop reasoning\. We generally drive this progress by pretraining models on vast text corpora and fine\-tuning them on diverse instruction datasets\. However, when we apply these models to highly specific fields such as the travel domain, they frequently struggle to maintain accuracy and reliability\. This challenge arises because travel domain reasoning requires more than general logic\. It demands strict adherence to the precise definitions, rules, and conceptual frameworks that govern the domain\.

This discrepancy highlights a critical distinction between general reasoning and domain\-specific reasoning\. General reasoning relies on broad linguistic patterns, whereas domain reasoning depends on local ontologies, explicit rules, and contextual norms that dictate how we apply and interpret knowledge$[pan2024unifying,](https://arxiv.org/html/2606.29254#bib.bib6)$\. When we fail to ground LLMs in these domain\-specific structures, the models inevitably hallucinate\. Therefore, to improve travel domain reasoning, we must ground LLMs in authoritative travel knowledge and structured logic, ensuring we generate consistent, accurate, and trustworthy outputs\.

Currently, the standard methodology for teaching reasoning capabilities relies on a top\-down approach\. We typically expect models to learn general abstractions from massive collections of facts and statements using large\-scale pretraining, reinforcement learning, and inference\-time compute\. Unfortunately, this top\-down solution yields suboptimal learning efficiency in specialized areas like the travel domain\. High\-quality travel data is scarce, making grounded reasoning rooted in structured domain knowledge absolutely essential\.

To overcome these limitations, we propose a bottom\-up approach that starts with core travel knowledge and builds upward to develop advanced reasoning capabilities\. We first encode the domain’s glossary, rules, and data structures\. We then connect the model to reliable sources and tools that enforce factual grounding\. We implement a curriculum progression that guides the model from basic definitions to applied reasoning, teaching it to reason accurately step by step\. Ultimately, we force the LLM to transition from merely producing likely\-sounding answers to generating verifiable, rule\-consistent reasoning\. By actively grounding the models in foundational knowledge and structured logic, we successfully reduce hallucinations and transform them into reliable domain assistants\.

In this paper, we introduce a novel, bottom\-up reasoning framework comprising four deeply integrated modules, as shown in Fig\.[1](https://arxiv.org/html/2606.29254#S1.F1):

- •Travel Domain Knowledge Graph $KG$: We encode foundational travel domain information into a robust, structured knowledge graph\.
- •Bottom\-Up Travel Knowledge Construction and Verification: We generate structured training data directly from the KG\. To do this, we traverse the graph, synthesize instructions, formulate multiple\-choice scenarios, and verify the factual accuracy of the data\.
- •Bottom\-Up Curriculum Learning: We train the travel domain reasoning model by systematically feeding it knowledge acquired from the KG, gradually increasing the complexity of the reasoning tasks\.
- •Travel Domain Benchmark: We provide a comprehensive new benchmark that evaluates both the factual knowledge and the specific reasoning capabilities of LLMs operating within the travel sector\.

The remainder of the paper proceeds as follows\. Section[2](https://arxiv.org/html/2606.29254#S2)reviews related work on reasoning in LLMs, KG grounded language models, and parameter efficient domain adaptation\. Section[3](https://arxiv.org/html/2606.29254#S3)describes the travel domain ontology and the structure of the knowledge graph\. Section[4](https://arxiv.org/html/2606.29254#S4)details our bottom\-up procedure for synthesizing and verifying multi\-answer multiple\-choice questions from KG paths\. Section[5](https://arxiv.org/html/2606.29254#S5)presents the supervised fine\-tuning setup of the Qwen3\-4B model\. Section[6](https://arxiv.org/html/2606.29254#S6)introduces the held\-out benchmark, defines the evaluation metrics, reports our results, and characterizes the model’s residual errors as a calibration problem on multi\-answer questions\. Finally Section[7](https://arxiv.org/html/2606.29254#S7)concludes the paper\.

*LLM \+ human\-in\-**the\-loop update*Travel\-DomainKnowledge Graph*Nodes:*Object, Concept,Scenario, Action, Outcome*Edges:*Condition,Agent\_ActionBottom\-UpData SynthesisRepresentativenode selection,path enum\. $≤\\leq10 hops$VerificationPipelineStructural \+LLM\-based \+RAG\-basedTraining setMultiple choiceQA pairsCurriculumSFTQwen3\-4B \+ LoRAReasoning modelFine\-tunedReasoning LLMBenchmarkMultiple ChoiceQA pairsEvaluationEM / F1 / P / R$sample\-averaged$errors inform KG updates

Figure 1\.An overview of training a reasoning model using domain\-specific knowledge graphs\.
## 2\.Related Work

In this section, we review the prior work in three related areas: reasoning in large language models, unification of KGs and LLMs, and domain specific models through instruction fine\-tuning\.

Reasoning in large language models\.Chain\-of\-thought $CoT$ prompting elicits multi\-step reasoning by asking models to verbalize intermediate steps$[wei2022chain,](https://arxiv.org/html/2606.29254#bib.bib1)$, and models display such capabilities in a zero\-shot setting when prompted appropriately$[kojima2022large,](https://arxiv.org/html/2606.29254#bib.bib2)$\. Self\-consistency$[wang2023selfconsistency,](https://arxiv.org/html/2606.29254#bib.bib3)$further improves reliability by sampling multiple reasoning paths and marginalizing over answers\. More recently, reasoning\-specialized models such as DeepSeek\-R1$[deepseekai2025r1,](https://arxiv.org/html/2606.29254#bib.bib4)$and the Qwen3 family$[qwen3,](https://arxiv.org/html/2606.29254#bib.bib5)$show that distilled or RL\-trained CoT can be internalized into smaller dense models\. Our work complements this line by grounding CoT traces in an authoritative domain KG rather than relying solely on model\-generated rationales, thereby reducing hallucinations in the specialized travel domain\.

Knowledge graphs and LLMs\.A growing body of work combines structured knowledge with neural language models\. Pan et al\.$[pan2024unifying,](https://arxiv.org/html/2606.29254#bib.bib6)$provide a roadmap unifying KGs and LLMs along three axes: KG\-enhanced LLMs, LLM\-augmented KGs, and their synergistic use\. Retrieval\-augmented generation $RAG$$[lewis2020rag,](https://arxiv.org/html/2606.29254#bib.bib7)$conditions generation on retrieved passages, while QA\-GNN$[yasunaga2021qagnn,](https://arxiv.org/html/2606.29254#bib.bib8)$and related methods reason jointly over text and a KG subgraph\. Think\-on\-Graph$[sun2024thinkongraph,](https://arxiv.org/html/2606.29254#bib.bib9)$performs explicit beam search over a KG at inference time to support multi\-hop QA\. These approaches typically augment inference with external structure; in contrast, we use the KG offline to synthesize verified training data and distill graph\-grounded reasoning directly into model weights via SFT\.

Domain\-specific and instruction fine\-tuning\.Instruction tuning$[ouyang2022instructgpt,](https://arxiv.org/html/2606.29254#bib.bib10)$and parameter\-efficient adaptation have become standard for aligning LLMs to specialized tasks\. LoRA$[hu2022lora,](https://arxiv.org/html/2606.29254#bib.bib11)$introduces low\-rank adapters that enable fine\-tuning of large models with minimal compute, and QLoRA$[dettmers2023qlora,](https://arxiv.org/html/2606.29254#bib.bib12)$extends this to quantized backbones\. Domain specialization has yielded notable results in medicine $Med\-PaLM\([singhal2023medpalm,](https://arxiv.org/html/2606.29254#bib.bib13)$\), mathematics $MAmmoTH\([yue2024mammoth,](https://arxiv.org/html/2606.29254#bib.bib14)$\), and code, typically by curating or synthesizing domain corpora\. In the same way, our pipeline adopts a two\-stage curriculum that moves from direct answers to reasoning\-enhanced answers, what sets it apart is that each stage is generated from a maintained ontology, ensuring that every training example is structurally and logically valid by construction\.

## 3\.Travel Domain Knowledge Graph

In this section, we present the travel\-domain knowledge graph that grounds every training example and reasoning trace later in the pipeline\. We first describe the ontology that specifies how entities and their relations are represented, and we then describe how we populate that ontology from travel policy documents and the taxonomy that domain experts already use in practice\.

### 3\.1\.Ontology Design

We focus on travel policy documents, in particular cancellation policies\. We design a domain\-specific ontology that captures the logic travel experts use when reasoning about these policies\. We represent every entity uniformly as a node, and we let hierarchy emerge from the relations between entities rather than from predefined categories\. We then traverse and interpret the graph purely through those relations\. Each node therefore means only what its edges say it means\. The labelProperty, for instance, defaults to a real\-estate reading in general English, but in our graph it carries only the meaning given by its travel\-policy edges $the alternatives it offers, the amenities it houses, the refund conditions it satisfies$, so that unrelated meaning never leaks in\. For the same reason, concepts that sound alike, such asCompensationandProperty Refund, remain distinct because they are identified by their connections rather than by their labels\.

### 3\.2\.Travel Domain KG Construction

We derive the nodes and relationships from a careful reading of travel policy documents and the taxonomy used by domain experts\. Our goal is to enumerate every relevant object and record how it connects to the others\.

Node typescapture the general category of an object and encode the role it plays in an interaction\. Each type fixes a specific kind of fact and constrains how we interpret the object in context\. For example, we classify “Email to vendor – Supplier waiver request – Customer has hotel approval – Pre\-travel” as anActionnode because it names an event in a customer\-service interaction that can trigger subsequent steps, and we classify “Refund under the property, supplier waiving” as anOutcomenode because it names a terminal state\. By distinguishing these types, the ontology tells us what plausibly precedes or follows a given node and gives the graph a coherent structural and temporal scaffolding\.

Edge typesencode the specific relationships between nodes\.Conditionedges specialize a scenario into a more specific scenario, andActionedges advance the interaction through an explicit step taken by the agent\. Fig\.[2](https://arxiv.org/html/2606.29254#S3.F2)shows a representative subgraph anchored at “Cancel Due to Poor Customer Service”\. A chain ofConditionedges narrows the context to a pre\-stay TV\-amenity issue, and anActionedge then moves the interaction to “Property Offers Alternative”\. From there the traveler either accepts and the path terminates at “Get Alternative Option Set Up”, or refuses and triggers a compensation chain that eventually reaches “Refund, Property Waiving” or “Consult Relocations”\. Once the traveler reports that the TV does not work, the earlierConditionnodes drop out without changing the rest of the graph, so the graph stays agnostic to where reasoning starts and consistently drives every event toward an outcome\.

We build the initial KG manually from the ontology and internal travel documents, but the underlying domain knowledge changes over time\. We therefore keep the graph current with a human\-in\-the\-loop pipeline in which an LLM proposes candidate triples from new internal documents and a domain expert accepts, edits, or rejects each proposal before we merge it back into the KG\.

Cancel Due toPoor Customer ServiceAmenityNot AvailableTV DoesNot WorkPre\-StayProperty OffersAlternativeTraveler Agreesto AlternativeTraveler RefusesAlternativeGet AlternativeOption Set UpSee if PropertyWill Offer RefundProperty OffersRefundTraveler AcceptsRefundTraveler RefusesRefundRefund,Property WaivingConsultRelocationsConditionConditionConditionConditionActionActionActionActionActionActionActionActionActionScenarioActionOutcomeNode types

Figure 2\.Local subgraph of the travel\-domain KG anchored at*Cancel Due to Poor Customer Service*, with node fill encoding ontology type and edge color separatingCondition$blue$ fromAction$green$ edges\.

## 4\.Bottom\-Up Knowledge Construction and Verification

We adopt a bottom\-up approach to knowledge construction and verification in the travel domain by synthesizing structured and contextual understanding from granular knowledge graph elements\. Instead of relying on predefined taxonomies, we learn by traversing node relationships, performing contextual reasoning, and generating instructions that produce interpretable and domain grounded knowledge\. Fig\.[2](https://arxiv.org/html/2606.29254#S3.F2)shows a local subgraph anchored at one such scenario\.

Representative Node Selection\. We begin by selecting representative starting nodes through balanced sampling across node types\. We exclude outcome nodes as starting points except for a small diagnostic subset\. We partition the graph into outcome and non\-outcome nodes, where outcome nodes represent terminal decisions and are identified either by explicit labeling or by the absence of outgoing edges\. We retain only those non\-outcome nodes that can reach at least one outcome node through a directed path, verified via reachability checks\. This filtering ensures that every sampled path corresponds to a valid decision trajectory\.

Path Length Sampling\. We then sample path lengths to control the depth of reasoning\. For each valid start and outcome pair, we generate simple directed paths up to a maximum depth of ten hops\. We prioritize shorter paths and retain up to five paths per pair to balance diversity and efficiency\. The resulting paths capture both local and multi\-hop relationships\.

Termination Criteria\. We terminate traversal when a path reaches an outcome node\. We enforce strict validity by retaining only paths that end at verified outcome nodes and by validating each step against the graph to ensure that every transition corresponds to a valid directed edge\.

Instruction Generation\. We convert each path into a natural language instruction by combining a query derived from the start node with varying levels of contextual information from intermediate nodes\. We generate multiple variants using diverse templates such as direct questions, scenario based prompts, and conditional formulations\. We control context at four levels ranging from no context to full context\. Each instruction includes a reasoning trace that narrates the traversal and a final answer derived from the terminal outcome node\.

Multiple Choice Formulation\. We transform these instructions into multiple choice questions\. The correct answer corresponds to the outcome node, while distractors are selected from other outcome nodes using type aware and semantically informed criteria to ensure plausibility\. When multiple paths yield the same question with different valid outcomes, we consolidate them into multi\-answer questions\. We maintain an approximately balanced training set with similar proportions of questions that have one, two, or three correct answers\.

Verification and Quality Assurance\. We verify each $question, reasoning, answer$ triplet for logical consistency and domain grounding\. We apply both model based and retrieval augmented verification\. The model checks that reasoning steps follow from the graph structure and that all structural constraints are satisfied, including valid answer formatting and consistency\. Graph level validation ensures that every reasoning trace exactly matches a valid path\. Retrieval augmented verification cross checks each instance against relevant domain documents to confirm correctness\. This pipeline ensures that the final dataset shown in Table[1](https://arxiv.org/html/2606.29254#S4.T1)is accurate, interpretable, and free of structural or logical errors\.

Table 1\.Knowledge graph and synthesized dataset statistics\.StatisticCountTraining QA pairs2,7641\-correct $single\_answer$8822\-correct $dual\_answer$8823\-correct $multi\_answer$1,000Benchmark QA pairs $§[6](https://arxiv.org/html/2606.29254#S6)$8881\-correct7182\-correct1633\-correct7
## 5\.Bottom\-Up Curriculum Learning

We fine\-tune two LoRA adapted variants of Qwen3\-4B, a four\-billion\-parameter causal language model, with adapters on both the attention and MLP projection modules\. Both variants start from the same pretrained base and train independently under matched hyperparameters, so the only differences between them are the training data and Qwen3’s thinking\-mode flag\. The direct\-answer variant disables thinking mode and trains on question and answer pairs alone\. The reasoning\-enhanced variant enables thinking mode and trains on the same questions augmented with KG\-derived reasoning traces\.

### 5\.1\.Direct\-Answer Training

For the direct\-answer variant, we apply Qwen3’s chat template with thinking mode disabled\. The target assistant turn is a single line of the form FINAL ANSWER: ¡letters¿, where ¡letters¿ is a comma\-separated alphabetical list of one to three correct choice letters\. The training objective is next\-token cross\-entropy over that line\. This format forces the model to internalize domain reasoning implicitly, since it sees only the question, the choices, and the set of correct letters\. At inference the model emits only the answer letters, so the reasoning stays entirely inside the weights\.

### 5\.2\.Reasoning\-Enhanced Training

For the reasoning variant, we apply Qwen3’s chat template with thinking mode enabled\. The target assistant turn carries a ¡think¿ block with the KG\-derived reasoning trace, followed by the explicit path and the final answer line, as shown below\. The training objective is next\-token cross\-entropy over the full span from the opening ¡think¿ tag to the final answer letter\. We draw every target trace from the exact KG path that generated the training example, so the model’s reasoning stays structurally faithful to the source graph\.

```
<think> [Reasoning trace] </think>
Path 1: start -> intermediate -> ... -> outcome
FINAL ANSWER: a, c
```

### 5\.3\.Training Dataset Properties

A defining characteristic of this dataset is that the same question text can appear with different choice sets and therefore different correct answer positions, by design, since the KG produces multiple valid outcome paths from the same starting scenario\. This means the model cannot memorize question\-to\-letter mappings; it must learn to read the specific choice texts and match them against domain knowledge, a much harder and more generalizable task\. With one to three correct answers per question and three to four carefully constructed distractors, the model must learn precise discrimination between correct policy outcomes and semantically plausible alternatives\.

We algorithmically draw hard negatives from nearby nodes in the knowledge graph\. This ensures that distractors are semantically plausible within the travel domain\. They represent real policy outcomes that could apply under different conditions but not match the specific scenario\. Notably, many choice texts in the dataset appear as both a correct answer and a distractor across different questions, making rote memorization of “correct” or “incorrect” answer texts impossible\.

## 6\.Travel Domain Benchmark

To evaluate the trained travel\-domain reasoning LLM, we developed a domain\-specific benchmark designed to assess both factual knowledge and policy\-mapping capabilities\. The benchmark uses the held\-out validation split $888888samples$ from the KG\-generated dataset with the following properties:

- •We draw all888888validation samples drawn from the knowledge graph\.
- •We balance the answer positions across all choice letters, preventing positional shortcut learning\.
- •Each benchmark questions draw its choices from a pool of 91 distinct Outcome\-node strings, 74 of which appear as both correct answer and distractor across different questions, so accuracy depends on truly understanding both the question and its candidate answers rather than on memorizing the mapping between them\.

### 6\.1\.Evaluation Metrics and Results

For each benchmark questionqq, letYq⊆a,b,c,d,eY\_\{q\}\\subseteq\{a,b,c,d,e\}be the set of gold letters andY^q\\hat\{Y\}\_\{q\}the model’s predicted set\. We compute four set\-based metrics per question, namely exact match $EM$, precision $PP$, recall $RR$, and F1 score $F1F1$, and we arithmetically average each metric across theN=888N=888benchmark questions to obtain sample\-averaged multi\-label scores\.

$1$=EMq⁢1\[=^YqYq\],=Pq\|∩Yq^Yq\|\|^Yq\|,=Rq\|∩Yq^Yq\|\|Yq\|,=⁢F1q⁢2PqRq\+PqRq

\.\\scalebox\{0\.9\}\{$\\text\{EM\}\_\{q\}=\\mathbf\{1\}\[\\hat\{Y\}\_\{q\}=Y\_\{q\}\],\\;P\_\{q\}=\\tfrac\{\|Y\_\{q\}\\cap\\hat\{Y\}\_\{q\}\|\}\{\|\\hat\{Y\}\_\{q\}\|\},\\;R\_\{q\}=\\tfrac\{\|Y\_\{q\}\\cap\\hat\{Y\}\_\{q\}\|\}\{\|Y\_\{q\}\|\},\\;F1\_\{q\}=\\tfrac\{2P\_\{q\}R\_\{q\}\}\{P\_\{q\}\+R\_\{q\}\}$\}\.
withPq=0P\_\{q\}=0whenY^q=∅\\hat\{Y\}\_\{q\}=\\emptysetandF1q=0F1\_\{q\}=0whenPq\+Rq=0P\_\{q\}\+R\_\{q\}=0\. Reported numbers are the means of these per\-question values over the benchmark\.

We compare three configurations: the pretrained Qwen3\-4B baseline with no fine\-tuning, a direct\-answer fine\-tuned model $LoRA SFT without reasoning traces, LR=5e\-5$, and a reasoning\-enhanced fine\-tuned model $LoRA SFT with reasoning traces, LR=5e\-5$\. Table[2](https://arxiv.org/html/2606.29254#S6.T2)reports both overall results across the 888\-sample benchmark and a breakdown by the number of correct answers per question\.

Direct\-answer fine\-tuning improves overall exact match from 22\.4% to 66\.0%, and adding chain\-of\-thought reasoning boosts it further to 82\.4%, a \+16\.4pp gain from explicit reasoning alone\. The impact is strongest on single\-answer questions $93\.3% vs\. 76\.6%$, where the model reconstructs the KG path step by step before answering\. On multi\-answer questions, reasoning improves F1 from 0\.560 to 0\.770 on dual\-answer and from 0\.705 to 0\.924 on triple\-answer questions\.

Table 2\.Benchmark test set results $888 samples$\. Best model per subset inbold\.
### 6\.2\.Calibration on Multi\-Answer Questions

Table[3](https://arxiv.org/html/2606.29254#S6.T3)decomposes the residual errors of all three models\. KG\-grounded reasoning closes most of the knowledge gap rather than reshaping it, but the reasoning model’s remaining17\.57%17\.57\\%of errors splits into two distinct failure modes that respond to different fixes\.

The first is a residual reasoning failure on single\-answer questions\. The model is right 93\.3% of the time on this slice, and 47 of its 48 errors are disjoint, where the predicted letter has no overlap with the ground truth\. These disjoint predictions therefore do not reflect missing knowledge in the graph but a learning gap in the weights, where the model has not internalized the specific multi\-hop reasoning that connects the question to its correct answer\. Calibration does not help here, because the model is not over\-confident on a known answer\. It is confidently following the wrong reasoning path\.

The second is over\-confidence on dual\-answer questions\. The model is right only 35\.0% of the time, and 66 of its 106 errors $62\.3%$ are strict supersets in which it predicts both correct answers plus one spurious option, the canonical signature of an over\-confident multi\-label decoder addressable by per\-option calibration such as temperature scaling or conformal thresholding\. Correct predictions traverse 2\.2 hops on average while errors traverse 10\.3, so a length\-aware stopping rule on the trace would catch the over\-traversal that produces these spurious extras\.

Table 3\.Set\-shape decomposition $%$ of predictions on the 888\-question multi\-answer benchmark\. SFT models trained at LR=5e\-5\.*Empty*counts predictions from which no\{a,…,e\}\\\{a,\\dots,e\\\}letter could be parsed\.

## 7\.Conclusion

We built a travel\-domain reasoning LLM by grounding Qwen3\-4B in an expert\-designed knowledge graph, synthesizing verified multi\-hop QA pairs from KG traversals, and fine\-tuning with reasoning traces drawn from the same paths\. The reasoning\-enhanced model achieves a 60\.02 point EM gain over the pretrained baseline and a 16\.44 point gain over a direct\-answer variant trained on the same data, evidence that explicit KG\-grounded reasoning, not fine\-tuning alone, drives the improvement\. The residual17\.57%17\.57\\%errors split into7\.5%7\.5\\%multi\-answer over\-confidence and6\.0%6\.0\\%single\-answer reasoning failures\. We therefore plan to extend the framework with a calibrated multi\-label decoding head, a confidence\-thresholded stopping rule on the reasoning trace, and stronger supervision on the underrepresented single\-answer reasoning patterns\.

## References

- $1$Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V\. Le, and Denny Zhou\.Chain\-of\-thought prompting elicits reasoning in large language models\.In*Advances in Neural Information Processing Systems*, 2022\.
- $2$Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa\.Large language models are zero\-shot reasoners\.In*Advances in Neural Information Processing Systems*, 2022\.
- $3$Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\.Self\-consistency improves chain\-of\-thought reasoning in language models\.In*International Conference on Learning Representations*, 2023\.
- $4$DeepSeek\-AI\.DeepSeek\-R1: Incentivizing reasoning capability in LLMs via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*, 2025\.
- $5$Qwen Team\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.
- $6$Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu\.Unifying large language models and knowledge graphs: A roadmap\.*IEEE Transactions on Knowledge and Data Engineering*, 36$7$:3580–3599, 2024\.
- $7$Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen\-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela\.Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.In*Advances in Neural Information Processing Systems*, 2020\.
- $8$Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec\.QA\-GNN: Reasoning with language models and knowledge graphs for question answering\.In*Proceedings of NAACL*, 2021\.
- $9$Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M\. Ni, Heung\-Yeung Shum, and Jian Guo\.Think\-on\-Graph: Deep and responsible reasoning of large language model on knowledge graph\.In*International Conference on Learning Representations*, 2024\.
- $10$Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L\. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe\.Training language models to follow instructions with human feedback\.In*Advances in Neural Information Processing Systems*, 2022\.
- $11$Edward J\. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\.LoRA: Low\-rank adaptation of large language models\.In*International Conference on Learning Representations*, 2022\.
- $12$Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer\.QLoRA: Efficient finetuning of quantized LLMs\.In*Advances in Neural Information Processing Systems*, 2023\.
- $13$Karan Singhal, Shekoofeh Azizi, Tao Tu, S\. Sara Mahdavi, Jason Wei, Hyung Won Chung, et al\.Large language models encode clinical knowledge\.*Nature*, 620:172–180, 2023\.
- $14$Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen\.MAmmoTH: Building math generalist models through hybrid instruction tuning\.In*International Conference on Learning Representations*, 2024\.

Travel-Oriented Reasoning Large Language Model via Domain-Specific Knowledge Graphs

Similar Articles

Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

I built an open-source Knowledge Graph pipeline with hybrid retrieval to improve LLM multi-hop reasoning [P]

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

Enhanced and Efficient Reasoning in Large Learning Models

Submit Feedback

Similar Articles

Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

I built an open-source Knowledge Graph pipeline with hybrid retrieval to improve LLM multi-hop reasoning [P]

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

Enhanced and Efficient Reasoning in Large Learning Models