TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation

arXiv cs.CL Papers

Summary

TCAR-Gen proposes a framework combining query-conditioned graph neural networks, temporal evidence fusion, and chain-of-trees reasoning for temporal graph retrieval in knowledge-grounded generation. It achieves improved recall on the Victorian Crime Diaries benchmark across multiple query types.

arXiv:2606.00029v1 Announce Type: new Abstract: Retrieval-augmented generation systems struggle with temporal reasoning and evidence fusion when answering complex questions over historical criminal case narratives. Existing approaches either retrieve independently of query semantics or fail to integrate multiple evidence sources coherently. We propose Temporal Context Augmented Retrieval Generation (TCAR-Gen), a framework that combines query-conditioned graph neural networks, temporal evidence fusion, and chain-of-trees reasoning to ground answer generation in retrieved evidence. On the Victorian Crime Diaries benchmark, TCAR-Gen achieves 0.3738 Recall@5, outperforming Vanilla RAG, Temporal RAG, GraphRAG-C, and GraphRAG-T across seven query types including multi-hop reasoning and counterfactual questions. Ablation studies reveal that the context graph, temporal penalty mechanism, and query conditioning are critical components. Cross-model evaluation across five language model (GPT-OSS 20B to TinyLlama 1.1B) demonstrates that TCAR-Gen maintains robust retrieval coverage at smaller model scales, though generation quality degrades substantially with reduced model capacity. Our work shows that explicit temporal modelling and multi-branch evidence fusion are essential for faithful, reasoning-intensive question answering over knowledge-grounded corpora.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:35 PM

# TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation
Source: [https://arxiv.org/html/2606.00029](https://arxiv.org/html/2606.00029)
Muhammad Noman Zahid2Rizwan Ahmed Khan3∗ 1Dipartimento di Informatica, Università di Verona, Italy 2School of Advanced Studies, University of Camerino, Italy 3Department of Computer Science, School of Mathematics and Computer Science, Institute of Business Administration \(IBA\), Karachi, Pakistan \*Corresponding author: Rizwan Ahmed Khan \(email: rizwankhan@iba\.edu\.pk\)

###### Abstract

Retrieval\-augmented generation systems struggle with temporal reasoning and evidence fusion when answering complex questions over historical criminal case narratives\. Existing approaches either retrieve independently of query semantics or fail to integrate multiple evidence sources coherently\. We propose Temporal Context Augmented Retrieval Generation \(TCAR\-Gen\), a framework that combines query\-conditioned graph neural networks, temporal evidence fusion, and chain\-of\-trees reasoning to ground answer generation in retrieved evidence\. On the Victorian Crime Diaries benchmark, TCAR\-Gen achieves 0\.3738 Recall@5, outperforming Vanilla RAG, Temporal RAG, GraphRAG\-C, and GraphRAG\-T across seven query types including multi\-hop reasoning and counterfactual questions\. Ablation studies reveal that the context graph, temporal penalty mechanism, and query conditioning are critical components\. Cross\-model evaluation across five language model \(GPT\-OSS 20B to TinyLlama 1\.1B\) demonstrates that TCAR\-Gen maintains robust retrieval coverage at smaller model scales, though generation quality degrades substantially with reduced model capacity\. Our work shows that explicit temporal modelling and multi\-branch evidence fusion are essential for faithful, reasoning\-intensive question answering over knowledge\-grounded corpora\.

Keywords—LLMs, Knowledge Graphs, Graph Neural Networks, Domain\-specific Information, Text Prompt Generation, Explainable AI

## 1Introduction

In recent years, Large Language Models \(LLMs\) have advanced natural language processing and demonstrated strong performance across a wide range of tasks, including summarization, machine translation, and question answering\. Their ability to generalize across tasks with limited task\-specific supervision has made them suitable for deployment in diverse application domains\[[1](https://arxiv.org/html/2606.00029#bib.bib1)\]\. However, reliability remains a central concern, particularly in applications where responses must be factually grounded, contextually precise, and open to verification\. A key limitation of LLMs is the generation of outputs that are linguistically fluent but factually unsupported or logically inconsistent with available evidence\. This behaviour, commonly referred to as hallucination, becomes critical in knowledge\-intensive and domain\-specific tasks, where correct answers depend on access to external information and on the ability to organise that information coherently\[[2](https://arxiv.org/html/2606.00029#bib.bib2)\]\. In such scenarios, knowledge stored within model parameters is often insufficient, especially when tasks require linking multiple pieces of evidence, resolving dependencies among entities or events, or reasoning over temporally distributed information\.

Retrieval\-augmented generation addresses this limitation by supplying LLMs with external textual evidence at inference time\[[3](https://arxiv.org/html/2606.00029#bib.bib3),[4](https://arxiv.org/html/2606.00029#bib.bib4)\]\. Although this approach improves factual grounding, most retrieval pipelines rely on similarity\-based matching between queries and isolated text chunks\. This design favours local semantic overlap over broader structural relationships, which results in evidence that is incomplete, fragmented, or poorly organised for downstream reasoning, particularly in multi\-hop scenarios involving interconnected facts\. Contextual retrieval methods extend this approach by enriching document chunks with surrounding document\-level information before indexing or ranking\. This improves interpretability by situating each chunk within its local context\. Even so, this strategy remains limited when meaning depends on explicit relationships among entities, events, and temporally ordered evidence rather than on textual proximity alone\. Knowledge Graphs \(KGs\) provide a natural representation for such structure by organising information through entities and relations\. Graph Neural Networks \(GNNs\) complement this representation by learning over relational data through neighbourhood aggregation and higher\-order dependency modelling\[[5](https://arxiv.org/html/2606.00029#bib.bib5),[6](https://arxiv.org/html/2606.00029#bib.bib6),[7](https://arxiv.org/html/2606.00029#bib.bib7)\]\. These developments indicate that effective question answering in complex settings requires preserving the structural and temporal organisation of evidence rather than relying solely on text\-based retrieval\.

The present work addresses this requirement by formulating retrieval and generation as a unified reasoning process\. Instead of treating retrieval as a separate preprocessing step, the proposed framework integrates contextual, relational, and temporal signals into a single inference pipeline\. This design combines contextual chunk enrichment, query\-conditioned graph construction, temporal encoding, and multi\-branch reasoning to support evidence selection and answer generation in a coordinated manner\. The contributions of this work are threefold\. First, it proposes a context\-aware retrieval framework that integrates document\-level enrichment with query\-conditioned graph construction and temporal modelling\. Second, it introduces a reasoning pipeline that combines structured evidence retrieval with multi\-branch inference grounded in relational and temporal constraints\. Third, it provides a comprehensive empirical evaluation that examines retrieval effectiveness, generation faithfulness, and the contribution of individual components across multiple model scales\.

The remainder of this paper is organised as follows\. Section[2](https://arxiv.org/html/2606.00029#S2)reviews related work on retrieval\-augmented generation, graph\-based reasoning, temporal modelling, and multi\-step inference\. The methodology section[3](https://arxiv.org/html/2606.00029#S3)introduces the proposed framework and its components, including context graph construction, hybrid retrieval, and reasoning mechanisms\. The experimental section[4](https://arxiv.org/html/2606.00029#S4)describes the dataset, evaluation protocol, and baselines\. In section[5](https://arxiv.org/html/2606.00029#S5)results are then presented and analysed through comparative evaluation, ablation studies, and scaling behaviour\. The paper concludes with a discussion of findings, limitations, and directions for future research\.

## 2Related Work

Retrieval\-augmented generation \(RAG\) improves the factual grounding of large language models by combining parametric generation with external non\-parametric knowledge\. Early work showed that retrieval at inference time improves performance on knowledge\-intensive tasks by reducing dependence on information stored only in model parameters\[[8](https://arxiv.org/html/2606.00029#bib.bib8)\]\. Later studies showed that retrieval quality remains central to downstream generation, since more relevant and informative context improves both faithfulness and accuracy\[[4](https://arxiv.org/html/2606.00029#bib.bib4),[3](https://arxiv.org/html/2606.00029#bib.bib3),[9](https://arxiv.org/html/2606.00029#bib.bib9)\]\. Recent methods have moved retrieval closer to the decoding process itself, which allows generation to use retrieved evidence during multi\-step reasoning\[[10](https://arxiv.org/html/2606.00029#bib.bib10)\]\. This line of work establishes retrieval quality as a core factor in reliable generation\.

Most RAG systems still rely on semantic similarity between a query and isolated text chunks\. This strategy retrieves locally relevant passages effectively, but it often fails to preserve document\-level context and does not handle multi\-hop reasoning well when evidence is distributed across related passages\. Contextualised retrieval methods partly address this issue by enriching chunks with surrounding document information\. Even so, these methods remain largely text\-centric and do not explicitly represent relationships between pieces of evidence\. The limitation is therefore not only retrieval coverage but also the absence of structured evidence modelling\.

Graph\-based retrieval addresses this limitation by representing evidence through explicit relations\. Early systems such as GRAFT\-Net\[[11](https://arxiv.org/html/2606.00029#bib.bib11)\]and PullNet\[[12](https://arxiv.org/html/2606.00029#bib.bib12)\]showed that question\-specific subgraphs constructed from text and structured knowledge improve multi\-hop reasoning\. These studies also showed that retrieval and reasoning are more effective when treated as joint and iterative processes rather than as separate stages\. Later work extended this idea to large language models\. KGLLM links entity mentions in a query to an external knowledge graph, extracts surrounding subgraphs, linearises them into natural language, and uses this evidence to ground generation and re\-rank candidate outputs for factual consistency\[[13](https://arxiv.org/html/2606.00029#bib.bib13)\]\. This direction shows that grounding generation in verified external evidence reduces hallucination more effectively than reliance on parametric memory alone\.

GraphRAG\[[14](https://arxiv.org/html/2606.00029#bib.bib14)\]organises retrieved evidence into graph structures to support coherent synthesis over large corpora, whereas G\-Retriever\[[15](https://arxiv.org/html/2606.00029#bib.bib15)\]performs retrieval directly over textual graphs\. Structured graph traversal has also improved factual grounding in systems that decompose complex queries into sub\-questions before reasoning\[[16](https://arxiv.org/html/2606.00029#bib.bib16)\]\. RDPG extends this approach through adaptive path generation, where an LLM iteratively explores a knowledge graph, revises candidate paths, and integrates recovered paths into a chain\-of\-thought prompt for final answer generation\[[17](https://arxiv.org/html/2606.00029#bib.bib17)\]\. Similar graph\-based retrieval has also improved reliability and interpretability in domain\-specific tasks such as autonomous driving\[[18](https://arxiv.org/html/2606.00029#bib.bib18)\]\. Multi\-level graph representations further improve reasoning by preserving both global and user\-specific structure during inference\[[19](https://arxiv.org/html/2606.00029#bib.bib19)\]\. At the level of evidence selection, SIBR formulates subgraph extraction as an information bottleneck problem and produces compact evidence sets by suppressing irrelevant neighbourhood structure\[[20](https://arxiv.org/html/2606.00029#bib.bib20)\]\. GS\-KGC follows a related direction by extracting a local subgraph around a query entity, serialising it into natural language, and combining it with chain\-of\-thought reasoning and post\-generation consistency checking\[[21](https://arxiv.org/html/2606.00029#bib.bib21)\]\. These studies show that faithful generation depends more on compact and relevant evidence than on retrieval volume alone\.

Relational structure, however, is not sufficient for many real\-world reasoning tasks\. Many queries depend not only on which entities are connected but also on when events occur and in what order\. Temporal graph learning addresses this requirement by incorporating time directly into representation learning\. TGAT models continuous\-time dynamic graphs through time\-aware attention mechanisms and established a strong basis for temporal graph reasoning\[[22](https://arxiv.org/html/2606.00029#bib.bib22)\]\. Later work introduced more explicit reasoning constraints\. An iterative logic\-guided framework combines mined temporal rules with temporal graph attention and uses temporal consistency checks to remove candidates that violate ordering constraints\[[23](https://arxiv.org/html/2606.00029#bib.bib23)\]\. This result shows that temporal reasoning benefits from interaction between symbolic constraints and neural representations\.

Temporal knowledge graph research has since expanded this idea across retrieval, reasoning, and evidence selection\. Some approaches convert spatiotemporal graph data into natural language to support multi\-hop reasoning over temporal entity networks\[[24](https://arxiv.org/html/2606.00029#bib.bib24)\]\. DyMemR introduces a dynamic memory pool that retains only relevant historical quadruples and shows that selective memory is more effective than indiscriminate accumulation\[[25](https://arxiv.org/html/2606.00029#bib.bib25)\]\. TiPNN reasons over temporal paths rather than entity embeddings and transfers to entities not observed during training by encoding relation sequences and time gaps\[[26](https://arxiv.org/html/2606.00029#bib.bib26)\]\. PCRS addresses sparse temporal knowledge graphs through path completion and reinforcement learning, with an explicit temporal consistency filter to enforce chronological validity\[[27](https://arxiv.org/html/2606.00029#bib.bib27)\]\. HIPNet separates short\-term and long\-term temporal structure through dual encoders and dynamically balances the two streams according to interaction frequency\[[28](https://arxiv.org/html/2606.00029#bib.bib28)\]\. HGCT assigns importance weights to historical facts through time\-aware attention and combines this with temporal convolution to capture local dynamics and global periodicity\[[29](https://arxiv.org/html/2606.00029#bib.bib29)\]\. These models show that temporal evidence must be filtered, weighted, and organised according to relevance and chronology rather than accumulated without structure\.

A related line of work strengthens temporal reasoning through semantic integration and explicit interpretability\. Text\-enhanced temporal models combine structural quadruples with contextual mention embeddings and improve performance especially for entities with sparse graph connections\[[30](https://arxiv.org/html/2606.00029#bib.bib30)\]\. Other frameworks combine language models with temporal graph encoders to improve inductive extrapolation to previously unseen entities\[[31](https://arxiv.org/html/2606.00029#bib.bib31)\]\. Interpretability has also become more prominent\. Hybrid rule\-based models combine mined temporal rules with learned embeddings and demonstrate zero\-shot transferability across related datasets\[[32](https://arxiv.org/html/2606.00029#bib.bib32)\]\. Dynamic rule validity has also been modelled directly through LLM\-based temporal reasoning, where rules are activated only when their temporal interval is compatible with the query\[[33](https://arxiv.org/html/2606.00029#bib.bib33)\]\. Other approaches show that historical trends, meta\-learning, and reinforcement learning further improve temporal generalisation and rule induction\[[34](https://arxiv.org/html/2606.00029#bib.bib34),[35](https://arxiv.org/html/2606.00029#bib.bib35),[36](https://arxiv.org/html/2606.00029#bib.bib36)\]\. Explainable frameworks now construct local temporal subgraphs, rank relation paths, and produce auditable natural language explanations tied to specific historical events\[[37](https://arxiv.org/html/2606.00029#bib.bib37)\]\. Temporal retrieval has also supported zero\-shot reasoning for time\-sensitive queries\[[38](https://arxiv.org/html/2606.00029#bib.bib38)\]\. Constraint\-aware temporal question answering adds a further layer by extracting syntactic and semantic time constraints from questions before pruning the candidate answer space\[[39](https://arxiv.org/html/2606.00029#bib.bib39),[40](https://arxiv.org/html/2606.00029#bib.bib40)\]\. These results show that temporal compatibility must shape both evidence selection and downstream reasoning\.

Work on large language model reasoning provides a parallel development\. Chain\-of\-thought prompting improves performance by encouraging models to produce intermediate reasoning steps\[[41](https://arxiv.org/html/2606.00029#bib.bib41)\]\. Self\-consistency further improves robustness by aggregating multiple reasoning paths\[[42](https://arxiv.org/html/2606.00029#bib.bib42)\]\. These methods improve inference on complex tasks, but they operate mainly at the level of text generation\. They do not by themselves ensure that intermediate reasoning remains grounded in structured external evidence\. The gain in reasoning quality therefore does not automatically translate into faithful knowledge\-intensive generation\.

![Refer to caption](https://arxiv.org/html/2606.00029v1/Fig3.png)Figure 1:Schematic overview of prior work—covering text\-based RAG, graph\-based reasoning, temporal KG, and LLM inference and the positioning of TCAR\-GenThe existing literature shows clear progress across retrieval\-augmented generation, graph\-based reasoning, temporal modelling, and multi\-step inference\. These directions, however, have mostly developed separately\. RAG systems improve factual grounding but still emphasise semantic relevance over relational structure\. Graph\-based methods improve structural reasoning but often rely on static representations and do not include document\-level contextual enrichment\. Temporal graph approaches model dynamic dependencies effectively, but they are rarely integrated into LLM\-based retrieval pipelines\. Even recent efforts that combine temporal knowledge graphs with language models focus on reasoning in isolation rather than on a unified retrieval\-generation pipeline\[[37](https://arxiv.org/html/2606.00029#bib.bib37),[33](https://arxiv.org/html/2606.00029#bib.bib33),[32](https://arxiv.org/html/2606.00029#bib.bib32)\]\. Multi\-step reasoning methods improve inference quality, but they also lack explicit grounding in structured evidence\.

A clear gap therefore, remains for unified frameworks that combine document\-level contextual enrichment, query\-conditioned graph construction, relational and temporal dependency modelling, and multi\-branch reasoning over structured evidence as illustrated in Figure[1](https://arxiv.org/html/2606.00029#S2.F1)\. The present work addresses this gap by treating retrieval and generation as a context\-aware, graph\-structured, and temporally grounded reasoning process within a single integrated pipeline\.

## 3Methodology

![Refer to caption](https://arxiv.org/html/2606.00029v1/Fig1.png)Figure 2:Overview of the proposed Temporal Context Augmented Retrieval Generation \(TCAR\-Gen\) framework for temporally grounded, graph\-aware retrieval and reasoning\.Figure[2](https://arxiv.org/html/2606.00029#S3.F2)presents the overall architecture of the proposed Temporal Context Augmented Retrieval GenerationTCAR\-Genframework\. The method is designed for evidence\-grounded reasoning over temporally evolving case narratives by coupling a query\-specific context graph with graph\-aware retrieval and explicit multi\-path reasoning\. Given a natural\-language queryqq, TCAR\-Gen comprises six stages:*\(i\)*query\-specific context graph construction,*\(ii\)*temporal graph encoding with query\-conditioned attention,*\(iii\)*hybrid evidence retrieval,*\(iv\)*evidence\-grounded prompt construction,*\(v\)*Chain\-of\-Trees reasoning, and*\(vi\)*path scoring and decision fusion\. The framework is designed so that retrieval, reasoning, and verification are performed over a shared structured representation rather than as independent steps\.

Formally, let𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)denote the case knowledge graph, where each edge is associated with a relation type and, when available, a temporal marker\. Given a queryqq, the objective is to generate an answeryysupported by a ranked evidence set𝒞q\\mathcal\{C\}\_\{q\}and an interpretable reasoning trace𝒫q\\mathcal\{P\}\_\{q\}\. Unlike text\-only retrieval pipelines, TCAR\-Gen first induces a query\-conditioned subgraph and then uses that subgraph to guide both evidence selection and downstream reasoning\.

### 3\.1Query\-specific context graph construction

The first stage constructs a compact context graph𝒢ctx​\(q\)\\mathcal\{G\}\_\{\\mathrm\{ctx\}\}\(q\)from the global case graph\. The purpose of this step is to restrict subsequent inference to entities, events, and narrative fragments that are likely to be relevant to the query\.

#### 3\.1\.1Seed extraction and query grounding

Given a queryqq, a query parser extracts seed mentions such as case titles, suspect names, victim references, locations, temporal expressions, and crime descriptors\. These mentions are normalized and aligned to graph entities using lexical matching and metadata\-aware grounding rules\. Let

𝒮​\(q\)=\{s1,s2,…,sm\},\\mathcal\{S\}\(q\)=\\\{s\_\{1\},s\_\{2\},\\ldots,s\_\{m\}\\\},\(1\)denote the resulting seed set, where eachsi∈𝒱s\_\{i\}\\in\\mathcal\{V\}corresponds to a graph node grounded in the query\.

#### 3\.1\.2Multi\-hop context expansion

Starting from𝒮​\(q\)\\mathcal\{S\}\(q\), the framework performs a bounded multi\-hop expansion over𝒢\\mathcal\{G\}to collect neighboring entities, events, and chunk\-linked evidence nodes\. Expansion is constrained by a maximum hop depth and a node budget in order to control graph size\. The resulting query\-specific subgraph is defined as

𝒢ctx​\(q\)=\(𝒱q,ℰq\),\\mathcal\{G\}\_\{\\mathrm\{ctx\}\}\(q\)=\\big\(\\mathcal\{V\}\_\{q\},\\mathcal\{E\}\_\{q\}\\big\),\(2\)where𝒱q\\mathcal\{V\}\_\{q\}contains the seed nodes together with admissible neighbors, andℰq\\mathcal\{E\}\_\{q\}contains the relations among those nodes\.

The induced subgraph includes four types of evidence\-bearing components: entity nodes, event nodes, inter\-entity relations, and narrative chunks attached to cases or events\. This representation provides the structured context used in later retrieval and reasoning stages\.

### 3\.2Temporal graph encoding with query\-conditioned attention

The context graph narrows the search space, but not all nodes and relations contribute equally to answering a given query\. TCAR\-Gen therefore applies a temporal graph encoder followed by query\-conditioned attention in order to estimate the relevance of graph components\. The induced subgraph includes four types of evidence\-bearing components: entity nodes, event nodes, inter\-entity relations, and narrative chunks attached to cases or events, as illustrated in Fig\.[3](https://arxiv.org/html/2606.00029#S3.F3)\.

![Refer to caption](https://arxiv.org/html/2606.00029v1/Fig2.png)Figure 3:Global temporal knowledge graph and query\-specific subgraph with query\-conditioned attention\. The top panel shows the full Victorian Crime Diaries knowledge graph with typed entities, relations, and temporal links across cases\. The bottom panel shows the query\-induced subgraph for a poisoning–fraud comparison query, where node relevance scores \(β\\beta\) highlight the most important entities and retrieved evidence chunks used for reasoning\.#### 3\.2\.1Node initialization

Each nodev∈𝒱qv\\in\\mathcal\{V\}\_\{q\}is assigned an initial feature representation

𝐡v\(0\)=\[𝐱v∥ϕv\],\\mathbf\{h\}^\{\(0\)\}\_\{v\}=\\big\[\\mathbf\{x\}\_\{v\}\\;\\\|\\;\\boldsymbol\{\\phi\}\_\{v\}\\big\],\(3\)where𝐱v\\mathbf\{x\}\_\{v\}is the semantic embedding of the node text or metadata andϕv\\boldsymbol\{\\phi\}\_\{v\}is an optional type\-specific feature vector\. Relation types are mapped to trainable embeddings𝐫u​v\\mathbf\{r\}\_\{uv\}, and temporal gaps associated with edges are encoded as continuous vectors\.

#### 3\.2\.2Temporal message passing

For each nodevv, message passing aggregates information from its temporal neighborhood\. Let𝒩​\(v\)\\mathcal\{N\}\(v\)denote the set of incoming neighbors ofvv\. The layer\-wise update is defined as

𝐡v\(ℓ\+1\)=σ​\(𝐖0​𝐡v\(ℓ\)\+∑u∈𝒩​\(v\)αu​v\(ℓ\)​𝐖1​\[𝐡u\(ℓ\)​‖𝝍​\(Δ​tu​v\)‖​𝐫u​v\]\),\\mathbf\{h\}^\{\(\\ell\+1\)\}\_\{v\}=\\sigma\\\!\\left\(\\mathbf\{W\}\_\{0\}\\mathbf\{h\}^\{\(\\ell\)\}\_\{v\}\+\\sum\_\{u\\in\\mathcal\{N\}\(v\)\}\\alpha^\{\(\\ell\)\}\_\{uv\}\\,\\mathbf\{W\}\_\{1\}\\left\[\\mathbf\{h\}^\{\(\\ell\)\}\_\{u\}\\;\\\|\\;\\boldsymbol\{\\psi\}\(\\Delta t\_\{uv\}\)\\;\\\|\\;\\mathbf\{r\}\_\{uv\}\\right\]\\right\),\(4\)where𝝍​\(Δ​tu​v\)\\boldsymbol\{\\psi\}\(\\Delta t\_\{uv\}\)denotes a temporal encoding of the time differenceΔ​tu​v\\Delta t\_\{uv\},αu​v\(ℓ\)\\alpha^\{\(\\ell\)\}\_\{uv\}is a neighbor\-attention coefficient,σ​\(⋅\)\\sigma\(\\cdot\)is a non\-linear activation function, and𝐖0,𝐖1\\mathbf\{W\}\_\{0\},\\mathbf\{W\}\_\{1\}are trainable projection matrices\.

#### 3\.2\.3Query\-conditioned attention pooling

AfterLLtemporal graph layers, the model computes a query\-conditioned attention distribution over node representations\. Let𝐪\\mathbf\{q\}be the query embedding\. For each nodevv, the query relevance score is

βv=exp⁡\(𝐪⊤​𝐖a​𝐡v\(L\)\)∑u∈𝒱qexp⁡\(𝐪⊤​𝐖a​𝐡u\(L\)\)\.\\beta\_\{v\}=\\frac\{\\exp\\big\(\\mathbf\{q\}^\{\\top\}\\mathbf\{W\}\_\{a\}\\mathbf\{h\}^\{\(L\)\}\_\{v\}\\big\)\}\{\\sum\_\{u\\in\\mathcal\{V\}\_\{q\}\}\\exp\\big\(\\mathbf\{q\}^\{\\top\}\\mathbf\{W\}\_\{a\}\\mathbf\{h\}^\{\(L\)\}\_\{u\}\\big\)\}\.\(5\)The graph summary vector is then computed as

𝐠ctx=∑v∈𝒱qβv​𝐡v\(L\)\.\\mathbf\{g\}\_\{\\mathrm\{ctx\}\}=\\sum\_\{v\\in\\mathcal\{V\}\_\{q\}\}\\beta\_\{v\}\\mathbf\{h\}^\{\(L\)\}\_\{v\}\.\(6\)
The attention weights\{βv\}\\\{\\beta\_\{v\}\\\}are subsequently reused as node\-level relevance signals during retrieval and path evaluation\.

### 3\.3Hybrid semantic–graph–temporal retrieval

Retrieval is performed over narrative chunks linked to cases, events, and entities\. Instead of relying on semantic similarity alone, TCAR\-Gen assigns each candidate chunk a composite score that combines semantic relevance, graph relevance, and temporal compatibility\.

For a candidate chunkcc, the final retrieval score is defined as

Score​\(c∣q\)=λs​Sim​\(𝐪,𝐞c\)\+λg​GraphRel​\(c∣𝒢ctx,𝜷\)\+λt​TimeAlign​\(c∣q\),\\mathrm\{Score\}\(c\\mid q\)=\\lambda\_\{s\}\\,\\mathrm\{Sim\}\(\\mathbf\{q\},\\mathbf\{e\}\_\{c\}\)\+\\lambda\_\{g\}\\,\\mathrm\{GraphRel\}\(c\\mid\\mathcal\{G\}\_\{\\mathrm\{ctx\}\},\\boldsymbol\{\\beta\}\)\+\\lambda\_\{t\}\\,\\mathrm\{TimeAlign\}\(c\\mid q\),\(7\)whereλs\+λg\+λt=1\\lambda\_\{s\}\+\\lambda\_\{g\}\+\\lambda\_\{t\}=1\.

#### 3\.3\.1Semantic relevance

The semantic term is computed using cosine similarity:

Sim​\(𝐪,𝐞c\)=𝐪⊤​𝐞c∥𝐪∥​∥𝐞c∥,\\mathrm\{Sim\}\(\\mathbf\{q\},\\mathbf\{e\}\_\{c\}\)=\\frac\{\\mathbf\{q\}^\{\\top\}\\mathbf\{e\}\_\{c\}\}\{\\lVert\\mathbf\{q\}\\rVert\\,\\lVert\\mathbf\{e\}\_\{c\}\\rVert\},\(8\)where𝐞c\\mathbf\{e\}\_\{c\}denotes the embedding of chunkcc\.

#### 3\.3\.2Graph relevance

Each chunk is associated with one or more nodes in the context graph through entity mentions, event references, or case\-level links\. The graph relevance term is defined as

GraphRel​\(c∣𝒢ctx,𝜷\)=maxv∈Nodes​\(c\)⁡βv,\\mathrm\{GraphRel\}\(c\\mid\\mathcal\{G\}\_\{\\mathrm\{ctx\}\},\\boldsymbol\{\\beta\}\)=\\max\_\{v\\in\\mathrm\{Nodes\}\(c\)\}\\beta\_\{v\},\(9\)whereNodes​\(c\)\\mathrm\{Nodes\}\(c\)denotes the set of context\-graph nodes linked to chunkcc\. This term prioritizes chunks attached to highly relevant graph regions\.

#### 3\.3\.3Temporal alignment

To account for chronology, a temporal compatibility score is computed between the query time window and the time interval associated with the chunk\. LetIqI\_\{q\}andIcI\_\{c\}denote these intervals\. Their alignment is measured as

TimeAlign​\(c∣q\)=\|Iq∩Ic\|\|Iq∪Ic\|\.\\mathrm\{TimeAlign\}\(c\\mid q\)=\\frac\{\|I\_\{q\}\\cap I\_\{c\}\|\}\{\|I\_\{q\}\\cup I\_\{c\}\|\}\.\(10\)When explicit timestamps are unavailable, the method uses coarser temporal cues derived from case metadata or narrative order\.

The top\-KKchunks ranked by Eq\. \([7](https://arxiv.org/html/2606.00029#S3.E7)\) form the evidence set

𝒞q=\{c1,c2,…,cK\}\.\\mathcal\{C\}\_\{q\}=\\\{c\_\{1\},c\_\{2\},\\ldots,c\_\{K\}\\\}\.\(11\)

### 3\.4Evidence\-grounded prompt construction

The retrieved evidence is converted into a structured prompt that conditions the language model on both textual and graph\-derived context\. The prompt constructor combines the query, the retrieved evidence chunks, a compact rendering of salient graph relations, and the highest\-relevance graph components\. Let𝒯​\(⋅\)\\mathcal\{T\}\(\\cdot\)denote the prompt template\. The final model input is

𝒫​\(q\)=𝒯​\(q,𝒞q,Render​\(𝒢ctx\),TopNodes​\(𝜷\),TopEdges​\(𝜶\)\),\\mathcal\{P\}\(q\)=\\mathcal\{T\}\\big\(q,\\mathcal\{C\}\_\{q\},\\mathrm\{Render\}\(\\mathcal\{G\}\_\{\\mathrm\{ctx\}\}\),\\mathrm\{TopNodes\}\(\\boldsymbol\{\\beta\}\),\\mathrm\{TopEdges\}\(\\boldsymbol\{\\alpha\}\)\\big\),\(12\)where𝜶\\boldsymbol\{\\alpha\}denotes an aggregated edge\-salience score derived from temporal message passing\.

This prompt provides the model with both textual evidence and a compact structural summary of the active context graph\.

### 3\.5Chain\-of\-Trees reasoning

Single\-path chain\-of\-thought reasoning may be brittle when a query admits multiple plausible interpretations or when several types of evidence must be reconciled\. TCAR\-Gen therefore adopts a*Chain\-of\-Trees*strategy that explores multiple evidence\-grounded reasoning branches before synthesizing an answer\.

Let𝒯q\\mathcal\{T\}\_\{q\}denote a rooted reasoning tree for queryqq\. The root node corresponds to the initial prompt𝒫​\(q\)\\mathcal\{P\}\(q\), and each child node represents a refined reasoning branch conditioned on the evidence selected so far\. In the current implementation, three branch types are considered: witness links, temporal overlap, and shared evidence\. Each branch encodes a local claim together with the evidence references used to support it\.

At depthdd, each branch expands into a small set of candidate continuations\. The expansion proceeds recursively until a maximum depth is reached or low\-confidence branches are pruned\. This produces a set of leaf paths

Πq=\{p1,p2,…,pM\},\\Pi\_\{q\}=\\\{p\_\{1\},p\_\{2\},\\ldots,p\_\{M\}\\\},\(13\)where each pathpjp\_\{j\}is a structured sequence of evidence\-grounded reasoning states\.

This branching process preserves alternative hypotheses and allows the model to compare competing explanations before final answer synthesis\.

### 3\.6Explicit path scoring and decision fusion

The final stage scores candidate reasoning paths and fuses the strongest ones into a single decision\. Each path is evaluated using four criteria: evidence support, temporal consistency, graph coherence, and model confidence\.

For a pathpp, the overall score is defined as

S​\(p\)=we​Evid​\(p\)\+wt​Temp​\(p\)\+wg​Graph​\(p\)\+wm​Conf​\(p\),S\(p\)=w\_\{e\}\\,\\mathrm\{Evid\}\(p\)\+w\_\{t\}\\,\\mathrm\{Temp\}\(p\)\+w\_\{g\}\\,\\mathrm\{Graph\}\(p\)\+w\_\{m\}\\,\\mathrm\{Conf\}\(p\),\(14\)wherewe\+wt\+wg\+wm=1w\_\{e\}\+w\_\{t\}\+w\_\{g\}\+w\_\{m\}=1\.

#### 3\.6\.1Evidence support

LetSupp​\(p\)\\mathrm\{Supp\}\(p\)denote the set of retrieved chunks cited by pathpp\. The evidence score is computed as the mean hybrid retrieval score of its supporting chunks:

Evid​\(p\)=1\|Supp​\(p\)\|​∑c∈Supp​\(p\)Score​\(c∣q\)\.\\mathrm\{Evid\}\(p\)=\\frac\{1\}\{\|\\mathrm\{Supp\}\(p\)\|\}\\sum\_\{c\\in\\mathrm\{Supp\}\(p\)\}\\mathrm\{Score\}\(c\\mid q\)\.\(15\)

#### 3\.6\.2Graph coherence

LetEdges​\(p\)\\mathrm\{Edges\}\(p\)denote the set of graph edges associated with the evidence used by pathpp\. Their coherence is computed as

Graph​\(p\)=1\|Edges​\(p\)\|​∑\(u→v\)∈Edges​\(p\)αu​v,\\mathrm\{Graph\}\(p\)=\\frac\{1\}\{\|\\mathrm\{Edges\}\(p\)\|\}\\sum\_\{\(u\\rightarrow v\)\\in\\mathrm\{Edges\}\(p\)\}\\alpha\_\{uv\},\(16\)whereαu​v\\alpha\_\{uv\}is the aggregated salience score assigned to edge\(u,v\)\(u,v\)\.

#### 3\.6\.3Temporal consistency

The temporal term penalizes contradictions in event ordering or incompatible timestamps\. Let𝒱temp​\(p\)\\mathcal\{V\}\_\{\\mathrm\{temp\}\}\(p\)denote the set of temporal violations detected for pathpp\. The temporal consistency score is

Temp​\(p\)=1−η​\|𝒱temp​\(p\)\|,\\mathrm\{Temp\}\(p\)=1\-\\eta\\,\|\\mathcal\{V\}\_\{\\mathrm\{temp\}\}\(p\)\|,\(17\)whereη\\etais a penalty coefficient and the result is clipped to the interval\[0,1\]\[0,1\]\.

#### 3\.6\.4Decision synthesis

After computingS​\(p\)S\(p\)for all paths inΠq\\Pi\_\{q\}, the framework selects the top\-ranked subset

Πq⋆=TopM​\(Πq;S\),\\Pi^\{\\star\}\_\{q\}=\\mathrm\{TopM\}\\big\(\\Pi\_\{q\};S\\big\),\(18\)subject to a minimum path\-quality threshold\. The final answer is synthesized fromΠq⋆\\Pi^\{\\star\}\_\{q\}by considering agreement across paths, evidence coverage, and residual uncertainty\. The system therefore returns both the answeryyand the selected reasoning paths with their associated score breakdowns\.

## 4Experiments

This section presents the full experimental setup for evaluating TCAR\-Gen\. It describes the corpus and knowledge graph, the benchmark construction and its validation procedure, the baseline systems and their configurations, the evaluation metrics, and the complete implementation details required to reproduce all reported results\. The evaluation is designed to measure both retrieval effectiveness and the quality of evidence\-grounded generation across multiple model scales\.

### 4\.1Corpus and Knowledge Graph

We evaluate TCAR\-Gen on the*Victorian Crime Diaries*\(VCD\) corpus, a curated collection of Victorian\-era criminal case narratives encoded as a heterogeneous knowledge graph stored in Neo4j\. Each case document originates from a Markdown source file and undergoes an automated ingestion pipeline that extracts four node types —Case,Person,Evidence, andLocation— connected by typed relational edges, includingAPPEARS\_IN,FOUND\_IN,OCCURRED\_AT, andFROM\_CASE\. Each source document is additionally chunked into passage\-levelChunknodes that preserve the raw narrative text and are anchored to their parentCaseviaFROM\_CASEedges\. The resulting global knowledge graph𝒢=\(V,E\)\\mathcal\{G\}=\(V,E\)encodes multi\-case forensic evidence, suspect–location co\-occurrences, and timestamped event sequences spanning Victorian London\. Crime types represented in the corpus include poisoning, murder, theft, conspiracy, blackmail, forgery, kidnapping, and fraud\. EachCasenode carries a structured date attribute that enables temporal ordering and cross\-case chronological comparison\. This heterogeneous, temporally\-annotated graph constitutes a demanding testbed for graph\-based temporal reasoning: queries require the system to traverse relational edges, fuse evidence from multiple chunks, and respect chronological constraints precisely the capabilities that TCAR\-Gen is designed to provide\. The VCD corpus was selected because it exhibits the relational and temporal structure that exposes weaknesses in purely similarity\-based retrieval, its heterogeneous graph schema maps naturally onto the TCAR\-Gen framework and allows controlled comparison of graph\-aware and graph\-agnostic methods, and its domain\-specific character provides a realistic evaluation setting where correct answers depend on multi\-step evidence fusion rather than surface\-level lexical matching\. Evaluation on a single corpus is acknowledged as a scope constraint of this study; broader generalisation across open\-domain benchmarks is identified as a primary direction for future work in Section[6](https://arxiv.org/html/2606.00029#S6)\.

### 4\.2Benchmark Construction and Gold Annotations

We construct a stratified evaluation benchmark using a purpose\-builtBenchmarkGeneratorthat draws queries directly from the Neo4j knowledge graph and attaches structured gold annotations to each query\. The generator produces queries across seven semantically distinct types, each targeting a different dimension of forensic reasoning:

1. 1\.Narrative— holistic case reconstruction from multiple evidence chunks \(1 reasoning hop; difficulty: easy\)\.
2. 2\.Entity Lookup— targeted retrieval of suspects, locations, or physical evidence \(1 hop; easy\)\.
3. 3\.Cross\-Case— identification of shared suspects, locations, or crime patterns across two distinct cases \(2 hops; hard\)\.
4. 4\.Evidence— justification of investigative conclusions through physical clues and witness testimony \(1 hop; medium\)\.
5. 5\.Temporal— sequencing of events within or across cases, including before/after and timeline comparison queries \(1–2 hops; easy–medium\)\.
6. 6\.Multi\-Hop— chained reasoning across witness testimony, physical evidence, and suspect identification \(3 hops; hard\)\.
7. 7\.Counterfactual— contrastive comparison of methods, motivations, and evidence profiles across two cases \(2 hops; hard\)\.

Each generated query carries four gold annotation fields: \(i\)gold\_answer, a reference answer string; \(ii\)gold\_evidence, a list of ground\-truth chunk identifiers of the form<case\_id\>\_E01\_C<idx\>; \(iii\)gold\_entities, the expected entity set \(suspects, evidence items\) the answer must reference; and \(iv\)reasoning\_hops, the minimum number of graph traversal steps required to construct a correct answer\. These annotations enable evaluation at both the retrieval level \(whether the correct chunks are retrieved\) and the generation level \(whether the produced answer is grounded in, and consistent with, the gold evidence\)\.

The full benchmark of 160 queries is partitioned into three disjoint splits: 20 queries for GNN training supervision \(*train*\), 20 queries for hyperparameter selection \(*dev*\), and 120 queries on which all reported results are computed \(*test*\)\. The train split provides the positive and negative evidence labels used to fit the GNN encoder\. The dev split is used exclusively for grid search over retrieval weights and path scoring weights\. The test split is held out throughout all model development and is accessed only once, at final evaluation\. All baselines are evaluated on the same 120\-query test split under identical conditions\. This three\-way partition ensures that TCAR\-Gen does not benefit from any information derived from the test queries during training or hyperparameter selection, and that the comparison with baselines is conducted on a common, unseen evaluation set\. The four baseline systems: Vanilla RAG, Temporal RAG, GraphRAG\-C, and GraphRAG\-T are purely inference\-time methods with no trainable components requiring supervision, and their published default configurations are therefore directly applicable without additional tuning on the dev split\. The split distribution preserves the original query\-type proportions across all three partitions through stratified sampling, ensuring that each difficulty level and hop count is represented in the test set\. All gold evidence annotations were manually reviewed by the authors to verify that \(a\) the reference answer is uniquely supported by the annotated chunks, \(b\) the annotated chunks are not trivially retrievable by keyword matching alone, and \(c\) the reasoning hop count reflects genuine multi\-step inference requirements\. Query templates are defined independently of the retrieval scoring function: theBenchmarkGeneratordraws entity references from the graph schema but does not have access to TCAR\-Gen’s retrieval weights or attention mechanism during query generation\. All baseline systems are evaluated against the same gold annotations under identical retrieval budgets, ensuring that any advantage for graph\-based methods reflects structural capability rather than benchmark design\.

### 4\.3Baselines

We compare TCAR\-Gen against four baseline systems that span the design space of retrieval\-augmented generation for knowledge\-intensive QA, from unstructured dense retrieval to graph\-structured methods with temporal attributes\. All baselines operate on the same VCD corpus and Neo4j knowledge graph as TCAR\-Gen\. To ensure a controlled and fair comparison, all systems use the same sentence\-transformer embedding model \(all\-MiniLM\-L6\-v2\)\[[43](https://arxiv.org/html/2606.00029#bib.bib43)\]for semantic similarity scoring, the same top\-K=10K=10retrieval cutoff, the same Neo4j graph instance, and the same generative model \(GPT\-OSS\-20B\) for answer generation\. No additional tuning was applied to any baseline beyond its published default configuration, which is appropriate given that Vanilla RAG, Temporal RAG, GraphRAG\-C, and GraphRAG\-T are all inference\-time systems without trainable components that would benefit from supervision on the dev split\.

Vanilla RAG\.Dense retrieval over a flat document index using cosine similarity overall\-MiniLM\-L6\-v2sentence embeddings\. No graph structure or temporal reasoning is applied\. This baseline establishes the performance floor for unstructured similarity\-based retrieval and provides a reference point for measuring the gain attributable to graph and temporal components\.

Temporal RAG\.Extends Vanilla RAG with a post\-retrieval temporal re\-ranking stage that adjusts chunk scores based on the Jaccard overlap between the query’s temporal span and the candidate chunk’s event date, computed using the sameTimeAlignfunction defined in Equation \([7](https://arxiv.org/html/2606.00029#S3.E7)\)\. This baseline isolates the contribution of temporal surface signals applied after retrieval, without any structural graph context\.

GraphRAG\-C\.Implements the community\-based GraphRAG method of\[[14](https://arxiv.org/html/2606.00029#bib.bib14)\], which partitions the knowledge graph into local communities via the Leiden algorithm and constructs community\-level summaries from which answers are retrieved\. This baseline encodes structural entity relations and provides a competitive graph\-structured comparison\. It does not model temporal edge ordering\. We use the publicly available GraphRAG implementation with default community detection parameters and prompt templates\.

GraphRAG\-T\.Augments GraphRAG\-C with temporal edge attributes by incorporating edge timestamps into the community summary construction step\. This provides both graph\-structural and temporal signals and constitutes the strongest non\-TCAR\-Gen baseline in our comparison\. It does not perform query\-conditioned graph attention, evidence\-level temporal alignment, or chain\-of\-trees reasoning\.

GraphRAG\-C and GraphRAG\-T are both derived from the published GraphRAG framework\[[14](https://arxiv.org/html/2606.00029#bib.bib14)\]and differ only in whether temporal edge attributes are included during community summarisation\. These two variants are included specifically to disentangle the contribution of graph structure from temporal structure within the same retrieval paradigm\. Comparisons against additional methods such as G\-Retriever\[[15](https://arxiv.org/html/2606.00029#bib.bib15)\]and RDPG\[[17](https://arxiv.org/html/2606.00029#bib.bib17)\]are deferred to future work owing to the significant engineering effort required to adapt these systems to the VCD schema\. The low absolute retrieval scores observed for all baselines \(Recall@5≤0\.074\\leq 0\.074\) reflect the genuine difficulty of the VCD corpus rather than inadequate baseline tuning: Vanilla RAG and Temporal RAG are fundamentally constrained by the absence of relational structure, while GraphRAG\-C and GraphRAG\-T operate at the community\-summary level and cannot distinguish individual evidence chunks within a community, as confirmed by the per\-type analysis in Section[5\.2](https://arxiv.org/html/2606.00029#S5.SS2)where all four baselines score0\.0000\.000on Multi\-Hop queries regardless of whether graph or temporal signals are present\. All baseline systems are inference\-time methods with no learnable parameters\. Applying their published default configurations is therefore not a fairness disadvantage but the appropriate symmetric treatment\. In contrast, TCAR\-Gen includes a trainable GNN component supervised on the development set\. This difference reflects the architectural design of each system rather than unequal tuning effort\.

### 4\.4Evaluation Metrics

We report metrics across two evaluation dimensions: retrieval quality and generation quality\. Recall@KK\(K∈\{3,5,10\}K\\in\\\{3,5,10\\\}\) measures the fraction of ground\-truth evidence chunks that appear in the top\-KKretrieved set\. Normalized Discounted Cumulative Gain \(NDCG@5\) evaluates ranking quality with position\-discounted relevance, penalising systems that rank relevant chunks lower in the list\. Mean Reciprocal Rank \(MRR\) captures how high the first relevant chunk ranks across all queries:

MRR=1\|Q\|​∑q∈Q1rankq\.\\mathrm\{MRR\}=\\frac\{1\}\{\|Q\|\}\\sum\_\{q\\in Q\}\\frac\{1\}\{\\mathrm\{rank\}\_\{q\}\}\.\(19\)Faithfulness measures the degree to which each claim in the generated answer is supported by the retrieved evidence, computed via the RAGAS framework\[[44](https://arxiv.org/html/2606.00029#bib.bib44)\]as:

Faithfulness=\|supported claims\|\|total claims\|\.\\mathrm\{Faithfulness\}=\\frac\{\|\\text\{supported claims\}\|\}\{\|\\text\{total claims\}\|\}\.\(20\)Answer Relevancy measures semantic alignment between the generated answer and the query, computed as the cosine similarity between theirall\-MiniLM\-L6\-v2embeddings\. Temporal Consistency \(TC\) captures chronological correctness by counting temporal ordering violations in the generated answer relative to the gold event sequence:

TC=1−\# temporal violations\# temporal claims\+ε,\\mathrm\{TC\}=1\-\\frac\{\\text\{\\\# temporal violations\}\}\{\\text\{\\\# temporal claims\}\+\\varepsilon\},\(21\)whereε=10−8\\varepsilon=10^\{\-8\}prevents division by zero\. A temporal violation is defined as any generated claim in which an event is stated to precede or follow another event in a direction that contradicts the timestamp ordering in the gold evidence\. Violation detection is implemented as a rule\-based consistency checker that extracts temporal relation pairs from the generated answer using dependency parsing and compares them against the event chronology recorded in the VCD graph\.

All text representations for semantic similarity, query embeddings, chunk embeddings, and the initial node feature vectors use theall\-MiniLM\-L6\-v2sentence\-transformer model\[[43](https://arxiv.org/html/2606.00029#bib.bib43)\], which produces 384\-dimensional dense vectors and is held fixed across all systems and all five models in the scaling study\. Seed entity extraction from queries is performed using spaCy’sen\_core\_web\_trfpipeline, with extracted mentions normalised to lowercase and matched to graph nodes via exact\-match and edit\-distance\-based fuzzy matching at a threshold of0\.850\.85; unmatched mentions are discarded\.

The GNN encoder consists of three temporal message\-passing layers with hidden dimensiond=256d=256\. Node features are initialised as the concatenation of the chunk or entity text embedding \(384384\-dimensional\) and a trainable type\-specific embedding \(3232\-dimensional\), giving an input dimension of416416\. Temporal gapsΔ​tu​v\\Delta t\_\{uv\}between connected events are encoded using a learnable time2vec projection\[[45](https://arxiv.org/html/2606.00029#bib.bib45)\]of dimension3232\. The GNN is trained with binary cross\-entropy loss using the 20\-query train split, where chunks appearing ingold\_evidenceare treated as positive nodes and all other chunk nodes in the query\-conditioned subgraph are treated as negatives\. Training uses the Adam optimiser with a learning rate of10−310^\{\-3\}and weight decay of10−410^\{\-4\}for 50 epochs with early stopping based on Recall@5 on the dev split\. Critically, the query embedding𝐪\\mathbf\{q\}used in the query\-conditioned attention pooling of Equation \([5](https://arxiv.org/html/2606.00029#S3.E5)\) is derived from the last hidden state of the generative model rather than from the fixed sentence\-transformer\. This design choice is intentional: using the models contextual representation for graph attention conditioning produces richer query\-node alignment than a frozen embedding model\. We note that while GNN weights and hybrid scoring weights remain fixed, the query representation𝐪\\mathbf\{q\}fed into the QCAP attention mechanism is derived from the query encoder of each LLM\. Consequently, the learned attention weights\{βv\}\\\{\\beta\_\{v\}\\\}vary slightly across models, which causes marginal fluctuations in the graph\-relevance component of the hybrid score\. The semantic similarity termλs⋅sim​\(q,c\)\\lambda\_\{s\}\\cdot\\mathrm\{sim\}\(q,c\)remains constant across models because it uses the fixedall\-MiniLM\-L6\-v2embedding\. These small variations in graph relevance are expected and do not invalidate the scaling analysis; they reflect the interaction between fixed retrieval structure and model\-dependent query representations\. It also explains the variation in retrieval metrics observed across model sizes in the cross\-model scaling study: while the GNN weights and graph structure are held fixed, the query representation𝐪\\mathbf\{q\}changes with the model, which in turn affects the query\-conditioned attention weights\{βv\}\\\{\\beta\_\{v\}\\\}and therefore the graph\-relevance term in Equation \([7](https://arxiv.org/html/2606.00029#S3.E7)\)\. Retrieval metrics in the scaling study therefore, reflect both the fixed structural component of retrieval and the variable query\-conditioning contributed by each LLM\. The semantic similarity termSim​\(q,ec\)\\mathrm\{Sim\}\(q,e\_\{c\}\)in Equation \([7](https://arxiv.org/html/2606.00029#S3.E7)\) uses the fixedall\-MiniLM\-L6\-v2embedding for both query and chunk, ensuring that the semantic component of retrieval is model\-independent and that the observed scaling effect is localised to the graph\-attention pathway\.

The composite retrieval score weights are set toλs=0\.5\\lambda\_\{s\}=0\.5,λg=0\.3\\lambda\_\{g\}=0\.3,λt=0\.2\\lambda\_\{t\}=0\.2, selected by grid search over the dev split at intervals of0\.10\.1subject toλs\+λg\+λt=1\\lambda\_\{s\}\+\\lambda\_\{g\}\+\\lambda\_\{t\}=1\. The Chain\-of\-Trees module initialises three hypothesis branches at depth zero grounded in witness\-link evidence, temporal\-overlap evidence, and shared physical evidence, respectively\. Each branch is expanded by prompting the LLM with the branch\-specific evidence subset and requesting a one\-step reasoning continuation; branches are pruned if their path score𝒮​\(p\)<0\.2\\mathcal\{S\}\(p\)<0\.2and the maximum tree depth is three\. The model confidence termConf​\(p\)\\mathrm\{Conf\}\(p\)is computed as the mean token\-level log\-probability of the generated reasoning continuation, normalised to\[0,1\]\[0,1\]via min\-max scaling across all branches for a given query\. Path scoring weights arewe=0\.3w\_\{e\}=0\.3,wt=0\.3w\_\{t\}=0\.3,wg=0\.2w\_\{g\}=0\.2,wm=0\.2w\_\{m\}=0\.2, fixed on the dev split and held constant across all queries and model sizes\. The top\-KKretrieval cutoff isK=10K=10for all systems\. The evidence\-grounded prompt follows the structure\[Query\] \| \[Retrieved Chunks\] \| \[Graph Summary\] \| \[Top Nodes\] \| \[Top Edges\] \| \[Instruction\], where the graph summary is a linearised rendering of the five highest\-scoring nodes and their incident edges as typed triples, and the instruction directs the model to produce answers grounded solely in the provided evidence, citing chunk identifiers for each supporting claim\. All generative models use greedy decoding with a maximum output length of 512 tokens\. All experiments are conducted on a single CPU machine; end\-to\-end wall\-clock time per query is approximately 4\.2 s for TCAR\-Gen \(GPT\-OSS\-20B\), 1\.1 s for Vanilla RAG, 1\.3 s for Temporal RAG, 3\.8 s for GraphRAG\-C, and 4\.1 s for GraphRAG\-T, with the additional latency of TCAR\-Gen attributable to GNN inference \(≈\\approx0\.8 s\), Chain\-of\-Trees expansion \(≈\\approx1\.5 s\), and path scoring \(≈\\approx0\.5 s\)\.

## 5Results and Discussion

This section evaluates TCAR\-Gen through four analyses: comparative performance against baseline systems, per\-type retrieval analysis, component\-level ablation, and cross\-model scaling behaviour\. These analyses explain both the performance gains of the framework and the components that produce them\.

### 5\.1Comparative Evaluation

![Refer to caption](https://arxiv.org/html/2606.00029v1/fig2_comparative_bars.png)Figure 4:Comparative evaluation of TCAR\-Gen and four baselines on the Victorian Crime Diaries benchmark\.Retrieval and generation performance for TCAR\-Gen and four baseline systems on the Victorian Crime Diaries benchmark \(n=160 queries\) is reported in Figure[4](https://arxiv.org/html/2606.00029#S5.F4)\. TCAR\-Gen achieves the highest score on every retrieval metric by a wide margin\. Recall@5 reaches 0\.3738, compared with 0\.0738 for the strongest baseline, GraphRAG\-C, an absolute improvement of 0\.300\. The margin increases at higher cutoffs\. Recall@10 reaches 0\.5476, compared with 0\.0738 for GraphRAG\-C, a difference of 0\.474\. This pattern shows that the hybrid retrieval mechanism not only ranks relevant chunks more effectively but also retrieves a larger set of gold evidence as the candidate pool expands\. This property is important for multi\-hop queries that require several supporting chunks\. NDCG@5 \(0\.2354 versus 0\.0603 for GraphRAG\-C\) and MRR \(0\.3357 versus 0\.0881 for Temporal RAG\) show the same pattern\. Relevant evidence appears earlier in the ranked list, which improves downstream generation by placing the most useful context inside the prompt window\.

Three retrieval signals account for these gains\. The context graph limits retrieval to a query\-conditioned neighbourhood and prevents semantic similarity from favouring unrelated lexical matches in other case documents\. The graph\-relevance term, derived from the query\-conditioned attention weights\{βv\}\\\{\\beta\_\{v\}\\\}produced by the GNN encoder, assigns higher scores to chunks attached to nodes that are highly relevant to the query\. The temporal alignment term removes anachronistic evidence, which is a common failure mode in the VCD corpus because many cases share entities but differ in chronological structure\.

The contrast with GraphRAG\-T is informative\. GraphRAG\-T is the strongest non\-TCAR baseline and incorporates both graph structure and temporal edge attributes\. Even so, GraphRAG\-T reaches Recall@5 of only 0\.0524, which matches Vanilla RAG and remains far below TCAR\-Gen\. The result shows that the presence of structural and temporal information is not sufficient on its own\. Performance depends on how these signals enter the ranking function\. TCAR\-Gen applies query\-conditioned graph attention and evidence\-level temporal alignment\. GraphRAG\-T applies temporal information at the community\-summary level\. That representation is too coarse to distinguish individual evidence chunks\.

TCAR\-Gen achieves the highest faithfulness score at 0\.7872\. The next best system, GraphRAG\-T, reaches 0\.7236, so the absolute improvement is 0\.0636\. The improvement reflects the structure of the prompt\. Retrieved chunks appear together with a compact representation of the active context graph, which constrains the model to claims that can be linked to specific evidence nodes and reduces unsupported statements\.

Temporal Consistency produces the clearest separation between systems\. TCAR\-Gen reaches 1\.0000, whereas the closest baseline, GraphRAG\-T, reaches 0\.9110\. Vanilla RAG reaches only 0\.7330\. The main source of this result is the Chain\-of\-Trees reasoning module\. Candidate reasoning paths are scored for temporal violations, and inconsistent event orders are penalised during fusion\. The system therefore avoids answers in which cause precedes effect or evidence predates the event it is used to support\. Temporal RAG reaches 0\.8820, which remains below TCAR\-Gen\. Temporal re\-ranking at the retrieval stage alone is therefore not enough\. Temporal grounding must also constrain reasoning\.

Answer Relevancy is the only metric on which TCAR\-Gen does not record the highest score\. GraphRAG\-C reaches 0\.8410, whereas TCAR\-Gen reaches 0\.8392, a difference of 0\.0018 The gap is negligible\. It likely reflects a trade\-off introduced by the context\-graph constraint\. Because TCAR\-Gen limits retrieval to the query\-conditioned subgraph, the system can exclude semantically related but structurally distant chunks\. GraphRAG\-C draws from broader community\-level summaries and therefore sometimes produces slightly broader lexical coverage\. This trade\-off is acceptable because TCAR\-Gen records higher faithfulness and perfect Temporal Consistency\.

### 5\.2Per\-Type Retrieval Analysis

Table[1](https://arxiv.org/html/2606.00029#S5.T1)reports Recall@5 by query type, ordered by descending absolute margin over the best baseline\. The table shows the reasoning dimensions on which TCAR\-Gen contributes most\.

Table 1:Recall@5 disaggregated by query type for TCAR\-Gen and all baselines, ordered by descending absolute margin over the best baseline\.Bold: best result per row\. Reasoning hops and difficulty level are defined in Section[4\.2](https://arxiv.org/html/2606.00029#S4.SS2)\.TCAR\-Gen achieves the highest Recall@5 on all seven query types\. Multi\-Hop queries show the largest absolute margin,\+0\.350\+0\.350, and all four baselines score0\.0000\.000on this type\. The result shows that semantic similarity, temporal re\-ranking, and community\-based graph retrieval are not sufficient to recover multi\-hop evidence chains in the VCD corpus\. The bounded multi\-hop context graph expansion introduced in Section[3\.1](https://arxiv.org/html/2606.00029#S3.SS1)is therefore necessary for this capability\. Entity Lookup queries reach the highest absolute Recall@5 of any type at0\.5000\.500and the second largest margin over the best baseline,\+0\.300\+0\.300over GraphRAG\-C\. The result reflects the precision of the context graph\. Seed entity mentions, including suspect names, evidence descriptors, and location references, are anchored to specific graph nodes\. The graph\-relevance term then assigns high priority to chunks linked to those nodes\.

Counterfactual and Evidence queries produce margins of\+0\.250\+0\.250and\+0\.200\+0\.200over their strongest baselines\. These queries require retrieval across contrasting cases or justification through physical clues and witness testimony\. The relevant chunks are distributed across several graph regions connected by typed relations, and semantic baselines do not use this structure\. For Counterfactual queries, temporal alignment adds an additional constraint by ensuring that evidence from the compared cases is chronologically ordered before it reaches the reasoning module\. Temporal queries produce a margin of\+0\.167\+0\.167over GraphRAG\-C\. Vanilla RAG and Temporal RAG both score0\.0000\.000on this type\. Temporal surface signals applied after retrieval are therefore not enough when the gold evidence lacks strong lexical overlap with the query and is connected instead through a chain of temporally ordered graph edges\. Cross\-Case queries produce the smallest margin,\+0\.100\+0\.100over Vanilla RAG and Temporal RAG\. This pattern reflects a structural limitation\. When a query spans two distinct case documents, the seed entities extracted from the query can activate only one of the required subgraphs\. The connecting evidence then falls outside the bounded two\-hop expansion\. Cross\-case entity resolution is therefore the most important target for future retrieval improvement, especially alias normalisation for suspects and locations that appear under different surface forms across cases\.

### 5\.3Ablation Study

Table[2](https://arxiv.org/html/2606.00029#S5.T2)reports eleven ablation configurations\. Each configuration removes or replaces one architectural component and isolates the contribution of a major subsystem\. The configurations are grouped into three sets: Context Graph \(A1–A2\), QCAP\-GNN \(A3–A5\), and Hybrid Retrieval / Chain\-of\-Trees / Fusion \(A6–A11\)\.

Table 2:Ablation study on TCAR\-Gen \(GPT\-OSS\-20B\)\. Upper panel: retrieval\-focused ablations \(Recall@5\)\. Lower panel: generation\-focused ablations \(Faithfulness\)\.⋆\\stardenotes the full system\.Bold: best result per column\.
Configuration A2, which removes the context graph, produces Recall@5 of 0\.0000\. The context graph is therefore necessary for evidence retrieval in the VCD corpus\. Without a query\-conditioned subgraph, retrieval has no structural signal for ranking candidate chunks, and both the graph\-relevance and temporal\-alignment terms in Eq\. \([7](https://arxiv.org/html/2606.00029#S3.E7)\) collapse to zero\. The system then reduces to unguided semantic search\. The gap of 0\.3738 between A2 and the full system is the largest retrieval drop in the ablation study\.

Configuration A1 restricts graph expansion toPersonandEvidencenodes and excludes event and location nodes\. Recall@5 decreases to 0\.3524, which is 0\.021 below the full system\. This result shows that event and location nodes provide important structural context\. Their removal reduces the coverage of the query\-conditioned attention weights\{βv\}\\\{\\beta\_\{v\}\\\}\. This observation suggests that node selection should depend on the query\. Fixed expansion may include unnecessary nodes or exclude relevant ones\. Adaptive node budgeting based on query complexity and entity density can improve this step\. Configurations A6, A7, and A8 isolate the contribution of each retrieval signal\. Removing graph relevance in A6 \(Semantic Only\) and removing temporal alignment in A7 \(Semantic \+ Graph\) both produce Recall@5 of 0\.3429, a decrease of 0\.031 from the full system\. Removing graph relevance while retaining temporal alignment in A8 \(Semantic \+ Temporal\) produces 0\.3524\. This result indicates that temporal alignment provides more discriminative power than graph relevance when evaluated independently\. The full combination of all three signals, however, remains the most stable configuration\. Ablations A3 to A5 examine the role of query\-conditioned attention pooling and the GNN encoder\. Replacing QCAP with mean pooling in A3 reduces faithfulness from 0\.7872 to 0\.0530\. Removing query conditioning in A4 reduces it further to 0\.0485\. The largest degradation appears in A5, where removing the GNN reduces faithfulness to 0\.0440, a drop of 0\.743\. Without message passing, node representations cannot encode neighborhood structure, and the attention scores\{βv\}\\\{\\beta\_\{v\}\\\}become unreliable\. These results show that faithful generation depends on structured representations, not only on retrieval quality\. Ablations A9 and A10 evaluate the reasoning component\. Replacing the Chain\-of\-Trees module with linear chain\-of\-thought in A9 or with a single\-branch tree in A10 reduces faithfulness to 0\.0591 and 0\.0565\. The drops are 0\.728 and 0\.731, respectively\. These results show that multiple reasoning branches are necessary for consistent answers across different types of evidence\. The improvement does not come only from retrieving more evidence\. It depends on comparing alternative reasoning paths before answer synthesis\. Configuration A11 removes the temporal penalty and produces the lowest faithfulness score, 0\.0374, a drop of 0\.750\. This is the largest effect among all ablations\. Once the temporal penalty is removed from the path\-scoring objective in Eq\. \([14](https://arxiv.org/html/2606.00029#S3.E14)\), the model can select reasoning paths that violate chronological order\.These inconsistencies propagate into the final answer, even when the retrieved evidence is correct\. Temporal grounding is therefore necessary for reasoning over event\-driven knowledge\.

### 5\.4Cross\-Model Scaling Analysis

Tables[3](https://arxiv.org/html/2606.00029#S5.T3)and Figure[5](https://arxiv.org/html/2606.00029#S5.F5)show how TCAR\-Gen behaves across five generative models spanning an approximately 18\-fold parameter range\.

Table 3:TCAR\-Gen performance across five LLM models\.Bold: best result per row\.Recall@5 decreases from0\.37380\.3738with GPT\-OSS\-20B to0\.17110\.1711with TinyLlama\-1\.1B, an absolute reduction of0\.1530\.153across an 18\-fold decrease in parameter count\. The reduction is modest relative to the model size change\. Retrieval quality in TCAR\-Gen is partly structural, determined by the fixed context graph and hybrid scoring pipeline, and partly model\-conditioned, through the LLM\-dependent query representation𝐪\\mathbf\{q\}that drives the QCAP attention weights\{βv\}\\\{\\beta\_\{v\}\\\}\.\. The GNN encoder and query\-conditioned attention pooling also operate outside the language model and determine chunk ranking directly\. This separation matters in practice because smaller models such as Phi\-3\-mini and TinyLlama still retain useful retrieval coverage, with Recall@5 of0\.22470\.2247and0\.17110\.1711respectively\. Generation quality depends much more strongly on model capacity\. Faithfulness decreases from0\.78720\.7872with GPT\-OSS\-20B to0\.25350\.2535with LLaMA\-3\.1\-13B and to0\.14720\.1472with TinyLlama\-1\.1B, a total reduction of0\.6400\.640\. Smaller models therefore struggle less with evidence retrieval than with synthesis of structured multi\-step arguments\. The evidence is available, but the model cannot combine it reliably into explicit claim\-evidence relations\.

![Refer to caption](https://arxiv.org/html/2606.00029v1/fig_radar_scaling.png)Figure 5:Radar Chart: Recall@5 by query type across five LLM models\.Temporal Consistency decreases more gradually, from1\.00001\.0000to0\.81630\.8163, and remains above0\.89030\.8903for all models of at least 3\.8B parameters\. The result suggests that the temporal penalty in Eq\. \([14](https://arxiv.org/html/2606.00029#S3.E14)\) provides useful structural support even for weaker models\. Paths that contain temporal violations are removed before synthesis, which reduces the burden on the language model itself\. Figure[5](https://arxiv.org/html/2606.00029#S5.F5)shows that the ranking of query types by Recall@5 remains stable across all five LLMs\. Entity Lookup remains the easiest type, whereas Cross\-Case remains the most difficult\. Query difficulty in TCAR\-Gen therefore, depends mainly on structural properties, especially the number of reasoning hops and the degree of cross\-document bridging, rather than only on model size\. Multi\-Hop and Counterfactual queries remain strong among the difficult categories at every scale\. This pattern reflects the effect of multi\-hop context graph expansion, which provides the same structural support across all models\.

## 6Conclusion

This work presents a retrieval\-augmented generation framework that integrates contextual, relational, and temporal reasoning within a single pipeline\. The approach combines document\-level context enrichment, query\-conditioned graph construction, hybrid retrieval, and multi\-branch reasoning\. Context enrichment preserves document coherence, while the graph structure captures dependencies among entities and events\. Hybrid retrieval uses semantic similarity, graph relevance, and temporal alignment to select evidence\. Multi\-branch reasoning evaluates alternative evidence paths and selects consistent outputs\. The evaluation shows improvements in both retrieval and generation\. The framework retrieves more relevant and complete evidence, especially for multi\-hop and temporally constrained queries\. Generated responses show higher faithfulness and better temporal consistency\. The ablation results confirm that the graph structure and temporal reasoning are necessary for maintaining coherence\. Future work can extend this framework in several directions\. Larger and more diverse benchmarks can improve evaluation coverage\. Adaptive graph construction and learned weighting can improve flexibility\. Temporal modelling can be extended to capture richer event structures\. Efficient reasoning strategies can reduce computational overhead\. Integration with smaller models or distributed systems can improve scalability\.

## References

- \[1\]C\. Qin, A\. Zhang, Z\. Zhang, J\. Chen, M\. Yasunaga, and D\. Yang, “Is chatgpt a general\-purpose natural language processing task solver?,”arXiv preprint arXiv:2302\.06476, 2023\.
- \[2\]B\. Peng, M\. Galley, P\. He, H\. Cheng, Y\. Xie, Y\. Hu, Q\. Huang, L\. Liden, Z\. Yu, W\. Chen,et al\., “Check your facts and try again: Improving large language models with external knowledge and automated feedback,”arXiv preprint arXiv:2302\.12813, 2023\.
- \[3\]S\. Borgeaud, A\. Mensch, J\. Hoffmann, T\. Cai, E\. Rutherford, K\. Millican, G\. B\. Van Den Driessche, J\.\-B\. Lespiau, B\. Damoc, A\. Clark,et al\., “Improving language models by retrieving from trillions of tokens,” inInternational conference on machine learning, pp\. 2206–2240, PMLR, 2022\.
- \[4\]O\. Ram, Y\. Levine, I\. Dalmedigos, D\. Muhlgay, D\. Jannai,et al\., “In\-context retrieval\-augmented language models,”Transactions of the Association for Computational Linguistics, vol\. 11, pp\. 1316–1331, 2023\.
- \[5\]B\. Y\. Lin, X\. Chen, J\. Chen, and X\. Ren, “Kagnet: Knowledge\-aware graph networks for commonsense reasoning,”arXiv preprint arXiv:1909\.02151, 2019\.
- \[6\]Y\. Feng, X\. Chen, B\. Y\. Lin, P\. Wang, J\. Li, and X\. Ren, “Scalable multi\-hop relational reasoning for knowledge\-aware question answering,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\), pp\. 1295–1309, 2020\.
- \[7\]Y\. Li, Z\. Li, P\. Wang, J\. Li, X\. Sun, H\. Cheng, and J\. X\. Yu, “A survey of graph meets large language model: Progress and future directions,”arXiv preprint arXiv:2311\.12399, 2023\.
- \[8\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\.\-t\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela, “Retrieval\-augmented generation for knowledge\-intensive nlp tasks,” inAdvances in Neural Information Processing Systems, 2020\.
- \[9\]Y\. Hu, Q\. Li, D\. Zhang, J\. Yan, and Y\. Chen, “Context\-alignment: Activating and enhancing llm capabilities in time series,”arXiv preprint arXiv:2501\.03747, 2025\.
- \[10\]J\. Feng, Q\. Wang, H\. Qiu, and L\. Liu, “Retrieval in decoder benefits generative models for explainable complex question answering,”Neural Networks, vol\. 181, p\. 106833, 2025\.
- \[11\]H\. Sun, B\. Dhingra, M\. Zaheer, K\. Mazaitis, R\. Salakhutdinov, and W\. W\. Cohen, “Open domain question answering using early fusion of knowledge bases and text,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp\. 4231–4242, 2018\.
- \[12\]H\. Sun, T\. Bedrax\-Weiss, and W\. W\. Cohen, “Pullnet: Open domain question answering with iterative retrieval on knowledge bases and text,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp\. 2380–2390, 2019\.
- \[13\]L\. Yang, H\. Chen, Z\. Li, X\. Ding, and X\. Wu, “Give us the facts: Enhancing large language models with knowledge graphs for fact\-aware language modeling,”IEEE Transactions on Knowledge and Data Engineering, vol\. 36, no\. 7, pp\. 3091–3110, 2024\.
- \[14\]D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, D\. Metropolitansky, R\. O\. Ness, and J\. Larson, “From local to global: A graph rag approach to query\-focused summarization,”arXiv preprint arXiv:2404\.16130, 2024\.
- \[15\]X\. He, Y\. Tian, Y\. Sun, N\. V\. Chawla, T\. Laurent, Y\. LeCun, X\. Bresson, and B\. Hooi, “G\-retriever: Retrieval\-augmented generation for textual graph understanding and question answering,” inAdvances in Neural Information Processing Systems, 2024\.
- \[16\]J\. Linders and J\. M\. Tomczak, “Knowledge graph\-extended retrieval augmented generation for question answering,”Applied Intelligence, vol\. 55, no\. 17, p\. 1102, 2025\.
- \[17\]L\. Ding, N\. Ding, Q\. Tao,et al\., “Enhancing graph multi\-hop reasoning for question answering with LLMs: An approach based on adaptive path generation,”Journal of Intelligent Information Systems, vol\. 63, pp\. 1455–1485, 2025\.
- \[18\]M\. M\. Hussien, A\. N\. Melo, A\. L\. Ballardini, C\. S\. Maldonado, R\. Izquierdo, and M\. Á\. Sotelo, “Rag\-based explainable prediction of road users behaviors for automated driving using knowledge graphs and large language models,”Expert Systems with Applications, vol\. 265, p\. 125914, 2025\.
- \[19\]L\. Li, Z\. Wang, J\. M\. Jose, and X\. Ge, “Llm supporting knowledge tracing leveraging global subject and student specific knowledge graphs,”Information Fusion, p\. 103577, 2025\.
- \[20\]K\. Chen, H\. Yu, Y\. Wang, X\. Song, X\. Zhao, Y\. Xie, L\. Gao, and A\. Li, “Temporal knowledge graph extrapolation with subgraph information bottleneck,”Expert Systems with Applications, vol\. 268, p\. 126226, 2025\.
- \[21\]R\. Yang, J\. Zhu, J\. Man, H\. Liu, L\. Fang, and Y\. Zhou, “Gs\-kgc: A generative subgraph\-based framework for knowledge graph completion with large language models,”Information Fusion, vol\. 117, p\. 102868, 2025\.
- \[22\]D\. Xu, C\. Ruan, E\. Korpeoglu, S\. Kumar, and K\. Achan, “Inductive representation learning on temporal graphs,”arXiv preprint arXiv:2002\.07962, 2020\.
- \[23\]L\. Bai, H\. Zhang, X\. An, and L\. Zhu, “Few\-shot multi\-hop reasoning via reinforcement learning and path search strategy over temporal knowledge graphs,”Information Processing & Management, vol\. 62, no\. 3, p\. 104001, 2025\.
- \[24\]X\. Liang, X\. Xu, R\. Ma, L\. Yan, and Z\. Ma, “Spatiotemporal knowledge graph multi\-hop reasoning based on large language models,”Engineering Applications of Artificial Intelligence, vol\. 164, p\. 113229, 2026\.
- \[25\]F\. Zhang, Z\. Zhang, F\. Zhuang, Y\. Zhao, D\. Wang, and H\. Zheng, “Temporal knowledge graph reasoning with dynamic memory enhancement,”IEEE Transactions on Knowledge and Data Engineering, vol\. 36, no\. 11, pp\. 7115–7128, 2024\.
- \[26\]H\. Dong, P\. Wang, M\. Xiao, Z\. Ning, P\. Wang, and Y\. Zhou, “Temporal inductive path neural network for temporal knowledge graph reasoning,”Artificial Intelligence, vol\. 329, p\. 104085, 2024\.
- \[27\]X\. Meng, L\. Bai, J\. Hu, and L\. Zhu, “Multi\-hop path reasoning over sparse temporal knowledge graphs based on path completion and reward shaping,”Information Processing & Management, vol\. 61, no\. 2, p\. 103605, 2024\.
- \[28\]W\. Xu, B\. Liu, M\. Peng, Z\. Jiang, X\. Jia, K\. Liu, L\. Liu, and M\. Peng, “Historical facts learning from long\-short terms with language model for temporal knowledge graph reasoning,”Information Processing & Management, vol\. 62, no\. 3, p\. 104047, 2025\.
- \[29\]H\. Dao, N\. Phan, T\. Le, and N\.\-T\. Nguyen, “HGCT: Enhancing temporal knowledge graph reasoning through extrapolated historical fact extraction,”Knowledge\-Based Systems, vol\. 316, p\. 113358, 2025\.
- \[30\]L\. Zhu, W\. Zhao, and L\. Bai, “Quadruple mention text\-enhanced temporal knowledge graph reasoning,”Engineering Applications of Artificial Intelligence, vol\. 133, p\. 108058, 2024\.
- \[31\]W\. Cai, M\. Li, X\. Shi, Y\. Fan, Q\. Zhu, and H\. Jin, “Re\-segnn: recurrent semantic evidence\-aware graph neural network for temporal knowledge graph forecasting,”Science China information sciences, vol\. 68, no\. 2, p\. 122104, 2025\.
- \[32\]X\. Mei, L\. Yang, Z\. Jiang, X\. Cai, D\. Gao, J\. Han, and S\. Pan, “An inductive reasoning model based on interpretable logical rules over temporal knowledge graph,”Neural Networks, vol\. 174, p\. 106219, 2024\.
- \[33\]Q\. Pan, L\. Yao, G\. Shen, X\. Han, Y\. Chen, and X\. Kong, “Leveraging temporal validity of rules via llms for enhanced temporal knowledge graph reasoning,”Knowledge\-Based Systems, p\. 114094, 2025\.
- \[34\]R\. Ma, L\. Wang, H\. Wu, B\. Gao, X\. Wang, and L\. Zhao, “Historical trends and normalizing flow for one\-shot temporal knowledge graph reasoning,”Expert Systems With Applications, vol\. 260, p\. 125366, 2025\.
- \[35\]L\. Bai, S\. Han, and L\. Zhu, “Multi\-hop interpretable meta learning for few\-shot temporal knowledge graph completion,”Neural Networks, vol\. 183, p\. 106981, 2025\.
- \[36\]T\. Chen, L\. Yang, Z\. Wang, and J\. Long, “A rule\-and query\-guided reinforcement learning for extrapolation reasoning in temporal knowledge graphs,”Neural Networks, vol\. 185, p\. 107186, 2025\.
- \[37\]Q\. Li and G\. Wu, “Explainable reasoning over temporal knowledge graphs by pre\-trained language model,”Information Processing & Management, vol\. 62, no\. 1, p\. 103903, 2025\.
- \[38\]Q\. Hu, X\. Tu, A\. Li, and B\. Yao, “Tempqa: An llm\-based framework for temporal knowledge graph question answering,”Knowledge\-Based Systems, p\. 114988, 2025\.
- \[39\]C\. Du, X\. Li, and Z\. Li, “Semantic\-enhanced reasoning question answering over temporal knowledge graphs,”Journal of Intelligent Information Systems, vol\. 62, pp\. 859–881, 2024\.
- \[40\]Q\. Liu, S\. Feng, and M\. Huang, “TEQA: Temporal knowledge graph enhanced question answering,”Knowledge\-Based Systems, vol\. 325, p\. 113916, 2025\.
- \[41\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou, “Chain\-of\-thought prompting elicits reasoning in large language models,”arXiv preprint arXiv:2201\.11903, 2022\.
- \[42\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou, “Self\-consistency improves chain of thought reasoning in language models,”arXiv preprint arXiv:2203\.11171, 2022\.
- \[43\]N\. Reimers and I\. Gurevych, “Sentence\-bert: Sentence embeddings using siamese bert\-networks,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\), pp\. 3982–3992, 2019\.
- \[44\]S\. Es, J\. James, L\. E\. Anke, and S\. Schockaert, “Ragas: Automated evaluation of retrieval augmented generation,” inProceedings of the 18th conference of the european chapter of the association for computational linguistics: system demonstrations, pp\. 150–158, 2024\.
- \[45\]S\. M\. Kazemi and D\. Poole, “Time2vec: Learning a vector representation of time,”arXiv preprint arXiv:1907\.05321, 2019\.

Similar Articles

Why Retrieval-Augmented Generation Fails: A Graph Perspective

arXiv cs.CL

This paper investigates why Retrieval-Augmented Generation (RAG) systems fail despite having access to correct evidence. Using circuit tracing and attribution graphs, the authors find that correct predictions exhibit deeper reasoning paths and more distributed evidence flow, while failures show shallow and fragmented patterns. They propose a graph-based error detection framework and targeted interventions to improve RAG reliability.