TRACE: State-Aware Query Processing over Temporal Evidence Graphs for Conversational Data
Summary
This paper presents TRACE, a query processing framework that models conversational data as temporal evidence graphs to enable state-aware reasoning over evolving user states, improving temporal and multi-hop reasoning for long-conversation QA.
View Cached Full Text
Cached at: 07/02/26, 05:36 AM
# TRACE: State-Aware Query Processing over Temporal Evidence Graphs for Conversational Data
Source: [https://arxiv.org/html/2607.00339](https://arxiv.org/html/2607.00339)
Maolin Wang1, Yu Wang1, Zichun Liu1, Baiyuan Qiu2, Chenbin Zhang2, Jiguang Shen2, Haoran Yang3, Hao Miao4
###### Abstract
Conversational data is increasingly used as a persistent source of user state for long\-running assistants and AI agents\. However, querying this data remains challenging because conversations naturally evolve: plans are revised, preferences change, and later messages frequently supersede or contradict earlier information\. Existing long\-memory pipelines largely treat memories as independent text or vector objects\. This approach often retrieves semantically similar but stale evidence, offering limited support for state\-aware reasoning\. To address this problem, we present TRACE, a query processing framework over temporal evidence graphs for evolving conversational data\. TRACE models conversations as a hierarchical graph spanning events, sessions, and topics, enriched with typed temporal, causal, update, and contradiction relations\. Crucially, the framework maintains validity annotations so obsolete facts remain accessible for historical queries but are discounted for current\-state answers\. At query time, TRACE combines vector\-based note retrieval with graph\-guided evidence search, generating validity\-aware support paths and a hybrid context for answer generation\. This design separates lexical recall from evidence reconstruction, enabling bounded query\-time reasoning over long conversational histories\. Experiments on long\-conversation query\-answering \(QA\) benchmarks show that TRACE improves temporal and multi\-hop reasoning, with ablations highlighting the importance of hierarchy, update\-aware seeding, and path\-grounded evidence\. Source code is provided at https://github\.com/MorinWang/TRACE\.
## IIntroduction
Conversational data is now a permanent fixture in personal assistants, enterprise copilots, customer service systems, and autonomous AI agents\[[44](https://arxiv.org/html/2607.00339#bib.bib31),[40](https://arxiv.org/html/2607.00339#bib.bib32),[28](https://arxiv.org/html/2607.00339#bib.bib2),[39](https://arxiv.org/html/2607.00339#bib.bib3)\]\. Unlike static documents, this data captures shifting user states: preferences evolve, plans change, commitments expire, and new messages often update or contradict older ones\. Querying such data requires more than retrieving relevant text spans: a system must identify which facts are still valid, which are outdated history, and how scattered pieces of evidence connect\. These are precisely the concerns addressed by temporal databases and stream processing\[[5](https://arxiv.org/html/2607.00339#bib.bib51),[2](https://arxiv.org/html/2607.00339#bib.bib54),[16](https://arxiv.org/html/2607.00339#bib.bib38),[36](https://arxiv.org/html/2607.00339#bib.bib26)\], yet current conversational memory systems lack analogous mechanisms for valid\-time tracking, update propagation, and versioned\-state querying\. Long\-term conversational QA is therefore a data\-management challenge over temporal, heterogeneous, continuously evolving interaction histories\. Failing to address it leads to hallucinated answers grounded in stale context, missed updates, or incoherent reasoning over fragmented evidence, ultimately undermining user trust and system reliability in deployed applications\.
The core difficulty stems from a mismatch between how conversational data is stored and how it is queried\. Interactions arrive as a linear, timestamped stream of messages, while many questions require a reconstructed state\. A user may share a travel plan, later change the destination, and eventually ask about the current itinerary or the reasoning behind the decision\. Answering accurately depends on connecting events scattered across sessions: the initial statement, the subsequent update, and the final outcome\. A chunk\-based retriever can pull a semantically relevant but obsolete message, missing the critical update that changes the answer\. In database terms, this is a query\-processing failure: the access path induced by semantic similarity reaches topical evidence, but not necessarily the authoritative evidence for the queried state\[[35](https://arxiv.org/html/2607.00339#bib.bib49),[18](https://arxiv.org/html/2607.00339#bib.bib50)\]\. The challenge therefore goes beyond relevance estimation; it requires temporal validity, evidence links, provenance, and conflict handling, echoing database settings where queries must return reliable answers over incomplete or inconsistent evidence\[[3](https://arxiv.org/html/2607.00339#bib.bib46),[9](https://arxiv.org/html/2607.00339#bib.bib39),[19](https://arxiv.org/html/2607.00339#bib.bib45),[15](https://arxiv.org/html/2607.00339#bib.bib71),[42](https://arxiv.org/html/2607.00339#bib.bib72)\]\.
Existing approaches only partially address this challenge\. Dense retrieval and retrieval\-augmented generation rank independent passages by relevance rather than state validity\[[22](https://arxiv.org/html/2607.00339#bib.bib10),[24](https://arxiv.org/html/2607.00339#bib.bib9)\]\. Temporal retrieval adds time constraints, but often keeps memory units as disconnected vector objects rather than records linked by update, contradiction, and provenance relationships\[[32](https://arxiv.org/html/2607.00339#bib.bib43)\]\. Long\-term memory systems cache, summarize, or organize histories into notes and hierarchies\[[44](https://arxiv.org/html/2607.00339#bib.bib31),[40](https://arxiv.org/html/2607.00339#bib.bib32)\], improving recall and compression while potentially blurring provenance and obscuring the update trail needed for temporally grounded answers\. Graph\-enhanced retrieval and database keyword search derive explicit structures over textual, graph\-shaped, structured, and semi\-structured data\[[25](https://arxiv.org/html/2607.00339#bib.bib20),[20](https://arxiv.org/html/2607.00339#bib.bib62),[11](https://arxiv.org/html/2607.00339#bib.bib63),[37](https://arxiv.org/html/2607.00339#bib.bib64)\], but these structures are usually used to locate relevant content rather than to model query\-time temporal state\. As a result, current architectures can retrieve content that appears semantically relevant without ensuring that it remains authoritative, up to date, and consistent with later conversation history\.
We argue that an ideal data model for evolving conversational QA must satisfy three requirements\. First, it must preserve a*multi\-granularity topology*: events, sessions, and topics should all be queryable units, since answers often traverse several levels\. This connects conversational memory to graph and path\-query settings where answers are assembled from related records rather than from a single tuple or document\[[30](https://arxiv.org/html/2607.00339#bib.bib65),[45](https://arxiv.org/html/2607.00339#bib.bib66),[26](https://arxiv.org/html/2607.00339#bib.bib68)\]\. Second, it must capture*state evolution*: updates and contradictions should be modeled explicitly so obsolete facts can be discounted without being deleted, consistent with temporal data and uncertain\-lineage paradigms\[[16](https://arxiv.org/html/2607.00339#bib.bib38),[36](https://arxiv.org/html/2607.00339#bib.bib26),[6](https://arxiv.org/html/2607.00339#bib.bib69)\]\. Third, it must support*evidence reconstruction*: query processing should surface coherent support paths rather than isolated snippets, allowing answers to be traced to the records and transformations that justify them\[[9](https://arxiv.org/html/2607.00339#bib.bib39),[19](https://arxiv.org/html/2607.00339#bib.bib45),[12](https://arxiv.org/html/2607.00339#bib.bib70)\]\. Meeting these requirements requires reconciling temporal and graph semantics with vector retrieval under bounded online overhead, echoing adaptive query processing under changing evidence\[[4](https://arxiv.org/html/2607.00339#bib.bib27),[17](https://arxiv.org/html/2607.00339#bib.bib29)\]\.
To address these challenges, we introduce TRACE, a state\-aware query processing framework over temporal evidence graphs for conversational data\. TRACE constructs a unified evidence graph from long\-running conversations, where state\-bearing events are connected by typed temporal, causal, update, and contradiction relations, while sessions and topics serve as hierarchical structural vertices\. At query time, TRACE couples message\-level retrieval with graph\-guided evidence navigation: the text channel preserves recall over raw dialogue, while the temporal graph yields validity\-aware paths that dictate how retrieved evidence should be reconciled\. This design preserves the fidelity of the original conversational text while leveraging a structured evidence layer to reason about current states, historical facts, and multi\-hop dependencies\.
This paper makes the following key contributions:
- •We formalize query\-time state reconstruction as a foundational bottleneck in long\-term conversational QA, demonstrating analytically and empirically why relevance\-centric retrieval fails when underlying facts continuously evolve\.
- •We introduce a temporal evidence graph that unifies events, sessions, and topics into a coherent hierarchical topology, featuring typed semantic edges, temporal validity annotations for obsolete facts, and explicit modeling primitives for updates and contradictions\.
- •We design and implement TRACE, a modular offline\-online architecture that integrates dense vector retrieval, hierarchical seed expansion, validity\-aware path traversal, multi\-signal fusion scoring, and hybrid context assembly for bounded, efficient query\-time reasoning\.
- •We conduct an extensive empirical study on long\-conversation QA benchmarks\. Through rigorous ablations, we analyze the impact of graph evidence, structural hierarchy, and query\-adaptive reasoning across temporal, multi\-hop, open\-domain, and preference\-driven workloads\.
## IIMethodology
TRACE maintains two complementary memory structures: a*note memory*𝒩\\mathcal\{N\}for lexical recall of raw conversation text, and a*temporal evidence graph*G⋆G^\{\\star\}for structured, state\-aware reasoning over evolving facts\. The note memory preserves what was said; the evidence graph tracks what currently still holds and how facts connect causally across sessions\.
Figure 1:Offline graph construction\. Beginning with session\-level summaries derived from note memory𝒩\\mathcal\{N\}, TRACE produces the evidence graphG⋆G^\{\\star\}through four sequential steps: session\-level event distillation, cross\-session linking via candidate gating, temporal and evolution edge inference, and hierarchy injection\.### II\-AProblem Formulation
Let a conversation be a sequence of sessions
𝒞=⟨St⟩t=1T,St=⟨mt,i⟩i=1nt,\\mathcal\{C\}=\\langle S\_\{t\}\\rangle\_\{t=1\}^\{T\},\\quad S\_\{t\}=\\langle m\_\{t,i\}\\rangle\_\{i=1\}^\{n\_\{t\}\},\(1\)where each messagemt,im\_\{t,i\}carries a speaker, content, and timestamp\. Given a queryqqposed after the conversation, the task is to produce an answery^\\hat\{y\}grounded in evidence from𝒞\\mathcal\{C\}\. Unlike standard retrieval\-augmented QA, evidence here must be interpreted under temporal evolution: if two retrieved passages conflict, the system must determine which reflects the current state rather than treating both as equally valid\. The complete pipeline is summarized in Algorithm[1](https://arxiv.org/html/2607.00339#alg1)\.
Algorithm 1TRACE Pipeline0:Conversation
𝒞=⟨St⟩t=1T\\mathcal\{C\}=\\langle S\_\{t\}\\rangle\_\{t=1\}^\{T\}, Memory
𝒩\\mathcal\{N\}, Query
qq
0:Answer
y^\\hat\{y\}
1:// Offline: Graph Construction \(§[II\-C](https://arxiv.org/html/2607.00339#S2.SS3)\)
2:foreach session
StS\_\{t\}do
3:
\(Vte,Etr\)←Devent\(St\)\(V\_\{t\}^\{e\},E\_\{t\}^\{r\}\)\\leftarrow D\_\{\\mathrm\{event\}\}\(S\_\{t\}\)⊳\\trianglerightEq\. \([9](https://arxiv.org/html/2607.00339#S2.E9)\)
4:endfor
5:Cross\-session linking via candidate gate
A\(e\)A\(e\)⊳\\trianglerightEq\. \([10](https://arxiv.org/html/2607.00339#S2.E10)\)
6:Infer temporal & evolution edges
ℛsup∪ℛevo\\mathcal\{R\}\_\{\\mathrm\{sup\}\}\\cup\\mathcal\{R\}\_\{\\mathrm\{evo\}\}
7:Inject hierarchy:
Eh←Inc\(ℰS\)∪Inc\(ℰT\)E\_\{h\}\\leftarrow\\mathrm\{Inc\}\(\\mathcal\{E\}\_\{S\}\)\\cup\\mathrm\{Inc\}\(\\mathcal\{E\}\_\{T\}\)⊳\\trianglerightEq\. \([8](https://arxiv.org/html/2607.00339#S2.E8)\)
8:// Validity Propagation \(§[II\-D](https://arxiv.org/html/2607.00339#S2.SS4)\)
9:foreach
en→updateseoe\_\{n\}\\xrightarrow\{\\texttt\{updates\}\}e\_\{o\}do
10:
v\(eo\)←0v\(e\_\{o\}\)\\leftarrow 0; propagate to causal dependents⊳\\trianglerightEq\. \([11](https://arxiv.org/html/2607.00339#S2.E11)\), \([12](https://arxiv.org/html/2607.00339#S2.E12)\)
11:endfor
12:// Online: Query Processing \(§[II\-E](https://arxiv.org/html/2607.00339#S2.SS5)\)
13:
Vq←V0\(q\)∪Vent∪Vupd∪VhierV\_\{q\}\\leftarrow V\_\{0\}\(q\)\\cup V\_\{\\mathrm\{ent\}\}\\cup V\_\{\\mathrm\{upd\}\}\\cup V\_\{\\mathrm\{hier\}\}⊳\\trianglerightEq\. \([13](https://arxiv.org/html/2607.00339#S2.E13)\), \([14](https://arxiv.org/html/2607.00339#S2.E14)\)
14:
Π←\\Pi\\leftarrowbounded bidirectional traversal from
VqV\_\{q\}with pruning
15:foreach path
π∈Π\\pi\\in\\Pido
16:
S\(π\)←α𝑁𝑆\+β𝑃𝐶\+γ𝑇𝐶\+δ𝑈𝑉S\(\\pi\)\\leftarrow\\alpha\\,\\mathit\{NS\}\+\\beta\\,\\mathit\{PC\}\+\\gamma\\,\\mathit\{TC\}\+\\delta\\,\\mathit\{UV\}⊳\\trianglerightEq\. \([16](https://arxiv.org/html/2607.00339#S2.E16)\)
17:endfor
18:// Context Assembly & Generation \(§[II\-F](https://arxiv.org/html/2607.00339#S2.SS6)\)
19:Assemble
CnoteC\_\{\\mathrm\{note\}\},
CpathC\_\{\\mathrm\{path\}\}under budget
BtotalB\_\{\\mathrm\{total\}\}⊳\\trianglerightEq\. \([21](https://arxiv.org/html/2607.00339#S2.E21)\)
20:Select answer plan
pqp\_\{q\}by query shape
21:
y^←𝒜\(q,Cnote,Cpath,pq\)\\hat\{y\}\\leftarrow\\mathcal\{A\}\(q,\\,C\_\{\\mathrm\{note\}\},\\,C\_\{\\mathrm\{path\}\},\\,p\_\{q\}\)⊳\\trianglerightEq\. \([3](https://arxiv.org/html/2607.00339#S2.E3)\)
22:return
y^\\hat\{y\}
TRACE maintains a composite memory state
ℳ=\(𝒩,G⋆\)\.\\mathcal\{M\}=\(\\mathcal\{N\},\\,G^\{\\star\}\)\.\(2\)𝒩\\mathcal\{N\}preserves original conversational text and supports dense retrieval;G⋆G^\{\\star\}encodes how facts relate to, support, and supersede one another over time\. A query is answered by combining relevant information drawn from both:
y^=𝒜\(q,Cnote,Cpath,pq\)\.\\hat\{y\}=\\mathcal\{A\}\(q,\\,C\_\{\\mathrm\{note\}\},\\,C\_\{\\mathrm\{path\}\},\\,p\_\{q\}\)\.\(3\)CnoteC\_\{\\mathrm\{note\}\}is the lexical context from𝒩\\mathcal\{N\}providing direct textual grounding in the original user utterances,CpathC\_\{\\mathrm\{path\}\}is the path\-based evidence fromG⋆G^\{\\star\}providing relational structure and temporal validity status, andpqp\_\{q\}is a query\-shape\-specific answer plan\. The separation is deliberate: the note memory keeps answer generation grounded in original conversational text, while the evidence graph ensures that this retrieved evidence is interpreted correctly with respect to temporal evolution and supersession relationships\.
### II\-BTemporal Evidence Graph
A flat list of extracted events suffices for single\-hop factual lookup, but long\-conversation queries frequently require reasoning over chains of related facts\. Understanding that event A caused event B, which was later updated by event C, demands explicit relational structure\. Moreover, queries operate at different granularities: some ask about a specific event, others about a session’s outcome, and others about a topic spanning multiple sessions\. TRACE therefore represents evidence as a typed, directed graph with three vertex populations:
G=\(V,E\),V=Ve∪Vs∪Vt,E=Er∪Eh\.G=\(V,E\),\\quad V=V\_\{e\}\\cup V\_\{s\}\\cup V\_\{t\},\\quad E=E\_\{r\}\\cup E\_\{h\}\.\(4\)*Event vertices*VeV\_\{e\}are the atomic units of factual content;*session vertices*VsV\_\{s\}preserve conversational boundaries;*topic vertices*VtV\_\{t\}group semantically related sessions and enable cross\-session traversal\.ErE\_\{r\}contains*semantic relation edges*among events, whileEhE\_\{h\}contains*hierarchical membership edges*connecting events to sessions and sessions to topics\.
#### Event schema
Each event vertex encodes the minimal information needed for both reasoning and answer generation:
e=\(𝑖𝑑,τ,P,a,x,N,ρ,u,v\)\.e=\(\\mathit\{id\},\\,\\tau,\\,P,\\,a,\\,x,\\,N,\\,\\rho,\\,u,\\,v\)\.\(5\)The*reasoning fields*include the event typeτ∈\{action,state\_change,preference,plan\}\\tau\\in\\\{\\texttt\{action\},\\texttt\{state\\\_change\},\\texttt\{preference\},\\texttt\{plan\}\\\}, the participant setPP, time anchoraa, and descriptionxx, which together support graph traversal and temporal comparison\. The*grounding fields*include source note setNNand provenanceρ\\rho, which associate each event with the note evidence used during construction and support later retrieval and audit\. The*validity fields*include valid\-until timestampuuand validity scorevv:uuis initially unset, whilevvis initialized as valid and updated by the validity semantics\. This schema keeps the graph compact \(one vertex per state\-bearing fact rather than per message\) while maintaining full traceability\.
#### Semantic edges
Two events connected by a semantic edge share a logical or temporal dependency:
rij=\(ei,ej,ℓ,c,η\),ℓ∈ℛ,c∈\[0,1\]\.r\_\{ij\}=\(e\_\{i\},\\,e\_\{j\},\\,\\ell,\\,c,\\,\\eta\),\\quad\\ell\\in\\mathcal\{R\},\\quad c\\in\[0,1\]\.\(6\)Hereℓ\\ellis the relation label,ccis a confidence score, andη\\etais a natural\-language justification supporting interpretability\. The relation vocabulary is partitioned into two families with distinct downstream semantics:
ℛ=ℛsup∪ℛevo\.\\mathcal\{R\}=\\mathcal\{R\}\_\{\\mathrm\{sup\}\}\\cup\\mathcal\{R\}\_\{\\mathrm\{evo\}\}\.\(7\)ℛsup=\{causes,enables,prevents,before\}\\mathcal\{R\}\_\{\\mathrm\{sup\}\}=\\\{\\texttt\{causes\},\\texttt\{enables\},\\texttt\{prevents\},\\texttt\{before\}\\\}\(the support family\) captures how facts causally or temporally depend on one another within the memory graph\. These edges carry information forward in time and form the backbone of multi\-hop reasoning chains\. The*evolution family*ℛevo=\{updates,contradicts\}\\mathcal\{R\}\_\{\\mathrm\{evo\}\}=\\\{\\texttt\{updates\},\\texttt\{contradicts\}\\\}captures where the conversational state changes\. These edges carry the temporal validity signals needed to distinguish current from obsolete information\. The distinction matters because the two families are treated differently during both validity propagation and query\-time traversal: support edges form reasoning chains, while evolution edges trigger state transitions\.
Figure 2:Online query processing pipeline\. Given a queryqq, TRACE performs seed selection, bounded bidirectional path generation with pruning, evidence ranking, hybrid context assembly with answer planning, answer generation from the assembled context, and optional post\-hoc evidence tracing\.
#### Hierarchical structure
The event–session–topic hierarchy is conceptually a nested hypergraph, but TRACE realizes it as a bipartite incidence expansion so that standard graph traversal operators can apply directly and uniformly without requiring any special\-case hyperedge handling:
Eh=Inc\(ℰS\)∪Inc\(ℰT\)\.E\_\{h\}=\\mathrm\{Inc\}\(\\mathcal\{E\}\_\{S\}\)\\cup\\mathrm\{Inc\}\(\\mathcal\{E\}\_\{T\}\)\.\(8\)Inc\(S\)\\mathrm\{Inc\}\(S\)inserts directed membership edges between each evente∈Se\\in Sand the corresponding session vertexσS\\sigma\_\{S\};Inc\(T\)\\mathrm\{Inc\}\(T\)connects each sessionσ∈T\\sigma\\in Twith the topic vertexθT\\theta\_\{T\}\. This gives TRACE a uniform traversal substrate: a query can descend from a topic to its sessions and then to specific events, or ascend from a seed event to discover related evidence elsewhere\. The three granularities correspond to three common query patterns: event\-level for precise factual lookup, session\-level for broader contextual understanding, and topic\-level for preference aggregation or longitudinal trend tracking across sessions\.
### II\-COffline Graph Construction
As illustrated in Fig\.[1](https://arxiv.org/html/2607.00339#S2.F1), the offline stage transforms session\-level textual summaries derived from the note memory𝒩\\mathcal\{N\}into a faithful, well\-connected evidence graphG⋆G^\{\\star\}through four operators applied in sequence\.
#### Step 1: Session\-level event distillation
The first operator extracts event vertices and intra\-session edges from each session\-level textual unit:
\(Vte,Etr\)=Devent\(St\)\.\(V\_\{t\}^\{e\},\\,E\_\{t\}^\{r\}\)=D\_\{\\mathrm\{event\}\}\(S\_\{t\}\)\.\(9\)The operator runs on the session\-level textual summary ofStS\_\{t\}and retains source\-note identifiers linking extracted events back to𝒩\\mathcal\{N\}\. This granularity provides enough local context to resolve participants, temporal references, and short\-range causal relations, while avoiding the fragmentation of message\-level extraction and the context\-window limits and long\-range ambiguity of conversation\-level extraction\.
#### Step 2: Cross\-session linking
Local subgraphs alone cannot answer queries requiring connections across sessions \(e\.g\., “Did Alice follow through on the plan she mentioned last month?”\)\. However, naively considering all cross\-session event pairs isO\(\|Ve\|2\)O\(\|V\_\{e\}\|^\{2\}\)and produces a dense graph dominated by spurious connections\. TRACE introduces a candidate gate that restricts cross\-session relation inference to plausible pairs:
A\(e\)=\{e′:J\(Pe,Pe′\)≥θp∧Γ\(e′\)∈Ω\(Γ\(e\)\)\}\.A\(e\)=\\\{e^\{\\prime\}:J\(P\_\{e\},P\_\{e^\{\\prime\}\}\)\\geq\\theta\_\{p\}\\;\\wedge\\;\\Gamma\(e^\{\\prime\}\)\\in\\Omega\(\\Gamma\(e\)\)\\\}\.\(10\)JJis participant Jaccard overlap andΓ\(e\)\\Gamma\(e\)maps an event to its topic cluster\. The participant constraint enforces a necessary condition for state evolution: one person’s preference change is irrelevant to another person’s plan, so events involving disjoint participants rarely need direct edges\. The topic constraint prevents linking superficially similar events from unrelated threads\. Together, these two lightweight filters reduce the candidate space to a sparse, semantically plausible subset without requiring expensive embedding comparisons\.
#### Step 3: Temporal and evolution edge inference
For each candidate pair passing the gate, TRACE determines the appropriate edge type\. When both events have parseable time anchors, a deterministicbeforeedge is added based on chronological ordering\. For remaining candidates, TRACE infers whether a support or evolution relation exists\. This is where the graph acquires its state\-tracking capability:updatesedges identify where newer information supersedes older facts, andcontradictsedges flag unresolved conflicts\.
#### Step 4: Hierarchy injection
Finally, TRACE injects session and topic vertices into the graph using the incidence expansion of Eq\. \([8](https://arxiv.org/html/2607.00339#S2.E8)\) and adds chronological edges between adjacent session vertices\. This step adds no new factual content; it purely adds navigational structure that enables multi\-granularity traversal at query time\.
### II\-DValidity Semantics
The construction stage producesupdatesandcontradictsedges, but these edges alone do not tell the query processor which events to trust\. Validity semantics translate evolution edges into per\-event state annotations that can be evaluated during traversal\.
Each event is initialized as fully valid withv\(e\)=1v\(e\)=1, andu\(e\)u\(e\)remains unconstrained until anupdatesedge is encountered\. When a newer eventene\_\{n\}is connected to an older eventeoe\_\{o\}via anupdatesedge, TRACE performs:
u\(eo\)=a\(en\),v\(eo\)=0\.u\(e\_\{o\}\)=a\(e\_\{n\}\),\\quad v\(e\_\{o\}\)=0\.\(11\)The older event is marked superseded as of the update time\. For current\-state queries,eoe\_\{o\}will be deprioritized; for historical queries asking about the state beforea\(en\)a\(e\_\{n\}\), the event remains available because the valid\-until timestamp records exactly when supersession occurred\.
Validity propagates one hop further: if an eventeedepends on a superseded event through acausesorenablesedge, its validity is reduced tov\(e\)=0\.5v\(e\)=0\.5\. The intuition is that a consequence of an obsolete cause is not necessarily false \(it may have already occurred\), but its evidential reliability is diminished\. The resulting three\-valued scheme, with11denoting currently valid,0\.50\.5denoting partially suspect or uncertain, and0denoting fully superseded, avoids the complexity of probabilistic belief propagation while still providing enough granularity for practically useful distinctions at query time\.
Updates also invalidate strong causal edges incident on superseded events:
ι\(eo→ℓe\)=a\(en\),ℓ∈\{causes,prevents\}\.\\iota\(e\_\{o\}\\xrightarrow\{\\ell\}e\)=a\(e\_\{n\}\),\\quad\\ell\\in\\\{\\texttt\{causes\},\\texttt\{prevents\}\\\}\.\(12\)Traversal at query time skips edges with a set invalidation timestamp, preventing reasoning chains from passing through obsolete causal links\. Importantly,contradictsedges behave differently: they do not modify validity scores but are retained as explicit conflict signals and penalized during path scoring\. This asymmetry reflects a semantic distinction\. An update resolves a state change where the newer version is authoritative, while a contradiction records disagreement without resolution\. Silently invalidating one side of a contradiction would inject an unsupported judgment\.
### II\-EOnline Query Processing
With the evidence graph fully constructed and validity annotations in place, TRACE answers queries by combining lexical recall from𝒩\\mathcal\{N\}with structured traversal overG⋆G^\{\\star\}, as illustrated in the overview shown in Fig\.[2](https://arxiv.org/html/2607.00339#S2.F2)\.
#### Seed selection
Bridging from an unstructured query to the structured graph is the first challenge\. TRACE uses dense retrieval over𝒩\\mathcal\{N\}to obtain a top\-kknote setRk\(q\)R\_\{k\}\(q\), then maps these to event vertices:
V0\(q\)=\{e:N\(e\)∩Rk\(q\)≠∅\}\.V\_\{0\}\(q\)=\\\{e:N\(e\)\\cap R\_\{k\}\(q\)\\neq\\emptyset\\\}\.\(13\)Vector similarity alone has well\-known blind spots, including lexical mismatch, failure to retrieve updated versions, and inability to recover cross\-session evidence with different vocabulary\. TRACE compensates by augmentingV0V\_\{0\}:
Vq=V0∪Vent∪Vupd∪Vhier\.V\_\{q\}=V\_\{0\}\\cup V\_\{\\mathrm\{ent\}\}\\cup V\_\{\\mathrm\{upd\}\}\\cup V\_\{\\mathrm\{hier\}\}\.\(14\)VentV\_\{\\mathrm\{ent\}\}comes from entity and participant lookup,VupdV\_\{\\mathrm\{upd\}\}adds one\-hop neighbors along evolution edges to ensure that both old and new versions are properly considered, andVhierV\_\{\\mathrm\{hier\}\}adds events from topically related sessions via the hierarchy\. Each expansion family targets a distinct retrieval failure mode\.
#### Path generation
FromVqV\_\{q\}, TRACE performs bounded bidirectional traversal to generate candidate evidence paths:
π=⟨e1,ℓ1,e2,…,ℓL−1,eL⟩\.\\pi=\\langle e\_\{1\},\\ell\_\{1\},e\_\{2\},\\ldots,\\ell\_\{L\-1\},e\_\{L\}\\rangle\.\(15\)Each path represents a chain of evidence connecting multiple facts through explicit logical relations\. Three hard pruning rules apply during traversal: invalidated causal edges are never crossed because they represent broken reasoning links; paths violating parseable timestamps are rejected to prevent backward causation; and paths with consecutivecontradictsedges are removed because chaining two unresolved conflicts produces no reliable inference\.
#### Evidence ranking
Each surviving path is then scored by a combination of four signals:
S\(π\)=α𝑁𝑆\(π\)\+β𝑃𝐶\(π\)\+γ𝑇𝐶\(π\)\+δ𝑈𝑉\(π\)\.S\(\\pi\)=\\alpha\\,\\mathit\{NS\}\(\\pi\)\+\\beta\\,\\mathit\{PC\}\(\\pi\)\+\\gamma\\,\\mathit\{TC\}\(\\pi\)\+\\delta\\,\\mathit\{UV\}\(\\pi\)\.\(16\)The default weights\(α,β,γ,δ\)=\(0\.4,0\.3,0\.15,0\.15\)\(\\alpha,\\beta,\\gamma,\\delta\)=\(0\.4,0\.3,0\.15,0\.15\)reflect priority ordering: relevance first, structural confidence second, temporal signals for refinement\.
*Neural relevance*averages per\-event similarity to the query:
𝑁𝑆\(π\)=L−1∑e∈πmaxn∈N\(e\)cos\(ϕ\(q\),ϕ\(n\)\)\.\\mathit\{NS\}\(\\pi\)=L^\{\-1\}\\sum\_\{e\\in\\pi\}\\max\_\{n\\in N\(e\)\}\\cos\(\\phi\(q\),\\phi\(n\)\)\.\(17\)The average rather than maximum penalizes paths containing one relevant event padded with irrelevant intermediate hops, thereby encouraging compact and focused evidence chains\.
*Path confidence*captures structural reliability:
𝑃𝐶\(π\)=∏i=1L−1cid\(ℓi\)⋅\(1−0\.51\{contradicts∈π\}\)\.\\mathit\{PC\}\(\\pi\)=\\prod\_\{i=1\}^\{L\-1\}c\_\{i\}\\,d\(\\ell\_\{i\}\)\\cdot\\bigl\(1\-0\.5\\,\\mathbf\{1\}\\\{\\texttt\{contradicts\}\\in\\pi\\\}\\bigr\)\.\(18\)The product form means a single weak link degrades the entire chain, which is appropriate because a reasoning chain is only as strong as its weakest individual step\. The relation discountd\(ℓi\)d\(\\ell\_\{i\}\)assigns lower values to weaker or less informative relation types, and the contradiction penalty halves confidence for paths that traverse unresolved or ambiguous conflicts\.
*Temporal consistency*verifies chronological constraints:
𝑇𝐶\(π\)=𝟏\{a\(ei\)≤a\(ej\)for everyei→beforeej∈π\}\.\\mathit\{TC\}\(\\pi\)=\\mathbf\{1\}\\bigl\\\{a\(e\_\{i\}\)\\leq a\(e\_\{j\}\)\\ \\text\{for every\}\\ e\_\{i\}\\xrightarrow\{\\texttt\{before\}\}e\_\{j\}\\in\\pi\\bigr\\\}\.\(19\)
*Update validity*reflects reliance on superseded information:
𝑈𝑉\(π\)=mine∈πv\(e\)\.\\mathit\{UV\}\(\\pi\)=\\min\_\{e\\in\\pi\}v\(e\)\.\(20\)The minimum operator ensures even a single superseded event makes staleness visible\. Paths through obsolete events are not forbidden, as they may be needed for historical queries, but are ranked lower by default\.
### II\-FContext Assembly and Answer Planning
The evidence graph produces structured paths, but the final answer must ultimately be generated by a language model consuming text\. TRACE assembles a hybrid context with two sections:CnoteC\_\{\\mathrm\{note\}\}contains top dense\-retrieval results providing verbatim grounding, andCpathC\_\{\\mathrm\{path\}\}contains top\-ranked graph paths rendered as compact support chains with explicit relation labels and validity markers\.
Budget is allocated between sections:
Bpath′=Bpath\+max\(0,Bnote−\|Cnote\|\),B^\{\\prime\}\_\{\\mathrm\{path\}\}=B\_\{\\mathrm\{path\}\}\+\\max\(0,\\,B\_\{\\mathrm\{note\}\}\-\|C\_\{\\mathrm\{note\}\}\|\),\(21\)\|Cnote\|\+\|Cpath\|≤Btotal\.\|C\_\{\\mathrm\{note\}\}\|\+\|C\_\{\\mathrm\{path\}\}\|\\leq B\_\{\\mathrm\{total\}\}\.\(22\)When dense retrieval returns few relevant notes \(common for cross\-session queries where relevant text is distributed thinly across the history\), the freed budget is reallocated to additional evidence paths providing the structural context that the notes alone lack\. Items are kept as atomic units during truncation, and repeated identifiers are deduplicated\.
TRACE selects an answer planpqp\_\{q\}based on the detected query shape\. Temporal queries receive date\-normalization instructions directing the model to compare timestamps before asserting currentness\. Preference queries receive a preference\-profiling plan that aggregates evolving statements over time\. Multi\-hop queries receive chain\-of\-evidence prompts closely following path structure\. Factual queries receive a concise plan prioritizing directness and brevity\. This decouples the what\-is\-available view \(evidence graph\) from the how\-to\-reason instruction \(answer plan\), allowing the same evidence to be consumed under different reasoning constraints\.
### II\-GAnswer\-Aware Evidence Tracing
After answer generation, users or downstream systems may need to verify provenance\. TRACE provides an optional post\-hoc tracing stage using the generated answer as an additional retrieval signal:
Ssup\(π\)=λa𝐴𝑆\(π\)\+λq𝑄𝑆\(π\)\+λc𝑃𝐶\(π\)\+λt𝑇𝐶\(π\)\+b\(π\)\.\\displaystyle S\_\{\\mathrm\{sup\}\}\(\\pi\)=\\lambda\_\{a\}\\,\\mathit\{AS\}\(\\pi\)\+\\lambda\_\{q\}\\,\\mathit\{QS\}\(\\pi\)\+\\lambda\_\{c\}\\,\\mathit\{PC\}\(\\pi\)\+\\lambda\_\{t\}\\,\\mathit\{TC\}\(\\pi\)\+b\(\\pi\)\.
\(23\)𝐴𝑆\\mathit\{AS\}measures similarity between the path and the generated answer to ensure relevance to what was actually said,𝑄𝑆\\mathit\{QS\}measures similarity to the original query to ensure the trace addresses the question,𝑃𝐶\\mathit\{PC\}and𝑇𝐶\\mathit\{TC\}retain their definitions from evidence ranking, andb\(π\)b\(\\pi\)is a bonus for paths passing through events whose descriptions overlap with answer entities\. Candidate paths with no lexical overlap with the query–answer pair are filtered as clearly irrelevant\. This stage does not revise the answer; it operates purely as a post\-hoc explanation mechanism, exposing the evidence chain for auditing whether the answer is faithfully grounded\.
## IIIExperiments
We conduct experiments to answer the following questions:
- •RQ1:Does TRACE outperform existing memory systems on long\-term conversational QA across diverse question types and dataset scales?
- •RQ2:Is TRACE robust across LLM backbones?
- •RQ3:How does each component of TRACE \(evidence graph, hierarchy, query\-shape\-aware reasoning\) contribute to the overall performance?
- •RQ4:How does TRACE handle superseded facts, contradictions, and chain revisions that stress update\-aware reasoning?
- •RQ5:What are the construction costs and online efficiency characteristics of TRACE’s evidence graph?
- •RQ6:Does the evidence graph provide interpretable and faithful reasoning traces?
### III\-AExperimental Setup
#### Datasets
We evaluate TRACE on two long\-term conversational QA benchmarks spanning moderate\-length and haystack\-scale dialogue\.LoCoMo\[[28](https://arxiv.org/html/2607.00339#bib.bib2)\]provides a publicly released1010\-conversation subset averaging2424K tokens per dialogue \(∼\\sim99K over the full5050\-conversation set\) with1,5401\{,\}540non\-adversarial questions\.LongMemEvalS\[[39](https://arxiv.org/html/2607.00339#bib.bib3)\]contains500500questions over dialogue histories averaging∼\\sim115115K tokens, of which470470are non\-adversarial and span six evidence\-based question types; the original*abstention*\(false\-premise\) type is excluded\. Following prior work, adversarial questions are excluded on both benchmarks\.
#### Stress test benchmark
LoCoMo and LongMemEval evaluate long\-term memory broadly, but the slices that exercise temporal update reasoning are heavily diluted by single\-hop and simple recall questions that do not require any form of state\-aware retrieval\. To specifically probe how a memory system handles superseded facts, contradictions, and chain revisions in isolation \(RQ4\), we construct a dedicated update\-heavy benchmark called the*Stress Test*, which is entirely derived from LoCoMo conversations\. We mine LoCoMo session pairs for typed\-edge candidates carrying update or contradiction signals, yielding772772raw candidates across four type labels:chain\_modification,fact\_override,contradiction\_resolution, andtemporal\_correction\. From this pool, we apply*coverage\-based representative selection*to retain100100candidates that maximize coverage over \(sample, type, signal\-pattern\) tuples, and then manually re\-author each against the LoCoMo session summaries and dialog text\. This yields the final5959\-item benchmark in natural descriptive style \(mean8\.88\.8words per reference\), preserving both type coverage \(22/18/15/422/18/15/4across the four update types\) and difficulty coverage \(medium2828, hard1616, easy1515\)\. Edge\-direction ambiguities are resolved by treating the LoCoMo session index as the chronology authority, where a larger index corresponds to a later point in time\.
Each item contains the following fields: the question, thecurrent\_valid\_answer, theoutdated\_answer\(i\.e\., the answer that was correct under an earlier session state but has since been superseded\), anevidence\_paththat records the chain of events and edges connecting the current state to its predecessors, the source LoCoMo sample index, and a difficulty label\. Theoutdated\_answerfield plays a key role in our evaluation protocol: it allows the temporal\-aware judge to separately measureoutdated\_leakagein addition tovalid\_acc, so that a system returning a previously correct but now superseded answer is penalized rather than rewarded—even though that answer was valid at some earlier point in the conversation\. We note that our selection procedure operates as an engineered heuristic over discrete\(sample,type\)\(\\text\{sample\},\\text\{type\}\)buckets rather than a formal coreset algorithm; to avoid overstating the theoretical guarantee, we use the term*representative selection*throughout the paper\.
#### Baselines
We compare TRACE against seven representative systems whose design positions are reviewed in Sec\.[IV](https://arxiv.org/html/2607.00339#S4)\.Full\-Contextconcatenates the full dialogue history into the LLM context and serves as a non\-retrieval reference\.RAGindexes each turn as an independent note and retrieves the top\-1010turn\-level notes by dense similarity\[[22](https://arxiv.org/html/2607.00339#bib.bib10),[24](https://arxiv.org/html/2607.00339#bib.bib9)\], with no LLM\-based memory evolution\.LangMem\[[23](https://arxiv.org/html/2607.00339#bib.bib48)\]runs in a backbone\-specific variant: under GPT\-4o\-mini we mount LangMem’s memory tools on a LangGraphcreate\_react\_agenthot\-path agent; under DeepSeek\-V4\-Flash \(which does not autonomously emit memory calls\) we substitute LangMem’s program\-drivencreate\_memory\_store\_managervariant\.Mem0\[[13](https://arxiv.org/html/2607.00339#bib.bib4)\]retrieves the top\-3030semantically similar facts per speaker at query time\.Nemori\[[29](https://arxiv.org/html/2607.00339#bib.bib5)\]is reproduced from the official release on our matched cohort, using the same backbone LLM and embedding model as all other systems\.Zep\[[33](https://arxiv.org/html/2607.00339#bib.bib6)\]is reproduced at session granularity with top\-2020edge and node retrieval and reciprocal rank\-fusion reranking\.A\-Mem\[[41](https://arxiv.org/html/2607.00339#bib.bib1)\]shares the same per\-turn note schema and top\-1010retrieval as RAG, and additionally invokes the backbone LLM at ingestion time to evolve cross\-note links on LoCoMo; this evolution step is disabled on LongMemEvalS, where the 199K\-turn cache renders full evolution computationally infeasible\. TRACE inherits A\-Mem’s per\-turn note ingestion as its bottom layer \(Sec\.[II\-C](https://arxiv.org/html/2607.00339#S2.SS3)\); we therefore designate A\-Mem as the primary point of comparison and reportΔ\\Deltarows against it in Tables[I](https://arxiv.org/html/2607.00339#S3.T1)and[II](https://arxiv.org/html/2607.00339#S3.T2)\. TRACE plus the seven baselines are evaluated end\-to\-end on LoCoMo \(n=1540n\{=\}1540\) under both backbones; on LongMemEvalS\(n=470n\{=\}470\) we compare TRACE against A\-Mem and Nemori, with Full\-Context as a non\-retrieval reference\.
TABLE I:Main results on LoCoMo by backbone and question category\.Bold marks the best memory\-augmented system within each backbone;underlinemarks the runner\-up\.Δ\\Deltarows report TRACE minus A\-Mem and TRACE minus Nemori in percentage points \(pp\)\.†see Sec\.[III\-A](https://arxiv.org/html/2607.00339#S3.SS1)for the LangMem variant on DeepSeek\-V4\-Flash\.
#### Backbones
To probe cross\-backbone robustness \(RQ2\), we evaluate every system with two LLMs of distinct families:gpt\-4o\-mini\(OpenAI, accessed through the OpenRouter API\) anddeepseek\-v4\-flash\(DeepSeek, accessed through the official DeepSeek API\)\. The same backbone is used for both memory construction and answer generation within each setting\.
#### Metrics
Following the LLM\-as\-a\-judge paradigm\[[43](https://arxiv.org/html/2607.00339#bib.bib7)\], our primary metric on both benchmarks is the binary LoCoMo grading protocol of Mem0\[[13](https://arxiv.org/html/2607.00339#bib.bib4)\], as also adopted by Nemori\[[29](https://arxiv.org/html/2607.00339#bib.bib5)\], in which a separate judge LLM scores each generated answer against the reference as a binary correctness verdict\. The judge isgpt\-4o\-miniat temperature0with JSON output, run three times independently; we report the mean score across runs\. On LoCoMo, we additionally report token\-level F1 and BLEU\-1 of the original evaluation protocol\[[28](https://arxiv.org/html/2607.00339#bib.bib2)\]to align with prior reports; on LongMemEvalS, we report F1 only and omit BLEU\-1, since references are verbose free\-form sentences for which n\-gram overlap is dominated by stylistic surface form rather than answer correctness\. For the Stress Test \(RQ4\), we reportvalid\_acc\(fraction of correct current\-state answers\),outdated\_leakage\(fraction returning the superseded answer\), andstrict net\(= valid\_acc−\-outdated\_leak\) under a strict\-aware judging protocol\.
#### Implementation details
All retrieval\-based systems use theall\-MiniLM\-L6\-v2sentence encoder\[[34](https://arxiv.org/html/2607.00339#bib.bib11),[38](https://arxiv.org/html/2607.00339#bib.bib12)\]\. TRACE uses one configuration across both benchmarks; LongMemEvalSapplies haystack\-scale adapters \(chronological session chunking instead of LLM topic clustering, with update detection disabled because LongMemEval has one fact per topic, no within\-session contradictions, and prohibitive pairwise checks at scale\) that leave retrieval and scoring unchanged\. Online answer generation culminates in one final LLM call per question on a1010K\-token hybrid context \(notes plus causal evidence\)\. Memory caches and graph indices are built once per \(system, backbone, dataset\) and reused across reruns\.
### III\-BMain Results \(RQ1, RQ2\)
TABLE II:Main results on LongMemEvalSby backbone and question type\.Bold marks the best memory\-augmented system within each row;underlinemarks the runner\-up; Full\-Context is excluded from the ranking pool as a non\-retrieval reference\.Δ\\Deltacolumns report TRACE minus A\-Mem and TRACE minus Nemori in percentage points \(pp\)\.
#### Overall effectiveness \(RQ1\)
Table[I](https://arxiv.org/html/2607.00339#S3.T1)reports results on LoCoMo across four question categories\. Under GPT\-4o\-mini, TRACE achieves the highest overall LLM judge score \(0\.6610\.661\), F1 \(0\.4520\.452\), and BLEU\-1 \(0\.3790\.379\), outperforming the strongest retrieval baselines Nemori and A\-Mem by\+5\.7\+5\.7and\+12\.8\+12\.8pp on LLM score, respectively\. The gains are most pronounced on Temporal questions \(\+18\.6\+18\.6pp vs\. Nemori,\+19\.9\+19\.9pp vs\. A\-Mem in LLM score\) and Multi\-Hop questions \(\+8\.9\+8\.9/\+9\.7\+9\.7pp\), both of which require reasoning over cross\-session evidence chains, precisely the capability targeted by TRACE’s path\-based evidence ranking\. On Single\-Hop questions TRACE remains competitive with Nemori \(−0\.9\-0\.9pp LLM\) while exceeding it on F1 \(\+3\.0\+3\.0\) and BLEU \(\+7\.3\+7\.3\), indicating that the graph layer does not harm simple retrieval performance\.
On the Open\-Domain category, TRACE leads on LLM score \(\+10\.7\+10\.7pp vs\. Nemori\) but trails on F1 and BLEU\. These questions are broad conversational prompts with verbose multi\-sentence references; TRACE’s concise answers receive high semantic judgments but lower n\-gram overlap\. The same pattern appears across both backbones and benchmarks, indicating that TRACE’s advantage is clearest under correctness\-oriented metrics rather than surface\-overlap metrics\.
Table[II](https://arxiv.org/html/2607.00339#S3.T2)extends the evaluation to the haystack\-scale LongMemEvalS\(∼\\sim115K tokens per dialogue, 470 questions\)\. Under GPT\-4o\-mini, TRACE achieves0\.6670\.667LLM and0\.4570\.457F1 overall, surpassing Nemori by\+4\.5\+4\.5/\+5\.8\+5\.8pp and A\-Mem by\+13\.8\+13\.8/\+7\.0\+7\.0pp\. On the single\-session\-preference type, which requires aggregation of evolving user preferences, TRACE improves over both baselines by over\+50\+50pp \(LLM\), validating the design decision to include preference\-profiling answer plans\. On single\-session\-assistant questions, which probe retrieval of assistant\-generated content, TRACE’s0\.9640\.964LLM score demonstrates near\-ceiling performance\. The multi\-session category, the most demanding cross\-session reasoning type, sees a\+6\.3\+6\.3pp LLM and\+10\.8\+10\.8pp F1 improvement over A\-Mem\. TRACE’s temporal\-reasoning results on LongMemEvalSreveal an interesting trade\-off: TRACE achieves the highest LLM score \(0\.6060\.606\) under GPT\-4o\-mini but trails Nemori on F1 \(0\.3290\.329vs\.0\.3440\.344\)\. Manual inspection shows that LongMemEval’s temporal\-reasoning reference answers contain precise date strings; TRACE’s evidence paths sometimes retrieve the correct temporal relation but express it through relative phrasing, which penalizes exact token overlap while still preserving semantic correctness\.
#### Cross\-backbone robustness \(RQ2\)
Both tables demonstrate that TRACE’s overall rankings are preserved under DeepSeek\-V4\-Flash, a backbone from a different model family with a distinct tokenizer and instruction\-following profile\. On LoCoMo \(Table[I](https://arxiv.org/html/2607.00339#S3.T1)\), TRACE leads overall \(0\.7250\.725LLM,0\.5050\.505F1,0\.4350\.435BLEU\) by\+3\.9\+3\.9/\+11\.6\+11\.6pp over Nemori/A\-Mem on LLM score\. On LongMemEvalS\(Table[II](https://arxiv.org/html/2607.00339#S3.T2)\), TRACE achieves0\.7330\.733LLM and0\.5030\.503F1 overall, with the largest single\-category gain again on single\-session\-preference \(\+45\.9\+45\.9pp vs\. Nemori,\+47\.7\+47\.7pp vs\. A\-Mem\)\. The absolute performance on DeepSeek is generally higher than GPT\-4o\-mini, consistent with the stronger Full\-Context ceiling, yet TRACE’s relative advantage over baselines remains stable across both backbones\. This consistency confirms that the performance gains stem from the evidence graph structure and query\-processing pipeline rather than from backbone\-specific prompt engineering\.
The one notable exception is the single\-session\-assistant type on DeepSeek, where A\-Mem achieves a perfect1\.0001\.000LLM score, suggesting the task is nearly solved by strong dense retrieval within a single session on this backbone\. TRACE’s0\.9640\.964on the same slice is not meaningfully different from ceiling, and the−3\.6\-3\.6pp gap reflects one or two borderline judgments rather than a systematic failure\. A second exception is temporal\-reasoning on DeepSeek, where Nemori’s LLM score \(0\.7480\.748\) exceeds TRACE \(0\.6560\.656\)\. DeepSeek’s stronger instruction\-following ability amplifies Nemori’s timeline\-oriented prompting strategy on this category; however, TRACE’s overall advantage across all six categories remains substantial \(\+17\.0\+17\.0pp LLM overall\)\.
### III\-CAblation Study \(RQ3\)
TABLE III:Compact overall ablation summary on GPT\-4o\-miniFigure 3:Ablation impact across benchmarks\. Each cell shows the LLM\-score change in percentage points relative to TRACE for an ablated variant on \(a\) LoCoMo and \(b\) LongMemEvalS\. Negative values indicate performance loss after removing a component\.#### Component contributions
Table[III](https://arxiv.org/html/2607.00339#S3.T3)reports overall ablation metrics, while Fig\.[3](https://arxiv.org/html/2607.00339#S3.F3)summarizes category\-level LLM changes\. Each row removes one component from the full system \(R0\); row R5 reproduces the A\-Mem baseline for reference\.
*Topic hierarchy \(R1: w/o Topic\)*Removing topic vertices reduces overall LoCoMo LLM score by only0\.10\.1pp but LongMemEvalSLLM by1\.51\.5pp\. The small LoCoMo effect is expected because that benchmark’s 10 conversations are short enough that session\-level edges already provide sufficient cross\-context links\. On the haystack\-scale LongMemEvalS, topic vertices serve as navigational shortcuts across hundreds of sessions, and their removal affects temporal\-reasoning \(−4\.7\-4\.7pp\) and single\-session\-assistant \(−3\.5\-3\.5pp\) most\. This validates the multi\-granularity topology design principle: at scale, topic\-level grouping becomes a necessary structural scaffold for efficient traversal\.
*Full hierarchy \(R2: w/o Hier\)*Removing both topic and session vertices reduces LoCoMo overall LLM to0\.6440\.644\(−1\.7\-1\.7pp\) and LongMemEvalSLLM to0\.6500\.650\(−1\.7\-1\.7pp\)\. The fact that R2 remains well above the A\-Mem reference \(R5\) by\+11\.1\+11\.1/\+12\.1\+12\.1pp overall confirms that the event\-level graph, together with the same query\-shape\-aware retrieval and answer\-planning stack, provides substantial value\. The hierarchy adds a consistent but moderate additional gain on top of the event\-semantic foundation\.
*Graph evidence \(R3: w/o graph\)*When the entire graph layer is removed and the system falls back to note\-only retrieval with query\-shape\-aware prompting, LoCoMo overall LLM drops to0\.5990\.599\(−6\.2\-6\.2pp\) and LongMemEvalSLLM to0\.6090\.609\(−5\.8\-5\.8pp\)\. The gap concentrates on categories that require relational reasoning: Multi\-Hop on LoCoMo \(−5\.4\-5\.4pp\), Single\-Hop \(−9\.2\-9\.2pp\), and single\-session\-preference on LongMemEvalS\(−18\.9\-18\.9pp\)\. This demonstrates that path\-based evidence provides information that flat retrieval cannot recover, even when the same underlying notes are available, because the graph encodes*how*facts relate rather than merely*which*facts exist\. The R3 result also establishes that the query\-shape planner alone \(without graph evidence\) still outperforms A\-Mem by\+6\.6\+6\.6/\+8\.0\+8\.0pp overall, confirming that the planner provides independent value through better reasoning instructions\.
*Query\-shape\-aware reasoning \(R4: w/o L3 CoT\)*Replacing the query\-shape\-specific answer plan with a generic prompt produces the largest single\-component drop on LongMemEvalSand one of the largest drops on LoCoMo:−5\.2\-5\.2pp overall LLM on LoCoMo and−11\.7\-11\.7pp on LongMemEvalS\. On LoCoMo, the Temporal category is most affected \(−17\.6\-17\.6pp LLM\), consistent with the design rationale that temporal queries need explicit date\-normalization reasoning\. On LongMemEvalS, the single\-session\-preference type collapses from0\.9000\.900to0\.5000\.500\(−40\.0\-40\.0pp\), confirming that the preference\-profiling plan is high\-impact for aggregating evolving statements into a coherent current\-state answer\. Notably, R4 still outperforms R5 \(A\-Mem\) overall by\+7\.6\+7\.6/\+2\.1\+2\.1pp, indicating that graph evidence contributes value even without the tailored plan, but the two components are strongly complementary: good evidence without good reasoning instructions yields only partial improvement, and the converse also holds\.
#### Summary
The ablation reveals a layered contribution structure: query\-shape\-aware reasoning \(L3\) and graph evidence provide the two main gains with benchmark\- and question\-dependent balance: L3 supports plan\-aware temporal/preference reasoning, while graph paths recover evidence missed by flat retrieval for factual and multi\-hop questions; the hierarchical structure provides a modest but consistent gain by enabling multi\-granularity navigation\. No single component accounts for the full improvement over baselines, confirming that TRACE’s effectiveness emerges from the interaction of its architectural layers rather than any single mechanism\.
### III\-DUpdate\-Heavy Slice Results \(RQ4\)
TABLE IV:Update\-tracking on the Stress test and the C3 update\-aware seed ablationTABLE V:Graph statistics, build cost, and online efficiency across benchmarks and backbones#### Stress test performance
Table[IV](https://arxiv.org/html/2607.00339#S3.T4)evaluates how well each system handles superseded facts, contradictions, and chain revisions on the purpose\-built 59\-item Stress Test\. TRACE achieves the highest valid accuracy \(0\.3900\.390\) and strict net score \(\+0\.314\+0\.314\) among memory\-augmented systems, surpassing Nemori by\+9\.4\+9\.4pp and A\-Mem by\+4\.2\+4\.2pp on strict net\. Crucially, TRACE’s outdated leakage \(0\.0760\.076\) matches the Full\-Context reference and is lower than both Nemori \(0\.1020\.102\) and A\-Mem \(0\.0850\.085\)\. This validates the update\-validity semantics: by marking superseded events withv\(e\)=0v\(e\)\{=\}0and deprioritizing paths that traverse them via the𝑈𝑉\(π\)\\mathit\{UV\}\(\\pi\)scoring term \(Eq\.[20](https://arxiv.org/html/2607.00339#S2.E20)\), TRACE avoids surfacing stale facts as if they were current\.
The comparison with Full\-Context is also instructive\. Full\-Context achieves the highest valid accuracy \(0\.5420\.542\) by having access to the entire conversation, but its outdated leakage \(0\.0760\.076\) is identical to TRACE’s\. This suggests that TRACE’s retrieval pipeline successfully identifies the same temporal resolution cues that are available in the full context, while using5×5\{\\times\}fewer tokens\. The gap in valid accuracy \(0\.5420\.542vs\.0\.3900\.390\) indicates that some update chains in the Stress Test require evidence distributed so broadly across the conversation that bounded path traversal does not fully recover all supporting passages, a known limitation of any retrieval\-based approach relative to full\-context processing\.
#### Update\-aware seed ablation \(C3\)
Disabling the update\-aware seed expansion \(VupdV\_\{\\mathrm\{upd\}\}in Eq\.[14](https://arxiv.org/html/2607.00339#S2.E14)\) causes the Stress Test strict net to drop from\+0\.314\+0\.314to\+0\.212\+0\.212\(−10\.17\-10\.17pp after accounting for both valid\_acc and leak changes\)\. The valid accuracy drops by−8\.47\-8\.47pp and outdated leakage increases by\+1\.69\+1\.69pp, confirming that the expansion along evolution edges is essential for retrieving the most recent version of a revised fact\. Without this expansion, the system relies solely on dense similarity to find the current state; when the current version uses different vocabulary from the query \(a common pattern in preference changes\), the older, more lexically similar version is retrieved instead\. On the broader LoCoMo benchmark, the ablation reduces overall LLM score by−0\.93\-0\.93pp, with the effect concentrated in the Open\-Domain sub\-category \(−7\.29\-7\.29pp\), a category where user preferences frequently evolve and where the update\-aware seed ensures the current preference is retrieved alongside the historical one\.
TABLE VI:Three\-layer interpretability evaluation of TRACEThe temporal\-class type\-label errors observed at the graph layer reappear at the explanation layer as the dominant driver of cat22path\-faithfulness failure \(90\.3%90\.3\\%propagation\), confirmed by an independent rule\-based corroborator’s\+27\.6\+27\.6pp clean\-vs\-flagged gap\.
#### Cache\-controlled diagnostic
To test whether the Stress Test results merely inherit A\-Mem’s evolved note cache, we run a two\-by\-two cache diagnostic using the default evolution cache and a UUID\-remapped no\-evolution cache\. The no\-evolution cache strips A\-Mem’s ingestion\-time note links and evolved metadata while preserving graph\-to\-note lookup compatibility for TRACE\. Removing note evolution reduces A\-Mem’s strict net from\+0\.271\+0\.271to\+0\.076\+0\.076\(−19\.5\-19\.5pp\), while TRACE drops from\+0\.314\+0\.314to\+0\.153\+0\.153\(−16\.1\-16\.1pp\)\. On the matched no\-evolution cache, TRACE still retains a direction\-consistent\+7\.7\+7\.7pp advantage over A\-Mem\. As a cross\-arm diagnostic, comparing the paper\-default TRACE setting against A\-Mem without ingestion\-time evolution increases the gap to\+23\.7\+23\.7pp\. These results indicate that A\-Mem’s stress performance is heavily supported by ingestion\-time evolution metadata, whereas TRACE retains an independent graph\-side signal even after the metadata has been stripped away\.
### III\-EGraph Construction and Efficiency Analysis \(RQ5\)
#### Graph statistics
Table[V](https://arxiv.org/html/2607.00339#S3.T5)characterizes the evidence graph at both benchmark scales\. On LoCoMo \(10 conversations, 5,882 notes\), TRACE distills approximately 2,300–2,500 events with 7,100–8,200 edges, yielding an average fanout of∼3\.1\{\\sim\}3\.1edges per event\. On LongMemEvalS\(19,195 sessions, 199,509 notes\), the graph scales to 128–152K events with 398–455K edges\. The structural\-to\-semantic edge ratio is approximately 75/25 at both scales, reflecting the dominance of membership and session\-ordering edges over the rarer semantic relations\. Here structural edges refer to membership and session\-leveltemporal\_beforeedges, while semantic edges includecauses,enables,prevents, event\-leveltemporal\_before,updates, andcontradicts\. Update\-aware edges \(updates\+\+contradicts\) number 185–362 on LoCoMo but only 1–8 on LongMemEvalS; this is expected because the latter benchmark’s conversations rarely revisit the same topic within a single dialogue sequence, meaning within\-conversation fact evolution is sparse\. Note that Sessions/Topic on LongMemEvalSequals the fixed chunk size of5050, as LLM topic clustering is bypassed by chronological windows at the1919K\-session scale\. Nonetheless, the temporal hierarchy and session\-ordering structure still enable multi\-session reasoning on LongMemEvalS, as demonstrated by the multi\-session category gains in Table[II](https://arxiv.org/html/2607.00339#S3.T2)\.
#### Construction cost
Build time is 4\.7–5\.9 hours on LoCoMo \(7\.5–8\.3 s per event\) and 38–45 hours on LongMemEvalS\(0\.9–1\.1 s per event\)\. The order\-of\-magnitude speedup in per\-event time at larger scale is attributable to the candidate gating mechanism \(Eq\.[10](https://arxiv.org/html/2607.00339#S2.E10)\): with more events, the participant and topic filters prune a proportionally larger candidate space, amortizing the cost of LLM inference for edge classification over more frequent rejection of trivial pairs\. Storage is modest: 6 MB for LoCoMo and 273–318 MB for the full LongMemEvalSgraph \(JSON serialization\)\. These numbers indicate that the offline construction is a one\-time investment whose cost is dominated by LLM calls rather than storage or memory, and that the candidate gate successfully controls the quadratic blowup that would otherwise make cross\-session linking intractable at scale\.
#### Online cost and latency
TRACE’s median online latency is 947/728 ms per question on LoCoMo and 1,171/1,025 ms on LongMemEvalS\(GPT\-4o\-mini/DeepSeek\)\. The latency overhead compared to A\-Mem \(703/564 ms on LoCoMo\) comes from graph traversal and path scoring; however, TRACE remains well within interactive response bounds \(<<1\.2 s\)\. On LongMemEvalS, Full\-Context requires 6\.4/5\.1 s and over 98–104K tokens per question\. TRACE reduces this to 6,948/6,058 tokens, a14×14\{\\times\}–17×17\{\\times\}reduction, while achieving higher accuracy \(Table[II](https://arxiv.org/html/2607.00339#S3.T2)\)\. The efficiency gain derives directly from the bounded path generation: rather than feeding the entire conversation to the LLM, TRACE supplies only the top\-scored paths and notes within the 10K\-token budget\. The token comparison also highlights an important design trade\-off relative to A\-Mem\.
On LoCoMo, TRACE uses fewer tokens than A\-Mem \(3,719 vs\. 6,642 under GPT\-4o\-mini\) because note deduplication against graph paths compresses redundancy\. On LongMemEvalS, TRACE uses more tokens \(6,948 vs\. 2,313\) because the path budget is allocated to cross\-session evidence that A\-Mem’s flat retrieval does not attempt to surface\. This selective budget expansion directly enables the multi\-session and temporal\-reasoning gains observed in Table[II](https://arxiv.org/html/2607.00339#S3.T2), demonstrating that the additional tokens carry high\-value structural information rather than undifferentiated context padding\.
### III\-FInterpretability Analysis \(RQ6\)
Table[VI](https://arxiv.org/html/2607.00339#S3.T6)assesses TRACE’s interpretability at three layers: graph fidelity, pre\-reasoning cue support, and post\-reasoning explanation faithfulness\.
*Graph layer: Edge Audit*On a random sample of 100 edges, human annotators confirm94\.0%94\.0\\%factual accuracy and89\.0%89\.0\\%type\-label accuracy\. The1111pp type\-label error concentrates in the temporal class \(temporal\_beforeon same\-timestamp events\), where session\-level fallback ordering occasionally produces incorrect direction assignments\.
*Pre\-reasoning layer: Cue Glanceability*In a 22\-question pilot,81\.8%81\.8\\%of TRACE’s rendered cues contain the information needed to answer correctly, vs\.18\.2%18\.2\\%for a similarity\-only baseline \(\+63\.6\+63\.6pp\)\. The misleading rate is22\.7%22\.7\\%vs\.54\.6%54\.6\\%, and the average glanceability rating is3\.73/53\.73/5\.
*Post\-reasoning layer: Path Faithfulness*On 145 annotated triples, overall path faithfulness is42\.5%42\.5\\%\(Wilson 95% CI\[35\.0,50\.9\]\[35\.0,50\.9\]\): Single\-Hop65\.0%65\.0\\%, Multi\-Hop45\.0%45\.0\\%, Open\-Domain36\.0%36\.0\\%, Temporal22\.5%22\.5\\%\. The temporal weakness traces to graph\-layer type\-label errors:90\.3%90\.3\\%of category\-2 failures involve a misclassified temporal edge, confirmed by an independent rule\-based corroborator \(p=0\.010p\{=\}0\.010,\+27\.6\+27\.6pp gap\)\.
These results confirm two interpretability properties: the typed\-edge vocabulary makes state\-transition reasons human\-readable without inspecting raw text, and the co\-presence of stale and current nodes preserves full revision history for auditing\. The temporal edge imprecision at construction time propagates through path ranking into the generated answer, indicating that improving time\-anchor resolution \(e\.g\., turn\-level ordering\) would yield compounding gains across all three layers\. The high fidelity on non\-temporal edges \(∼97%\{\\sim\}97\\%\) confirms that the typed\-edge schema provides a reliable interpretability foundation for the majority of question types\.
## IVRelated Work
#### Long\-term memory and temporal validity
Systems such as A\-Mem\[[41](https://arxiv.org/html/2607.00339#bib.bib1)\], Mem0\[[13](https://arxiv.org/html/2607.00339#bib.bib4)\], Nemori\[[29](https://arxiv.org/html/2607.00339#bib.bib5)\], Zep\[[33](https://arxiv.org/html/2607.00339#bib.bib6)\], and LangMem\[[23](https://arxiv.org/html/2607.00339#bib.bib48)\]address what should be persisted across interactions, storing notes, entities, summaries, graph facts, or interaction traces\. These systems make memory persistent and searchable, but most stored objects are exposed to the downstream model as flat retrieved artifacts rather than as a structured state\-reconstruction problem that accounts for how facts evolve over a conversation history\. On the temporal side, databases formalize valid time and transaction time semantics\[[16](https://arxiv.org/html/2607.00339#bib.bib38),[36](https://arxiv.org/html/2607.00339#bib.bib26)\], while uncertainty and lineage models attach confidence and derivation context to individual records\[[6](https://arxiv.org/html/2607.00339#bib.bib69),[12](https://arxiv.org/html/2607.00339#bib.bib70)\]\. Temporal KG QA methods retrieve and rerank facts under time constraints\[[32](https://arxiv.org/html/2607.00339#bib.bib43)\], but they typically begin from an already well\-structured knowledge graph with explicitly recorded temporal annotations, an assumption that does not hold for raw conversational data\. TRACE adopts memory\-note ingestion as its lexical layer, but promotes the reasoning object to a temporal evidence graph derived from raw dialogue sessions, where validity annotations, update edges, and contradiction edges collectively determine how past statements should affect current answers, bridging the gap between memory persistence and temporally coherent reasoning\.
#### Graph\-based retrieval, path reasoning, and evidence provenance
Graph\-based retrieval systems organize unstructured corpora into explicit structures for search and reasoning\. GraphReader\[[25](https://arxiv.org/html/2607.00339#bib.bib20)\]agentically walks an entity graph for long\-document QA\. Database keyword search and graph\-query systems provide the query\-processing counterpart, using schema graphs, tuple trees, ranked answer graphs, RDF exploration, and subgraph matching to connect relevant records\[[1](https://arxiv.org/html/2607.00339#bib.bib58),[7](https://arxiv.org/html/2607.00339#bib.bib59),[21](https://arxiv.org/html/2607.00339#bib.bib60),[20](https://arxiv.org/html/2607.00339#bib.bib62),[11](https://arxiv.org/html/2607.00339#bib.bib63),[37](https://arxiv.org/html/2607.00339#bib.bib64),[30](https://arxiv.org/html/2607.00339#bib.bib65),[45](https://arxiv.org/html/2607.00339#bib.bib66),[26](https://arxiv.org/html/2607.00339#bib.bib68)\]\. Provenance further ensures that answers remain traceable to supporting records\[[9](https://arxiv.org/html/2607.00339#bib.bib39),[19](https://arxiv.org/html/2607.00339#bib.bib45),[12](https://arxiv.org/html/2607.00339#bib.bib70)\], enabling downstream consumers to verify the derivation chain from source data to final output\. These approaches improve reasoning and inspectability, but primarily target corpus organization or graph query answering over relatively stable knowledge bases where facts do not routinely supersede one another\. TRACE differs in treating graph paths as temporal evidence objects for dialogue state: a path may contain superseded, supporting, or conflicting events, and its temporal consistency directly affects how the final answer is assembled and which nodes are promoted or demoted during generation\.
#### Adaptive query processing over evolving conversational evidence
The query\-processing view in TRACE is closest to database work that adapts execution behavior according to data and query conditions\. Classical optimizers select access paths and plans from statistics\[[35](https://arxiv.org/html/2607.00339#bib.bib49),[18](https://arxiv.org/html/2607.00339#bib.bib50)\], adaptive query processing revises execution at runtime in response to changing data distributions\[[4](https://arxiv.org/html/2607.00339#bib.bib27),[17](https://arxiv.org/html/2607.00339#bib.bib29),[14](https://arxiv.org/html/2607.00339#bib.bib28)\], and stream systems support standing queries over continuously arriving inputs with bounded memory and latency\[[5](https://arxiv.org/html/2607.00339#bib.bib51),[10](https://arxiv.org/html/2607.00339#bib.bib52),[27](https://arxiv.org/html/2607.00339#bib.bib53),[2](https://arxiv.org/html/2607.00339#bib.bib54)\]\. View maintenance studies how derived results should remain current when base data changes\[[8](https://arxiv.org/html/2607.00339#bib.bib56),[31](https://arxiv.org/html/2607.00339#bib.bib57)\], a concern that closely parallels keeping conversational answers consistent as new turns introduce updates or retractions\. TRACE couples this adaptive principle to temporal evidence semantics, selecting query\-shape\-specific plans and expanding seeds through validity\-aware graph paths to adapt to both question complexity and evidence dynamics\.
## VConclusion
We presented TRACE, a state\-aware query processing framework that models conversations as a hierarchical evidence graph with typed causal, temporal, and update edges and graded validity annotations\. At query time, it combines vector retrieval with graph\-guided path search and query\-shape\-aware answer planning\. Experiments on two benchmarks across two backbones show consistent gains over existing memory systems, particularly on temporal, multi\-hop, and preference questions\. On a purpose\-built Stress Test, TRACE matches the Full\-Context outdated leakage while using5×5\{\\times\}fewer tokens\. Ablations confirm that graph evidence, hierarchy, and query\-shape reasoning each contribute complementary value\. Future work includes incremental graph construction for deployed assistants, turn\-level time\-anchor resolution to reduce temporal edge errors, and extension to multi\-agent collaboration traces\.
## References
- \[1\]\(2002\)DBXplorer: enabling keyword search over relational databases\.InProceedings of the 2002 ACM SIGMOD international conference on Management of data,pp\. 627–627\.Cited by:[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px2.p1.1)\.
- \[2\]A\. Arasu, S\. Babu, and J\. Widom\(2006\)The cql continuous query language: semantic foundations and query execution\.The VLDB Journal15\(2\),pp\. 121–142\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p1.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px3.p1.1)\.
- \[3\]M\. Arenas, L\. Bertossi, and J\. Chomicki\(1999\)Consistent query answers in inconsistent databases\.InProceedings of the eighteenth ACM SIGMOD\-SIGACT\-SIGART symposium on Principles of database systems,pp\. 68–79\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p2.1)\.
- \[4\]R\. Avnur and J\. M\. Hellerstein\(2000\)Eddies: continuously adaptive query processing\.InProceedings of the 2000 ACM SIGMOD international conference on Management of data,pp\. 261–272\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p4.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px3.p1.1)\.
- \[5\]B\. Babcock, S\. Babu, M\. Datar, R\. Motwani, and J\. Widom\(2002\)Models and issues in data stream systems\.InProceedings of the twenty\-first ACM SIGMOD\-SIGACT\-SIGART symposium on Principles of database systems,pp\. 1–16\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p1.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px3.p1.1)\.
- \[6\]O\. Benjelloun, A\. D\. Sarma, A\. Halevy, and J\. Widom\(2006\)ULDBs: databases with uncertainty and lineage\.InVLDB,Vol\.6,pp\. 953–964\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p4.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px1.p1.1)\.
- \[7\]G\. Bhalotia, A\. Hulgeri, C\. Nakhe, S\. Chakrabarti, and S\. Sudarshan\(2002\)Keyword searching and browsing in databases using banks\.InProceedings 18th international conference on data engineering,pp\. 431–440\.Cited by:[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px2.p1.1)\.
- \[8\]J\. A\. Blakeley, P\. Larson, and F\. W\. Tompa\(1986\)Efficiently updating materialized views\.ACM SIGMOD Record15\(2\),pp\. 61–71\.Cited by:[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px3.p1.1)\.
- \[9\]P\. Buneman, S\. Khanna, and T\. Wang\-Chiew\(2001\)Why and where: a characterization of data provenance\.InInternational conference on database theory,pp\. 316–330\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p2.1),[§I](https://arxiv.org/html/2607.00339#S1.p4.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px2.p1.1)\.
- \[10\]J\. Chen, D\. J\. DeWitt, F\. Tian, and Y\. Wang\(2000\)NiagaraCQ: a scalable continuous query system for internet databases\.InProceedings of the 2000 ACM SIGMOD international conference on Management of data,pp\. 379–390\.Cited by:[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px3.p1.1)\.
- \[11\]Y\. Chen, W\. Wang, Z\. Liu, and X\. Lin\(2009\)Keyword search on structured and semi\-structured data\.InProceedings of the 2009 ACM SIGMOD International Conference on Management of data,pp\. 1005–1010\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p3.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px2.p1.1)\.
- \[12\]J\. Cheney, L\. Chiticariu, and W\. Tan\(2009\)Provenance in databases: why, how, and where\.Foundations and trends in databases1\(4\),pp\. 379–474\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p4.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px1.p1.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px2.p1.1)\.
- \[13\]P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav\(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px3.p1.7),[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px5.p1.2),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px1.p1.1)\.
- \[14\]A\. Deshpande, Z\. Ives, and V\. Raman\(2007\)Adaptive query processing\.Now Publishers Inc\.Cited by:[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px3.p1.1)\.
- \[15\]X\. L\. Dong, L\. Berti\-Equille, and D\. Srivastava\(2013\)Data fusion: resolving conflicts from multiple sources\.InHandbook of Data Quality: Research and Practice,pp\. 293–318\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p2.1)\.
- \[16\]C\. Dyreson, F\. Grandi, W\. Käfer, N\. Kline, N\. Lorentzos, Y\. Mitsopoulos, A\. Montanari, D\. Nonen, E\. Peressi, B\. Pernici,et al\.\(1994\)A consensus glossary of temporal database concepts\.ACM Sigmod Record23\(1\),pp\. 52–64\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p1.1),[§I](https://arxiv.org/html/2607.00339#S1.p4.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px1.p1.1)\.
- \[17\]A\. Gounaris, N\. W\. Paton, A\. A\. Fernandes, and R\. Sakellariou\(2002\)Adaptive query processing: a survey\.InBritish National Conference on Databases,pp\. 11–25\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p4.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px3.p1.1)\.
- \[18\]G\. Graefe and W\. J\. McKenna\(1993\)The volcano optimizer generator: extensibility and efficient search\.InProceedings of IEEE 9th international conference on data engineering,pp\. 209–218\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p2.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px3.p1.1)\.
- \[19\]T\. J\. Green, G\. Karvounarakis, and V\. Tannen\(2007\)Provenance semirings\.InProceedings of the twenty\-sixth ACM SIGMOD\-SIGACT\-SIGART symposium on Principles of database systems,pp\. 31–40\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p2.1),[§I](https://arxiv.org/html/2607.00339#S1.p4.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px2.p1.1)\.
- \[20\]H\. He, H\. Wang, J\. Yang, and P\. S\. Yu\(2007\)BLINKS: ranked keyword searches on graphs\.InProceedings of the 2007 ACM SIGMOD international conference on Management of data,pp\. 305–316\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p3.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px2.p1.1)\.
- \[21\]V\. Hristidis and Y\. Papakonstantinou\(2002\)Discover: keyword search in relational databases\.InVLDB’02: Proceedings of the 28th International Conference on Very Large Databases,pp\. 670–681\.Cited by:[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px2.p1.1)\.
- \[22\]V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih\(2020\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 conference on empirical methods in natural language processing \(EMNLP\),pp\. 6769–6781\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p3.1),[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px3.p1.7)\.
- \[23\]LangChain AI\(2026\)LangMem: long\-term memory for llm agents\.GitHub\.Note:Accessed: 2026\-06\-01External Links:[Link](https://github.com/langchain-ai/langmem)Cited by:[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px3.p1.7),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px1.p1.1)\.
- \[24\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p3.1),[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px3.p1.7)\.
- \[25\]S\. Li, Y\. He, H\. Guo, X\. Bu, G\. Bai, J\. Liu, J\. Liu, X\. Qu, Y\. Li, W\. Ouyang,et al\.\(2024\)Graphreader: building graph\-based agent to enhance long\-context abilities of large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 12758–12786\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p3.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px2.p1.1)\.
- \[26\]L\. Libkin, W\. Martens, and D\. Vrgoč\(2016\)Querying graphs with data\.Journal of the ACM \(JACM\)63\(2\),pp\. 1–53\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p4.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px2.p1.1)\.
- \[27\]S\. Madden, M\. Shah, J\. M\. Hellerstein, and V\. Raman\(2002\)Continuously adaptive continuous queries over streams\.InProceedings of the 2002 ACM SIGMOD international conference on Management of data,pp\. 49–60\.Cited by:[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px3.p1.1)\.
- \[28\]A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang\(2024\)Evaluating very long\-term conversational memory of llm agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13851–13870\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p1.1),[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px1.p1.10),[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px5.p1.2)\.
- \[29\]J\. Nan, W\. Ma, W\. Wu, and Y\. Chen\(2025\)Nemori: self\-organizing agent memory inspired by cognitive science\.arXiv preprint arXiv:2508\.03341\.Cited by:[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px3.p1.7),[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px5.p1.2),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px1.p1.1)\.
- \[30\]T\. Neumann and G\. Weikum\(2008\)RDF\-3x: a risc\-style engine for rdf\.Proceedings of the VLDB Endowment1\(1\),pp\. 647–659\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p4.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px2.p1.1)\.
- \[31\]T\. Palpanas, R\. Sidle, R\. Cochrane, and H\. Pirahesh\(2002\)Incremental maintenance for non\-distributive aggregate functions\.InVLDB’02: Proceedings of the 28th International Conference on Very Large Databases,pp\. 802–813\.Cited by:[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px3.p1.1)\.
- \[32\]X\. Qian, Y\. Zhang, Y\. Zhao, B\. Zhou, X\. Sui, L\. Zhang, and K\. Song\(2024\)TimeR4: time\-aware retrieval\-augmented large language models for temporal knowledge graph question answering\.InProceedings of the 2024 conference on empirical methods in natural language processing,pp\. 6942–6952\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p3.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px1.p1.1)\.
- \[33\]P\. Rasmussen, P\. Paliychuk, T\. Beauvais, J\. Ryan, and D\. Chalef\(2025\)Zep: a temporal knowledge graph architecture for agent memory\.arXiv preprint arXiv:2501\.13956\.Cited by:[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px3.p1.7),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px1.p1.1)\.
- \[34\]N\. Reimers and I\. Gurevych\(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 3982–3992\.Cited by:[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px6.p1.1)\.
- \[35\]P\. G\. Selinger, M\. M\. Astrahan, D\. D\. Chamberlin, R\. A\. Lorie, and T\. G\. Price\(1979\)Access path selection in a relational database management system\.InProceedings of the 1979 ACM SIGMOD international conference on Management of data,pp\. 23–34\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p2.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px3.p1.1)\.
- \[36\]R\. T\. Snodgrass\(2012\)The tsql2 temporal query language\.Vol\.330,Springer Science & Business Media\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p1.1),[§I](https://arxiv.org/html/2607.00339#S1.p4.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px1.p1.1)\.
- \[37\]T\. Tran, H\. Wang, S\. Rudolph, and P\. Cimiano\(2009\)Top\-k exploration of query candidates for efficient keyword search on graph\-shaped \(rdf\) data\.In2009 IEEE 25th International Conference on Data Engineering,pp\. 405–416\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p3.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px2.p1.1)\.
- \[38\]W\. Wang, F\. Wei, L\. Dong, H\. Bao, N\. Yang, and M\. Zhou\(2020\)Minilm: deep self\-attention distillation for task\-agnostic compression of pre\-trained transformers\.Advances in neural information processing systems33,pp\. 5776–5788\.Cited by:[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px6.p1.1)\.
- \[39\]D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu\(2024\)Longmemeval: benchmarking chat assistants on long\-term interactive memory\.arXiv preprint arXiv:2410\.10813\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p1.1),[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px1.p1.10)\.
- \[40\]J\. Xu, A\. Szlam, and J\. Weston\(2022\)Beyond goldfish memory: long\-term open\-domain conversation\.InProceedings of the 60th annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 5180–5197\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p1.1),[§I](https://arxiv.org/html/2607.00339#S1.p3.1)\.
- \[41\]W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang\(2026\)A\-mem: agentic memory for llm agents\.Advances in Neural Information Processing Systems38,pp\. 17577–17604\.Cited by:[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px3.p1.7),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px1.p1.1)\.
- \[42\]X\. Yin, J\. Han, and P\. S\. Yu\(2007\)Truth discovery with multiple conflicting information providers on the web\.InProceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining,pp\. 1048–1052\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p2.1)\.
- \[43\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§III\-A](https://arxiv.org/html/2607.00339#S3.SS1.SSS0.Px5.p1.2)\.
- \[44\]W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang\(2024\)Memorybank: enhancing large language models with long\-term memory\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 19724–19731\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p1.1),[§I](https://arxiv.org/html/2607.00339#S1.p3.1)\.
- \[45\]L\. Zou, J\. Mo, L\. Chen, M\. T\. Özsu, and D\. Zhao\(2011\)GStore: answering sparql queries via subgraph matching\.Proceedings of the VLDB Endowment4\(8\),pp\. 482–493\.Cited by:[§I](https://arxiv.org/html/2607.00339#S1.p4.1),[§IV](https://arxiv.org/html/2607.00339#S4.SS0.SSS0.Px2.p1.1)\.Similar Articles
TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation
TCAR-Gen proposes a framework combining query-conditioned graph neural networks, temporal evidence fusion, and chain-of-trees reasoning for temporal graph retrieval in knowledge-grounded generation. It achieves improved recall on the Victorian Crime Diaries benchmark across multiple query types.
AdaTKG: Adaptive Memory for Temporal Knowledge Graph Reasoning
This paper proposes AdaTKG, a method for temporal knowledge graph reasoning that uses adaptive memory to refine entity representations dynamically as new interactions occur, improving performance over static baselines.
TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents
TRACE is a monitoring framework for long-horizon LLM agent trajectories that uses a Triage-Inspect-Judge loop to connect evidence across temporally distant actions, achieving high recall and F1 on evasive sabotage detection tasks.
GraphReAct: Reasoning and Acting for Multi-step Graph Inference
This paper introduces GraphReAct, a framework that extends reasoning-acting paradigms to graph-structured data for multi-step inference. It combines topological and semantic retrieval with context refinement to improve performance on graph learning benchmarks.
Strikingness-Aware Evaluation for Temporal Knowledge Graph Reasoning
This paper proposes a strikingness-aware evaluation framework for Temporal Knowledge Graph Reasoning (TKGR) that weights events by rarity to better assess model reasoning, addressing overestimation from trivial repeated events.