REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs

arXiv cs.CL Papers

Summary

REAL is a reasoning-enhanced graph framework for long-term memory management of LLMs that uses temporal and confidence-aware directed property graphs with non-destructive temporal updates and hybrid beam search retrieval, achieving an average improvement of 22.72%.

arXiv:2606.10694v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly expected to interact with users over long time horizons. However, due to their finite context window, LLMs cannot retain all past interactions, making long-term memory management essential for storing, updating, and retrieving historical information beyond the context limit. Although recent memory systems attempt to address this issue by storing historical information externally, existing approaches suffer from three key limitations: flat text-based memory organizations fail to capture explicit relations among memories, structured memory systems often destructively overwrite evolving facts, and current retrieval mechanisms remain query-agnostic and passive when evidence is incomplete. REAL constructs long-term conversational memory as a temporal and confidence-aware directed property graph, where each atomic fact is represented with entities, relations, valid-time intervals, confidence scores, and exploration intent labels. During memory construction, REAL adopts a non-destructive temporal update strategy that preserves parallel fact versions and their validity intervals, enabling faithful tracking of fact evolution. During retrieval, REAL anchors query-relevant root entities, decouples their exploration intents, and performs semantic evaluator-guided hybrid beam search to extract compact memory subgraphs. It further incorporates counterfactual inference to repair unreliable retrieval states and recover missing memory evidence through implicit logical relations. Comprehensive experiments demonstrate that REAL substantially improves long-term memory performance over flat-text, graph-based, and existing memory baselines, achieving an average improvement of 22.72\%.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:12 AM

# REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs
Source: [https://arxiv.org/html/2606.10694](https://arxiv.org/html/2606.10694)
Keer Lu‡, Liwei Chen§, Guoqing Jiang§, Zhiheng Qin§, Yunhuai Liu‡†¶, Wentao Zhang†¶ ‡School of Computer Science, Peking University,§Kuaishou Technology, †Center for Data Science, Academy for Advanced Interdisciplinary Studies, Peking University keer\.lu@stu\.pku\.edu\.cn, \{yunhuai\.liu, wentao\.zhang\}@pku\.edu\.cn

###### Abstract

Large Language Models \(LLMs\) are increasingly expected to interact with users over long time horizons\. However, due to their finite context window, LLMs cannot retain all past interactions, making long‑term memory management essential for storing, updating, and retrieving historical information beyond the context limit\. Although recent memory systems attempt to address this issue by storing historical information externally, existing approaches suffer from three key limitations: flat text\-based memory organizations fail to capture explicit relations among memories, structured memory systems often destructively overwrite evolving facts, and current retrieval mechanisms remain query\-agnostic and passive when evidence is incomplete\.

To address these challenges, we proposeREAL, a reasoning\-enhanced graph framework for long\-term memory management of LLMs\.REALconstructs long\-term conversational memory as a temporal and confidence\-aware directed property graph, where each atomic fact is represented with entities, relations, valid\-time intervals, confidence scores, and exploration intent labels\. During memory construction,REALadopts a non\-destructive temporal update strategy that preserves parallel fact versions and their validity intervals, enabling faithful tracking of fact evolution\. During retrieval,REALanchors query\-relevant root entities, decouples their exploration intents, and performs semantic evaluator\-guided hybrid beam search to extract compact memory subgraphs\. It further incorporates counterfactual inference to repair unreliable retrieval states and recover missing memory evidence through implicit logical relations\. Comprehensive experiments demonstrate thatREALsubstantially improves long\-term memory performance over flat\-text, graph\-based, and existing state\-of\-the\-art memory management methods, achieving an average improvement of 22\.72%\.

$\\P$$\\P$footnotetext:Corresponding Authors\.## IIntroduction

Large Language Models \(LLMs\)\[[1](https://arxiv.org/html/2606.10694#bib.bib42),[54](https://arxiv.org/html/2606.10694#bib.bib43),[11](https://arxiv.org/html/2606.10694#bib.bib45)\]have demonstrated remarkable capabilities in long\-duration tasks, interacting with users over long time horizons that span multiple sessions and hundreds of conversation turns\[[57](https://arxiv.org/html/2606.10694#bib.bib2)\]\. Such long‑term interactions highlight the importance ofLLM memory111“Memory” discussed in this paper is distinct from volatile hardware RAM\., which is essential for retaining and retrieving relevant information over extended periods\[[19](https://arxiv.org/html/2606.10694#bib.bib22),[14](https://arxiv.org/html/2606.10694#bib.bib46),[24](https://arxiv.org/html/2606.10694#bib.bib20)\]\. Despite their potential, maintaining effective memory over prolonged periods remains a critical challenge\. The context window of modern LLMs is severely limited, which typically ranges from 8K to 256K tokens\. As tasks lengthen and the accumulated historical contents exceed the model’s fixed context window, it faces inherent limitations that force migration of information to persistent storage\[[8](https://arxiv.org/html/2606.10694#bib.bib25)\]\.

One straightforward solution is to increase the model’s context window length\[[1](https://arxiv.org/html/2606.10694#bib.bib42),[43](https://arxiv.org/html/2606.10694#bib.bib47)\]\. However, these advances only postpone the inherent limitation without resolving it:\(1\) Unbounded Context Accumulation: The infinitely accumulating dialogue history will inevitably exceed the finite context limits as interactions evolve over weeks or months\.\(2\) Attention Decay over Distance: Even if the past information is included in the same context window, since the model’s attention mechanism degrades with distance\[[28](https://arxiv.org/html/2606.10694#bib.bib48)\], it struggles to retrieve or utilize distant tokens effectively, especially when relevant details are scattered across many tokens\[[14](https://arxiv.org/html/2606.10694#bib.bib46)\]\.\(3\) Thematic Discontinuity: More importantly, real‑world conversations often jump between topics, e\.g\., the user might mention a travel plan, then spend hours troubleshooting a software bug, before returning to discuss hotel recommendations for the journey\. Key facts can become lost among vast amounts of unrelated tokens, which makes full‑context reasoning inefficient\[[26](https://arxiv.org/html/2606.10694#bib.bib49)\]\. Given these inherent limitations, simply enlarging the context window is neither sufficient nor sustainable\.

To overcome these limitations, there is a need for LLMs to adopt memory systems that move beyond static context expansion, where the model could retain crucial information, integrate related concepts, and retrieve relevant details on demand\. A more principled approach is to decouple the model’s long\-term memory from its finite working memory, i\.e\., the context window\. This separation calls for anexternal memory repositorythat persistently stores historical information outside the LLM, indexes it for efficient retrieval, and dynamically supplies relevant subsets back into the context window when needed\[[53](https://arxiv.org/html/2606.10694#bib.bib23)\]\. Such a repository acts as a dedicateddata management system: it extracts facts from the dialogue stream, organizes them into a queryable structure, supports incremental updates, and provides fast, accurate retrieval\.

However, realizing such a memory repository is non‑trivial\. Through analysis, we identified that the challenges in existing LLM long\-term memory management strategies stem precisely from the lack of such a well\-designed memory mechanism, and can be traced to the following three key aspects:

![Refer to caption](https://arxiv.org/html/2606.10694v1/x1.png)Figure 1:Illustration of existing challenges and design motivation\. Finite context windows, attention decay over long distances, and thematic discontinuity make it difficult for LLMs to preserve and retrieve historical information directly from the context window, motivating the need for an external memory repository\. However, existing memory systems suffer from:\(C1\)flat text\-based organization lacks explicit memory links and multi\-hop retrieval capability,\(C2\)destructive updates erase fact evolution, and\(C3\)query\-agnostic and passive retrieval fails to recover missing evidence through counterfactual reasoning\.C1: Ineffectiveness of Flat Memory Organization\.Current memory repositories predominantly adopt flat text organization, where the accumulated history is segmented into text chunks, each independently embedded and stored in a vector database\. Memory retrieval is then performed by nearest\-neighbor search on the query embedding to return the top‑kksemantically similar chunks, which are then appended to the prompt\[[57](https://arxiv.org/html/2606.10694#bib.bib2)\]\. While this approach offers simplicity in implementation, it suffers from fundamental deficiencies that severely impede retrieval efficiency and contextual relevance\. First, each text chunk is treated as an isolated semantic unit, and the repository maintains no explicit links between related chunks, making it impossible to navigate from a retrieved chunk to its logical neighbours without performing another similarity search\. As a result, the system cannot perform path\-based reasoning or multi\-hop retrieval\. Second, purely semantic similarity\-based retrieval proves inadequate, which often retrieves lexically similar but semantically irrelevant chunks while missing contextually important evidence\[[20](https://arxiv.org/html/2606.10694#bib.bib35)\]\.

C2: Lack of Fact Evolution\.Recognizing the limitations of flat text organizations, subsequent work has moved toward structured memory management\. However, even when the memory is organized with structure, current LLM memory management systems adopt a destructive update paradigm, in which the newly acquired facts directly overwrite existing ones without preserving their evolutionary trajectories\[[8](https://arxiv.org/html/2606.10694#bib.bib25),[53](https://arxiv.org/html/2606.10694#bib.bib23)\]\. Such replacement strategy fails to capture the temporal dynamics inherent in real\-world knowledge, where the user preferences, plans, and relationships evolve continuously, and facts are refined over time\[[41](https://arxiv.org/html/2606.10694#bib.bib50)\]\. For instance, when a user’s preference changes from “prefers Italian cuisine” to “prefers Japanese cuisine”, most studies simply replace the old fact with the new one, erasing the historical context that could inform future recommendations or explain behavioral shifts\. This gap directly motivates our proposal:designing a structured memory management framework in which facts are annotated with explicit valid\-time intervals, and conflicting updates are preserved as parallel versions instead of being destructively overwritten, thereby preserving the complete evolutionary history of every fact\.

C3: Inefficiency and Passivity in Memory Retrieval\.The retrieval mechanism of the memory repository itself persists as a critical bottleneck\. Existing memory retrieval methods, whether based on flat vector similarity\[[37](https://arxiv.org/html/2606.10694#bib.bib52)\], tree‑based indexing\[[27](https://arxiv.org/html/2606.10694#bib.bib54)\], or graph traversal\[[38](https://arxiv.org/html/2606.10694#bib.bib53)\], lack query‑aware exploration strategies\. Conventional approaches, such as fixed‑step random walks\[[51](https://arxiv.org/html/2606.10694#bib.bib51)\], or Personalized PageRank \(PPR\)\[[15](https://arxiv.org/html/2606.10694#bib.bib38)\], execute the same retrieval pattern regardless of the query’s specific intent\. Tree‑based methods\[[27](https://arxiv.org/html/2606.10694#bib.bib54)\]often rely on fixed‑depth traversal or static thresholds, missing deep or branching dependencies related to the query\. Additionally, they are purely retrieval‑oriented and operate passively\[[21](https://arxiv.org/html/2606.10694#bib.bib63)\]\. When the retrieved evidence lacks sufficient coverage to answer the query, they simply return “I don’t know”\. They cannot perform counterfactual reasoning, i\.e\., leveraging implicit logical relations such as symmetry \(“likes”⇔\\Leftrightarrow“dislikes”, “purchased”⇔\\Leftrightarrow“returned”\), hyponymy \(“basketball shoes”⊂\\subset“sports shoes”\), or temporal succession to infer new facts from the existing memory structure\. Such passivity limits the model’s ability to handle incomplete or ambiguous queries\.

To address the challenges outlined above, we introduceREAL, aReasoning\-Enhanced grAph framework forLong\-term memory management of LLMs\. ForC1andC2, REAL organizes long\-term conversational memory as a temporal, confidence\-aware directed property graph, where each atomic fact is represented with entities, relations, valid\-time intervals, confidence scores, and exploration intent labels\. During memory construction, REAL incrementally updates the graph in a non\-destructive manner, preserving their temporal evolution through parallel versions with explicit validity intervals\. ForC3during memory retrieval, REAL first anchors query\-relevant root entities and decouples their exploration intents, then performs semantic evaluator\-guided hybrid beam search to extract a compact evidence subgraph\. Each candidate traversal path is evaluated by query relevance, logical coherence, and entity\-specific answer sufficiency, enabling the model to adaptively decide whether to stop or expand a beam\. When normal expansion becomes unreliable, REAL further invokes a counterfactual inference mechanism to generate alternative traversal hypotheses and recover missing evidence from implicit relations\. In summary, our contributions are as follows:

- •Multi\-Attribute Memory Construction Mechanism\.We propose a multi\-attribute memory construction mechanism that represents long\-term conversational memory as a temporal, confidence\-aware directed property graph, and adopt a non\-destructive temporal update strategy that preserves parallel fact versions with explicit validity intervals \(forC1,C2\)\.
- •Reasoning\-Enhanced Adaptive Retrieval Strategy\.We integrate mechanisms of exploration intent decoupling, semantic evaluator\-guided hybrid beam search, and counterfactual inference to enable query\-aware and robust memory evidence discovery during the memory retrieval phase \(forC3\)\.
- •Performance and Effectiveness\.Comprehensive evaluations show thatREALbolsters the model’s long\-term memory performances over existing state\-of\-the\-art memory management methods with an average improvement of 22\.72%\.

## IIPreliminary

In this section, we introduce the background information and oundational concepts throughout the paper\.

### II\-AMemory Mechanism of LLMs

In real\-world applications such as personal AI assistants\[[16](https://arxiv.org/html/2606.10694#bib.bib41)\]or longitudinal medical consultations\[[47](https://arxiv.org/html/2606.10694#bib.bib40)\], interaction histories often span months or even years\. For these long\-duration tasks, the memory mechanism of Large Language Models \(LLMs\) plays an extremely important role in determining how to accumulate knowledge, process historical experiences, retrieve relevant information to inform decisions, and so on\[[57](https://arxiv.org/html/2606.10694#bib.bib2)\]\.

Definition[II](https://arxiv.org/html/2606.10694#S2)\.1 \(Conversation Stream Modeling\)\.Formally, we define a multi\-turn conversation as asessionU=\(u1,u2,…,uT\)U=\(u\_\{1\},u\_\{2\},\\dots,u\_\{T\}\), where eachconversation turnut=\{ct,τt\}u\_\{t\}=\\\{c\_\{t\},\\tau\_\{t\}\\\}consists of thetextual contentctc\_\{t\}, which is typically a user query or an LLM response, and anabsolute timestampτt\\tau\_\{t\}whereτt∈ℝ≥0\\tau\_\{t\}\\in\\mathbb\{R\}\_\{\\geq 0\}\. Multiple sessions in the chronological order form aconversation stream𝒳=\(U1,U2,…,UN\)\\mathcal\{X\}=\(U\_\{1\},U\_\{2\},\\dots,U\_\{N\}\)\.

Definition[II](https://arxiv.org/html/2606.10694#S2)\.2 \(Memory Repository\)\.Generally, the model maintains amemory repositoryℳ\\mathcal\{M\}to persistently store historical information extracted from theconversation stream𝒳\\mathcal\{X\}\.ℳ\\mathcal\{M\}can adopt multiple data organization paradigms:

- •Flat\-Text\-based Structure: Conversation stream𝒳\\mathcal\{X\}is split into fixed‑length text chunks and stored in a vector database\. Retrieval is performed by nearest neighbor search to obtain the top‑kksemantically similar chunks\.
- •Graph‑based Structure: These strategies incorporate graph databases for memory storage and retrieval processes, where the conversation stream𝒳\\mathcal\{X\}is represented as an entity‑relation graph to support structured path traversal and multi‑hop reasoning\.

### II\-BProblem Formulation of LLM Memory Management

Given the conversation stream𝒳\\mathcal\{X\}, the model should maintain amemory repositoryℳ\\mathcal\{M\}such that for any new user queryqq, the model can retrieve the most relevant subsetℳ\|q\\mathcal\{M\}\|\_\{q\}fromℳ\\mathcal\{M\}and generate an accurate answeraa\. The core problems of memory management can be decomposed into two subproblems:

- •Memory Construction: Given the accumulated conversation stream𝒳\\mathcal\{X\}, the model transforms𝒳\\mathcal\{X\}intoℳ\\mathcal\{M\}via a memory construction processΦconst\\Phi\_\{\\text\{const\}\}: ℳ=Φconst​\(𝒳\)\\mathcal\{M\}=\\Phi\_\{\\text\{const\}\}\(\\mathcal\{X\}\)\(1\)The central aspect of the research is how to extract, organize, and continuously updateℳ\\mathcal\{M\}from the streaming conversation input𝒳\\mathcal\{X\}, such thatℳ\\mathcal\{M\}compactly represents historical information without losing critical details\.
- •Memory Retrieval: For a new user queryqq, the model performs a memory retrieval processΦret\\Phi\_\{\\text\{ret\}\}: ℳ\|q=Φret​\(q,ℳ\)⊆ℳ\\mathcal\{M\}\|\_\{q\}=\\Phi\_\{\\text\{ret\}\}\(q,\\mathcal\{M\}\)\\subseteq\\mathcal\{M\}\(2\)whereℳ\|q\\mathcal\{M\}\|\_\{q\}denotes the retrieved memory subset\. The final answeraais generated conditioned on the queryqqand the memory subsetℳ\|q\\mathcal\{M\}\|\_\{q\}\. The essence of the research is how to efficiently and accurately locate the subset of evidenceℳ\|q\\mathcal\{M\}\|\_\{q\}from the large\-scale memory repositoryℳ\\mathcal\{M\}that best supports answering the queryqq\.

## IIISystem Overview

![Refer to caption](https://arxiv.org/html/2606.10694v1/x2.png)Figure 2:Overview of REAL pipeline\. During memory construction, conversation streams are converted into atomic facts in the form of[Equation4](https://arxiv.org/html/2606.10694#S3.E4)and incrementally stored in a temporal, confidence\-aware directed property graph \([Equation3](https://arxiv.org/html/2606.10694#S3.E3)\)\. During the memory retrieval phase, REAL anchors root entities from the query, decouples their exploration intents, and performs semantic evaluator\-guided hybrid beam search to adaptively choose to stop, expand, or invoke counterfactual inference\. The retrieved paths are globally pruned and aggregated into a compact memory evidence subgraph for final answer generation\.In this section, we present the proposed REAL framework\. As illustrated in[Figure2](https://arxiv.org/html/2606.10694#S3.F2), REAL consists of the following core phases: memory construction, which incrementally builds a temporal and confidence\-aware memory graph, and memory retrieval, which extracts a compact query\-specific evidence subgraph for answer generation\. We also analyze the computational complexity and efficiency of REAL\.

### III\-AMemory Construction

The first phase in REAL framework is constructing the memory, which involves creating a multi\-dimensional representation of entities and their relationships within a defined temporal context\. This memory structure acts as the backbone for reasoning and subsequent memory retrieval\. Here we represent the memory repository as a directed property graph:

ℳ=\{\(h,r,t,τ,c,ι\)∣h,t∈ℰ,r∈ℛ,τ⊆\[0,∞\),c∈\(0,1\],ι∈I\}\\mathcal\{M\}=\\\{\(h,r,t,\\tau,c,\\iota\)\\mid h,t\\in\\mathcal\{E\},r\\in\\mathcal\{R\},\\tau\\subseteq\[0,\\infty\),c\\in\(0,1\],\\iota\\in I\\\}

\(3\)whereℰ\\mathcal\{E\}denotes the set of all entities, e\.g\., users, objects, concepts, etc\.,ℛ\\mathcal\{R\}represents the set of all relations, and\(h,r,t\)\(h,r,t\)is a triple in which the head entityhhis connected to the tail entityttvia the relationrr\. Each edge carries a temporal intervalτ=\[τs,τe\]⊆\[0,∞\)\\tau=\[\\tau\_\{s\},\\tau\_\{e\}\]\\subseteq\[0,\\infty\), along with a confidence scorec∈\(0,1\]c\\in\(0,1\]and an exploration intent labelι\\iota\.

Atomic Fact Extraction\.Given an incoming conversation stream𝒳\\mathcal\{X\}, we segment it into sessions and then into individual turns\. For each turn, we call an LLM to extract a set of atomic facts\. Each fact is output in the form of a sextuple:

m=\(h,r,t,\[τs,τe\],c,ι\)m=\(h,r,t,\[\\tau\_\{s\},\\tau\_\{e\}\],c,\\iota\)\(4\)whereτs\\tau\_\{s\}is the utterance timestamp or the explicit time mentioned in the utterance, andτe\\tau\_\{e\}is initially set to∞\\infty, which means “still valid”\. The LLM is prompted to assign confidence scores and exploration intents based on linguistic cues\. For instance, statements containing hedging modal verbs, e\.g\., “might” and “perhaps”, receive lower confidence, whereas definitive declarative statements are assigned higher confidence\. Moreover, words like “because” trigger the exploration intent ofCAUSAL, while “before/after” triggerTEMPORAL\. We then detail how these attributes, incluing the temporal interval\[τs,τe\]\[\\tau\_\{s\},\\tau\_\{e\}\], confidence scorecc, and exploration intentι\\iota, are determined and maintained during memory construction:

Incremental Graph Update with Non‑Destructive Temporal Evolution\.When a new atomic factmnew=\(h,r,t,τnew,cnew\)m\_\{\\text\{new\}\}=\(h,r,t,\\tau\_\{\\text\{new\}\},c\_\{\\text\{new\}\}\)arrives, we first retrieve any existing factmold=\(h,r,t′,τold,cold\)m\_\{\\text\{old\}\}=\(h,r,t^\{\\prime\},\\tau\_\{\\text\{old\}\},c\_\{\\text\{old\}\}\)with the same head entity and relation but possibly different tail entity\. The update behavior is determined by thecardinality typeof relationrr:

- •Single‑Valued Relations: At any valid time interval, at most one tail entity can be true, e\.g\., “current\_living\_city”\. If there existsmoldm\_\{\\text\{old\}\}with the same tail entityt′=tt^\{\\prime\}=t, we then merge the intervalsτnew=\[min⁡\(τs,new,τs,old\),∞\)\\tau\_\{\\text\{new\}\}=\[\\min\(\\tau\_\{s,\\text\{new\}\},\\tau\_\{s,\\text\{old\}\}\),\\infty\)\. Otherwise, if there existsmoldm\_\{\\text\{old\}\}with a different tail entityt′≠tt^\{\\prime\}\\neq t, and the intervals overlap, it indicates that aconflicting factoccurs\. In this case, we do not overwrite\. Instead, we close the old interval by setting itsτe\\tau\_\{e\}toτs,new\\tau\_\{s,\\text\{new\}\}, when the time the new factmnewm\_\{\\text\{new\}\}becomes true, and insert the new fact with its own interval\. Parallel edges with non‑overlapping intervals thus record the complete evolution of the attribute\.
- •Multi‑Valued Relations: Multiple values can coexist even with overlapping intervals, e\.g\., “likes\_food”, “friend\_of”\. We simply addmnewm\_\{\\text\{new\}\}as a new parallel edge, without closing any existing interval\. If the exact same triple\(h,r,t\)\(h,r,t\)already exists with an overlapping interval, we merge the intervals as in the same‑value case\.
- •Unknown Cardinality: To avoid data loss, we regard the cardinality type of relationrrdefault tomulti‑valued relationsin this case, while the system may later learn cardinality constraints from observed patterns\.

This cardinality‑aware temporal update strategy ensures that we neither mistakenly erase valid parallel facts \(e\.g\., two liked foods\) nor fail to capture genuine attribute evolution \(e\.g\., a living city change\)\. It forms the foundation for faithful historical backtracking in long‑term conversation memory\.

Confidence Stratification and Upgrade\.To prevent unconfirmed statements during memory retrieval, we introduce the confidence stratification strategy\. Each atomic fact stored in the memory graphℳ\\mathcal\{M\}is associated with a confidence scorec∈\(0,1\]c\\in\(0,1\], which is initially assigned by the LLM during fact extraction according to linguistic reliability cues\. Specifically, statements containing hedging expressions or modal verbs, such as “might”, “perhaps”, and “maybe”, are assigned lower confidence scores, whereas assertive declarative statements, such as those containing “is”, “was”, or “always”, are assigned higher confidence scores\. During the memory retrieval phase, only facts whose current confidence satisfiesc≥θstablec\\geq\\theta\_\{\\text\{stable\}\}are considered as core memory evidence for beam expansion\. Facts below this threshold are excluded from normal retrieval to reduce noise, unless the query explicitly requests uncertain or hypothetical information, e\.g\., the query contains uncertainty keywords such as “guess”, “maybe”, “speculate”\. When subsequent dialogue turns provide additional confirmation for a low\-confidence fact, its confidence score can be upgraded accordingly, allowing previously uncertain memories to become eligible for future retrieval\.

Exploration Intent Enrichment\.During memory construction phase, we further attach an exploration intent labelι∈\{\\iota\\in\\\{FACT,CAUSAL,TEMPORAL,CONTRAST,EVOLUTION\}\\\}to each extracted atomic fact, which describes what type of future retrieval or reasoning the memory fact may support\. For example, the sentence “I used to prefer Italian food, but recently I choose Japanese food more often” yields facts with anEVOLUTIONintent, because the memory describes a change in user preference over time\. A conversation record such as “I stopped using the fitness app because it kept sending too many notifications” is labeled aCAUSALintent, since the fact supports future why/how reasoning\. Similarly, utterances involving explicit time markers, e\.g\., “before moving to Seattle”, are assignedTEMPORALintent, while utterances comparing alternatives or expressing conflicts are assignedCONTRASTintent\. The stored intent labelι\\iotais later used during memory retrieval\. When a user query is issued, the root entity anchoring module extracts the query\-side exploration intent, and the beam search prioritizes memory edges whose stored exploration intent labels are compatible with the query intent\.

### III\-BMemory Retrieval

Input:User queryqq, Memory repositoryℳ\\mathcal\{M\}, Beam widthkk,Maximum search depthDmaxD\_\{\\text\{max\}\}, Stable confidence thresholdθstable\\theta\_\{\\text\{stable\}\}, Anchored entities with exploration intents\{\(ei,ιi\)\}i=1n\\\{\(e\_\{i\},\\iota\_\{i\}\)\\\}\_\{i=1\}^\{n\},Relevance thresholdδQ​R\\delta\_\{QR\}, Logical coherence thresholdδL​C\\delta\_\{LC\};

Parameter:Memory beamℬ\\mathcal\{B\}, Semantic evaluation scoreS​\(⋅\)S\(\\cdot\);

Output:Retrieved memory subset

ℳq\\mathcal\{M\}\_\{q\};

1

2

𝒫f​i​n​a​l←∅\\mathcal\{P\}\_\{final\}\\leftarrow\\emptyset;

3

4for*each\(ei,ιi\)∈\{\(ei,ιi\)\}i=1n\(e\_\{i\},\\iota\_\{i\}\)\\in\\\{\(e\_\{i\},\\iota\_\{i\}\)\\\}\_\{i=1\}^\{n\}*do

5Initialize beam set:

ℬi\(0\)←\{\(ei,ιi\)\}\\mathcal\{B\}\_\{i\}^\{\(0\)\}\\leftarrow\\\{\(e\_\{i\},\\iota\_\{i\}\)\\\};

6

7for*t=0t=0toD*max*−1D\_\{\\text\{max\}\}\-1*do

8

ℬc​a​n​d​\(i\)\(t\)←∅\\mathcal\{B\}\_\{cand\(i\)\}^\{\(t\)\}\\leftarrow\\emptyset;

9

10for*each traversal memory pathPi,j\(t\)∈ℬi\(t\)P\_\{i,j\}^\{\(t\)\}\\in\\mathcal\{B\}\_\{i\}^\{\(t\)\}*do

//Candidate Node Expansion

11Retrieve direct neighbors𝒩​\(ei,j\(t\)\)\\mathcal\{N\}\(e\_\{i,j\}^\{\(t\)\}\)ofei,j\(t\)e\_\{i,j\}^\{\(t\)\}fromℳ\\mathcal\{M\};

12

13

𝒩r​a​w​\(i,j\)\(t\),𝒫′i,j\(t\)←∅\\mathcal\{N\}\_\{raw\(i,j\)\}^\{\(t\)\},\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}\\leftarrow\\emptyset;

14

15for*eachu∈𝒩​\(ei,j\(t\)\)u\\in\\mathcal\{N\}\(e\_\{i,j\}^\{\(t\)\}\)*do

16if**TimeValid*​\(u,q\)\\textsc\{TimeValid\}\(u,q\)andc​\(ei,j\(t\),u\)≥θ*stable*c\(e\_\{i,j\}^\{\(t\)\},u\)\\geq\\theta\_\{\\text\{stable\}\}and*IntentCompatible*​\(ιu,ιi,j\(t\)\)\\textsc\{IntentCompatible\}\(\\iota\_\{u\},\\iota\_\{i,j\}^\{\(t\)\}\)*then

17

𝒩r​a​w​\(i,j\)\(t\)←𝒩r​a​w​\(i,j\)\(t\)∪\{u\}\\mathcal\{N\}\_\{raw\(i,j\)\}^\{\(t\)\}\\leftarrow\\mathcal\{N\}\_\{raw\(i,j\)\}^\{\(t\)\}\\cup\\\{u\\\};

18

19𝒩c​a​n​d​\(i,j\)\(t\)←TopM​\(𝒩r​a​w​\(i,j\)\(t\),Sim​\(q,r​\(ei,j\(t\),u\)⊕u\)\)\\mathcal\{N\}\_\{cand\(i,j\)\}^\{\(t\)\}\\leftarrow\\textsc\{TopM\}\(\\mathcal\{N\}\_\{raw\(i,j\)\}^\{\(t\)\},\\textsc\{Sim\}\(q,r\(e\_\{i,j\}^\{\(t\)\},u\)\\oplus u\)\);

20

21

𝒫′i,j\(t\)←\{Pi,j\(t\)⊕\(u,ιu\)\|u∈𝒩cand​\(i,j\)\(t\)\}\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}\\leftarrow\\big\\\{P\_\{i,j\}^\{\(t\)\}\\oplus\(u,\\iota\_\{u\}\)\\;\\big\|\\;u\\in\\mathcal\{N\}\_\{\\text\{cand\}\(i,j\)\}^\{\(t\)\}\\big\\\};

//Traversal Path Scoring

22for*eachp∈𝒫′i,j\(t\)∪\{Pi,j\(t\)\}p\\in\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}\\cup\\\{P\_\{i,j\}^\{\(t\)\}\\\}*do

23ComputeS​\(p\)=\(SQ​R​\(p\),SL​C​\(p\),SA​S​\(p\)\)S\(p\)=\(S\_\{QR\}\(p\),S\_\{LC\}\(p\),S\_\{AS\}\(p\)\);

24

//Counterfactual Repair

25if**NeedCounterfactual*​\(𝒫′i,j\(t\),Pi,j\(t\)\)\\textsc\{NeedCounterfactual\}\(\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\},P\_\{i,j\}^\{\(t\)\}\)*then

26𝒞i,j\(t\)←CounterfactualInfer​\(q,ℳ,Pi,j\(t\),ιi,j\(t\)\)\\mathcal\{C\}\_\{i,j\}^\{\(t\)\}\\leftarrow\\textsc\{CounterfactualInfer\}\(q,\\mathcal\{M\},P\_\{i,j\}^\{\(t\)\},\\iota\_\{i,j\}^\{\(t\)\}\);

27

28𝒫c​f​\(i,j\)\(t\)←ConstrainedTraverse​\(ℳ,𝒞i,j\(t\)\)\\mathcal\{P\}\_\{cf\(i,j\)\}^\{\(t\)\}\\leftarrow\\textsc\{ConstrainedTraverse\}\(\\mathcal\{M\},\\mathcal\{C\}\_\{i,j\}^\{\(t\)\}\);

29

ℬc​a​n​d​\(i\)\(t\)←ℬc​a​n​d​\(i\)\(t\)∪𝒫c​f​\(i,j\)\(t\)\\mathcal\{B\}\_\{cand\(i\)\}^\{\(t\)\}\\leftarrow\\mathcal\{B\}\_\{cand\(i\)\}^\{\(t\)\}\\cup\\mathcal\{P\}\_\{cf\(i,j\)\}^\{\(t\)\};

30

//Stop\-or\-Expand Decision

31if*SA​S​\(Pi,j\(t\)\)≥max⁡\{SA​S​\(p\)∣p∈𝒫′i,j\(t\)\}S\_\{AS\}\(P\_\{i,j\}^\{\(t\)\}\)\\geq\\max\\\{S\_\{AS\}\(p\)\\mid p\\in\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}\\\}*then

32

𝒫f​i​n​a​l←𝒫f​i​n​a​l∪\{Pi,j\(t\)\}\\mathcal\{P\}\_\{final\}\\leftarrow\\mathcal\{P\}\_\{final\}\\cup\\\{P\_\{i,j\}^\{\(t\)\}\\\};

33

34else

35𝒫b​e​t​t​e​r←\{p∈𝒫′i,j\(t\)∣SA​S​\(p\)\>SA​S​\(Pi,j\(t\)\)\}\\mathcal\{P\}\_\{better\}\\leftarrow\\\{\\,p\\in\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}\\mid S\_\{AS\}\(p\)\>S\_\{AS\}\(P\_\{i,j\}^\{\(t\)\}\)\\\};

36

37

ℬc​a​n​d​\(i\)\(t\)←ℬc​a​n​d​\(i\)\(t\)∪𝒫b​e​t​t​e​r\\mathcal\{B\}\_\{cand\(i\)\}^\{\(t\)\}\\leftarrow\\mathcal\{B\}\_\{cand\(i\)\}^\{\(t\)\}\\cup\\mathcal\{P\}\_\{better\};

38

39if*ℬc​a​n​d​\(i\)\(t\)=∅\\mathcal\{B\}\_\{cand\(i\)\}^\{\(t\)\}=\\emptyset*then

40break;

41

//Global Beam Pruning

42

ℬi\(t\+1\)←TopK​\(ℬc​a​n​d​\(i\)\(t\),S​\(P∣q\),k\)\\mathcal\{B\}\_\{i\}^\{\(t\+1\)\}\\leftarrow\\textsc\{TopK\}\(\\mathcal\{B\}\_\{cand\(i\)\}^\{\(t\)\},S\(P\\mid q\),k\);

43

44if*all active beams have selectedStop*then

45break;

46

47

ℳq←AggregateEvidence​\(𝒫f​i​n​a​l\)\\mathcal\{M\}\_\{q\}\\leftarrow\\textsc\{AggregateEvidence\}\(\\mathcal\{P\}\_\{final\}\);

48

49return

ℳq\\mathcal\{M\}\_\{q\};

Algorithm 1Semantic Evaluator\-Guided Hybrid Beam Search for Memory Retrieval \([SectionIII\-B](https://arxiv.org/html/2606.10694#S3.SS2)\)Given a user queryqq, the goal of the memory retreival stage is to produce a small memory subsetℳq\\mathcal\{M\}\_\{q\}from the entire memory repositoryℳ\\mathcal\{M\}that maximally supports answeringqq\. The memory retrieval stage contains the following phases:

#### III\-B1Root Entity Anchoring and Intent Decoupling

For each user queryqq, we prompt the LLM to automatically extract a set of root entities as entry points of the memory repositoryℳ\\mathcal\{M\}\. Unlike conventional entity extraction that only identifies surface mentions, our root entity anchoring module further decouples the exploration intent of each entity\. Specifically, the exploration intent is inferred from the semantic goal of the query, the entity role, the expected answer type, temporal expressions, dialogue history, and the relation schema of the memory repository\. The output is in the form of\{\(ei,ιi\)\}i=1n\\\{\(e\_\{i\},\\iota\_\{i\}\)\\\}\_\{i=1\}^\{n\}, whereei∈ℰe\_\{i\}\\in\\mathcal\{E\}denotes the initial root entity for exploration, andιi∈I\\iota\_\{i\}\\in Iis the intent direction for that anchored entity\.

#### III\-B2Semantic Evaluator\-Guided Hybrid Beam Search

We formulate the expansion from each anchored entity as a hybrid beam search process that integrates reinforcement learning \(RL\) to learn optimal expansion actions\. The model receives a user queryqq, a set ofinitialcandidate entities along with their exploration intents\{\(ei,ιi\)\}i=1n\\\{\(e\_\{i\},\\iota\_\{i\}\)\\\}\_\{i=1\}^\{n\}, and aims to retrieve an optimal memory subsetℳq\\mathcal\{M\}\_\{q\}within the entire memory repositoryℳ\\mathcal\{M\}over a series of stepst=1,2,…,Tt=1,2,\.\.\.,T\.

Here we set the beam width to bekk\. Start from each initial root entityeie\_\{i\}in\{\(ei,ιi\)\}i=1n\\\{\(e\_\{i\},\\iota\_\{i\}\)\\\}\_\{i=1\}^\{n\}, in the context of beam search, at each steptt, thejj\-th beam\(j=1,…,k\)\(j=1,\.\.\.,k\)independently maintains a traversal pathPi,jtP\_\{i,j\}^\{t\}from the root entityeie\_\{i\}to the current node, which records both visited entities and their exploration intents, i\.e\.,Pi,j\(t\)=\(ei,ιi,ei,j\(1\),ιi,j\(1\),…,ei,j\(t\),ιi,j\(t\)\)P\_\{i,j\}^\{\(t\)\}=\(e\_\{i\},\\iota\_\{i\},e\_\{i,j\}^\{\(1\)\},\\iota\_\{i,j\}^\{\(1\)\},\\ldots,e\_\{i,j\}^\{\(t\)\},\\iota\_\{i,j\}^\{\(t\)\}\)\.

The operational details are described below\.

I\.Candidate Node Expansion\.For each traversal memory pathPi,j\(t\)P\_\{i,j\}^\{\(t\)\}of thejj\-th beam\(j=1,…,k\)\(j=1,\.\.\.,k\)at steptt, we first retrieve the direct neighboring nodes of the current tail entityei,j\(t\)e\_\{i,j\}^\{\(t\)\}inPi,j\(t\)P\_\{i,j\}^\{\(t\)\}\. To reduce the search space and ensure retrieval quality, we apply three cascading filters based on properties of the memory repositoryℳ\\mathcal\{M\}constructed in[SectionIII\-A](https://arxiv.org/html/2606.10694#S3.SS1):

- •Temporal Filtering: A neighbor is retained only if its valid time intervalτ=\[τs,τe\]\\tau=\[\\tau\_\{s\},\\tau\_\{e\}\]contains the query’s reference timeτq\\tau\_\{q\}\(if explicitly provided, e\.g\., “last month”\)\. When no temporal constraint is given, it is default to the current dialogue time\. Neighbors whose time intervals lie completely in the future or have expired before the query’s temporal scope are excluded\.
- •Confidence Thresholds: Only neighbors with a confidence scorec≥θstablec\\geq\\theta\_\{\\text\{stable\}\}are considered for steady retrieval\. Low‑confidence edges \(c<θstablec<\\theta\_\{\\text\{stable\}\}\) are excluded from the normal retrieval process, unless the query explicitly contains uncertainty markers \(e\.g\., “guess”, “maybe”\)\.
- •Exploration Intent Compatibility: Each fact inℳ\\mathcal\{M\}carries a pre‑assigned intent labelιe\.\\iota\_\{e\}\.The neighbor is kept only ifιe\\iota\_\{e\}is compatible with the current beam’s exploration intentιi,j\(t\)\\iota\_\{i,j\}^\{\(t\)\}\. For example, for a causal intent, only edges labeledCAUSAL\(e\.g\., “causes”, “leads\_to”\) are allowed, while for a temporal intent, edges labeledTEMPORAL\(e\.g\., “precedes”, “follows”\) are retained\.

After applying the above filters, the remaining neighbor nodes form the candidate set𝒩raw​\(i,j\)\(t\)\\mathcal\{N\}\_\{\\text\{raw\}\(i,j\)\}^\{\(t\)\}\. From𝒩raw​\(i,j\)\(t\)\\mathcal\{N\}\_\{\\text\{raw\}\(i,j\)\}^\{\(t\)\}, we further select the top‑mmcandidates based on the semantic similarity score, e\.g\., cosine similarity between the query embedding and the concatenation of the edge’s relation name and target entity in the atomic fact, denoted as𝒩cand​\(i,j\)\(t\)\\mathcal\{N\}\_\{\\text\{cand\}\(i,j\)\}^\{\(t\)\}\.

𝒫′i,j\(t\)=\{Pi,j\(t\)⊕\(u,ιu\)\|u∈𝒩cand​\(i,j\)\(t\)\}\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}=\\big\\\{P\_\{i,j\}^\{\(t\)\}\\oplus\(u,\\iota\_\{u\}\)\\;\\big\|\\;u\\in\\mathcal\{N\}\_\{\\text\{cand\}\(i,j\)\}^\{\(t\)\}\\big\\\}\(5\)where⊕\\oplusdenotes path concatenation, andιu\\iota\_\{u\}is the updated exploration intent after visitinguu\. The final candidate expanded paths𝒫′i,j\(t\)\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}is then passed to the semantic evaluator for scoring\.

II\.Traversal Path Scoring\.We employ the semantic evaluator to assess the current trsversal pathPi,j\(t\)P\_\{i,j\}^\{\(t\)\}and all possible expanded paths𝒫′i,j\(t\)\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}of thejj\-th beam\(j=1,…,k\)\(j=1,\.\.\.,k\)at stepttunder a unified scoring function\. The scoring rubric considers:

- •Query Relevance\(SQ​RS\_\{QR\}\): The degree of semantic correlation between the trsversal memory path and the user’s query\. High relevance indicates that the path’s content directly pertains to the core information need expressed in the query, while low relevance suggests off‑topic facts\. For instance, a path describing “user→\\rightarrowlikes→\\rightarrowItalian cuisine” would receive high relevance score for the query “What is the user’s favorite food?”, while a path about “user→\\rightarrowworks→\\rightarrowtech company” would score low\.
- •Logical Coherence\(SL​CS\_\{LC\}\): Whether the sequence of entities and relations in the trsversal memory path forms a plausible and consistent reasoning chain\. For example, a path “user→\\rightarrowbought→\\rightarrowred shoes” followed by “red shoes→\\rightarrowmade\_of→\\rightarrowleather” is coherent, whereas “user→\\rightarrowlikes→\\rightarrowred shoes” immediately followed by “red shoes→\\rightarrowis→\\rightarrowexpensive” without any bridging relation would be considered incoherent\. High logical coherence indicates that the path can serve as a reliable basis for reasoning, while low coherence suggests fragmentation or irrelevant jumps that may mislead the answer generation\.
- •Entity‑Specific Answer Sufficiency\(SA​SS\_\{AS\}\): To what degree the current memory pathPi,j\(t\)P\_\{i,j\}^\{\(t\)\}contain enough information to answer the sub‑question that directly involves the root entityeie\_\{i\}within the overall query\. Since each traversal memory pathPi,j\(t\)P\_\{i,j\}^\{\(t\)\}originates from a single root entityeie\_\{i\}extracted from the query, it is not expected to fully resolve the entire multi‑entity or multi‑hop query\. Instead, this metric evaluates whether the path already contains sufficient information to answer the part of the query that directly involves that root entity \(e\.g\., the attribute value, the causal link, or the temporal event associated with this entity\)\. A high score indicates that the sub‑question concerning this entity is satisfied, even if other parts of the query remain unanswered\.

III\.Decision Procedure per Beam\.After obtaining the unified scores for the current traversal pathPi,j\(t\)P\_\{i,j\}^\{\(t\)\}and its expanded candidates𝒫′i,j\(t\)\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}from the semantic evaluator, the next phase is to determine the operation for each beam by directly comparing the utility of stopping at the current path versus expanding to a new node\. At each steptt, for thejj\-th beam\(j=1,…,k\)\(j=1,\.\.\.,k\)rooted at the initial entityeie\_\{i\}, its decisionai,j\(t\)a\_\{i,j\}^\{\(t\)\}belongs to the decision space𝒜\\mathcal\{A\}, which is defined as:

𝒜=\{Stop,Expand,Counterfactual\_Infer\}\\mathcal\{A\}=\\\{\\texttt\{Stop\},\\texttt\{Expand\},\\texttt\{Counterfactual\\\_Infer\}\\\}\(6\)The decision for the current beam is made by first comparing theentity‑specific answer sufficiency score\(SA​SS\_\{AS\}\) in a unified metric space\. If the current pathPi,j\(t\)P\_\{i,j\}^\{\(t\)\}achieves a sufficiently higher score than all possible expanded paths𝒫′i,j\(t\)\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}, the beam selects theStopaction:

ai,j\(t\)=Stop,if​SA​S​\(Pi,j\(t\)\)≥max⁡\{SA​S​\(p\)\|p∈𝒫′i,j\(t\)\}a\_\{i,j\}^\{\(t\)\}=\\texttt\{Stop\},\\quad\\text\{if \}S\_\{AS\}\(P\_\{i,j\}^\{\(t\)\}\)\\geq\\max\\left\\\{S\_\{AS\}\(p\)\\;\\middle\|\\;p\\in\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}\\right\\\}\(7\)
Once theStopdecision is selected, the current memory pathPi,j\(t\)P\_\{i,j\}^\{\(t\)\}is added to the final evidence path pool\. If one or more expanded paths obtain higher scores than the current pathPi,j\(t\)P\_\{i,j\}^\{\(t\)\}, theExpandaction is chosen\. We insert those expanded paths whose entity\-specific answer sufficiency score exceedsSA​S​\(Pi,j\(t\)\)S\_\{AS\}\(P\_\{i,j\}^\{\(t\)\}\)into the global candidate path poolℬc​a​n​d​\(i\)\(t\)\\mathcal\{B\}\_\{cand\(i\)\}^\{\(t\)\}of the initial entityeie\_\{i\}for subsequent global top\-kkbeam pruning:

ℬc​a​n​d​\(i\)\(t\)←ℬc​a​n​d​\(i\)\(t\)∪\{SA​S​\(p\)\>SA​S​\(Pi,j\(t\)\)\|p∈𝒫′i,j\(t\)\}\\mathcal\{B\}\_\{cand\(i\)\}^\{\(t\)\}\\leftarrow\\mathcal\{B\}\_\{cand\(i\)\}^\{\(t\)\}\\cup\\left\\\{S\_\{AS\}\(p\)\>S\_\{AS\}\(P\_\{i,j\}^\{\(t\)\}\)\\;\\middle\|\\;p\\in\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}\\right\\\}\(8\)
TheCounterfactual\_Inferaction is triggered when the normal expansion process becomes unreliable\. This occurs when one of the following conditions holds:

- •No valid neighbor remains after filtering:\|𝒩cand​\(i,j\)\(t\)\|=0\|\\mathcal\{N\}\_\{\\text\{cand\}\(i,j\)\}^\{\(t\)\}\|=0\.
- •The maximum query\-neighbor similarity is below a threshold:max⁡\{SQ​R​\(p\)\|p∈𝒫′i,j\(t\)\}<δQ​R\\max\\left\\\{S\_\{QR\}\(p\)\\;\\middle\|\\;p\\in\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}\\;\\right\\\}<\\delta\_\{QR\}\.
- •All expanded candidates have low logical coherence quality:max⁡\{SL​C​\(p\)\|p∈𝒫′i,j\(t\)\}<δL​C\\max\\left\\\{S\_\{LC\}\(p\)\\;\\middle\|\\;p\\in\\mathcal\{P^\{\\prime\}\}\_\{i,j\}^\{\(t\)\}\\;\\right\\\}<\\delta\_\{LC\}\.

whereδQ​R\\delta\_\{QR\}denotes the minimum semantic similarity threshold between candidate neighbors and the query, andδL​C\\delta\_\{LC\}denotes the minimum acceptable logical coherence quality\.Counterfactual\_Inferdoes not directly produce an answer, it repairs the retrieval process by generating alternative traversal hypotheses based on the rules described in[TableI](https://arxiv.org/html/2606.10694#S3.T1)instead\. These hypotheses are converted into constrained graph traversals, and the recovered paths𝒫c​f​\(i,j\)\(t\)\\mathcal\{P\}\_\{cf\(i,j\)\}^\{\(t\)\}are scored and merged back into the same global beam candidate poolℬc​a​n​d​\(i\)\(t\)\\mathcal\{B\}\_\{cand\(i\)\}^\{\(t\)\}\. To prevent unbounded search overhead, we impose a counterfactual traversal budgetBc​f​\(i,j\)\(t\)=max⁡\{0,m−\|𝒩r​e​l​\(i,j\)\(t\)\|\}B\_\{cf\(i,j\)\}^\{\(t\)\}=\\max\\\{0,m\-\|\\mathcal\{N\}\_\{rel\(i,j\)\}^\{\(t\)\}\|\\\}as the maximum number of counterfactually recovered paths retained for each triggered beam, wheremmis the maximum candidate expansion size and𝒩r​e​l​\(i,j\)\(t\)\\mathcal\{N\}\_\{rel\(i,j\)\}^\{\(t\)\}denotes the set of reliable normal candidates whose corresponding expanded paths satisfy the minimum query relevance and logical coherence requirements, i\.e\.,SQ​R​\(p\)≥δQ​RS\_\{QR\}\(p\)\\geq\\delta\_\{QR\}andSL​C​\(p\)≥δL​CS\_\{LC\}\(p\)\\geq\\delta\_\{LC\}\. For thejj\-th beam rooted ateie\_\{i\}at steptt, the recovered path set satisfies\|𝒫c​f​\(i,j\)\(t\)\|≤Bc​f​\(i,j\)\(t\)\|\\mathcal\{P\}\_\{cf\(i,j\)\}^\{\(t\)\}\|\\leq B\_\{cf\(i,j\)\}^\{\(t\)\}\. Only the top\-Bc​f​\(i,j\)\(t\)B\_\{cf\(i,j\)\}^\{\(t\)\}recovered paths, ranked by the semantic evaluator, are merged back into the global beam candidate poolℬc​a​n​d​\(i\)\(t\)\\mathcal\{B\}\_\{cand\(i\)\}^\{\(t\)\}\.

TABLE I:Counterfactual inference rules\.CategoryExampleSymmetrylikes\(A,B\)→\\rightarrowinterested\_in\(A,B\)HyponymysubClassOf\(A,B\)∧\\wedgelikes\(C,A\)→\\rightarrowlikes\(C,B\)Temporal Successionafter\(A,B\)∧\\wedgehappened\(B\)→\\rightarrow¬\\neghappened\(A\)Causalitycauses\(A,B\)∧\\wedgetrue\(A\)→\\rightarrowtrue\(B\)Co‑Occurrenceco\_occurs\(A,B\)∧\\wedgementions\(C,A\)→\\rightarrowmay\_mention\(C,B\)

IV\.Global Beam Pruning\.After all beams have made their decisions at steptt, all candidate paths are aggregated into the global candidate path poolℬc​a​n​d​\(i\)\(t\)\\mathcal\{B\}\_\{cand\(i\)\}^\{\(t\)\}, and only the top\-kkpaths are retained for the next iteration\. The procedure continues until all beams selectStop, no valid candidate path remains, or the maximum search depthDmaxD\_\{\\text\{max\}\}is reached\.

### III\-CAnswer Generation

For answer generation, the retrieved memory subsetℳq\\mathcal\{M\}\_\{q\}is serialized into a compact evidence context containing the selected paths, timestamps, and confidence scores\. The serialization preserves both graph structure and metadata\. For each retrieved path, we include its root entity, traversal sequence, relation labels, valid\-time intervals, and confidence scores\. A typical serialized evidence path is formatted as:

e0→r1,τ1,c1e1→r2,τ2,c2⋯→rn,τn,cnene\_\{0\}\\xrightarrow\{r\_\{1\},\\tau\_\{1\},c\_\{1\}\}e\_\{1\}\\xrightarrow\{r\_\{2\},\\tau\_\{2\},c\_\{2\}\}\\cdots\\xrightarrow\{r\_\{n\},\\tau\_\{n\},c\_\{n\}\}e\_\{n\}\(9\)
Given the serialized memory context, the answer generator is instructed to produce a response to the user queryqqbased on the retrieved memory evidence\.

### III\-DComplexity and Efficiency Analysis

In this section, we analyze the computational complexity of REAL\. Let\|ℳ\|\|\\mathcal\{M\}\|denote the number of atomic memory facts stored in the repository,\|ℰ\|\|\\mathcal\{E\}\|the number of entities, andddthe average out\-degree of an entity in the memory graph\. For a queryqq,nnis the number of anchored root entities,kkdenotes the beam width,DmaxD\_\{\\text\{max\}\}is the maximum search depth, andmmrepresents the number of top candidates retained after local neighbor filtering based on semantic similarity\.

Memory Construction\.For each incoming conversation turn, the model extracts a small set of atomic facts, each represented as\(h,r,t,\[τs,τe\],c,ι\)\(h,r,t,\[\\tau\_\{s\},\\tau\_\{e\}\],c,\\iota\)in the form of[Equation4](https://arxiv.org/html/2606.10694#S3.E4)\. Given an index over the head entity and relation pair\(h,r\)\(h,r\), updating a new fact only requires retrieving existing facts with the same\(h,r\)\(h,r\)key rather than scanning the entire memory repository\. Therefore, the graph update cost for each fact isO​\(degr⁡\(h\)\)O\(\\deg\_\{r\}\(h\)\), wheredegr⁡\(h\)\\deg\_\{r\}\(h\)is the number of existing facts associated with the same head entity and relation\. In practice,degr⁡\(h\)≪\|ℳ\|\\deg\_\{r\}\(h\)\\ll\|\\mathcal\{M\}\|, making the update process output\-sensitive and scalable to continuously growing memory streams\. The additional storage overhead of REAL is linear in the number of atomic facts, i\.e\.,O​\(\|ℳ\|\)O\(\|\\mathcal\{M\}\|\), since each fact is stored once with a constant\-size set of attributes described in[Equation4](https://arxiv.org/html/2606.10694#S3.E4)\.

Memory Retrieval\.At query time, REAL starts fromnnanchored root entities and performs hybrid beam search\. At each step, each active beam visits the direct neighbors of its current tail entity, and the candidate filtering is bounded by the average degreedd\. Selecting the top\-mmcandidates results inO​\(d​log⁡m\)O\(d\\log m\)time per beam step\. Thus, the total cost of normal candidate expansion isO​\(n⋅k⋅Dmax⋅d​log⁡m\)O\\left\(n\\cdot k\\cdot D\_\{\\text\{max\}\}\\cdot d\\log m\\right\)\. After candidate expansion, REAL evaluates the current path and its expanded paths using the semantic evaluator\. Since each beam keeps at mostmmexpanded candidates, the number of path evaluations is bounded byO​\(n⋅k⋅Dmax⋅\(m\+1\)\)O\\left\(n\\cdot k\\cdot D\_\{\\text\{max\}\}\\cdot\(m\+1\)\\right\), where the additional one corresponds to the current path used for the stop\-or\-expand decision\. In implementation, we batched these path evaluations into a single semantic\-evaluation call per beam, which reduces the number of LLM calls from per\-path scoring to per\-beam scoring\.

Counterfactual Inference\.This mechanism is only invoked when normal expansion becomes unreliable\. Letρ\\rhodenote the empirical triggering rate of counterfactual repair, andB¯c​f\\bar\{B\}\_\{cf\}denote the average of the per\-beam counterfactual traversal budget over all triggered beam steps\. The additional retrieval cost introduced by counterfactual repair isO​\(ρ⋅n⋅k⋅Dmax⋅B¯c​f\)O\\left\(\\rho\\cdot n\\cdot k\\cdot D\_\{\\text\{max\}\}\\cdot\\bar\{B\}\_\{cf\}\\right\)\. SinceB¯c​f\\bar\{B\}\_\{cf\}is budgeted andρ\\rhois typically much smaller than 1\.0, counterfactual repair improves robustness without bringing unbounded graph exploration\.

Global Beam Pruning\.At each beam steptt, the candidate pool for each root entityeie\_\{i\}contains the reliable normal candidates and the counterfactually recovered candidates from all active beams\. Since the counterfactual traversal budget is defined asBc​f​\(i,j\)\(t\)=max⁡\{0,m−\|𝒩r​e​l​\(i,j\)\(t\)\|\}B\_\{cf\(i,j\)\}^\{\(t\)\}=\\max\\\{0,m\-\|\\mathcal\{N\}\_\{rel\(i,j\)\}^\{\(t\)\}\|\\\}, the recovered paths𝒫c​f​\(i,j\)\(t\)\\mathcal\{P\}\_\{cf\(i,j\)\}^\{\(t\)\}only fill the remaining reliable candidate slots of the current beam\. Therefore, for each beam, the number of candidates inserted into the global pool is bounded by\|𝒩r​e​l​\(i,j\)\(t\)\|\+\|𝒫c​f​\(i,j\)\(t\)\|≤\|𝒩r​e​l​\(i,j\)\(t\)\|\+Bc​f​\(i,j\)\(t\)≤m\|\\mathcal\{N\}\_\{rel\(i,j\)\}^\{\(t\)\}\|\+\|\\mathcal\{P\}\_\{cf\(i,j\)\}^\{\(t\)\}\|\\leq\|\\mathcal\{N\}\_\{rel\(i,j\)\}^\{\(t\)\}\|\+B\_\{cf\(i,j\)\}^\{\(t\)\}\\leq m, which means the candidate pool for each root entityeie\_\{i\}contains at mostk⋅mk\\cdot mpaths at steptt\. Keeping the top\-kkcandidates can be implemented with a bounded heap, requiringO​\(k​m​log⁡k\)O\(km\\log k\)per root entity per step\. Therefore, the overall retrieval complexity isO​\(n⋅Dmax​\[k​d​log⁡m\+k​\(m\+1\)\+ρ​k​B¯c​f\+k​m​log⁡k\]\)O\\left\(n\\cdot D\_\{\\text\{max\}\}\\left\[kd\\log m\+k\(m\+1\)\+\\rho k\\bar\{B\}\_\{cf\}\+km\\log k\\right\]\\right\)\. Under this setting, sincekk,DmaxD\_\{\\text\{max\}\}, andmmare controlled hyperparameters, the query\-time retrieval cost is primarily determined by the local neighborhood size, i\.e\., the average out\-degreeddof an entity in the memory graph, whered≪\|ℳ\|d\\ll\|\\mathcal\{M\}\|\.

## IVExperiments and Results

### IV\-AExperimental Setup

Benchmarks\.Dialogue\-style memory benchmarkssimulate real\-world multi\-session conversations, which directly evaluates long\-duration conversational memory, personalization, temporal updates, and multi\-session reasoning\. For this scenario, we have selected the following datasets: \(1\)LoCoMo\[[34](https://arxiv.org/html/2606.10694#bib.bib6)\]: A benchmark designed to evaluate whether language models can maintain and retrieve information from extended multi\-session dialogues, where each conversation contains an average of approximately 300 turns and 9K tokens\. \(2\)LongMemEval\[[49](https://arxiv.org/html/2606.10694#bib.bib7)\]: A comprehensive benchmark that evaluates five core memory abilities: information extraction, multi\-session reasoning, temporal reasoning, knowledge updates, and abstention\. \(3\)PersonaMem\[[22](https://arxiv.org/html/2606.10694#bib.bib56)\]: The benchmark encompasses curated user profiles and simulated user\-LLM interaction histories, with each history containing up to 60 sessions across diverse real\-world personalization tasks\.Long\-form, document\-style benchmarksprovide controlled environments for evaluating evidence retrieval, multi\-hop path reasoning, and document\-level knowledge composition\. Given this setting, we adopt the following datasets: \(1\)HotpotQA\[[56](https://arxiv.org/html/2606.10694#bib.bib57)\]: A widely used multi\-hop question answering benchmark built from Wikipedia, which contains 113K question\-answer pairs that require models to locate and reason over multiple supporting documents\. \(2\)2WikiMultihopQA\[[18](https://arxiv.org/html/2606.10694#bib.bib58)\]: This benchmark combines structured and unstructured data, exploiting Wikidata\-style structured relations together with Wikipedia textual evidence, which provides more explicit and comprehensive reasoning supervision on long\-term memory tasks\. \(3\)MuSiQue\[[45](https://arxiv.org/html/2606.10694#bib.bib59)\]: It is a challenging benchmark comprising approximately 25K questions that require 2–4 reasoning hops, with explicitly connected steps where one hop often depends on the answer to a previous one\.

Baselines\.We compare REAL against several representative baselines that cover full historical context, flat text\-based memory strategies, and graph\-based memory approaches\.

- •Full Context: This strategy directly concatenates the entire conversation history into the model’s input context window\.
- •Flat Memory Chunk: In this setting, the dialogue history is segmented into flat text chunks and stored in a vector database\. At inference time, the top\-kkmost similar chunks are retrieved according to cosine similarity, and then concatenated with the query as context for answer generation\.
- •MemGPT\[[36](https://arxiv.org/html/2606.10694#bib.bib24)\]: Inspired by operating systems, this memory\-augmented framework introduces virtual context management to organize memory into different tiers and move information between them when needed\.
- •Mem0\[[8](https://arxiv.org/html/2606.10694#bib.bib25)\]: This memory\-centric framework dynamically extracts and consolidates information from ongoing conversations to support long\-term conversational coherence\. It represents memories primarily as textual memory entries and retrieves them through semantic matching\.
- •Vanilla Memory Graph: This baseline extracts facts as entity\-relation triples\. During retrieval, it starts from entities mentioned in the query and performs fixed\-depth graph traversal according to simple graph proximity\.
- •Mem0g\[[8](https://arxiv.org/html/2606.10694#bib.bib25)\]: It denotes the graph\-memory variant ofMem0\. Different from thevanilla memory graphstrategy, its update phase employs conflict detection and resolution mechanisms when integrating new information into the existing graph\.
- •A\-MEM\[[53](https://arxiv.org/html/2606.10694#bib.bib23)\]: Drawing on the Zettelkasten principle\[[2](https://arxiv.org/html/2606.10694#bib.bib62)\], it dynamically organizes memory into interconnected graphs\. When a new memory is added, the system constructs a structured memory note with contextual descriptions, keywords, and tags, and then links it to related historical memories\.

Evaluation Metrics\.To assess the correctness of the final answer, we adopt two evaluation strategies\. The first is Exact Match \(EM\), which considers a prediction correct only when it exactly matches the ground‑truth answer\. However, such exact match metric is too strict for our setting, since the result is described by natural language\. Therefore, we also consider LLM\-as\-judge \(LJ\) for automatic evaluation, where we use DeepSeek\-V3\[[30](https://arxiv.org/html/2606.10694#bib.bib1)\]to score answer correctness\.

TABLE II:Comparison of REAL with baselines on long\-term memory\-related benchmarks, reporting Exact Match \(EM, %\) and LLM\-as\-Judge \(LJ, %\) results\. The best results are highlighted inbold, and the second best results areunderlined\.ModelMethodDialogue\-Style Memory BenchmarksLong\-Form, Document\-Style BenchmarksLoCoMoLongMemEvalPersonaMemHotpotQA2WikiMultihopQAMuSiQueEMLJEMLJEMLJEMLJEMLJEMLJClose\-Sourced Frontier ModelsDeepSeek\-V3Full Context35\.2446\.8438\.4649\.7822\.5233\.1027\.8051\.6533\.1436\.2519\.3730\.25Flat\-Text\-Based Memory BaselinesFlat Memory Chunk45\.6058\.5350\.1262\.2024\.0035\.0526\.4350\.6835\.5238\.6418\.4529\.67MemGPT48\.3261\.2853\.4665\.8425\.8737\.5628\.1552\.3734\.4636\.6520\.5932\.36Mem052\.6566\.6758\.7371\.2830\.1442\.4627\.6551\.7937\.7539\.6823\.5734\.61Graph\-Based Memory BaselinesVanilla Memory Graph49\.4362\.6454\.2066\.0827\.3638\.9829\.2854\.0540\.5843\.8722\.2533\.65Mem0g53\.9667\.1260\.8073\.6731\.0443\.2531\.7655\.7538\.0440\.0624\.5834\.69A\-MEM55\.1268\.8059\.6372\.5732\.5845\.4532\.8056\.0739\.9542\.3326\.5235\.90REAL \(ours\)59\.9873\.7665\.9278\.5837\.1949\.2136\.4059\.0643\.6246\.1730\.3441\.73Open\-Sourced Base / Instruct ModelsQwen3\-32BFull Context31\.0643\.2835\.1446\.5219\.2030\.1624\.3545\.8029\.9332\.4512\.4924\.54Flat\-Text\-Based Memory BaselinesFlat Memory Chunk42\.3655\.1047\.0059\.3820\.8532\.1425\.2547\.1030\.4933\.6718\.2527\.45MemGPT44\.7457\.8349\.8262\.0922\.9534\.2525\.5448\.6632\.5736\.8617\.4425\.96Mem049\.1163\.0555\.2368\.2025\.0238\.5026\.8548\.9733\.5537\.6519\.5130\.62Graph\-Based Memory BaselinesVanilla Memory Graph45\.9259\.2851\.4564\.1024\.2236\.4025\.5148\.0435\.9538\.8320\.8431\.76Mem0g51\.7466\.3056\.4869\.3528\.5440\.4229\.0351\.3735\.0537\.8021\.5432\.09A\-MEM50\.6564\.5157\.6970\.4531\.3442\.8628\.2550\.6336\.5239\.1622\.8533\.66REAL \(ours\)54\.8869\.4760\.1473\.4934\.3245\.0530\.5653\.7139\.0742\.7925\.5736\.04LLaMA\-3\.3\-70B\-InstructFull Context34\.8045\.6637\.2848\.3420\.6231\.5725\.4247\.3531\.5134\.0517\.3928\.15Flat\-Text\-Based Memory BaselinesFlat Memory Chunk43\.8756\.4848\.6560\.0522\.4033\.6825\.5447\.6530\.5632\.6319\.4630\.95MemGPT46\.1659\.4051\.2863\.4424\.2035\.7327\.5749\.6732\.1534\.6218\.4629\.61Mem050\.6764\.8556\.5869\.9128\.4241\.0628\.8650\.7335\.5538\.1621\.7532\.65Graph\-Based Memory BaselinesVanilla Memory Graph47\.4061\.0252\.8065\.4625\.7437\.5227\.6550\.6933\.5235\.6724\.9635\.62Mem0g52\.1266\.3557\.8870\.9731\.1943\.2530\.6252\.2741\.5845\.7923\.2532\.98A\-MEM55\.3870\.7259\.0872\.1730\.0642\.8031\.1653\.7038\.7541\.6524\.5533\.86REAL \(ours\)58\.6972\.8362\.7876\.0535\.9848\.2833\.6456\.7745\.6548\.7028\.5939\.18

Models and Implementation\.To ensure fair evaluation, we usethe same backbone modelthroughout the end\-to\-end \(E2E\) experiments, isolating performance gains as attributable to the proposed memory framework, not to differences in model capability\. We utilize DeepSeek\-V3\[[30](https://arxiv.org/html/2606.10694#bib.bib1)\]as the close\-sourced frontier models, while using Qwen3\-32B\[[54](https://arxiv.org/html/2606.10694#bib.bib43)\]and LLaMA3\.3\-70B\-Instruct\[[9](https://arxiv.org/html/2606.10694#bib.bib60)\]as the open\-sourced backbone models in our main experiments\. For candidate pre\-ranking, i\.e\., selecting𝒩cand​\(i,j\)\(t\)\\mathcal\{N\}\_\{\\text\{cand\}\(i,j\)\}^\{\(t\)\}from𝒩raw​\(i,j\)\(t\)\\mathcal\{N\}\_\{\\text\{raw\}\(i,j\)\}^\{\(t\)\}, we employed the BGE\-m3 model\[[6](https://arxiv.org/html/2606.10694#bib.bib61)\]as our embedding model\. When conducting evaluations, we used a single\-node machine equipped with 8 NVIDIA H800 GPUs\. Depending on the CUDA memory requirements and computational demands of tasks, the assessment process for each task utilized between 1 to 8 GPUs\.

Hyper\-Parameters Setting\.During the memory retrieval phase, the beam widthkk, maximum search depthDmaxD\_\{\\text\{max\}\}, maximum number of candidate neighbors per expansionmm, stable confidence thresholdθstable\\theta\_\{\\text\{stable\}\}, query relevance thresholdδQ​R\\delta\_\{QR\}, and logical coherence thresholdδL​C\\delta\_\{LC\}were adjusted to optimize the balance between retrieval accuracy and computational cost\. As analyzed in[SectionIV\-C2](https://arxiv.org/html/2606.10694#S4.SS3.SSS2), we have performed experiments under a range of hyperparameter settings with the Qwen3\-32B model, and selected the optimal settings for hyperparameters in our REAL framework\. Overall, the default configuration used in our main experiments isk=5k=5,Dmax=3D\_\{\\text\{max\}\}=3,m=5m=5,θstable=0\.8\\theta\_\{\\text\{stable\}\}=0\.8,δQ​R=0\.5\\delta\_\{QR\}=0\.5, andδL​C=0\.5\\delta\_\{LC\}=0\.5\.

### IV\-BMain Results

The main results of baselines and REAL are demonstrated in[TableII](https://arxiv.org/html/2606.10694#S4.T2)\. and we summarize the observations below\.

REAL is effective across different models\.Experimental results in[TableII](https://arxiv.org/html/2606.10694#S4.T2)show that REAL consistently outperforms other baseline methods across all backbone models in terms of long\-term memory management\. Compared with thefull context, which directly feeds the full context into the model, REAL enhances average downstream task performances by 49\.33%, which suggests that simply exposing the model to the entire history is not sufficient for reliable long\-term memory reasoning\. As opposed toflat\-text\-based memory methods, includingFlat Memory Chunk,MemGPT, andMem0, our method yields a 23\.85% gain, underscoring its superiority over approaches that rely on isolated text chunks or plain textual memory units\. Moreover, models enhanced with REAL and it exceeds the average performance ofgraph\-based memory baselines, such asvanilla memory graph,Mem0g, andA\-MEM, by 12\.71%, further demonstrating that merely converting memory into a graph is not enough, where temporal and reasoning‑aware memory mechanisms are essential\.

TABLE III:Ablations on the impact of each component using theQwen3\-32Bmodel, reporting Exact Match \(EM, %\) and LLM\-as\-Judge \(LJ, %\) results\. “w/o”:removingthe component from the original REAL while keeping the remaining settings unchanged\. The best and second best scores are inboldandunderlined\.MethodDialogue\-Style Memory BenchmarksLong\-Form, Document\-Style BenchmarksAvg\.LoCoMoLongMemEvalPersonaMemHotpotQA2WikiMultihopQAMuSiQueEMLJEMLJEMLJEMLJEMLJEMLJLJREAL54\.8869\.4760\.1473\.4934\.3245\.0530\.5653\.7139\.0742\.7925\.5736\.0453\.43Memory Construction Phasew/oTemporal Interval53\.4268\.2758\.0571\.6132\.5543\.2330\.4253\.5038\.5542\.4025\.3435\.8652\.48↓0\.95w/oConfidence Stratification54\.4468\.8558\.5972\.1533\.9644\.5730\.2653\.1338\.9842\.5525\.5835\.9552\.87↓0\.56w/oExploration Intent53\.2067\.6259\.4572\.6433\.1543\.7129\.8952\.8538\.2042\.1725\.0435\.5852\.43↓1\.00Memory Retrieval Phasew/oSemantic Evaluator52\.2066\.8356\.4570\.9232\.2742\.9529\.7752\.2036\.5040\.9523\.6735\.2851\.52↓1\.91w/oCounterfactual Inference54\.1468\.6759\.1572\.3733\.6644\.0529\.6552\.3137\.4541\.2624\.4435\.3952\.34↓1\.09

TABLE IV:Search hyperparameter analysis of REAL onQwen3\-32B\. We vary one hyperparameter at a time while keeping others fixed to the default setting:k=5k=5,Dmax=3D\_\{\\text\{max\}\}=3,m=5m=5,θstable=0\.8\\theta\_\{\\text\{stable\}\}=0\.8,δQ​R=0\.5\\delta\_\{QR\}=0\.5, andδL​C=0\.5\\delta\_\{LC\}=0\.5\. “Rel\. Latency”: the normalized retrieval latency relative to the best\-performing default configuration, where1\.001\.00corresponds to the default setting and larger values indicate higher retrieval cost\. “\#Nodes”: the average number of memory graph nodes visited during memory retrieval, reflecting the explored search\-space size\. The best and second best results are highlighted inboldandunderlined\.Hyperparameter SettingDialogue\-Style Memory BenchmarksLong\-Form, Document\-Style BenchmarksAvg\.Rel\.Latency\#NodesLoCoMoLongMemEvalPersonaMemHotpotQA2WikiMultihopQAMuSiQueEMLJEMLJEMLJEMLJEMLJEMLJLJBeam Widthkkk=1k=150\.3264\.0155\.2168\.1030\.0640\.4226\.4448\.2734\.9037\.1421\.3830\.9848\.150\.428\.6k=3k=353\.2168\.1558\.7372\.1832\.9143\.9629\.3252\.2937\.7441\.2424\.3634\.9252\.120\.7316\.1𝐤=𝟓\\mathbf\{k=5\}54\.8869\.4760\.1473\.4934\.3245\.0530\.5653\.7139\.0742\.7925\.5736\.0453\.431\.0023\.8k=8k=854\.3769\.0559\.7173\.1034\.0344\.7630\.1153\.2038\.6242\.3625\.1535\.7153\.031\.3935\.7k=10k=1053\.7068\.4259\.0972\.4833\.4144\.1029\.6752\.6538\.1041\.8224\.6935\.1052\.431\.6543\.9Maximum Search DepthDmaxD\_\{\\text\{max\}\}Dmax=1D\_\{\\text\{max\}\}=147\.9061\.0252\.3265\.8729\.7839\.5024\.3145\.7031\.8535\.2119\.8429\.0346\.060\.389\.4Dmax=2D\_\{\\text\{max\}\}=253\.2167\.9058\.7271\.8632\.8543\.5128\.9451\.8636\.8540\.0823\.9034\.3851\.600\.6917\.3𝐃max=𝟑\\mathbf\{D\_\{\\text\{max\}\}=3\}54\.8869\.4760\.1473\.4934\.3245\.0530\.5653\.7139\.0742\.7925\.5736\.0453\.431\.0023\.8Dmax=4D\_\{\\text\{max\}\}=455\.2569\.8459\.5872\.9634\.9445\.5231\.1254\.1939\.6143\.1225\.0935\.6653\.541\.3232\.5Dmax=5D\_\{\\text\{max\}\}=553\.5168\.1058\.8472\.1233\.2543\.8729\.4552\.4837\.9241\.5024\.5034\.9552\.171\.5840\.8Candidate Expansion Sizemmm=3m=352\.7267\.5658\.2971\.3532\.5443\.2228\.5551\.1036\.4839\.8523\.4233\.7751\.140\.8218\.6𝐦=𝟓\\mathbf\{m=5\}54\.8869\.4760\.1473\.4934\.3245\.0530\.5653\.7139\.0742\.7925\.5736\.0453\.431\.0023\.8m=10m=1055\.4870\.1260\.8074\.0234\.0144\.6130\.2353\.2539\.7143\.1825\.2235\.7253\.481\.3133\.5m=15m=1553\.6568\.3159\.1172\.4133\.4243\.9729\.6152\.5138\.0541\.4524\.7235\.0952\.291\.5842\.6m=20m=2052\.8167\.4458\.2571\.6232\.7943\.2828\.9651\.7637\.3640\.7924\.0134\.2051\.521\.8251\.3Stable Confidence Thresholdθstable\\theta\_\{\\text\{stable\}\}θstable=0\.6\\theta\_\{\\text\{stable\}\}=0\.653\.1067\.9558\.5271\.7433\.0243\.5228\.8251\.3636\.9240\.1723\.8333\.8151\.431\.3435\.4θstable=0\.7\\theta\_\{\\text\{stable\}\}=0\.754\.2468\.9859\.4272\.7133\.7144\.3129\.8852\.7938\.3641\.8324\.9435\.4252\.671\.1529\.1θstable=0\.8\\mathbf\{\\theta\_\{\\text\{stable\}\}=0\.8\}54\.8869\.4760\.1473\.4934\.3245\.0530\.5653\.7139\.0742\.7925\.5736\.0453\.431\.0023\.8θstable=0\.9\\theta\_\{\\text\{stable\}\}=0\.951\.6266\.7456\.8470\.3531\.6342\.5027\.5650\.1135\.8939\.0722\.8332\.6850\.240\.7615\.2Counterfactual Inference ThresholdsδQ​R\\delta\_\{QR\}andδL​C\\delta\_\{LC\}δQ​R=0\.3,δL​C=0\.3\\delta\_\{QR\}=0\.3,\\delta\_\{LC\}=0\.353\.0267\.8258\.4371\.7032\.9743\.3528\.7051\.4236\.8440\.0423\.6133\.6851\.340\.9122\.1δQ​R=0\.4,δL​C=0\.4\\delta\_\{QR\}=0\.4,\\delta\_\{LC\}=0\.454\.4169\.0459\.3672\.7233\.6544\.2529\.7752\.6738\.0841\.5625\.0235\.6552\.650\.9522\.8δ𝐐𝐑=0\.5,δ𝐋𝐂=0\.5\\mathbf\{\\delta\_\{QR\}=0\.5,\\delta\_\{LC\}=0\.5\}54\.8869\.4760\.1473\.4934\.3245\.0530\.5653\.7139\.0742\.7925\.5736\.0453\.431\.0023\.8δQ​R=0\.6,δL​C=0\.6\\delta\_\{QR\}=0\.6,\\delta\_\{LC\}=0\.654\.0768\.8559\.7272\.9933\.9644\.6930\.1353\.2338\.5442\.3124\.7835\.2152\.881\.1627\.9δQ​R=0\.7,δL​C=0\.7\\delta\_\{QR\}=0\.7,\\delta\_\{LC\}=0\.753\.5268\.1258\.9171\.8633\.2243\.8829\.3852\.1137\.6841\.2024\.2634\.6951\.981\.3833\.4

![Refer to caption](https://arxiv.org/html/2606.10694v1/x3.png)Figure 3:Search hyperparameter statistics of REAL\.TABLE V:Resource overhead comparison on theQwen3\-32Bmodel setting\. “Construction Latency”: the average memory construction latency per dialogue turn\. “Retrieval Latency”: the average memory retrieval latency per query\. “Answer Latency”: the average latency of generating the final answer from the retrieved memory context\.MethodLatency \(ms\)\#NodesAvg\.LJConstructionRetrievalAnswerFull Context––12869\.14–37\.13Flat Memory Chunk248\.73681\.652643\.05–42\.47MemGPT1123\.691362\.482185\.38–44\.61Mem01260\.961183\.592028\.21–47\.83Vanilla Memory Graph1483\.671097\.542364\.5241\.646\.40Mem0g1762\.701620\.571857\.7534\.849\.56A\-MEM1919\.721781\.051965\.8131\.550\.21REAL2073\.042052\.231716\.1223\.853\.43

### IV\-CAblation Studies and Discussions

To provide a detailed understanding of where the performance gains of REAL come from and how the framework behaves under different deployment constraints, we conducted ablation studies on the designed components, the sensitivity analysis of hyperparameters, as well as efficiency assessment, with results shown in[TableIII](https://arxiv.org/html/2606.10694#S4.T3),[TableIV](https://arxiv.org/html/2606.10694#S4.T4)and[TableV](https://arxiv.org/html/2606.10694#S4.T5)\.

#### IV\-C1Component Ablation

we perform component ablation on the Qwen3\-32B model to quantify the contribution of each major design choice in REAL, including temporal fact modeling, confidence\-aware filtering, exploration intent guidance, semantic path scoring, and counterfactual inference\.

The results of the component ablation are reported in[TableIII](https://arxiv.org/html/2606.10694#S4.T3)\. Overall, removing any major component leads to a performance drop, which indicates that the proposed framework benefits from the integration of these core components\. In particular, removing temporal intervals hurts performance on dialogue\-style memory benchmarks by 3\.03%, where user preferences, plans, and relations evolve over time\. When confidence stratification is removed, answer generation becomes less stable, resulting in an average performance drop of 1\.08% across all datasets\. Removing the exploration intent labelι\\iotaweakens the query\-aware nature of the retrieval process, thus leading to a degradation of 2\.05% in overall model performance\. Disabling the semantic evaluator leads to a substantial decline in answer accuracy by 4\.44%, mainly due to poorer memory retrieval quality\. The performance gap of 2\.22% demonstrates the robustness of memory retrieval by repairing unreliable expansion states through counterfactual inference rather than directly generating unsupported answers\.

#### IV\-C2Search Hyperparameter Sensitivity

We further investigate the sensitivity of hyperparameterkk,DmaxD\_\{\\text\{max\}\},mm,θstable\\theta\_\{\\text\{stable\}\},δQ​R\\delta\_\{QR\}, andδL​C\\delta\_\{LC\}on the Qwen3\-32B model to understand the trade\-off among accuracy, latency, and search cost, with experimental results summarized in[TableIV](https://arxiv.org/html/2606.10694#S4.T4)and[Figure3](https://arxiv.org/html/2606.10694#S4.F3)\.

We experimented withkkvalues of\[1,3,5,8,10\]\\left\[1,3,5,8,10\\right\]and observed that retrieval performance improved askkincreased from 1 to 5, while further increasingkkintroduced additional semantic evaluation cost and even degraded performance\. Thus, we setk=5k=5as the default beam width\. We then varied the maximum search depthDmaxD\_\{\\text\{max\}\}in\[1,2,3,4,5\]\\left\[1,2,3,4,5\\right\]and found thatDmax=3D\_\{\\text\{max\}\}=3was sufficient to capture evidence paths, whereas deeper traversal increased latency and semantic drift\. Therefore, we selectedDmax=3D\_\{\\text\{max\}\}=3\. For the candidate expansion sizemm, we tested\[3,5,10,15,20\]\\left\[3,5,10,15,20\\right\]\. A smallmmreduced search cost but occasionally missed useful branching evidence, while a largemmincreased the number of candidate paths requiring semantic evaluation\. We found thatm=5m=5achieved a favorable trade\-off between candidate diversity and computational overhead\. For confidence\-aware retrieval, we variedθstable\\theta\_\{\\text\{stable\}\}in\[0\.6,0\.7,0\.8,0\.9\]\\left\[0\.6,0\.7,0\.8,0\.9\\right\]and selectedθstable=0\.8\\theta\_\{\\text\{stable\}\}=0\.8, which effectively filtered speculative memories while preserving sufficient evidence coverage\. Finally, for counterfactual repair, we experimented withδQ​R\\delta\_\{QR\}andδL​C\\delta\_\{LC\}values in\[0\.3,0\.4,0\.5,0\.6,0\.7\]\\left\[0\.3,0\.4,0\.5,0\.6,0\.7\\right\]\. We observed that lower thresholds under\-triggered counterfactual repair, while higher thresholds increased inference cost and introduced more inferred noise\. Considering the trade\-off between repair effectiveness and retrieval efficiency, we setδQ​R=0\.5\\delta\_\{QR\}=0\.5andδL​C=0\.5\\delta\_\{LC\}=0\.5\.

![Refer to caption](https://arxiv.org/html/2606.10694v1/x4.png)Figure 4:A case study on LoCoMo using DeepSeek\-V3 as the backbone model\. The phrase “home country” is associated with the earlier memory, while answering the query requires the model to further identify Caroline’s current life plan\.
#### IV\-C3Efficiency Analysis

We further evaluate the computational resource overhead of REAL across benchmarks on Qwen3\-32B deployed with 2×\\timesH800 GPUs\. We report memory construction latency, retrieval latency, answer latency, visited memory nodes \(for graph structure memory\), and answer accuracy\. As forfull context, memory construction, retrieval latency, and visited nodes are not applicable because the entire historical context is directly provided to the answer generator\.

As shown in[TableV](https://arxiv.org/html/2606.10694#S4.T5), REAL adds latency during memory construction and retrieval\. However, these costs are controlled by hybrid beam search and candidate node filtering\. Compared tograph\-basedmethods, REAL visits the fewest graph nodes and passes a compact memory context to the answer generator while achieving the highest answer accuracy\. It indicates that REAL makes favorable trade‑offs: the extra computation is effectively converted into superior task performance\.

### IV\-DCase Studies

In[Figure4](https://arxiv.org/html/2606.10694#S4.F4), we present a case study from the LoCoMo dataset on the DeepSeek\-V3 model to illustrate how REAL differs from typical memory approaches, e\.g\.,flat memory chunkandvanilla memory graph, in long\-term memory management\. In this case, REAL retrieves a compact memory subgraph that connectsCaroline’s currentadoption processto her likely future decision, while avoiding distraction from lexically similar but insufficient historical memories\.

## VRelated Work

### V\-ALong\-Term Memory Management in LLMs

With the growing deployment of Large Language Models \(LLMs\) in long\-duration applications, such as multi\-turn dialogue\[[34](https://arxiv.org/html/2606.10694#bib.bib6),[33](https://arxiv.org/html/2606.10694#bib.bib30)\], software engineering\[[55](https://arxiv.org/html/2606.10694#bib.bib3)\], and open\-world exploration\[[40](https://arxiv.org/html/2606.10694#bib.bib5)\], the demand for persistent, queryable, and evolvable memory has become acute, positioning long\-term memory management as a pivotal challenge\[[50](https://arxiv.org/html/2606.10694#bib.bib19),[57](https://arxiv.org/html/2606.10694#bib.bib2),[24](https://arxiv.org/html/2606.10694#bib.bib20)\]\. A widely adopted approach to memory management in LLM\-based systems involves directly incorporating the complete history information as the long\-context input at each interaction turn\[[4](https://arxiv.org/html/2606.10694#bib.bib10),[3](https://arxiv.org/html/2606.10694#bib.bib9),[10](https://arxiv.org/html/2606.10694#bib.bib8)\]\. While such method offers simplicity and effectiveness in low\-interaction settings, its scalability is constrained by the finite context window of language models\. Concurrently, long contexts often include irrelevant or redundant details that compromise the model’s reasoning performance\[[39](https://arxiv.org/html/2606.10694#bib.bib11),[32](https://arxiv.org/html/2606.10694#bib.bib12),[49](https://arxiv.org/html/2606.10694#bib.bib7)\]\. To overcome the aforementioned limitations, existing solutions can be broadly divided into two categories:\(1\) Architectural modificationsembed memory capabilities directly into the model architecture, which include refining position embeddings\[[58](https://arxiv.org/html/2606.10694#bib.bib27),[59](https://arxiv.org/html/2606.10694#bib.bib28)\], enhancing attention mechanisms\[[31](https://arxiv.org/html/2606.10694#bib.bib32),[48](https://arxiv.org/html/2606.10694#bib.bib29),[25](https://arxiv.org/html/2606.10694#bib.bib33)\], and integrating explicit memory modules\[[17](https://arxiv.org/html/2606.10694#bib.bib31)\]\. However, these methods depend on access to model\-internal states, which is unavailable for closed\-sourced or API\-based language models\.\(2\) Summarization\-oriented approachesare emerging as a dominant trend, which attempt to manage long\-term memory from the perspective of context compression, developing techniques to summarize prolonged historical contexts into concise representations\[[7](https://arxiv.org/html/2606.10694#bib.bib14),[23](https://arxiv.org/html/2606.10694#bib.bib15),[35](https://arxiv.org/html/2606.10694#bib.bib13),[52](https://arxiv.org/html/2606.10694#bib.bib16),[46](https://arxiv.org/html/2606.10694#bib.bib17),[12](https://arxiv.org/html/2606.10694#bib.bib18)\]\.

The storage structure and retrieval effectiveness constitute the central research emphasis in summarization\-oriented memory management approaches\. Existing methods can be generally classified into two categories:flat‑text‑basedmethods andgraph‑basedmethods\. Flat‑text‑based approaches typically rely on vector embeddings of fixed‑length text chunks stored in a vector database, retrieving the top‑kksemantically similar chunks via nearest neighbor search\[[57](https://arxiv.org/html/2606.10694#bib.bib2)\]\. Representative examples include MemGPT\[[36](https://arxiv.org/html/2606.10694#bib.bib24)\], Mem0\[[8](https://arxiv.org/html/2606.10694#bib.bib25)\]and RMM\[[42](https://arxiv.org/html/2606.10694#bib.bib26)\], which continuously update a running summary of the dialogue as new turns arrive\. However, these strategies struggle to capture the intricate relationships among different pieces of information, and cannot perform multi‑hop reasoning across multiple documents, which impedes the effectiveness of long\-term memory organization\[[13](https://arxiv.org/html/2606.10694#bib.bib34)\]\.

### V\-BGraph\-Based Memory Frameworks

To address the above issues, researchers have begun to develop graph‑based strategies by organizing the accumulated contextual data into a graph, which serves as a structured memory index, enabling retrieval that goes beyond shallow semantic matching\[[29](https://arxiv.org/html/2606.10694#bib.bib36),[13](https://arxiv.org/html/2606.10694#bib.bib34)\]\. Typically, Mem0g\[[8](https://arxiv.org/html/2606.10694#bib.bib25)\]represents memory as a labeled graph, with entities serving as nodes and relations as edges\. A\-MEM\[[53](https://arxiv.org/html/2606.10694#bib.bib23)\]creates interconnected knowledge networks through dynamic indexing and linking of memories\. HippoRAG\[[12](https://arxiv.org/html/2606.10694#bib.bib18)\]draws inspiration from the hippocampal memory indexing theory\[[44](https://arxiv.org/html/2606.10694#bib.bib37)\], and orchestrates LLMs, knowledge graphs, and the Personalized PageRank \(PPR\) algorithm\[[15](https://arxiv.org/html/2606.10694#bib.bib38)\]to perform single‑step, multi‑hop retrieval\. However, existing graph\-based memory methods usually adopt a destructive update paradigm\[[41](https://arxiv.org/html/2606.10694#bib.bib50)\], and their retrieval procedures are often query\-agnostic and operate passively\[[21](https://arxiv.org/html/2606.10694#bib.bib63)\]\. These limitations motivate our temporal, confidence\-aware, and reasoning\-enhanced graph memory framework\.

## VIConclusion

In this paper, we proposed REAL, a reasoning\-enhanced graph framework for long\-term memory management of Large Language Models \(LLMs\)\. During memory construction, REAL represents historical interactions as a temporal and confidence\-aware directed property graph, where atomic facts are annotated with valid\-time intervals, confidence scores, and exploration intent labels\. During memory retrieval, REAL anchors query\-relevant root entities, decouples their exploration intents, and performs semantic evaluator\-guided hybrid beam search to extract compact memory subgraphs\. Furthermore, when normal expansion becomes unreliable, counterfactual inference generates alternative traversal hypotheses to recover missing evidence from implicit logical relations\. Extensive experiments demonstrate the effectiveness of REAL over flat\-text, graph\-based, and existing memory management strategies\. Overall, our results suggest that treating LLM memory as a structured, temporally evolving, and query\-adaptive data management problem is a promising direction for building more reliable long\-term LLM systems\.

## Acknowledgment

This work is supported by Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China \(JYB2025XDXM113\), National Natural Science Foundation of China \(92470121, 62402016\), National Key R&D Program of China \(2024YFA1014003\), Zhongguancun Academy \(C20250204, C20250602\), Beijing Major Science and Technology Project \(Z251100008125043, Z251100008425023\), and High\-performance Computing Platform of Peking University\.

## AI\-Generated Content Acknowledgement

We acknowledge the use of the generative AI assistants ChatGPT\[[5](https://arxiv.org/html/2606.10694#bib.bib4)\]and DeepSeek\-R1\[[11](https://arxiv.org/html/2606.10694#bib.bib45)\]in preparing this article\. These tools were employed exclusively for language refinement \(grammar, word choice, and readability\) throughout the manuscript \(from[SectionI](https://arxiv.org/html/2606.10694#S1)to[SectionVI](https://arxiv.org/html/2606.10694#S6)\)\. All core scientific contributions, including the task formalization, system design of our REAL pipeline, experimental strategy, and result analysis, were developed solely by the human authors\. The authors carefully reviewed and edited all AI‑suggested modifications to ensure accuracy and alignment with their original work\.

## References

- \[1\]J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p1.1),[§I](https://arxiv.org/html/2606.10694#S1.p2.1)\.
- \[2\]\(2022\)How to take smart notes: one simple technique to boost writing, learning and thinking\.Sönke Ahrens\.Cited by:[7th item](https://arxiv.org/html/2606.10694#S4.I1.i7.p1.1)\.
- \[3\]S\. An, Z\. Ma, Z\. Lin, N\. Zheng, J\. Lou, and W\. Chen\(2024\)Make your llm fully utilize the context\.Advances in Neural Information Processing Systems37,pp\. 62160–62188\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[4\]I\. Beltagy, M\. E\. Peters, and A\. Cohan\(2020\)Longformer: the long\-document transformer\.arXiv preprint arXiv:2004\.05150\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[5\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.NeurIPS33,pp\. 1877–1901\.Cited by:[AI\-Generated Content Acknowledgement](https://arxiv.org/html/2606.10694#Sx2.p1.1)\.
- \[6\]J\. Chen, S\. Xiao, P\. Zhang, K\. Luo, D\. Lian, and Z\. Liu\(2024\)Bge m3\-embedding: multi\-lingual, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.arXiv preprint arXiv:2402\.03216\.Cited by:[§IV\-A](https://arxiv.org/html/2606.10694#S4.SS1.p4.2)\.
- \[7\]A\. Chevalier, A\. Wettig, A\. Ajith, and D\. Chen\(2023\)Adapting language models to compress contexts\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 3829–3846\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[8\]P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav\(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p1.1),[§I](https://arxiv.org/html/2606.10694#S1.p6.1),[4th item](https://arxiv.org/html/2606.10694#S4.I1.i4.p1.1),[6th item](https://arxiv.org/html/2606.10694#S4.I1.i6.p1.1),[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p2.1),[§V\-B](https://arxiv.org/html/2606.10694#S5.SS2.p1.1)\.
- \[9\]A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§IV\-A](https://arxiv.org/html/2606.10694#S4.SS1.p4.2)\.
- \[10\]Y\. Fu, R\. Panda, X\. Niu, X\. Yue, H\. Hajishirzi, Y\. Kim, and H\. Peng\(2024\)Data engineering for scaling language models to 128k context\.InProceedings of the 41st International Conference on Machine Learning,pp\. 14125–14134\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[11\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p1.1),[AI\-Generated Content Acknowledgement](https://arxiv.org/html/2606.10694#Sx2.p1.1)\.
- \[12\]B\. J\. Gutiérrez, Y\. Shu, Y\. Gu, M\. Yasunaga, and Y\. Su\(2024\)Hipporag: neurobiologically inspired long\-term memory for large language models\.Advances in neural information processing systems37,pp\. 59532–59569\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1),[§V\-B](https://arxiv.org/html/2606.10694#S5.SS2.p1.1)\.
- \[13\]B\. J\. Gutiérrez, Y\. Shu, W\. Qi, S\. Zhou, and Y\. Su\(2025\)From rag to memory: non\-parametric continual learning for large language models\.InInternational Conference on Machine Learning,pp\. 21497–21515\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p2.1),[§V\-B](https://arxiv.org/html/2606.10694#S5.SS2.p1.1)\.
- \[14\]K\. Hatalis, D\. Christou, J\. Myers, S\. Jones, K\. Lambert, A\. Amos\-Binks, Z\. Dannenhauer, and D\. Dannenhauer\(2023\)Memory matters: the need to improve long\-term memory in llm\-agents\.InProceedings of the AAAI Symposium Series,Vol\.2,pp\. 277–280\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p1.1),[§I](https://arxiv.org/html/2606.10694#S1.p2.1)\.
- \[15\]T\. H\. Haveliwala\(2002\)Topic\-sensitive pagerank\.InProceedings of the 11th international conference on World Wide Web,pp\. 517–526\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p7.3),[§V\-B](https://arxiv.org/html/2606.10694#S5.SS2.p1.1)\.
- \[16\]J\. He, C\. Treude, and D\. Lo\(2025\)Llm\-based multi\-agent systems for software engineering: literature review, vision, and the road ahead\.ACM Transactions on Software Engineering and Methodology34\(5\),pp\. 1–30\.Cited by:[§II\-A](https://arxiv.org/html/2606.10694#S2.SS1.p1.1)\.
- \[17\]Z\. He, Y\. Cao, Z\. Qin, N\. Prakriya, Y\. Sun, and J\. Cong\(2025\-04\)HMT: hierarchical memory transformer for efficient long context language processing\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 8068–8089\.External Links:ISBN 979\-8\-89176\-189\-6Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[18\]X\. Ho, A\. D\. Nguyen, S\. Sugawara, and A\. Aizawa\(2020\)Constructing a multi\-hop qa dataset for comprehensive evaluation of reasoning steps\.InProceedings of the 28th International Conference on Computational Linguistics,pp\. 6609–6625\.Cited by:[§IV\-A](https://arxiv.org/html/2606.10694#S4.SS1.p1.1)\.
- \[19\]C\. Hu, J\. Fu, C\. Du, S\. Luo, J\. Zhao, and H\. Zhao\(2023\)Chatdb: augmenting llms with databases as their symbolic memory\.arXiv preprint arXiv:2306\.03901\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p1.1)\.
- \[20\]S\. Hu, Y\. Wei, J\. Ran, Z\. Yao, X\. Han, H\. Wang, R\. Chen, and L\. Zou\(2026\)Does memory need graphs? a unified framework and empirical analysis for long\-term dialog memory\.InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p5.1)\.
- \[21\]S\. Ji, Y\. Li, and B\. Hooi\(2026\)Memory is reconstructed, not retrieved: graph memory for llm agents\.InICLR 2026 Workshop on Memory for LLM\-Based Agentic Systems,Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p7.3),[§V\-B](https://arxiv.org/html/2606.10694#S5.SS2.p1.1)\.
- \[22\]B\. Jiang, Z\. Hao, Y\. M\. Cho, B\. Li, Y\. Yuan, S\. Chen, L\. Ungar, C\. J\. Taylor, and D\. Roth\(2025\)Know me, respond to me: benchmarking llms for dynamic user profiling and personalized responses at scale\.InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling,Cited by:[§IV\-A](https://arxiv.org/html/2606.10694#S4.SS1.p1.1)\.
- \[23\]H\. Jiang, Q\. Wu, C\. Lin, Y\. Yang, and L\. Qiu\(2023\)Llmlingua: compressing prompts for accelerated inference of large language models\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 13358–13376\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[24\]J\. Kang, M\. Ji, Z\. Zhao, and T\. Bai\(2025\)Memory os of ai agent\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 25972–25981\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p1.1),[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[25\]J\. Kang, W\. Wu, F\. Christianos, A\. J\. Chan, F\. D\. Greenlee, G\. Thomas, M\. Purtorab, and A\. Toulis\(2025\)Lm2: large memory models for long context reasoning\.InWorkshop on Reasoning and Planning for Large Language Models,Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[26\]M\. Levy, A\. Jacoby, and Y\. Goldberg\(2024\-08\)Same task, more tokens: the impact of input length on the reasoning performance of large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15339–15353\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p2.1)\.
- \[27\]H\. Li, Q\. Ai, J\. Zhan, J\. Mao, Y\. Liu, Z\. Liu, and Z\. Cao\(2023\)Constructing tree\-based index for efficient and effective dense retrieval\.InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 131–140\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p7.3)\.
- \[28\]Y\. Li, Y\. Huang, M\. E\. Ildiz, A\. S\. Rawat, and S\. Oymak\(2024\)Mechanics of next token prediction with self\-attention\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 685–693\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p2.1)\.
- \[29\]L\. Liang, Z\. Bo, Z\. Gui, Z\. Zhu, L\. Zhong, P\. Zhao, M\. Sun, Z\. Zhang, J\. Zhou, W\. Chen,et al\.\(2025\)Kag: boosting llms in professional domains via knowledge augmented generation\.InCompanion Proceedings of the ACM on Web Conference 2025,pp\. 334–343\.Cited by:[§V\-B](https://arxiv.org/html/2606.10694#S5.SS2.p1.1)\.
- \[30\]A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§IV\-A](https://arxiv.org/html/2606.10694#S4.SS1.p3.1),[§IV\-A](https://arxiv.org/html/2606.10694#S4.SS1.p4.2)\.
- \[31\]H\. Liu, M\. Zaharia, and P\. Abbeel\(2024\)RingAttention with blockwise transformers for near\-infinite context\.InThe Twelfth International Conference on Learning Representations,Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[32\]N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang\(2024\)Lost in the middle: how language models use long contexts\.Transactions of the association for computational linguistics12,pp\. 157–173\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[33\]K\. Lu, X\. Nie, Z\. Liang, D\. Pan, S\. Zhang, K\. Zhao, W\. Chen, Z\. Zhou, G\. Dong, B\. Cui,et al\.\(2025\)DataSculpt: a holistic data management framework for long\-context llms training\.In2025 IEEE 41st International Conference on Data Engineering \(ICDE\),pp\. 4234–4247\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[34\]A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang\(2024\)Evaluating very long\-term conversational memory of llm agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13851–13870\.Cited by:[§IV\-A](https://arxiv.org/html/2606.10694#S4.SS1.p1.1),[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[35\]J\. Mu, X\. Li, and N\. Goodman\(2023\)Learning to compress prompts with gist tokens\.Advances in Neural Information Processing Systems36,pp\. 19327–19352\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[36\]C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez\(2023\)MemGPT: towards llms as operating systems\.arXiv preprint arXiv:2310\.08560\.Cited by:[3rd item](https://arxiv.org/html/2606.10694#S4.I1.i3.p1.1),[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p2.1)\.
- \[37\]J\. J\. Pan, J\. Wang, and G\. Li\(2024\)Survey of vector database management systems\.The VLDB Journal33\(5\),pp\. 1591–1615\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p7.3)\.
- \[38\]M\. A\. Rodriguez and P\. Neubauer\(2012\)The graph traversal pattern\.InGraph data management: Techniques and applications,pp\. 29–46\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p7.3)\.
- \[39\]F\. Shi, X\. Chen, K\. Misra, N\. Scales, D\. Dohan, E\. H\. Chi, N\. Schärli, and D\. Zhou\(2023\)Large language models can be easily distracted by irrelevant context\.InInternational Conference on Machine Learning,pp\. 31210–31227\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[40\]M\. Shridhar, X\. Yuan, M\. Cote, Y\. Bisk, A\. Trischler, and M\. Hausknecht\(2021\)ALFWorld: aligning text and embodied environments for interactive learning\.InInternational Conference on Learning Representations,Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[41\]X\. Tan, X\. Wang, Q\. Liu, X\. Xu, X\. Yuan, L\. Zhu, and W\. Zhang\(2026\)Memotime: memory\-augmented temporal knowledge graph enhanced large language model reasoning\.InProceedings of the ACM Web Conference 2026,pp\. 4220–4231\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p6.1),[§V\-B](https://arxiv.org/html/2606.10694#S5.SS2.p1.1)\.
- \[42\]Z\. Tan, J\. Yan, I\. Hsu, R\. Han, Z\. Wang, L\. Le, Y\. Song, Y\. Chen, H\. Palangi, G\. Lee,et al\.\(2025\)In prospect and retrospect: reflective memory management for long\-term personalized dialogue agents\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 8416–8439\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p2.1)\.
- \[43\]G\. Team, P\. Georgiev, V\. I\. Lei, R\. Burnell, L\. Bai, A\. Gulati, G\. Tanzer, D\. Vincent, Z\. Pan, S\. Wang,et al\.\(2024\)Gemini 1\.5: unlocking multimodal understanding across millions of tokens of context\.arXiv preprint arXiv:2403\.05530\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p2.1)\.
- \[44\]T\. J\. Teyler and P\. DiScenna\(1986\)The hippocampal memory indexing theory\.\.Behavioral neuroscience100\(2\),pp\. 147\.Cited by:[§V\-B](https://arxiv.org/html/2606.10694#S5.SS2.p1.1)\.
- \[45\]H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal\(2022\)MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.Cited by:[§IV\-A](https://arxiv.org/html/2606.10694#S4.SS1.p1.1)\.
- \[46\]W\. Wang, L\. Dong, H\. Cheng, X\. Liu, X\. Yan, J\. Gao, and F\. Wei\(2023\)Augmenting language models with long\-term memory\.Advances in Neural Information Processing Systems36,pp\. 74530–74543\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[47\]W\. Wang, Z\. Ma, Z\. Wang, C\. Wu, J\. Ji, W\. Chen, X\. Li, and Y\. Yuan\(2025\)A survey of llm\-based agents in medicine: how far are we from baymax?\.Findings of the Association for Computational Linguistics: ACL 2025,pp\. 10345–10359\.Cited by:[§II\-A](https://arxiv.org/html/2606.10694#S2.SS1.p1.1)\.
- \[48\]Y\. Wang, D\. Krotov, Y\. Hu, Y\. Gao, W\. Zhou, J\. Mcauley, D\. Gutfreund, R\. Feris, and Z\. He\(2025\)M\+: extending memoryllm with scalable long\-term memory\.InInternational Conference on Machine Learning,pp\. 63308–63323\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[49\]D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu\(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§IV\-A](https://arxiv.org/html/2606.10694#S4.SS1.p1.1),[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[50\]Y\. Wu, S\. Liang, C\. Zhang, Y\. Wang, Y\. Zhang, H\. Guo, R\. Tang, and Y\. Liu\(2025\)From human memory to ai memory: a survey on memory mechanisms in the era of llms\.arXiv preprint arXiv:2504\.15965\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[51\]F\. Xia, J\. Liu, H\. Nie, Y\. Fu, L\. Wan, and X\. Kong\(2019\)Random walks: a review of algorithms and applications\.IEEE Transactions on Emerging Topics in Computational Intelligence4\(2\),pp\. 95–107\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p7.3)\.
- \[52\]F\. Xu, W\. Shi, and E\. Choi\(2023\)Recomp: improving retrieval\-augmented lms with context compression and selective augmentation\.InThe Twelfth International Conference on Learning Representations,Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[53\]W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang\(2025\)A\-mem: agentic memory for llm agents\.arXiv preprint arXiv:2502\.12110\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p3.1),[§I](https://arxiv.org/html/2606.10694#S1.p6.1),[7th item](https://arxiv.org/html/2606.10694#S4.I1.i7.p1.1),[§V\-B](https://arxiv.org/html/2606.10694#S5.SS2.p1.1)\.
- \[54\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p1.1),[§IV\-A](https://arxiv.org/html/2606.10694#S4.SS1.p4.2)\.
- \[55\]J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press\(2024\)Swe\-agent: agent\-computer interfaces enable automated software engineering\.Advances in Neural Information Processing Systems37,pp\. 50528–50652\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[56\]Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning\(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2369–2380\.Cited by:[§IV\-A](https://arxiv.org/html/2606.10694#S4.SS1.p1.1)\.
- \[57\]Z\. Zhang, Q\. Dai, X\. Bo, C\. Ma, R\. Li, X\. Chen, J\. Zhu, Z\. Dong, and J\. Wen\(2025\)A survey on the memory mechanism of large language model\-based agents\.ACM Transactions on Information Systems43\(6\),pp\. 1–47\.Cited by:[§I](https://arxiv.org/html/2606.10694#S1.p1.1),[§I](https://arxiv.org/html/2606.10694#S1.p5.1),[§II\-A](https://arxiv.org/html/2606.10694#S2.SS1.p1.1),[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1),[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p2.1)\.
- \[58\]L\. Zhao, X\. Feng, X\. Feng, W\. Zhong, D\. Xu, Q\. Yang, H\. Liu, B\. Qin, and T\. Liu\(2024\)Length extrapolation of transformers: a survey from the perspective of positional encoding\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 9959–9977\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.
- \[59\]C\. Zheng, Y\. Gao, H\. Shi, M\. Huang, J\. Li, J\. Xiong, X\. Ren, M\. Ng, X\. Jiang, Z\. Li,et al\.\(2024\)Dape: data\-adaptive positional encoding for length extrapolation\.Advances in Neural Information Processing Systems37,pp\. 26659–26700\.Cited by:[§V\-A](https://arxiv.org/html/2606.10694#S5.SS1.p1.1)\.

Similar Articles

Learning to Refine Hidden States for Reliable LLM Reasoning

arXiv cs.LG

Proposes ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations in LLMs before decoding, improving reasoning reliability and efficiency compared to chain-of-thought methods.

Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

arXiv cs.CL

This paper proposes SGR, a framework that enhances LLM stepwise reasoning by integrating external knowledge graphs through query-relevant subgraph generation, combining Cypher-based reasoning with collaborative reasoning integration. Experiments on CWQ, WebQSP, GrailQA, and KQA Pro show improved reasoning accuracy over standard prompting and knowledge-enhanced baselines.

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

arXiv cs.CL

G-Long proposes a graph-enhanced memory management framework for long-term dialogue agents, using a fine-tuned small language model for structured triplet extraction and associative retrieval, achieving state-of-the-art performance in response generation and memory retrieval with reduced computational overhead.