H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure

arXiv cs.CL Papers

Summary

H-Mem is a novel memory mechanism for LLM-based agents that uses a hybrid structure combining a temporal and semantic tree with a knowledge graph to model memory evolution and improve retrieval, achieving state-of-the-art performance on QA benchmarks.

arXiv:2605.15701v1 Announce Type: new Abstract: Memory data are ubiquitous in Large Language Model (LLM)-based agents (e.g., OpenClaw and Manus). A few recent works have attempted to exploit agents'memory for improving their performance on the question-answering (QA) task, but they lack a principled mechanism for effectively modeling how memory data evolves over time and retrieving memory data effectively, leading to poor performance in memory utilization. To fill this gap, we present H-Mem, a novel memory mechanism via a hybrid structure that can not only effectively model the evolution of agent memory over a long period of time, but also provide an efficient memory retrieval approach. Particularly, H-Mem builds a temporal and semantic tree structure that allows the short-term memory data to evolve progressively into long-term memory data, where the latter provides summarized information about the former, while simultaneously constructing a knowledge graph to capture the relationships between entities in memory. Moreover, it offers an effective memory retrieval approach by exploiting the hybrid structure of the tree and graph structures. Extensive experiments on three agent memory benchmarks show that H-Mem achieves state-of-the-art performance on the QA task.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:34 AM

# H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure
Source: [https://arxiv.org/html/2605.15701](https://arxiv.org/html/2605.15701)
Jiawei Yu1,Yixiang Fang1,Xilin Liu2,Yuchi Ma2 1The Chinese University of Hong Kong, Shenzhen 2Huawei Cloud Computing Technologies CO\., LTD\. jiaweiyu1@link\.cuhk\.edu\.cn, fangyixiang@cuhk\.edu\.cn \{liuxilin3, mayuchi1\}@huawei\.com

###### Abstract

Memory data are ubiquitous in Large Language Model \(LLM\)\-based agents \(e\.g\., OpenClaw and Manus\)\. A few recent works have attempted to exploit agents’ memory for improving their performance on the question\-answering \(QA\) task, but they lack a principled mechanism for effectively modeling how memory data evolves over time and retrieving memory data effectively, leading to poor performance in memory utilization\. To fill this gap, we presentH\-Mem, a novel memory mechanism via a hybrid structure that can not only effectively model the evolution of agent memory over a long period of time, but also provide an efficient memory retrieval approach\. Particularly,H\-Membuilds a temporal and semantic tree structure that allows the short\-term memory data to evolve progressively into long\-term memory data, where the latter provides summarized information about the former, while simultaneously constructing a knowledge graph to capture the relationships between entities in memory\. Moreover, it offers an effective memory retrieval approach by exploiting the hybrid structure of the tree and graph structures\. Extensive experiments on three agent memory benchmarks show thatH\-Memachieves state\-of\-the\-art performance on the QA task\.

## 1Introduction

LLM\-based agents such as OpenClaw\[[17](https://arxiv.org/html/2605.15701#bib.bib6)\]and Manus\[[13](https://arxiv.org/html/2605.15701#bib.bib7)\]have received tremendous attention owing to their powerful abilities in solving complex real\-world tasks such as QA\. During the interactions between users and agents, a large amount of memory data has been generated and accumulated\. Generally, the agent memory data refers to the information accumulated by an agent during interactions, such as conversation history and task execution records\. By exploiting the memory data, the agent can not only have a clear understanding of the users’ preferences and behaviors but also improve performance on understanding context, maintaining conversational coherence, and executing complex tasks\. As a result, a crucial component of modern agents is the memory mechanism, which stores and manipulates the agent memory data\.

To enable an LLM\-based agent to exploit memory data, a naive memory mechanism is to store the memory as plain text, retrieve all memory data associated with a specific user, and then use the retrieved data to accomplish a task for that user\. However, due to the finite context window of LLMs, this mechanism cannot effectively or efficiently process large amounts of memory data, especially when the agent has interacted with users over a long period of time\. To alleviate this bottleneck, existing systems often apply Retrieval\-Augmented Generation \(RAG\) techniques to memory data\[[10](https://arxiv.org/html/2605.15701#bib.bib35),[23](https://arxiv.org/html/2605.15701#bib.bib2)\]; that is, the agent only retrieves relevant memory information from an external memory database when solving a task\.

Table 1:Taxonomy of representative agent memory methods\.Under this memory\-based RAG paradigm, existing methods differ not only in how memory data are indexed before retrieval, but also in their retrieval mechanisms which determine how relevant evidence is accessed from the index\. According to the memory index structures, existing memory mechanisms can be roughly classified into three categories as reported in Table[1](https://arxiv.org/html/2605.15701#S1.T1)\. In Table[1](https://arxiv.org/html/2605.15701#S1.T1),*Memory Evolution*denotes temporal\-window\-based consolidation from short\-term memories to long\-term summaries, while*Multi\-hop Reasoning*denotes entity\- or relation\-level traversal across memory fragments\. A first class of methods adopts the*vector index*, a single\-level organization of memory, in which memory fragments are stored as independent entries\. To enable efficient memory retrieval, these fragments are often encoded as embeddings, and some methods further store these embeddings in a vector database\[[27](https://arxiv.org/html/2605.15701#bib.bib11),[2](https://arxiv.org/html/2605.15701#bib.bib18)\]\. A second class of methods mainly explores the*tree index*, where the semantic topics of the memory data are hierarchically organized across multiple levels, with lower levels preserving fine\-grained semantic topics and higher levels providing the abstract or persistent representations of fine\-grained semantic topics\[[20](https://arxiv.org/html/2605.15701#bib.bib38),[7](https://arxiv.org/html/2605.15701#bib.bib41),[11](https://arxiv.org/html/2605.15701#bib.bib30),[6](https://arxiv.org/html/2605.15701#bib.bib42)\]\. To query the memory fragments about a specific topic, they just need to traverse explicitly along the tree structure in a bottom\-up or top\-down manner\. A third class of methods mainly uses the*graph index*, where entities and relationships are represented as nodes and edges, respectively\[[19](https://arxiv.org/html/2605.15701#bib.bib21)\]\. By following the link relationships between entities, they can naturally support fast relational retrieval and multi\-hop reasoning\.

Despite this progress, existing memory mechanisms suffer from two major limitations: First, their index designs are still limited in modeling memory evolution, where short\-term memory can be progressively consolidated into long\-term memory, as suggested by studies of human memory consolidation\[[14](https://arxiv.org/html/2605.15701#bib.bib49),[21](https://arxiv.org/html/2605.15701#bib.bib50)\]\. This is primarily because they fail to explicitly take the temporal dimension into account, which renders them incapable of differentiating between short\- and long\-term semantic topics within the memory data\. Second, they cannot accurately retrieve the relevant evidence from the memory index when performing QA tasks\. Specifically, the vector index\-based methods are efficient for similarity search, but they treat memory data as independent entries and therefore cannot explicitly capture either temporal abstraction or entity\-level relational dependencies\. The tree index\-based methods cannot accurately capture the multi\-hop relationships between entities; and the graph index\-based methods cannot identify the consolidated memory data due to the lack of a memory evolution mechanism\. Overall, these methods mainly rely on a single index \(i\.e\., vector, tree, or graph\), so they cannot accurately retrieve the relevant evidence from memory data\. Thus, existing works lack a principled mechanism that can jointly model long\-term memory evolution and support accurate retrieval\.

To address the aforementioned limitations, we proposeH\-Mem, a novel memory mechanism via a hybrid structure of tree and graph\. The key distinction ofH\-Memis not merely using a tree index together with a graph index, but coupling temporal\-semantic memory evolution with entity\-centered multi\-hop reasoning\. The tree structure ofH\-Memorganizes memory data both temporally and semantically, where each tree node retains memory information regarding a specific semantic topic within a pre\-defined time window\. Specifically, each leaf node stores an event of the agent’s original memory fragment, containing a semantic topic \(e\.g\., a message in a conversation\) generated at a specific timestamp, while the upper\-level nodes store the memory summaries of fine\-grained semantic topics in their lower levels, covering their respective time windows\. To enable memory evolution,H\-Memperforms a temporal\-and\-semantic consolidation; that is, given two tree nodes whose time windows are very close in the same level, if the semantic similarity between their memory data exceeds a predefined threshold, they could share the same parent node, whose memory summary preserves the consolidated information of these two nodes\. Clearly, this temporal and semantic tree structure allows short\-term memory to evolve progressively into long\-term memory\. Furthermore, the graph structure ofH\-Memmaintains a knowledge graph of entities and their relationships extracted from the memory data, effectively recording the entity\-centered information beyond temporal order and capturing multi\-hop relationships between entities across different memory fragments\. Overall, the tree and graph structures complement each other, and this hybrid structure overcomes the issue of relying on a single index prevalent in existing works\.

Based on this hybrid structure,H\-Memincludes an effective retrieval method\. Given a queryQQ, it first decomposesQQinto some sub\-queries and generates a retrieval workflow for each sub\-query\. Then, for each sub\-query, it locates some original memory fragments and multi\-hop relevant entities in the graphs\. Afterwards, it searches relevant evidence from the tree in a bottom\-up manner, which is used for completing the RAG process\. We have evaluatedH\-Memagainst representative SOTA baselines on three public long\-term memory benchmarks covering diverse QA scenarios\. The results show thatH\-Memachieves superior F1 scores and accuracy while maintaining competitive index and retrieval efficiency\. Further analyses validate the contribution of the temporal tree, the knowledge graph, and the agent\-assisted retrieval strategy\.

Our principal contributions are summarized as follows:

- •We proposeH\-Mem, a novel memory mechanism that can effectively model the evolution of agent memory over a long time by using a hybrid structure of tree and graph\.
- •Based on the hybrid structure above, we develop an effective method for retrieving the relevant memory evidence to support the QA tasks\.
- •Experiments on three public long\-term agent memory benchmarks show thatH\-Memachieves SOTA performance in solving QA tasks while maintaining competitive efficiency\.

## 2Related Work

### 2\.1Retrieval\-Augmented Generation \(RAG\)

Recently, many works have explored how LLMs can access external information beyond its parametric knowledge and immediate prompt context\. Simply extending the context window is insufficient, as the key challenge is how to select, organize, and reuse the external information effectively\. Within this landscape, RAG has become a widely used technique for incorporating external knowledge at inference time\[[10](https://arxiv.org/html/2605.15701#bib.bib35),[28](https://arxiv.org/html/2605.15701#bib.bib1)\]\. Given a questionQQ, it retrieves the relevant information from an external database, incorporates it withQQas the prompt, and then feeds it into the LLM for generation\. Various types of RAG techniques have been studied: the naive RAG retrieves relevant passages from external corpora, graph\-based RAG leverages a graph\-structured index for multi\-hop and relation\-aware reasoning\[[4](https://arxiv.org/html/2605.15701#bib.bib36),[5](https://arxiv.org/html/2605.15701#bib.bib37)\], and agentic RAG incorporates retrieval into an adaptive reasoning loop so that the model can decide when and how to retrieve during multi\-step problem solving\[[1](https://arxiv.org/html/2605.15701#bib.bib39),[9](https://arxiv.org/html/2605.15701#bib.bib40)\]\.

### 2\.2Agent Memory\-based RAG and Agent Memory Mechanisms

Since the agent memory data can be considered as a kind of external information, it is natural to use it for RAG\. The memory\-based RAG techniques\[[23](https://arxiv.org/html/2605.15701#bib.bib2)\]often first extract useful information from memory data, such as user preferences and events, then organize them into some index structures, and finally retrieve relevant evidence and inject it into the prompt when answering a question\. However, different from traditional RAG techniques, which often use static documents to provide factual grounding, the memory\-based RAG techniques operate over stateful, interaction\-derived memory data that evolves over time, and aim to understand context, maintain conversational coherence, and execute complex tasks\. Therefore, they heavily rely on the memory mechanisms, which not only provide an effective organization of the memory data but also offer effective methods for evolving and retrieving the memory data\.

According to the memory index structures, existing memory mechanisms can be roughly classified into three categories: \(1\) The vector\-based memory methods, such as MemoryBank\[[27](https://arxiv.org/html/2605.15701#bib.bib11)\]and Mem0\[[2](https://arxiv.org/html/2605.15701#bib.bib18)\], store interaction\-derived memory as independent embeddings and retrieve relevant memory from ongoing interactions\. \(2\) The tree\-based memory methods, such as MemTree\[[20](https://arxiv.org/html/2605.15701#bib.bib38)\], introduce dynamic tree\-structured representations to organize memory at different abstraction levels\. MemOS\[[11](https://arxiv.org/html/2605.15701#bib.bib30)\]also supports tree\-like textual memory modules within its MemCube abstraction\. Related hierarchical memory methods, such as MemoryOS\[[7](https://arxiv.org/html/2605.15701#bib.bib41)\]and EverMemOS\[[6](https://arxiv.org/html/2605.15701#bib.bib42)\], also organize memories across multiple levels or structured units, emphasizing memory management and long\-term reuse\. \(3\) The graph\-based memory methods, such as Zep\[[19](https://arxiv.org/html/2605.15701#bib.bib21)\], construct temporal knowledge graphs for agent memory, enabling relational access to evolving facts and entities\. Additionally, recent works have explored structured and adaptive memory mechanisms from related perspectives\. A\-Mem\[[25](https://arxiv.org/html/2605.15701#bib.bib19)\]studies agentic memory mechanisms, while multi\-granularity memory methods\[[24](https://arxiv.org/html/2605.15701#bib.bib24)\]investigate memory association and selection across different abstraction levels\.

As aforementioned, although the above works have achieved some promising progress, their index designs are still limited in modeling the evolution of memory data, which progressively consolidates short\-term memory fragments into long\-term memory fragments\. Besides, they cannot accurately retrieve the relevant evidence from the memory index when performing QA tasks\. Therefore, it is desirable to study a novel memory mechanism that can not only effectively model the evolution of agent memory over a long period of time, but also provide an efficient memory retrieval approach\.

## 3Our Proposed Memory MechanismH\-Mem

To effectively support the memory\-based RAG, we proposeH\-Mem, a novel memory mechanism for evolving and retrieving agent memory\.H\-Memconsists of two stages:Offline IndexingandOnline Retrieval, where the former stage builds a hybrid structure of tree and graph, and the latter stage includes an agentic memory retrieval approach by exploiting the hybrid structure\.

![Refer to caption](https://arxiv.org/html/2605.15701v1/x1.png)

![Refer to caption](https://arxiv.org/html/2605.15701v1/x2.png)

Figure 1:The offline indexing stage ofH\-Mem\.### 3\.1Overview

Letℱ=\{fi\}\\mathcal\{F\}=\\\{f\_\{i\}\\\}be the set of originalmemory fragmentsof the memory data\. In the offline indexing stage, as depicted in Figure[1](https://arxiv.org/html/2605.15701#S3.F1),H\-Membuilds a hybrid structure forℱ\\mathcal\{F\}, mainly consisting of two parts:

- •Tree:We build a temporal and semantic tree𝒯\\mathcal\{T\}, where each node retains memory information regarding a specific semantic topic within a pre\-defined time window\. In the temporal view, for all the levels from the leaf to the root, their nodes have pre\-defined time windows \(e\.g\., one day, one week, one month, etc\), and the time window of a parent node covers its child nodes’ time windows\. In the semantic view, each leaf node contains a memory event extracted from an original memory fragment, and each non\-leaf node stores amemory summary, providing an abstract and persistent representation of the fine\-grained semantic topics of memory events/summaries in its child nodes\.
- •Graph:We build a knowledge graph𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\), where𝒱\\mathcal\{V\}contains the entities extracted from the original memory fragments andℰ\\mathcal\{E\}denotes relationships between entities\. Besides, each entity is linked to the original memory fragment containing it and may have a profile\.

The key rationale behind our design is that the tree above can effectively support memory evolution in both temporal and semantic dimensions across different granularities, while the graph records the entity\-centered information beyond the temporal dimension and captures multi\-hop relationships between entities across different memory fragments\. Additionally, to enable efficient semantic search, we maintain the embedding vectors of the memory fragments/events/summaries\.

In the online retrieval stage,H\-Memfirst decomposes a queryQQinto sub\-queries\. Then for each of them, it searches the relevant memory information within some specific time windows by using the hybrid structure\. Finally, it combines the searched memory information for all the sub\-queries as the relevant evidence identified from the memory data\.

### 3\.2Offline Indexing

We now introduce the construction process of the hybrid structure\.

Tree Construction\.We build a temporal and semantic tree𝒯\\mathcal\{T\}in an incremental manner\. Assume that𝒯\\mathcal\{T\}hasLLlevels, where the leaf nodes are at the first level, and each levelllhas two hyperparametersαl\\alpha\_\{l\}andβl\\beta\_\{l\}, whereαl\\alpha\_\{l\}is the similarity threshold between a node and its child node, andβl\\beta\_\{l\}is the size of the time window\. As shown in Figure[1](https://arxiv.org/html/2605.15701#S3.F1)\(a\), given a list of original memory fragments which may include speakers, timestamps, and conversation data, we first extract a set of memory events from each memory fragment, where each memory event preserves a fine\-grained semantic topic\. Next, we create a leaf nodexxfor each newly extracted memory event and update the tree in a bottom\-up manner\. At each levelll, the newly inserted node is assigned to the corresponding temporal window with sizeβl\\beta\_\{l\}according to its timestamp\. Only the existing nodes within the same temporal window are considered as candidates for semantic consolidation\. Within this temporal window, we compute pairwise semantic similarities between the newly inserted node and existing lower\-level nodes\. If the newly inserted node is similar to an existing candidate cluster according to the thresholdαl\\alpha\_\{l\}, we add it to this candidate cluster\. Then, we update a non\-leaf nodeyyas the parent node of this candidate cluster in the upper level, and generate a memory summary by consolidating the memory events/summaries from its child nodes\. Otherwise, we create a new non\-leaf nodezzas the parent node of the newly inserted node in the upper level\. Afterwards, we repeat the above process by updating the parents of the newly generated nodeyyorzz, until we reach the root node\. In this way, the tree supports memory evolution across different temporal windows; that is, the leaf nodes retain fine\-grained short\-term memory, while upper\-level nodes provide abstract or persistent long\-term memory\.

Graph Construction\.H\-Memalso incrementally builds a knowledge graph𝒢\\mathcal\{G\}\. First, it extracts entities and relations from each original memory fragmentfif\_\{i\}\. Second, the extracted entities are normalized and resolved into entity nodes through entity disambiguation based on text normalization, lemmatization, token overlap, and fuzzy string matching\. If a resolved entity exactly matches an existing entity node and their types are compatible, it is merged into that node; otherwise, it is kept as a new entity node, and some associated edges may be inserted\. Besides, the merged or new entity node is linked to the original memory fragment containing it\. Third, the extracted relations are mapped to their resolved head and tail entity nodes and then inserted into the graph as an edge if the same edge does not already exist\. Therefore, the graph provides an entity\-centered view of memory and captures multi\-hop relations between entities across different memory fragments\. In addition,H\-Memmaintains profiles for salient entities selected based on the number of memory fragments–entity links associated with an entity, together with predefined important entity types such as persons, organizations, and locations\. For each salient entity,H\-Memmaintains a profile keeping both persistent and recent memory data\.

Overall,H\-Memconstructs a hybrid structure with some vectors, a tree, and a graph that are built incrementally\. It can not only effectively model the evolution of agent memory over a long time, but also provide a structure supporting efficient memory retrieval introduced in the next subsection\.

### 3\.3Online Retrieval

Given a queryQQ,H\-Memidentifies relevant memory evidence by searching over the hybrid structure, which consists of three steps, i\.e\., Retrieval Planning, Evidence Retrieval, and Generation Process\.

1\) Retrieval Planning\.H\-Memfirst decomposesQQinto a list of sub\-queries\{Qk\}k=1K\\\{Q\_\{k\}\\\}\_\{k=1\}^\{K\}with dependency relations in an agentic manner\. For eachQkQ\_\{k\}, it uses an LLM to infer a memory scope with a label in\{Short,Long,Mixed\}\\\{\\textsc\{Short\},\\textsc\{Long\},\\textsc\{Mixed\}\\\}, indicating the temporal granularity for tree\-based retrieval, whereShortfocuses on short\-term memory evidence,Longfocuses on long\-term memory evidence, andMixedconsiders both\. Besides, it infers explicit temporal hints if available, such as specific dates and relative time cues\. Finally, it generates a retrieval workflow forQkQ\_\{k\}\.

2\) Evidence Retrieval\.Following the retrieval workflow,H\-Memretrieves relevant evidence for eachQkQ\_\{k\}by first exploring entities in the graph and then searching the tree in the hybrid memory structure\. Specifically, by extracting entities fromQkQ\_\{k\}, it locates seed entities in the graph through NLP\-based lexical entity matching and vector\-based semantic similarity search\. Starting from the seed entities, it then performs multi\-hop expansion in the graph to identify more related entities\.

Based on the identified entities,H\-Memfurther explores the tree to identify relevant memory evidence in a bottom\-up manner\. Specifically, it first maps the entities to their original memory fragments and then links them to memory events in the tree\. Afterwards, it searches the tree according to the inferred memory scope: If the scope isShort, it only uses the original memory fragments and memory events in the leaf nodes of the tree; If the scope isLong, it only uses the memory events and memory summaries in the tree; and if the scope isMixed, it uses the original memory fragments, memory events, and memory summaries\.

For each memory evidence, denoted bymm,H\-Memconsiders three aspects betweenmmandQkQ\_\{k\}:semantic similarity,temporal relevance, andmemory robustness\. The semantic similarity asS​\(m,Qk\)S\(m,Q\_\{k\}\)is derived by the cosine similarity between the embedding vectors ofmmandQkQ\_\{k\}\. For temporal relevance, letIm=\[sm,em\]I\_\{m\}=\[s\_\{m\},e\_\{m\}\]denote the time interval of memory evidencemm, andIk=\[sk,ek\]I\_\{k\}=\[s\_\{k\},e\_\{k\}\]denote the temporal interval inferred from the queryQkQ\_\{k\}\. The temporal relevance is computed by jointly considering temporal overlap and normalized center distance:

T​\(m,Qk\)=λ⋅\|Im∩Ik\|\|Im∪Ik\|\+ϵ\+\(1−λ\)​\(1−\|cm−ck\|\|Im∪Ik\|\+ϵ\),\\small T\(m,Q\_\{k\}\)=\\lambda\\cdot\\frac\{\\left\|I\_\{m\}\\cap I\_\{k\}\\right\|\}\{\\left\|I\_\{m\}\\cup I\_\{k\}\\right\|\+\\epsilon\}\+\(1\-\\lambda\)\\left\(1\-\\frac\{\|c\_\{m\}\-c\_\{k\}\|\}\{\\left\|I\_\{m\}\\cup I\_\{k\}\\right\|\+\\epsilon\}\\right\),\(1\)wherecmc\_\{m\}andckc\_\{k\}are the centers ofImI\_\{m\}andIkI\_\{k\},λ∈\[0,1\]\\lambda\\in\[0,1\]controls the balance between temporal overlap and temporal distance, andϵ\\epsilonis a small constant for numerical stability\.

The memory robustness is used to reflect how likely a memory event or summary is retained during memory evolution\. Inspired by the Ebbinghaus forgetting curve\[[3](https://arxiv.org/html/2605.15701#bib.bib43)\], we formulate the memory robustness ofmmat the query timettas

R​\(m,t\)=exp⁡\(−t−rmτ​\(1\+η​ln⁡\(1\+nm\)\)\),\\small R\(m,t\)=\\exp\\\!\\left\(\-\\frac\{t\-r\_\{m\}\}\{\\tau\(1\+\\eta\\ln\(1\+n\_\{m\}\)\)\}\\right\),\(2\)wherermr\_\{m\}is the most recent timestamp whenmmis consolidated,τ\>0\\tau\>0controls the forgetting time scale,η≥0\\eta\\geq 0controls the reinforcement effect of repeated mentions, andnmn\_\{m\}is the number of times thatmmis consolidated in the memory evolution\. A higher robustness value indicates thatmmis more recent or has been repeatedly mentioned, suggesting that it is more likely to serve as reliable evidence\.

Finally, all retrieved evidence is de\-duplicated and ranked into an evidence chain by jointly considering semantic similarity, temporal relevance, and memory robustness as follows:

ℱ​\(m,Qk,t\)=θ1​S​\(m,Qk\)\+θ2​T​\(m,Qk\)\+θ3​R​\(m,t\),\\small\\mathcal\{F\}\(m,Q\_\{k\},t\)=\\theta\_\{1\}S\(m,Q\_\{k\}\)\+\\theta\_\{2\}T\(m,Q\_\{k\}\)\+\\theta\_\{3\}R\(m,t\),\(3\)whereθ1\\theta\_\{1\},θ2\\theta\_\{2\}, andθ3\\theta\_\{3\}are non\-negative weights, andttis the query time ofQQ\. The evidence yielding higherℱ​\(m,Qk,t\)\\mathcal\{F\}\(m,Q\_\{k\},t\)scores is prioritized to construct the final evidence chain for answeringQQ\.

3\) Generation Process\.For eachQkQ\_\{k\},H\-Memgenerates a sub\-answer using its retrieved evidence\. IfQkQ\_\{k\}depends on other sub\-queries, the answers to these prerequisite sub\-queries are also used as additional context\. Finally,H\-Memsynthesizes all sub\-query answers asΨ​\(\{𝒜k\}k=1K\)\\Psi\(\\\{\\mathcal\{A\}\_\{k\}\\\}\_\{k=1\}^\{K\}\), where𝒜k\\mathcal\{A\}\_\{k\}denotes the answer ofQkQ\_\{k\}and we invoke an LLM to complete the synthesis processΨ\\Psi\.

## 4Experiments

We present the experimental setup in Section[4\.1](https://arxiv.org/html/2605.15701#S4.SS1)and discuss the results in Sections[4\.2](https://arxiv.org/html/2605.15701#S4.SS2)and[4\.3](https://arxiv.org/html/2605.15701#S4.SS3)\.

### 4\.1Setup

Datasets\.We evaluateH\-Memon three public long\-term agent memory benchmarks:LoCoMo\[[12](https://arxiv.org/html/2605.15701#bib.bib15)\],LongMemEvalS\[[22](https://arxiv.org/html/2605.15701#bib.bib16)\], andREALTALK\[[8](https://arxiv.org/html/2605.15701#bib.bib17)\]\. LoCoMo\[[12](https://arxiv.org/html/2605.15701#bib.bib15)\]evaluates long\-term conversational memory over ultra\-long multi\-session dialogues and contains 1,540 questions over 10 dialogues, covering single\-hop, multi\-hop, temporal, and open\-domain questions\. LongMemEvalS evaluates long\-term interactive memory in assistant\-style settings; in the S\-setting, each conversation contains roughly 115K tokens, and the benchmark includes 500 questions spanning core capabilities\. Both LoCoMo and LongMemEvalS are constructed in controlled LLM\-simulated settings\. In contrast, REALTALK is built from crowdsourced real\-world human–human conversations, providing a more realistic testbed for persistent conversational memory\. Together, these datasets cover both controlled and realistic settings, as well as diverse long\-term memory demands\.

Metrics\.We consider two complementary evaluation metrics:F1andLLM\-Judge Accuracy\. F1 is computed from token\-level precision and recall between the predicted answer and the reference answer, and measures partial lexical overlap\. Since long\-term memory questions often admit semantically correct yet lexically diverse responses, we additionally report LLM\-Judge Accuracy following prior long\-term memory QA evaluation protocols\[[18](https://arxiv.org/html/2605.15701#bib.bib13)\], which marks a prediction as correct only if an LLM judge determines that it is semantically consistent with the gold answer\. This combination allows us to evaluate both lexical\-level answer overlap and semantic correctness\.

Baselines and Configurations\.We consider six representative agent memory methods that reflect different design choices for agent memory:MemoryOS\[[7](https://arxiv.org/html/2605.15701#bib.bib41)\],Mem0\[[2](https://arxiv.org/html/2605.15701#bib.bib18)\],MemTree\[[20](https://arxiv.org/html/2605.15701#bib.bib38)\],MemOS\[[11](https://arxiv.org/html/2605.15701#bib.bib30)\],Zep\[[19](https://arxiv.org/html/2605.15701#bib.bib21)\], andEverMemOS\[[6](https://arxiv.org/html/2605.15701#bib.bib42)\], which are listed in Table[1](https://arxiv.org/html/2605.15701#S1.T1)\. To ensure a fair comparison, we evaluateH\-Memand all baselines with the same embedding model, re\-ranking model, and LLM\-as\-a\-judge prompt\. This unified configuration reduces variance introduced by method\-specific evaluators and makes LLM\-Judge Accuracy directly comparable across systems\.

Table 2:The F1 and LLM\-Judge Accuracy \(Acc\.\) for each question category onLoCoMo\.
### 4\.2Main Results

We report F1 and LLM\-Judge Accuracy on LoCoMo, LongMemEvalS, and REALTALK in Tables[2](https://arxiv.org/html/2605.15701#S4.T2),[3](https://arxiv.org/html/2605.15701#S4.T3), and[4](https://arxiv.org/html/2605.15701#S4.T4), respectively\. Overall,H\-Memconsistently outperforms the baselines, with the largest gains appearing on multi\-hop and temporal questions\.

OnLoCoMo,H\-Memachieves the best overall F1 under both backbone LLMs and remains competitive in LLM\-Judge Accuracy\. The improvements on multi\-hop and temporal categories indicate that graph\-based expansion and temporal tree retrieval help locate dispersed and time\-sensitive evidence\. OnLongMemEvalS,H\-Memshows stronger gains, especially on multi\-session reasoning, temporal reasoning, and knowledge update, suggesting that the hybrid memory structure is effective for long histories and updated information\. OnREALTALK,H\-Memalso obtains the best overall performance, showing that the proposed retrieval mechanism remains useful under noisier real\-world conversations\. These results demonstrate that combining temporal memory abstraction with entity\-centered graph retrieval improves long\-term memory QA\.

Table 3:Overall results onLongMemEvalS, where SSU, MS, SSP, TR, KU, and SSA are the abbreviations of single\-session\-user, multi\-session, single\-session\-preference, temporal reasoning, knowledge update, and single\-session\-assistant, respectively\.Table 4:Overall results onREALTALK\.Besides QA accuracy, we compareH\-Memwith baselines from different aspects in both offline indexing and online retrieval stages\. For offline indexing, we use three metrics: indexing time, indexing token cost, and index storage, where the former two measure the time and token cost for building the index respectively, and the latter one reports the final index size\. As shown in Figure[2](https://arxiv.org/html/2605.15701#S4.F2)\(a\),H\-Memhas a moderate indexing time\. This indicates that constructing the hybrid structure does not make the offline indexing process prohibitively slow\. Figures[2](https://arxiv.org/html/2605.15701#S4.F2)\(b\)–\(c\) show thatH\-Memhas relatively high indexing token cost and index storage\. Its indexing token cost is close to the high\-cost group, including MemoryOS, MemTree, and EverMemOS\. Its index storage is higher than most baselines, while still lower than EverMemOS\. This is expected becauseH\-Memneeds to construct memory events, memory summaries, entities, relations, and entity profiles, while maintaining both the temporal and semantic tree and the knowledge graph\. Therefore, the additional offline cost is consistent with the goal of modeling memory evolution and supporting multi\-hop retrieval\.

For online retrieval, we report retrieval latency and retrieval token cost, which measure the latency and token cost during retrieval, respectively\. As shown in Figure[2](https://arxiv.org/html/2605.15701#S4.F2)\(d\),H\-Memremains lower than EverMemOS\. This overhead mainly comes from query decomposition, graph exploration, bottom\-up tree search, and evidence re\-ranking\. In terms of retrieval token cost, Figure[2](https://arxiv.org/html/2605.15701#S4.F2)\(e\) shows thatH\-Memis higher than most baselines but lower than Zep\. This is mainly becauseH\-Memretrieves evidence from both the tree and the graph, while deduplication and re\-ranking help control the final evidence chain\. Overall,H\-Memintroduces additional indexing and retrieval costs compared with several simpler memory mechanisms, but these costs remain reasonable considering its more expressive hybrid memory structure\.

![Refer to caption](https://arxiv.org/html/2605.15701v1/x3.png)Figure 2:Cumulative indexing and retrieval costs ofH\-Memand baselines onLoCoMo, including indexing time, indexing token cost, index storage, retrieval latency, and retrieval token cost\.
### 4\.3Ablation Study

We perform ablation studies to identify the contribution of key components inH\-Mem\. Specifically,w/o treedisables the tree and removes bottom\-up tree search;w/o graphdisables entity nodes, relations, and multi\-hop entity expansion;w/o long\-term memoryremoves upper\-level memory summaries and only searches memory events and original memory fragments;w/o memory robustnessremoves the robustness term from evidence scoring;w/o missing\-info querydisables follow\-up retrieval when the first\-pass evidence is insufficient; andw/o entity profileremoves entity profiles\.

As shown in Table[5](https://arxiv.org/html/2605.15701#S4.T5), removing the tree causes the largest performance drop among all ablation variants\. This shows that the tree structure is the most critical component ofH\-Mem, since it organizes memory data across different time windows and semantic granularities and supports bottom\-up retrieval\. The second largest drop comes from removing the graph\. This confirms that entity\-centered information and multi\-hop relationships are important complements to the tree\. The variant without long\-term memory also shows a clear performance drop, indicating that memory summaries are useful for preserving abstract and persistent information at retrieval\. The variants without memory robustness, missing\-information query, and entity profile also show consistent but smaller drops\. This indicates that repeated memory reinforcement, follow\-up retrieval for insufficient first\-pass evidence, and entity\-centered profiles all provide useful support for improving memory retrieval\.

Overall, the ablation results show that the performance gains ofH\-Memmainly come from the cooperation between the temporal and semantic tree and the entity\-centered knowledge graph\.

Table 5:Ablation study onLoCoMo\.

## 5Conclusion

In this work, we presentH\-Mem, a novel memory mechanism for evolving and retrieving agent memory via a hybrid structure\. Particularly,H\-Membuilds a temporal and semantic tree to organize memory data across different time windows and semantic granularities, where short\-term memory data can evolve progressively into long\-term memory data\. It also constructs a knowledge graph to capture entity\-centered information and multi\-hop relations\. To support the QA task,H\-Memretrieves relevant evidence by exploiting the above hybrid structure\. Our method achieves state\-of\-the\-art performance on three public long\-term memory benchmarks, demonstrating its superiority over existing baselines for the QA task\. In the future, we will further improveH\-Memto support multimodal memory data and explore its deployment in real\-world agent applications\.

## References

- \[1\]A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi\(2023\)Self\-rag: learning to retrieve, generate, and critique through self\-reflection\.arXiv preprint arXiv:2310\.11511\.External Links:[Link](https://arxiv.org/abs/2310.11511)Cited by:[§2\.1](https://arxiv.org/html/2605.15701#S2.SS1.p1.2)\.
- \[2\]P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav\(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.External Links:[Link](https://arxiv.org/abs/2504.19413)Cited by:[Table 1](https://arxiv.org/html/2605.15701#S1.T1.4.1.2.2.1.1.1),[§1](https://arxiv.org/html/2605.15701#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.15701#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2605.15701#S4.SS1.p3.1)\.
- \[3\]H\. Ebbinghaus\(1913\)Memory: a contribution to experimental psychology\.Teachers College, Columbia University,New York\.Note:Original work published 1885Cited by:[§3\.3](https://arxiv.org/html/2605.15701#S3.SS3.p6.2)\.
- \[4\]D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, D\. Metropolitansky, R\. O\. Ness, and J\. Larson\(2024\)From local to global: a graph rag approach to query\-focused summarization\.arXiv preprint arXiv:2404\.16130\.External Links:[Link](https://arxiv.org/abs/2404.16130)Cited by:[§2\.1](https://arxiv.org/html/2605.15701#S2.SS1.p1.2)\.
- \[5\]B\. J\. Gutiérrez, Y\. Shu, Y\. Gu, M\. Yasunaga, and Y\. Su\(2024\)HippoRAG: neurobiologically inspired long\-term memory for large language models\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2405.14831)Cited by:[§2\.1](https://arxiv.org/html/2605.15701#S2.SS1.p1.2)\.
- \[6\]C\. Hu, X\. Gao, Z\. Zhou, D\. Xu, Y\. Bai, X\. Li, H\. Zhang, T\. Li, C\. Zhang, L\. Bing, and Y\. Deng\(2026\)EverMemOS: a self\-organizing memory operating system for structured long\-horizon reasoning\.arXiv preprint arXiv:2601\.02163\.External Links:[Link](https://arxiv.org/abs/2601.02163)Cited by:[Table 1](https://arxiv.org/html/2605.15701#S1.T1.4.1.6.4.1.1.1),[§1](https://arxiv.org/html/2605.15701#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.15701#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2605.15701#S4.SS1.p3.1)\.
- \[7\]J\. Kang, M\. Ji, Z\. Zhao, and T\. Bai\(2025\)Memory os of ai agent\.arXiv preprint arXiv:2506\.06326\.External Links:[Link](https://arxiv.org/abs/2506.06326)Cited by:[Table 1](https://arxiv.org/html/2605.15701#S1.T1.4.1.5.3.1.1.1),[§1](https://arxiv.org/html/2605.15701#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.15701#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2605.15701#S4.SS1.p3.1)\.
- \[8\]D\. Lee, A\. Maharana, J\. Pujara, X\. Ren, and F\. Barbieri\(2025\)REALTALK: a 21\-day real\-world dataset for long\-term conversation\.arXiv preprint arXiv:2502\.13270\.External Links:[Link](https://arxiv.org/abs/2502.13270)Cited by:[§4\.1](https://arxiv.org/html/2605.15701#S4.SS1.p1.1)\.
- \[9\]M\. Lee, S\. An, and M\. Kim\(2024\-06\)PlanRAG: a plan\-then\-retrieval augmented generation for generative large language models as decision makers\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),Mexico City, Mexico,pp\. 6537–6555\.External Links:[Link](https://aclanthology.org/2024.naacl-long.364/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.364)Cited by:[§2\.1](https://arxiv.org/html/2605.15701#S2.SS1.p1.2)\.
- \[10\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InAdvances in Neural Information Processing Systems,Vol\.33\.External Links:[Link](https://arxiv.org/abs/2005.11401)Cited by:[§1](https://arxiv.org/html/2605.15701#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.15701#S2.SS1.p1.2)\.
- \[11\]Z\. Li, C\. Xi, C\. Li, D\. Chen, B\. Chen, S\. Song, S\. Niu, H\. Wang, J\. Yang, C\. Tang, Q\. Yu, J\. Zhao, Y\. Wang, P\. Liu, Z\. Lin, P\. Wang, J\. Huo, T\. Chen, K\. Chen, K\. Li, Z\. Tao, H\. Lai, H\. Wu, B\. Tang, Z\. Wang, Z\. Fan, N\. Zhang, L\. Zhang, J\. Yan, M\. Yang, T\. Xu, W\. Xu, H\. Chen, H\. Wang, H\. Yang, W\. Zhang, Z\. J\. Xu, S\. Chen, and F\. Xiong\(2025\)MemOS: a memory os for ai system\.arXiv preprint arXiv:2507\.03724\.External Links:[Link](https://arxiv.org/abs/2507.03724)Cited by:[Table 1](https://arxiv.org/html/2605.15701#S1.T1.4.1.3.1.1.1.1),[§1](https://arxiv.org/html/2605.15701#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.15701#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2605.15701#S4.SS1.p3.1)\.
- \[12\]A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang\(2024\)Evaluating very long\-term conversational memory of llm agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,External Links:[Link](https://aclanthology.org/2024.acl-long.747/)Cited by:[§4\.1](https://arxiv.org/html/2605.15701#S4.SS1.p1.1)\.
- \[13\]Manus AI\(2025\)Manus: experience ai that acts\.Note:[https://manus\.is/](https://manus.is/)Accessed: 2026\-04\-27Cited by:[§1](https://arxiv.org/html/2605.15701#S1.p1.1)\.
- \[14\]J\. L\. McGaugh\(2000\)Memory–a century of consolidation\.Science287\(5451\),pp\. 248–251\.Cited by:[§1](https://arxiv.org/html/2605.15701#S1.p4.1)\.
- \[15\]OpenAI\(2024\)GPT\-4o mini: advancing cost\-efficient intelligence\.Note:[https://openai\.com/index/gpt\-4o\-mini\-advancing\-cost\-efficient\-intelligence/](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)Accessed: 2026\-05\-04Cited by:[§A\.2](https://arxiv.org/html/2605.15701#A1.SS2.p1.1)\.
- \[16\]OpenAI\(2025\)Introducing gpt\-4\.1 in the api\.Note:[https://openai\.com/index/gpt\-4\-1/](https://openai.com/index/gpt-4-1/)Accessed: 2026\-05\-04Cited by:[§A\.2](https://arxiv.org/html/2605.15701#A1.SS2.p1.1)\.
- \[17\]OpenClaw Contributors\(2026\)OpenClaw: personal ai assistant\.Note:[https://github\.com/openclaw/openclaw](https://github.com/openclaw/openclaw)Accessed: 2026\-04\-27Cited by:[§1](https://arxiv.org/html/2605.15701#S1.p1.1)\.
- \[18\]C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez\(2023\)MemGPT: towards llms as operating systems\.arXiv preprint arXiv:2310\.08560\.External Links:[Link](https://arxiv.org/abs/2310.08560)Cited by:[§A\.3](https://arxiv.org/html/2605.15701#A1.SS3.p4.1),[§4\.1](https://arxiv.org/html/2605.15701#S4.SS1.p2.1)\.
- \[19\]P\. Rasmussen, P\. Paliychuk, T\. Beauvais, J\. Ryan, and D\. Chalef\(2025\)Zep: a temporal knowledge graph architecture for agent memory\.arXiv preprint arXiv:2501\.13956\.External Links:[Link](https://arxiv.org/abs/2501.13956)Cited by:[Table 1](https://arxiv.org/html/2605.15701#S1.T1.4.1.7.5.1.1.1),[§1](https://arxiv.org/html/2605.15701#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.15701#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2605.15701#S4.SS1.p3.1)\.
- \[20\]A\. Rezazadeh, Z\. Li, W\. Wei, and Y\. Bao\(2025\)From isolated conversations to hierarchical schemas: dynamic tree memory representation for llms\.InInternational Conference on Learning Representations,Cited by:[Table 1](https://arxiv.org/html/2605.15701#S1.T1.4.1.4.2.1.1.1),[§1](https://arxiv.org/html/2605.15701#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.15701#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2605.15701#S4.SS1.p3.1)\.
- \[21\]L\. R\. Squire and P\. Alvarez\(1995\)Retrograde amnesia and memory consolidation: a neurobiological perspective\.Current Opinion in Neurobiology5\(2\),pp\. 169–177\.Cited by:[§1](https://arxiv.org/html/2605.15701#S1.p4.1)\.
- \[22\]D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu\(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=pZiyCaVuti)Cited by:[§4\.1](https://arxiv.org/html/2605.15701#S4.SS1.p1.1)\.
- \[23\]Y\. Wu, T\. Lin, Y\. Zhou, F\. Zhang, Q\. Guo, X\. Zhou, S\. Wang, X\. Liu, Y\. Ma, and Y\. Fang\(2026\)Memory in the llm era: modular architectures and strategies in a unified framework\.External Links:2604\.01707,[Link](https://arxiv.org/abs/2604.01707)Cited by:[§1](https://arxiv.org/html/2605.15701#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.15701#S2.SS2.p1.1)\.
- \[24\]D\. Xu, Y\. Wen, P\. Jia, Y\. Zhang, W\. Zhang, Y\. Wang, H\. Guo, R\. Tang, X\. Zhao, E\. Chen, and T\. Xu\(2025\)From single to multi\-granularity: toward long\-term memory association and selection of conversational agents\.arXiv preprint arXiv:2505\.19549\.External Links:[Link](https://arxiv.org/abs/2505.19549)Cited by:[§2\.2](https://arxiv.org/html/2605.15701#S2.SS2.p2.1)\.
- \[25\]W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang\(2025\)A\-mem: agentic memory for llm agents\.arXiv preprint arXiv:2502\.12110\.External Links:[Link](https://arxiv.org/abs/2502.12110)Cited by:[§2\.2](https://arxiv.org/html/2605.15701#S2.SS2.p2.1)\.
- \[26\]Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou\(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.arXiv preprint arXiv:2506\.05176\.External Links:[Link](https://arxiv.org/abs/2506.05176)Cited by:[§A\.2](https://arxiv.org/html/2605.15701#A1.SS2.p1.1)\.
- \[27\]W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang\(2024\)MemoryBank: enhancing large language models with long\-term memory\.Proceedings of the AAAI Conference on Artificial Intelligence38\(17\),pp\. 19724–19731\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v38i17.29946)Cited by:[§1](https://arxiv.org/html/2605.15701#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.15701#S2.SS2.p2.1)\.
- \[28\]Y\. Zhou, Y\. Su, Y\. Sun, S\. Wang, T\. Wang, R\. He, Y\. Zhang, S\. Liang, X\. Liu, Y\. Ma,et al\.\(2025\)In\-depth analysis of graph\-based rag in a unified framework\.arXiv preprint arXiv:2503\.04338\.External Links:[Link](https://arxiv.org/abs/2503.04338)Cited by:[§2\.1](https://arxiv.org/html/2605.15701#S2.SS1.p1.2)\.

## Appendix AExperiment Details

### A\.1Dataset and Index Statistics

This subsection provides supplementary corpus statistics and index statistics for the benchmarks used in the main paper\. Table[6](https://arxiv.org/html/2605.15701#A1.T6)summarizes the basic properties of the three datasets, including dataset scale, average conversation length, construction setting, and data source\.

Table 6:Dataset details\.The three datasets cover complementary evaluation settings\. LoCoMo provides dense question supervision over long multi\-session dialogues, while LongMemEvalS contains substantially longer conversational histories and is therefore more challenging for long\-context memory retrieval\. REALTALK further complements the two LLM\-simulated benchmarks with real\-world human–human conversations, making it useful for evaluating robustness under noisier conversational conditions\.

Table[7](https://arxiv.org/html/2605.15701#A1.T7)reports the average index statistics after offline indexing, including raw conversation size, fragment count, hierarchical event count, and entity\-graph statistics\.

Table 7:Average index statistics per conversation after offline indexing\.The index statistics show that different benchmarks stress different aspects of long\-term memory\. LongMemEvalS produces the largest event hierarchy and entity graph, reflecting its longer conversational histories and more complex cross\-session dependencies\. REALTALK has the largest number of fragments on average, which is consistent with the greater heterogeneity and noisiness of real\-world conversations\. Across all datasets,H\-Memexpands raw fragments into multi\-level events and entity graphs, enabling retrieval at both temporal and entity\-centric granularities\.

### A\.2Implementation Details

Model configuration\.We evaluateH\-Memwith GPT\-4o\-mini\[[15](https://arxiv.org/html/2605.15701#bib.bib44)\]and GPT\-4\.1\-mini\[[16](https://arxiv.org/html/2605.15701#bib.bib45)\]as backbone LLMs\. For semantic retrieval and evidence reranking, the default configuration uses Qwen3\-Embedding\-4B and Qwen3\-Reranker\-4B\[[26](https://arxiv.org/html/2605.15701#bib.bib32)\]\. We also evaluate a lighter configuration with Qwen3\-Embedding\-0\.6B and Qwen3\-Reranker\-0\.6B\[[26](https://arxiv.org/html/2605.15701#bib.bib32)\]in the hyperparameter analysis\. All experiments are conducted on a Linux server equipped with an Intel Xeon 2\.0GHz CPU, 1024GB of memory, and 8 NVIDIA GeForce RTX A5000 GPUs, each with 24GB of VRAM\.

Baseline configuration\.For each baseline, we follow its original memory organization and retrieval design as closely as possible\. To reduce confounding factors, all baselines are evaluated under the same experimental framework\. When semantic retrieval or reranking is required, all methods use the same embedding and reranking configurations\. All methods are evaluated with the same answer simplification, normalization, F1 computation, and LLM\-Judge protocol\.

Tree Index Hyperparameter\.For the temporal\-semantic tree, we use four levels corresponding to day\-, week\-, month\-, and year\-level memory organization\. The leaf level stores fine\-grained memory events, while upper levels store consolidated summaries over longer temporal windows\. Specifically,βl\\beta\_\{l\}denotes the temporal window size at levelll, andαl\\alpha\_\{l\}denotes the semantic clustering threshold for consolidation within the corresponding temporal window\. By default, we use day, week, month, and year as the temporal windows from L1 to L4, and set the consolidation thresholds of L2, L3, and L4 to0\.80\.8,0\.70\.7, and0\.60\.6, respectively\. The threshold gradually decreases at higher levels because higher\-level summaries are expected to capture more abstract and persistent memory patterns\. We also tested two alternative threshold schedules: a conservative schedule\(0\.9,0\.8,0\.7\)\(0\.9,0\.8,0\.7\)and an aggressive schedule\(0\.7,0\.6,0\.5\)\(0\.7,0\.6,0\.5\)\. The conservative schedule retains more fine\-grained nodes but makes the upper\-level index more fragmented, while the aggressive schedule yields a more compact index but risks over\-consolidating heterogeneous memories\. To avoid unnecessary high\-level consolidation when the memory history is short, the maximum active level is determined by the memory age: histories shorter than 7 days activate day and week levels; histories between 7 and 30 days activate day, week, and month levels; otherwise, all four levels are activated\.

Memory robustness and scoring hyperparameter\.For evidence ranking,H\-Memcombines semantic similarity, temporal alignment, and memory robustness\. The event\-level relevance score is computed as

s​\(m,q\)=wsem⋅sim​\(m,q\)\+wtime⋅time​\(m,q\)\+wmem⋅R​\(m,t\),s\(m,q\)=w\_\{\\mathrm\{sem\}\}\\cdot\\mathrm\{sim\}\(m,q\)\+w\_\{\\mathrm\{time\}\}\\cdot\\mathrm\{time\}\(m,q\)\+w\_\{\\mathrm\{mem\}\}\\cdot R\(m,t\),wheresim​\(m,q\)\\mathrm\{sim\}\(m,q\)is the semantic similarity between memorymmand queryqq,time​\(m,q\)\\mathrm\{time\}\(m,q\)measures temporal alignment when an explicit temporal hint is available, andR​\(m,t\)R\(m,t\)denotes the memory robustness score\. We setwsem=0\.70w\_\{\\mathrm\{sem\}\}=0\.70,wtime=0\.15w\_\{\\mathrm\{time\}\}=0\.15, andwmem=0\.15w\_\{\\mathrm\{mem\}\}=0\.15by default\. When no explicit temporal hint is available, the temporal alignment term is set to zero\.

The memory robustness score follows an Ebbinghaus\-style decay with reinforcement:

R​\(m,t\)=exp⁡\(−t−rmτ​\(1\+η​log⁡\(1\+nm\)\)\),R\(m,t\)=\\exp\\left\(\-\\frac\{t\-r\_\{m\}\}\{\\tau\(1\+\\eta\\log\(1\+n\_\{m\}\)\)\}\\right\),wherermr\_\{m\}is the latest reinforcement timestamp of memorymm,nmn\_\{m\}is the number of reinforcements,η\\etacontrols the reinforcement effect, andτ\\tauis the decay time scale\. We setτ=365\\tau=365days andη=0\.5\\eta=0\.5by default\. In implementation, botht−rmt\-r\_\{m\}andτ\\tauare converted to seconds\. Thus, whenτ=365\\tau=365days, an unreinforced memory after one year has

R=exp⁡\(−1\)≈36\.8%,R=\\exp\(\-1\)\\approx 36\.8\\%,corresponding to a decay of about63\.2%63\.2\\%\. This choice is reasonable for long\-term agent memory because many user preferences, relationships, and recurring facts should remain retrievable over a year\-scale horizon rather than being rapidly forgotten\. At the same time, the robustness term is only a weak ranking prior withwmem=0\.15w\_\{\\mathrm\{mem\}\}=0\.15, so it does not dominate semantic relevance or explicit temporal matching\. Repeatedly reinforced memories decay more slowly through the factor1\+η​log⁡\(1\+nm\)1\+\\eta\\log\(1\+n\_\{m\}\), which allows stable and recurring memories to remain more salient during retrieval\.

Entity extraction and disambiguation\.For graph construction,H\-Memprocesses each memory fragment through entity and relation extraction, entity resolution, and relation insertion\. Specifically,H\-Memuses an LLM\-based information extraction prompt to extract entities and relations from each memory fragment\. Each extracted entity is represented as a structured record containing its surface name, entity type, optional text span, role, salience score, and auxiliary metadata\. The entity type is normalized into a predefined type set, including person, organization, location, event, product, work, date, time, and other\. Each extracted relation contains a source entity, a target entity, a relation label, a confidence score, and an optional text span\. If LLM\-based entity extraction fails or returns no valid entity,H\-Memfalls back to spaCy NER to obtain a best\-effort entity set\.

After extraction, the extracted entities are normalized and resolved into entity nodes\. The normalization step lower\-cases entity names, removes unnecessary punctuation, normalizes whitespace, maps entity types into the predefined type set, and optionally uses lemmatization\. For each newly extracted entity,H\-Memfirst checks whether it exactly matches an existing entity node with a compatible entity type\. If no exact match is found, the entity is compared with existing entity names and aliases using token overlap and fuzzy string matching\. If the entity satisfies the matching criteria, it is merged into the existing entity node, and its surface name is stored as an alias when applicable\. Otherwise, a new entity node is created\. Each resolved entity node is then linked to the original memory fragment containing it, preserving provenance for later evidence verification\.

We distinguish entity merging from graph repair\. Entity merging indicates that two extracted entities are resolved as the same real\-world entity and therefore share one entity node\. Graph repair does not merge nodes or modify aliases\. Instead, for short single\-token names or nickname\-like variants that remain as separate nodes,H\-Memmay add anoverlapedge based on prefix/suffix matching\. This edge is used only to improve graph traversal recall during retrieval and does not indicate identity equivalence\.

Finally, the extracted relations are mapped to resolved entity nodes before being inserted into the graph\. That is, the source and target entities of each relation are first resolved to their corresponding entity nodes\. The relation is then inserted as an entity–entity edge with its relation label, confidence weight, timestamp, and supporting evidence\. If the same relation between the same resolved entities already exists,H\-Memmerges the supporting evidence instead of creating a duplicate edge\. Since the current implementation does not rely on calibrated extraction confidence for pruning, it uses conservative type\-compatible matching, bounded fuzzy matching, and provenance links to reduce the impact of extraction noise\. During retrieval, the provenance links allow entity and relation evidence to be verified against the original memory fragments before answer generation\.

### A\.3Evaluation Details

We provide the implementation details for the evaluation protocol used in the main paper\.

Answer Simplification\.Before lexical evaluation, the predicted answer is simplified into a short factual form\. This step reduces verbosity artifacts in generative answers and makes token\-overlap metrics more faithful to semantic correctness\. Answer simplification is only used for lexical F1 evaluation and does not affect the generated answer used by the LLM judge\. The full answer simplification prompt is provided in Appendix[C\.2](https://arxiv.org/html/2605.15701#A3.SS2)\.

Normalization and F1 Computation\.After simplification, both the prediction and the reference answer are normalized before F1 computation\. The normalization pipeline includes Unicode normalization, lower\-casing, punctuation and article removal, and whitespace collapsing\. LetA^\\hat\{A\}denote the normalized prediction andAAdenote the normalized reference answer\. After tokenization, we compute token\-level precision, recall, and F1 as

P=\|Tok​\(A^\)∩Tok​\(A\)\|\|Tok​\(A^\)\|,R=\|Tok​\(A^\)∩Tok​\(A\)\|\|Tok​\(A\)\|,F1=2​P​RP\+R\.P=\\frac\{\|\\mathrm\{Tok\}\(\\hat\{A\}\)\\cap\\mathrm\{Tok\}\(A\)\|\}\{\|\\mathrm\{Tok\}\(\\hat\{A\}\)\|\},\\quad R=\\frac\{\|\\mathrm\{Tok\}\(\\hat\{A\}\)\\cap\\mathrm\{Tok\}\(A\)\|\}\{\|\\mathrm\{Tok\}\(A\)\|\},\\quad\\mathrm\{F1\}=\\frac\{2PR\}\{P\+R\}\.whereTok​\(⋅\)\\mathrm\{Tok\}\(\\cdot\)denotes the token multiset after normalization, and∩\\capdenotes multiset intersection\. If both precision and recall are zero, the F1 score is set to zero\.

LLM\-Judge Protocol\.For answer evaluation, we follow the LLM\-as\-a\-judge prompt used inMemGPT\[[18](https://arxiv.org/html/2605.15701#bib.bib13)\], where the judge is given the question, the gold answer, and the predicted answer, and returnsCORRECTorWRONGunder a semantically tolerant criterion\. The same judging prompt is used for all methods to ensure direct comparability\. The full LLM\-Judge prompt is provided in Appendix[C\.2](https://arxiv.org/html/2605.15701#A3.SS2)\.

### A\.4Additional Experiments

We provide additional analyses ofH\-Memon retrieval hyperparameters and retrieval planning behaviors, including the retriever scale, memory\-scope distribution, and missing\-information query behavior\.

![Refer to caption](https://arxiv.org/html/2605.15701v1/x4.png)Figure 3:Sensitivity to the top\-kkretrieval budget onLoCoMo\.∙\\bulletEffect of top\-kk\.We analyze the sensitivity ofH\-Memto the top\-kkretrieval budget onLoCoMo\. In our implementation,kkcontrols the budget for entity\-related fragment retrieval and memory event retrieval\. Figure[3](https://arxiv.org/html/2605.15701#A1.F3)reports the F1 and LLM\-Judge Accuracy under different top\-kksettings\. Overall,H\-Memremains stable across a moderate range ofkk\. Increasingkkprovides a larger candidate pool and improves F1, suggesting that more candidate evidence helps recover useful supporting information\. Accuracy also improves askkincreases from 5 to 30, but slightly decreases atk=50k=50, indicating that overly large candidate pools may introduce redundant or noisy context\. Therefore, we use the setting that balances F1 and LLM\-Judge Accuracy as the default configuration in the main experiments\.

∙\\bulletEffect of retrieval model components\.Table[8](https://arxiv.org/html/2605.15701#A1.T8)compares different retrieval model configurations, including an embedding\-only setting, a lighter 0\.6B embedding/reranker pair, and a stronger 4B embedding/reranker pair\. Compared with the embedding\-only setting, adding the reranker improves both F1 and LLM\-Judge Accuracy, showing that reranking helps prioritize more useful evidence for downstream reasoning\. Notably, the embedding\-only setting also performs well\. This is becauseH\-Memretrieves over clean and atomic memory events rather than raw conversational chunks\. These extracted events contain less irrelevant dialogue context and have clearer semantic boundaries, which makes embedding similarity more reliable for matching queries to candidate evidence\. The stronger 4B configuration further achieves slightly higher F1 and accuracy than the 0\.6B configuration, while the lighter model pair remains competitive\. This indicates thatH\-Membenefits from stronger retrieval models, but its performance is mainly supported by the proposed memory structure rather than by retrieval model scale alone\.

Table 8:Sensitivity to embedding and reranker models onLoCoMo\.∙\\bulletMemory scope distribution\.We analyze the memory scopes predicted by the retrieval planner\. For each sub\-query, the planner selects one scope fromShort,Long, andMixed, which determines the memory levels used during tree\-based retrieval\.Shortis mainly used for moment\-specific evidence, such as concrete events, recent actions, or temporally localized facts\.Longis used when the query requires stable or persistent memory, such as long\-term preferences, relationships, or recurring facts\.Mixedis selected when both fine\-grained situational evidence and higher\-level memory summaries are needed\. Figure[4](https://arxiv.org/html/2605.15701#A1.F4)reports the distribution of predicted memory scopes\.

![Refer to caption](https://arxiv.org/html/2605.15701v1/x5.png)Figure 4:Distribution of memory scopes predicted by the retrieval planner\.The distribution shows that the planner does not rely on a fixed retrieval granularity\. Instead, it adaptively selects the retrieval scope according to the information need of each sub\-query\. Across the three datasets,Shortremains the most frequently selected scope, whileLongalso accounts for a substantial portion of sub\-queries, especially on LoCoMo and REALTALK\. This indicates that long\-term conversational QA requires both moment\-specific evidence and stable reusable memory\. We further compare the adaptive scope strategy with a fixedMixedretrieval policy, where every sub\-query retrieves both fine\-grained events and higher\-level summaries\. The fixedMixedpolicy also achieves comparable QA performance, with an accuracy of92\.21%92\.21\\%\. This shows that retrieving both fine\-grained events and higher\-level summaries can cover most required evidence\. However, fixedMixeduses a much larger evidence context and incurs about1\.8×1\.8\\timesretrieval token cost compared with the planner\-based strategy\. Therefore, the main advantage of memory scope prediction is retrieval efficiency: it preserves comparable answer quality while reducing token usage by adaptively selectingShort,Long, orMixedfor each sub\-query\.

∙\\bulletMissing\-information query analysis\.We further analyze how oftenH\-Memtriggers missing\-information queries\. A missing\-information query is generated only when the first\-pass evidence is insufficient to answer a sub\-query with the required specificity\. Typical cases include unresolved entities, vague references, missing temporal anchors, underspecified event descriptions, and multiple plausible candidates\. Table[9](https://arxiv.org/html/2605.15701#A1.T9)reports the triggering statistics of missing\-information queries\.

Table 9:Triggering statistics of missing\-information queries\.Instead of simply increasing the retrieval depth for all queries,H\-Memperforms targeted follow\-up retrieval only when an evidence gap is detected\. The generated missing\-information query is required to ask for the missing slot rather than paraphrasing the original sub\-query, and it must contain at least one concrete anchor from the first\-pass evidence\. This strategy helps recover missing bridge evidence while avoiding unnecessary retrieval cost for queries that can already be answered from the first\-pass evidence\.

### A\.5Controlled Stability Analysis

To examine the stability ofH\-Mem, we conduct three controlled repeated evaluations on representative main settings, including LoCoMo, LongMemEvalS, and REALTALK\. SinceH\-Memis a retrieval\-based memory system and does not train a task\-specific neural model, these repeated evaluations are used to measure evaluation\-pipeline stability rather than training\-run variability\. To reduce accidental variation introduced by the planning stage, we reuse the same planner outputs across the three runs, including the decomposed sub\-queries and predicted memory scopes\. In addition, all LLM\-based components are evaluated with temperature set to 0\.

For each setting, we report the LLM\-Judge Accuracy and F1 score of each run, together with the mean and sample standard deviation across the three runs\. The reported error range corresponds to±1\\pm 1sample standard deviation\.

Table 10:Controlled stability analysis over three repeated evaluations\. We reuse the same planner outputs across runs and set the temperature of all LLM\-based components to 0\. We report the F1 score and LLM\-Judge Accuracy of each run, together with the mean and sample standard deviation\. The error range corresponds to±1\\pm 1sample standard deviation\.The results show thatH\-Memremains stable across repeated evaluations\. The standard deviations are small for both LLM\-Judge Accuracy and F1, indicating that the reported improvements are not mainly caused by incidental evaluation fluctuations\. This is consistent with the controlled evaluation setup, where the planner outputs are fixed and deterministic decoding is used for all LLM\-based components\.

### A\.6Limitations

AlthoughH\-Memimproves long\-term conversational memory retrieval, it still has several limitations\. First,H\-Memrelies on LLM\-based memory construction and retrieval planning\. Therefore, its performance can be affected by imperfect memory\-unit extraction, inaccurate query decomposition, and incorrect memory\-scope prediction\. Second, the hybrid tree\-graph memory structure introduces additional offline indexing cost and storage overhead compared with simpler fragment\-level memory systems\. Although the online retrieval latency remains moderate, the indexing stage may become more expensive for extremely long or frequently updated conversations\. Third, our experiments are conducted on existing long\-term memory benchmarks\. While these benchmarks cover simulated and real\-world conversational settings, further evaluation is needed in deployed interactive agents, longer usage periods, and broader real\-world domains\. Finally, long\-term memory systems may store and retrieve user\-specific information, which introduces potential privacy risks such as unintended retention, sensitive inference, or inappropriate memory reuse\. Practical deployments should therefore include explicit user consent, memory editing and deletion mechanisms, access control, and transparent memory auditing\.

### A\.7Assets and Licenses

We use existing benchmarks, baselines, and model APIs or publicly available models in our experiments\. The datasets and baselines are cited in the main paper and used only for research evaluation\. For model usage, we follow the corresponding API terms of service or model usage terms\. We do not redistribute any third\-party datasets, model weights, or proprietary model outputs as new assets\.

## Appendix BCase Studies

We present two representative qualitative examples ofH\-Mem\. The first illustrates multi\-step sub\-query decomposition for a multi\-entity question, while the second illustrates missing\-information guided retrieval, where the system detects that first\-pass evidence is insufficiently specific and issues a bridge\-style follow\-up query\.

\(A\) Case Study: Multi\-step Sub\-query DecompositionQuestion\.What subject have Caroline and Melanie both painted?Planner output\.H\-Memdecomposes the question into two entity\-specific subqueries:•q1: What subjects has Caroline painted? \(SHORT,global\)•q2: What subjects has Melanie painted? \(SHORT,global\)Retrieved evidence\.•q1: Caroline shared paintings involving sunsets and floral themes\.•q2: Melanie shared paintings including a lake sunrise, a sunset scene, and other nature\-inspired subjects\.Prediction\.sunsetsGold answer\.Sunsets

\(B\) Case Study: Missing\-Information Guided RetrievalQuestion\.Which popular music composer’s tunes does Tim enjoy playing on the piano?First\-pass evidence\.The first retrieval pass finds evidence that Tim enjoys playing*a theme from a movie he really likes*on the piano, but the composer is not explicitly named\.Reasoner decision\.The first\-pass reasoner marks the subquery asmissing\_info=true, because the current evidence is suggestive but still underspecified for answering the question at the required level of specificity\.Generated missing\-info query\.Which popular music composer created the theme from the movie that Tim enjoys playing on the piano?Second\-pass evidence and answer\.The follow\-up retrieval uncovers the missing bridge between Tim’s favorite movie theme and the intended composer, allowingH\-Memto produce the final answer:John WilliamsGold answer\.John Williams

Figure 5:Two representative qualitative examples ofH\-Mem\. \(A\)H\-Memdecomposes a multi\-entity question into two entity\-specific sub\-queries and retrieves evidence under global coverage before final synthesis\. \(B\) When first\-pass evidence is insufficiently specific,H\-Memexplicitly detects missing information and generates a bridge\-style follow\-up query for targeted second\-pass retrieval\.
## Appendix CPrompt Templates

This appendix provides the core prompt templates used byH\-Mem\. We include prompts for memory construction, evaluation, retrieval planning, and missing\-information query generation\.

### C\.1Memory Construction Prompts

`Memory extraction prompt Memory consolidation prompt`

`C\.2 Evaluation Prompts LLM\-Judge prompt Answer simplification prompt C\.3 Retrieval Planning Prompts Subquery planner prompt Missing\-information query prompt`

Similar Articles

HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents

arXiv cs.CL

HeLa-Mem is a bio-inspired memory architecture for LLM agents that models memory as a dynamic graph using Hebbian learning dynamics, featuring episodic and semantic memory stores to improve long-term coherence. Experiments on LoCoMo show superior performance across question categories while using fewer context tokens.

Human-Inspired Memory Architecture for LLM Agents

arXiv cs.AI

Microsoft researchers propose a biologically-inspired memory architecture for LLM agents that incorporates mechanisms like sleep-phase consolidation and interference-based forgetting to manage persistent memory efficiently.