RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents
Summary
RecMem is a recurrence-based memory consolidation method for long-running LLM agents that reduces token consumption by up to 87% while improving accuracy, by only invoking LLMs when semantically similar interactions recur.
View Cached Full Text
Cached at: 05/18/26, 06:35 AM
# RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents
Source: [https://arxiv.org/html/2605.16045](https://arxiv.org/html/2605.16045)
Zijie Dai1Shiyuan Deng2Sheng Guan3Yizhou Tian1 Xin Yao4Xiao Yan5James Cheng1 1Department of Computer Science and Engineering, The Chinese University of Hong Kong 3School of Computer Science, Beijing University of Posts and Telecommunications 2Huawei Cloud,4Huawei Theory Lab,5Institute for Math and AI, Wuhan University caiusdai@link\.cuhk\.edu\.hkdengshiyuan@huawei\.com
###### Abstract
Memory systems often organize user\-agent interactions as retrievable external memory and are crucial for long\-running agents by overcoming the limited context windows of LLMs\. However, existing memory systems invoke LLMs to process every incoming interaction for memory extraction, and such aneager memory consolidationscheme leads to substantial token consumption\. To tackle this problem, we proposeRecMemby rethinking when memory consolidation should be conducted\. RecMem stores incoming interactions in a subconscious memory layer and encode them using lightweight embedding models for retrieval\. LLMs are only invoked to extract episodic and semantic memory when sustained recurrence are observed for semantically similar interactions\. Suchrecurrence\-based consolidationworks because these interactions correspond to a semantic cluster with rich information and thus are worth extraction and summarization\. To improve accuracy, RecMem also incorporates a semantic refinement mechanism that recovers the fine\-grained facts omitted by memory extraction\. Experiments show that RecMem reduces the memory construction token cost of three SOTA memory systems by up to 87% while exceeding their accuracy\. Our code is available at[https://github\.com/CaiusDai/RecMem](https://github.com/CaiusDai/RecMem)\.
RecMem: Recurrence\-based Memory Consolidation for Efficient and Effective Long\-Running LLM Agents
Zijie Dai1Shiyuan Deng2††thanks:Dr\. Shiyuan Deng is the corresponding author\.Sheng Guan3Yizhou Tian1Xin Yao4Xiao Yan5James Cheng11Department of Computer Science and Engineering, The Chinese University of Hong Kong3School of Computer Science, Beijing University of Posts and Telecommunications2Huawei Cloud,4Huawei Theory Lab,5Institute for Math and AI, Wuhan Universitycaiusdai@link\.cuhk\.edu\.hkdengshiyuan@huawei\.com
## 1Introduction
Large Language Models \(LLMs\) have demonstrated strong capabilities across a wide range of tasksGuo et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib5)\); Shao et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib24)\)\. However, enabling LLMs to function as long\-running agents requires accumulating experience over extended user\-agent interactionsJiang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib9)\)\. In practice, this is hindered by two critical limitations: current LLMs cannot retain information beyond their limited context windowsLiu et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib16)\), and they often under\-utilize relevant evidences even if they are present in long inputs due to the lost\-in\-the\-middle effectLiu et al\. \([2023](https://arxiv.org/html/2605.16045#bib.bib17)\)\.
To address these limitations, memory systems emerge as an essential component for building long\-running LLM agentsJiang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib9)\); Zhang et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib32)\), and many solutions have been proposed with different memory structures and memory extraction methodsXu et al\. \([2025b](https://arxiv.org/html/2605.16045#bib.bib30)\); Chhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\); Rezazadeh et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib23)\); Packer et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib21)\); Maharana et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib18)\)\. For example, ZepRasmussen et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib22)\)constructs temporal knowledge graphs by abstracting relational triplets from interactions; Mem0Chhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\)extracts atomic facts from interactions for similarity\-based retrieval; A\-MemXu et al\. \([2025b](https://arxiv.org/html/2605.16045#bib.bib30)\)organizes interactions as connected notes, and a note can update the contents of its neighbors\.
\(a\)Eager memory consolidation
\(b\)Recurrence\-based consolidation \(ours\)

\(c\)Task accuracy
\(d\)Memory construction cost
Figure 1:Comparing RecMem with existing memory systems\. \(a\) Existing systems conduct eager memory consolidation for every incoming interaction; \(b\) our RecMem conducts recurrence\-based consolidation selectively from a subconscious memory; \(c\)\-\(d\) task accuracy and memory construction cost on the LoCoMo benchmark\.Despite the differences in existing memory systems, we observe that they all adopt aneager memory consolidationscheme\. In particular, for every incoming user\-item interaction, they invoke LLMs to extract facts and merge these facts with existing memory contents\. This scheme avoids missing information in the interactions but incurs substantial token cost for memory construction, as shown in Figure[1](https://arxiv.org/html/2605.16045#S1.F1)\(d\), which makes it expensive to utilize these memory systems in practice\. We argue that running LLM\-based memory consolidation for every interaction is an overkill\. For instance, some interactions may convey little information or contain noise, while some interactions are not related to existing ones and can be queried directly without consolidation\. Hence, it is possible to reduce the memory construction cost by choosing when to conduct memory consolidation more judiciously\.
Similar insights emerge from cognitive science\. The multi\-store theory\(Atkinson and Shiffrin,[1968](https://arxiv.org/html/2605.16045#bib.bib2)\)and the Complementary Learning Systems framework\(Kumaran et al\.,[2016](https://arxiv.org/html/2605.16045#bib.bib13); O’Reilly et al\.,[2014](https://arxiv.org/html/2605.16045#bib.bib20); McClelland et al\.,[1995](https://arxiv.org/html/2605.16045#bib.bib19)\)both converge on a common principle: isolated experiences remain in transient or rapidly\-encoded stores, and only repeated or recurring patterns drive consolidation into stable long\-term memory\. This principle directly motivates RecMem’s recurrence\-driven consolidation scheme\.
Motivated by these insights, we propose RecMem, an efficient memory system for long\-running agents that conducts fewer LLM\-based memory consolidations in arecurrence\-drivenmanner\. In particular, RecMem introduces a subconscious memory layer that buffers the raw user\-agent interactions via lightweight embeddings, enabling cost\-effective retrieval without invoking LLMs\. Memory consolidation is only conducted when an incoming interaction can find a sufficient number of semantically similar or related interactions in the subconscious memory, and LLMs are utilized to extract episodic summaries and semantic facts from these interactions\. This works because these interactions form a semantic cluster with rich information that is worth memory consolidation and resembles generating long\-term memory from transient memory in cognitive science\.
RecMem also incorporates a*semantic refinement*mechanism to improve accuracy\. Specifically, LLM\-based extraction, especially event\-level episodic summarization, may omit fine\-grained but query\-critical details, leading to lossy long\-term memory\. Our semantic refinement revisits the raw interactions associated with each episodic memory, extracts the missing and persistent facts that are not captured by the episodic memory, and distills them into a semantic memory to avoid information loss\.
We empirically evaluate RecMem on two challenging long\-term memory benchmarks \(i\.e\., LoCoMoMaharana et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib18)\)and LongMemEval\-SWu et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib28)\)\) and compare with three SOTA memory systems \(i\.e\., Mem0Chhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\), A\-MemXu et al\. \([2025b](https://arxiv.org/html/2605.16045#bib.bib30)\), and MemoryOSKang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib12)\)\)\. The results show that RecMem yields higher question answering accuracy than all baselines on both datasets while drastically reducing the token cost for memory construction\. In particular, on the LoCoMo benchmark in Figure[1](https://arxiv.org/html/2605.16045#S1.F1)\(d\), RecMem reduces the token consumption by up to 7\.8x over the baselines\. Moreover, RecMem’s query\-time token cost remains comparable to existing memory systems, so the construction\-time savings translate into lower end\-to\-end cost over long interaction histories\.
Our contributions are summarized as follows:
- •We identify a fundamental inefficiency in existing LLM memory systems, i\.e\.,eager memory consolidationfor every interaction leads to a high memory construction token cost\.
- •Inspired by cognitive science, we proposerecurrence\-based consolidationto save token cost by conducting memory consolidation only when an incoming interaction can find a sufficient number of semantically similar or related interactions\.
- •We present RecMem, a three\-tier memory architecture that realizes this paradigm\. Combining a lightweight subconscious store with a novel semantic refinement mechanism, RecMem achieves high accuracy while substantially reducing token cost\.
## 2Preliminaries
### 2\.1Problem Setting: Conversational Memory
Recent work on LLM\-based agents increasingly focuses on conversational memory, where the agent accumulates information through long\-term, multi\-turn interactionsHu et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib8)\); Xu et al\. \([2025a](https://arxiv.org/html/2605.16045#bib.bib29)\); Maharana et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib18)\)\. Formally, we denote the interaction history available at time stepttas a sequence𝒪1:t=\{o1,…,ot\}\\mathcal\{O\}\_\{1:t\}=\\\{o\_\{1\},\\ldots,o\_\{t\}\\\}\. Each interaction unitoto\_\{t\}is defined as a tuple:
ot=\(st,xt,τt\)o\_\{t\}=\(s\_\{t\},x\_\{t\},\\tau\_\{t\}\)\(1\)wherest∈\{user,assistant\}s\_\{t\}\\in\\\{\\text\{\{user\}\},\\text\{\{assistant\}\}\\\}represents the speaker role,xtx\_\{t\}denotes the message content, andτt\\tau\_\{t\}is the timestamp\. Given a queryqq, the objective is to retrieve relevant evidence from an external memory derived from𝒪1:t\\mathcal\{O\}\_\{1:t\}to support reasoning and response generation\.
Although conversational settings may appear more specific than general memory scenarios, they capture a fundamental property of real\-world deployment: information arrives streamingly over time, and the agent must continually manage an ever\-growing interaction history to support future queries and reasoningZhang et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib32)\)\. This formulation contrasts with retrieval\-augmented generation \(RAG\), which typically assumes static or pre\-ingested knowledge sourcesLewis et al\. \([2021](https://arxiv.org/html/2605.16045#bib.bib14)\); Han et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib6)\)\. In conversational memory, the key challenge is not retrieval, which can largely leverage existing techniques, but how the system constructs and updates the underlying memory from ongoing interactions in an online manner\.
### 2\.2Memory Systems
We focus on training\-free, text\-based external memory systems for LLM agents in streaming conversational settings\. For brevity, we refer to such systems as*memory systems*in the remainder of this paper\. Parametric memory approachesFang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib4)\); Wang et al\. \([2025a](https://arxiv.org/html/2605.16045#bib.bib26)\)require retraining or architectural modification to absorb new information and are thus less applicable in our settingHu et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib8)\), while RL\-based methodsYan et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib31)\); Wang et al\. \([2025b](https://arxiv.org/html/2605.16045#bib.bib27)\)are orthogonal to our focus, as they operate on top of a given memory architecture\.
Most existing memory systems construct long\-term memory by incrementally transforming incoming interactions \(or short windows thereof\) into retrievable memory units, such as summariesKang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib12)\); Packer et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib21)\); Zhong et al\. \([2023](https://arxiv.org/html/2605.16045#bib.bib33)\), atomic factsChhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\); Wang and Chen \([2025](https://arxiv.org/html/2605.16045#bib.bib25)\), or structured nodes \(e\.g\., graphs/trees\)Hogan et al\. \([2021](https://arxiv.org/html/2605.16045#bib.bib7)\); Rasmussen et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib22)\); Rezazadeh et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib23)\), and then rely on similarity\-based retrieval or hybrid searchKang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib12)\); Rasmussen et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib22)\)to supply evidence at query time\. We defer a detailed taxonomy of memory representations, retrieval mechanisms, and construction pipelines to Appendix[A](https://arxiv.org/html/2605.16045#A1)\.
## 3The RecMem Framework
### 3\.1Overview
RecMem is a three\-tier memory system guided by the principle that not all interactions warrant LLM\-level consolidation\. Incoming messages are first organized as atomic interaction units and written to a*subconscious*store with only lightweight structuring and vectorization, making the raw interaction history directly accessible through embedding\-based retrieval \(§[3\.2](https://arxiv.org/html/2605.16045#S3.SS2)\)\. Building on this store, RecMem performs*recurrence\-based consolidation*: instead of consolidating every turn, it invokes LLM\-based processing only when the system observes clear evidence that similar interaction content recurs, thereby reserving LLM invocation for cases where aggregation is likely to be beneficial\. Once triggered, RecMem produces an*episodic*abstraction over the selected turns \(§[3\.3](https://arxiv.org/html/2605.16045#S3.SS3)\), and then applies*semantic refinement*to recover fine\-grained, reusable facts that may be omitted by episodic abstraction, grounded in the episode and its underlying interactions \(§[3\.4](https://arxiv.org/html/2605.16045#S3.SS4)\)\. At query time, RecMem retrieves a small budget of items from the subconscious, episodic, and semantic stores, and answers by conditioning the LLM on the merged context \(§[3\.5](https://arxiv.org/html/2605.16045#S3.SS5)\)\.
Our use of episodic and semantic memory follows the convention in previous LLM memory literature\(Li and Li,[2024](https://arxiv.org/html/2605.16045#bib.bib15); Wang and Chen,[2025](https://arxiv.org/html/2605.16045#bib.bib25)\)\. Specifically, episodic memory in RecMem stores temporally anchored event narratives, which are coherent summaries of how a topic evolves across multiple interaction turns, with explicit time grounding\. Semantic memory stores atomic facts about general knowledge, user preferences, constraints, and entity relations\.
RecMem’s design mirrors human memory: most experiences remain unconsolidated unless repeatedly activatedAtkinson and Shiffrin \([1968](https://arxiv.org/html/2605.16045#bib.bib2)\); O’Reilly et al\. \([2014](https://arxiv.org/html/2605.16045#bib.bib20)\); McClelland et al\. \([1995](https://arxiv.org/html/2605.16045#bib.bib19)\)\. By avoiding eager LLM\-based consolidation of transient interactions, RecMem substantially reduces token consumption while preserving both event\-level coherence and stable user\-centric knowledge as memories\. To facilitate understanding, Appendix[B](https://arxiv.org/html/2605.16045#A2)provides a minimal running example that walks through the memory ingestion workflow\.
### 3\.2Subconscious Memory
The subconscious memory manager maintains a faithful record of interaction history at minimal computational cost\. A critical design consideration here is the granularity at which conversational information is represented\. Existing systems adopt diverse ingestion strategies, ranging from processing individual messagesXu et al\. \([2025b](https://arxiv.org/html/2605.16045#bib.bib30)\); Wang and Chen \([2025](https://arxiv.org/html/2605.16045#bib.bib25)\); Rasmussen et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib22)\)or interaction pairsChhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\)to accumulating larger, fixed\-size context buffersKang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib12)\); Packer et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib21)\)\. Static grouping or buffering may conflate temporally adjacent but semantically unrelated topics, diluting the specificity of embeddings\. Conversely, ingesting messages in isolation risks fragmenting the semantic context, as an assistant’s response often relies heavily on the preceding user query for its meaning\.
To address these issues, RecMem treats each*message exchange*\(a user\-assistant turn\) as an atomic unit\. Formally, we define a multi\-turn conversation between a user and an assistant withttturns as
ℋt\\displaystyle\\mathcal\{H\}\_\{t\}=\(u1,u2,…,ut\),\\displaystyle=\\bigl\(u\_\{1\},u\_\{2\},\\ldots,u\_\{t\}\\bigr\),\(2\)ui\\displaystyle u\_\{i\}=\(miusr,miast,τi\)\.\\displaystyle=\\bigl\(m^\{\\mathrm\{usr\}\}\_\{i\},m^\{\\mathrm\{ast\}\}\_\{i\},\\tau\_\{i\}\\bigr\)\.\(3\)
Here,uiu\_\{i\}represents an interaction unit at turnii, composed of the user messagemiusrm^\{\\mathrm\{usr\}\}\_\{i\}, the assistant responsemiastm^\{\\mathrm\{ast\}\}\_\{i\}, and a timestampτi\\tau\_\{i\}\. The historyℋt\\mathcal\{H\}\_\{t\}is a time\-ordered sequence of these units\.
As interaction units arrive in a streaming manner, each new message turnuiu\_\{i\}is processed independently\. We formally define the constructed subconscious memory unitsis\_\{i\}as:
si=\(vi,ui\)wherevi=Φ\(ui\)\.s\_\{i\}=\(v\_\{i\},u\_\{i\}\)\\quad\\text\{where\}\\quad v\_\{i\}=\\Phi\(u\_\{i\}\)\.\(4\)These units, computed via the dense vector encoderΦ\(⋅\)\\Phi\(\\cdot\), are immediately indexed into the subconscious memory store𝒮sub\\mathcal\{S\}\_\{\\mathrm\{sub\}\}\. This store is implemented as a vector database to support efficient semantic retrieval and incremental updates without the need for batching or access to future context\. This fine\-grained representation encourages focused semantic embeddings at the level of individual interaction units, making it well\-suited for streaming settings\.
#### Recurrence\-based Memory Consolidation
Echoing cognitive principlesAtkinson and Shiffrin \([1968](https://arxiv.org/html/2605.16045#bib.bib2)\); O’Reilly et al\. \([2014](https://arxiv.org/html/2605.16045#bib.bib20)\); McClelland et al\. \([1995](https://arxiv.org/html/2605.16045#bib.bib19)\), we proposerecurrence\-based consolidation: raw interactions are retained in the subconscious buffer, with LLM\-based abstraction triggered only when retrieval signals indicate sustained recurrence\. Specifically, for a new arriving unitsi=\(vi,ui\)s\_\{i\}=\(v\_\{i\},u\_\{i\}\), the system queries𝒮sub\\mathcal\{S\}\_\{\\mathrm\{sub\}\}to retrieve the set𝒩i\\mathcal\{N\}\_\{i\}containing the top\-kkunits ranked by cosine similarity toviv\_\{i\}\. We then filter these candidates to define therelevant setbased on strict semantic proximity:
ℛi=\{sj∈𝒩i∣cos\(vi,vj\)≥θsim\}\.\\mathcal\{R\}\_\{i\}=\\\{\\,s\_\{j\}\\in\\mathcal\{N\}\_\{i\}\\mid\\cos\(v\_\{i\},v\_\{j\}\)\\geq\\theta\_\{\\mathrm\{sim\}\}\\\}\.\(5\)
Consolidation is triggered only if the relevant set size meets a recurrence count threshold \(i\.e\.,\|ℛi\|≥θcount\|\\mathcal\{R\}\_\{i\}\|\\geq\\theta\_\{\\mathrm\{count\}\}\)\. In such cases, the cluster𝒞i=ℛi∪\{si\}\\mathcal\{C\}\_\{i\}=\\mathcal\{R\}\_\{i\}\\cup\\\{s\_\{i\}\\\}is promoted to higher\-level memory modules including episodic memory and semantic memory; otherwise,sis\_\{i\}remains in𝒮sub\\mathcal\{S\}\_\{\\mathrm\{sub\}\}\. This ensures consolidation is conducted exclusively in memories with demonstrated long\-term recurrence\.
### 3\.3Episodic Memory
Episodic memory captures event\-level structure across multiple turns\. To ensure memory remains compact, RecMem adopts a merge\-first strategy\. Upon the arrival of a subconscious unitsi=\(vi,ui\)s\_\{i\}=\(v\_\{i\},u\_\{i\}\), we retrieve the nearest neighbor episodeE⋆E^\{\\star\}from the episodic store𝒮epi\\mathcal\{S\}\_\{\\mathrm\{epi\}\}\. LetvE⋆=Φ\(E⋆\)v\_\{E^\{\\star\}\}=\\Phi\(E^\{\\star\}\)denote the embedding of this episode\. We strictly enforce an in\-place update if semantic similarity permits:
E⋆←LLMmerge\(E⋆,ui\)ifcos\(vi,vE⋆\)≥θsim,\\begin\{split\}E^\{\\star\}&\\leftarrow\\operatorname\{LLM\}\_\{\\mathrm\{merge\}\}\(E^\{\\star\},u\_\{i\}\)\\\\ \\text\{if\}\\quad&\\cos\(v\_\{i\},v\_\{E^\{\\star\}\}\)\\geq\\theta\_\{\\mathrm\{sim\}\},\\end\{split\}\(6\)whereLLMmerge\\operatorname\{LLM\}\_\{\\mathrm\{merge\}\}integrates the content of the new turnuiu\_\{i\}into the narrative ofE⋆E^\{\\star\}\.
Without such a merge\-first step, each recurrence\-triggered consolidation on a topic would produce a fresh episode in parallel with existing ones on the same topic, fragmenting the episodic representation of an evolving thread across multiple disconnected entries\. Merge\-first collapses these into a single continually\-updated narrative, keeping the episodic store compact and the per\-topic narrative coherent as the conversation evolves\.
If merging is not applicable, the unit waits for therecurrence\-based consolidationtrigger\. Given the triggered cluster𝒞i\\mathcal\{C\}\_\{i\}\(from §[3\.2](https://arxiv.org/html/2605.16045#S3.SS2.SSS0.Px1)\), we extract the interaction units𝒰𝒞=\{uj∣\(vj,uj\)∈𝒞i\}\\mathcal\{U\}\_\{\\mathcal\{C\}\}=\\\{u\_\{j\}\\mid\(v\_\{j\},u\_\{j\}\)\\in\\mathcal\{C\}\_\{i\}\\\}\. We then sort these units by their timestamps to form a temporal sequenceUiseq=\(u\(1\),…,u\(\|𝒞i\|\)\)U^\{\\mathrm\{seq\}\}\_\{i\}=\(u^\{\(1\)\},\\ldots,u^\{\(\|\\mathcal\{C\}\_\{i\}\|\)\}\)for episodic memory consolidation\. The consolidation prompt is designed forinductive organizationrather than simple summarization\. The LLM processes the formatted sequence to synthesize coherent narratives, segmenting disparate sub\-topics if necessary:
ℳiepi=LLMepi\(⨁k=1\|𝒞i\|Fmt\(u\(k\)\)\)\.\\mathcal\{M\}^\{\\mathrm\{epi\}\}\_\{i\}=\\operatorname\{LLM\}\_\{\\mathrm\{epi\}\}\\left\(\\bigoplus\_\{k=1\}^\{\|\\mathcal\{C\}\_\{i\}\|\}\\operatorname\{Fmt\}\(u^\{\(k\)\}\)\\right\)\.\(7\)Here,Fmt\(⋅\)\\operatorname\{Fmt\}\(\\cdot\)denotes a fixed template that formats each interaction unit into a textual representation\. The outputℳiepi\\mathcal\{M\}^\{\\mathrm\{epi\}\}\_\{i\}is a set of new episodic units; each episodeE∈ℳiepiE\\in\\mathcal\{M\}^\{\\mathrm\{epi\}\}\_\{i\}is then encoded viaΦ\(⋅\)\\Phi\(\\cdot\)and stored in𝒮epi\\mathcal\{S\}\_\{\\mathrm\{epi\}\}\. We provide a complete prompt list in Appendix[F](https://arxiv.org/html/2605.16045#A6)for reference\.
### 3\.4Semantic Memory
Semantic memory complements episodic memory by storing fine\-grained facts that may be missed by event\-level summaries\. It also mitigates a side effect of the merge\-first strategy in episodic memory: as an episode absorbs more turns through repeated merges, its summary necessarily becomes broader and more abstract, which can dilute its retrieval precision for queries that target a specific detail buried within that episode\. Semantic memory counteracts this by storing the same details as independent, narrowly\-scoped entries, so that precise factual queries can hit them directly without having to surface the entire encompassing episode\. In RecMem, we construct this memory layer through a process calledSemantic Refinement\. By strictly tying semantic extraction to episodic construction, this mechanism ensures that facts remain grounded in the current episodic context while explicitly recovering precise details that were abstracted away\.
Formally, when a new episodeE∈ℳiepiE\\in\\mathcal\{M\}^\{\\mathrm\{epi\}\}\_\{i\}is generated from the source interaction units𝒰𝒞\\mathcal\{U\}\_\{\\mathcal\{C\}\}, we first retrieve related existing semantic facts to provide historical context:
𝒱=TopKk\(𝒮sem,Φ\(E\)\)\.\\mathcal\{V\}=\\operatorname\{TopK\}\_\{k\}\\bigl\(\\mathcal\{S\}\_\{\\mathrm\{sem\}\},\\Phi\(E\)\\bigr\)\.\(8\)We then employ an LLM\-based refiner to deduce new facts\. Conditioned on the raw interaction units𝒰𝒞\\mathcal\{U\}\_\{\\mathcal\{C\}\}, the episodic summaryEE, and the retrieved facts𝒱\\mathcal\{V\}, the model is instructed to perform two parallel tasks: \(1\)Detail Recovery, which scans the raw texts in𝒰𝒞\\mathcal\{U\}\_\{\\mathcal\{C\}\}to identify critical entities omitted by the summaryEE; and \(2\)Fact Maintenance, which prevents redundancy by filtering out known information in𝒱\\mathcal\{V\}while updating evolving user states \(e\.g\., preference changes\)\.
The extraction process is formulated as:
ℳisem=LLMrefine\(E,𝒰𝒞,𝒱\)\.\\mathcal\{M\}^\{\\mathrm\{sem\}\}\_\{i\}=\\operatorname\{LLM\}\_\{\\mathrm\{refine\}\}\\bigl\(E,\\mathcal\{U\}\_\{\\mathcal\{C\}\},\\mathcal\{V\}\\bigr\)\.\(9\)
Each extracted factf∈ℳisemf\\in\\mathcal\{M\}^\{\\mathrm\{sem\}\}\_\{i\}is stored as an independent entry to preserve retrieval specificity\. This design reduces redundancy and enables incremental updates of user facts while keeping retrieval efficient\.
### 3\.5Question Answering
To generate an answer, RecMem first encodes the user queryqqinto a vector representationvq=Φ\(q\)v\_\{q\}=\\Phi\(q\)and retrieves the most relevant entries from the subconscious \(𝒮sub\\mathcal\{S\}\_\{\\mathrm\{sub\}\}\), episodic \(𝒮epi\\mathcal\{S\}\_\{\\mathrm\{epi\}\}\), and semantic \(𝒮sem\\mathcal\{S\}\_\{\\mathrm\{sem\}\}\) stores\. To manage the context window efficiently while ensuring diverse coverage, we enforce a fixed subconscious retrieval budget and*couple*the episodic and semantic budgets by settingksem=2kepik\_\{\\mathrm\{sem\}\}=2k\_\{\\mathrm\{epi\}\}, yielding three context sets:𝒦sub\\mathcal\{K\}\_\{\\mathrm\{sub\}\},𝒦epi\\mathcal\{K\}\_\{\\mathrm\{epi\}\}, and𝒦sem\\mathcal\{K\}\_\{\\mathrm\{sem\}\}\. The final answer is then generated by conditioning the LLM on the retrieved contexts alongside the original query\.
### 3\.6Discussions
#### Setting the hyper\-parameters
RecMem’s consolidation behavior is controlled by two key hyper\-parameters, i\.e\., similarity thresholdθsim\\theta\_\{sim\}for relevant interactions and recurrence thresholdθcount\\theta\_\{count\}to trigger consolidation\. Largerθsim\\theta\_\{sim\}andθcount\\theta\_\{count\}make consolidation more conservative and favor more frequent patterns, while lower thresholds make consolidation more active and improve coverage for subtle details\. According to empirical experiences, we recommendθsim=0\.7,θcount=5\\theta\_\{sim\}\{=\}0\.7,\\ \\theta\_\{count\}\{=\}5for casual open\-ended settings andθsim=0\.6,θcount=4\\theta\_\{sim\}\{=\}0\.6,\\ \\theta\_\{count\}\{=\}4for longer and task\-oriented interactions\. For question answering, we fix the aggregate retrieval budgets across the memory layers and useksub=10k\_\{\\mathrm\{sub\}\}\{=\}10,kepi=5k\_\{\\mathrm\{epi\}\}\{=\}5, andksem=10k\_\{\\mathrm\{sem\}\}\{=\}10\(i\.e\.,ksem=2kepik\_\{\\mathrm\{sem\}\}\{=\}2k\_\{\\mathrm\{epi\}\}\) by default\. Sensitivity experiments for the hyperparameters are conducted in Appendix[C](https://arxiv.org/html/2605.16045#A3)\.
#### Robustness of threshold choice
A natural concern is whether RecMem’s performance depends on precise threshold calibration\. Our sensitivity analysis in Appendix[C](https://arxiv.org/html/2605.16045#A3)shows that this is not the case: overall accuracy varies smoothly and is within a narrow band around the recommended defaults, so performance does not hinge on selecting a brittle operating point\. Within this robust range, the thresholds instead serve as a strategic dial between memory selectivity and consolidation sensitivity\. Higher values ofθsim\\theta\_\{sim\}andθcount\\theta\_\{count\}render RecMem moreconservative, prioritizing high\-confidence patterns suitable for casual open\-ended conversations where signal is sparse and noise filtering matters\. Conversely, lower thresholds make the system moreactivein consolidation, ideal for task\-completion workflows where capturing subtle details is critical\.
#### Generality of recurrence\-based consolidation
Although RecMem organizes the episodic memory and semantic memory as flat entries for similarity\-based retrieval, recurrence\-based consolidation is a general idea and not limited to specific memory structures\. The key to recurrence\-based consolidation is to utilize a cheap subconscious memory to buffer the incoming interactions and trigger consolidation for higher memory layers based on recurrence, and the higher memory layers can also adopt alternative structures \(e\.g\., knowledge graph\)\.
## 4Experimental Evaluation
### 4\.1Experiment Settings
To ensure a fair and standardized comparison, we strictly adhere to the incremental evaluation protocol established in prior studiesChhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\); Xu et al\. \([2025b](https://arxiv.org/html/2605.16045#bib.bib30)\); Kang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib12)\)\. In this setting, message turns are streamed sequentially into the memory system to mimic the natural flow of ongoing dialoguesHu et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib8)\), followed by multi\-round query sessions\.
#### Datasets
We evaluate RecMem on two English benchmarks selected to represent distinct interaction modalities: social companionship and long\-context task completion\.LoCoMoMaharana et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib18)\)features companion\-style, life\-sharing dialogues, consisting of 10 multi\-session conversations \(avg\. 16k tokens\) with questions that probe reasoning over evolving personal history\. In contrast,LongMemEval\-SWu et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib28)\)focuses on agentic, task\-oriented interactions with substantially longer contexts\. Comprising 500 conversations averaging 115k tokens, it poses a rigorous test for memory systems under realistic, high\-load user\-assistant workflows\. Detailed statistics and question types of these two datasets are provided in Appendix[D](https://arxiv.org/html/2605.16045#A4)\.
#### Baselines
We compare RecMem against various types of representative baselines:
- ∙\\bulletFull Context, which feeds all historical interactions to the LLM for answering each question\.
- ∙\\bulletNaive RAG, a standard RAG baseline that segments the interactions into chunks and retrieves the relevant chunks based on embedding similarity\. We employ a chunking strategy that respects message integrity \(see Appendix[E\.1](https://arxiv.org/html/2605.16045#A5.SS1)\)\.
- ∙\\bulletMem0Chhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\)employs a fact\-extraction pipeline to dynamically extract salient information from interactions and manage memory consistency via LLM\-based update operations \(e\.g\., add,update, delete\)\.
- ∙\\bulletA\-MemXu et al\. \([2025b](https://arxiv.org/html/2605.16045#bib.bib30)\), an agentic memory system inspired by the Zettelkasten note\-taking methodKadavy \([2021](https://arxiv.org/html/2605.16045#bib.bib11)\); Ahrens \([2017](https://arxiv.org/html/2605.16045#bib.bib1)\), which organizes the interactions as discrete “memory notes" that are connected via entity linking to facilitate associative retrieval\.
- ∙\\bulletMemoryOSKang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib12)\), an OS\-inspired hierarchical framework that manages information via short\-term, mid\-term, and long\-term memory tiers\. It also incorporates a dedicated module to maintain evolving user and agent personas to enable personalized interactions\.
Table 1:Results on the LoCoMo benchmark\. Bold and underline mark the best and second accuracies\.Table 2:Results on the LongMemEval\-S benchmark\. Bold and underline mark the best and second accuracies\.
#### Performance Metrics
We compare RecMem with the baselines along two dimensions\.
- ∙\\bulletQuestion answering accuracy\.We report accuracy as the fraction of questions answered correctly\. FollowingChhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\), we useGPT\-4o\-minias an LLM judge and treat its judgment score as the primary metric\. We prioritize this semantic evaluation over token\-overlap metrics like F1 score, which can under\-estimate correctness for open\-ended generation with paraphrases\. For completeness, we also report F1 and a comparison in Appendix[E\.4](https://arxiv.org/html/2605.16045#A5.SS4)\. All reported task scores are averaged over three runs\.
- ∙\\bulletComputation efficiency\.We measure LLM token usage \(input plus output\) in two phases: \(1\)construction cost, averaged per conversation during memory ingestion, and \(2\)query cost, averaged per question during answering\.
#### Implementation Details
To evaluate the generalization of our approach, we conduct experiments using two distinct LLM backends:GPT\-4o\-miniandGPT\-4\.1\-mini\. To ensure fair comparison, for any given benchmark result, RecMem and all baselines share the identical underlying model version\. For LLM calls, we set temperature=0\.0 and utilizetext\-embedding\-3\-smallfor vector embedding generations\. For RecMem, we configure the recurrence\-based consolidation thresholds to adapt to the distinct interaction densities of each benchmark:θsim=0\.7,θcount=5\\theta\_\{\{sim\}\}=0\.7,\\theta\_\{\{count\}\}=5for LoCoMo, andθsim=0\.6,θcount=4\\theta\_\{\{sim\}\}=0\.6,\\theta\_\{\{count\}\}=4for LongMemEval\-S\. More detailed experiment settings are in Appendix[E](https://arxiv.org/html/2605.16045#A5)\.
### 4\.2Main Results
Tables[1](https://arxiv.org/html/2605.16045#S4.T1)and[2](https://arxiv.org/html/2605.16045#S4.T2)demonstrate that RecMem offers a strong efficiency–performance trade\-off compared to prior memory systems\. For each task, we highlight the best result in bold and the second\-best result with underlining\. Across both benchmarks and backbone models, RecMem substantially reduces construction\-time token consumption while preserving competitive end\-task performance, indicating that eager consolidation is not required to achieve effective long\-term memory\. We emphasize that our goal is not to dominate every individual task category, but rather to achieve the highest overall accuracy among memory systems under a drastically reduced construction\-cost budget\.
#### LoCoMo\.
On LoCoMo with GPT\-4\.1\-mini, RecMem uses only 193\.2K construction tokens on average, compared to 1520\.8K for Mem0 and 1459\.9K for A\-Mem, corresponding to reductions of 87\.3% and 86\.8%, respectively\. A similar reduction pattern holds for GPT\-4o\-mini\. Despite this drastic decrease in construction cost, RecMem achieves the highest overall score among memory\-based methods, indicating that recurrence\-based memory consolidation can retain strong long\-term memory performance while avoiding the systematic overhead of processing every turn through the LLM\. We also note that Full Context slightly outperforms RecMem on LoCoMo, which is consistent with LoCoMo’s relatively short conversations \(approximately 16K tokens per conversation\) where full\-context inference remains feasible\. However, as shown in table[2](https://arxiv.org/html/2605.16045#S4.T2), this behavior does not generalize to substantially longer settings\.
#### LongMemEval\-S\.
A similar but more complex pattern emerges on LongMemEval\-S, where conversations are substantially longer and closer to real\-world long\-lived agents\. With GPT\-4\.1\-mini, RecMem reduces construction tokens by 77\.5% relative to Mem0 and 71\.1% relative to A\-Mem, while achieving the best overall score among all evaluated methods, including Full Context and RAG\.
At the category level, different systems exhibit complementary strengths, and we do not claim a universal winner across all question types\. Importantly, RecMem is not designed to dominate every category in isolation; rather, it targets robust*overall*capability with much smaller construction\-cost budget\. Our results support this goal: despite large reductions in construction tokens, RecMem attains the best overall score on LongMemEval\-S\.
Beyond the aggregate metric, RecMem’s clearest and most consistent gains appear on temporal reasoning, where long\-range dependencies are central\. We argue this is a structural consequence of recurrence\-based consolidation rather than an artifact of tuning\. Temporal reasoning requires two capabilities: cross\-time linking of co\-referent mentions, and reconstructing their chronological order\. Eager consolidation systems are disadvantaged on the former: by committing to summary boundaries at each turn or local buffer, they anchor later mentions of an evolving topic to different summaries, fragmenting the thread\. RecMem addresses both capabilities by construction: similarity\-based clustering in subconscious memory \(§[3\.2](https://arxiv.org/html/2605.16045#S3.SS2)\) aggregates co\-referent mentions regardless of temporal distance, and timestamp\-sorted episodic consolidation \(§[3\.3](https://arxiv.org/html/2605.16045#S3.SS3)\) reconstructs chronological order within each cluster\. Semantic refinement additionally extracts time\-anchored facts grounded in the raw interaction units, serving as a second safeguard for fine\-grained temporal evidence that episodic abstraction may compress away\.
#### Construction vs\. query cost\.
RecMem’s efficiency gains comes from reducing construction\-time LLM usage\. Query\-time token consumption stays within a comparable range across memory\-based methods under our evaluation protocol because they retrieve a similar order of evidence for answering, whereas construction\-time usage diverges sharply depending on how frequently and how heavily a method invokes LLM processing during ingestion\. In streaming deployments where new turns arrive continually, these construction\-time differences accumulate over time and can dominate total LLM usage, making construction a critical and often overlooked cost driver\.
### 4\.3Ablation Study
We conduct an ablation study to quantify the contribution of each RecMem module by disabling one component at a time\. Figure[2](https://arxiv.org/html/2605.16045#S4.F2)reports results onLoCoMousingGPT\-4\.1\-minias the backbone model\. For each testing target, we maintain the retrieval budget for the non\-ablated modules to ensure fairness\.
Figure 2:Ablation study for RecMem on LoCoMoAs shown in figure[2](https://arxiv.org/html/2605.16045#S4.F2), overall, removing any module reduces performance, indicating that the three\-tier design is complementary\. The largest drop occurs when removing subconscious memory \(81\.10→51\.8881\.10\\rightarrow 51\.88\)\. This sharp degradation is expected because subconscious memory is the only faithful carrier of raw interaction units: information that does not trigger recurrence\-based consolidation remain exclusively in the module, thus disabling it eliminates access to a substantial fraction of query\-relevant evidence\.
We observe an asymmetric contribution between episodic and semantic memory\. Removing episodic memory causes only a small drop \(81\.10→79\.9481\.10\\rightarrow 79\.94\), whereas removing semantic memory yields a larger but still bounded drop \(81\.10→70\.5881\.10\\rightarrow 70\.58\)\. This asymmetry reflects their division of labor under semantic refinement: episodic memories mainly capture high\-level structure and cross\-turn linkage, while semantic memories prioritize fine\-grained factual details\. Since semantic refinement explicitly recovers details omitted by episodic abstraction and stores them as semantic facts, semantic memory can partially cover missing evidence when episodic memory is removed; in contrast, episodic summaries can only weakly substitute for the detailed facts lost without semantic memory, leading to the larger degradation\.
To isolate the effect of semantic refinement, we evaluate aDirect Extractionvariant that extracts semantic facts directly from raw conversations, without using episodic memories as a reference for detecting omitted details\. At inference time, this variant answers using only subconscious retrieval and the extracted semantic facts\. We remove the refinement\-specific guidance tied to episodic summaries in semantic extraction prompt and leave the other parts intact\. The score drops from79\.9479\.94to74\.2274\.22, showing that episodic memory provides an essential reference signal for semantic refinement, improving semantic memory quality beyond naive fact extraction from raw dialogue\.
#### Additional Experiments
Beyond the consolidation thresholds discussed above, we conduct three additional sets of analyses to characterize RecMem’s behavior\. Appendix[C\.1](https://arxiv.org/html/2605.16045#A3.SS1)provides a full sensitivity analysis of the consolidation hyperparametersθsim\\theta\_\{\{sim\}\}andθcount\\theta\_\{\{count\}\}, examining both accuracy and construction cost\. Appendix[C\.2](https://arxiv.org/html/2605.16045#A3.SS2)studies the retrieval\-side budgetsksubk\_\{\{sub\}\},kepik\_\{\{epi\}\}, andksemk\_\{\{sem\}\}to identify how much evidence is needed at query time\. Appendix[E\.4](https://arxiv.org/html/2605.16045#A5.SS4)additionally reports F1 scores for completeness, along with a discussion of why we treat LLM\-as\-Judge as the primary metric for open\-ended generation\.
## 5Conclusion
We present RecMem, an efficiency\-aware memory system for long\-running LLM agents that challenges the prevailing paradigm of eager memory consolidation\. By explicitly modeling raw interactions within a lightweightsubconscious memoryand deferring LLM\-based abstraction until triggered by recurrence, RecMem demonstrates that high\-fidelity long\-term memory does not necessitate exhaustive processing of every interaction\. Across LoCoMo and LongMemEval\-S, this strategy substantially reduces memory construction cost while preserving competitive task performance\. More broadly, RecMem reframes memory consolidation as a dynamic, recurrence\-driven process\. We hope this work encourages the community to reconsiderwhenandwhyinformation should be consolidated in long\-running agent tasks, and to treat computational cost as a first\-class criterion when evaluating future memory systems\.
## 6Limitations
Despite the empirical strengths and efficiency gains of RecMem, several limitations merit discussion\.
#### Dependence on Heuristic Thresholds\.
RecMem relies on static similarity \(θsim\\theta\_\{sim\}\) and recurrence thresholds \(θcount\\theta\_\{count\}\) to govern the consolidation process\. While our experiments demonstrate that these parameters can be tuned to accommodate different interaction densities \(e\.g\., casual conversation vs\. task completion\), they currently remain manually specified\. This dependency means RecMem may benefit from threshold recalibration when deploying to domains with substantially different interaction densities, although Appendix[C](https://arxiv.org/html/2605.16045#A3)shows that such recalibration can be coarse rather than precise\. Developing adaptive or learnable triggering mechanisms that dynamically adjust to user behavior is a promising direction for future work\.
#### Recurrence as a Proxy for Salience\.
Our design is predicated on the assumption that information worthy of long\-term abstraction tends to recur\. While this aligns with many cognitive theories and conversational patterns, it may risk overlooking rare but critical events—such as a one\-off safety instruction or a unique user constraint—that appear only once\. To mitigate this risk, the subconscious memory layer functions as a persistent safety net: every interaction unit is preserved verbatim and remains directly retrievable at query time, regardless of whether it has been consolidated\. Nevertheless, non\-recurring content does not benefit from the cross\-turn linking of episodic memory or the fact\-level refinement of semantic memory, which may weaken reasoning over these details\. Developing a lightweight salience signal beyond pure recurrence to promote rare but high\-value events is a promising direction for future work\.
## 7Ethical Considerations
We evaluate RecMem only on publicly available benchmarks in an offline setting, and we do not deploy or test it in real user\-facing applications\. Nevertheless, long\-term memory mechanisms can raise dual\-use concerns: when integrated into real applications, persistent memory may be misused for profiling or surveillance beyond the intended personalization benefits\. We therefore recommend that practical deployments incorporate clear user\-facing disclosures and safeguards such as access controls and user\-controllable deletion/retention policies\.
A second risk arises from unintended harms due to incorrect memory\. Errors in consolidation or retrieval can surface outdated or spurious details and lead to overconfident but incorrect responses, which may be consequential in high\-stakes settings\. We encourage future work to incorporate uncertainty\-aware retrieval, confidence calibration, and monitoring against memory poisoning or prompt\-injection attempts\.
Finally, RecMem reduces unnecessary LLM invocations compared to eager extraction baselines, which can lower compute and associated environmental footprint when operating over long interaction histories\.
## References
- Ahrens \(2017\)S\. Ahrens\. 2017\.[*How to Take Smart Notes: One Simple Technique to Boost Writing, Learning and Thinking – for Students, Academics and Nonfiction Book Writers*](https://books.google.com.sg/books?id=lHDsDwAAQBAJ)\.Sönke Ahrens\.
- Atkinson and Shiffrin \(1968\)Richard C\. Atkinson and Richard M\. Shiffrin\. 1968\.[Human memory: A proposed system and its control processes](https://api.semanticscholar.org/CorpusID:22958289)\.In*The psychology of learning and motivation*\.
- Chhikara et al\. \(2025\)Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav\. 2025\.[Mem0: Building production\-ready ai agents with scalable long\-term memory](https://arxiv.org/abs/2504.19413)\.*Preprint*, arXiv:2504\.19413\.
- Fang et al\. \(2025\)Yunhao Fang, Weihao Yu, Shu Zhong, Qinghao Ye, Xuehan Xiong, and Lai Wei\. 2025\.[Artificial hippocampus networks for efficient long\-context modeling](https://arxiv.org/abs/2510.07318)\.*Preprint*, arXiv:2510\.07318\.
- Guo et al\. \(2024\)Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y\. Wu, Y\. K\. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang\. 2024\.[Deepseek\-coder: When the large language model meets programming – the rise of code intelligence](https://arxiv.org/abs/2401.14196)\.*Preprint*, arXiv:2401\.14196\.
- Han et al\. \(2025\)Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A\. Rossi, Subhabrata Mukherjee, Xianfeng Tang, Qi He, Zhigang Hua, Bo Long, Tong Zhao, Neil Shah, Amin Javari, Yinglong Xia, and Jiliang Tang\. 2025\.[Retrieval\-augmented generation with graphs \(graphrag\)](https://arxiv.org/abs/2501.00309)\.*Preprint*, arXiv:2501\.00309\.
- Hogan et al\. \(2021\)Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia D’amato, Gerard De Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, Axel\-Cyrille Ngonga Ngomo, Axel Polleres, Sabbir M\. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan Sequeda, Steffen Staab, and Antoine Zimmermann\. 2021\.[Knowledge graphs](https://doi.org/10.1145/3447772)\.*ACM Computing Surveys*, 54\(4\):1–37\.
- Hu et al\. \(2025\)Yuanzhe Hu, Yu Wang, and Julian McAuley\. 2025\.[Evaluating memory in llm agents via incremental multi\-turn interactions](https://arxiv.org/abs/2507.05257)\.*Preprint*, arXiv:2507\.05257\.
- Jiang et al\. \(2025\)Xun Jiang, Feng Li, Han Zhao, Jiahao Qiu, Jiaying Wang, Jun Shao, Shihao Xu, Shu Zhang, Weiling Chen, Xavier Tang, Yize Chen, Mengyue Wu, Weizhi Ma, Mengdi Wang, and Tianqiao Chen\. 2025\.[Long term memory: The foundation of ai self\-evolution](https://arxiv.org/abs/2410.15665)\.*Preprint*, arXiv:2410\.15665\.
- Johnson et al\. \(2017\)Jeff Johnson, Matthijs Douze, and Hervé Jégou\. 2017\.[Billion\-scale similarity search with gpus](https://arxiv.org/abs/1702.08734)\.*Preprint*, arXiv:1702\.08734\.
- Kadavy \(2021\)David Kadavy\. 2021\.*Digital Zettelkasten: Principles, Methods, & Examples*\.Kadavy, Incorporated\.
- Kang et al\. \(2025\)Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai\. 2025\.[Memory os of ai agent](https://arxiv.org/abs/2506.06326)\.*Preprint*, arXiv:2506\.06326\.
- Kumaran et al\. \(2016\)Dharshan Kumaran, Demis Hassabis, and James L\. McClelland\. 2016\.[What learning systems do intelligent agents need? complementary learning systems theory updated](https://doi.org/10.1016/j.tics.2016.05.004)\.*Trends in Cognitive Sciences*, 20\(7\):512–534\.
- Lewis et al\. \(2021\)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela\. 2021\.[Retrieval\-augmented generation for knowledge\-intensive nlp tasks](https://arxiv.org/abs/2005.11401)\.*Preprint*, arXiv:2005\.11401\.
- Li and Li \(2024\)Jitang Li and Jinzheng Li\. 2024\.[Memory, consciousness and large language model](https://arxiv.org/abs/2401.02509)\.*Preprint*, arXiv:2401\.02509\.
- Liu et al\. \(2025\)Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, and 29 others\. 2025\.[Advances and challenges in foundation agents: From brain\-inspired intelligence to evolutionary, collaborative, and safe systems](https://arxiv.org/abs/2504.01990)\.*Preprint*, arXiv:2504\.01990\.
- Liu et al\. \(2023\)Nelson F\. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang\. 2023\.[Lost in the middle: How language models use long contexts](https://api.semanticscholar.org/CorpusID:259360665)\.*Transactions of the Association for Computational Linguistics*, 12:157–173\.
- Maharana et al\. \(2024\)Adyasha Maharana, Dong\-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang\. 2024\.[Evaluating very long\-term conversational memory of llm agents](https://arxiv.org/abs/2402.17753)\.*Preprint*, arXiv:2402\.17753\.
- McClelland et al\. \(1995\)James L\. McClelland, Bruce L\. McNaughton, and Randall C\. O’Reilly\. 1995\.[Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory\.](https://api.semanticscholar.org/CorpusID:2832081)*Psychological review*, 102 3:419–457\.
- O’Reilly et al\. \(2014\)Randall C\. O’Reilly, Rajan Bhattacharyya, Michael D\. Howard, and Nicholas Ketz\. 2014\.[Complementary learning systems](https://doi.org/10.1111/j.1551-6709.2011.01214.x)\.*Cognitive Science*, 38\(6\):1229–1248\.
- Packer et al\. \(2024\)Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G\. Patil, Ion Stoica, and Joseph E\. Gonzalez\. 2024\.[Memgpt: Towards llms as operating systems](https://arxiv.org/abs/2310.08560)\.*Preprint*, arXiv:2310\.08560\.
- Rasmussen et al\. \(2025\)Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef\. 2025\.[Zep: A temporal knowledge graph architecture for agent memory](https://arxiv.org/abs/2501.13956)\.*Preprint*, arXiv:2501\.13956\.
- Rezazadeh et al\. \(2025\)Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao\. 2025\.[From isolated conversations to hierarchical schemas: Dynamic tree memory representation for llms](https://arxiv.org/abs/2410.14052)\.*Preprint*, arXiv:2410\.14052\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\. 2024\.[Deepseekmath: Pushing the limits of mathematical reasoning in open language models](https://arxiv.org/abs/2402.03300)\.*Preprint*, arXiv:2402\.03300\.
- Wang and Chen \(2025\)Yu Wang and Xi Chen\. 2025\.[Mirix: Multi\-agent memory system for llm\-based agents](https://arxiv.org/abs/2507.07957)\.*Preprint*, arXiv:2507\.07957\.
- Wang et al\. \(2025a\)Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, and Zexue He\. 2025a\.[M\+: Extending memoryllm with scalable long\-term memory](https://arxiv.org/abs/2502.00592)\.*Preprint*, arXiv:2502\.00592\.
- Wang et al\. \(2025b\)Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu\. 2025b\.[Mem\-α\\alpha: Learning memory construction via reinforcement learning](https://arxiv.org/abs/2509.25911)\.*Preprint*, arXiv:2509\.25911\.
- Wu et al\. \(2025\)Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai\-Wei Chang, and Dong Yu\. 2025\.[Longmemeval: Benchmarking chat assistants on long\-term interactive memory](https://arxiv.org/abs/2410.10813)\.*Preprint*, arXiv:2410\.10813\.
- Xu et al\. \(2025a\)Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, wenlin zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, and Tong Xu\. 2025a\.[From single to multi\-granularity: Toward long\-term memory association and selection of conversational agents](https://arxiv.org/abs/2505.19549)\.*Preprint*, arXiv:2505\.19549\.
- Xu et al\. \(2025b\)Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang\. 2025b\.[A\-mem: Agentic memory for llm agents](https://arxiv.org/abs/2502.12110)\.*Preprint*, arXiv:2502\.12110\.
- Yan et al\. \(2025\)Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Kristian Kersting, Jeff Z\. Pan, Hinrich Schütze, Volker Tresp, and Yunpu Ma\. 2025\.[Memory\-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning](https://arxiv.org/abs/2508.19828)\.*Preprint*, arXiv:2508\.19828\.
- Zhang et al\. \(2024\)Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji\-Rong Wen\. 2024\.[A survey on the memory mechanism of large language model based agents](https://arxiv.org/abs/2404.13501)\.*Preprint*, arXiv:2404\.13501\.
- Zhong et al\. \(2023\)Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang\. 2023\.[Memorybank: Enhancing large language models with long\-term memory](https://arxiv.org/abs/2305.10250)\.*Preprint*, arXiv:2305\.10250\.
## Appendix ADetailed Taxonomy of Memory Systems
In this section, we provide a structured taxonomy of existing memory systems, focusing on two critical dimensions: memory consolidation \(how raw interactions are transformed into long\-term storage\) and retrieval mechanisms \(how relevant information is accessed during query time\)\.
### A\.1Memory Consolidation Paradigms
Memory consolidation transforms raw interaction streams into retrievable long\-term storage, as described in §[1](https://arxiv.org/html/2605.16045#S1)\. A critical commonality across existing works is their reliance on an eager consolidation strategy\. In these systems, every incoming interaction—regardless of its informational value or redundancy—eventually triggers an LLM\-driven processing pipeline\. This approach assumes that all user inputs require active structuring or abstraction, incurring constant computational overhead to maintain the memory state\. We categorize these paradigms by their consolidation targets:
#### Graph and Structure\-based Consolidation\.
These systems treat memory construction as a continuous structural maintenance task\. Upon receiving a new message, the system must compute embeddings, identify entities, and execute structural updates \(e\.g\., creating nodes or re\-balancing trees\) to integrate the new information into the existing topology\.
1. 1\.A\-MemXu et al\. \([2025b](https://arxiv.org/html/2605.16045#bib.bib30)\): Inspired by the Zettelkasten methodKadavy \([2021](https://arxiv.org/html/2605.16045#bib.bib11)\); Ahrens \([2017](https://arxiv.org/html/2605.16045#bib.bib1)\), it treats interactions as discrete "notes" in a network, where consolidation involves generating embeddings and establishing associative links between new and existing notes\.
2. 2\.TreeMemRezazadeh et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib23)\): Maintains a hierarchical summary tree\. New information is not just appended but traverses down to specific leaf nodes based on semantic relevance, forcing a recursive chain of summary updates from the leaf back up to the root to keep the hierarchy consistent\.
3. 3\.ZepRasmussen et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib22)\): Parses interactions into a "Temporal Knowledge Graph\." It actively extracts entities and relationships from each turn, modeling them as nodes and edges while explicitly updating the temporal metadata of these connections\.
4. 4\.Mem0 \(Graph Variant\)Chhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\): Extends atomic fact extraction by organizing data into a graph\. It requires per\-turn analysis to identify multi\-hop relationships between entities, dynamically updating the graph structure as the conversation evolves\.
#### Fact and Summary\-based Consolidation
These systems function as active distillers, where the LLM is invoked at every turn \(or small buffer intervals\) to parse information into compressed formats\. The goal is to immediately strip away redundancy and store only the event summaries or extracted facts\.
1. 1\.Mem0Chhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\): Runs a dedicated extraction pipeline after every user message\. It prompts the LLM to identify atomic facts \(e\.g\., entity\-relation triplets\), instructing it to add, update, or delete records in the vector database to reflect the latest state\.
2. 2\.MemoryOSKang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib12)\): Features a multi\-tiered architecture \(Short\-, Mid\-, and Long\-term memories\) to manage context flow, emphasizing a dedicated Profile Memory module that explicitly maintains evolving user personas and agent guidelines\.
3. 3\.MirixWang and Chen \([2025](https://arxiv.org/html/2605.16045#bib.bib25)\): Routes every interaction through a parallel extraction pipeline\. Raw text is simultaneously processed by distinct modules to distill specific "Knowledge" facts and "Event" summaries, creating a synchronized update across multiple memory stores\.
4. 4\.MemGPTPacker et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib21)\): Treats memory management as an operating system process, employing self\-directed function calls to actively summarize and compress ongoing interactions into a fixed\-size "Core Memory" block, ensuring key persona and user details are preserved while offloading raw history\.
### A\.2Retrieval Mechanisms
While memory consolidation determines how information is stored, retrieval mechanisms define how relevant context is accessed to support reasoning\. Existing approaches range from simple semantic matching to complex, structure\-aware traversal algorithms\.
#### Dense Vector Retrieval
This prevalent paradigm relies on high\-dimensional embeddings to measure semantic overlap, commonly utilizing vector databases like FAISSJohnson et al\. \([2017](https://arxiv.org/html/2605.16045#bib.bib10)\)for efficient similarity search\. A representative system is Mem0Chhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\), which retrieves relevant atomic facts by computing the cosine similarity between the query and stored embeddings, selecting the top\-kkentries based purely on semantic relevance scores\.
Figure 3:A simplified memory ingestion process in RecMem
#### Structure\-Aware Retrieval
These systems leverage the topological structure established during consolidation \(graphs or trees\) to expand retrieval beyond simple similarity\. TreeMemRezazadeh et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib23)\)utilizes a top\-down tree pruning strategy; starting from the root, it evaluates child nodes based on their summaries and prunes irrelevant branches to efficiently narrow the search to specific leaf nodes\. Similarly, A\-MemXu et al\. \([2025b](https://arxiv.org/html/2605.16045#bib.bib30)\)employs associative retrieval: upon locating an initial "note" via vector search, it traverses established entity links to fetch connected notes, mimicking the human ability to associate disparate memories through shared concepts\.
#### Hybrid Retrieval
To mitigate the precision limitations of pure vector search \(e\.g\., missing exact keyword matches\), some systems adopt a multi\-metric strategy\. MemoryOSKang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib12)\)implements a weighted hybrid retrieval mechanism\. Instead of relying on a single metric, it calculates a unified relevance score by linearly combining Cosine similarity \(for semantic understanding\) and Jaccard similarity \(for exact keyword overlap\)\. This approach ensures that specific entities are recalled even if their semantic embeddings are distant, balancing fuzzy semantic matching with precise lexical matching\.
## Appendix BRunning Example
We briefly illustrate RecMem’s ingestion\-time behavior with a minimal three\-turn interaction in Figure[3](https://arxiv.org/html/2605.16045#A1.F3)\. For clarity, we set the recurrence threshold toθcount=2\\theta\_\{\\text\{count\}\}=2: consolidation is triggered once a topic is observed in at least two interaction units after passing the topical\-similarity check\. For simplicity, we do not expand the exact similarity threshold here and use natural language describe when two turns are treated as relevant\. To keep the example concise, we present only the recurrence\-triggered construction path, and therefore omit themerge\-firstepisodic in\-place update\.
#### Turn 1: subconscious write without consolidation\.
The user first asks for suggestions to order a birthday cake\. RecMem ingests this user–assistant exchange as one interaction unit and appends it to the subconscious memory\. Since the “cake” topic has been observed only once so far, the corresponding setRiR\_\{i\}does not satisfy the recurrence condition\|Ri\|≥θcount\|R\_\{i\}\|\\geq\\theta\_\{\\text\{count\}\}, and thus no LLM\-based consolidation is triggered\. Instead, RecMem computes a lightweight embedding for this unit and stores it together with the raw text in the subconscious vector index, enabling efficient similarity\-based retrieval in future turns\.
\(a\)
\(b\)
\(c\)
Figure 4:Sensitivity of consolidation thresholds on LoCoMo \(GPT\-4\.1\-mini\)\. \(a\) Overall score vs\.θsim\\theta\_\{\\mathrm\{sim\}\}\. \(b\) Overall score vs\.θcount\\theta\_\{\\mathrm\{count\}\}\. \(c\) Memory\-construction token consumption vs\.θcount\\theta\_\{\\mathrm\{count\}\}\.
#### Turn 2: no recurrence under the similarity check\.
The user then switches to an unrelated topic \(washing dark jeans\)\. RecMem uses the new unit’s embedding to retrieve relevant turns from the subconscious store, and then formsRiR\_\{i\}by keeping only those with similarity above the topical threshold\. The cake\-related unit from Turn 1 is unrelated to this turn thus\|Ri\|=1\|R\_\{i\}\|=1and consolidation is not triggered\. The new turn is then stored in subconscious store\.
#### Turn 3: recurrence\-based consolidation and semantic refinement\.
When the user returns to the cake topic, the new unit retrieves prior cake\-related unit\(s\) and passes the similarity checkθsim\\theta\_\{\\text\{sim\}\}\. Since the recurrence count now satisfies\|Ri\|≥θcount\|R\_\{i\}\|\\geq\\theta\_\{\\text\{count\}\}, RecMem triggers consolidation and produces two complementary artifacts\.Episodic memoryabstracts the recurring turns into a coherent, intent\-level narrative—e\.g\., the user is preparing a birthday cake order for their sister Mia, and the assistant recommends an allergy\-safe ordering strategy\. This abstraction focuses on high\-level event summary but may compress away valuable details\.Semantic refinementpreserves such details by extracting atomic facts from the underlying raw turns, such as: \(i\) the user has a sister named Mia; \(ii\) Mia is allergic to peanuts; and \(iii\) the user plans to place an order at SweetLeaf with concrete cake/message specifications\. Semantic refinement also uses related existing semantic memories to assist extraction, but we omit this aspect here to keep the example minimal\.
## Appendix CHyperparameter Analysis
In this section, we conduct a sensitivity analysis of RecMem’s key hyperparameters under a controlled\-variable protocol\.
\(a\)
\(b\)
Figure 5:Sensitivity of retrieval budgets on LoCoMo \(GPT\-4\.1\-mini\)\. \(a\) Overall score vs\. subconscious retrieval budgetksubk\_\{\\mathrm\{sub\}\}\. \(b\) Overall score vs\. episodic budgetkepik\_\{\\mathrm\{epi\}\}withksem=2kepik\_\{\\mathrm\{sem\}\}=2k\_\{\\mathrm\{epi\}\}\.We organize the discussion into two parts: \(i\)consolidation\-stagethresholds, including the recurrence count threshold \(θcount\\theta\_\{\\text\{count\}\}\) and the similarity threshold \(θsim\\theta\_\{\\text\{sim\}\}\), which determine when interaction clusters are promoted from subconscious memory to higher\-level episodic and semantic memories; and \(ii\)retrieval\-stagebudgeting, where we cap the number of retrieved items from each memory tier to control context length\. For retrieval, we treat the subconscious and episodic budgets \(ksubk\_\{\\text\{sub\}\},kepik\_\{\\text\{epi\}\}\) as the only free hyperparameters, and set the semantic budget as a fixed function of the episodic budget,ksem=2kepik\_\{\\text\{sem\}\}=2k\_\{\\text\{epi\}\}\.
To ensure fair comparison and isolate causal effects, in each experiment we vary only one target hyperparameter and freeze all others to the default LoCoMo configuration\. Unless otherwise stated, we useθcount=5\\theta\_\{\\text\{count\}\}\{=\}5andθsim=0\.7\\theta\_\{\\text\{sim\}\}\{=\}0\.7for consolidation, andksub=10k\_\{\\text\{sub\}\}\{=\}10,kepi=5k\_\{\\text\{epi\}\}\{=\}5\(thusksem=10k\_\{\\text\{sem\}\}\{=\}10\) for retrieval\. When sweepingkepik\_\{\\text\{epi\}\}, we updateksemk\_\{\\text\{sem\}\}accordingly viaksem=2kepik\_\{\\text\{sem\}\}=2k\_\{\\text\{epi\}\}, while keeping all remaining hyperparameters fixed\. All experiments in this section are conducted on LoCoMo using GPT\-4\.1\-mini as the backbone model\.
### C\.1Consolidation Hyperparameters
We study two consolidation\-stage thresholds that govern demand\-driven memory promotion: the similarity thresholdθsim\\theta\_\{\\text\{sim\}\}, which controls how interaction units are clustered in subconscious memory, and the recurrence thresholdθcount\\theta\_\{\\text\{count\}\}, which controls when a cluster is consolidated into episodic/semantic memories\.
#### Impact ofθsim\\theta\_\{\\text\{sim\}\}\.
As shown in Figure[4\(a\)](https://arxiv.org/html/2605.16045#A2.F4.sf1),θsim\\theta\_\{\\text\{sim\}\}exhibits a clear peak around the default settingθsim=0\.7\\theta\_\{\\text\{sim\}\}\{=\}0\.7\. Whenθsim\\theta\_\{\\text\{sim\}\}is too low, semantically unrelated interactions are merged into the same cluster, reducing topical coherence and making the downstream summarization step noisier\. Conversely, whenθsim\\theta\_\{\\text\{sim\}\}is too high, related interactions are fragmented across multiple small clusters, weakening recurrence signals and delaying \(or preventing\) consolidation for genuinely recurring topics\. Overall, the sharp optimum suggests that the best choice on LoCoMo is unambiguous and that RecMem is reasonably robust in the neighborhood ofθsim=0\.7\\theta\_\{\\text\{sim\}\}\{=\}0\.7\.
#### Impact ofθcount\\theta\_\{\\text\{count\}\}: quality–cost trade\-off\.
Figure[4\(b\)](https://arxiv.org/html/2605.16045#A2.F4.sf2)and Figure[4\(c\)](https://arxiv.org/html/2605.16045#A2.F4.sf3)highlight a more explicit effectiveness–efficiency tension forθcount\\theta\_\{\\text\{count\}\}\. Lowerθcount\\theta\_\{\\text\{count\}\}triggers consolidation earlier, so each consolidation event typically includes fewer raw interaction units\. This smaller consolidation context can better preserve fine\-grained details \(less compression pressure during episodic abstraction\), but it is also more aggressive and therefore increases construction\-time token consumption due to more frequent consolidations\. Accordingly, token cost decreases smoothly asθcount\\theta\_\{\\text\{count\}\}increases \(Figure[4\(c\)](https://arxiv.org/html/2605.16045#A2.F4.sf3)\)\.
In contrast, the performance curve is not smooth: we observe a clear degradation when increasingθcount\\theta\_\{\\text\{count\}\}from55to66\(Figure[4\(b\)](https://arxiv.org/html/2605.16045#A2.F4.sf2)\), while the corresponding token reduction remains comparatively gradual\. We attribute this drop to two compounding factors at higher thresholds: \(i\) consolidation becomes overly conservative, leaving some recurring patterns insufficiently represented in episodic/semantic memory at query time; and \(ii\) once consolidation is finally triggered, the accumulated cluster is larger, which increases summarization difficulty and raises the likelihood that salient details are omitted or poorly organized \(even with semantic refinement\)\. Taken together, these results indicate thatθcount=5\\theta\_\{\\text\{count\}\}\{=\}5is the best operating point on LoCoMo: it retains the accuracy benefits of earlier, detail\-preserving consolidation while avoiding unnecessary consolidation overhead, and it prevents the disproportionate quality loss observed at more conservative threshold setting likeθcount=6\\theta\_\{\\text\{count\}\}\{=\}6\(81\.1 vs 78\.9\)\.
### C\.2Retrieval Hyperparameters
We next analyze the retrieval\-stage budgets that control how much evidence is surfaced from each memory tier at query time\. Recall that we treat the subconscious and episodic budgets \(ksubk\_\{\\text\{sub\}\},kepik\_\{\\text\{epi\}\}\) as the only free retrieval hyperparameters, and set the semantic budget deterministically asksem=2kepik\_\{\\text\{sem\}\}=2k\_\{\\text\{epi\}\}\. Thus, sweepingkepik\_\{\\text\{epi\}\}implicitly scales the total retrieved memory volume, while sweepingksubk\_\{\\text\{sub\}\}isolates the contribution of raw, fine\-grained interaction evidence\.
Figure[5\(a\)](https://arxiv.org/html/2605.16045#A3.F5.sf1)and Figure[5\(b\)](https://arxiv.org/html/2605.16045#A3.F5.sf2)show a consistent diminishing\-returns trend: increasing retrieval budgets yields substantial gains at small values, but improvements become marginal as budgets grow\. We therefore adopt compact defaults that retain most of the performance benefit while limiting retrieved context length, settingksub=10k\_\{\\text\{sub\}\}\{=\}10andkepi=5k\_\{\\text\{epi\}\}\{=\}5\(thusksem=10k\_\{\\text\{sem\}\}\{=\}10\)\.
## Appendix DEvaluation Datasets
This section provides detailed specifications and preprocessing protocols for the two benchmarks used in our experiments: LoCoMoMaharana et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib18)\)and LongMemEval\-SWu et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib28)\)\.
### D\.1LoCoMo
LoCoMo \(Long\-Context Memory\) is a benchmark designed to evaluate memory systems in casual, social settings\. Unlike standard user\-agent interactions, the source texts consist of multi\-session human\-to\-human dialogues between two distinct speakers, simulating the natural evolution of a long\-term relationship\.
#### Data Statistics\.
The dataset consists of 10 independent, human\-annotated conversations\. Each conversation spans multiple sessions, simulating a relationship that evolves over time\.
- •Total Conversations:10
- •Average Length:≈\\approx16,000 tokens per conversation
- •Total Questions \(Used\):1,540
- •Dialogue Style:Casual, multi\-turn, life\-sharing, highly contextual\.
#### Task Categories\.
The benchmark originally includes five question categories\. Following standard protocols established in prior worksChhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\); Kang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib12)\); Wang and Chen \([2025](https://arxiv.org/html/2605.16045#bib.bib25)\), we evaluate on the first four categories and exclude the adversarial set:
1. 1\.Single\-hop Retrieval:Questions requiring the retrieval of a specific fact mentioned in a single past session\.
2. 2\.Multi\-hop Reasoning:Questions that require synthesizing information distributed across multiple distinct sessions to derive an answer\.
3. 3\.Temporal Reasoning:Questions testing the system’s ability to understand the sequence of events and relative time expressions\.
4. 4\.Open\-domain Knowledge:Questions that require combining memory retrieval with external world knowledge\.
5. 5\.Adversarial \(Excluded\):Questions designed to trick the model with false premises\. We exclude this category as it lacks reliable ground\-truth answers for automated evaluation\.
### D\.2LongMemEval\-S
LongMemEval\-S is a subset of the LongMemEval benchmark, curated to evaluate memory systems inagentic, task\-orientedinteractions with long context windows\.
#### Data Statistics\.
Unlike the social nature of LoCoMo, LongMemEval\-S features functional interactions where the user seeks specific assistance\.
- •Total Conversations:500
- •Average Context Length:≈\\approx115k tokens \(approx\. 30–40 sessions\)\.
- •Total Questions:500
- •Dialogue Style:Task\-oriented, high information density\.
#### Task Categories\.
To assess memory capabilities at a granular level, the benchmark stratifies queries into six distinct types:
1. 1\.Single\-session\-user:Evaluates the retrieval of specific details explicitly mentioned by theuserwithin the bounds of a single conversation session\.
2. 2\.Single\-session\-assistant:Tests the system’s ability to recall information provided by theassistantitself within a single session, ensuring consistency in the agent’s own history\.
3. 3\.Single\-session\-preference:Assesses whether the model can effectively apply retrieved user information to generate personalized, context\-aware responses\.
4. 4\.Multi\-session:Requires the aggregation of disjoint pieces of information scattered across two or more sessions to derive a complete answer\.
5. 5\.Knowledge\-update:Probes the system’s capacity to track dynamic changes in the user’s life state and supersede outdated information with new updates\.
6. 6\.Temporal\-reasoning:Demands chronological deduction by synthesizing both the session metadata \(timestamps\) and explicit time expressions found in the text\.
## Appendix EExperiment Details
### E\.1Baseline Configurations
To ensure fair and reliable comparisons, we configure each baseline to faithfully reflect its original design choices, rather than enforcing a unified ingestion or prompting pipeline\. Below, we describe the implementation and prompting decisions used in our experiments in details\.
To enable a fair comparison of computational costs, we instrumented all baseline codebases with unified token\-tracking logic while leaving their core memory components intact\. For the LoCoMo benchmark, all memory\-system baselines considered in this work provide official implementations\. We therefore reuse their original prompts and evaluation code without modification\.
For the LongMemEval\-S benchmark, where standardized reference implementations are not available, we implement the evaluation pipeline while preserving each method’s ingestion strategy as used in its LoCoMo setup\. Concretely, we adopt: \(i\) A\-Mem’s per\-message ingestion, \(ii\) Mem0’s dual\-speaker ingestion with two messages per turn, and \(iii\) MemoryOS’s ingestion based on user–assistant QA pairs\. We make this choice to respect the baselines’ intended memory abstractions; forcing all methods to share RecMem’s ingestion logic would conflate design differences and bias the comparison\.
For A\-Mem and MemoryOS, they both have two official codebases and we adopt the ones used in their paperXu et al\. \([2025b](https://arxiv.org/html/2605.16045#bib.bib30)\); Kang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib12)\)to ensure reproduction of the reported setting\. For Mem0, we use its local\-deployment version to enable token\-consumption tracking\. We also disable graph construction, as Mem0 reports that its graph variant can lead to a performance dropChhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\)\.
For the RAG\-2048 baseline, we adopt a conservative chunking strategy that preserves message integrity: we never split a message across two chunks\. Messages are accumulated sequentially until the chunk reaches the 2048\-token budget\. If adding the next message would exceed this limit, we still include the entire message \(rather than truncating it\) to preserve semantic completeness, and then start a new chunk from the subsequent message\.
### E\.2Evaluation Prompt Consistency
To ensure a fair and standardized comparison, we strictly enforce prompt consistency across all evaluated methods\. For any given dataset, the exact same evaluation prompt is employed for the LLM judge across all baselines and RecMem, ensuring that performance differences originate solely from the memory systems’ capabilities rather than variations in the evaluation criteria\. Specifically, our prompt sources are as follows:
#### LongMemEval\-S
: We adopt the official evaluation prompt provided by the benchmark authorsWu et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib28)\)without modification\.
#### LoCoMo
: We follow the evaluation protocol established in previous workChhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\), which adapts the evaluation prompt elements originally designed by MemGPTPacker et al\. \([2024](https://arxiv.org/html/2605.16045#bib.bib21)\)\.
### E\.3Answer Prompt Consistency
For LoCoMo, we use each baseline’s original answering prompt\. For LongMemEval\-S, we use the main body of RecMem’s answering prompt as a shared answer template across methods, so that performance differences primarily reflect the underlying memory mechanisms rather than prompt engineering\. For theFull\-ContextandRAG\-2048baselines, we also use the same answering prompt as RecMem for consistency\.
BecauseRecMemandMemoryOSboth adopt multi\-module memory architectures, their answering prompts include a short module description that clarifies the roles of different memory sources\. For MemoryOS, we retain the prompt format used in its LoCoMo implementation\. For RecMem, we include a brief module\-role description to prevent the answer agent from double\-counting overlapping evidence retrieved from different modules\. Baselines with a single memory source do not require such clarification and therefore use only the shared main prompt body\.
### E\.4Discussion on F1 Score
While the F1 score is one of the standard metrics in prior worksXu et al\. \([2025b](https://arxiv.org/html/2605.16045#bib.bib30)\); Kang et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib12)\); Chhikara et al\. \([2025](https://arxiv.org/html/2605.16045#bib.bib3)\), measuring token\-level exact matching, we observed it to be unreliable for evaluating long\-context memory systems where semantic correctness is paramount\. F1 score penalizes correct answers that differ in phrasing from the ground truth\. For instance, if the ground truth is “16 March, 2023”, and the model generates “Gina opened her online clothing store on 2023\-03\-16”, the F1 score approaches 0 despite the answer being factually correct\. Consequently, we prioritize LLM\-as\-Judge in our main analysis\.
For LoCoMo, many prior evaluations treat F1 as a primary metric and enforce strict output\-length constraints \(e\.g\., “the answer should be less than 5 words”\) to optimize token overlap\. To maintain comparability with these reporting conventions, we retain such length constraints when evaluating on this dataset\. For transparency, we additionally report the resulting F1 scores in Table[3](https://arxiv.org/html/2605.16045#A5.T3)\. In contrast, for LongMemEval\-S, since all methods utilize a shared prompt body, we remove these artificial constraints to avoid penalizing valid, grounded answers that may exceed rigid word counts\.
Table 3:F1 score and llm judge score on LoCoMo\.
### E\.5Retrieval Top\-K
ForRAG\-2048, we set the retrieval top\-KKto33on LoCoMo, reflecting its relatively short conversation length, and to55on LongMemEval\-S, where conversations are longer and often require aggregating evidence across more chunks\. As shown by the query\-token statistics in Tables[2](https://arxiv.org/html/2605.16045#S4.T2)and[1](https://arxiv.org/html/2605.16045#S4.T1), these settings allow the RAG baseline to retrieve a comparable amount of information to other methods under similar query\-time budgets\.
For all other baselines, we keep their retrieval budgets consistent across LoCoMo and LongMemEval\-S, following their default design choices\. Concretely,A\-Memretrieves1010memory notes\.Mem0retrieves6060memory facts in total\.MemoryOSretrieves all memories from short\-term memory, together with1010memories from mid\-term memory,55memories from long\-term memory, as well as its qualified assistant knowledge and user knowledge components\.
A special case isMem0on LoCoMo: since LoCoMo includes dual\-speaker question types, Mem0 retrieves3030facts per speaker \(i\.e\.,6060total\) to balance coverage across user and assistant perspectives\. In contrast, LongMemEval\-S is dominated by user\-centric questions, with relatively few assistant\-centric queries\. Therefore, while keeping the total budget fixed at6060, we allocate4545retrieved facts to the user side and1515to the assistant side, which better reflects Mem0’s intended strengths under the LongMemEval\-S query distribution\.
## Appendix FLLM Prompts
This appendix reports the primary prompts used inRecMem, including \(i\) episodic memory generation, \(ii\) episodic memory merging, \(iii\) semantic memory generation, and \(iv\) the final answer prompt\. To improve readability and facilitate reproduction, we present each prompt in figures instead of inline text\. Each memory\-related prompt follows a consistent structure with three components: \(a\) a role and goal specification, \(b\) detailed instructions, and \(c\) explicit output\-format constraints\.
#### Episodic memory generation prompt\.
Figures[6](https://arxiv.org/html/2605.16045#A7.F6)–[8](https://arxiv.org/html/2605.16045#A7.F8)show the role/goal description, instructions, and required output format for episodic memory generation\.
#### Semantic memory generation prompt\.
Figures[9](https://arxiv.org/html/2605.16045#A7.F9)–[11](https://arxiv.org/html/2605.16045#A7.F11)present the corresponding components for semantic memory generation\.
#### Episodic memory merging prompt\.
Figures[12](https://arxiv.org/html/2605.16045#A7.F12)–[14](https://arxiv.org/html/2605.16045#A7.F14)provide the prompt used to merge newly consolidated content into existing episodic memories\.
#### Answer prompt\.
Figures[15](https://arxiv.org/html/2605.16045#A7.F15)and[16](https://arxiv.org/html/2605.16045#A7.F16)report the role/goal and instruction components of the answer prompt used during evaluation\.
## Appendix GLicenses and Terms of Use
#### Licenses\.
We use publicly released benchmarks under their original licenses: LoCoMo \(CC BY\-NC 4\.0\) and LongMemEval\-S \(MIT License\)\. We do not redistribute these datasets; instead, we refer readers to their official releases\. For baselines, we use publicly available implementations under the licenses stated in their official repositories \(Mem0: Apache License 2\.0; A\-Mem: MIT License; MemoryOS: Apache License 2\.0\)\. We do not repackage or redistribute third\-party artifacts beyond what is permitted by their original licenses\.
#### Terms of Use\.
LoCoMo and LongMemEval\-S were released as research benchmarks for evaluating conversational assistants\. We use them strictly in the intended offline evaluation setting, following the benchmark protocols\. We do not redistribute the datasets and only report aggregated results, consistent with their stated licenses and access conditions\. RecMem is evaluated on these benchmarks, but its core mechanisms are applicable to a broader class of long\-running conversational agent settings and can be integrated into practical systems\. Real\-world deployment should be adapted to the target workflow and comply with applicable data licenses/terms and usage conditions\.
Figure 6:Episodic Memory Generation Role DescriptionFigure 7:Episodic Memory Generation InstructionFigure 8:Episodic Memory Output FormatFigure 9:Semantic Memory Generation Role DescriptionFigure 10:Semantic Memory Generation InstructionFigure 11:Semantic Memory Output FormatFigure 12:Episodic Merging Role DescriptionFigure 13:Episodic Merging InstructionFigure 14:Episodic Merging Output FormatFigure 15:Answering Role DescriptionFigure 16:Answering InstructionSimilar Articles
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Researchers introduce Memora, a benchmark that evaluates LLMs’ ability to retain, update, and forget long-term user memories over weeks-to-months conversations, revealing frequent reuse of obsolete memories.
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Mem0 introduces a scalable memory-centric architecture using graph-based representations to improve long-term conversational coherence in LLMs, significantly reducing latency and token costs while outperforming existing memory systems.
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
EvolveMem introduces a self-evolving memory architecture for LLM agents that optimizes retrieval configurations through LLM-powered diagnosis and iterative research cycles, achieving significant performance improvements on benchmarks like LoCoMo and MemBench.
Human-Inspired Memory Architecture for LLM Agents
Microsoft researchers propose a biologically-inspired memory architecture for LLM agents that incorporates mechanisms like sleep-phase consolidation and interference-based forgetting to manage persistent memory efficiently.
HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents
HeLa-Mem is a bio-inspired memory architecture for LLM agents that models memory as a dynamic graph using Hebbian learning dynamics, featuring episodic and semantic memory stores to improve long-term coherence. Experiments on LoCoMo show superior performance across question categories while using fewer context tokens.