@yibie: 推荐这篇，交大和清华的团队系统测评了 12 种 Agent 记忆系统。不是那种"我们的模型更好"的论文，而是从数据管理的角度拆解记忆系统怎么选——什么时候该用 RAG、什么时候该用向量数据库、什么时候该用知识图谱。 Agent 的长期记忆…

X AI KOLs Timeline 2026/06/26 01:07 论文

agent-memory llm-agents data-management benchmark memory-systems retrieval-augmented knowledge-graphs

摘要

This paper from SJTU and Tsinghua systematically evaluates 12 agent memory systems from a data management perspective, decomposing memory into four modules and providing guidelines on when to use RAG, vector databases, or knowledge graphs for long-term agent memory.

推荐这篇，交大和清华的团队系统测评了 12 种 Agent 记忆系统。不是那种"我们的模型更好"的论文，而是从数据管理的角度拆解记忆系统怎么选——什么时候该用 RAG、什么时候该用向量数据库、什么时候该用知识图谱。 Agent 的长期记忆怎么做？12 种记忆系统的系统对比论文：Are We Ready For An Agent-Native Memory System? 上海交大 × 清华 × MemTensor，6 月 23 日提交核心框架把 Agent 记忆拆成 4 个模块：表示与存储 → 提取 → 检索与路由 → 维护。每个模块独立评测。关键发现 • 没有一种架构在所有场景下都最优。记忆结构跟 workload 瓶颈的匹配程度决定了效果 • 局部维护（只更新受影响的部分）比全局重构省得多，效果也不差 • 12 个系统跑 5 个 benchmark（11 个数据集）的端到端评测，差异主要在 retrieval precision 和 long-horizon stability 上实操意义如果你在搭需要长期记忆的 Agent： • 简单问答场景 → 向量检索就够了 • 需要多跳推理 → 需要知识图谱层 • 频繁更新知识 → 局部维护优于全局重建论文：https://arxiv.org/abs/2606.24775 代码：https://github.com/MEMTRON/AgentMemory-Bench… #Agent #长期记忆 #系统评测

查看原文

查看缓存全文

缓存时间: 2026/06/26 10:09

推荐这篇，交大和清华的团队系统测评了 12 种 Agent 记忆系统。不是那种“我们的模型更好“的论文，而是从数据管理的角度拆解记忆系统怎么选——什么时候该用 RAG、什么时候该用向量数据库、什么时候该用知识图谱。

Agent 的长期记忆怎么做？12 种记忆系统的系统对比

论文：Are We Ready For An Agent-Native Memory System? 上海交大 × 清华 × MemTensor，6 月 23 日提交

核心框架把 Agent 记忆拆成 4 个模块：表示与存储 → 提取 → 检索与路由 → 维护。每个模块独立评测。

关键发现 • 没有一种架构在所有场景下都最优。记忆结构跟 workload 瓶颈的匹配程度决定了效果 • 局部维护（只更新受影响的部分）比全局重构省得多，效果也不差 • 12 个系统跑 5 个 benchmark（11 个数据集）的端到端评测，差异主要在 retrieval precision 和 long-horizon stability 上

实操意义如果你在搭需要长期记忆的 Agent： • 简单问答场景 → 向量检索就够了 • 需要多跳推理 → 需要知识图谱层 • 频繁更新知识 → 局部维护优于全局重建

论文：https://arxiv.org/abs/2606.24775 代码：https://github.com/MEMTRON/AgentMemory-Bench…

#Agent #长期记忆 #系统评测

Are We Ready For An Agent-Native Memory System?

Source: https://arxiv.org/html/2606.24775

Abstract.

Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system-level concerns, including operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored.

In this paper, we present a systematic experimental study of agent memory from a data management perspective. We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance. Under this framework, we evaluate 12 representative memory systems and two reference baselines across five benchmark workloads spanning 11 datasets. Our extensive end-to-end evaluation shows that no single architecture dominates across all scenarios; instead, effectiveness depends heavily on how well the memory structure aligns with the workload bottleneck. Furthermore, through fine-grained ablation studies, we quantify their individual effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability. Finally, we reveal cost-performance trade-offs under realistic workloads, showing localized maintenance is more cost-efficient than global reorganization. Based on these findings, we identify promising directions towards building truly agent-native memory systems. The code is publicly available at*https://github.com/OpenDataBox/MemoryData*.

††copyright:none## 1.Introduction

The rapid evolution of Large Language Model (LLM) agents has sparked a large body of exciting research and industrial efforts in building agent memory, i.e., the data management system of the LLM agent that supports long-horizon stateful execution and personalized interaction(Luoet al.,2026; Khanet al.,2025; Liuet al.,2026b; Singhet al.,2024; OpenAI,2026; Microsoft,2025; Google,2025).

As shown in Figure1, existing agent memory systems span a diverse set of architectural designs.(1) Stream-and-Reflection Memory System(e.g., MemoryBank(Zhonget al.,2024)) maintains experiences as timestamped memory streams and periodically summarizes them into higher-level reflections that are written back into the stream;(2) Hierarchical Tiered Memory System(e.g., MemGPT(Packeret al.,2023)) organizes memory into multiple levels with different capacities and access properties, separating core memory from archival storage with explicit movement (e.g., eviction and promotion) across tiers;(3) Knowledge Graph Memory System(e.g.,Mem0g\text{Mem0}^{g}(Chhikaraet al.,2025), Zep(Rasmussenet al.,2025)) represents entities, relations, and their temporal evolution in structured forms (e.g., temporal knowledge graphs), often incorporating entity disambiguation and conflict resolution;(4) Composite Hybrid Memory System(e.g., A-MEM(Xu and others,2025)) routes schema-aware memory objects across multiple storage substrates, explicitly separating runtime state (e.g., KV caches) from long-term storage (e.g., vector, graph, keyword indexes), managed by dedicated maintenance modules. However, this rapid proliferation has also led to a highly fragmented landscape that lacks systematic evaluation from a data management perspective, raising a natural question:Are we ready for an agent-native memory system?

Refer to caption Figure 1.Typical Execution Workflows of Agent Memory.In this paper, we revisit this question for agent memory. In particular, we focus onsystem-level memoryover textual, structured, and even parametric representations(Chhikaraet al.,2025; Packeret al.,2023; Rasmussenet al.,2025; Xu and others,2025), a fundamental infrastructure component for modern autonomous agents. We focus on memory-centric systems, rather than task-specific agent frameworks where memory is an auxiliary module(1;W. Zhou, X. Zhou, Q. He, G. Li, B. He, Q. Xu, and F. Wu (2026)). It is the persistent data management system that maintains information beyond a single inference step (e.g., historical interactions, environmental observations, and intermediate tool executions) decoupled from the LLMs’ parametric weights and volatile context windows. Agent frameworks rely on these external memory systems (e.g., Mem0(Chhikaraet al.,2025), Letta(Packeret al.,2023), Zep(Rasmussenet al.,2025), and A-MEM(Xu and others,2025)) to actively write, update, index, and route relevant context back into the reasoning loop. The capability of a long-horizon agent largely depends on the reliability and efficiency of this memory layer. An agent adopting a poorly designed memory architecture can suffer from factual contradictions, catastrophic forgetting, or unacceptable latencies during continuous execution(Du,2026; Zhenget al.,2025b).

Recent benchmarks(Maharanaet al.,2024; Wuet al.,2024; MemoryAgentBench Team,2026; Tanet al.,2025)have evaluated agent memory and shown that external memory can improve agent performance on tasks requiring factual recall and long-context understanding. However, these evaluations are largely rooted in natural language processing and have multiple limitations when treating agent memory as a data management system (see Section2). First, they fail to evaluate many representative memory architectures (e.g., systems such as MemoChat, MemTree, and LightMem have not been included in prior evaluations) under unified workloads, making principled cross-system comparisons difficult. Efforts from the database community(Wuet al.,2026)limit their scope to a few chatbot-centric datasets (e.g., LoCoMo and LongMemEval only), neglecting complex agentic execution scenarios. Second, existing benchmarks predominantly rely on single-sided, end-to-end task success metrics (e.g., F1 and BLEU scores) rather than a comprehensive evaluation suite. They fail to explicitly isolate and measure multi-dimensional performance indicators such as evidence-level retrieval fidelity, dynamic update robustness under conflicting knowledge, and long-horizon stability. Third, they rarely measure key operational costs from a systems perspective, such as index construction time and query latency, which are critical for production deployments. Last, they treat memory systems as monolithic black boxes rather than decomposing them into fundamental data management modules for isolated, fine-grained analysis.

We overcome these limitations and conduct comprehensive experiments and analyses from a data management perspective. The contributions are as follows.

(1) Technology Decomposition and Taxonomy (Section3).We decompose existing agent memory systems into four core components: (i) memory representation and storage, (ii) memory extraction, (iii) memory retrieval and routing, and (iv) memory maintenance. For each component, we further establish a structured taxonomy111https://github.com/OpenDataBox/awesome-agent-memoryby categorizing existing approaches according to their underlying design principles, enabling principled comparisons.

(2) Overall End-to-End Performance Evaluation (Section4).We conduct end-to-end evaluations under a unified and fair testbed (e.g., unified time-overhead traces)222https://github.com/OpenDataBox/MemoryDataacross five distinct benchmark workloads encompassing 11 datasets. Our study includes 12 representative memory systems, each embodying different combinations of representation, storage, routing, and maintenance strategies. We evaluate their performance from five perspectives: task effectiveness (RQ1), retrieval fidelity (RQ2), dynamic update robustness (RQ3), long-horizon stability (RQ4), and operational cost (RQ5).

(3) Fine-Grained Technical Component Evaluation (Section5).Leveraging our four-module framework, we conduct controlled and fine-grained experiments on representative strategies within each technique component. By systematically generating controlled variants that modify one module at a time, we quantify their performance trade-offs and assess their individual impacts on representation fidelity, routing precision, and update correctness.

(4) Insightful Findings.Based on our experimental results and in-depth analysis, we distill a set of insightful findings regarding the cost–performance trade-offs of agent memory systems:

❶Are Memory Systems Effective Across Different Agent Request Workloads?No single memory architecture dominates all scenarios. Composite hybrid systems lead on conversational QA, while graph-based methods excel in single-hop factual recall but struggle with temporal reasoning. Moreover, effective memory systems remain robust acrossLLMbackbone variants because they externalize evidence localization before answer generation.

❷How Accurately Do Memory Systems Retrieve Stored Evidence?Explicit query planning and balanced hybrid search maximize contextual relevance. However, retrieval accuracy degrades significantly as the temporal distance between the evidence and the query increases, exposing limitations of similarity-based retrieval.

❸Are Memory Systems Robust Under Dynamic Updates?Graph-based methods handle knowledge updates most reliably, whereas popular fact-extraction plugins and append-only stores struggle with targeted overwrites. Systems lacking lifecycle management return stale facts, leading to “hallucinations of the past”.

❹Do Memory Systems Remain Stable Over Long Horizons?Many append-only memory stores suffer from catastrophic degradation as evidence becomes more distant. For time-dependent queries, raw long-context retrieval still outperforms most memory-backed approaches, indicating that standard semantic consolidation often destroys crucial chronological cues.

❺What Are the Operational Costs of Agent Memory?Highly structured systems incur orders-of-magnitude higher index construction time and query latency than lightweight stores, yet do not consistently deliver proportional accuracy gains.

❻When Do Individual Memory Components Go Wrong?Each layer of abstraction (e.g., compression, summarization, and fact extraction) progressively discards information. Furthermore, fine-grainedLLM-based extraction can yield modest precision gains but substantially degrade multi-hop reasoning. Finally, conservative memory consolidation serves as the best default maintenance strategy, whereas delayed flushing creates a deceptive trade-off between surface-level coverage and actual answerability.

2.Preliminaries

To support the discussion in the rest of this paper, we first clarify the scope ofagent memoryfrom a data management perspective. Although recent studies have examined memory from viewpoints such as cognitive taxonomy, agent architecture, and graph-based organization(Zhanget al.,2025; Du,2026; Huet al.,2025; Wuet al.,2026; Tang and others,2026; Yanget al.,2026), the underlying concept is still often treated primarily as an algorithmic component of the LLM or agent pipeline(Zhanget al.,2025; Du,2026; Huet al.,2025). In contrast, we study agent memory as a standalone data management object and system infrastructure, with explicit attention to how it is represented, stored, retrieved, updated, and maintained under real agent workloads. Under this view, we introduce a set of definitions below.

Table 1.Taxonomy and Characteristics of Agent Memory Systems.\rowcolor[HTML]1F1F1F\cellcolor[HTML]1F1F1F\cellcolor[HTML]1F1F1F\cellcolor[HTML]1F1F1FMemory Representation & Storage\cellcolor[HTML]1F1F1F\cellcolor[HTML]1F1F1F\cellcolor[HTML]1F1F1F\rowcolor[HTML]1F1F1F\cellcolor[HTML]1F1F1FCategory\cellcolor[HTML]1F1F1FMethod\cellcolor[HTML]1F1F1FRepresentation\cellcolor[HTML]1F1F1FStorage\cellcolor[HTML]1F1F1FMemory Extraction\cellcolor[HTML]1F1F1FMemory Retrieval &Query Routing\cellcolor[HTML]1F1F1FMemory MaintenanceMemoChat(Luet al.,2023)❶ Token-Level Sequence(Structured JSON Memos)❶ Transient In-Context Registers❸ Schema-Constrained Extraction(LLM Topic Segmentation)❹ Autonomous Agentic Routing(LLM Topic Selection)❸ LLM-Driven Semantic Consolidation(Turn-Triggered)Mem0(Chhikaraet al.,2025)❶ Token-Level Sequence(Discrete Facts)❷ Specialized Single-Engine(Vector DB)❷ Schema-Free Extraction❷ Semantic-Based Retrieval❸ LLM-Driven Semantic Consolidation(Tool-Calling)MEM1(Zhouet al.,2025)❶ Token-Level Sequence❶ Transient In-Context Registers❶ Raw Sequence Concatenation❶ Native Attention-Based Retrieval❷ Capacity-Driven Physical EvictionSequentialContextMemAgent(Yuet al.,2025)❶ Token-Level Sequence❶ Transient In-Context Registers❶ Raw Sequence Concatenation(Recursive Summaries)❶ Native Attention-Based Retrieval❷ Capacity-Driven Physical Eviction(RL Overwrite)MemTree(Rezazadehet al.,2025)❷ Graph & Tree-Based Topology(Hierarchical Tree)❷ Specialized Single-Engine(Vector DB)❷ Schema-Free Extraction(Top-Down Embedding)❷ Semantic-Based Retrieval(Collapsed Tree)❸ LLM-Driven Semantic Consolidation(Recursive Aggregation)Zep(Rasmussenet al.,2025)❷ Graph & Tree-Based Topology(Temporal KG)❷ Specialized Single-Engine(Graph DB)❸ Schema-Constrained Extraction(Triplets)❺ Multi-Stage Hybrid Execution(Dense + BM25 + BFS)❶ Timestamp-Based Multi-Versioning(Logical Invalidation)Mem0g\textbf{Mem0}^{g}(Chhikaraet al.,2025)❷ Graph & Tree-Based Topology(Labeled Graph)❸ Heterogeneous Multi-Engine(Vector + Graph DB)❸ Schema-Constrained Extraction(Entity-Relation)❸ Topological Subgraph Traversal❶ Timestamp-Based Multi-VersioningStructuralTopologicalCognee(Markovicet al.,2025)❷ Graph & Tree-Based Topology(Entity-Relation Triplets)❸ Heterogeneous Multi-Engine(Graph + Vector + Relational DB)❸ Schema-Constrained Extraction(ECL Pipeline via Pydantic)❸ Topological Subgraph Traversal(Dense-Seeded Triplet Extraction)❶ Timestamp-Based Multi-Versioning(Hash-Based Deduplication)LightMem(Fanget al.,2025)❸ Heterogeneous Composite(Tripartite Schema)❷ Specialized Single-Engine(Relational DB)❷ Schema-Free Extraction(Entropy-Gated)❷ Semantic-Based Retrieval❶ Timestamp-Based Multi-Versioning(Append-Only Logs)SimpleMem(Liuet al.,2026a)❸ Heterogeneous Composite❸ Heterogeneous Multi-Engine(Vector DB + BM25 + SQL)❸ Schema-Constrained Extraction❹ Autonomous Agentic Routing(Query Expansion)❸ LLM-Driven Semantic Consolidation(On-the-Fly Synthesis)MemOS(Liet al.,2025)❸ Heterogeneous Composite(MemCube)❸ Heterogeneous Multi-Engine(Vector + Graph DB)❸ Schema-Constrained Extraction(Semantic Parser)❺ Multi-Stage Hybrid Execution(Boolean + Semantic)❶ Timestamp-Based Multi-Versioning(Differential Writes)MemoryOS(Kanget al.,2025b)❸ Heterogeneous Composite(Segment-Page)❸ Heterogeneous Multi-Engine(Keyword Index + Vector DB)❸ Schema-Constrained Extraction❺ Multi-Stage Hybrid Execution(Hierarchical Routing)❷ Capacity-Driven Physical Eviction(Heat-Based Eviction)A-MEM(Xu and others,2025)❸ Heterogeneous Composite(Atomic Notes)❸ Heterogeneous Multi-Engine(Vector + Graph DB)❸ Schema-Constrained Extraction(JSON Attributes)❸ Topological Subgraph Traversal❸ LLM-Driven Semantic Consolidation(Mutation & Pruning)Multi-ParadigmHybridLetta(Packeret al.,2023)❸ Heterogeneous Composite(Context Tiers)❷ Specialized Single-Engine(Relational DB)❸ Schema-Constrained Extraction❹ Autonomous Agentic Routing(Function Calling)❷ Capacity-Driven Physical Eviction(Queue Flush)

Memory Types.For an LLM agent, a wide variety of information is produced and may need to be memorized, including dialogue history, tool execution logs, distilled facts, and user preferences(Zhanget al.,2025). Following established cognitive frameworks, memory can be broadly organized along two axes(Huet al.,2025; Tang and others,2026). (1) Along thetemporalaxis, short-term memory holds the volatile state of an ongoing session, while long-term memory persists across sessions. (2) Along thefunctionalaxis, long-term memory is further divided into information such as concrete past events (episodic memory), abstracted factual knowledge (semantic memory)(Huet al.,2025; Wuet al.,2026), reusable action strategies (procedural memory), and user preferences.

Agent Memory.We define theagent memoryℳ\mathcal{M}as the persistent data management object(Packeret al.,2023; Wuet al.,2026)that maintains this cumulative state beyond a single inference step and makes it accessible to the agent during future reasoning and action(Zhanget al.,2025; Huet al.,2025).

Agent Memory System.To operationalize agent memoryℳ\mathcal{M}, a robust infrastructure is required(Packeret al.,2023; Huet al.,2025). As shown in Table1, from a data systems perspective(Wuet al.,2026; Liet al.,2024), we formalize theagent memory systemas a tuple of four modules:ℳsys=⟨ℛ,𝒮,𝒬,𝒰⟩\mathcal{M}_{sys}=\langle\mathcal{R},\mathcal{S},\mathcal{Q},\mathcal{U}\rangle, where each module governs a distinct phase of the memory lifecycle.

∙\bullet(1) Memory Representation and Storageℛ\mathcal{R}:A mapping that defines the logical and physical memory format with a data model of two facets: (a) logical representation, spanning simple primitives (discrete tokens, continuous vectors) to complex topologies (knowledge graphs, trees, and composites); and (b) physical storage, utilizing transient registers, specialized single-engine databases, or multi-engine backends for persistence and indexing.

∙\bullet(2) Memory Extraction𝒮\mathcal{S}:A mechanism governing how heterogeneous input streams (e.g., multi-turn dialogues, tool logs) are transformed into logical memory primitives via pipelines such as raw sequence concatenation, schema-free semantic extraction, or schema-constrained structured extraction.

∙\bullet(3) Memory Retrieval and Routing𝒬\mathcal{Q}:A function that dynamically identifies relevant memory subsets based on a query context, utilizing specific routing algorithms to traverse indices. Mechanisms span native attention-based retrieval, semanticKK-nearest neighbor search, topological subgraph traversal, autonomous agentic routing viaLLMplanning, and multi-stage hybrid execution.

∙\bullet(4) Memory Maintenance𝒰\mathcal{U}:Policies governing the dynamic lifecycle of memory entries, decomposed into three sub-operations:(a) Conflict Resolution and Versioninghandles contradictions via multi-versioning, invalidation, or precedence rules;(b) Capacity Managementenforces bounded growth through constraint-based hard eviction (e.g., FIFO, token limits) or score-based priority eviction (e.g., temporal decay); and(c) Semantic Consolidationutilizes theLLMto merge redundant assertions into dense summaries or execute CRUD operations via tool-calling interfaces.

Distinction from RAG and Context Engineering.Retrieval-Augmented Generation (RAG)(Gaoet al.,2023; Khanet al.,2025)typically operates as a stateless, read-only retrieval primitive: given a query, it fetches relevant passages from a static corpus to augment a single generation step. Context engineering(Anthropic Engineering,2025)is the broader practice of curating the finite LLM context window at each inference turn (e.g., dynamically selecting prompts, tool descriptions, and retrieved facts) to mitigate context rot(Anthropic Engineering,2025). In contrast, an agent memory system (1) is a persistent and updatable infrastructure for managing agent-specific state over time and (2) governs the full long-term memory lifecycle, including memory representation, storage, retrieval, and maintenance, rather than merely packing the current context window.

Distinction from Traditional Database Workloads.Agent memory workloads differ substantially from conventional database OLTP / OLAP workloads(Packeret al.,2023; Liuet al.,2026b). First, memory access is oftensemanticrather than purely predicate-based(Kanget al.,2025a; Caminal and others,2025). Queries are commonly expressed through natural language, partial context, or latent intent, and therefore rely on approximate matching, query rewriting, orLLM-guided retrieval rather than only exact logical predicates over rigid schemas. Second, memory contents evolve undercontinuous and potentially conflicting observations. Unlike conventional transactional settings, where updates typically overwrite tuples under a predefined schema and consistency model, agent memory must accommodate uncertain, partial, and sometimes contradictory information collected across time, tools, and environments(Huet al.,2025; Zhenget al.,2025b). Third, agent memory workloads arehighly heterogeneousin both access pattern and granularity. A single workload may combine long-context synthesis, episodic recall, structured fact lookup, temporal reasoning, and streaming updates. As a result, practical systems often require hybrid execution strategies that combine semantic retrieval, structured filtering, and topology-aware traversal within one memory architecture(Wuet al.,2026; Packeret al.,2023). These properties distinguish agent memory from traditional databases, motivating dedicated abstractions and evaluation methodologies.

3.Method Overview

In this section, we carefully analyze existing agent memory systems across the four components in Section2and establish a unified taxonomy that summarizes representative component methods.

3.1.Memory Representation and Storage

Refer to caption Figure 2.Memory Representation Methods.This module consists of two components: (1) logical representation, which defines the structural encoding and organization exposed to the agent system, directly dictating capacity, accessibility, and trade-offs in expressiveness, retrieval granularity, and downstream reasoning compatibility; and (2) physical storage, which designates the persistence and indexing structures, such as volatile in-context registers, dense vector engines, or topological graph databases.

3.1.1.Logical Representation

As displayed in Figure2, this component acts as a bridge between raw data and the execution environment by organizing memory into clear models, such as graphs, or vector spaces. It determines how efficiently a system can search, combine, and use historical context for complex tasks.

❶Token-Level Sequence Representation.This category models memory as flat, one-dimensional sequences lacking explicit structural abstractions (e.g., graphs or hierarchies). Memory is represented either as discrete, human-readable natural language tokens or as implicit, continuous latent vector tokens (e.g., fact embeddings, hidden states, or KV-cache tensors).

▶\blacktrianglerightExplicit Discrete Text Token.This category models memory as human-readable strings or independent factual statements. For instance, Mem0 isolates memory into discrete natural language facts extracted directly from interaction history. Similarly, MemoChat structures multi-turn dialogues into discrete JSON blocks (topics, summaries, raw turns) to maintain topical coherence within a plain-text paradigm. While these systems externalize their plain-text memory, others retain it within the active processing window: MemAgent restricts its internal belief state to a strictly bounded text sequence (e.g., 1024 tokens), and MEM1 encapsulates internal-state summaries within specialized boundary tags (e.g., ¡IS¿).

▶\blacktrianglerightImplicit Continuous Vector Token.Departing from readable text tokens, this sub-category encodes memory as continuous vectors, which may be materialized either as external embeddings attached to facts and summaries for semantic retrieval, or as model-side latent states such as compressed internal states and attention caches. For example, Mem0 represents extracted facts as dense semantic embeddings and MemoRAG utilizes specifically initialized weight matrices to compress raw inputs into high-dimensional Key-Value (KV) cache tensors. Although these vector-token representations reduce explicit tokenization burdens and integrate naturally with retrieval or inference pipelines, they sacrifice structural interpretability and are difficult to manipulate via fine-grained operations, such as predicate-level filtering or targeted updates to encoded facts.

❷Graph and Tree-Based Topological Representation.This category abstracts memory into structured graph and tree topologies with interconnected nodes and edges, allowing conversational entities, high-level concepts, and their temporal or semantic relationships to be explicitly modeled and computationally traversed.

Refer to caption Figure 3.Memory Storage Methods.▶\blacktrianglerightTemporal Knowledge Graphs.This sub-category models memory using graph topologies to map entities and their interconnections, natively supporting temporal reasoning and conflict detection. For example, Zep partitions memory into formally defined, temporally-aware knowledge graphs (e.g., episode, entity, and community subgraphs). Similarly,Mem0g\text{Mem0}^{g}formalizes memory as a directed labeled graph, where vertices represent entities and edges encapsulate relationship triplets (e.g., “LIVES_IN”). To aid temporal reasoning, entity nodes are enriched with structural metadata like semantic types, dense embeddings, and creation timestamps.

▶\blacktrianglerightHierarchical Tree Structures.This sub-category organizes knowledge into recursive, hierarchical structures, preserving highly granular observations at terminal leaves and broad semantic abstractions at ancestor nodes. For example, MemTree models memory as a dynamic, directed tree schema. Each node is structured as a tuple containing textual content, a dense embedding, topological pointers, and a depth scalar. Within this topology, deep leaf nodes retain isolated facts (e.g., a player scoring), while ancestor nodes provide high-level conceptual summaries (e.g., the match result), with a specialized root node serving as the definitive entry point.

❸Heterogeneous Composite Representation.This category moves past simple token sequences and standard graphs by packaging memory into complex, multi-part data containers. These architectures directly combine unstructured text with highly structured metadata (e.g., timestamps, categorical labels, vector embeddings, and network links) to form a single functional unit. For example, MemOS proposes the MemCube, a unified data object that organizes memory into three distinct payloads (plain-text, activation, and parametric memory) alongside structured details (e.g., ID tags).

3.1.2.Physical Storage and Indexing

As shown in Figure3, this component manages how data is physically stored and accessed, relying on systems like in-memory caches, files, vector engines, or databases. It sets the actual capacity limits and determines the speed, throughput, and overall scalability of memory operations.

❶Transient In-Context Register.To eliminate disk I/O and external traversal latency, this category retains memory exclusively within the active hardware state (e.g., dynamic context windows or KV caches). MemoChat avoids dedicated external memory engines and keeps structured JSON-style memos within the LLM context input during its memorization-retrieval-response loop, while MemAgent directly stores summary tokens as Key-Value (KV) cache tensors via dense positional embeddings.

Refer to caption Figure 4.Memory Extraction Methods.❷Specialized Single-Engine Storage.This category physically warehouses formulated units within a standalone, homogeneous backend strictly tailored to the memory’s logical structure. Depending on the ingestion paradigm, architectures deploy specific backend topologies: (1)Dense Vector Databasesare utilized to project data into continuous high-dimensional spaces; Mem0 and MemTree use centralized vector stores, Letta leverages PostgreSQL with thepgvectorextension; (2)Graph Databasesare deployed to enforce topological constraints; both Zep andMem0g\text{Mem0}^{g}execute predefined Cypher queries to physically persist logical graph components into Neo4j; (3)Relational SQL Enginesare used to serialize structural and temporal schemas. LightMem incrementally appends factual streams to preserve global relational states; (4)File or Object Storespreserve raw interaction artifacts (e.g., conversation histories or tool-execution logs) as files or object blobs.

❸Heterogeneous Multi-Engine Storage.This category dynamically constructs multiple index typologies or distributes data across heterogeneous backends (e.g., pairing a dense vector store with a topological graph database). SimpleMem ingests memory into LanceDB with an IVF-PQ mechanism that concurrently maintains dense embeddings, sparse BM25 indices, and SQL predicates. MemoryOS relies on a hybrid index fusing dense cosine similarity with discrete Jaccard similarity. Conversely, MemOS delegates serialized payloads to highly specialized independent backends, fusing Vector and Graph databases via a standardized memory adapter interface.

3.2.Memory Extraction

Memory extraction concerns how raw interaction traces are computationally processed. It covers both the extraction pipeline, how language models extract, summarize, or parse unstructured text into logical structures. As shown in Figure4, it defines how the agent memory system transforms heterogeneous input streams (e.g., multi-turn dialogues, and tool execution logs) into logical memory primitives prior to physical persistence.

❶Raw Sequence Concatenation.To minimize computational overhead, this category bypasses explicit extraction prompts, formulating memory directly as raw token concatenations or transient state summaries (e.g., appending recent dialogue turns directly into a prompt buffer). Systems such as MEM1 and MemAgent retain their newly formulated structures exclusively within the active computational state without secondary parsing.

❷Schema-Free Semantic Extraction.This category systematically distills raw, unstructured inputs into independent, high-value informational units, representing them either as explicit free-form texts or as compressed, continuous latent vectors. By isolating core knowledge from broader conversational context, it ensures precise and granular retrieval. For example, Mem0 actively parses interactions to extract and store discrete, standalone factual statements (e.g., “User is vegetarian and dairy-free”).

Refer to caption Figure 5.Memory Retrieval Methods.❸Schema-Constrained Structured Extraction.This category prompts the LLM to parse raw inputs and synchronously populate a rigidly predefined structural schema, producing strictly typed data rather than free-form text. The constrained output takes the form of either topological entity-relation triplets for graph insertion or multi-modal relational payloads for hybrid storage, depending on the target backend. Zep andMem0g\text{Mem0}^{g}extract typed directed relational edges (e.g.,LIVES_IN,WORKS_AT) conforming to predefined graph schemas, with Zep additionally applying a reflection-inspired verification step to suppress hallucinated triplets. MemoChat populates predefined structural fields to ensure data predictability by leveragingLLMsto segment conversations into strict JSON schemas.

3.3.Memory Retrieval and Query Routing

Memory retrieval and query routing determine how the agent memory system dynamically identifies and extracts relevant historical context to inform the overarching agent’s current reasoning state. As shown in Figure5, this module encompasses the complete query execution spectrum, defining the operational algorithms, predicate evaluations, and agentic workflows utilized to traverse indices.

❶Native Attention-Based Retrieval.To bypass external database I/O, this category uses the transformer’s native computational graph as the sole retrieval engine, relying entirely on self-attention mechanisms to implicitly weight and route information (e.g., scanning dialogue tokens directly within the KV cache). MEM1 performs implicit retrieval via self-attention over the current sequence, utilizing a two-dimensional attention mask to preserve causal consistency. MemAgent implements routing by concatenating blocks directly into the prompt template, enabling standard attention-based decoding without external cross-encoder reranking.

❷Semantic-Based Dense Retrieval.Operating over continuous latent spaces, this category maps query tensors against uniform vector indices to extract localized spatial neighbors (e.g., executing a standardKK-Nearest Neighbors (KNN) search). Mem0 calculates vector embeddings for incoming queries to execute a dense similarity search, fetching a constrained subset of facts. LightMem utilizes efficient cosine-similarity distance calculations over dense embeddings, bypassing computationally expensive iterative reranking. MemTree implements a collapsed-tree architecture that mathematically flattens its hierarchy, broadcasting inbound vectors to compute global cosine-similarity distributions across all candidates.

❸Topological Subgraph Traversal.Departing from continuous vector spaces, this category retrieves information by traversing explicit relationship edges to extract semantic clusters structurally grounded in knowledge graphs (e.g., hopping from aUsernode to a linkedPreferencenode).Mem0g\text{Mem0}^{g}deploys an entity-centric heuristic to recursively traverse local subgraphs synchronously with semantic triplet evaluations. A-MEM identifies candidate anchors via denseKK-Nearest Neighbor selection, then executes localized graph traversal to access topologically adjacent memory nodes explicitly linked within the same conceptual cluster.

❹Autonomous Agentic Routing.Rather than executing deterministic database scans, this category delegates retrieval to theLLMitself, thereby functioning as an active, autonomous query planner. It generates tool-call invocations or drafts implicit search criteria.

▶\blacktrianglerightFunction Call Invocation.This sub-category bridges theLLMwith external storage by generating explicit function call commands to directly execute predefined database operations (e.g., outputting a valid JSON payload to trigger an external database API). For example, Letta orchestrates self-directed memory retrieval where the LLM evaluates its active context to explicitly generate localized function calls (e.g., emitting anarchival_storage.search()command to extract targeted historical logs).

▶\blacktrianglerightGenerative Query Expansion.Unlike rigid function calling, this approach uses natural language generation to synthesize intermediate clues or decompose complex intents before mapping them to the index (e.g., rewriting vague prompts into descriptive search strings). SimpleMem uses an Intent-Aware Retrieval Planning module where theLLMdissects queries, calculates adaptive search depths, and synthesizes optimized query variants.

Refer to caption Figure 6.Memory Maintenance Methods.❺Multi-Stage Hybrid Execution.To overcome the recall limitations of single-paradigm searches, this category executes multi-engine query pipelines orchestrating multi-dimensional candidate generation followed by downstream reranking frameworks.

▶\blacktrianglerightSequential Hybrid Routing.This sub-category chains retrieval paradigms into a strictly ordered pipeline, systematically pruning the search space with deterministic predicates before executing fine-grained semantic extraction (e.g., applying strict SQL date filters before computationally expensive vector searches). MemoryOS executes a federated routing strategy featuring coarse-grained predicate evaluation followed by fine-grained semantic ranking strictly within isolated segments. It algebraically fuses rule-based structural Boolean filtering with dense semantic similarity routing.

▶\blacktrianglerightParallel Ensemble Retrieval.In contrast to sequential filtering, this approach maximizes initial recall by simultaneously dispatching queries to multiple distinct indexing algorithms, followed by a late-stage fusion and reranking phase to optimize the aggregated pool (e.g., concurrently fetching candidates via BM25 and dense vector search, then cross-encoding the results). Zep executes simultaneous cosine semantic scans, Okapi BM25 full-text searches, and topological BFS, subsequently optimizing precision via RRF, MMR, and computationally intensive cross-encoder models.

3.4.Memory Maintenance

Memory maintenance concerns how memory is updated, maintained, compressed, forgotten, and eventually removed over time. As shown in Figure6, it captures the dynamic behavior of memory after it has been created, including how new information is incorporated, how outdated or conflicting content is revised, and how the system controls memory growth under limited resources.

Refer to caption Figure 7.Effectiveness of Memory Systems overLoCoMo,MemoryAgentBench (LongMemEval),LifeLongAgentBench (DB-Bench).❶Timestamp-Based Multi-Versioning.Rather than executing physical row deletions, this category preserves historical continuity by utilizing timestamp metadata and append-only logs to logically deprecate expired facts. Operating via explicit metadata mutations, Zep andMem0g\text{Mem0}^{g}avoid physical deletion by marking obsolete or conflicting relationships as logically invalid using validity flags and timestamps. Taking an append-only approach, LightMem incrementally inserts timestamped factual streams, while SimpleMem resolves contradictions through strict chronological precedence using ISO-8601 timestamps. Synthesizing these techniques, MemOS leverages a structured Update API to execute differential writes, seamlessly updating provenance IDs to generate multi-version chains.

❷Capacity-Driven Physical Eviction.In contrast to timestamp-based multi-versioning, this category manages unbounded memory growth by physically dropping or unconditionally overwriting data. It executes this physical pruning through either strict deterministic constraints or dynamically calculated eviction scores.

▶\blacktrianglerightConstraint-Based Hard Eviction.This sub-category enforces rigid execution bounds by utilizing deterministic rules—such as strict FIFO queues, fixed sequence boundaries, or hard token limits—to unconditionally evict older states. Executing structural overwrites, MemAgent implements a programmatic scheduling algorithm that unconditionally replaces older memory sequences with newly synthesized summary blocks at every fixed segment boundary. Enforcing hard capacity limits, MEM1 operates through a system-enforced truncation mechanism that executes an automated FIFO pruning protocol to evict older tags once active context thresholds are breached. Operating via threshold flushes, Letta strictly handles buffer capacities via an OS-inspired queue manager; when the token count breaches a terminal limit, it forces a flush sequence to evict older messages into secondary recall storage.

▶\blacktrianglerightScore-Based Priority Eviction.Rather than relying on static capacity limits, this sub-category dynamically forces the physical obsolescence of data by continuously calculating temporal decay or access-frequency scores. Quantifying access frequency, MemoryOS measures segment vitality via a scalar Heat score that balances retrieval frequency against exponential temporal decay, executing priority evictions that physically target the lowest-heat segments.

❸LLM-Driven Semantic Consolidation.Operating as a cognitive governor, this category leverages theLLMto dynamically resolve logical conflicts and abstract redundant observations into dense summaries prior to query or persistence phases.

▶\blacktrianglerightInline Semantic Compaction.During the active write phase, this sub-category dynamically evaluates and consolidates newly ingested data against existing memory nodes, systematically merging redundant assertions prior to database transaction commitment (e.g., compressing three similar dialogue turns into one dense summary node). SimpleMem executes online semantic synthesis on-the-fly, systematically merging structurally similar assertions into singular dense abstractions prior to database transaction commitment. MemTree utilizes a core scheduling operation that recursively triggers a semantic summarization prompt across all parent nodes to dynamically fuse historical states with novel payloads.

▶\blacktrianglerightTool-Driven CRUD Execution.In contrast to automated fusion, this sub-category operationalizes maintenance through discrete, programmed state-mutations guided explicitly byLLM-driven tool interfaces that issue explicit Create, Read, Update, or Delete (CRUD) commands. Mem0 operationalizes its dynamic maintenance strictly through structuredLLMtool-calling interfaces encompassing discrete programmed state-mutations such as UPDATE, and DELETE.

❹Continuous Parametric Optimization.Completely decoupling state updates from online inference latency, this category executes heavy neural optimizations as asynchronous background processes, modifying the actual model parameters rather than the external database schema (e.g., running continuous fine-tuning on overnight batches). For example, MemoRAG leaves active inference tokens strictly static and read-only, optimizing extraction quality exclusively during an offline training phase via a Reinforcement Learning with Generation Feedback (RLGF) algorithmic framework.

4.End-to-End Assessment

In this section, we conduct a systematic evaluation of agent memory systems across five research questions. Across five distinct benchmark workloads and 11 datasets, we assess 12 representative memory systems against baselines to characterize their performance. Specifically, the five research questions are as follows.

4.1.Overall Effectiveness (RQ1)

Experimental Setting.For“Do different agent memory systems successfully improve end-to-end task performance across workloads?”, we evaluate 12 representative memory systems and two reference baselines (Long ContextandEmbedding RAG) on the three end-to-end workloads to assess whether memory improves task success beyond the underlyingLLM. Specifically, we use: (1)LoCoMo(Maharanaet al.,2024): a long-conversation QA benchmark that tests episodic, temporal, and open-domain memory over multi-turn interactions, and report the unweighted mean of category-levelExact Match (EM)andAnswer F1on the four-category queries; (2)LongMemEval(Wuet al.,2024): a multi-session long-memory benchmark that evaluates whether systems can reconnect facts across sessions and reason over temporally distributed evidence, and reportSubstring EM,ROUGE-L F1,ROUGE-L Recall, andGPT-5.4-basedLLMJudge AccuracyfromMemoryAgentBench(MemoryAgentBench Team,2026); and (3)DB-Bench: evaluates whether memory supports procedural execution across database operations fromLifelongAgentBench(Zhenget al.,2025a), and reportExact Match (EM)andTask Success Rate.

O1-(Cross-Workload Effectiveness): No single memory system dominates all workloads, but methods that preserve task-critical evidence through structure-guided filtering remain the most competitive overall.As shown in Figure7, the leading systems shift across workloads: (1) Structure-aware systems leadLongMemEval, where Zep reaches48.0LLMJudge Accuracyand Cognee attains35.3ROUGE-L F1; (2) Hybrid filtering is strongest onLoCoMoexactness, where MemOS reaches11.5Exact Match (EM); and (3) Trace-preserving memories remain strongest onDB-Bench, where Long Context achieves48.20EMand MemoChat reaches55.40Task Success Rate. However, among methods with full workload coverage, MemoryOS and MemOS remain closest to the frontier overall, suggesting that robustness comes not from a single universal memory form, but from preserving the right evidence at the right level of abstraction before final matching. In particular, (1) Temporal or graph-organized memory is most useful for cross-session aggregation and event-order reasoning (e.g., scattered personal facts inLongMemEval); (2) Summary-first or coarse-to-fine routing is useful for exact grounding in long but semantically coherent dialogues (e.g., recovering a specific date or personal detail inLoCoMo); and (3) Trace-preserving memory is necessary when correctness depends on intermediate state changes and operation order (e.g., dependentUPDATEandINSERToperations inDB-Bench).

O2-(Beyond Exact Match):EMremains informative for tasks with canonical, directly grounded outputs, but it becomes insufficient when correctness depends on paraphrastic synthesis or executable success.As shown in Figure7,Exact Match (EM)is still a meaningful signal onLoCoMo, where many questions target short grounded facts, as reflected by MemOS achieving the bestExact Match (EM). OnLongMemEval, however, the stronger systems are more clearly separated once semantic equivalence is considered throughROUGE-LandLLMJudge Accuracy, indicating that cross-session reasoning often yields correct answers that do not share a single canonical surface form. OnDB-Bench, the limitation is even clearer: Long Context achieves the bestExact Match (EM), but MemoChat attains a substantially higherTask Success Rate, showing that exact output matching does not fully capture whether memory supports successful execution. These results suggest thatExact Match (EM)is most appropriate when answers are short, canonical, and locally verifiable (e.g., a venue name, or object attribute inLoCoMo), but should be complemented once tasks require cross-session synthesis, or end-task state validation (e.g., composing a semantically correct answer from multiple sessions inLongMemEvalor reaching the correct table state inDB-Bench).

Finding 1.(Workload-Aligned Memory).RQ1 suggests that strong agent memory is not defined by a single universal representation, but by how well it supports the dominant workload bottleneck: (1) for dispersed cross-session reasoning, relation- and time-aware retrieval is most effective, as in Zep and Cognee; (2) for long but semantically coherent dialogue, coarse-to-fine filtering improves exact grounding, as in MemOS and MemoryOS; and (3) for stateful execution, preserving interaction traces is more critical than exact lexical matching alone, as in Long Context.

4.2.Memory Retrieval Fidelity (RQ2)

Refer to caption Figure 8.Retrieval Results of Memory Systems over LoCoMo.Experimental Setting.For“How accurately can a memory system surface the stored evidence required by a query?”, we evaluate eight representative memory systems to assess evidence-level retrieval fidelity independently of downstream answer generation. Specifically, we useLoCoMo(Maharanaet al.,2024), which provides source-level gold evidence for queries with diverse evidence distances. We report: (1)Recall@K, where a hit requires the top-kkretrieved source-id groups to contain the annotated gold evidence, and (2)Recall@10over sixevidence distance gapbins (1–5 to 26–31), defined by the session distance between the query’s final session and the earliest supporting evidence, to measure long-range retrieval accuracy.

O1-(Structured Evidence Expansion): Retrieval fidelity depends less on surfacing one relevant memory early than on preserving an explicitly organized memory structure that can gather complete and temporally distant evidence.As shown in Figure8, the results show a clear difference between early-hit precision and overall evidence completeness:SimpleMemachieves the highestRecall@1(39.0), butA-MEMandMemTreebecome clearly stronger at larger retrieval budgets, reaching 69.5/85.9 and 59.7/80.5 onRecall@5/@10, respectively, while also remaining much more stable as theevidence distance gapincreases; by contrast, the flatEmbedding RAGbaseline drops sharply after the shortest-gap bin. This pattern suggests that strong memory retrieval is not mainly a top-1 ranking problem, but an evidence-completion problem in which the required support may be old, scattered, or spread across multiple turns (e.g., personal details mentioned in different sessions or dated events referenced much later). More specifically, the results point to three different retrieval behaviors: (1) compression-oriented memory is effective for surfacing one highly relevant item early (e.g., a single salient personal detail or recent conversational fact); (2) linked or hierarchical memory organization is more effective for gathering complementary supporting evidence across the ranked results (e.g., combining pet names or dated events mentioned in different sessions); and (3) flat dense retrieval remains competitive mainly when the needed evidence is still close to the current context (e.g., recent conversational facts). For queries that require scattered or temporally distant support, the most competitive systems are therefore those that organize memory as a structured evidence space rather than a flat similarity cache.

Finding 2.(Evidence-Centric Memory Organization).RQ2 shows that retrieval quality depends more on how a system organizes evidence for later reconstruction than on how well it ranks one relevant memory first. Specifically, (1) early localization and evidence assembly should be treated as separate design targets; (2) explicit structure, such as links or hierarchy, is most valuable when supporting evidence is scattered or temporally distant, as in A-MEM and MemTree; and (3) flat similarity search is mainly effective for short-range access.

4.3.Memory Evolution Robustness (RQ3)

Table 2.Robustness over Memory Update Settings.MethodLoCoMoLongMemEvalTemporalKnowledge UpdateTemporal ReasoningExactMatchAnswerF1SubstringEMROUGE-LF1SubstringEMROUGE-LF1Long Context8.126.920.018.012.024.0Embedding RAG1.67.920.017.810.722.7Mem03.26.015.617.110.722.4MemoChat2.415.48.912.910.725.3Cognee4.028.137.834.018.735.8Zep4.818.144.436.813.330.5MemTree5.618.631.130.68.029.9Letta (MemGPT)0.07.117.85.712.08.8LightMem4.020.115.620.212.028.6SimpleMem4.48.16.77.48.022.6MemOS8.928.028.930.512.031.1MemoryOS3.222.735.632.216.031.6A-MEM4.817.726.722.88.022.5

Experimental Setting.For“Can agent memory systems reliably incorporate revised facts, preserve the correct temporal state after updates, and remain robust across answer backbones?”, we conduct two experiments:(1) Update Robustness Comparison, which evaluates whether systems can absorb fact revisions and answer temporally grounded queries after updates; and(2) Backbone Robustness Ablation, which tests whether this behavior remains stable when only theLLMbackbone changes. In(1) Update Robustness Comparison, Table2compares 11 representative memory systems onKnowledge UpdateandTemporal ReasoningfromLongMemEval(Wuet al.,2024), andTemporalfromLoCoMo(Maharanaet al.,2024). The twoLongMemEval(Wuet al.,2024)slices are reported withSubstring EMandROUGE-L F1, while theLoCoMo(Maharanaet al.,2024)slice is reported withExact Match (EM)andAnswer F1; In(2) Backbone Robustness Ablation, Figure9evaluates 6 representative memory settings under 4LLMbackbones on theLoCoMo(Maharanaet al.,2024).

O4-(Temporal State Externalization): No single memory system dominates all update-oriented slices, but methods that preserve temporally valid evidence through structured organization remain the most competitive overall.As shown in Table2, the leading systems shift across slices: (1) Graph- or relation-organized memory is strongest on direct fact revision, where Zep leadsKnowledge Updatewith 44.4Substring EMand 36.8ROUGE-L F1; (2) Relationally organized retrieval is strongest on temporally dispersed evidence, where Cognee leadsTemporal Reasoningwith 18.7Substring EMand 35.8ROUGE-L F1; and (3) Hybrid filtered memory is strongest on exact latest-state grounding, where MemOS attains the highestLoCoMoExact Match (EM)at 8.9 while Cognee attains the highestAnswer F1at 28.1. However, among methods with full slice coverage, Cognee, MemOS, and MemoryOS remain closest to the frontier overall, indicating that robustness comes not from a single universal memory form, but from preserving the right temporal evidence at the right structure level. In particular, (1) temporal or graph-organized memory is most useful for revised personal facts and dated events (e.g., aggregating scattered updates to preferences, purchases, or past activities inLongMemEval); (2) hybrid or coarse-to-fine filtering is most useful when correctness depends on the currently valid state (e.g., recovering the latest date, attribute, or event order from long but semantically coherent dialogues inLoCoMo); and (3) flat context accumulation or dense similarity alone is weakest when stale mentions must be separated from updated ones (e.g., distinguishing an earlier personal detail from its later correction after repeated mentions over time).

Refer to caption Figure 9.Ablation ofLLMBackbones.O5-(Backbone Robustness): Backbone variation changes absolute answer quality more than it changes which memory pipeline remains effective, indicating that stable update behavior is determined primarily before final generation.As shown in Figure9,Answer F1generally rises under stronger generators, yet the overall ordering changes only modestly: (1) MemOS remains the strongest memory-based configuration at 32.2, 41.2, 38.6, and 41.2; and (2) the only notable reversal is local, where A-MEM overtakes MemTree under GPT-5.4-mini and GPT-5.4. This stability implies that stronger backbones mostly improve answer realization after relevant evidence has already been localized, rather than compensating for weak temporal grounding. For example, for date-grounded latest-state queries in the currentLoCoMotemporal outputs, MemOS remains correct across all four backbones, whereas Embedding RAG remains incorrect. More concretely, methods with stronger external organization yield a more stable evidence set for differentLLMs, whereas methods that rely more heavily onLLM-side synthesis exhibit greater cross-backbone movement.

Finding 3.(Temporal Update Fidelity).RQ3 suggests that reliable post-update behavior is a pipeline-level design problem rather than a pure model-capacity problem. In particular, (1)revisabilityshould be built into the memory representation so later facts can be bound to the same entity or event rather than appended as undifferentiated text, as in Zep and Cognee; (2)query-time selectivityshould match the workload bottleneck, using filtered or hybrid routing when the task requires the currently valid state, as in MemOS and MemoryOS; and (3)LLMscalingis most valuable only after grounding has succeeded, so stronger backbones should refine answer expression rather than serve as the primary mechanism for resolving stale or conflicting memories.

4.4.Long Horizon Memory Stability (RQ4)

Experimental Setting.For“How stable are agent memory systems as the effective memory horizon increases, either through longer contexts or more distant supporting evidence?”, we evaluate 12 representative memory systems across 3 benchmarks to assess robustness to increasing context length and temporal distance. Specifically, we use: (1)LongBench(Baiet al.,2024), which evaluates controlled long-context difficulty in question answering, reported withAccuracyoverShort,Medium, andLongcontext-length buckets to measure context-length robustness; (2)LongMemEval(Wuet al.,2024), which evaluates multi-session memory as the amount of prior interaction grows, reported withROUGE-L F1over bins of historical session count to measure multi-session stability; and (3)LoCoMo(Maharanaet al.,2024), which evaluates memory drift when supporting evidence lies back in the conversation, reported withAnswer F1over bins of evidence-distance gap between the final session and the earliest supporting-evidence.

O6-(Long-Horizon Evidence Preservation): Memory remains more stable at longer horizons when evidence is organized through explicit relational links or hierarchical consolidation, rather than left as flat text for direct matching.As shown in Figure10, inLongBench,SimpleMemstays nearly unchanged from the Short to Medium buckets (35.235.2to34.934.9Accuracy), whereasLong Contextdrops from42.642.6to19.019.0, indicating that larger prompts alone do not sustain answer quality once long inputs accumulate distractors. InLoCoMo, the contrast is sharper:Embedding RAGfalls from37.137.1to7.47.4Answer F1as the evidence gap widens, while graph- or consolidated-memory systems such asCognee,MemOS, andMemoryOSremain substantially higher across the same bins;LongMemEvalshows the same advantage for methods that preserve cross-session structure over longer histories. It indicates that the main difficulty at longer horizons is not memory volume, but whether the representation keeps distant facts connected to the abstractions needed for answering. More specifically, graph- or temporally organized memory preserves entity–event–time relations for distant facts (e.g., recovering a repeated personal event many sessions earlier), while hierarchical or summary-first organization preserves session-level structure (e.g., first locating the relevant session before resolving a specific local detail) so theLLMcan narrow attention before final generation. Pure long-context prompting and flat dense memory provide neither form of support, and therefore degrade more sharply as the effective horizon grows.

Refer to caption Figure 10.(a) Context-length robustness onLongBench; (b) Session-history growth onLongMemEval; (c) Temporal evidence-distance drift onLoCoMo.Finding 4.(Horizon-Structured Memory).RQ4 indicates that, as the effective memory horizon grows, the main challenge shifts from storing more history to choosing the right abstraction over it:(1) Multi-view filteringhelps when long inputs contain many distractors, as in SimpleMem;(2) Relation-aware indexinghelps when supporting facts are separated by many turns or sessions, as in Cognee and Zep; and*(3) Coarse-to-fine summarization*helps when the system must first identify the relevant session before resolving a local detail, as in MemOS and MemoryOS.

4.5.Memory Operation Cost (RQ5)

Experimental Setting.For“What is the operational cost of each memory system in terms of utility–latency trade-off and cross-workload latency footprint?”, we evaluate 8 representative memory systems using the unified time-overhead traces recorded by our runner. We quantify two aspects: (1)Utility–latency trade-off, measured byAvg. Operation Latency/QueryandNormalized Utility; and (2)Cross-workload latency footprint, measured byOutlier-Filtered Avg. Total Latency/Query. For (1),Avg. Operation Latency/Queryis computed as memory construction time plus query time and interpreted as amortized per-query cost for systems with cumulative or bursty writes, whileNormalized Utilityis the mean of six min–max normalized answer-quality metrics from the currentLoCoMo(Maharanaet al.,2024)andLongMemEval(Wuet al.,2024)runs. For (2), we reportOutlier-Filtered Avg. Total Latency/Queryon three benchmark.

Refer to caption Figure 11.Operation Cost of Memory Systems.O7-(Localized Maintenance): The most cost-efficient memory mechanisms are those that localize maintenance to a bounded subset of memory state, whereas mechanisms that repeatedly reorganize a large global state are the least efficient.As shown in Figure11, (1) Among memory-augmented systems, LightMem and MemTree occupy the strongest efficiency frontier, with LightMem reaching 48.3Normalized Utilityat 3.67 sAvg. Operation Latency/Queryand MemTree reaching 63.5 at 15.9 s; both are clearly more efficient than MemoChat (28.0 at 15.4 s), Mem0 (21.4 at 35.9 s), and A-MEM (57.7 at 17.9 s); (2) Higher-utility structured systems move markedly to the expensive side: MemoryOS reaches 82.0Normalized Utilityonly at 28.6 s, while Cognee and Zep exceed 84 utility only after 116.5 s and 155.1 s; (3) The workload-specific latency view sharpens the same separation onLongBench: LightMem remains at 17.3 s and MemTree at 116.7 s, whereas Mem0, MemoChat, MemoryOS, and A-MEM rise to 374.2, 460.2, 490.0, and 552.1 s, respectively. It indicates that operational efficiency is governed less by whether a system uses structure than by how widely each write propagates through that structure. Specifically, (1) segmented compression and bounded hybrid retrieval keep LightMem close to the low-cost regime; (2) path-local tree aggregation allows MemTree to preserve substantially more utility without global refresh; and (3) graph-wide consolidation, multi-store synchronization, or repeated whole-memory rewriting yield stronger organization but impose the heaviest operational cost as memory grows.

Finding 5.(Operational Scaling Rule).RQ5 shows that efficiency is governed by maintenance scope rather than structure alone. (1) Localized update and search yield the strongest cost–utility balance, as in LightMem and MemTree; (2) Richer organization helps only when its upkeep avoids broad recomputation; otherwise overhead offsets its gains, as in Cognee and MemoryOS; (3) Under long-context workloads, whole-memory coordination becomes the dominant cost driver.

5.Fine-Grained Component Comparison

To understand the root causes behind end-to-end performance differences, we decompose agent memory systems into four fundamental modules. By systematically generating controlled variants that modify one module at a time, we evaluate the contribution of each module to overall system performance.

5.1.Memory Representation and Storage (M1)

Table 3.Ablation of Representation and Storage Mechanisms.Method VariantLoCoMoLongMemEvalEMAns. F1Substr. EMROUGE-L F1LightMemUser-Only Raw24.238.926.031.4User-Only Summary8.515.611.717.4User-Only Compressed23.638.610.719.1MemTreeFlat-biased18.230.723.029.9Deeper Tree18.731.223.330.9Mem0Default3.26.29.316.5Graph Store3.06.58.315.9

Experimental Setting.For“How do memory abstraction level and structural organization affect factual fidelity and downstream reasoning effectiveness?”, we evaluate three representation-focused variants: (1)LightMem(Fanget al.,2025)comparesUser-Only Raw, which stores verbatim user utterances,User-Only Summary, which rewrites each session into an LLM-generated abstractive summary, andUser-Only Compressed, which removes filler and redundant tokens while preserving the original phrasing and factual content; (2)MemTree(Rezazadehet al.,2025)compares a shallowFlat-biased Settingwith aDeeper Tree Settingto examine how hierarchical text organization affects memory fidelity. We assess the trade-off between fine-grained fact preservation and multi-step reasoning usingLoCoMofor compositional reasoning andLongMemEvalfor multi-session factual retrieval, reporting exactness- and overlap-based metrics (e.g.,EMandROUGE-L F1).

O8-(Content Fidelity): Retaining the original conversational content is more important than increasing abstraction or hierarchy for sustaining both factual recall and reasoning quality.Table3shows thatLightMemUser-Only Rawachieves the best result on all four metrics, whileUser-Only Compressedremains close onLoCoMo(Ans. F1: 38.6 vs. 38.9;EM: 23.6 vs. 24.2) but drops sharply onLongMemEval(Substring EM: 10.7 vs. 26.0),User-Only Summaryis substantially weaker on both benchmarks, and the deeperMemTreesetting provides only modest gains over the flat setting. It indicates that the main performance boundary is the amount of recoverable evidence the representation preserves, rather than whether it applies stronger abstraction or a deeper structure. In particular, (1) Raw text is most effective when the task requires recovering exact session-level details (e.g., recalling a title such as“Nu, pogodi!”); (2) Light compression can still support compositional reasoning when the main meaning is preserved, but becomes unreliable for exact detail matching (e.g., relating two earlier events while missing the precise date or name); and (3) Deeper hierarchy can improve organization, but cannot restore information removed during representation (e.g., a parent node helps navigate related sessions, but not recover omitted details).

Finding 6.(Representation Granularity).M1 shows that preserving usable evidence matters more than making memory more compact or more structured. (1)High-retention formsbest support exact detail recovery, as in LightMem User-Only Raw; (2)Light compressioncan preserve reasoning, but weakens exact matching, as in LightMem User-Only Compressed; (3)Hierarchymainly improves access, but cannot restore removed content, as reflected by the MemTree (Deeper Tree) variant.

5.2.Memory Extraction (M2)

Table 4.Ablation of Memory Extraction Strategies.Method VariantLoCoMoLongMemEvalEMAns. F1Substr. EMROUGE-L F1MemoChatHeuristic Topic23.033.510.718.6LLM Topic22.534.47.315.9MemOSFast Memorize25.540.820.726.1Fine Memorize2.55.022.330.2LightMemUser-Only Raw24.238.926.031.4Hybrid Raw25.539.725.331.4

Experimental Setting.For“How do write-time extraction choices affect factual fidelity and downstream reasoning effectiveness?”, we compare extraction-related variants in three groups:(1) MemoChat(Luet al.,2023), which contrastsHeuristic TopicandLLM Topicsegmentation;(2) MemOS(Liet al.,2025), which contrastsFast MemorizeandFine Memorizeon the sametree_textbackend; and(3) LightMem(Fanget al.,2025), which contrastsUser-Only RawandHybrid Rawby extracting raw memory from user turns only or from both user and assistant turns. We evaluateLongMemEvalfor multi-session factual retrieval fidelity andLoCoMofor downstream multi-step reasoning, reporting exactness- and overlap-based measures (e.g.,EMandROUGE-L F1).

O9-(Coverage-Preserving Extraction): Coverage-preserving write-time extraction provides the most stable balance between factual retrieval and downstream reasoning.As shown in Table4,MemoChatHeuristic Topicimproves LongMemEval overLLM Topic(10.7 vs. 7.3Substr. EM; 18.6 vs. 15.9ROUGE-L F1) while keeping LoCoMo nearly unchanged (23.0/33.5 vs. 22.5/34.4EM/Ans. F1),MemOSFast Memorizefar exceedsFine Memorizeon LoCoMo (25.5 vs. 2.5EM; 40.8 vs. 5.0Ans. F1) despite lower LongMemEval scores (20.7 vs. 22.3Substr. EM; 26.1 vs. 30.2ROUGE-L F1), andLightMemHybrid Rawslightly improves LoCoMo overUser-Only Raw(25.5 vs. 24.2EM; 39.7 vs. 38.9Ans. F1) with nearly unchanged LongMemEval results (25.3 vs. 26.0Substr. EM; both 31.4ROUGE-L F1). These results suggest that broader, less selective extraction better preserves the context needed for downstream answerability, even when more selective extraction yields modest gains on lexical factual retrieval. In particular, (1) conservative topic grouping is less likely to split a sustained thread or isolate a brief aside (e.g., a one-off hobby mention); (2) lighter memorization is more likely to retain details that later need to be combined for reasoning; and (3) including both user and assistant turns can preserve clarifying cues that user-only extraction may miss (e.g., a date or refined phrasing).

Finding 7.(Late Filtering Principle).M2 suggests that memory extraction should preserve context at write time rather than aggressively filter details: (1) Coarser segmentation helps thread-spanning questions by keeping related cues together; (2) Limited rewriting supports compositional reasoning by retaining details that matter only when combined later; (3) Storing both user and assistant turns helps clarification-heavy dialogues by preserving refined formulations for later access.

5.3.Memory Retrieval and Routing (M3)

Table 5.Ablation of Retrieval and Routing Mechanisms.Method VariantLoCoMoLongMemEvalAns. F1RecallSubstr. EMROUGE-L F1A-MEMHybrid-Balanced24.649.927.525.9Hybrid Sparse-Leaning23.044.324.322.8SimpleMemNo Planning18.786.417.022.9Planning Only20.790.621.727.9Planning + Reflect20.088.621.326.1

Experimental Setting.For“How do retrieval fusion and reasoning-mediated routing affect retrieval relevance and provenance-sensitive precision?”, we compare variants in two groups: (1)A-MEMunderHybrid-Balanced, which uses a moderate dense–sparse fusion, versusHybrid Sparse-Leaning, which increases the sparse contribution; and (2)SimpleMemunderNo Planning, which retrieves directly, versusPlanning Only, which adds an explicit planning step, andPlanning + Reflect, which further introduces a lightweight reflection stage. We evaluate onLongMemEvalto measure scattered-history retrieval relevance, and onLoCoMoto assess provenance-sensitive memory access and supporting-memory identification, reporting overlap-based measures (e.g.,Substr. EM,ROUGE-L F1).

O10-(Planning and Fusion): Explicit planning and balanced retrieval fusion provide the strongest improvement in retrieval effectiveness.As shown in Table5,A-MEMachieves its best performance withHybrid-Balanced, reaching 24.6Ans. F1and 27.5Substr. EM, compared with 23.0 and 24.3 underHybrid Sparse-Leaning, whileSimpleMemachieves its best performance withPlanning Only, reaching 20.7Ans. F1, 90.6Strict Rec., 21.7Substr. EM, and 27.9ROUGE-L F1, above bothNo PlanningandPlanning + Reflect. These results indicate that stronger retrieval and routing performance comes from adding useful structure, rather than from simply increasing sparse matching or extra reasoning steps. In particular, (1) moderate fusion appears more effective than sparse-leaning fusion for preserving both answer quality and relevance (e.g., semantically related but lexically varied facts); (2) explicit planning consistently improves over direct retrieval (e.g., multi-constraint memory queries); and (3) adding reflection on top of planning does not yield further gains, suggesting that extra deliberation may weaken rather than improve routing decisions.

Finding 8.(Retrieval Strategy Guidance).M3 indicates that retrieval quality improves most from targeted structure rather than added complexity: (1) moderate hybrid fusion is preferable when evidence is semantically related but lexically diverse; (2) lightweight planning is effective for constrained memory lookup; and (3) once a route is already specified, extra reflection brings limited benefit and mainly adds overhead.

5.4.Memory Maintenance (M4)

Refer to caption Figure 12.Ablation of Maintenance Strategies.Experimental Setting.For“How do consolidation aggressiveness, flush timing, and summary granularity affect update correctness and long-horizon memory consistency?”, we compare maintenance-relevant variants in two groups:(1) MemoChatunder default multi-topic consolidation versusTopic1, which forces each window into a single-topic summary; and(2) MemoryOSunder default immediate consolidation versusDelayed-Flush, which enlarges the short-term buffer before backend writes, andConservative-Merge, which raises the topic-similarity threshold for stricter assimilation. We evaluate onLoCoMoto assess whether these maintenance choices preserve updated facts and coherent memory use over extended contexts.

O11-(Conservative Consolidation): Conservative consolidation is more effective than delayed flushing or overly coarse summarization for maintaining answer-relevant memory.As shown in Figure12, the stricter-merge variant,MemoryOS (Conservative-Merge), improves over defaultMemoryOSfrom 23.2 to 23.5 inAns. F1and from 22.4 to 22.8 inSubstr. EM, whereas delaying flushes lowers the same system to 20.6/19.5, and forcing single-topic summaries inMemoChatalso underperforms its default setting at 16.2/16.8 versus 16.6/18.4;Long Contextremains highest onSubstr. EMat 23.7. It indicates that maintenance is most effective when it selectively consolidates evidence without either leaving it unresolved or compressing it too aggressively, while raw context still better preserves exact phrasing. In particular, (1) conservative merging can retain related details for later recomposition (e.g., dispersed hobby mentions); (2) delayed flushing leaves more evidence unresolved before retrieval (e.g., activity split across turns).

Finding 9.(Maintenance Design Principle).M4 suggests that memory maintenance works best under a balanced update regime: (1) conservative integration preserves cross-turn linkages for long-horizon reasoning; (2) delayed flushing leaves recent evidence fragmented at query time; and (3) overly coarse summarization obscures sparse but useful cues.

6.Conclusion

We present a comprehensive review of existing agent memory systems from a data management perspective. We conduct thorough end-to-end performance evaluation of typical agent memory systems and explore their suitable application scenarios. Additionally, we delve into the impact of the individual building blocks by constructing multiple memory module variants, thereby identifying the most effective methods for representation, extraction, routing, and maintenance, as well as the most influential factors governing operational costs and long-horizon stability. Finally, we summarize the findings and present guidance for users in selecting suitable memory architectures, alongside outlining promising research directions. We will also release the testbed and evaluation framework.

References

[1](Claude Code)(Anthropic)(Website)External Links:LinkCited by:§1.
Anthropic Engineering (2025)Effective context engineering for AI agents.Note:https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agentsCited by:§2.
Y. Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y. Dong, J. Tang, and J. Li (2024)LongBench: A bilingual, multitask benchmark for long context understanding.InACL (1),pp. 3119–3137.Cited by:§4.4.
L. Caminalet al.(2025)Filtered vector search: state-of-the-art and research opportunities.InProceedings of the VLDB Endowment,Vol.18,pp. 5488–5491.Cited by:§2.
P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav (2025)Mem0: building production-ready AI agents with scalable long-term memory.arXiv preprint arXiv:2504.19413.Cited by:§1,§1,Table 1,Table 1.
P. Du (2026)Memory for autonomous llm agents: mechanisms, evaluation, and emerging frontiers.arXiv preprint arXiv:2603.07670.Cited by:§1,§2.
J. Fang, X. Deng, H. Xu, Z. Jiang, Y. Tang, Z. Xu, S. Deng, Y. Yao, M. Wang, S. Qiao, H. Chen, and N. Zhang (2025)LightMem: lightweight and efficient memory-augmented generation.CoRRabs/2510.18866.External Links:Link,Document,2510.18866Cited by:Table 1,§5.1,§5.2.
Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, Q. Guo, M. Wang, and H. Wang (2023)Retrieval-augmented generation for large language models: A survey.CoRRabs/2312.10997.Cited by:§2.
Google (2025)Memory – agent development kit (ADK).Note:https://google.github.io/adk-docs/sessions/memory/Cited by:§1.
Y. Hu, S. Liu, Y. Yue, G. Zhang, B. Liu, F. Zhu, J. Lin,et al.(2025)Memory in the age of AI agents.arXiv preprint arXiv:2512.13564.Cited by:§2,§2,§2,§2,§2.
G. Kang, Z. Ge, J. Hu, X. Zhang, L. Wang, and J. Zhan (2025a)BigVectorBench: heterogeneous data embedding and compound queries are essential in evaluating vector databases.Proceedings of the VLDB Endowment18(6),pp. 1536–1549.Cited by:§2.
J. Kang, M. Ji, Z. Zhao, and T. Bai (2025b)Memory OS of AI agent.InEMNLP,pp. 25961–25970.Cited by:Table 1.
A. Khan, Y. Luo, W. Zhang, M. Zhou, and X. Zhou (2025)Retrieval-augmented generation (RAG): what is there for data management researchers?.ACM SIGMOD Record54(4).Cited by:§1,§2.
G. Li, X. Zhou, and X. Zhao (2024)LLM for data management.Proc. VLDB Endow.17(12),pp. 4213–4216.Cited by:§2.
Z. Li, S. Song, C. Xi, H. Wang, C. Tang, S. Niu, D. Chen, J. Yang, C. Li, Q. Yu, J. Zhao, Y. Wang, P. Liu, Z. Lin, P. Wang, J. Huo, T. Chen, K. Chen, K. Li, Z. Tao, J. Ren, H. Lai, H. Wu, B. Tang, Z. Wang, Z. Fan, N. Zhang, L. Zhang, J. Yan, M. Yang, T. Xu, W. Xu, H. Chen, H. Wang, H. Yang, W. Zhang, Z. J. Xu, S. Chen, and F. Xiong (2025)MemOS: A memory OS for AI system.CoRRabs/2507.03724.External Links:Link,Document,2507.03724Cited by:Table 1,§5.2.
J. Liu, Y. Su, P. Xia, S. Han, Z. Zheng, C. Xie, M. Ding, and H. Yao (2026a)SimpleMem: efficient lifelong memory for LLM agents.CoRRabs/2601.02553.Cited by:Table 1.
S. Liu, S. Ponnapalli, S. Shankar, S. Zeighami, A. Zhu, S. Agarwal, R. Chen, S. Suwito, S. Yuan, I. Stoica, M. Zaharia, A. Cheung, N. Crooks, J. E. Gonzalez, and A. G. Parameswaran (2026b)Supporting our AI overlords: redesigning data systems to be agent-first.InProceedings of the 16th Annual Conference on Innovative Data Systems Research (CIDR),Cited by:§1,§2.
J. Lu, S. An, M. Lin, G. Pergola, Y. He, D. Yin, X. Sun, and Y. Wu (2023)MemoChat: tuning llms to use memos for consistent long-range open-domain conversation.CoRRabs/2308.08239.External Links:Link,Document,2308.08239Cited by:Table 1,§5.2.
Y. Luo, G. Li, J. Fan, and N. Tang (2026)Data agents: levels, state of the art, and open problems.arXiv preprint arXiv:2602.04261.Note:SIGMOD 2026 TutorialCited by:§1.
A. Maharana, D. Lee, S. Turishcheva, K. Nham, G. Jandaghi, J. Pujara, and X. Ren (2024)Evaluating very long-term conversational memory of LLM agents.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL),Cited by:§1,§4.1,§4.2,§4.3,§4.4,§4.5.
V. Markovic, L. Obradovic, L. Hajdu, and J. Pavlovic (2025)Optimizing the interface between knowledge graphs and llms for complex reasoning.CoRRabs/2505.24478.Cited by:Table 1.
MemoryAgentBench Team (2026)Evaluating memory in LLM agents via incremental multi-turn interactions.InFourteenth International Conference on Learning Representations (ICLR),Cited by:§1,§4.1.
Microsoft (2025)Introducing copilot memory: a more productive and personalized AI.Note:https://techcommunity.microsoft.com/blog/microsoft365copilotblog/introducing-copilot-memoryCited by:§1.
OpenAI (2026)Context engineering for personalization – state management with long-term memory notes using OpenAI agents SDK.Note:https://developers.openai.com/cookbook/examples/agents_sdk/context_personalization/Cited by:§1.
C. Packer, V. Fang, S. G. Patil, K. Lin, S. Wooders, and J. E. Gonzalez (2023)MemGPT: towards LLMs as operating systems.arXiv preprint arXiv:2310.08560.Cited by:§1,§1,Table 1,§2,§2,§2.
P. Rasmussen, P. Paliychuk, T. Beauvais, and J. Ryan (2025)Zep: a temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956.Cited by:§1,§1,Table 1.
A. Rezazadeh, Z. Li, W. Wei, and Y. Bao (2025)From isolated conversations to hierarchical schemas: dynamic tree memory representation for llms.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025,External Links:LinkCited by:Table 1,§5.1.
H. Singh, N. Verma, Y. Wang, M. Bharadwaj, H. Fashandi, K. Ferreira, and C. Lee (2024)Personal large language model agents: a case study on tailored travel planning.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track,Miami, Florida, US,pp. 486–514.External Links:DocumentCited by:§1.
H. Tan, Z. Zhang, C. Ma, X. Chen, Q. Dai, and Z. Dong (2025)MemBench: towards more comprehensive evaluation on the memory of llm-based agents.InACL (Findings),Findings of ACL,pp. 19336–19352.Cited by:§1.
Z. Tanget al.(2026)LLM agent memory: a survey from a unified representation.arXiv preprint arXiv:2603.0359.Cited by:§2,§2.
D. Wu, H. Wang, W. Yu, Y. Zhang, and K. Chang (2024)LongMemEval: benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813.Cited by:§1,§4.1,§4.3,§4.4,§4.5.
Y. Wu, T. Lin, Y. Zhou, F. Zhang, Q. Guo, X. Zhou, S. Wang, X. Liu, Y. Ma, and Y. Fang (2026)Memory in the LLM era: modular architectures and strategies in a unified framework.Proceedings of the VLDB Endowment.Cited by:§1,§2,§2,§2,§2,§2.
W. Xuet al.(2025)A-MEM: agentic memory for LLM agents.arXiv preprint arXiv:2502.12110.Cited by:§1,§1,Table 1.
C. Yang, C. Zhou, Y. Xiao, S. Dong, L. Zhuang,et al.(2026)Graph-based agent memory: taxonomy, techniques, and applications.arXiv preprint arXiv:2602.05665.Cited by:§2.
H. Yu, T. Chen, J. Feng, J. Chen, W. Dai, Q. Yu, Y. Zhang, W. Ma, J. Liu, M. Wang, and H. Zhou (2025)MemAgent: reshaping long-context LLM with multi-conv rl-based memory agent.CoRRabs/2507.02259.Cited by:Table 1.
Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J. Wen (2025)A survey on the memory mechanism of large language model based agents.ACM Transactions on Information Systems.Cited by:§2,§2,§2.
J. Zheng, X. Cai, Q. Li, D. Zhang, Z. Li, Y. Zhang, L. Song, and Q. Ma (2025a)LifelongAgentBench: evaluating LLM agents as lifelong learners.CoRRabs/2505.11942.Cited by:§4.1.
J. Zheng, C. Shi, X. Cai, Q. Li, D. Zhang, C. Li, D. Yu, and Q. Ma (2025b)Lifelong learning of large language model based agents: a roadmap.IEEE Transactions on Pattern Analysis and Machine Intelligence.Cited by:§1,§2.
W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang (2024)MemoryBank: enhancing large language models with long-term memory.InAAAI,pp. 19724–19731.Cited by:§1.
W. Zhou, X. Zhou, Q. He, G. Li, B. He, Q. Xu, and F. Wu (2026)Automating database-native function code synthesis with llms.Proc. ACM Manag. Data3(4),pp. 141:1–141:26.Cited by:§1.
Z. Zhou, A. Qu, Z. Wu, S. Kim, A. Prakash, D. Rus, J. Zhao, B. K. H. Low, and P. P. Liang (2025)MEM1: learning to synergize memory and reasoning for efficient long-horizon agents.CoRRabs/2506.15841.Cited by:Table 1.

相似文章

@chenchengpro: 给 LLM Agent 堆越花哨的"记忆"架构，效果不一定越好。一篇新论文实测了 12 个记忆系统，没有通用赢家。它把 Agent 记忆当成数据库来拆——表示与存储、抽取、检索与路由、维护四个模块，拉来 Mem0、Letta、Zep、C…

X AI KOLs Timeline

一篇论文系统评估了12个LLM Agent记忆系统，将其拆分为四个模块，发现没有单一架构在所有场景下占优，并揭示了成本-性能权衡和常见问题（如“过去的幻觉”）。

Are We Ready For An Agent-Native Memory System?

Abstract.

2.Preliminaries