@yibie: Recommend this article. The teams from SJTU and Tsinghua systematically evaluated 12 agent memory systems. It's not one of those "our model is better" papers but rather breaks down how to choose memory systems from a data management perspective—when to use RAG, when to use vector databases, when to use knowledge graphs. Long-term memory for agents...

X AI KOLs Timeline 06/26/26, 01:07 AM Papers

agent-memory llm-agents data-management benchmark memory-systems retrieval-augmented knowledge-graphs

Summary

This paper from SJTU and Tsinghua systematically evaluates 12 agent memory systems from a data management perspective, decomposing memory into four modules and providing guidelines on when to use RAG, vector databases, or knowledge graphs for long-term agent memory.

Recommend this article. The teams from SJTU and Tsinghua systematically evaluated 12 agent memory systems. It's not one of those "our model is better" papers but rather breaks down how to choose memory systems from a data management perspective—when to use RAG, when to use vector databases, when to use knowledge graphs. How to do long-term memory for agents? A systematic comparison of 12 memory systems. Paper: Are We Ready For An Agent-Native Memory System? Shanghai Jiao Tong University × Tsinghua × MemTensor, Submitted June 23 Core Framework Decomposes agent memory into 4 modules: Representation & Storage → Extraction → Retrieval & Routing → Maintenance. Each module evaluated independently. Key Findings • No single architecture is optimal for all scenarios. The alignment between memory structure and workload bottlenecks determines performance. • Local maintenance (updating only affected parts) is much more efficient than global reconstruction, with no loss in effectiveness. • End-to-end evaluation across 12 systems on 5 benchmarks (11 datasets) shows differences mainly in retrieval precision and long-horizon stability. Practical Takeaways If you're building agents that need long-term memory: • Simple Q&A scenarios → vector retrieval is sufficient. • Multi-hop reasoning → needs a knowledge graph layer. • Frequently updated knowledge → local maintenance is better than global rebuild. Paper: https://arxiv.org/abs/2606.24775 Code: https://github.com/MEMTRON/AgentMemory-Bench… #Agent #LongTermMemory #SystemEvaluation

Original Article

View Cached Full Text

Cached at: 06/26/26, 10:09 AM

Recommend this one: A joint team from Shanghai Jiao Tong University and Tsinghua University systematically evaluated 12 agent memory systems. This isn’t one of those “our model is better” papers. Instead, it deconstructs how to choose a memory system from a data management perspective — when to use RAG, when to use a vector database, and when to use a knowledge graph.

How to build long-term memory for agents? A systematic comparison of 12 memory systems

Paper: Are We Ready For An Agent-Native Memory System? Submitted on June 23 by Shanghai Jiao Tong University × Tsinghua University × MemTensor

Core Framework Decomposes agent memory into 4 modules: Representation & Storage → Extraction → Retrieval & Routing → Maintenance. Each module is evaluated independently.

Key Findings • No single architecture is optimal for all scenarios. The effectiveness depends on how well the memory structure aligns with the workload bottleneck. • Localized maintenance (updating only the affected parts) is much more efficient than global reorganization, with comparable effectiveness. • End-to-end evaluation of 12 systems across 5 benchmarks (11 datasets) shows that differences primarily lie in retrieval precision and long-horizon stability.

Practical Implications If you are building an agent that requires long-term memory: • Simple Q&A scenarios → Vector retrieval is sufficient • Multi-hop reasoning needed → Requires a knowledge graph layer • Frequently updated knowledge → Localized maintenance is better than global reconstruction

Paper: https://arxiv.org/abs/2606.24775 Code: https://github.com/MEMTRON/AgentMemory-Bench…

#Agent #LongTermMemory #SystemEvaluation

Are We Ready For An Agent-Native Memory System?

Source: https://arxiv.org/html/2606.24775

Abstract.

Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system-level concerns, including operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored.

In this paper, we present a systematic experimental study of agent memory from a data management perspective. We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance. Under this framework, we evaluate 12 representative memory systems and two reference baselines across five benchmark workloads spanning 11 datasets. Our extensive end-to-end evaluation shows that no single architecture dominates across all scenarios; instead, effectiveness depends heavily on how well the memory structure aligns with the workload bottleneck. Furthermore, through fine-grained ablation studies, we quantify their individual effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability. Finally, we reveal cost-performance trade-offs under realistic workloads, showing localized maintenance is more cost-efficient than global reorganization. Based on these findings, we identify promising directions towards building truly agent-native memory systems. The code is publicly available at https://github.com/OpenDataBox/MemoryData.

††copyright:none

1. Introduction

The rapid evolution of Large Language Model (LLM) agents has sparked a large body of exciting research and industrial efforts in building agent memory, i.e., the data management system of the LLM agent that supports long-horizon stateful execution and personalized interaction (Luo et al., 2026 (https://arxiv.org/html/2606.24775#bib.bib225); Khan et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib226); Liu et al., 2026b (https://arxiv.org/html/2606.24775#bib.bib227); Singh et al., 2024 (https://arxiv.org/html/2606.24775#bib.bib253); OpenAI, 2026 (https://arxiv.org/html/2606.24775#bib.bib243); Microsoft, 2025 (https://arxiv.org/html/2606.24775#bib.bib246); Google, 2025 (https://arxiv.org/html/2606.24775#bib.bib247)).

As shown in Figure 1 (https://arxiv.org/html/2606.24775#S1.F1), existing agent memory systems span a diverse set of architectural designs. (1) Stream-and-Reflection Memory System (e.g., MemoryBank (Zhong et al., 2024 (https://arxiv.org/html/2606.24775#bib.bib255))) maintains experiences as timestamped memory streams and periodically summarizes them into higher-level reflections that are written back into the stream; (2) Hierarchical Tiered Memory System (e.g., MemGPT (Packer et al., 2023 (https://arxiv.org/html/2606.24775#bib.bib230))) organizes memory into multiple levels with different capacities and access properties, separating core memory from archival storage with explicit movement (e.g., eviction and promotion) across tiers; (3) Knowledge Graph Memory System (e.g., Mem0g (Chhikara et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib231)), Zep (Rasmussen et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib233))) represents entities, relations, and their temporal evolution in structured forms (e.g., temporal knowledge graphs), often incorporating entity disambiguation and conflict resolution; (4) Composite Hybrid Memory System (e.g., A-MEM (Xu and others, 2025 (https://arxiv.org/html/2606.24775#bib.bib232))) routes schema-aware memory objects across multiple storage substrates, explicitly separating runtime state (e.g., KV caches) from long-term storage (e.g., vector, graph, keyword indexes), managed by dedicated maintenance modules. However, this rapid proliferation has also led to a highly fragmented landscape that lacks systematic evaluation from a data management perspective, raising a natural question: Are we ready for an agent-native memory system?

Refer to caption Figure 1. Typical Execution Workflows of Agent Memory. In this paper, we revisit this question for agent memory. In particular, we focus on system-level memory over textual, structured, and even parametric representations (Chhikara et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib231); Packer et al., 2023 (https://arxiv.org/html/2606.24775#bib.bib230); Rasmussen et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib233); Xu and others, 2025 (https://arxiv.org/html/2606.24775#bib.bib232)), a fundamental infrastructure component for modern autonomous agents. We focus on memory-centric systems, rather than task-specific agent frameworks where memory is an auxiliary module (1 (https://arxiv.org/html/2606.24775#bib.bib269); W. Zhou, X. Zhou, Q. He, G. Li, B. He, Q. Xu, and F. Wu (2026) (https://arxiv.org/html/2606.24775#bib.bib222)). It is the persistent data management system that maintains information beyond a single inference step (e.g., historical interactions, environmental observations, and intermediate tool executions) decoupled from the LLMs’ parametric weights and volatile context windows. Agent frameworks rely on these external memory systems (e.g., Mem0 (Chhikara et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib231)), Letta (Packer et al., 2023 (https://arxiv.org/html/2606.24775#bib.bib230)), Zep (Rasmussen et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib233)), and A-MEM (Xu and others, 2025 (https://arxiv.org/html/2606.24775#bib.bib232))) to actively write, update, index, and route relevant context back into the reasoning loop. The capability of a long-horizon agent largely depends on the reliability and efficiency of this memory layer. An agent adopting a poorly designed memory architecture can suffer from factual contradictions, catastrophic forgetting, or unacceptable latencies during continuous execution (Du, 2026 (https://arxiv.org/html/2606.24775#bib.bib258); Zheng et al., 2025b (https://arxiv.org/html/2606.24775#bib.bib257)).

Recent benchmarks (Maharana et al., 2024 (https://arxiv.org/html/2606.24775#bib.bib240); Wu et al., 2024 (https://arxiv.org/html/2606.24775#bib.bib241); MemoryAgentBench Team, 2026 (https://arxiv.org/html/2606.24775#bib.bib242); Tan et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib256)) have evaluated agent memory and shown that external memory can improve agent performance on tasks requiring factual recall and long-context understanding. However, these evaluations are largely rooted in natural language processing and have multiple limitations when treating agent memory as a data management system (see Section 2 (https://arxiv.org/html/2606.24775#S2)). First, they fail to evaluate many representative memory architectures (e.g., systems such as MemoChat, MemTree, and LightMem have not been included in prior evaluations) under unified workloads, making principled cross-system comparisons difficult. Efforts from the database community (Wu et al., 2026 (https://arxiv.org/html/2606.24775#bib.bib259)) limit their scope to a few chatbot-centric datasets (e.g., LoCoMo and LongMemEval only), neglecting complex agentic execution scenarios. Second, existing benchmarks predominantly rely on single-sided, end-to-end task success metrics (e.g., F1 and BLEU scores) rather than a comprehensive evaluation suite. They fail to explicitly isolate and measure multi-dimensional performance indicators such as evidence-level retrieval fidelity, dynamic update robustness under conflicting knowledge, and long-horizon stability. Third, they rarely measure key operational costs from a systems perspective, such as index construction time and query latency, which are critical for production deployments. Last, they treat memory systems as monolithic black boxes rather than decomposing them into fundamental data management modules for isolated, fine-grained analysis.

We overcome these limitations and conduct comprehensive experiments and analyses from a data management perspective. The contributions are as follows.

(1) Technology Decomposition and Taxonomy (Section 3 (https://arxiv.org/html/2606.24775#S3)). We decompose existing agent memory systems into four core components: (i) memory representation and storage, (ii) memory extraction, (iii) memory retrieval and routing, and (iv) memory maintenance. For each component, we further establish a structured taxonomy (1) https://github.com/OpenDataBox/awesome-agent-memory by categorizing existing approaches according to their underlying design principles, enabling principled comparisons.

(2) Overall End-to-End Performance Evaluation (Section 4 (https://arxiv.org/html/2606.24775#S4)). We conduct end-to-end evaluations under a unified and fair testbed (e.g., unified time-overhead traces) (2) https://github.com/OpenDataBox/MemoryData across five distinct benchmark workloads encompassing 11 datasets. Our study includes 12 representative memory systems, each embodying different combinations of representation, storage, routing, and maintenance strategies. We evaluate their performance from five perspectives: task effectiveness (RQ1), retrieval fidelity (RQ2), dynamic update robustness (RQ3), long-horizon stability (RQ4), and operational cost (RQ5).

(3) Fine-Grained Technical Component Evaluation (Section 5 (https://arxiv.org/html/2606.24775#S5)). Leveraging our four-module framework, we conduct controlled and fine-grained experiments on representative strategies within each technique component. By systematically generating controlled variants that modify one module at a time, we quantify their performance trade-offs and assess their individual impacts on representation fidelity, routing precision, and update correctness.

(4) Insightful Findings. Based on our experimental results and in-depth analysis, we distill a set of insightful findings regarding the cost–performance trade-offs of agent memory systems:

❶ Are Memory Systems Effective Across Different Agent Request Workloads? No single memory architecture dominates all scenarios. Composite hybrid systems lead on conversational QA, while graph-based methods excel in single-hop factual recall but struggle with temporal reasoning. Moreover, effective memory systems remain robust across LLM backbone variants because they externalize evidence localization before answer generation.

❷ How Accurately Do Memory Systems Retrieve Stored Evidence? Explicit query planning and balanced hybrid search maximize contextual relevance. However, retrieval accuracy degrades significantly as the temporal distance between the evidence and the query increases, exposing limitations of similarity-based retrieval.

❸ Are Memory Systems Robust Under Dynamic Updates? Graph-based methods handle knowledge updates most reliably, whereas popular fact-extraction plugins and append-only stores struggle with targeted overwrites. Systems lacking lifecycle management return stale facts, leading to “hallucinations of the past”.

❹ Do Memory Systems Remain Stable Over Long Horizons? Many append-only memory stores suffer from catastrophic degradation as evidence becomes more distant. For time-dependent queries, raw long-context retrieval still outperforms most memory-backed approaches, indicating that standard semantic consolidation often destroys crucial chronological cues.

❺ What Are the Operational Costs of Agent Memory? Highly structured systems incur orders-of-magnitude higher index construction time and query latency than lightweight stores, yet do not consistently deliver proportional accuracy gains.

❻ When Do Individual Memory Components Go Wrong? Each layer of abstraction (e.g., compression, summarization, and fact extraction) progressively discards information. Furthermore, fine-grained LLM-based extraction can yield modest precision gains but substantially degrade multi-hop reasoning. Finally, conservative memory consolidation serves as the best default maintenance strategy, whereas delayed flushing creates a deceptive trade-off between surface-level coverage and actual answerability.

2. Preliminaries

To support the discussion in the rest of this paper, we first clarify the scope of agent memory from a data management perspective. Although recent studies have examined memory from viewpoints such as cognitive taxonomy, agent architecture, and graph-based organization (Zhang et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib260); Du, 2026 (https://arxiv.org/html/2606.24775#bib.bib258); Hu et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib238); Wu et al., 2026 (https://arxiv.org/html/2606.24775#bib.bib259); Tang and others, 2026 (https://arxiv.org/html/2606.24775#bib.bib261); Yang et al., 2026 (https://arxiv.org/html/2606.24775#bib.bib239)), the underlying concept is still often treated primarily as an algorithmic component of the LLM or agent pipeline (Zhang et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib260); Du, 2026 (https://arxiv.org/html/2606.24775#bib.bib258); Hu et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib238)). In contrast, we study agent memory as a standalone data management object and system infrastructure, with explicit attention to how it is represented, stored, retrieved, updated, and maintained under real agent workloads. Under this view, we introduce a set of definitions below.

Table 1. Taxonomy and Characteristics of Agent Memory Systems. \rowcolor[HTML]1F1F1F\cellcolor[HTML]1F1F1F\cellcolor[HTML]1F1F1F\cellcolor[HTML]1F1F1F Memory Representation & Storage \cellcolor[HTML]1F1F1F\cellcolor[HTML]1F1F1F\cellcolor[HTML]1F1F1F \rowcolor[HTML]1F1F1F\cellcolor[HTML]1F1F1F Category \cellcolor[HTML]1F1F1F Method \cellcolor[HTML]1F1F1F Representation \cellcolor[HTML]1F1F1F Storage \cellcolor[HTML]1F1F1F Memory Extraction \cellcolor[HTML]1F1F1F Memory Retrieval & Query Routing \cellcolor[HTML]1F1F1F Memory Maintenance MemoChat (Lu et al., 2023 (https://arxiv.org/html/2606.24775#bib.bib4)) ❶ Token-Level Sequence (Structured JSON Memos) ❶ Transient In-Context Registers ❸ Schema-Constrained Extraction (LLM Topic Segmentation) ❹ Autonomous Agentic Routing (LLM Topic Selection) ❸ LLM-Driven Semantic Consolidation (Turn-Triggered) Mem0 (Chhikara et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib231)) ❶ Token-Level Sequence (Discrete Facts) ❷ Specialized Single-Engine (Vector DB) ❷ Schema-Free Extraction ❷ Semantic-Based Retrieval ❸ LLM-Driven Semantic Consolidation (Tool-Calling) MEM1 (Zhou et al., 2025 (https://arxiv.org/html/2606.24775#bib.bib266)) ❶ Token-Level Sequence ❶ Transient In-Context Registers ❶ Raw Sequence Concatenation ❶ Native Attention-Based Retrieval ❷ Capacity-Driven Physical Eviction SequentialConte

Are We Ready For An Agent-Native Memory System?

Abstract.

1. Introduction

2. Preliminaries

Similar Articles

@wquguru: https://x.com/wquguru/status/2069641926752780384

@servasyy_ai: https://x.com/servasyy_ai/status/2057463627255570937

@TencentAI_News: We spent 6 months on one problem: agents losing context in long sessions. Ended up building and open-sourcing an agent …

@rohanpaul_ai: AI agents should treat memory as a changing web of useful connections, not static storage. Most agent memory systems re…

Submit Feedback

Similar Articles

@chenchengpro: The more fancy "memory" architectures you stack on an LLM Agent, the better the results? Not necessarily. A new paper tested 12 memory systems and found no universal winner. It decomposes Agent memory like a database — representation & storage, extraction, retrieval & routing, and maintenance — and tested Mem0, Letta, Zep, C…

@wquguru: https://x.com/wquguru/status/2069641926752780384

@servasyy_ai: https://x.com/servasyy_ai/status/2057463627255570937

@TencentAI_News: We spent 6 months on one problem: agents losing context in long sessions. Ended up building and open-sourcing an agent …

@rohanpaul_ai: AI agents should treat memory as a changing web of useful connections, not static storage. Most agent memory systems re…
该论文提出 FluxMem，一种将智能体记忆视为不断演化的图结构，通过动态修复连接和提炼技能来提升记忆效果的系统。实验显示其在多个任务上优于现有方法，例如在 LoCoMo 上达到 95.06% 准确率，并在 GAIA 上相比 Kimi K2 提升 12.73 分。