Tag
ComMem proposes complementary memory systems inspired by biological memory to improve test-time adaptation of vision-language models, outperforming state-of-the-art on 15 benchmarks.
This article discusses the design approach for AI memory systems, advocating for letting good memory systems emerge naturally from evaluation rather than designing memory architecture from the top down. The author argues that memory is a second-order effect evolved under pressure and proposes a longitudinal evaluation framework.
This paper from SJTU and Tsinghua systematically evaluates 12 agent memory systems from a data management perspective, decomposing memory into four modules and providing guidelines on when to use RAG, vector databases, or knowledge graphs for long-term agent memory.
A paper systematically evaluates 12 LLM Agent memory systems, breaks them into four modules, finds no single architecture dominates all scenarios, and reveals cost-performance trade-offs and common issues (e.g., 'past hallucinations').
An exploration of how AI agent memory systems often miss crucial cognitive processes like working memory, drawing parallels to anterograde amnesia, and offering design guidance for more effective solutions.
MEMPROBE is a benchmark that evaluates long-term memory in LLM agents by reconstructing hidden user states from the agent's memory after interaction.
This paper presents a systematic experimental study of agent memory systems from a data management perspective, decomposing memory into four core modules and evaluating 12 representative systems across 11 datasets, finding no single architecture dominates and highlighting cost-performance trade-offs.
Discussion of different schools of thought for building memory systems in LLMs, with a focus on graph memory and its potential for human creativity and inductive bias.
A developer working on an AI agent wrapper observes that the agent's hallucinations of user responses can actually aid problem-solving, and proposes treating such hallucinations as imagined events rather than errors.
This article argues that filesystems, due to their long history and extensive inclusion in LLM training data, offer a natural and intuitive primitive for AI agent memory, outperforming traditional databases and APIs for exploratory reasoning and persistent context.
A detailed guide on building an agentic research framework using a multi-LLM system with persistent memory, allowing researchers to avoid re-explaining context across sessions by leveraging file-based identity, project docs, and memory indices.
This paper introduces Engram, an open-source bi-temporal memory engine for LLM agents that retrieves a compact context slice (∼9.6k tokens) to outperform the full-history baseline (79k tokens) by 10.4 accuracy points on LongMemEval, using a hybrid read path fusing dense, lexical, graph, and temporal signals.
A reflection on agent memory as primarily an infrastructure/data-management problem rather than an AI problem, focusing on practical complexities like permissions, scopes, and revision history.
CL-Bench is a new expert-validated benchmark across six domains that evaluates whether LLM-based agents genuinely learn from sequential experience. It finds that naive in-context learning often outperforms dedicated memory systems, indicating current architectures add overhead rather than genuine learning.
An in-depth analysis of ChatGPT Dreaming V3's memory architecture, explaining how it synthesizes a coherent memory state from raw sources and comparing it to other open-source memory frameworks like mem0, supermemory, and Letta.
This paper evaluates eight memory systems for LLM agents across five diverse scenarios, finding that giving agents active control over storage and retrieval (rather than passive pipelines) yields the best cross-scenario generalization, leading to the proposed AutoMEM framework.
The article highlights the lack of version control and observability in AI memory systems compared to code version control, and questions the current state of tooling for memory history.
MemPro is a system-level evolution framework that treats the memory construction–retrieval pipeline as an evolvable program, using an Evolving Agent to iteratively diagnose failures and create improved versions. Experiments on long-horizon benchmarks show consistent improvement over static and prompt-level baselines with favorable performance–cost trade-off.
The article warns that AI memory systems, while impressive in demos, often lead to stale facts, conflicting preferences, and broken summaries, creating future debugging nightmares and technical debt.
MemTrace automatically traces errors in LLM memory systems by converting memory pipelines into executable graphs, identifying root causes of failures, and self-correcting to improve performance by up to 7.62%.