Human-Inspired Memory Architecture for LLM Agents

arXiv cs.AI Papers

Summary

Microsoft researchers propose a biologically-inspired memory architecture for LLM agents that incorporates mechanisms like sleep-phase consolidation and interference-based forgetting to manage persistent memory efficiently.

arXiv:2605.08538v1 Announce Type: new Abstract: Current LLM agents lack principled mechanisms for managing persistent memory across long interaction horizons. We present a biologically-grounded memory architecture comprising six cognitive mechanisms: (1) sleep-phase consolidation, (2) interference-based forgetting, (3) engram maturation, (4) reconsolidation upon retrieval, (5) entity knowledge graphs, and (6) hybrid multi-cue retrieval. Each mechanism addresses a specific failure mode of naive memory accumulation. We introduce a synthetic calibration methodology that derives all pipeline thresholds without benchmark data exposure, eliminating a common source of evaluation leakage. We evaluate on two benchmarks. First, a VSCode issue-tracking dataset (13K issues, 120K events) where deduplication-based consolidation achieves 97.2% retention precision with 58% store reduction (+21.8 pp over baseline). Second, the LongMemEval personal-chat benchmark where we conduct the first streaming M-tier evaluation (475 sessions, ~540K unique turns). At a 200K-token context budget, our pipeline matches raw retrieval accuracy (70.1% vs. 71.2%, overlapping 95% CI) while exposing a tunable accuracy/store-size operating curve. At S-tier scale (50 sessions), dedup-based consolidation yields a +13.3 pp improvement in preference recall.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 07:17 AM

# Human-Inspired Memory Architecture for LLM Agents
Source: [https://arxiv.org/html/2605.08538](https://arxiv.org/html/2605.08538)
Doga Kerestecioglu Microsoft &Alexei Robsky Microsoft &Clemens Vasters Microsoft &Anshul Sharma Microsoft &Yitzhak Kesselman Microsoft

###### Abstract

Current LLM agents lack principled mechanisms for managing persistent memory across long interaction horizons\. We present a biologically\-grounded memory architecture comprising six cognitive mechanisms: \(1\) sleep\-phase consolidation, \(2\) interference\-based forgetting, \(3\) engram maturation, \(4\) reconsolidation upon retrieval, \(5\) entity knowledge graphs, and \(6\) hybrid multi\-cue retrieval\. Each mechanism addresses a specific failure mode of naive memory accumulation\. We introduce a synthetic calibration methodology that derives all pipeline thresholds without benchmark data exposure, eliminating a common source of evaluation leakage\. We evaluate on two benchmarks\. First, a VSCode issue\-tracking dataset \(13K issues, 120K events\) where deduplication\-based consolidation achieves 97\.2% retention precision with 58% store reduction \(\+21\.8 pp over baseline\)\. Second, the LongMemEval personal\-chat benchmark where we conduct the first streaming M\-tier evaluation \(475 sessions,∼\\sim540K unique turns\)\. At a 200K\-token context budget, our pipeline matches raw retrieval accuracy \(70\.1% vs\. 71\.2%, overlapping 95% CI\) while exposing a tunable accuracy/store\-size operating curve\. At S\-tier scale \(50 sessions\), dedup\-based consolidation yields a \+13\.3 pp improvement in preference recall\.

## 1Introduction

Large Language Model \(LLM\) agents have demonstrated remarkable capabilities in reasoning, planning, and task execution\. Yet, a fundamental limitation constrains their utility in enterprise settings: the absence of persistent, adaptive memory\. Current approaches fall into three categories, each with shortcomings\.

Stateless agentstreat each interaction independently, losing all context between sessions\. This forces users to repeatedly re\-establish context and prevents agents from learning from past interactions\.

Context window approachesattempt to maintain memory by expanding the prompt with historical information\. Recent advances have extended context windows to millions of tokens, but this approach scales cost without scaling intelligence \(i\.e\., the agent still cannot prioritize, forget, or learn\)\. Similarly, approaches based on rolling summarization suffer from compounding information loss as history grows\.

Vector database approaches \(RAG\)store and retrieve information based on embedding similarity\. While an improvement over stateless designs, these systems treat all information equally, lack mechanisms for consolidation or forgetting, and cannot evolve memories based on new information\.

We propose a fundamentally different approach inspired by human neurology and present a memory architecture grounded in the neuroscience of human memory systems\. Our design implements the key mechanisms that make biological memory effective by introducing \(1\) a multi\-tier storage, \(2\) offline consolidation, \(3\) adaptive forgetting, \(4\) gradual maturation, and \(5\) reconsolidation upon retrieval\. The architecture is designed for enterprise\-grade scalability, governance, and integration\.

#### Contributions\.

This paper makes the following contributions:

- •A biologically\-grounded memory architecture mapping six cognitive mechanisms to system components, with detailed specifications for each\. Four mechanisms have ablation evidence in this paper \(*consolidation*,*forgetting*,*graph retrieval*,*importance scoring*\)\. Two are implemented for operational completeness but require deployment conditions absent from current benchmarks\.*Maturation*requires repeated retrieval over weeks to activate, and*reconsolidation*requires cross\-session contradictions, which are both structurally absent from LongMemEval’s construction\.
- •A synthetic calibration methodology that derives all pipeline thresholds from LLM\-generated corpora produced from a fixed specification \(no benchmark exposure\), eliminating evaluation leakage\.
- •A streaming evaluation protocol that processes sessions sequentially in temporal order, simulating realistic agent deployment\.
- •A nine\-configuration ablation study on LongMemEval S\-tier with bootstrapped 95% confidence intervals, isolating the contribution of consolidation, forgetting, reconsolidation, and graph retrieval\.
- •A streaming M\-tier evaluation \(475 sessions per question,∼\\sim540K unique turns\), demonstrating that the pipeline matches raw retrieval accuracy at a 200K\-token context budget and exposes a tunable accuracy/store\-size operating curve at lower budgets\.

The remainder of this paper is organized as follows\. §[2](https://arxiv.org/html/2605.08538#S2)reviews the biological foundations\. §[3](https://arxiv.org/html/2605.08538#S3)presents the technical architecture\. §§[4](https://arxiv.org/html/2605.08538#S4)–[6](https://arxiv.org/html/2605.08538#S6)detail the consolidation, forgetting, and maturation mechanisms\. §[7](https://arxiv.org/html/2605.08538#S7)covers retrieval and agent integration\. §[8](https://arxiv.org/html/2605.08538#S8)describes experimental methodology and §[9](https://arxiv.org/html/2605.08538#S9)presents evaluation results\. §§[10](https://arxiv.org/html/2605.08538#S10)–[12](https://arxiv.org/html/2605.08538#S12)address related work, limitations, and conclusions\.

## 2Biological Foundations

Our architecture draws on six established neuroscientific principles\. Table[1](https://arxiv.org/html/2605.08538#S2.T1)summarizes the mapping from biological mechanism to system design\.

Table 1:Mapping from system mechanism to neuroscience inspiration\.The core insight from complementary learning systems theoryMcClellandet al\.\([1995](https://arxiv.org/html/2605.08538#bib.bib12)\)is that rapid episodic encoding \(hippocampus\) and slow semantic extraction \(neocortex\) serve different roles\. Our architecture mirrors this\. A vector store provides high\-fidelity episodic retrieval while a knowledge graph accumulates semantic relationships through consolidation\. Sleep\-phase consolidationFrankland and Bontempi \([2005](https://arxiv.org/html/2605.08538#bib.bib4)\)runs offline to deduplicate and merge redundant traces\. Forgetting combines exponential trace decay \(the Ebbinghaus forgetting curve\) with retrieval\-induced interferenceAnderson \([2003](https://arxiv.org/html/2605.08538#bib.bib15)\)\. Memory maturation follows the Kitamura et al\.Kitamuraet al\.\([2017](https://arxiv.org/html/2605.08538#bib.bib6)\)finding that engrams form immediately but remain “silent” for days before becoming explicitly retrievable\. ReconsolidationNaderet al\.\([2000](https://arxiv.org/html/2605.08538#bib.bib21)\)enables retrieved memories to be updated with new information during a lability window, preventing stale facts from persisting indefinitely\. The graph layer is grounded in semantic\-network and spreading\-activation theories from cognitive psychology\.

## 3Technical Architecture

The architecture maps three biological memory tiers to system components:*short\-term*\(prefrontal cortex→\\tohot cache, in\-memory with TTL min–hrs\),*medium\-term*\(hippocampus→\\towarm episodic store, full fidelity with TTL days–weeks\), and*long\-term*\(neocortex→\\toknowledge graph, semantic and permanent\)\. Concretely, the system comprises three layers: \(1\) aningestion layerthat stores raw events with embeddings and metadata enrichment; \(2\) anepisodic storeproviding time\-indexed vector search over recent memories with tiered caching; and \(3\) asemantic graphorganizing long\-term memories by entity relationships, enabling multi\-hop traversal queries\. All three layers share a unified data layer, eliminating data movement between services and enabling unified governance\.

## 4Memory Consolidation Pipeline

The consolidation pipeline implements the biological sharp\-wave ripple mechanism through scheduled batch processing \(default is every 6 hours, but optimized based on domain\) that identifies, validates, transforms, and promotes valuable memories to long\-term storage\. Events not promoted remain in the episodic store subject to TTL\-based expiration and active forgetting mechanisms operate independently \(Section[5](https://arxiv.org/html/2605.08538#S5)\)\.

#### Importance scoring\.

Each pending event is scored for long\-term retention value using five factors \(Table[2](https://arxiv.org/html/2605.08538#S4.T2)\):

S​\(e\)=∑i=15wi⋅fi​\(e\)S\(e\)=\\sum\_\{i=1\}^\{5\}w\_\{i\}\\cdot f\_\{i\}\(e\)\(1\)wherefif\_\{i\}represents each scoring factor andwiw\_\{i\}its weight\. Events are classified by composite score: promote \(top 20%\), retain \(middle 60%\), and prune \(bottom 20%\)\.

Table 2:Importance scoring factors with default weights\.
#### Downstream stages\.

Before filtering, a temporal validation step detects out\-of\-order arrivals, duplicates, and causal inversions, quarantining anomalous events \(TTL: 15 min\) to prevent “agent déjà vu\.” Score\-based filtering then downweights automated and low\-authority events while preserving high\-surprise system alerts\. Promoted events are transformed into semantic summaries via LLM\-generated gists and clustering, then integrated into the knowledge graph with entity edges\. Newly integrated memories begin in a “silent” state with low activation strength \(Section[6](https://arxiv.org/html/2605.08538#S6)\), ensuring only stable knowledge influences long\-term reasoning\.

## 5Adaptive Forgetting

Our architecture treats forgetting as essential maintenance that improves retrieval accuracy, reduces interference, and ensures relevance\.

#### Passive decay\.

Events not consolidated are automatically removed when TTL expires\. For events awaiting consolidation, importance scores decay:

I​\(t\)=I0⋅e−λ​tI\(t\)=I\_\{0\}\\cdot e^\{\-\\lambda t\}\(2\)whereλ\\lambdais the decay rate \(empirically optimized:λ=0\.001\\lambda=0\.001, corresponding to a half\-life of≈\\approx29 days\) andttis hours since encoding\.

#### Interference\-based forgetting\.

When memories share features \(similar content, overlapping entities\), they create retrieval interference\. We compute an interference score and selectively forget high\-interference, low\-value memories:

Iinterference=∑jwj⋅sim​\(mi,mj\)I\_\{\\text\{interference\}\}=\\sum\_\{j\}w\_\{j\}\\cdot\\text\{sim\}\(m\_\{i\},m\_\{j\}\)\(3\)wherewjw\_\{j\}represents interference weights \(retroactive=0\.6=0\.6, proactive=0\.4=0\.4\), reflecting the finding that new learning more strongly disrupts old memories\.

#### Graceful degradation\.

Before complete forgetting, memories undergo progressive fidelity reduction through six levels starting from full episodic record \(L0, 100%\) through summary \(L2, 50%\) and gist \(L3, 25%\) to a tombstone record \(L5, 0%\) that preserves only the fact that a memory existed\. Degradation is triggered by age combined with memory score, not storage economics\.

## 6Memory Maturation Dynamics

Following the Kitamura et al\. finding that engrams form immediately but remain “silent” before becoming retrievable, our architecture implements memory maturation\. When an event is consolidated, the full\-fidelity episodic record remains immediately retrievable, while a summarized semantic version is created in the knowledge graph withactivation\_strength=0\.0\\text\{activation\\\_strength\}=0\.0\. This dual\-trace design ensures agents remain responsive\. As such, recent events are always available from the episodic store while the semantic layer accumulates only verified, stable knowledge\.

Activation strength evolves according to a sigmoid function:

A​\(t\)=11\+e−\(t−t1/2\)/kA\(t\)=\\frac\{1\}\{1\+e^\{\-\(t\-t\_\{1/2\}\)/k\}\}\(4\)wheret1/2t\_\{1/2\}is the maturation half\-life \(default: 168 hours\) andkkis the slope parameter \(default: 48\)\. A memory starts silent \(A≈0\.03A\\approx 0\.03\), reaches the retrieval threshold \(A=0\.5A=0\.5\) at one week, and is fully mature \(A\>0\.9A\>0\.9\) at two weeks\. Below threshold, memories can still exert implicit*priming*effects influencing relevance scoring of other memories without being explicitly surfaced\. This mirrors the biological distinction between implicit and explicit memory\.

## 7Retrieval and Agent Integration

### 7\.1Hybrid Retrieval

Memory retrieval combines episodic and semantic pathways, mirroring the brain’s dual retrieval systems\. Critically, the system favors episodic retrieval for recent queries, ensuring users never experience delays due to semantic maturation\.

Episodic retrieval:Vector similarity search across the episodic store hot and warm tiers with temporal filters for session and recent memories\. This is the primary path for recently formed memories\.

Semantic retrieval\.Knowledge graph multi\-hop traversal for relationship\-aware, schema\-grounded knowledge\. This path supplements episodic retrieval with mature, abstracted knowledge and becomes primary for older information after episodic TTL expiration\.

Hybrid GraphRAG\.Vector search seeds graph traversal, combining recency with relational context\. Retrieval spans all three tiers with priority ordering: \(1\) short\-term hot cache for current session \(sub\-second, highest priority\); \(2\) warm vector store for recent episodic memories \(filtered by importance score\); \(3\) knowledge graph traversal for mature semantic memories \(filtered by activation strength\)\. Results are merged, deduplicated, and ranked with a recency boost\.

### 7\.2Reconsolidation

Retrieved memories enter a labile state and remain modifiable for a configurable window \(default: 60 minutes, with optimal values being domain\-dependent\), implementing biological reconsolidation\. When new information is detected through explicit retrieval with new context, contradiction detection, or elaborative retrieval, the system blends it with existing memory content using adaptive strength based on confidence, recency, and contradiction severity\.

Memory scores also adjust based on outcomes and memories that contribute to successful decisions are reinforced, while errors are preserved as learning signals\.

## 8Experimental Methodology

A central challenge in evaluating memory systems is*threshold leakage*: parameters tuned on benchmark data inflate reported accuracy\. We address this through synthetic calibration, deriving all pipeline thresholds from LLM\-generated corpora produced from a fixed specification with zero exposure to evaluation benchmarks\.

### 8\.1Synthetic Calibration

We construct two synthetic corpora for threshold derivation:

#### Similarity thresholds\.

Eight topically diverse personal\-chat sessions \(88 turns\) are embedded with the same model used in evaluation \(text\-embedding\-3\-large, 3072 dimensions\)\. We compute within\-session and cross\-session similarity distributions\. Near\-dedup threshold is set at the 99th percentile of all\-pairs similarity \(0\.559\); cluster distance at1−P951\-P\_\{95\}of within\-session similarity \(0\.404\); interference threshold atP90P\_\{90\}of within\-session similarity \(0\.542\)\. These percentile rules transfer across domains without retuning\.

#### Importance weights\.

Fifty LLM\-generated sessions \(483 turns, 377 substantive, 106 filler\) spanning 14 topics over three simulated months, generated from a fixed specification of topic list, substantive/filler ratio, and returning\-topic structure\. Each turn carries an explicit substantive/filler label embedded in the generation spec\. We compute per\-signal ROC AUC for four signals \(content length, embedding surprise, turn position, recency\) and derive weights via AUC\-excess normalization\. Our findings reveal that content length \(AUC=0\.77, weight=0\.363\) and turn position \(weight=0\.325\) dominate while recency \(AUC=0\.51, weight=0\.019\) provides negligible discrimination\. These four signals are the empirical instantiation of the five\-factor importance score in Eq\.[1](https://arxiv.org/html/2605.08538#S4.E1): content length and embedding surprise correspond to Bayesian Surprise, turn position corresponds to Entity Salience \(substantive turns concentrate near session anchors\), and recency corresponds to the Recency factor\. Frequency and Outcome were dropped from the calibrated formula because \(a\) within\-session frequency is degenerate at the turn level and \(b\) LongMemEval contains no goal\-completion signal\. The importance score implemented in evaluation is therefore a four\-signal weighted sum with weights derived entirely from synthetic data\.

### 8\.2Evaluation Protocol

#### Streaming evaluation\.

For multi\-session benchmarks, we process sessions sequentially in temporal order, consolidating everyNNsessions and applying forgetting at each consolidation step\. This simulates how a deployed agent would accumulate and manage memories over time, where the consolidation pipeline only has access to past sessions when deciding what to retain\.

#### Temporal context\.

All configurations include the current date and timestamps on retrieved memories in the answer prompt empowering any deployed agent to know the date\. This consistently contributes \+10 percentage points versus date\-unaware prompts across all benchmarks\.

#### Judge\.

We use the LongMemEval evaluation protocol with task\-specific prompts for each question type \(knowledge update, multi\-session, single\-session, temporal reasoning\)\. A GPT\-4o judge evaluates whether the model response contains the ground\-truth answer\.

#### Models and AI tools\.

Our evaluation pipeline uses the following LLM and embedding models as components: \(1\)text\-embedding\-3\-large\(3072 dimensions\) for all memory embeddings and retrieval; \(2\)GPT\-4ofor answer generation given retrieved context and for LLM\-as\-judge evaluation; \(3\)GPT\-4o\-minifor entity extraction in graph\-enhanced retrieval configurations\. All models are accessed via Azure AI Foundry\. The memory lifecycle mechanisms \(consolidation, forgetting, maturation, reconsolidation\) are implemented as deterministic algorithms operating on embeddings and scores\. These processes do not invoke LLMs\.

## 9Evaluation

We evaluate on two benchmarks spanning different domains: software engineering \(VSCode issue tracking\) and personal conversation \(LongMemEval\)\. All thresholds are fixed from synthetic calibration \(§[8\.1](https://arxiv.org/html/2605.08538#S8.SS1)\)\. Our expectation is for our memory mechanism to function better on long running agents that perform repeated tasks\. As such software development is a more fitting domain\. However, we also evaluate on a more standardized benchmark to see how the memory compression fares for agent usage\.

### 9\.1VSCode Issue Tracking

#### Dataset\.

13,127 real VSCode GitHub issues with full timelines \(December 2025–February 2026\), yielding 120,000 events\. Events include issue creation, comments, label changes, status transitions, and assignments\. Embeddings are computed with text\-embedding\-3\-large \(3072 dimensions\)\.

#### Evaluation\.

Temporal streaming with quarterly windows\. Unlike LongMemEval which evaluates end\-to\-end QA accuracy, the VSCode benchmark measures retention precision—whether the pipeline correctly identifies events that will be referenced by future activity—as there is no associated question\-answering task\. Retention precision measures whether the pipeline correctly retains important events while discarding unimportant ones\.

#### Results\.

Consolidation and forgetting achieve97\.2% retention precisionwith58% store reduction, a \+21\.8 percentage point improvement over the keep\-everything baseline \(75\.4%\)\. Graph retrieval and maturation are not yet integrated into the VSCode streaming pipeline; the result therefore represents a lower bound for the full architecture\. The memory store self\-regulates at 300–500 events regardless of input volume\.

#### Key findings\.

\(1\) Consolidation \(deduplication and near\-dedup\) drives the majority of quality improvement\. \(2\) Optimal decay rate isλ=0\.001\\lambda=0\.001\(half\-life≈\\approx29 days\), indicating production agents need longer memory horizons than human biology, suggesting half\-life relates to the rhythm of the domain rather than daily cycles\.

### 9\.2LongMemEval: S\-Tier \(50 Sessions\)

#### Benchmark\.

LongMemEvalWuet al\.\([2025](https://arxiv.org/html/2605.08538#bib.bib26)\)provides 500 personal\-chat questions across six categories: knowledge update, multi\-session reasoning, single\-session \(user facts, assistant recall, preferences\), and temporal reasoning\. Each S\-tier question includes approximately 50 conversation sessions \(∼\\sim500 turns\) spanning weeks to months\.

#### Configurations\.

We evaluate nine pipeline configurations as an ablation matrix, varying consolidation aggressiveness \(dedup\-only vs\. aggressive clustering\), forgetting strategy \(fixed thresholds vs\. adaptive token targets\), and reconsolidation \(enabled/disabled\)\. All configurations use episodic \(vector\) retrieval withk=10k=10\.

#### Results\.

Table[3](https://arxiv.org/html/2605.08538#S9.T3)presents the full ablation with 95% bootstrap confidence intervals \(10K resamples\) on overall accuracy\.

Table 3:LongMemEval S\-tier accuracy \(%\) by question type with 95% bootstrap CI on Overall\. KU: knowledge update, MS: multi\-session, SS\-P: single\-session preference, SS\-A: single\-session assistant, SS\-U: single\-session user, Temp: temporal reasoning\. Top: moderate configurations \(overlap baseline CI\)\. Bottom: aggressive configurations \(significantly worse\)\.
#### Analysis\.

All five moderate configurations have overall accuracy CIs that overlap the raw\-RAG baseline \(\[74\.8,82\.0\]\[74\.8,82\.0\]\), establishing that the pipeline is statistically non\-destructive at S\-tier scale across a range of memory budgets\. The aggressive configurations \(Adaptive\-10K, Aggressive consolidation\) fall well outside this CI and are unambiguously harmful\. Three findings emerge:

1. 1\.Preference recall improves directionally\.Single\-session preference accuracy increases from 56\.7% \(baseline\) to 70\.0% \(\+13\.3 pp\) for dedup\-only and dedup\+recon\. With only 30 preference questions per tier the per\-category CIs are wide and this difference is not individually significant \(\[40\.0,73\.3\]\[40\.0,73\.3\]vs\.\[53\.3,86\.7\]\[53\.3,86\.7\]\); we report it as a directional signal that warrants the M\-tier replication in §[9\.3](https://arxiv.org/html/2605.08538#S9.SS3)\.
2. 2\.Aggressive consolidation is destructive\.Agglomerative clustering with merge reduces accuracy to 48\.4% \[44\.0, 52\.8\], as merging turns into cluster summaries destroys the specific details needed for factual question answering\. This confirms that consolidation should deduplicate, not summarize\.
3. 3\.Reconsolidation has marginal impact at S\-tier scale\.With only 50 sessions, there are few genuine contradictions to detect \(dedup\+recon CI fully overlaps dedup\-only\)\. The mechanism’s value is expected to increase with longer interaction horizons\.

### 9\.3LongMemEval: M\-tier \(475 Sessions\)

#### Benchmark\.

The M\-tier of LongMemEval extends each question’s history from 50 to 475 sessions \(∼\\sim4,900 turns per question,∼\\sim540K unique turns across the cache\)\. To our knowledge this is the first published M\-tier evaluation under streaming conditions\. Total raw history exceeds the 128K context window of GPT\-4o, requiring all configurations \(including the baseline\) to apply some form of selection\.

#### Configurations\.

We sweep adaptive token targets at 25K, 50K, 115K \(90% of the GPT\-4o context window\), and 200K, all with dedup\-based consolidation, forgetting, and reconsolidation enabled \(top\-k=35k=35\)\. Raw RAG usesk=10k=10with no consolidation as the strongest baseline LongMemEval permits\.

Table 4:LongMemEval M\-tier accuracy \(%\) by token budget with 95% bootstrap CI on Overall\. Cells inboldindicate the pipeline matches or exceeds raw RAG\.
#### Analysis\.

At the 200K\-token budget, the pipeline’s overall CI \(\[66\.0,74\.2\]\[66\.0,74\.2\]\) overlaps raw RAG \(\[67\.2,75\.0\]\[67\.2,75\.0\]\): the−1\.1\-1\.1pp aggregate gap is within sampling noise\. The pipeline directionally beats raw RAG on multi\-session \(\+1\.2\+1\.2pp\) and temporal\-reasoning \(\+3\.0\+3\.0pp\) categories, where reasoning over consolidated, deduplicated history is more useful than retrieving from raw turns\. Aggressive token budgets \(25K, 50K\) are statistically distinct from raw RAG and confirm that under\-budgeting destroys factual recall, particularly for single\-session questions where the relevant turn is irreversibly removed\. The cross\-over point where memory lifecycle management ceases to be destructive lies between 115K and 200K tokens; together the four budgets sketch a tunable accuracy/store\-size operating curve, with the pipeline’s value at M\-tier scale lying in providing this configurable trade\-off at parity, rather than absolute accuracy gains\. We expect the gap to invert in operational settings where the same user returns with related queries that benefit from consolidated semantic structure\. For context, Wu et al\.Wuet al\.\([2025](https://arxiv.org/html/2605.08538#bib.bib26)\)report 72\.0% on LongMemEvalM with their best\-optimized pipeline \(fact\-augmented key expansion, chain\-of\-note reading, Stella V5 retrieval\); our simple RAG baseline \(71\.2%\) and 200K pipeline \(70\.1%\) are competitive without any conversational\-RAG\-specific engineering\.

## 10Related Work

#### Agent memory systems\.

MemGPTPackeret al\.\([2024](https://arxiv.org/html/2605.08538#bib.bib22)\)introduces a virtual memory hierarchy with page\-in/page\-out operations and rolling summarization\. While effective for medium\-horizon tasks, the summarization chain compounds errors over time\. Our consolidation approach avoids this by deduplicating rather than summarizing\. ReflexionShinnet al\.\([2023](https://arxiv.org/html/2605.08538#bib.bib23)\)enables agents to reflect on failures, but stores reflections as unstructured text without lifecycle management\. RAISEShaoet al\.\([2023](https://arxiv.org/html/2605.08538#bib.bib24)\)and Generative AgentsParket al\.\([2023](https://arxiv.org/html/2605.08538#bib.bib25)\)implement memory retrieval with recency, importance, and relevance scoring but lack consolidation, forgetting, and reconsolidation mechanisms\. Direct empirical comparison with these systems with our current evals is not feasible as none provide LongMemEval evaluations\. We compare against the benchmark’s published baselines\.

#### Long\-context memory benchmarks\.

LongMemEvalWuet al\.\([2025](https://arxiv.org/html/2605.08538#bib.bib26)\)provides the first systematic evaluation of long\-term memory in chat assistants, with S\-tier \(50 sessions\) and M\-tier \(475 sessions\) questions\. Prior benchmarks either focus on single\-session recall or use synthetic data\. Our streaming evaluation protocol extends LongMemEval’s batch evaluation to simulate realistic agent deployment\.

#### Biological memory in AI\.

The connection between biological memory and AI systems has been explored in complementary learning systems theoryMcClellandet al\.\([1995](https://arxiv.org/html/2605.08538#bib.bib12)\)and its application to catastrophic forgetting in neural networksKirkpatricket al\.\([2017](https://arxiv.org/html/2605.08538#bib.bib27)\)\. Our work applies these principles at the system architecture level rather than the weight level, implementing consolidation, forgetting, maturation, and reconsolidation as explicit pipeline stages operating on stored memories\.

## 11Limitations

#### Mechanisms not isolated by ablation\.

Two of the six proposed mechanisms are not empirically discriminated in current experiments, for structural reasons tied to benchmark design\.*Maturation*\(§[6](https://arxiv.org/html/2605.08538#S6)\) requires memories to be retrieved repeatedly over an extended period before activation differences become measurable; LongMemEval questions each draw on an independent haystack with no shared retrieval history across questions, so memories never accumulate repeated access signals\. All experiments therefore run with uniform activation strength \(A=1\.0A=1\.0\)\.*Reconsolidation*\(§[7\.2](https://arxiv.org/html/2605.08538#S7.SS2)\) is enabled in the dedup\+recon configuration, but the CI \(\[72\.0,79\.6\]\[72\.0,79\.6\]\) fully overlaps dedup\-only \(\[73\.0,80\.4\]\[73\.0,80\.4\]\) because LongMemEval’s construction deliberately avoids cross\-session contradictions \(user attributes are non\-conflicting by design\)\. Both mechanisms are present in the architecture because operational multi\-user agents*do*exhibit repeated retrieval and contradictory updates; both remain design\-rationale claims rather than ablation evidence within this paper\.

#### Statistical power\.

We report single\-run results with 95% bootstrap confidence intervals \(10K resamples\) on the 500\-question evaluation rather than variance across multiple judge runs, due to computational cost \(each M\-tier configuration requires 8–13 hours of API calls\)\. The reported CIs reflect question\-sampling variance; judge variance is not separately estimated\. We note that Wu et al\.Wuet al\.\([2025](https://arxiv.org/html/2605.08538#bib.bib26)\)characterize this evaluation protocol’s GPT\-4o judge as achieving\>\>97% agreement with human experts, suggesting judge variance is small relative to the sampling CIs we report\. Per\-category CIs are wide \(especially SS\-P atn=30n=30\), and several reported differences \(e\.g\., \+13\.3 pp preference recall\) are directional rather than individually significant\. Aggregate\-level claims \(overall pipeline non\-destructive vs\. raw RAG; aggressive configurations destructive\) are robust across the bootstrap\.

#### Comparison to prior memory systems\.

We compare against LongMemEval’s published baselines \(Raw RAG with text\-embedding\-3\-large \+ GPT\-4o\) but do not benchmark against MemGPTPackeret al\.\([2024](https://arxiv.org/html/2605.08538#bib.bib22)\), ReflexionShinnet al\.\([2023](https://arxiv.org/html/2605.08538#bib.bib23)\), RAISEShaoet al\.\([2023](https://arxiv.org/html/2605.08538#bib.bib24)\), or Generative AgentsParket al\.\([2023](https://arxiv.org/html/2605.08538#bib.bib25)\), none of which publish LongMemEval results\. Apples\-to\-apples comparison would require porting each system to the streaming protocol; this is left to future work\.

#### Domain breadth and downstream task\.

Our evaluation is limited to two domains\. The VSCode benchmark uses retention precision \(a proxy for downstream utility\) rather than a downstream QA task, because the dataset has no associated questions\. Cross\-domain validation on three additional domains \(fashion retail, F1 racing, security operations\) shows the mechanisms generalize qualitatively but quantitative evaluation remains future work\. Graph\-enhanced retrieval \(Table[3](https://arxiv.org/html/2605.08538#S9.T3), dedup\+hybrid: 74\.8%\) shows comparable but not superior performance at S\-tier scale; entity\-based traversal is expected to provide greater benefit in stores exceeding 1,000 memories where vector similarity alone faces precision challenges\.

#### Benchmark\-architecture alignment\.

Our architecture targets operational agents accumulating high\-volume, repetitive event streams over months—where the same actions recur, preferences evolve, facts become stale, and consolidation removes genuinely redundant observations \(57\.9% reduction on VSCode\)\. The critical challenge in this setting is discarding noise without losing signal\. LongMemEval is structurally different: it presents a single linear conversation of roughly 5,000 turns with no repeated actions, no redundant content, and no agent decision\-making\. In this setting, consolidation has nothing to deduplicate, maturation cannot strengthen memories through repeated access, and reconsolidation has no contradictions to resolve\. Applied to LongMemEval, our pipeline compresses a store with minimal structural redundancy—an adversarial case for our design\. The S\-tier near\-parity \(76\.8% vs\. 78\.4%\) and M\-tier parity at 200K \(70\.1% vs\. 71\.2%\) are therefore best read as a*non\-destruction bound*: the architecture does not lose information even on a benchmark type it was not designed for\. The VSCode result \(97\.2% precision, \+21\.8 pp over baseline\) reflects the intended operational scenario\.

#### Future work\.

Four extensions are actively underway\. First, integrating maturation dynamics into the streaming pipeline so that frequently accessed memories resist forgetting and low\-activation memories are preferentially pruned\. Second, scaling graph\-enhanced retrieval with ingestion\-time entity extraction, where entity\-based traversal is expected to improve precision over pure vector similarity in large stores\. Third, evaluating the pipeline against MemGPT and Reflexion under the streaming protocol on both LongMemEval tiers and the VSCode benchmark\. Fourth, extending the VSCode dataset into a downstream agent benchmark for long\-running agents that carry out repeated tasks, a setting that current memory benchmarks under\-represent\. Concretely, we are building a memory\-augmented issue\-triage agent that operates over the consolidated store and answers operational questions \(e\.g\., duplicate detection, owner suggestion, regression linking\), converting the current retention\-precision proxy into an end\-to\-end task metric while exercising the high\-volume, repetitive event regime our architecture targets\.

## 12Conclusion

We have presented a biologically\-grounded memory architecture for LLM agents that implements the full memory lifecycle through six cognitive mechanisms\. Our synthetic calibration methodology eliminates evaluation leakage by deriving all thresholds from LLM\-generated corpora produced from a fixed specification, independent of any benchmark\.

Empirical evaluation yields three principal findings\. First, deduplication\-based consolidation is the dominant mechanism, driving the majority of quality improvement on the VSCode benchmark \(97\.2% retention precision, 58% store reduction\)\. Second, the pipeline is statistically non\-destructive on LongMemEval at both S\-tier \(76\.8% vs\. 78\.4% baseline, overlapping 95% CI\) and M\-tier \(70\.1% vs\. 71\.2% baseline at a 200K\-token budget\), while exposing a tunable accuracy/store\-size operating curve at lower budgets and yielding directional gains on multi\-session \(\+1\.2\+1\.2pp\) and temporal reasoning \(\+3\.0\+3\.0pp\) at M\-tier\. Third, the pipeline provides a directional\+13\.3\+13\.3pp improvement in S\-tier preference recall, surfacing user preferences buried in raw retrieval\.

## References

- Rethinking interference theory: executive control and the mechanisms of forgetting\.Journal of Memory and Language49\(4\),pp\. 415–445\.Cited by:[§2](https://arxiv.org/html/2605.08538#S2.p2.1)\.
- P\. W\. Frankland and B\. Bontempi \(2005\)The organization of recent and remote memories\.Nature Reviews Neuroscience6\(2\),pp\. 119–130\.Cited by:[§2](https://arxiv.org/html/2605.08538#S2.p2.1)\.
- J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska, D\. Hassabis, C\. Clopath, D\. Kumaran, and R\. Hadsell \(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the National Academy of Sciences114\(13\),pp\. 3521–3526\.Cited by:[§10](https://arxiv.org/html/2605.08538#S10.SS0.SSS0.Px3.p1.1)\.
- T\. Kitamura, S\. K\. Ogawa, D\. S\. Roy, T\. Okuyama, M\. D\. Morrissey, L\. M\. Smith, R\. L\. Redondo, and S\. Tonegawa \(2017\)Engrams and circuits crucial for systems consolidation of a memory\.Science356\(6333\),pp\. 73–78\.Cited by:[§2](https://arxiv.org/html/2605.08538#S2.p2.1)\.
- J\. L\. McClelland, B\. L\. McNaughton, and R\. C\. O’Reilly \(1995\)Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory\.Psychological Review102\(3\),pp\. 419–457\.Cited by:[§10](https://arxiv.org/html/2605.08538#S10.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.08538#S2.p2.1)\.
- K\. Nader, G\. E\. Schafe, and J\. E\. LeDoux \(2000\)Fear memories require protein synthesis in the amygdala for reconsolidation after retrieval\.Nature406\(6797\),pp\. 722–726\.Cited by:[§2](https://arxiv.org/html/2605.08538#S2.p2.1)\.
- C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez \(2024\)MemGPT: towards LLMs as operating systems\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§10](https://arxiv.org/html/2605.08538#S10.SS0.SSS0.Px1.p1.1),[§11](https://arxiv.org/html/2605.08538#S11.SS0.SSS0.Px3.p1.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology,Cited by:[§10](https://arxiv.org/html/2605.08538#S10.SS0.SSS0.Px1.p1.1),[§11](https://arxiv.org/html/2605.08538#S11.SS0.SSS0.Px3.p1.1)\.
- R\. Shao, C\. Chen, J\. Jia, and B\. Xiao \(2023\)RAISE: retrieval\-augmented interaction simulation engine for automatic llm agent evaluation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[§10](https://arxiv.org/html/2605.08538#S10.SS0.SSS0.Px1.p1.1),[§11](https://arxiv.org/html/2605.08538#S11.SS0.SSS0.Px3.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§10](https://arxiv.org/html/2605.08538#S10.SS0.SSS0.Px1.p1.1),[§11](https://arxiv.org/html/2605.08538#S11.SS0.SSS0.Px3.p1.1)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu \(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§10](https://arxiv.org/html/2605.08538#S10.SS0.SSS0.Px2.p1.1),[§11](https://arxiv.org/html/2605.08538#S11.SS0.SSS0.Px2.p1.2),[§9\.2](https://arxiv.org/html/2605.08538#S9.SS2.SSS0.Px1.p1.1),[§9\.3](https://arxiv.org/html/2605.08538#S9.SS3.SSS0.Px3.p1.5)\.

Similar Articles

From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms

Hugging Face Daily Papers

This survey paper proposes an evolutionary framework for LLM agent memory mechanisms, categorizing their development into three stages: storage, reflection, and experience. It analyzes core drivers such as long-range consistency and continual learning to provide design principles for next-generation agents.

HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents

arXiv cs.CL

HeLa-Mem is a bio-inspired memory architecture for LLM agents that models memory as a dynamic graph using Hebbian learning dynamics, featuring episodic and semantic memory stores to improve long-term coherence. Experiments on LoCoMo show superior performance across question categories while using fewer context tokens.