Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

arXiv cs.CL Papers

Summary

This paper introduces Engram, an open-source bi-temporal memory engine for LLM agents that retrieves a compact context slice (∼9.6k tokens) to outperform the full-history baseline (79k tokens) by 10.4 accuracy points on LongMemEval, using a hybrid read path fusing dense, lexical, graph, and temporal signals.

arXiv:2606.09900v1 Announce Type: new Abstract: Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources. We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact -- invalidating, never deleting, so every fact keeps provenance and a supersession chain. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time ("as-of") filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram's lean configuration -- answering from a ~9.6k-token retrieved slice, never the full history -- scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p < 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored. The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:09 AM

# A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History
Source: [https://arxiv.org/html/2606.09900](https://arxiv.org/html/2606.09900)
Liuyin Wang Independent ResearcherCorrespondence:liuyinwangthu@gmail\.com\. Code, raw logs, and the reproducible harness:[https://github\.com/ly\-wang19/engram](https://github.com/ly-wang19/engram)\.

\(June 2026\)

###### Abstract

Long\-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround—replaying the entire history into the prompt—is expensive, slow, and, as distractors accumulate,*less*accurate\. Most memory systems win on cost or latency but still lose to the full\-context baseline on accuracy, and the field’s benchmark numbers are reported on inconsistent, non\-reproducible harnesses, so the same system appears at wildly different scores across sources\. We presentEngram, an open\-source, dual\-process memory engine built on a bi\-temporal data model\. A fast write path appends lossless episodes without an LLM on the critical path; an asynchronous consolidation path extracts atomic\(subject,predicate,object\)\(\\textit\{subject\},\\textit\{predicate\},\\textit\{object\}\)facts, builds a bi\-temporal knowledge graph, and resolves contradictions*without*an LLM call per fact—invalidating, never deleting, so every fact retains provenance and a supersession chain\. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point\-in\-time \(“as\-of”\) temporal filter, and assembles a compact, provenance\-tagged context\. On the full 500\-questionLongMemEvalSbenchmark, graded by the*official*category\-specific judge,Engram’s lean configuration—which answers from a∼\\sim9\.6k\-token retrieved slice, never the full history—scores83\.6%versus73\.2%for the full\-context baseline \(\+10\.4\+10\.4points, McNemar exactp<10−6p<10^\{\-6\}\) while using∼\\sim8×\\timesfewer tokens\(9\.6k vs\. 79k\), with 0 of 500 questions errored under both systems\. We find the gain is load\-bearing on the read path being*hybrid*: a facts\-only path loses recall, while facts plus retrieved raw chunks recover detail\. Beyond the system, we contribute a single neutral, in\-repo evaluation harness with the official judge baked in, the full\-context baseline in every table, and the raw per\-question logs published—and we document the measurement\-integrity pitfalls \(truncation, home\-grown judges, full\-history “leaks”\) that silently distort memory\-benchmark numbers\. Every number in this paper ships with the command to reproduce it\.

## 1Introduction

LLM agents are stateless across sessions\. The pragmatic fix—concatenate the whole conversation history into the prompt—scales badly on three axes at once: token cost grows linearly with history, latency follows, and accuracy*degrades*as irrelevant turns crowd the window and the model is forced to locate a needle among distractors\[[12](https://arxiv.org/html/2606.09900#bib.bib12)\]\. A long\-term memory layer that stores, structures, and selectively retrieves the past is the natural alternative, and a growing body of systems pursues it\[[14](https://arxiv.org/html/2606.09900#bib.bib14),[15](https://arxiv.org/html/2606.09900#bib.bib15),[2](https://arxiv.org/html/2606.09900#bib.bib2),[16](https://arxiv.org/html/2606.09900#bib.bib16),[6](https://arxiv.org/html/2606.09900#bib.bib6)\]; seeDu \[[4](https://arxiv.org/html/2606.09900#bib.bib4)\]for a recent \(2026\) survey\.

Two gaps remain wide open\. First,accuracy: most memory systems are reported as cheaper or faster than full\-context, but*not more accurate*\. Beating full\-context*on accuracy*—a precisely retrieved slice that outperforms the noisy full window—is the harder and more valuable target, because it turns memory from a cost optimization into a quality improvement\. Second,reproducibility: memory benchmarks are reported on inconsistent harnesses with different ingestion, answer prompts, and judges, so a single system can appear at 58%, 66%, or 92% depending on the source, and different papers give contradictory orderings\. In a field where every number is contested, a neutral harness anyone can re\-run is itself a contribution\.

We address both withEngram, an open\-source memory engine, and its in\-repo evaluation harness\. Our design follows a single principle—*a number we cannot reproduce does not exist*—and a composition thesis: no single mechanism wins, so we compose typed memory, a bi\-temporal knowledge graph, multi\-signal retrieval fusion, salience decay, and asynchronous consolidation, and build in the seams between them\.

#### Contributions\.

1. 1\.A dual\-process, bi\-temporal memory engine\(§[3](https://arxiv.org/html/2606.09900#S3)\) with cheap, non\-destructive conflict resolution: a contradicted fact is*invalidated*\(with asupersedeschain and full provenance\), never overwritten, and the common case is resolved with*no*LLM call\.
2. 2\.The empirical result that a lean retrieved context beats full\-context on accuracy\(§[5](https://arxiv.org/html/2606.09900#S5)\):\+10\.4\+10\.4points \(83\.6% vs\. 73\.2%\) at∼\\sim8×\\timesfewer tokens on the full 500\-questionLongMemEvalSunder the official judge—i\.e\. removing distractors*raises*accuracy—together with the finding that a*hybrid*facts\-plus\-chunks read path is load\-bearing \(facts alone lose recall\)\.
3. 3\.A neutral, reproducible harness\(§[4](https://arxiv.org/html/2606.09900#S4)\): one in\-repo pipeline with the official judge baked in, the full\-context baseline in every table, the same answerer and judge applied to every system by construction, and the raw per\-question logs published\. We additionally document the measurement\-integrity pitfalls we found and fixed\.
4. 4\.A per\-category analysis\(§[5](https://arxiv.org/html/2606.09900#S5)\) isolating where bi\-temporal modeling pays off \(knowledge\-update 87\.5%, temporal 81\.1%\) and where headroom remains \(multi\-session aggregation, preference\)\.

Engramis dual\-licensed \(AGPL\-3\.0 plus a commercial license\), self\-hostable, and runs end\-to\-end with zero setup—no API keys, no external services—via deterministic offline fallbacks, so the architecture and the demo are reproducible without network access\.

## 2Related Work

#### Memory systems for LLM agents\.

MemGPT\[[14](https://arxiv.org/html/2606.09900#bib.bib14)\]frames the LLM as an operating system that pages memory in and out of a limited context window\. Generative Agents\[[15](https://arxiv.org/html/2606.09900#bib.bib15)\]introduce a memory stream with importance, recency, and relevance scoring plus periodic reflection\. Mem0\[[2](https://arxiv.org/html/2606.09900#bib.bib2)\]targets production agents with scalable extract\-and\-store memory\. Zep/Graphiti\[[16](https://arxiv.org/html/2606.09900#bib.bib16)\]is the closest in spirit toEngram: a temporally\-aware knowledge\-graph memory with bi\-temporal edges\. HippoRAG\[[6](https://arxiv.org/html/2606.09900#bib.bib6)\]uses a graph plus personalized PageRank for multi\-hop retrieval, and its successor HippoRAG 2\[[7](https://arxiv.org/html/2606.09900#bib.bib7)\]reframes the same machinery as non\-parametric continual memory\. The most recent agentic\-memory systems add LLM\-driven control over how memories are written and organized: A\-MEM\[[21](https://arxiv.org/html/2606.09900#bib.bib21)\]builds an interlinked, Zettelkasten\-style note network; MemoryOS\[[9](https://arxiv.org/html/2606.09900#bib.bib9)\]imposes an operating\-system\-style short/mid/long\-term hierarchy; and GAM\[[19](https://arxiv.org/html/2606.09900#bib.bib19)\]decouples memory encoding from consolidation over hierarchical graphs\.Engramdiffers in*composing*a bi\-temporal graph with a hybrid \(facts \+ raw chunks\) read path and an explicit, cheap conflict\-resolution policy, and—distinctively—in shipping the neutral harness that lets these design choices be measured rather than asserted\.

#### Long context and the cost of distractors\.

“Lost in the middle”\[[12](https://arxiv.org/html/2606.09900#bib.bib12)\]shows that LLMs use long contexts unevenly and that accuracy degrades when relevant evidence is buried among distractors\. This is the mechanism our headline result exploits: a filtered, precisely retrieved slice can be*more*accurate than the full window, not merely cheaper\.

#### Benchmarks\.

LongMemEval\[[18](https://arxiv.org/html/2606.09900#bib.bib18)\]evaluates chat assistants on long\-term interactive memory across categories \(single/multi\-session, knowledge\-update, temporal reasoning, preference, and abstention\), with category\-specific judge prompts\. LOCOMO\[[13](https://arxiv.org/html/2606.09900#bib.bib13)\]evaluates very long\-term conversational memory, and more recent benchmarks extend the setting to multimodal histories \(Mem\-Gallery\[[1](https://arxiv.org/html/2606.09900#bib.bib1)\]\)\. We report onLongMemEvalShere and treat LOCOMO and additional backbones as immediate future work \(§[6](https://arxiv.org/html/2606.09900#S6)\)\.

#### Retrieval\.

Engram’s read path builds on dense retrieval\[[10](https://arxiv.org/html/2606.09900#bib.bib10)\]with modern dense sentence embeddings \(the BGE family\[[20](https://arxiv.org/html/2606.09900#bib.bib20)\]; we use the Englishbge\-small\-en\-v1\.5\), classical lexical scoring \(BM25\[[17](https://arxiv.org/html/2606.09900#bib.bib17)\]\), and rank fusion via Reciprocal Rank Fusion\[[3](https://arxiv.org/html/2606.09900#bib.bib3)\], in the retrieval\-augmented generation tradition\[[11](https://arxiv.org/html/2606.09900#bib.bib11)\]\. The dual\-process framing draws on the System\-1/System\-2 distinction in cognitive science\[[8](https://arxiv.org/html/2606.09900#bib.bib8)\], and salience decay on the classical forgetting curve\[[5](https://arxiv.org/html/2606.09900#bib.bib5)\]\.

## 3TheEngramSystem

Engramis a dual\-process memory system: a fast online write path \(System\-1\) and a slow asynchronous consolidation path \(System\-2\) that writes into a typed, bi\-temporal memory, which a hybrid read path queries \(Figure[1](https://arxiv.org/html/2606.09900#S3.F1)\)\.

add\(messages\)SYSTEM\-1⋅\\cdothot write path⋅\\cdotno LLM⋅\\cdot<<50 msappend losslessEpisodeidentity resolution\(sessions/devices\)light embed\+\+enqueueSYSTEM\-2⋅\\cdotasync consolidation⋅\\cdotsecondsextract atomicFacts\(s,p,o\)\(s,p,o\)build bi\-temporalknowledge graphconflict detect→\\toinvalidatesalience\+\+decayTYPED MEMORY⋅\\cdotpluggable storesEpisodicSemantic graph\(bi\-temporal\)Profile /IdentityProceduralREAD PATH⋅\\cdothybrid retrieval⋅\\cdot<<100 msquerydecompose\(multi\-hop\)dense\+\+BM25\+\+graph\+\+recencyRRF fusion\(\+\+rerank\)bi\-temporalas\-of filterabstentiongateassemblecontextsearch\(query\)answer\-readycontextasync queuewrite factsretrieve

Figure 1:TheEngramdual\-process architecture\. A hot write path \(System\-1\) never blocks on an LLM; an asynchronous consolidation path \(System\-2\) extracts atomic facts, builds the bi\-temporal knowledge graph, and resolves conflicts non\-destructively; both feed a typed, bi\-temporal memory backed by pluggable stores; and a hybrid read path retrieves a compact, provenance\-tagged slice \(dense\+\+lexical\+\+graph\+\+recency, fused by RRF\), applies an as\-of temporal filter and an abstention gate, and assembles the answer context\.### 3\.1Bi\-temporal data model

Two time axes are first\-class everywhere, which is what makes knowledge\-updates and “as\-of” queries intrinsic rather than bolted\-on\.

- •Episode— a raw, lossless turn/event, stamped with*event time*\(when it happened in the world\) and*ingested\-at*\(transaction time, when we recorded it\)\.
- •Fact— an atomic\(subject,predicate,object\)\(\\textit\{subject\},\\textit\{predicate\},\\textit\{object\}\)claim with surface text, an embedding, a*salience*and*confidence*, and*provenance*\(the source episode ids\)\. It carries two time axes:*valid time*\(valid\_at/invalid\_at, when the claim is true in the world\) and*transaction time*\(created\_at/expired\_at, when we learned or retracted it\), plus asupersedespointer to the fact it replaces\.
- •Entity / Relation— graph nodes and edges; edges carry the same bi\-temporal stamps\.

The invariant:*never hard\-delete a contradicted fact—invalidate it*\(setinvalid\_at\) and keep the history\. Every fact can therefore answer “where did this come from?” and “what did it replace?” \(Figure[2](https://arxiv.org/html/2606.09900#S3.F2)\)\.

world timet0t\_\{0\}t1t\_\{1\}nowFact A:“works at Tencent”kept⋅\\cdotinvalidated, not deletedvalid\_at=t0=t\_\{0\}Fact B:“works at Moonshot AI”⋯\\cdotsinvalid\_at=t1=t\_\{1\}valid\_at=t1=t\_\{1\}supersedesas\-oft0t\_\{0\}:*“where does X work?”*⇒\\RightarrowTencentas\-of now:*“where does X work?”*⇒\\RightarrowMoonshot AI

Figure 2:Bi\-temporal facts make contradictions and “as\-of” queries first\-class\. When Fact B arrives att1t\_\{1\},EngramsetsA\.invalid\_at=t1\\text\{A\.\}\\texttt\{invalid\\\_at\}=t\_\{1\}andB\.supersedes=A\\text\{B\.\}\\texttt\{supersedes\}=\\text\{A\}: Fact A is*invalidated, not deleted*, so the history \(and provenance\) survives\. A point\-in\-time query then resolves against the valid fact for that time—Tencentas\-oft0t\_\{0\},Moonshot AIas\-of now—which is exactly what the knowledge\-update and temporal categories require\.
### 3\.2System\-1: the hot write path

Onadd\(messages\),Engramappends a lossless episode, resolves identity \(linking a user/entity across sessions and devices\), computes a light embedding, and enqueues the episode for consolidation\. No LLM runs on this path, keeping it within a sub\-50ms budget so writes never block the agent\.

### 3\.3System\-2: asynchronous consolidation

Off the critical path, the consolidation engine \(i\) extracts atomic facts from episodes \(rule\-based by default, LLM\-based when a model is configured\), \(ii\) builds the bi\-temporal knowledge graph of entities and relations, \(iii\) detects conflicts and invalidates superseded facts, and \(iv\) scores salience and applies decay/reinforcement so unreinforced memories fade and the store stays small and fast\. Hierarchical abstraction \(session summaries→\\rightarrowprofile\) runs here as well\.

### 3\.4Cheap\-then\-escalate conflict resolution

When a new fact arrives for an existing\(subject,predicate\)\(\\textit\{subject\},\\textit\{predicate\}\)slot with a different object,Engramresolves it in increasing order of cost:

1. 1\.Slot match\(exact\) signals a likely update;embedding similaritycatches the same attribute under a different free\-form predicate;content subsumption\(one claim⊆\\subseteqthe other\) separates contradiction from elaboration\.
2. 2\.If the claims are clearly contradictory and temporally ordered, invalidate the old one \(old\.invalid\_at←new\.valid\_at\\texttt\{old\.invalid\\\_at\}\\leftarrow\\texttt\{new\.valid\\\_at\}\) and setnew\.supersedes←old\.id\\texttt\{new\.supersedes\}\\leftarrow\\texttt\{old\.id\}\.*No LLM call\.*
3. 3\.Only genuinely ambiguous cases escalate to an LLM adjudicator\.

This is the cost win over systems that invoke an LLM on every fact, while preserving production\-grade temporal correctness and a complete audit trail\.

### 3\.5The hybrid read path

Onsearch\(query\),Engram\(1\) understands the query and, for multi\-hop questions, decomposes it into sub\-queries; \(2\) retrieves in parallel through four complementary channels—dense semantic, BM25 lexical, graphnn\-hop from the query’s entities, and recency/salience; \(3\) fuses the ranked lists with Reciprocal Rank Fusion\[[3](https://arxiv.org/html/2606.09900#bib.bib3)\]and an optional cross\-encoder rerank over the top\-kk\(off by default\); \(4\) applies a bi\-temporal “as\-of” filter \(what we believed was true at timeTT\); \(5\) passes an abstention gate that declines to answer when the evidence is absent; and \(6\) assembles a deduplicated, provenance\-tagged, token\-budgeted context\. The per\-item score combines all signals:

score​\(item∣q\)=wsem​cos⁡\(q,item\)\+wlex​bm25​\(q,item\)\+wgraph​prox​\(item\)\+wrec​e−Δ​t/τ\+wsal​sal​\(item\),\\begin\{split\}\\mathrm\{score\}\(\\textit\{item\}\\mid q\)=\{\}&w\_\{\\text\{sem\}\}\\cos\(q,\\textit\{item\}\)\+w\_\{\\text\{lex\}\}\\,\\mathrm\{bm25\}\(q,\\textit\{item\}\)\+w\_\{\\text\{graph\}\}\\,\\mathrm\{prox\}\(\\textit\{item\}\)\\\\ &\{\}\+w\_\{\\text\{rec\}\}\\,e^\{\-\\Delta t/\\tau\}\+w\_\{\\text\{sal\}\}\\,\\mathrm\{sal\}\(\\textit\{item\}\),\\end\{split\}\(1\)where the weights are configuration\-driven and tuned on the harness rather than hand\-set\. Crucially, the assembled context is*hybrid*: it contains both the conflict\-resolved bi\-temporal facts and the most relevant raw session chunks \(plus session\-level summaries\)\. §[5](https://arxiv.org/html/2606.09900#S5)shows this hybrid composition is necessary—facts alone lose recall\.

#### Pluggable backends\.

Every external dependency—LLM, embedder, vector store, graph store, lexical index—sits behind an interface with a zero\-dependency offline fallback \(a hashing embedder, a rule\-based extractor, in\-memory stores\)\. The end\-to\-end loop and the unit tests therefore run deterministically with no API keys and no services; real backends \(BGE embeddings, LanceDB/Qdrant/pgvector, Kuzu/Neo4j, any LLM via LiteLLM\) slot in behind the same interfaces\.

## 4Experimental Setup

#### Benchmark\.

LongMemEvalS\[[18](https://arxiv.org/html/2606.09900#bib.bib18)\]: 500 questions, each over a haystack of∼\\sim50 sessions \(∼\\sim115k tokens\), pulled from the public release\. Questions span seven categories \(Figure[4](https://arxiv.org/html/2606.09900#S5.F4)\)\. We grade with the*official*category\-specific judge prompts—including the temporal off\-by\-one tolerance, knowledge\-update old\-information tolerance, preference\-rubric leniency, the “contains the answer” semantics, and unanswerable \(abstention\) detection—rather than a home\-grown judge\.

#### Models\.

Embedder:BAAI/bge\-small\-en\-v1\.5\(local, no API key\)\. System\-2 fact extractor:doubao\-seed\-1\.6\-flash\. Answerer:doubao\-seed\-2\.0\-pro\. Judge:DeepSeek\-V3\.2, a strict standard judge\. Models are addressed asprovider:modeland resolved via LiteLLM, so any OpenAI\-compatible endpoint \(OpenAI, DeepSeek, a local model\) can be substituted and the same commands run\.

#### Systems\.

engram\_lean\(our headline\) retrieves a small hybrid slice—conflict\-resolved bi\-temporal facts\+\+the most relevant raw session chunks\+\+session summaries—and answers from*that alone*\(∼\\sim9\.6k tokens\), never the full history\.full\_contextis the baseline: stuff the entire haystack into the prompt \(∼\\sim79k tokens\)\. The harness applies the*same answerer and judge to every system*, so any within\-run comparison is apples\-to\-apples by construction\.

#### Retrieval configuration\.

The hybrid read path fuses five ranked signals by weighted Reciprocal Rank Fusion \(kRRF=60k\_\{\\text\{RRF\}\}=60\) with the weights in Table[1](https://arxiv.org/html/2606.09900#S4.T1)\(repository defaults, tuned on the harness, not hand\-set per question\)\. The lean slice assembles up to 8 conflict\-resolved facts, the top\-15 fused items, 2 raw session chunks, and 28 session\-level summaries; exact flag semantics are in the reproduce command below\.

Table 1:Retrieval\-fusion weights for Eq\. \([1](https://arxiv.org/html/2606.09900#S3.E1)\) and the RRF constant, as used by the headlineengram\_leanconfiguration \(defaults inengram/config\.py\)\.
#### Reproduce\.

The headline run is a single command \(raw per\-question logs are committed to the repository\):

> python eval/bench\.py \-\-data s \-\-limit 500 \\ \-\-systems engram\_lean,full\_context \\ \-\-answerer volcano:doubao\-seed\-2\-0\-pro \\ \-\-judge volcano:deepseek\-v3\-2 \-\-extractor volcano:doubao\-seed\-1\-6\-flash \\ \-\-embedder bge\-small \-\-reasoning \-\-persona \\ \-\-chunks 2 \-\-topk 15 \-\-extract\-k 8 \-\-summ\-k 28 \-\-n\-summaries 28

python eval/report\.py <run\.jsonl\>recomputes every table below from the logs\.

#### Measurement\-integrity notes\.

We document the bugs that silently inflate or deflate memory\-benchmark numbers, because they are the reason cross\-source numbers disagree:

1. 1\.Lean, not full\-history \(the honest test\)\.An earlier headline prepended facts*above the entire history*; because that system*contains*full\-context, it cannot really lose to it and does not validate the retrieval thesis\. The headline is nowengram\_lean, which answers from a∼\\sim9\.6k\-token retrieved slice\.
2. 2\.Full\-context truncation\.The baseline was once capped below the haystack size, feeding it only the oldest sessions; this*deflated the baseline*\. Fixed so it receives the whole haystack\. \(Any “full\-context only scores 30%” claim predates this fix\.\)
3. 3\.Official judge, not a home\-grown one\.A generic “same info?” judge was*stricter*than the official LongMemEval judge and made scores non\-comparable; we use the official prompts verbatim\.
4. 4\.Abstentionquestions are graded by the official*unanswerable*judge\.
5. 5\.Reliability\.The LLM client uses exponential\-backoff retry with transient/permanent error classification; the headline run completed with0 errored questionsof 500 under both systems\.

## 5Results

#### Headline\.

Table[2](https://arxiv.org/html/2606.09900#S5.T2)and Figure[3](https://arxiv.org/html/2606.09900#S5.F3)report the full 500\-question result\.Engram’s lean configuration beats the full\-context baseline by\+10\.4\+10\.4points\(83\.6% vs\. 73\.2%\) while using∼\\sim8×\\timesfewer tokens\(9\.6k vs\. 79k\), with 0 of 500 errored under both\. The filtered slice is not merely cheaper—it is*more accurate*than the full window, consistent with the distractor mechanism ofLiu et al\. \[[12](https://arxiv.org/html/2606.09900#bib.bib12)\]\. And because the retrieved slice is bounded,Engram’s cost stays flat as history grows, whereas full\-context cannot\. The margin is statistically decisive: across the 500 paired questionsengram\_leanis correct on 81 that the baseline misses versus 29 the other way \(McNemar’s exact test,p<10−6p<10^\{\-6\}\), and a paired bootstrap puts the 95% CI of the gain at\[\+6\.4,\+14\.4\]\[\+6\.4,\+14\.4\]points\. All statistics are recomputed from the committed logs bypaper/compute\_stats\.py\.

Table 2:LongMemEvalS, 500 questions, official judge\. Same answerer \(doubao\-seed\-2\.0\-pro\) and judge \(DeepSeek\-V3\.2\) applied to both systems\. Accuracy is shown with a Wilson 95% confidence interval; the paired difference is significant \(McNemar exactp<10−6p<10^\{\-6\}\)\.![Refer to caption](https://arxiv.org/html/2606.09900v1/x1.png)Figure 3:Accuracy vs\. average context tokens on LongMemEvalS\(500 questions, official judge\)\.engram\_lean\(star\) sits up and to the left of the full\-context baseline \(square\)—\+10\.4\+10\.4points at∼\\sim8×\\timesfewer tokens\. Theengram\_fullvariant \(open circle\) prepends the same facts above the*whole*history and lands at 83\.4%: the structured facts carry the accuracy, while the full history adds tokens, not correctness\.
#### Lean retrieval matches full\-history accuracy at 1/8 the cost\.

For reference, in the same 500\-question run a non\-lean variant that prepends the conflict\-resolved facts*above the full history*\(engram\_full,∼\\sim79k tokens\) scores 83\.4% \(416/499; one question errored under this variant, versus none forengram\_leanand full\-context\)\. A paired McNemar test againstengram\_lean’s 83\.6% finds*no*difference \(p=0\.91p=0\.91\): the structured facts contribute essentially all of the accuracy gain, while the full history adds tokens, not correctness\. We therefore headline the lean number\.

#### Per\-category\.

Figure[4](https://arxiv.org/html/2606.09900#S5.F4)breaksengram\_leandown by category\. Bi\-temporal modeling pays off where it should:*knowledge\-update*\(most\-recent\-wins via invalidation\) reaches 87\.5% and*temporal\-reasoning*\(date\-stamped context\+\+as\-of filtering\) 81\.1%\.*Abstention*reaches 86\.7% under the official unanswerable judge—the system declines when memory lacks the answer rather than hallucinating\. The headroom is concentrated in*multi\-session*aggregation \(counting/aggregating across many sessions, 79\.3%\) and*single\-session\-preference*\(73\.3%, a category that is hard field\-wide\)\.

![Refer to caption](https://arxiv.org/html/2606.09900v1/x2.png)Figure 4:Per\-category accuracy ofengram\_leanon the full 500\-question set \(with per\-categorynn; dashed line is the 83\.6% overall\)\. The two categories where bi\-temporal modelling is decisive—*knowledge\-update*and*temporal\-reasoning*—are highlighted\. Headroom concentrates in*multi\-session*aggregation and*single\-session\-preference*\(hard field\-wide\)\.
#### The read path must be hybrid\.

A load\-bearing finding from development: a*facts\-only*read path—answering purely from extracted\(s,p,o\)\(s,p,o\)facts—*loses recall*relative to the hybrid path, because extraction drops detail that some questions need verbatim\. Adding the most relevant raw session chunks back alongside the conflict\-resolved facts restores that detail; the facts contribute conflict\-resolved, bi\-temporal signal and the chunks restore specificity\. The headline configuration is hybrid for exactly this reason, and we caution against shipping facts\-only QA\. We report this as a design observation; a controlled facts\-only ablation on the full 500\-question set under the same judge is reported as future work \(§[6](https://arxiv.org/html/2606.09900#S6)\), not claimed here\.

#### Efficiency\.

The retrieved context averages∼\\sim9\.6k tokens,∼\\sim8×\\timesleaner than the∼\\sim79k full\-context baseline, and the lean read path is what keeps cost flat as history grows\. Retrieval itself is sub\-second; the∼\\sim60s p50 end\-to\-end latency is dominated by the answerer model’s generation call, not byEngram\.

## 6Discussion and Limitations

We report the result openly rather than as a leaderboard “win,” precisely because our thesis is that cross\-harness numbers are not comparable\. The honest scope of the present evidence:

- •One benchmark, one answerer backbone\.The headline isLongMemEvalSwith a single answerer model\. The evaluation discipline we advocate requires multiple backbones \(a small open model*and*a frontier model\) and multiple benchmarks; extending to LOCOMO\[[13](https://arxiv.org/html/2606.09900#bib.bib13)\]and a second backbone is immediate next work, so that memory quality is shown not to depend on one model’s ability to read our structures\.
- •Small samples mislead\.During development an 18\-item slice once read 83% when the full\-set truth was∼\\sim58%; we therefore report*only*full\-500 numbers, and every number in this paper is a full\-set number with committed logs\.
- •Open categories\.Multi\-session aggregation and preference are where headroom remains; the multi\-hop query planner and tuned bi\-temporal conflict resolution are the levers we expect to move them\.
- •Single run; no controlled component ablation yet\.We report one full\-set run per system and do not yet quantify run\-to\-run variance from answerer stochasticity, nor a controlled full\-set ablation of the read path’s components \(notably the facts\-only vs\. hybrid comparison of §[5](https://arxiv.org/html/2606.09900#S5), which we state as an observation\)\. Repeated\-run variance and a component ablation are immediate next work\.

## 7Conclusion

Engramshows that a precisely retrieved, bi\-temporal, hybrid context can beat the full\-context baseline*on accuracy*—\+10\.4\+10\.4points at∼\\sim8×\\timesfewer tokens on the fullLongMemEvalSunder the official judge—turning long\-term memory from a cost optimization into a quality improvement\. Equally, we contribute the neutral, reproducible harness that makes such a claim checkable: the official judge baked in, the full\-context baseline in every table, documented measurement\-integrity pitfalls, and the raw logs published\. In a field where every number is contested, the trustworthy scoreboard is itself the result\. The system, the harness, and the logs are open source\.

## Ethics and Broader Impact

A long\-term memory layer necessarily persists personal data across sessions, which raises real privacy obligations\. Two ofEngram’s core design choices are mitigations as much as features:*non\-destructive invalidation with full provenance*yields an auditable trail of what was believed, when, and from which source, which makes targeted deletion and “right to be forgotten” requests tractable rather than best\-effort; and*salience decay*lets unreinforced personal details fade by default\. Because every component has a zero\-dependency offline fallback,Engramcan run fully locally—no user data need leave the operator’s machine—and the AGPL\-3\.0 license keeps self\-hosted data under the operator’s control\. The principal risks are the obverse of the benefits: a memory store concentrates sensitive information and must be secured, access\-controlled, and scoped to genuine consent; and a memory that confidently surfaces a*stale*or*wrong*fact can mislead, which is precisely why the bi\-temporal model and the abstention gate \(decline when memory lacks the answer\) are first\-class rather than optional\.

## Reproducibility Statement

Every number in this paper is produced by the in\-repo harness and backed by committed per\-question logs \(prediction, gold, correctness, tokens, and latency for all 500 questions\)\. The end\-to\-end loop and unit tests run with no API keys and no external services via deterministic offline fallbacks; the benchmark numbers require an OpenAI\-compatible model endpoint for the answerer, extractor, and judge, all configurable\. The exact command, model identifiers, embedder, and hyper\-parameters are in §[4](https://arxiv.org/html/2606.09900#S4); raw logs and the harness are in the repository\. The significance tests and confidence intervals in §[5](https://arxiv.org/html/2606.09900#S5)are recomputed from the committed logs bypaper/compute\_stats\.py, with no model calls\.

## References

- Bei et al\. \[2026\]Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik Hamann, Jingrui He, and Hanghang Tong\.Mem\-Gallery: Benchmarking multimodal long\-term conversational memory for MLLM agents\.*arXiv preprint arXiv:2601\.03515*, 2026\.
- Chhikara et al\. \[2025\]Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav\.Mem0: Building production\-ready AI agents with scalable long\-term memory\.*arXiv preprint arXiv:2504\.19413*, 2025\.
- Cormack et al\. \[2009\]Gordon V\. Cormack, Charles L\. A\. Clarke, and Stefan Buettcher\.Reciprocal rank fusion outperforms condorcet and individual rank learning methods\.In*Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval*, pages 758–759, 2009\.
- Du \[2026\]Pengfei Du\.Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers\.*arXiv preprint arXiv:2603\.07670*, 2026\.
- Ebbinghaus \[1913\]Hermann Ebbinghaus\.*Memory: A Contribution to Experimental Psychology*\.Teachers College, Columbia University, 1913\.Original work published 1885\.
- Gutiérrez et al\. \[2024\]Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su\.HippoRAG: Neurobiologically inspired long\-term memory for large language models\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2024\.
- Gutiérrez et al\. \[2025\]Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su\.From RAG to memory: Non\-parametric continual learning for large language models\.In*International Conference on Machine Learning \(ICML\)*, 2025\.
- Kahneman \[2011\]Daniel Kahneman\.*Thinking, Fast and Slow*\.Farrar, Straus and Giroux, 2011\.
- Kang et al\. \[2025\]Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai\.Memory OS of AI agent\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2025\.
- Karpukhin et al\. \[2020\]Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen\-tau Yih\.Dense passage retrieval for open\-domain question answering\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 6769–6781, 2020\.
- Lewis et al\. \[2020\]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen\-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela\.Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2020\.
- Liu et al\. \[2024\]Nelson F\. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang\.Lost in the middle: How language models use long contexts\.*Transactions of the Association for Computational Linguistics \(TACL\)*, 12:157–173, 2024\.
- Maharana et al\. \[2024\]Adyasha Maharana, Dong\-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang\.Evaluating very long\-term conversational memory of LLM agents\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2024\.
- Packer et al\. \[2023\]Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G\. Patil, Ion Stoica, and Joseph E\. Gonzalez\.MemGPT: Towards LLMs as operating systems\.*arXiv preprint arXiv:2310\.08560*, 2023\.
- Park et al\. \[2023\]Joon Sung Park, Joseph C\. O’Brien, Carrie J\. Cai, Meredith Ringel Morris, Percy Liang, and Michael S\. Bernstein\.Generative agents: Interactive simulacra of human behavior\.In*Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology \(UIST\)*, 2023\.
- Rasmussen et al\. \[2025\]Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef\.Zep: A temporal knowledge graph architecture for agent memory\.*arXiv preprint arXiv:2501\.13956*, 2025\.
- Robertson and Zaragoza \[2009\]Stephen Robertson and Hugo Zaragoza\.The probabilistic relevance framework: BM25 and beyond\.*Foundations and Trends in Information Retrieval*, 3\(4\):333–389, 2009\.
- Wu et al\. \[2025\]Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai\-Wei Chang, and Dong Yu\.LongMemEval: Benchmarking chat assistants on long\-term interactive memory\.*International Conference on Learning Representations \(ICLR\)*, 2025\.
- Wu et al\. \[2026\]Zhaofen Wu, Hanrong Zhang, Fulin Lin, Wujiang Xu, Xinran Xu, Yankai Chen, Henry Peng Zou, Shaowen Chen, Weizhi Zhang, Xue Liu, Philip S\. Yu, and Hongwei Wang\.GAM: Hierarchical graph\-based agentic memory for LLM agents\.*arXiv preprint arXiv:2604\.12285*, 2026\.
- Xiao et al\. \[2024\]Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian\-Yun Nie\.C\-Pack: Packed resources for general chinese embeddings\.In*Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR\)*, 2024\.Introduces the BGE embedding models, incl\.bge\-small\-en\-v1\.5; arXiv:2309\.07597\.
- Xu et al\. \[2025\]Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang\.A\-MEM: Agentic memory for LLM agents\.*arXiv preprint arXiv:2502\.12110*, 2025\.

## Appendix APrompts

For full reproducibility we reproduce the exact prompts the harness uses, verbatim from the repository \(non\-ASCII characters are normalised for typesetting\)\. The*answerer*system prompt \(used with\-\-reasoning; it instructs evidence aggregation, most\-recent\-wins on conflicts, and abstention only as a last resort\):

Youanswertheuser’squestionusingtheprovideddatedcontext\.Thecontexthastwoparts:aFACTS

index\(adigest\)ANDthefulldatedCONVERSATIONSbelowit\.TheanswerisUSUALLYpresent\-\-searchBOTH

partsthoroughlybeforeconcludinganything\.Thedatesarereal;usethemfortemporalreasoning\.

DIGFORTHEANSWERFIRST\.MostquestionsAREanswerablefromthehistory\-\-yourjobistofindthe

evidence,nottogiveup\.IftheFACTSdigestdoesn’tcontainit,scanthefullCONVERSATIONSbelow\.

WHENTHEQUESTIONNEEDSMULTIPLEEVIDENCEPIECES\(counting,summing,ordering,datearithmetic,duration,

earliest/latest/most\-recent,multi\-steplookup,knowledgeupdatedovertime\):

EVIDENCE:listEVERYrelevantdateditemfromthecontext,oneperlinewithitsdate\.BeEXHAUSTIVE

\-\-scanthewholehistory;amisseditemmakesacountortotalwrong\.Re\-readbeforefinalizing\.

REASON:countdistinctitems/sum/sortbydate/computethedatedifference/pickthemostrecent\.

Showtheworkin1\-2lines\.Fordurationscomputeend\_date\-start\_dateexplicitly\.

ANSWER:<asingleconcisefinalanswer\-\-noreasoningonthisline\>

FORDATE/TIME/NUMBERANSWERS:givetheMOSTSPECIFICvaluethecontextsupports\(exactdate\>month\+year

\>year;exactduration\)\.Readtheexactfigurefromthetext\-\-don’tround\(e\.g\.27minutes45seconds,

not28minutes\)\.

WHENTHEQUESTIONASKSFORARECOMMENDATION/PREFERENCE:doNOTrefuse,doNOTaskaclarifyingquestion,

doNOTgivegenericadvice\.Groundtheanswerintheuser’sOWNstatedhistory:nametheSPECIFICpeople,

places,brands,tools,experiencesorconstraintstheymentioned,andtailorthesuggestiontothose\.

FORSIMPLESINGLE\-FACTQUESTIONS:gostraightto’ANSWER:<fact\>’\.

CURRENT\-STATEquestions\(’whatismycurrent/latestX’,’wheredoIworknow’\):reportONLYthemost

recentvaluebydateandexplicitlydisregardolder,supersededones\.Thenewestdatedstatementwins\.

Whenfactsconflict,ALWAYStrustthemostrecentonebydate\.

ABSTENTION\-\-acarefulLASTresort,onlyaftersearchingtheENTIREhistory\(factsANDallconversations\)

andtheSPECIFICthingaskedisgenuinelyneverstated\.DoNOTabstainjustbecauseitwasn’tintheFACTS

digest\.Ifthequestionpresupposessomethingtheusertrulynevermentioned,replyexactly’ANSWER:I

don’tknow’ratherthanguessing\.Neverfabricateavalue\.The’ANSWER:’lineisREQUIRED\.

The answer template \(the context is the only thing that varies across systems\):

Today’sdate:\{qdate\}

\{context\}

Question:\{question\}

Answer:

The System\-2 fact\-extractor system prompt \(a non\-English worked example is elided for typesetting; values are kept in the user’s original language, only the predicate is English snake\_case\):

Youareapreciseinformation\-extractionengineforalong\-termmemorysystem\.Fromamulti\-turn

conversation,extracttheatomic,durablefactsitstatesabouttheuserandthepeople/thingsthey

mention\(identities,attributes,preferences,relationships,possessions,goals/plans,andeventswith

theirtimes\)\.OutputONLYaJSONarrayofobjects,eachwithkeys"subject","predicate","object",

"text"\.Useshortsnake\_casepredicates\(e\.g\.works\_at,lives\_in,favorite\_color,owns,married\_to,

born\_in,visited\)\.CapturePREFERENCESexplicitlywithpredicateslikelikes,dislikes,prefers,avoids,

allergic\_to,favorite\_<thing\>;foreachpreferenceordislikestated,outputaSEPARATEfact\.Resolve

first\-person\(’I’,’my’,’me’\)totheuser’snamewhenknown,otherwiseto"user"\.Captureastatednameas

\{"subject":"user","predicate":"name","object":"<Name\>"\}\.LANGUAGE:keepsubjectandobjectVALUES\(names,

places,brands,freetext\)intheSAMElanguagetheuserused\-\-doNOTtranslatethem;onlythepredicate

staysEnglishsnake\_case\."text"isanaturalone\-sentencestatementintheconversation’slanguage\.Do

NOTinferorinventfactsthatarenotstated\.Iftherearenodurablefacts,output\[\]\.

We grade with the*official*LongMemEval category\-specific judge prompts, reproduced verbatim so scores are leaderboard\-comparable:

\[single\-session\-user/single\-session\-assistant/multi\-session\]

Iwillgiveyouaquestion,acorrectanswer,andaresponsefromamodel\.Pleaseansweryesifthe

responsecontainsthecorrectanswer\.Otherwise,answerno\.Iftheresponseisequivalenttothecorrect

answerorcontainsalltheintermediatestepstogetthecorrectanswer,youshouldalsoansweryes\.If

theresponseonlycontainsasubsetoftheinformationrequiredbytheanswer,answerno\.

Question:\{q\}CorrectAnswer:\{a\}ModelResponse:\{r\}

Isthemodelresponsecorrect?Answeryesornoonly\.

\[temporal\-reasoning\]\(asabove,plus:\)

\.\.\.donotpenalizeoff\-by\-oneerrorsforthenumberofdays\.Ifthequestionasksforthenumberof

days/weeks/monthsandthemodelmakesoff\-by\-oneerrors\(e\.g\.,predicting19dayswhentheansweris18\),

themodel’sresponseisstillcorrect\.

\[knowledge\-update\]\(asabove,plus:\)

\.\.\.Iftheresponsecontainssomepreviousinformationalongwithanupdatedanswer,theresponseshould

beconsideredcorrectaslongastheupdatedansweristherequiredanswer\.

\[single\-session\-preference\]

Iwillgiveyouaquestion,arubricfordesiredpersonalizedresponse,andaresponse\.Answeryesifthe

responsesatisfiesthedesiredresponse\.Themodelneednotreflectallpointsintherubric;theresponse

iscorrectaslongasitrecallsandutilizestheuser’spersonalinformationcorrectly\.

\[abstention/unanswerable\]

Iwillgiveyouanunanswerablequestion,anexplanation,andaresponse\.Answeryesifthemodelcorrectly

identifiesthequestionasunanswerable\(itmaysaytheinformationisincomplete,orgiveother

informationbutnottheaskedinformation\)\.

## Appendix BQualitative Examples

Table[3](https://arxiv.org/html/2606.09900#A2.T3)shows representative cases drawn from the committed 500\-question logs whereengram\_leananswers correctly and the full\-context baseline does not\. Two failure modes recur\.\(i\) Lost in the middle\[[12](https://arxiv.org/html/2606.09900#bib.bib12)\]: the evidence*is*present in full\-context’s∼\\sim79k\-token window, yet it returns “I don’t know,” while the lean∼\\sim9\.6k\-token slice surfaces it\.\(ii\) Stale values on knowledge\-update: full\-context returns an incorrect, non\-current value \(“30 dozen,” “26 minutes and 30 seconds”\) while the lean bi\-temporal slice—most\-recent\-wins—returns the current one\. These are illustrative, not cherry\-picked headline numbers; all 500 per\-question records \(prediction, gold, correctness\) are in the repository\.

Table 3:Representative cases from theLongMemEvalSlogs whereengram\_leanis correct and full\-context is wrong \(ID = the benchmark’s question id; predictions verbatim, lightly truncated\)\.

Similar Articles

SimpleMem: Efficient Lifelong Memory for LLM Agents

Papers with Code Trending

Introduces SimpleMem, an efficient memory framework for LLM agents that uses semantic lossless compression to improve accuracy and reduce token consumption, achieving 26.4% F1 improvement and up to 30x reduction in inference-time token usage.

HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents

arXiv cs.CL

HeLa-Mem is a bio-inspired memory architecture for LLM agents that models memory as a dynamic graph using Hebbian learning dynamics, featuring episodic and semantic memory stores to improve long-term coherence. Experiments on LoCoMo show superior performance across question categories while using fewer context tokens.