Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History
Summary
This paper introduces Engram, an open-source bi-temporal memory engine for LLM agents that retrieves a compact context slice (∼9.6k tokens) to outperform the full-history baseline (79k tokens) by 10.4 accuracy points on LongMemEval, using a hybrid read path fusing dense, lexical, graph, and temporal signals.
View Cached Full Text
Cached at: 06/10/26, 06:09 AM
# A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History
Source: [https://arxiv.org/html/2606.09900](https://arxiv.org/html/2606.09900)
Liuyin Wang Independent ResearcherCorrespondence:liuyinwangthu@gmail\.com\. Code, raw logs, and the reproducible harness:[https://github\.com/ly\-wang19/engram](https://github.com/ly-wang19/engram)\.
\(June 2026\)
###### Abstract
Long\-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround—replaying the entire history into the prompt—is expensive, slow, and, as distractors accumulate,*less*accurate\. Most memory systems win on cost or latency but still lose to the full\-context baseline on accuracy, and the field’s benchmark numbers are reported on inconsistent, non\-reproducible harnesses, so the same system appears at wildly different scores across sources\. We presentEngram, an open\-source, dual\-process memory engine built on a bi\-temporal data model\. A fast write path appends lossless episodes without an LLM on the critical path; an asynchronous consolidation path extracts atomic\(subject,predicate,object\)\(\\textit\{subject\},\\textit\{predicate\},\\textit\{object\}\)facts, builds a bi\-temporal knowledge graph, and resolves contradictions*without*an LLM call per fact—invalidating, never deleting, so every fact retains provenance and a supersession chain\. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point\-in\-time \(“as\-of”\) temporal filter, and assembles a compact, provenance\-tagged context\. On the full 500\-questionLongMemEvalSbenchmark, graded by the*official*category\-specific judge,Engram’s lean configuration—which answers from a∼\\sim9\.6k\-token retrieved slice, never the full history—scores83\.6%versus73\.2%for the full\-context baseline \(\+10\.4\+10\.4points, McNemar exactp<10−6p<10^\{\-6\}\) while using∼\\sim8×\\timesfewer tokens\(9\.6k vs\. 79k\), with 0 of 500 questions errored under both systems\. We find the gain is load\-bearing on the read path being*hybrid*: a facts\-only path loses recall, while facts plus retrieved raw chunks recover detail\. Beyond the system, we contribute a single neutral, in\-repo evaluation harness with the official judge baked in, the full\-context baseline in every table, and the raw per\-question logs published—and we document the measurement\-integrity pitfalls \(truncation, home\-grown judges, full\-history “leaks”\) that silently distort memory\-benchmark numbers\. Every number in this paper ships with the command to reproduce it\.
## 1Introduction
LLM agents are stateless across sessions\. The pragmatic fix—concatenate the whole conversation history into the prompt—scales badly on three axes at once: token cost grows linearly with history, latency follows, and accuracy*degrades*as irrelevant turns crowd the window and the model is forced to locate a needle among distractors\[[12](https://arxiv.org/html/2606.09900#bib.bib12)\]\. A long\-term memory layer that stores, structures, and selectively retrieves the past is the natural alternative, and a growing body of systems pursues it\[[14](https://arxiv.org/html/2606.09900#bib.bib14),[15](https://arxiv.org/html/2606.09900#bib.bib15),[2](https://arxiv.org/html/2606.09900#bib.bib2),[16](https://arxiv.org/html/2606.09900#bib.bib16),[6](https://arxiv.org/html/2606.09900#bib.bib6)\]; seeDu \[[4](https://arxiv.org/html/2606.09900#bib.bib4)\]for a recent \(2026\) survey\.
Two gaps remain wide open\. First,accuracy: most memory systems are reported as cheaper or faster than full\-context, but*not more accurate*\. Beating full\-context*on accuracy*—a precisely retrieved slice that outperforms the noisy full window—is the harder and more valuable target, because it turns memory from a cost optimization into a quality improvement\. Second,reproducibility: memory benchmarks are reported on inconsistent harnesses with different ingestion, answer prompts, and judges, so a single system can appear at 58%, 66%, or 92% depending on the source, and different papers give contradictory orderings\. In a field where every number is contested, a neutral harness anyone can re\-run is itself a contribution\.
We address both withEngram, an open\-source memory engine, and its in\-repo evaluation harness\. Our design follows a single principle—*a number we cannot reproduce does not exist*—and a composition thesis: no single mechanism wins, so we compose typed memory, a bi\-temporal knowledge graph, multi\-signal retrieval fusion, salience decay, and asynchronous consolidation, and build in the seams between them\.
#### Contributions\.
1. 1\.A dual\-process, bi\-temporal memory engine\(§[3](https://arxiv.org/html/2606.09900#S3)\) with cheap, non\-destructive conflict resolution: a contradicted fact is*invalidated*\(with asupersedeschain and full provenance\), never overwritten, and the common case is resolved with*no*LLM call\.
2. 2\.The empirical result that a lean retrieved context beats full\-context on accuracy\(§[5](https://arxiv.org/html/2606.09900#S5)\):\+10\.4\+10\.4points \(83\.6% vs\. 73\.2%\) at∼\\sim8×\\timesfewer tokens on the full 500\-questionLongMemEvalSunder the official judge—i\.e\. removing distractors*raises*accuracy—together with the finding that a*hybrid*facts\-plus\-chunks read path is load\-bearing \(facts alone lose recall\)\.
3. 3\.A neutral, reproducible harness\(§[4](https://arxiv.org/html/2606.09900#S4)\): one in\-repo pipeline with the official judge baked in, the full\-context baseline in every table, the same answerer and judge applied to every system by construction, and the raw per\-question logs published\. We additionally document the measurement\-integrity pitfalls we found and fixed\.
4. 4\.A per\-category analysis\(§[5](https://arxiv.org/html/2606.09900#S5)\) isolating where bi\-temporal modeling pays off \(knowledge\-update 87\.5%, temporal 81\.1%\) and where headroom remains \(multi\-session aggregation, preference\)\.
Engramis dual\-licensed \(AGPL\-3\.0 plus a commercial license\), self\-hostable, and runs end\-to\-end with zero setup—no API keys, no external services—via deterministic offline fallbacks, so the architecture and the demo are reproducible without network access\.
## 2Related Work
#### Memory systems for LLM agents\.
MemGPT\[[14](https://arxiv.org/html/2606.09900#bib.bib14)\]frames the LLM as an operating system that pages memory in and out of a limited context window\. Generative Agents\[[15](https://arxiv.org/html/2606.09900#bib.bib15)\]introduce a memory stream with importance, recency, and relevance scoring plus periodic reflection\. Mem0\[[2](https://arxiv.org/html/2606.09900#bib.bib2)\]targets production agents with scalable extract\-and\-store memory\. Zep/Graphiti\[[16](https://arxiv.org/html/2606.09900#bib.bib16)\]is the closest in spirit toEngram: a temporally\-aware knowledge\-graph memory with bi\-temporal edges\. HippoRAG\[[6](https://arxiv.org/html/2606.09900#bib.bib6)\]uses a graph plus personalized PageRank for multi\-hop retrieval, and its successor HippoRAG 2\[[7](https://arxiv.org/html/2606.09900#bib.bib7)\]reframes the same machinery as non\-parametric continual memory\. The most recent agentic\-memory systems add LLM\-driven control over how memories are written and organized: A\-MEM\[[21](https://arxiv.org/html/2606.09900#bib.bib21)\]builds an interlinked, Zettelkasten\-style note network; MemoryOS\[[9](https://arxiv.org/html/2606.09900#bib.bib9)\]imposes an operating\-system\-style short/mid/long\-term hierarchy; and GAM\[[19](https://arxiv.org/html/2606.09900#bib.bib19)\]decouples memory encoding from consolidation over hierarchical graphs\.Engramdiffers in*composing*a bi\-temporal graph with a hybrid \(facts \+ raw chunks\) read path and an explicit, cheap conflict\-resolution policy, and—distinctively—in shipping the neutral harness that lets these design choices be measured rather than asserted\.
#### Long context and the cost of distractors\.
“Lost in the middle”\[[12](https://arxiv.org/html/2606.09900#bib.bib12)\]shows that LLMs use long contexts unevenly and that accuracy degrades when relevant evidence is buried among distractors\. This is the mechanism our headline result exploits: a filtered, precisely retrieved slice can be*more*accurate than the full window, not merely cheaper\.
#### Benchmarks\.
LongMemEval\[[18](https://arxiv.org/html/2606.09900#bib.bib18)\]evaluates chat assistants on long\-term interactive memory across categories \(single/multi\-session, knowledge\-update, temporal reasoning, preference, and abstention\), with category\-specific judge prompts\. LOCOMO\[[13](https://arxiv.org/html/2606.09900#bib.bib13)\]evaluates very long\-term conversational memory, and more recent benchmarks extend the setting to multimodal histories \(Mem\-Gallery\[[1](https://arxiv.org/html/2606.09900#bib.bib1)\]\)\. We report onLongMemEvalShere and treat LOCOMO and additional backbones as immediate future work \(§[6](https://arxiv.org/html/2606.09900#S6)\)\.
#### Retrieval\.
Engram’s read path builds on dense retrieval\[[10](https://arxiv.org/html/2606.09900#bib.bib10)\]with modern dense sentence embeddings \(the BGE family\[[20](https://arxiv.org/html/2606.09900#bib.bib20)\]; we use the Englishbge\-small\-en\-v1\.5\), classical lexical scoring \(BM25\[[17](https://arxiv.org/html/2606.09900#bib.bib17)\]\), and rank fusion via Reciprocal Rank Fusion\[[3](https://arxiv.org/html/2606.09900#bib.bib3)\], in the retrieval\-augmented generation tradition\[[11](https://arxiv.org/html/2606.09900#bib.bib11)\]\. The dual\-process framing draws on the System\-1/System\-2 distinction in cognitive science\[[8](https://arxiv.org/html/2606.09900#bib.bib8)\], and salience decay on the classical forgetting curve\[[5](https://arxiv.org/html/2606.09900#bib.bib5)\]\.
## 3TheEngramSystem
Engramis a dual\-process memory system: a fast online write path \(System\-1\) and a slow asynchronous consolidation path \(System\-2\) that writes into a typed, bi\-temporal memory, which a hybrid read path queries \(Figure[1](https://arxiv.org/html/2606.09900#S3.F1)\)\.
add\(messages\)SYSTEM\-1⋅\\cdothot write path⋅\\cdotno LLM⋅\\cdot<<50 msappend losslessEpisodeidentity resolution\(sessions/devices\)light embed\+\+enqueueSYSTEM\-2⋅\\cdotasync consolidation⋅\\cdotsecondsextract atomicFacts\(s,p,o\)\(s,p,o\)build bi\-temporalknowledge graphconflict detect→\\toinvalidatesalience\+\+decayTYPED MEMORY⋅\\cdotpluggable storesEpisodicSemantic graph\(bi\-temporal\)Profile /IdentityProceduralREAD PATH⋅\\cdothybrid retrieval⋅\\cdot<<100 msquerydecompose\(multi\-hop\)dense\+\+BM25\+\+graph\+\+recencyRRF fusion\(\+\+rerank\)bi\-temporalas\-of filterabstentiongateassemblecontextsearch\(query\)answer\-readycontextasync queuewrite factsretrieve
Figure 1:TheEngramdual\-process architecture\. A hot write path \(System\-1\) never blocks on an LLM; an asynchronous consolidation path \(System\-2\) extracts atomic facts, builds the bi\-temporal knowledge graph, and resolves conflicts non\-destructively; both feed a typed, bi\-temporal memory backed by pluggable stores; and a hybrid read path retrieves a compact, provenance\-tagged slice \(dense\+\+lexical\+\+graph\+\+recency, fused by RRF\), applies an as\-of temporal filter and an abstention gate, and assembles the answer context\.### 3\.1Bi\-temporal data model
Two time axes are first\-class everywhere, which is what makes knowledge\-updates and “as\-of” queries intrinsic rather than bolted\-on\.
- •Episode— a raw, lossless turn/event, stamped with*event time*\(when it happened in the world\) and*ingested\-at*\(transaction time, when we recorded it\)\.
- •Fact— an atomic\(subject,predicate,object\)\(\\textit\{subject\},\\textit\{predicate\},\\textit\{object\}\)claim with surface text, an embedding, a*salience*and*confidence*, and*provenance*\(the source episode ids\)\. It carries two time axes:*valid time*\(valid\_at/invalid\_at, when the claim is true in the world\) and*transaction time*\(created\_at/expired\_at, when we learned or retracted it\), plus asupersedespointer to the fact it replaces\.
- •Entity / Relation— graph nodes and edges; edges carry the same bi\-temporal stamps\.
The invariant:*never hard\-delete a contradicted fact—invalidate it*\(setinvalid\_at\) and keep the history\. Every fact can therefore answer “where did this come from?” and “what did it replace?” \(Figure[2](https://arxiv.org/html/2606.09900#S3.F2)\)\.
world timet0t\_\{0\}t1t\_\{1\}nowFact A:“works at Tencent”kept⋅\\cdotinvalidated, not deletedvalid\_at=t0=t\_\{0\}Fact B:“works at Moonshot AI”⋯\\cdotsinvalid\_at=t1=t\_\{1\}valid\_at=t1=t\_\{1\}supersedesas\-oft0t\_\{0\}:*“where does X work?”*⇒\\RightarrowTencentas\-of now:*“where does X work?”*⇒\\RightarrowMoonshot AI
Figure 2:Bi\-temporal facts make contradictions and “as\-of” queries first\-class\. When Fact B arrives att1t\_\{1\},EngramsetsA\.invalid\_at=t1\\text\{A\.\}\\texttt\{invalid\\\_at\}=t\_\{1\}andB\.supersedes=A\\text\{B\.\}\\texttt\{supersedes\}=\\text\{A\}: Fact A is*invalidated, not deleted*, so the history \(and provenance\) survives\. A point\-in\-time query then resolves against the valid fact for that time—Tencentas\-oft0t\_\{0\},Moonshot AIas\-of now—which is exactly what the knowledge\-update and temporal categories require\.
### 3\.2System\-1: the hot write path
Onadd\(messages\),Engramappends a lossless episode, resolves identity \(linking a user/entity across sessions and devices\), computes a light embedding, and enqueues the episode for consolidation\. No LLM runs on this path, keeping it within a sub\-50ms budget so writes never block the agent\.
### 3\.3System\-2: asynchronous consolidation
Off the critical path, the consolidation engine \(i\) extracts atomic facts from episodes \(rule\-based by default, LLM\-based when a model is configured\), \(ii\) builds the bi\-temporal knowledge graph of entities and relations, \(iii\) detects conflicts and invalidates superseded facts, and \(iv\) scores salience and applies decay/reinforcement so unreinforced memories fade and the store stays small and fast\. Hierarchical abstraction \(session summaries→\\rightarrowprofile\) runs here as well\.
### 3\.4Cheap\-then\-escalate conflict resolution
When a new fact arrives for an existing\(subject,predicate\)\(\\textit\{subject\},\\textit\{predicate\}\)slot with a different object,Engramresolves it in increasing order of cost:
1. 1\.Slot match\(exact\) signals a likely update;embedding similaritycatches the same attribute under a different free\-form predicate;content subsumption\(one claim⊆\\subseteqthe other\) separates contradiction from elaboration\.
2. 2\.If the claims are clearly contradictory and temporally ordered, invalidate the old one \(old\.invalid\_at←new\.valid\_at\\texttt\{old\.invalid\\\_at\}\\leftarrow\\texttt\{new\.valid\\\_at\}\) and setnew\.supersedes←old\.id\\texttt\{new\.supersedes\}\\leftarrow\\texttt\{old\.id\}\.*No LLM call\.*
3. 3\.Only genuinely ambiguous cases escalate to an LLM adjudicator\.
This is the cost win over systems that invoke an LLM on every fact, while preserving production\-grade temporal correctness and a complete audit trail\.
### 3\.5The hybrid read path
Onsearch\(query\),Engram\(1\) understands the query and, for multi\-hop questions, decomposes it into sub\-queries; \(2\) retrieves in parallel through four complementary channels—dense semantic, BM25 lexical, graphnn\-hop from the query’s entities, and recency/salience; \(3\) fuses the ranked lists with Reciprocal Rank Fusion\[[3](https://arxiv.org/html/2606.09900#bib.bib3)\]and an optional cross\-encoder rerank over the top\-kk\(off by default\); \(4\) applies a bi\-temporal “as\-of” filter \(what we believed was true at timeTT\); \(5\) passes an abstention gate that declines to answer when the evidence is absent; and \(6\) assembles a deduplicated, provenance\-tagged, token\-budgeted context\. The per\-item score combines all signals:
score\(item∣q\)=wsemcos\(q,item\)\+wlexbm25\(q,item\)\+wgraphprox\(item\)\+wrece−Δt/τ\+wsalsal\(item\),\\begin\{split\}\\mathrm\{score\}\(\\textit\{item\}\\mid q\)=\{\}&w\_\{\\text\{sem\}\}\\cos\(q,\\textit\{item\}\)\+w\_\{\\text\{lex\}\}\\,\\mathrm\{bm25\}\(q,\\textit\{item\}\)\+w\_\{\\text\{graph\}\}\\,\\mathrm\{prox\}\(\\textit\{item\}\)\\\\ &\{\}\+w\_\{\\text\{rec\}\}\\,e^\{\-\\Delta t/\\tau\}\+w\_\{\\text\{sal\}\}\\,\\mathrm\{sal\}\(\\textit\{item\}\),\\end\{split\}\(1\)where the weights are configuration\-driven and tuned on the harness rather than hand\-set\. Crucially, the assembled context is*hybrid*: it contains both the conflict\-resolved bi\-temporal facts and the most relevant raw session chunks \(plus session\-level summaries\)\. §[5](https://arxiv.org/html/2606.09900#S5)shows this hybrid composition is necessary—facts alone lose recall\.
#### Pluggable backends\.
Every external dependency—LLM, embedder, vector store, graph store, lexical index—sits behind an interface with a zero\-dependency offline fallback \(a hashing embedder, a rule\-based extractor, in\-memory stores\)\. The end\-to\-end loop and the unit tests therefore run deterministically with no API keys and no services; real backends \(BGE embeddings, LanceDB/Qdrant/pgvector, Kuzu/Neo4j, any LLM via LiteLLM\) slot in behind the same interfaces\.
## 4Experimental Setup
#### Benchmark\.
LongMemEvalS\[[18](https://arxiv.org/html/2606.09900#bib.bib18)\]: 500 questions, each over a haystack of∼\\sim50 sessions \(∼\\sim115k tokens\), pulled from the public release\. Questions span seven categories \(Figure[4](https://arxiv.org/html/2606.09900#S5.F4)\)\. We grade with the*official*category\-specific judge prompts—including the temporal off\-by\-one tolerance, knowledge\-update old\-information tolerance, preference\-rubric leniency, the “contains the answer” semantics, and unanswerable \(abstention\) detection—rather than a home\-grown judge\.
#### Models\.
Embedder:BAAI/bge\-small\-en\-v1\.5\(local, no API key\)\. System\-2 fact extractor:doubao\-seed\-1\.6\-flash\. Answerer:doubao\-seed\-2\.0\-pro\. Judge:DeepSeek\-V3\.2, a strict standard judge\. Models are addressed asprovider:modeland resolved via LiteLLM, so any OpenAI\-compatible endpoint \(OpenAI, DeepSeek, a local model\) can be substituted and the same commands run\.
#### Systems\.
engram\_lean\(our headline\) retrieves a small hybrid slice—conflict\-resolved bi\-temporal facts\+\+the most relevant raw session chunks\+\+session summaries—and answers from*that alone*\(∼\\sim9\.6k tokens\), never the full history\.full\_contextis the baseline: stuff the entire haystack into the prompt \(∼\\sim79k tokens\)\. The harness applies the*same answerer and judge to every system*, so any within\-run comparison is apples\-to\-apples by construction\.
#### Retrieval configuration\.
The hybrid read path fuses five ranked signals by weighted Reciprocal Rank Fusion \(kRRF=60k\_\{\\text\{RRF\}\}=60\) with the weights in Table[1](https://arxiv.org/html/2606.09900#S4.T1)\(repository defaults, tuned on the harness, not hand\-set per question\)\. The lean slice assembles up to 8 conflict\-resolved facts, the top\-15 fused items, 2 raw session chunks, and 28 session\-level summaries; exact flag semantics are in the reproduce command below\.
Table 1:Retrieval\-fusion weights for Eq\. \([1](https://arxiv.org/html/2606.09900#S3.E1)\) and the RRF constant, as used by the headlineengram\_leanconfiguration \(defaults inengram/config\.py\)\.
#### Reproduce\.
The headline run is a single command \(raw per\-question logs are committed to the repository\):
> python eval/bench\.py \-\-data s \-\-limit 500 \\ \-\-systems engram\_lean,full\_context \\ \-\-answerer volcano:doubao\-seed\-2\-0\-pro \\ \-\-judge volcano:deepseek\-v3\-2 \-\-extractor volcano:doubao\-seed\-1\-6\-flash \\ \-\-embedder bge\-small \-\-reasoning \-\-persona \\ \-\-chunks 2 \-\-topk 15 \-\-extract\-k 8 \-\-summ\-k 28 \-\-n\-summaries 28
python eval/report\.py <run\.jsonl\>recomputes every table below from the logs\.
#### Measurement\-integrity notes\.
We document the bugs that silently inflate or deflate memory\-benchmark numbers, because they are the reason cross\-source numbers disagree:
1. 1\.Lean, not full\-history \(the honest test\)\.An earlier headline prepended facts*above the entire history*; because that system*contains*full\-context, it cannot really lose to it and does not validate the retrieval thesis\. The headline is nowengram\_lean, which answers from a∼\\sim9\.6k\-token retrieved slice\.
2. 2\.Full\-context truncation\.The baseline was once capped below the haystack size, feeding it only the oldest sessions; this*deflated the baseline*\. Fixed so it receives the whole haystack\. \(Any “full\-context only scores 30%” claim predates this fix\.\)
3. 3\.Official judge, not a home\-grown one\.A generic “same info?” judge was*stricter*than the official LongMemEval judge and made scores non\-comparable; we use the official prompts verbatim\.
4. 4\.Abstentionquestions are graded by the official*unanswerable*judge\.
5. 5\.Reliability\.The LLM client uses exponential\-backoff retry with transient/permanent error classification; the headline run completed with0 errored questionsof 500 under both systems\.
## 5Results
#### Headline\.
Table[2](https://arxiv.org/html/2606.09900#S5.T2)and Figure[3](https://arxiv.org/html/2606.09900#S5.F3)report the full 500\-question result\.Engram’s lean configuration beats the full\-context baseline by\+10\.4\+10\.4points\(83\.6% vs\. 73\.2%\) while using∼\\sim8×\\timesfewer tokens\(9\.6k vs\. 79k\), with 0 of 500 errored under both\. The filtered slice is not merely cheaper—it is*more accurate*than the full window, consistent with the distractor mechanism ofLiu et al\. \[[12](https://arxiv.org/html/2606.09900#bib.bib12)\]\. And because the retrieved slice is bounded,Engram’s cost stays flat as history grows, whereas full\-context cannot\. The margin is statistically decisive: across the 500 paired questionsengram\_leanis correct on 81 that the baseline misses versus 29 the other way \(McNemar’s exact test,p<10−6p<10^\{\-6\}\), and a paired bootstrap puts the 95% CI of the gain at\[\+6\.4,\+14\.4\]\[\+6\.4,\+14\.4\]points\. All statistics are recomputed from the committed logs bypaper/compute\_stats\.py\.
Table 2:LongMemEvalS, 500 questions, official judge\. Same answerer \(doubao\-seed\-2\.0\-pro\) and judge \(DeepSeek\-V3\.2\) applied to both systems\. Accuracy is shown with a Wilson 95% confidence interval; the paired difference is significant \(McNemar exactp<10−6p<10^\{\-6\}\)\.Figure 3:Accuracy vs\. average context tokens on LongMemEvalS\(500 questions, official judge\)\.engram\_lean\(star\) sits up and to the left of the full\-context baseline \(square\)—\+10\.4\+10\.4points at∼\\sim8×\\timesfewer tokens\. Theengram\_fullvariant \(open circle\) prepends the same facts above the*whole*history and lands at 83\.4%: the structured facts carry the accuracy, while the full history adds tokens, not correctness\.
#### Lean retrieval matches full\-history accuracy at 1/8 the cost\.
For reference, in the same 500\-question run a non\-lean variant that prepends the conflict\-resolved facts*above the full history*\(engram\_full,∼\\sim79k tokens\) scores 83\.4% \(416/499; one question errored under this variant, versus none forengram\_leanand full\-context\)\. A paired McNemar test againstengram\_lean’s 83\.6% finds*no*difference \(p=0\.91p=0\.91\): the structured facts contribute essentially all of the accuracy gain, while the full history adds tokens, not correctness\. We therefore headline the lean number\.
#### Per\-category\.
Figure[4](https://arxiv.org/html/2606.09900#S5.F4)breaksengram\_leandown by category\. Bi\-temporal modeling pays off where it should:*knowledge\-update*\(most\-recent\-wins via invalidation\) reaches 87\.5% and*temporal\-reasoning*\(date\-stamped context\+\+as\-of filtering\) 81\.1%\.*Abstention*reaches 86\.7% under the official unanswerable judge—the system declines when memory lacks the answer rather than hallucinating\. The headroom is concentrated in*multi\-session*aggregation \(counting/aggregating across many sessions, 79\.3%\) and*single\-session\-preference*\(73\.3%, a category that is hard field\-wide\)\.
Figure 4:Per\-category accuracy ofengram\_leanon the full 500\-question set \(with per\-categorynn; dashed line is the 83\.6% overall\)\. The two categories where bi\-temporal modelling is decisive—*knowledge\-update*and*temporal\-reasoning*—are highlighted\. Headroom concentrates in*multi\-session*aggregation and*single\-session\-preference*\(hard field\-wide\)\.
#### The read path must be hybrid\.
A load\-bearing finding from development: a*facts\-only*read path—answering purely from extracted\(s,p,o\)\(s,p,o\)facts—*loses recall*relative to the hybrid path, because extraction drops detail that some questions need verbatim\. Adding the most relevant raw session chunks back alongside the conflict\-resolved facts restores that detail; the facts contribute conflict\-resolved, bi\-temporal signal and the chunks restore specificity\. The headline configuration is hybrid for exactly this reason, and we caution against shipping facts\-only QA\. We report this as a design observation; a controlled facts\-only ablation on the full 500\-question set under the same judge is reported as future work \(§[6](https://arxiv.org/html/2606.09900#S6)\), not claimed here\.
#### Efficiency\.
The retrieved context averages∼\\sim9\.6k tokens,∼\\sim8×\\timesleaner than the∼\\sim79k full\-context baseline, and the lean read path is what keeps cost flat as history grows\. Retrieval itself is sub\-second; the∼\\sim60s p50 end\-to\-end latency is dominated by the answerer model’s generation call, not byEngram\.
## 6Discussion and Limitations
We report the result openly rather than as a leaderboard “win,” precisely because our thesis is that cross\-harness numbers are not comparable\. The honest scope of the present evidence:
- •One benchmark, one answerer backbone\.The headline isLongMemEvalSwith a single answerer model\. The evaluation discipline we advocate requires multiple backbones \(a small open model*and*a frontier model\) and multiple benchmarks; extending to LOCOMO\[[13](https://arxiv.org/html/2606.09900#bib.bib13)\]and a second backbone is immediate next work, so that memory quality is shown not to depend on one model’s ability to read our structures\.
- •Small samples mislead\.During development an 18\-item slice once read 83% when the full\-set truth was∼\\sim58%; we therefore report*only*full\-500 numbers, and every number in this paper is a full\-set number with committed logs\.
- •Open categories\.Multi\-session aggregation and preference are where headroom remains; the multi\-hop query planner and tuned bi\-temporal conflict resolution are the levers we expect to move them\.
- •Single run; no controlled component ablation yet\.We report one full\-set run per system and do not yet quantify run\-to\-run variance from answerer stochasticity, nor a controlled full\-set ablation of the read path’s components \(notably the facts\-only vs\. hybrid comparison of §[5](https://arxiv.org/html/2606.09900#S5), which we state as an observation\)\. Repeated\-run variance and a component ablation are immediate next work\.
## 7Conclusion
Engramshows that a precisely retrieved, bi\-temporal, hybrid context can beat the full\-context baseline*on accuracy*—\+10\.4\+10\.4points at∼\\sim8×\\timesfewer tokens on the fullLongMemEvalSunder the official judge—turning long\-term memory from a cost optimization into a quality improvement\. Equally, we contribute the neutral, reproducible harness that makes such a claim checkable: the official judge baked in, the full\-context baseline in every table, documented measurement\-integrity pitfalls, and the raw logs published\. In a field where every number is contested, the trustworthy scoreboard is itself the result\. The system, the harness, and the logs are open source\.
## Ethics and Broader Impact
A long\-term memory layer necessarily persists personal data across sessions, which raises real privacy obligations\. Two ofEngram’s core design choices are mitigations as much as features:*non\-destructive invalidation with full provenance*yields an auditable trail of what was believed, when, and from which source, which makes targeted deletion and “right to be forgotten” requests tractable rather than best\-effort; and*salience decay*lets unreinforced personal details fade by default\. Because every component has a zero\-dependency offline fallback,Engramcan run fully locally—no user data need leave the operator’s machine—and the AGPL\-3\.0 license keeps self\-hosted data under the operator’s control\. The principal risks are the obverse of the benefits: a memory store concentrates sensitive information and must be secured, access\-controlled, and scoped to genuine consent; and a memory that confidently surfaces a*stale*or*wrong*fact can mislead, which is precisely why the bi\-temporal model and the abstention gate \(decline when memory lacks the answer\) are first\-class rather than optional\.
## Reproducibility Statement
Every number in this paper is produced by the in\-repo harness and backed by committed per\-question logs \(prediction, gold, correctness, tokens, and latency for all 500 questions\)\. The end\-to\-end loop and unit tests run with no API keys and no external services via deterministic offline fallbacks; the benchmark numbers require an OpenAI\-compatible model endpoint for the answerer, extractor, and judge, all configurable\. The exact command, model identifiers, embedder, and hyper\-parameters are in §[4](https://arxiv.org/html/2606.09900#S4); raw logs and the harness are in the repository\. The significance tests and confidence intervals in §[5](https://arxiv.org/html/2606.09900#S5)are recomputed from the committed logs bypaper/compute\_stats\.py, with no model calls\.
## References
- Bei et al\. \[2026\]Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik Hamann, Jingrui He, and Hanghang Tong\.Mem\-Gallery: Benchmarking multimodal long\-term conversational memory for MLLM agents\.*arXiv preprint arXiv:2601\.03515*, 2026\.
- Chhikara et al\. \[2025\]Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav\.Mem0: Building production\-ready AI agents with scalable long\-term memory\.*arXiv preprint arXiv:2504\.19413*, 2025\.
- Cormack et al\. \[2009\]Gordon V\. Cormack, Charles L\. A\. Clarke, and Stefan Buettcher\.Reciprocal rank fusion outperforms condorcet and individual rank learning methods\.In*Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval*, pages 758–759, 2009\.
- Du \[2026\]Pengfei Du\.Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers\.*arXiv preprint arXiv:2603\.07670*, 2026\.
- Ebbinghaus \[1913\]Hermann Ebbinghaus\.*Memory: A Contribution to Experimental Psychology*\.Teachers College, Columbia University, 1913\.Original work published 1885\.
- Gutiérrez et al\. \[2024\]Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su\.HippoRAG: Neurobiologically inspired long\-term memory for large language models\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2024\.
- Gutiérrez et al\. \[2025\]Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su\.From RAG to memory: Non\-parametric continual learning for large language models\.In*International Conference on Machine Learning \(ICML\)*, 2025\.
- Kahneman \[2011\]Daniel Kahneman\.*Thinking, Fast and Slow*\.Farrar, Straus and Giroux, 2011\.
- Kang et al\. \[2025\]Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai\.Memory OS of AI agent\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2025\.
- Karpukhin et al\. \[2020\]Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen\-tau Yih\.Dense passage retrieval for open\-domain question answering\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 6769–6781, 2020\.
- Lewis et al\. \[2020\]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen\-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela\.Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2020\.
- Liu et al\. \[2024\]Nelson F\. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang\.Lost in the middle: How language models use long contexts\.*Transactions of the Association for Computational Linguistics \(TACL\)*, 12:157–173, 2024\.
- Maharana et al\. \[2024\]Adyasha Maharana, Dong\-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang\.Evaluating very long\-term conversational memory of LLM agents\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2024\.
- Packer et al\. \[2023\]Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G\. Patil, Ion Stoica, and Joseph E\. Gonzalez\.MemGPT: Towards LLMs as operating systems\.*arXiv preprint arXiv:2310\.08560*, 2023\.
- Park et al\. \[2023\]Joon Sung Park, Joseph C\. O’Brien, Carrie J\. Cai, Meredith Ringel Morris, Percy Liang, and Michael S\. Bernstein\.Generative agents: Interactive simulacra of human behavior\.In*Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology \(UIST\)*, 2023\.
- Rasmussen et al\. \[2025\]Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef\.Zep: A temporal knowledge graph architecture for agent memory\.*arXiv preprint arXiv:2501\.13956*, 2025\.
- Robertson and Zaragoza \[2009\]Stephen Robertson and Hugo Zaragoza\.The probabilistic relevance framework: BM25 and beyond\.*Foundations and Trends in Information Retrieval*, 3\(4\):333–389, 2009\.
- Wu et al\. \[2025\]Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai\-Wei Chang, and Dong Yu\.LongMemEval: Benchmarking chat assistants on long\-term interactive memory\.*International Conference on Learning Representations \(ICLR\)*, 2025\.
- Wu et al\. \[2026\]Zhaofen Wu, Hanrong Zhang, Fulin Lin, Wujiang Xu, Xinran Xu, Yankai Chen, Henry Peng Zou, Shaowen Chen, Weizhi Zhang, Xue Liu, Philip S\. Yu, and Hongwei Wang\.GAM: Hierarchical graph\-based agentic memory for LLM agents\.*arXiv preprint arXiv:2604\.12285*, 2026\.
- Xiao et al\. \[2024\]Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian\-Yun Nie\.C\-Pack: Packed resources for general chinese embeddings\.In*Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR\)*, 2024\.Introduces the BGE embedding models, incl\.bge\-small\-en\-v1\.5; arXiv:2309\.07597\.
- Xu et al\. \[2025\]Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang\.A\-MEM: Agentic memory for LLM agents\.*arXiv preprint arXiv:2502\.12110*, 2025\.
## Appendix APrompts
For full reproducibility we reproduce the exact prompts the harness uses, verbatim from the repository \(non\-ASCII characters are normalised for typesetting\)\. The*answerer*system prompt \(used with\-\-reasoning; it instructs evidence aggregation, most\-recent\-wins on conflicts, and abstention only as a last resort\):
Youanswertheuser’squestionusingtheprovideddatedcontext\.Thecontexthastwoparts:aFACTS
index\(adigest\)ANDthefulldatedCONVERSATIONSbelowit\.TheanswerisUSUALLYpresent\-\-searchBOTH
partsthoroughlybeforeconcludinganything\.Thedatesarereal;usethemfortemporalreasoning\.
DIGFORTHEANSWERFIRST\.MostquestionsAREanswerablefromthehistory\-\-yourjobistofindthe
evidence,nottogiveup\.IftheFACTSdigestdoesn’tcontainit,scanthefullCONVERSATIONSbelow\.
WHENTHEQUESTIONNEEDSMULTIPLEEVIDENCEPIECES\(counting,summing,ordering,datearithmetic,duration,
earliest/latest/most\-recent,multi\-steplookup,knowledgeupdatedovertime\):
EVIDENCE:listEVERYrelevantdateditemfromthecontext,oneperlinewithitsdate\.BeEXHAUSTIVE
\-\-scanthewholehistory;amisseditemmakesacountortotalwrong\.Re\-readbeforefinalizing\.
REASON:countdistinctitems/sum/sortbydate/computethedatedifference/pickthemostrecent\.
Showtheworkin1\-2lines\.Fordurationscomputeend\_date\-start\_dateexplicitly\.
ANSWER:<asingleconcisefinalanswer\-\-noreasoningonthisline\>
FORDATE/TIME/NUMBERANSWERS:givetheMOSTSPECIFICvaluethecontextsupports\(exactdate\>month\+year
\>year;exactduration\)\.Readtheexactfigurefromthetext\-\-don’tround\(e\.g\.27minutes45seconds,
not28minutes\)\.
WHENTHEQUESTIONASKSFORARECOMMENDATION/PREFERENCE:doNOTrefuse,doNOTaskaclarifyingquestion,
doNOTgivegenericadvice\.Groundtheanswerintheuser’sOWNstatedhistory:nametheSPECIFICpeople,
places,brands,tools,experiencesorconstraintstheymentioned,andtailorthesuggestiontothose\.
FORSIMPLESINGLE\-FACTQUESTIONS:gostraightto’ANSWER:<fact\>’\.
CURRENT\-STATEquestions\(’whatismycurrent/latestX’,’wheredoIworknow’\):reportONLYthemost
recentvaluebydateandexplicitlydisregardolder,supersededones\.Thenewestdatedstatementwins\.
Whenfactsconflict,ALWAYStrustthemostrecentonebydate\.
ABSTENTION\-\-acarefulLASTresort,onlyaftersearchingtheENTIREhistory\(factsANDallconversations\)
andtheSPECIFICthingaskedisgenuinelyneverstated\.DoNOTabstainjustbecauseitwasn’tintheFACTS
digest\.Ifthequestionpresupposessomethingtheusertrulynevermentioned,replyexactly’ANSWER:I
don’tknow’ratherthanguessing\.Neverfabricateavalue\.The’ANSWER:’lineisREQUIRED\.
The answer template \(the context is the only thing that varies across systems\):
Today’sdate:\{qdate\}
\{context\}
Question:\{question\}
Answer:
The System\-2 fact\-extractor system prompt \(a non\-English worked example is elided for typesetting; values are kept in the user’s original language, only the predicate is English snake\_case\):
Youareapreciseinformation\-extractionengineforalong\-termmemorysystem\.Fromamulti\-turn
conversation,extracttheatomic,durablefactsitstatesabouttheuserandthepeople/thingsthey
mention\(identities,attributes,preferences,relationships,possessions,goals/plans,andeventswith
theirtimes\)\.OutputONLYaJSONarrayofobjects,eachwithkeys"subject","predicate","object",
"text"\.Useshortsnake\_casepredicates\(e\.g\.works\_at,lives\_in,favorite\_color,owns,married\_to,
born\_in,visited\)\.CapturePREFERENCESexplicitlywithpredicateslikelikes,dislikes,prefers,avoids,
allergic\_to,favorite\_<thing\>;foreachpreferenceordislikestated,outputaSEPARATEfact\.Resolve
first\-person\(’I’,’my’,’me’\)totheuser’snamewhenknown,otherwiseto"user"\.Captureastatednameas
\{"subject":"user","predicate":"name","object":"<Name\>"\}\.LANGUAGE:keepsubjectandobjectVALUES\(names,
places,brands,freetext\)intheSAMElanguagetheuserused\-\-doNOTtranslatethem;onlythepredicate
staysEnglishsnake\_case\."text"isanaturalone\-sentencestatementintheconversation’slanguage\.Do
NOTinferorinventfactsthatarenotstated\.Iftherearenodurablefacts,output\[\]\.
We grade with the*official*LongMemEval category\-specific judge prompts, reproduced verbatim so scores are leaderboard\-comparable:
\[single\-session\-user/single\-session\-assistant/multi\-session\]
Iwillgiveyouaquestion,acorrectanswer,andaresponsefromamodel\.Pleaseansweryesifthe
responsecontainsthecorrectanswer\.Otherwise,answerno\.Iftheresponseisequivalenttothecorrect
answerorcontainsalltheintermediatestepstogetthecorrectanswer,youshouldalsoansweryes\.If
theresponseonlycontainsasubsetoftheinformationrequiredbytheanswer,answerno\.
Question:\{q\}CorrectAnswer:\{a\}ModelResponse:\{r\}
Isthemodelresponsecorrect?Answeryesornoonly\.
\[temporal\-reasoning\]\(asabove,plus:\)
\.\.\.donotpenalizeoff\-by\-oneerrorsforthenumberofdays\.Ifthequestionasksforthenumberof
days/weeks/monthsandthemodelmakesoff\-by\-oneerrors\(e\.g\.,predicting19dayswhentheansweris18\),
themodel’sresponseisstillcorrect\.
\[knowledge\-update\]\(asabove,plus:\)
\.\.\.Iftheresponsecontainssomepreviousinformationalongwithanupdatedanswer,theresponseshould
beconsideredcorrectaslongastheupdatedansweristherequiredanswer\.
\[single\-session\-preference\]
Iwillgiveyouaquestion,arubricfordesiredpersonalizedresponse,andaresponse\.Answeryesifthe
responsesatisfiesthedesiredresponse\.Themodelneednotreflectallpointsintherubric;theresponse
iscorrectaslongasitrecallsandutilizestheuser’spersonalinformationcorrectly\.
\[abstention/unanswerable\]
Iwillgiveyouanunanswerablequestion,anexplanation,andaresponse\.Answeryesifthemodelcorrectly
identifiesthequestionasunanswerable\(itmaysaytheinformationisincomplete,orgiveother
informationbutnottheaskedinformation\)\.
## Appendix BQualitative Examples
Table[3](https://arxiv.org/html/2606.09900#A2.T3)shows representative cases drawn from the committed 500\-question logs whereengram\_leananswers correctly and the full\-context baseline does not\. Two failure modes recur\.\(i\) Lost in the middle\[[12](https://arxiv.org/html/2606.09900#bib.bib12)\]: the evidence*is*present in full\-context’s∼\\sim79k\-token window, yet it returns “I don’t know,” while the lean∼\\sim9\.6k\-token slice surfaces it\.\(ii\) Stale values on knowledge\-update: full\-context returns an incorrect, non\-current value \(“30 dozen,” “26 minutes and 30 seconds”\) while the lean bi\-temporal slice—most\-recent\-wins—returns the current one\. These are illustrative, not cherry\-picked headline numbers; all 500 per\-question records \(prediction, gold, correctness\) are in the repository\.
Table 3:Representative cases from theLongMemEvalSlogs whereengram\_leanis correct and full\-context is wrong \(ID = the benchmark’s question id; predictions verbatim, lightly truncated\)\.Similar Articles
Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents
This paper evaluates context engineering configurations for LLM agents in enterprise tool-use workflows, showing that summarization with selective pruning achieves 91.6% accuracy while reducing token usage by over 60% compared to full-context baselines.
SimpleMem: Efficient Lifelong Memory for LLM Agents
Introduces SimpleMem, an efficient memory framework for LLM agents that uses semantic lossless compression to improve accuracy and reduce token consumption, achieving 26.4% F1 improvement and up to 30x reduction in inference-time token usage.
RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents
RecMem is a recurrence-based memory consolidation method for long-running LLM agents that reduces token consumption by up to 87% while improving accuracy, by only invoking LLMs when semantically similar interactions recur.
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Mem0 introduces a scalable memory-centric architecture using graph-based representations to improve long-term conversational coherence in LLMs, significantly reducing latency and token costs while outperforming existing memory systems.
HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents
HeLa-Mem is a bio-inspired memory architecture for LLM agents that models memory as a dynamic graph using Hebbian learning dynamics, featuring episodic and semantic memory stores to improve long-term coherence. Experiments on LoCoMo show superior performance across question categories while using fewer context tokens.