User as Engram: Internalizing Per-User Memory as Local Parametric Edits

arXiv cs.AI Papers

Summary

Proposes User as Engram, a method to store per-user memory as sparse local parametric edits in a hash-keyed memory table, inspired by hippocampal engrams, achieving better reasoning accuracy and memory efficiency compared to per-user LoRA.

arXiv:2606.19172v1 Announce Type: new Abstract: Personal memory in a language model is two problems: content and reasoning skill. The brain keeps the two apart (a sparse, local engram in the hippocampus for each episode, a slow neocortex for the shared skills that interpret it), so a new fact need not overwrite everything else. Most personalization today keeps a user's facts outside the weights, in a natural-language memory file or a retrieval index. When facts are written into the model instead, the standard recipe is the per-user LoRA adapter, which does the opposite of the brain, folding content and skill into one global weight delta. Writing a user's facts as a LoRA contaminates text unrelated to them; writing the same facts as local Engram rows leaves it mathematically untouched, resulting in a roughly 33,000x smaller memory footprint. We therefore propose User as Engram: store a user's content as surgical edits to the hash-keyed memory table of an Engram model, and carry the reasoning skill in one shared adapter. This layered design matches per-user LoRA's direct recall while delivering 5.6x higher indirect-reasoning accuracy on average, and never makes a single user worse at reasoning than the untouched base. The edit is a glass box: writing a fact switches on its lookup at exactly the trigger, adds the value the answer needs, leaves every other position unchanged to the last bit, and fails if written into the wrong layer. Because different users' facts land in disjoint hash slots, their edits compose: many users live in one shared table at once, stacking additively and losslessly, where a per-user LoRA, a single global weight delta, admits only one. Upon retrieval, a per-user Engram table does not grow with the population the retriever must search, so past ~100 facts it overtakes a retrieval pipeline on a 2.5x larger model.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:42 AM

# User as Engram: Internalizing Per-User Memory as Local Parametric Edits
Source: [https://arxiv.org/html/2606.19172](https://arxiv.org/html/2606.19172)
###### Abstract

Personal memory in a language model is two problems, not one:*content*\(the specific facts about a user\) and*reasoning skill*\(the ability to turn those facts into answers\)\. The brain keeps the two apart \(a sparse, local*engram*in the hippocampus for each episode, a slow neocortex for the shared skills that interpret it\), so a new fact need not overwrite everything else\. Most personalization today keeps a user’s facts*outside*the weights, in a natural\-language memory file or a retrieval index\. When facts are written*into*the model instead, the standard recipe is the per\-user LoRA adapter, which does the opposite of the brain, folding content and skill into one global weight delta\. Writing a user’s facts as a LoRA contaminates text unrelated to them; writing the same facts as local*Engram rows*leaves it mathematically untouched, resulting in a roughly33,000×33\{,\}000\\timessmaller memory footprint\.

We therefore proposeUser as Engram: store a user’s content as surgical edits to the hash\-keyed memory table of an Engram model, and carry the reasoning skill in one*shared*adapter\. This layered design matches per\-user LoRA’s direct recall while delivering5\.6×5\.6\\timeshigher indirect\-reasoning accuracy on average, and never makes a single user worse at reasoning than the untouched base\. The edit is a glass box: writing a fact switches on its lookup at exactly the trigger, adds the value the answer needs, leaves every other position unchanged to the last bit, and fails if written into the wrong layer\. Because different users’ facts land in disjoint hash slots, their edits*compose*: many users live in one shared table at once, stacking additively and losslessly, where a per\-user LoRA, a single global weight delta, admits only one\. Upon retrieval, a per\-user Engram table does not grow with the population the retriever must search, so past∼\\sim100 facts it overtakes a retrieval pipeline on a2\.5×2\.5\\timeslarger model\.

## 1 Introduction

Maya has talked to her AI assistant for the better part of a year\. Over those months it has learned that her cardiologist is Dr\. Elena Vasquez, that Vasquez practices at Lakeside Cardiology, that Maya is severely allergic to penicillin, and that she went vegetarian in 2024\. A useful assistant must do three things with this*personal memory*: recall a fact on demand \(*“who is my cardiologist?”*\), reason over it when the question is indirect \(*“I’m visiting my daughter in California next month, where should I go for a heart check\-up?”*\), and never let one user’s facts leak into another’s\. Serving millions of such users from a single transformer\(Vaswani et al\.,[2017](https://arxiv.org/html/2606.19172#bib.bib7)\)backbone is an open architectural problem\.

Human memory already solves a version of it\. The brain does not record a new fact by nudging every synapse a little; it writes the episode as a sparse, local trace \(an*engram*\) in the hippocampus, while the slow, distributed neocortex supplies the general skills that interpret it\(Semon,[1921](https://arxiv.org/html/2606.19172#bib.bib2); Tonegawa et al\.,[2015](https://arxiv.org/html/2606.19172#bib.bib4); McClelland et al\.,[1995](https://arxiv.org/html/2606.19172#bib.bib3)\)\. Keeping the fast, local store apart from the slow, shared one is exactly what lets us learn that Maya is allergic to penicillin without disturbing how we reason about allergies in general\.

Personal memory in a language model has the same two\-part structure,*content*\(the user’s specific facts\) and*reasoning skill*\(turning facts into answers\), and the two pull in opposite directions\. Content is private and differs from user to user, so it wants its own local store; the reasoning skill is common to everyone, so it should be learned once and shared \(Figure[1](https://arxiv.org/html/2606.19172#S1.F1)\)\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x1.png)Figure 1:User as Engramsplits personal memory into two layers, mirroring the brain’s complementary learning systems\.Top \(content\)\.Each user’s facts are written as a few local row overrides \(colored\) in the Engram model’s one hash\-keyed memory table; the table’s other rows hold the general knowledge learned at pretraining \(gray\)\. A write touches only that user’s rows \(Δ\\Deltabpb=\+0\.0001\{\}=\+0\.0001on all other text\), and different users hash to different addresses, so they never overlap: the fast, sparse, local trace of a hippocampal engram\.Bottom \(reasoning skill\)\.A single shared LoRA on the frozen Mini\-Engram backbone, trained once across other users and amortized over everyone: the slow, shared neocortex\. The only per\-user state is the handful of colored rows; everything below is shared\.Three families of methods address personal memory\. In deployed systems the two that dominate are both non\-parametric: in\-context learning \(ICL,Brown et al\.[2020](https://arxiv.org/html/2606.19172#bib.bib8); Min et al\.[2022](https://arxiv.org/html/2606.19172#bib.bib10)\), most often an automatically\-extracted natural\-language memory file, keeps the facts in the prompt, and retrieval\-augmented generation \(RAG,Lewis et al\.[2020](https://arxiv.org/html/2606.19172#bib.bib11); Gao et al\.[2023](https://arxiv.org/html/2606.19172#bib.bib12)\) fetches them at query time\. Both leave the model’s weights untouched, and we compare against retrieval head\-to\-head throughout \(Sections[6\.4](https://arxiv.org/html/2606.19172#S6.SS4)–[6\.5](https://arxiv.org/html/2606.19172#S6.SS5)\)\. The third family writes the facts*into*the weights, and only this one conflates content and reasoning skill, storing both in the*same*parameters\.

Among in\-weights methods the standard recipe is a per\-user LoRA\(Hu et al\.,[2022](https://arxiv.org/html/2606.19172#bib.bib23)\)trained on each user’s facts \(Section[2](https://arxiv.org/html/2606.19172#S2)\)\. A LoRA has no address: to make one fact more likely, gradient descent bends theQ/K/V/OQ/K/V/Odirections that fit it most cheaply, and those same directions fire on unrelated text, so the edit is global by construction, not a side\-effect of training\. A global delta is also a single\-tenant one: two users’ LoRAs cannot share a model without merging into a combined delta that interferes, so each user needs their own copy of the weights\.

#### Our approach: content in a memory table, skill in a shared adapter\.

User as Engramwrites each of Maya’s facts as a surgical edit to the hash\-keyed memory table of an Engram\-pretrained model\(Cheng et al\.,[2026](https://arxiv.org/html/2606.19172#bib.bib6)\), whose gated lookups fire only on suffix N\-grams via deterministic hashing; the reasoning skill is one*shared*LoRA trained once across a held\-out population, so the only per\-user state is the memory rows, swapped in per request\. The table is*content\-addressable*and inert off its key, so writing a fact changes the output only where its trigger N\-gram is read \(Section[4\.3](https://arxiv.org/html/2606.19172#S4.SS3), Figure[7](https://arxiv.org/html/2606.19172#S4.F7)\)\. That is why the table, not the weights, is the right home for a user’s facts: writes are isolated, the suffix\-N\-gram index matches how personal facts are queried \(by their surface form\), and each row is a fixed∼\\sim88 KB whose address swaps in per request, so storage grows with users, not model size\. Isolation buys composability for free: because different users occupy disjoint addresses, their override maps commute and stack additively into one shared table, putting many users \(or a corporate store layered under a personal one\) in the same model at once \(Figure[1](https://arxiv.org/html/2606.19172#S1.F1), Section[5\.3](https://arxiv.org/html/2606.19172#S5.SS3)\)\.

#### Organization of the paper\.

The rest of the paper develops four core claims:

1. 1\.The cost of storing facts in shared weights\(Section[3](https://arxiv.org/html/2606.19172#S3)\)\. On a controlled base, writing a fact as a LoRA rather than as an Engram row is a∼\\sim33,000×\\timesdifference in extra loss on unrelated text \(3\-seed mean\), a property of how a LoRA stores a fact, not of how we tuned it\. The extra loss is always there; the visible damage \(worse reasoning\) appears only on weak bases, and vanishes on a strong instruction\-tuned base that answers the question on its own\. We turn that damage on and off by changing base strength\.
2. 2\.User as Engram, with the mechanism opened up\(Section[4](https://arxiv.org/html/2606.19172#S4)\): a way to write a user’s facts as small, local edits to an Engram model’s memory table\. Unlike a LoRA, whose change is spread invisibly across the whole model, every step of an Engram write can be watched on the trained model: it switches on the memory lookup at exactly its trigger, adds the value the answer needs and nothing else \(every other position is unchanged to the last bit\), and stops working if written into the wrong layer \(Figures[7](https://arxiv.org/html/2606.19172#S4.F7)and[8](https://arxiv.org/html/2606.19172#S4.F8)\)\. We release four Mini\-Engram checkpoints \(178 M–1\.22 B; a nano\-scale public replication of Engram\), all code and data, and a 50\-line multi\-tenant server with zero cross\-user leakage\.
3. 3\.A layered design\(Section[6](https://arxiv.org/html/2606.19172#S6)\) that pairs per\-user Engram content with one shared reasoning adapter \(an artificial complementary learning system\) and beats every all\-in\-one baseline on direct recall, indirect reasoning \(5\.6×5\.6\\timesover per\-user LoRA on average\), contamination, and per\-user reasoning regressions\. The lead survives a personal→\\tomedical schema shift\.
4. 4\.A deployment\-scale comparison with retrieval\(Sections[6\.4](https://arxiv.org/html/2606.19172#S6.SS4)–[6\.5](https://arxiv.org/html/2606.19172#S6.SS5)\): which method wins is set by facts\-per\-user and population size, not benchmark size\. Because a per\-user Engram table does not grow with the population, it overtakes a strong retrieval pipeline \(running on a2\.5×2\.5\\timeslarger model\) once the knowledge base passes∼\\sim100 facts\.

## 2 Background

#### The Engram architecture\.

Cheng et al\. \([2026](https://arxiv.org/html/2606.19172#bib.bib6)\), part of the DeepSeek line of sparse models\(Dai et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib14); DeepSeek\-AI,[2024](https://arxiv.org/html/2606.19172#bib.bib55)\), introduce*conditional memory*as a form of sparsity that complements mixture\-of\-experts \(MoE,Shazeer et al\.[2017](https://arxiv.org/html/2606.19172#bib.bib13); Dai et al\.[2024](https://arxiv.org/html/2606.19172#bib.bib14)\)\. At each token positiontt, suffix N\-grams of canonicalized tokens are hashed viaKKmultiplicative\-XOR heads into prime\-sized embedding tablesEn,kE\_\{n,k\}\. Retrieved rowsete\_\{t\}are projected by learnedWK,WVW\_\{K\},W\_\{V\}and gated by an attention\-style scalarαt=σ​\(RMS​\(ht\)⊤​RMS​\(WK​et\)/d\)\\alpha\_\{t\}=\\sigma\(\\mathrm\{RMS\}\(h\_\{t\}\)^\{\\top\}\\mathrm\{RMS\}\(W\_\{K\}e\_\{t\}\)/\\sqrt\{d\}\)\. The outputαt​WV​et\\alpha\_\{t\}W\_\{V\}e\_\{t\}is added to the residual stream at the selected Engram layer\. Crucially, the addresszt,n,kz\_\{t,n,k\}is deterministic from the input token IDs—known before the forward pass—so the table can be offloaded to host DRAM without GPU contention\.

#### LoRA\-as\-memory family\.

Per\-user/per\-document LoRA adapters\(Su et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib25); Tan et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib28); Zhuang et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib29); Charakorn et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib31); Tan et al\.,[2024b](https://arxiv.org/html/2606.19172#bib.bib32); Bini et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib30)\)train a low\-rank delta\(Hu et al\.,[2022](https://arxiv.org/html/2606.19172#bib.bib23); Houlsby et al\.,[2019](https://arxiv.org/html/2606.19172#bib.bib24); Mangrulkar et al\.,[2022](https://arxiv.org/html/2606.19172#bib.bib15)\)—one member of a broad parameter\-efficient fine\-tuning family that also includes prefix\-tuning\(Li & Liang,[2021](https://arxiv.org/html/2606.19172#bib.bib94)\)and adaptive\-rank and quantized variants\(Zhang et al\.,[2023](https://arxiv.org/html/2606.19172#bib.bib95); Dettmers et al\.,[2023](https://arxiv.org/html/2606.19172#bib.bib96)\)—onQ/K/VQ/K/V\(and sometimes MLP\) matrices on top of a frozen base\. It is fit per user via NTP on a \(observation, fact, QA\) mixture, via knowledge distillation from a teacher \(DyPRAG\(Tan et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib26)\), DistilledPRAG\(Chen et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib27)\)\), or via hypernetwork emission \(T2L\(Charakorn et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib31)\)\)\. All variants*edit the model weights globally*: every forward pass sees the LoRA’s change\.

#### Where User as Engram sits\.

These two—the Engram store and the per\-user LoRA—anchor the landscape of personal\-memory methods, which sorts along two axes \(Figure[2](https://arxiv.org/html/2606.19172#S2.F2)\): whether a method pays context tokens at query time, and whether storing a fact leaves the rest of the model untouched\. In\-context learning, retrieval, and external memory systems keep the model local but pay context on every query; per\-user LoRA and knowledge editing pay no context but change the weights globally\. User as Engram is the only one in the remaining corner—a local edit at zero context cost—and the rest of the paper earns that placement\. We defer the broader literature to Section[9](https://arxiv.org/html/2606.19172#S9)\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x2.png)Figure 2:Where User as Engram sits among personal\-memory methods\. Context\-based methods \(in\-context learning, retrieval, memory systems\) leave the model untouched but pay context tokens at query time \(top\-left\); weight\-based methods \(per\-user LoRA, knowledge editing\) pay no context but edit the model globally \(bottom\-right\)\. User as Engram occupies the remaining corner: a local edit at zero context cost\. Axes are qualitative\.

## 3 Per\-user LoRA contaminates globally

Weights are an excellent place to keep knowledge—pretraining packs millions of facts into them accurately, at roughly22–44bits per parameter\(Allen\-Zhu & Li,[2025](https://arxiv.org/html/2606.19172#bib.bib81)\)\. But that capacity is*shared*and the facts are*incompressible*: each must be stored explicitly and ends up entangled with every other—a distinct share of the weights that does not shrink with better training\(Li,[2026b](https://arxiv.org/html/2606.19172#bib.bib82)\)\. That is fine for the facts everyone queries\. It is a poor place to keep*one user’s*facts written as a*per\-user weight edit*, and not for lack of capacity\. A per\-user LoRA editsQ/K/V/OQ/K/V/Oin every layer, so its edit is unconditioned \(part of every forward pass\) and must be fit per user from a handful of examples\. Two problems follow, and we measure both: the edit is global, so it contaminates text that has nothing to do with the user; and the facts, once trained in, are*recalled but not readily reasoned over*\. The first is an architectural cost paid on every base; the second matters most for personalization, and is why a fact belongs in a store the shared reasoning can read rather than in a per\-user copy of the weights\.

### 3\.1 Architectural contamination on the same base \(Mini\-Engram\-d20\)

We hold the base fixed \(Mini\-Engram\-d20@1536, 1\.22 B parameters\) and train a per\-user LoRA \(rank\-64 on Q/K/V, 1,500 steps per fact, single\-token gold\) on the synthetic user fact sets used throughout\. For each of 20 test users we measure val\_bpb on a held\-out 262 K\-token ClimbMix shard—text unrelated to the user’s facts—before and after the edit\. Per\-user Engram\-row Joint OPT is measured under the identical protocol on the same base\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x3.png)Figure 3:Architectural contamination on held\-out text, Mini\-Engram\-d20 \(canonical seed S0,n=20n\{=\}20users\)\. A per\-user LoRA more than triples val bpb on text*unrelated*to the user’s facts \(\+1\.784\+1\.784; per\-user range\[\+0\.44,\+3\.74\]\[\+0\.44,\+3\.74\]; 17/20 users worse\), while a per\-user Engram row leaves it unchanged to four decimals \(\+0\.00005\+0\.00005; 0/20 worse\)—∼\\sim34,000×\\timesless, by design\. The 3\-seed mean ratio is∼\\sim33,000×\\times\(Appendix[N](https://arxiv.org/html/2606.19172#A14)\)\.Figure[3](https://arxiv.org/html/2606.19172#S3.F3)shows the gap; the mechanism is the point\. A LoRA cannot write a fact without bending a function every input shares—to lift the probability of Maya’s cardiologist it movesQ/K/V/OQ/K/V/Oalong directions that also fire on unrelated text—so the loss on held\-out text rises \(more than*tripling*here\), and it rises a little further with each fact added, because every new fact bends the function again\. An Engram row is written to an address instead: it is read only when its trigger N\-gram is hashed, and leaves the loss on everything else unchanged to four decimal places\. This is the difference between editing a function and writing to a store, and it is precisely what makes co\-locating many users feasible in one and hopeless in the other: contamination from a shared delta accumulates and cannot be quarantined per user, whereas addressed writes have zero cross\-talk by construction \(Section[7](https://arxiv.org/html/2606.19172#S7)\)\.

### 3\.2 Recall without reasoning

A per\-user LoRA learns to*recall*its facts—direct recall is near\-perfect—but the harder half of personal memory is*reasoning over*them, and that is where it struggles\. The reason is composability\. Knowledge acquired in pretraining is laid down in a form the model’s reasoning can chain; a per\-user adapter fit to a handful of examples instead memorizes a flat trigger→\\toanswer mapping that the shared reasoning does not readily pick up and combine\. To answer an indirect question the adapter would have to carry the reasoning skill as well, learned from the same few examples—hard to train, and it does not transfer to facts it never saw\. This recall\-without\-reasoning gap is the recurring finding of the per\-user\-adapter line\(Su et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib25); Tan et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib28); Zhuang et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib29)\): the facts go in, but indirect questions over them do not reliably come out\.

What a user sees in addition depends on the base\. We replicate the per\-user LoRA recipe \(rank\-64 NTP on observation/fact/QA mixtures, 12 epochs×\\times200 samples\) on four instruction\-tuned bases—Qwen2\.5\-3B and Qwen2\.5\-7B\(Qwen Team,[2025a](https://arxiv.org/html/2606.19172#bib.bib51)\), Llama\-3\.1\-8B\(Grattafiori et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib53)\), and Mistral\-7B\(Jiang et al\.,[2023](https://arxiv.org/html/2606.19172#bib.bib54)\)—and one base LM \(Figure[4](https://arxiv.org/html/2606.19172#S3.F4)\):

![Refer to caption](https://arxiv.org/html/2606.19172v1/x4.png)Figure 4:Cross\-base LoRA scaling: mean change in indirect recall \(adapter−\-base\) with the fraction of users it hurts\. On the base LM \(Mini\-Engram\-d20\) a per\-user LoRA disrupts fragile completion behavior \(85% worse,Δ=−0\.133\\Delta\{=\}\{\-\}0\.133\); on every instruction\-tuned base the reasoning skill absorbs the perturbation \(0–20% worse,Δ\\Deltafrom\+0\.088\+0\.088to\+0\.217\+0\.217\)\.The split is clean \(Figure[4](https://arxiv.org/html/2606.19172#S3.F4)\)\. On a base LM the global perturbation also disrupts fragile completion behavior, so indirect recall drops outright; on an instruction\-tuned base the base’s own reasoning carries the indirect question and masks the adapter’s inability to provide it\. The strength of the base thus controls whether the damage is visible at all: we can make it appear or vanish by changing how strong the base is \(Appendix[O\.9](https://arxiv.org/html/2606.19172#A15.SS9)\)\. The reasoning has to come from somewhere, and Section[6](https://arxiv.org/html/2606.19172#S6)supplies it from one shared skill instead of asking every per\-user edit to learn it again\.

## 4 Method: User as Engram

![Refer to caption](https://arxiv.org/html/2606.19172v1/x5.png)Figure 5:Where the Engram lives in the transformer\.The backbone is a standard transformer; at a few designated Engram layers \(★\\bigstar\) a content\-addressable lookup runs alongside attention and the MLP\. At a position, the recent tokens’ suffixNN\-gram is hashed byKKheads intoKKtable addresses \(deterministic, hence known before the forward pass\); the retrieved rows are projected byWK,WVW\_\{K\},W\_\{V\}, gated byα\\alpha, and added to the residual stream—but the gate fires only where the triggerNN\-gram is present\. Unlike attention and the MLP, which touch every position and every weight, the Engram read is sparse and addressable; that is what lets a single written row change behavior locally\.We instantiate the “local per\-user edit” concretely: surgically write per\-user fact rows into the hash\-keyed memory table of an Engram\-pretrained model\. Figure[5](https://arxiv.org/html/2606.19172#S4.F5)shows where that table sits in the network; the subsections below walk through how a fact is written into it \(Figure[6](https://arxiv.org/html/2606.19172#S4.F6)\) and how the writes are served per user \(Figure[9](https://arxiv.org/html/2606.19172#S4.F9)\)\.

### 4\.1 Where the Engram lives

Most of an Engram model is an ordinary transformer\(Vaswani et al\.,[2017](https://arxiv.org/html/2606.19172#bib.bib7)\); the only addition is a content\-addressable lookup spliced into a few layers \(Figure[5](https://arxiv.org/html/2606.19172#S4.F5); two layers in our Mini\-Engrams\)\. At every position, the recent tokens—a suffixNN\-gram—are hashed byKKdeterministic multiplicative\-XOR heads into addresses in a large embedding tableEE\. The retrieved rows are projected byWK,WVW\_\{K\},W\_\{V\}, weighted by an attention\-style gateα=σ​\(⋅\)\\alpha=\\sigma\(\\cdot\), and*added to the residual stream*, exactly where attention and the MLP also write\. The lookup is thus a read from memory, but keyed by the surface form of the recent tokens rather than by query–key similarity across the sequence\.

Two properties make this a genuine store rather than just more parameters, and both are what we exploit\. First, the address is a deterministic function of the input token IDs—known*before*the forward pass—so we know exactly which rows a given query will read, and the table can be offloaded to host DRAM\. Second, the gate makes the read sparse and conditional: most positions retrieve essentially nothing, and a stored row affects the output only at positions whoseNN\-gram addresses it\. This is the addressable, sparse store that per\-user facts want—the opposite of the global weight edit of Section[3](https://arxiv.org/html/2606.19172#S3)\. Attention and the MLP are dense and global—every position, every weight—whereas the Engram read touches a handful of rows at a handful of positions, so editing one row changes the model only where that row is addressed\. The rest of this section turns that into a per\-user memory\.

### 4\.2 Surgical row insertion

![Refer to caption](https://arxiv.org/html/2606.19172v1/x6.png)Figure 6:Inserting a user fact into the Engram\.A fact is reduced to*where*and*what*: the trigger’s suffixNN\-gram hashes to a sparse set of row addressesRfR\_\{f\}\(*where*\), and one of three strategies writes a valuee⋆e^\{\\star\}into those rows \(*what*\)\. UNEMBED\_P solves fore⋆e^\{\\star\}in closed form; OPT takes a few gradient steps per fact; Joint OPT—our default beyond∼\\sim30 facts/user— optimizes all of the user’s rows together so they do not interfere\. The result is a per\-user override map of∼\\sim88 KB; every row outsideRfR\_\{f\}, and the entire backbone, is left bit\-identical\.To write a fact, we put its answer where its trigger is read\. The trigger’s suffixNN\-gram hashes to a fixed, sparse set of table rowsRfR\_\{f\}—about 16 rows for ourN=3N\{=\}3,K=8K\{=\}8configuration—deterministically, touching no other row\. What remains is*what*to write into those rows, for which we use three strategies of increasing cost and fidelity \(per\-strategy recall against a RANDOM control in Appendix[D](https://arxiv.org/html/2606.19172#A4)\):

- •UNEMBED\_P\(closed form\): solve in one step for the row value whose projection steers the next\-token distribution toward the answer—a single matrix–vector product,<1\{<\}1ms, no training\.
- •OPT: refine that value with a few gradient steps per fact \(∼\\sim1 s\), trading time for accuracy\.
- •Joint OPT\(our default beyond∼\\sim30 facts/user\): optimize*all*of a user’s rows together rather than one fact at a time \(full algorithm in Appendix[M](https://arxiv.org/html/2606.19172#A13); convergence behavior in Appendix[L](https://arxiv.org/html/2606.19172#A12)\)\.

Joint OPT is the default because of the one place this design has to earn its keep\. When many of a user’s facts are live at once, their rows are read in the same forward passes and interfere; optimizing them jointly lets the rows settle into values that coexist, which is what holds recall up as facts\-per\-user grows \(Section[5](https://arxiv.org/html/2606.19172#S5)\)\. Crucially, every write still lands only in that user’sRfR\_\{f\}, so however many facts or domains are stacked, the store stays additive and private—the locality is the architecture’s, not a constraint we have to impose\.

Read together, the two figures give the method in one sentence:remembering a fact is writing a few rows\.A fact decomposes into an address setRfR\_\{f\}\(*where*, fixed by the trigger’s hash\) and a valuee⋆e^\{\\star\}\(*what*, set by one of the strategies above\), and the write touches nothing else\. The locality we rely on is therefore*inherited from the architecture*—the read was already sparse and addressed \(Section[4\.1](https://arxiv.org/html/2606.19172#S4.SS1)\)—rather than enforced by a separate mechanism\. That is why per\-user and per\-domain writes compose without a combiner and never leak across users \(Section[5](https://arxiv.org/html/2606.19172#S5)\), and Section[4\.3](https://arxiv.org/html/2606.19172#S4.SS3)verifies that “touches nothing else” is literally exact\.

### 4\.3 The local edit is a glass box

![Refer to caption](https://arxiv.org/html/2606.19172v1/x7.png)Figure 7:Addressed write vs\. global function\-bend\.Per\-layer, per\-position residual\-stream change‖xafter\(ℓ\)−xbefore\(ℓ\)‖\\\|x^\{\(\\ell\)\}\_\{\\text\{after\}\}\-x^\{\(\\ell\)\}\_\{\\text\{before\}\}\\\|on the*same*trigger sentence and base \(Mini\-Engram\-d12@1280, log color scale\)\.\(a\)An Engram row insertion is*exactly*0\.000 at every position before the Engram layer \(causality\) and at every non\-trigger position after it—only the trigger column moves\.\(b\)A per\-user LoRA fit to the*same single fact*moves the residual at every position and every layer, and perturbs unrelated text \(“The capital of France is”\) by mean107107\. Storing a fact in a shared function changes the function everywhere; storing it at an address does not\.A LoRA’s change is spread invisibly across the whole model, so you cannot point to what it did\. An Engram write is the opposite: you can watch every step of it on the trained model\. We follow one inserted fact through the write and check each step on the trained Mini\-Engrams \(full detail, with the depth and multi\-hop probes, in Appendix[B](https://arxiv.org/html/2606.19172#A2)\)\. Three things are worth seeing, and Figures[7](https://arxiv.org/html/2606.19172#S4.F7)and[8](https://arxiv.org/html/2606.19172#S4.F8)show each\.

#### \(1\) The write switches on its own lookup, and adds just the value it should\.

A row reaches the output only through a gated value,αt​WV​e\\alpha\_\{t\}\\,W\_\{V\}e: a switchαt\\alpha\_\{t\}that decides whether the lookup is read, times the valueWV​eW\_\{V\}ethe row carries\. Writing a fact does both at once\. The switch at the trigger turns from nearly off to nearly on, while it stays off everywhere else \(Figure[8](https://arxiv.org/html/2606.19172#S4.F8)a\)—the row we write is also what turns its own lookup on\. And the change the row makes to the output points almost exactly along the value it carries \(Figure[8](https://arxiv.org/html/2606.19172#S4.F8)b\): the switch and the small convolution around it scale the value but do not turn it into something else\. The same alignment was reported for the original Engram code on a fresh, untrained model; we confirm it on the trained model, for the strategy we actually deploy\.

#### \(2\) Nothing else moves—exactly nothing\.

We write only the trigger’s rows, and the trigger’s last few tokens are unique in the sentence, so every other position reads the rows it always read and its output is unchanged to the last bit\. This holds for every one of the 16 test facts and both writing strategies: the largest change at any non\-trigger position, at any layer, is0\(Figure[7](https://arxiv.org/html/2606.19172#S4.F7)a\)\. A LoRA that learns the same fact does the opposite—it moves every position in every layer, and shifts unrelated text as well \(Figure[7](https://arxiv.org/html/2606.19172#S4.F7)b\)\. This is the line between editing a function and writing to a store\. Note where the locality comes from: not from the switch being off elsewhere, but from the fact that we only ever changed the rows at one address\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x8.png)Figure 8:The mechanism on the trained model \(Mini\-Engram\-d20, 16 facts\)\.\(a\)The write opens its own gate: the trigger\-position gateα\\alpharises from0\.020\.02to0\.990\.99\(OPT\) while non\-trigger positions stay at0\.040\.04\.\(b\)The deployed row’s residual change is cosine0\.9990\.999to its value\-path projectionWV​eW\_\{V\}e\(left bar\); what the row*encodes*relative to the gold token’s unembedding is exact for the closed\-form solution \(0\.590\.59\) and drifts as gradient OPT/Joint\-OPT trade direct gold\-alignment for higher recall \(0\.160\.16–0\.180\.18\)\.\(c\)Exact locality: the maximum non\-trigger residual change over all 16 facts is0\.0000\.000for both strategies, versus an order\-1 change at the trigger\.
#### \(3\) It only works near the end of the network\.

The Engram lookup is read at a late layer, after the model has mostly made up its mind about the next token\. That is why a row can change the answer cleanly\. To check this is the real cause and not a coincidence, we write the same facts into the model’s*early*lookup instead: at the same effort, recall drops from perfect to about a quarter, because an edit made early gets reworked by all the layers above it\. Where the lookup sits is not a free choice—it has to be late enough that the value it adds is close to the final answer\.

What this kind of write*cannot*do is join two facts together\. The lookup only matches the words it was given, so it answers “who is my doctor” but not “where does my doctor work” unless that second step was stored too \(Section[5\.6](https://arxiv.org/html/2606.19172#S5.SS6)\)\. That is the reason we keep the facts and the reasoning apart \(Section[6](https://arxiv.org/html/2606.19172#S6)\): the store can hold a fact but cannot chain it, so the chaining has to come from somewhere else\. The split between facts and reasoning is forced by how the mechanism works, not a convenience we chose\.

### 4\.4 Per\-user override tables and additive composition

The production design uses per\-user override tables\. Each user has a small dictionary\{ri↦vi\}\\\{r\_\{i\}\\mapsto v\_\{i\}\\\}of fact rows\. At request time, the server saves the originals at the affected addresses, writes the user’s overrides, runs the forward, and restores\. Cross\-user leakage is*zero by design*—user A’s overrides are not in the table when user B queries\. \(We also benchmarked a shared table with a per\-user hash salt and rejected it for lower recall; Appendix[I](https://arxiv.org/html/2606.19172#A9)\.\)

![Refer to caption](https://arxiv.org/html/2606.19172v1/x9.png)Figure 9:EngramServer architecture\. Per user, an override map of \(row index, row vector\) pairs lives in DRAM\. On each request, the server saves the originals at the affected addresses \(∼\\sim2 ms\), writes the user’s overrides, runs the forward pass, then restores\. The Engram lookup at the configured layer transparently sees the override values; the gate fires only at the trigger N\-gram\. There is no router and no graph rewrite\.Override maps with disjoint addresses commute, so corporate facts \+ user facts \(or any number of domain\-specific Engrams\)stack additively at inferencewithout retraining a combiner\. This mirrors how Stable Diffusion LoRAs additively stack, but at the row level\. Section[5](https://arxiv.org/html/2606.19172#S5)confirms that additive composition is lossless when the domains have disjoint trigger templates\.

## 5 Experiments

Before composing the store with a shared reasoning skill \(Section[6](https://arxiv.org/html/2606.19172#S6)\), we first characterize the bare per\-user store on its own—how well facts surface, what they cost, and where it breaks—so that the layered design’s gains can be attributed cleanly to the skill it adds\. This section answers four questions about User\-as\-Engram, in order\.\(a\) How accurately do facts surface?\(Section[5\.2](https://arxiv.org/html/2606.19172#S5.SS2)\) for one fact and for many at once\.\(b\) What does it cost?\(Section[5\.3](https://arxiv.org/html/2606.19172#S5.SS3)\) storage and composition vs\. a per\-user LoRA\.\(c\) Does writing the fact beat retrieving it?\(Sections[5\.4](https://arxiv.org/html/2606.19172#S5.SS4)–[5\.5](https://arxiv.org/html/2606.19172#S5.SS5)\) against memory\-system and RAG baselines on the same answering LM, including paraphrased queries\.\(d\) Where does it break?\(Section[5\.6](https://arxiv.org/html/2606.19172#S5.SS6)\) multi\-hop chaining—the store matches words but does not compose them\. The capacity, token\-budget, and dense\-size scaling ablations that justify our recipe are in Appendix[O](https://arxiv.org/html/2606.19172#A15)\. We use three fact corpora throughout: thebase200\-fact benchmark \(100 USER \+ 100 ORG facts\), theXLcorpus of 1,000 USER \+ 1,000 ORG templated facts \(Section[O\.3](https://arxiv.org/html/2606.19172#A15.SS3)\), and theXXLcorpus of 3,132 distinct trigger templates \(high\-density and distinct\-template stress tests\)\.

### 5\.1 Setup: the Mini\-Engram models we insert into

Before measuring per\-user insertion we need an Engram backbone to insert into\.Cheng et al\. \([2026](https://arxiv.org/html/2606.19172#bib.bib6)\)released only architectural reference code; no Engram\-trained weights are public\. We graft Engram into Karpathy’s nanochat\(Karpathy,[2026](https://arxiv.org/html/2606.19172#bib.bib22)\)—a GPT\-2\-style backbone\(Radford et al\.,[2019](https://arxiv.org/html/2606.19172#bib.bib9)\)optimized with Muon\+AdamW\(Jordan et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib16)\)—and train four Mini\-Engram models on a single Blackwell GPU at the ablation\-optimal recipe \(Section[O\.2](https://arxiv.org/html/2606.19172#A15.SS2); pretraining curves in Appendix[C](https://arxiv.org/html/2606.19172#A3)\): the samelargeEngram table \(50 K×\\times256, 51\.2 M params\), trained at Karpathy’s 12 tokens\-per\-param \(t/p\) budget on the scaling\-params for every dense size\.

#### Model\-name notation\.

Throughout the paper we used\{depth\}@\{width\}for the ablation\-optimal Mini\-Engrams \(d12@768,d12@1280,d20@1536\) andv1/v2to denote earlier and final training runs at the same shape\. Where the width is omitted \(e\.g\.d12without a suffix\), we mean thev2\(ablation\-optimal at width 768, 339 M total\) checkpoint\. The four headline checkpoints, their param counts and token budgets are in Table[1](https://arxiv.org/html/2606.19172#S5.T1); all four share the same 51\.2 M Engram table\.

Table 1:Mini\-Engram pretraining at the ablation\-optimal recipe\. All use thelargeEngram \(50 K×\\times256 = 51\.2 M Engram params\)\. Trained on a single NVIDIA RTX PRO 6000 Blackwell \(102 GB\) in bf16\.
#### We reproduce Engram’s two signature findings\.

Suppressing the Engram lookup hurts factual recall far more than reading comprehension—the factual\-vs\-reading sensitivity asymmetry ofCheng et al\. \([2026](https://arxiv.org/html/2606.19172#bib.bib6)\)\(their §6\.3\), at the∼\\sim10×\\timessmaller magnitude expected for a model two orders smaller on a corpus three orders smaller\. The Engram layers also act as effective extra depth: a LogitLens probe shows our small Engram model resolving its prediction earlier than its dense\-only twin, matching their §6\.1 \(Appendix Figure[21](https://arxiv.org/html/2606.19172#A2.F21)\)\.

### 5\.2 Per\-fact recall and within\-user density

![Refer to caption](https://arxiv.org/html/2606.19172v1/x10.png)Figure 10:Within\-user fact\-density: top\-1 \(left\) and top\-5 \(right\) recall vs\. number of facts inserted simultaneously into one user’s override map\. Joint OPT \(blue\) closes most of the gap to LoRA rank\-64 \(red dashed\) at 161×\\timesless storage \(88 KB vs\. 14\.2 MB at 100 facts\)\.A single inserted fact surfaces within∼\\sim8 points of the in\-context ceiling, at three orders of magnitude lower per\-user cost than training a full LoRA \(Appendix[D](https://arxiv.org/html/2606.19172#A4)\)\. The harder question is what happens when many facts are live at once: the rows in a single user’s override table interfere during the forward pass\. We sweep that density curve on Mini\-Engram\-d12 \(Figure[10](https://arxiv.org/html/2606.19172#S5.F10)\)\.

Joint OPT is the recommended default for≥\\geq30 facts/user: it tracks multi\-fact LoRA rank\-64 to within a few top\-5 points while keeping top\-5 recall above 90% out to∼\\sim300 facts/user, all at up to∼\\sim161×\\timesless per\-user storage \(Figure[10](https://arxiv.org/html/2606.19172#S5.F10); per\-nnrecall and storage in Appendix[O](https://arxiv.org/html/2606.19172#A15)\)\.

#### What the density ceiling is—and what it is not\.

Three ablations triangulate the ceiling to a single cause, ruling out the obvious suspects\. \(i\) It is*not*backbone capacity: holding Engram capacity fixed and scaling the dense backbone does not lift recall at 1 000 facts—it slightly degrades it \(top\-10\.35→0\.280\.35\\\!\\to\\\!0\.28from d12@768 to d20@1536, Section[8\.3](https://arxiv.org/html/2606.19172#S8.SS3)\)\. \(ii\) It is*not*the table or the hash: per\-fact*independent*OPT, where each fact gets private rows, shows*no*ceiling to 1 000 facts \(≥0\.98\\geq 0\.98top\-1, Appendix[O\.3](https://arxiv.org/html/2606.19172#A15.SS3)\)\. \(iii\) It is*not*pretraining: a multi\-fact\-in\-the\-loss finetune only accelerates convergence—given enough OPT steps the baseline reaches the same asymptote \(Section[O\.1](https://arxiv.org/html/2606.19172#A15.SS1)\)\. What remains is the one thing all three share:gradient interference among the co\-active rows in a single user’s table, at a fixed inference\-time optimization budget\. This is why the path to breaking it is a recipe change \(multi\-fact\-in\-the\-loss, scoped tables\), not further dense scaling, and why it scales with facts\-per\-user rather than with the model\.

### 5\.3 Cost, storage, and composition

Storage\.A user’s whole table is∼\\sim88 KB at 100 facts/user and grows about linearly with the fact count \(∼\\sim0\.9 KB/fact\), while a per\-user LoRA is a fact\-count\-independent 14\.2 MB—∼\\sim161×\\timeslarger at 100 facts/user, and∼\\sim1700×\\timesat 10 facts/user\. At a million users that is 100 GB versus 14\.2 TB: the difference between one server and a distributed store \(Appendix Figure[37](https://arxiv.org/html/2606.19172#A15.F37)\)\. Figure[11](https://arxiv.org/html/2606.19172#S5.F11)places all the methods on one cost–quality plot: a single fact matches the in\-context ceiling at a fraction of LoRA’s cost, and at high density Joint OPT keeps pace with even rank\-64 LoRA at tens of times less storage, with the gap turning in Engram’s favor by 1 000 facts\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x11.png)Figure 11:Cost\-quality trade\-off across personal\-memory methods at the same answering LM \(Mini\-Engram\-d20@1536\)\. User\-as\-Engram Joint OPT matches a per\-user LoRA’s LOCOMO F1 at 161×\\timessmaller per\-user storage \(88 KB vs\. 14\.2 MB at 100 facts\)\. Retrieval baselines store each fact at a higher per\-fact cost than an Engram row and still cap out below Engram on quality\.Composition\.Two users’ tables, or a user’s facts and a company’s, occupy different addresses, so they simply add up—no combiner to train\. The one exception is when two sets of facts share the same trigger words: their addresses collide and the last write wins\. The rule, then, is to add tables when their triggers differ and to keep separate per\-user tables when they do not \(Appendix[F](https://arxiv.org/html/2606.19172#A6)\)\.

### 5\.4 Comparison against memory systems

![Refer to caption](https://arxiv.org/html/2606.19172v1/x12.png)\(a\)Retrieval wins on exact\-trigger queries—but Engram pays*zero*context tokens \(0 vs\. 16–63\) for its 68%\.
![Refer to caption](https://arxiv.org/html/2606.19172v1/x13.png)\(b\)Engram multi\-trigger insertion wins on paraphrases by 22 points, again at zero context\.

Figure 12:Direct recall vs\. memory systems, all sharing the same Mini\-Engram\-d12 answering LM \(100 USER facts, XXL corpus\)\.[12\(a\)](https://arxiv.org/html/2606.19172#S5.F12.sf1)When the query is the fact’s stored prefix, nearest\-neighbor retrieval is near\-perfect and beats Engram’s 68% top\-1—but at a context\-token cost Engram does not pay\.[12\(b\)](https://arxiv.org/html/2606.19172#S5.F12.sf2)When the query is a paraphrase, retrieval drops to 60–75% while multi\-trigger Engram insertion reaches 96\.9% top\-1\.We compare User\-as\-Engram against state\-of\-the\-art memory systems that target the same use case \(per\-user / personal memory\)\. All systems use the*same*Mini\-Engram\-d12 as the final\-answer LM; the only difference is how facts are stored and retrieved\. OurMEM0andMEMMACHINEbaselines follow the retrieval recipes of Mem0\(Chhikara et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib60)\)and MemMachine\(Wang et al\.,[2026](https://arxiv.org/html/2606.19172#bib.bib64)\)\. The sentence encoder used by the RAG / MEM0 / MEMMACHINE baselines isall\-MiniLM\-L6\-v2\(Wang et al\.,[2020](https://arxiv.org/html/2606.19172#bib.bib57); Reimers & Gurevych,[2019](https://arxiv.org/html/2606.19172#bib.bib56)\)\(80 M parameters; comparable in size to a production retriever\)\.

This subsection probes*direct*fact recall under trigger\-exact and paraphrased queries; the separate*indirect\-reasoning*comparison \(does retrieval recover the gold fact across hops, and what happens as the KB grows?\) is deferred to Sections[6\.4](https://arxiv.org/html/2606.19172#S6.SS4)–[6\.5](https://arxiv.org/html/2606.19172#S6.SS5)\.

#### Setup\.

100 USER facts \(XXL corpus\), asked two ways: with the fact’s exact stored prompt \(easy for retrieval—the words match\), and with a paraphrase \(“I love the spice” for “My favorite spice is”; 16 facts×\\times4 rewordings\)\.

On exact\-trigger queries the words line up with what was stored, so nearest\-neighbor retrieval is near\-perfect and beats Engram’s 68% top\-1—but it pays 16–63 context tokens for the answer, where Engram pays none \(Figure[12](https://arxiv.org/html/2606.19172#S5.F12)\)\.

The story flips on paraphrased queries\.When the query is reworded, the sentence encoder confuses surface forms and retrieval slips, while a single\-trigger Engram misses the reworded N\-gram entirely\. The fix is to write a row at every anticipated paraphrase: multi\-trigger insertion then wins on paraphrase by a clear margin at zero context cost \(Figure[12](https://arxiv.org/html/2606.19172#S5.F12)\), because the contract is explicit—the trigger N\-gram—rather than left to an encoder’s notion of similarity\.

#### Scale invariance\.

This verdict \(*retrieval wins on the verbatim prefix; multi\-trigger Engram wins under paraphrase*\) is invariant to dense scale: the Engram numbers barely move across our three sizes, and only the retrieval baselines drift as their answering\-LM improves \(Appendix[O](https://arxiv.org/html/2606.19172#A15)\)\.

Our retrieval baselines are the retrieval step itself \(top\-3 sentence\-encoder lookup over atomic facts\), not the full Mem0 or MemMachine systems with their extraction and consolidation machinery; we skip those because our facts are already structured\(Maharana et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib19); Wu et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib18)\)\.

#### Paraphrase generalization\.

The mechanism behind the paraphrase result is worth stating, because it is also a limit: the hash addresses a token\-level suffixNN\-gram, so different surface forms of the same fact map to*different*rows\. A fact written once therefore transfers to a paraphrase only when the paraphrase ends in the same tokens; writing the fact at every anticipated phrasing \(multi\-trigger insertion\) is what closes the gap and drives the96\.9%96\.9\\%vs\.75\.0%75\.0\\%win above\. The contract is explicit—the triggerNN\-gram—rather than left to an encoder’s notion of similarity\. Single\- vs\. multi\-trigger generalization is detailed in Appendix[E](https://arxiv.org/html/2606.19172#A5), with per\-paraphrase rank data in Appendix[K](https://arxiv.org/html/2606.19172#A11)\.

### 5\.5 LOCOMO single\-hop fact recall

To evaluate the same memory systems on a published long\-term conversational benchmark, we run all eight systems from Section[5\.4](https://arxiv.org/html/2606.19172#S5.SS4)on LOCOMO\(Maharana et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib19)\)\. We use*all 10 LOCOMO conversations*and take the first 80 single\-hop \(non\-adversarial\) questions per conversation, for a total of∼\\sim800 question\-answer pairs grounded in dialog per model\.

#### Setup\.

Retrieval baselines store the gold evidence turns; User\-as\-Engram stores the gold \(question, answer\) pair\. Every system uses the same Mini\-Engram\-d12 to write the final answer, so the only difference is how the answer is found\. We score token\-F1 against the gold answer \(details in Appendix[A](https://arxiv.org/html/2606.19172#A1)\)\.

User\-as\-Engram Joint OPT clears every retrieval baseline on single\-hop token\-F1\(Figure[13](https://arxiv.org/html/2606.19172#S5.F13)\)\. Absolute scores are low—Mini\-Engram\-d12 is a small base LM with no instruction tuning and token\-F1 punishes verbosity—so the relative ordering is what matters\. This is a best\-case\-for\-each benchmark: retrieval stores the gold evidence sentence, Engram stores the gold\(q,a\)\(q,a\)pair, and the only question is which extracts the answer more reliably under a fixed LM\. Engram wins because the gate fires on the question’s trigger N\-gram and biases the next token toward the gold answer, whereas retrieval can miss the relevant turn and, even when it hits, the in\-context evidence still competes with the base LM’s priors\.

#### Scaling at the optimal config\.

We repeated the same LOCOMO setup with all four trained Mini\-Engrams at our ablation\-optimal recipe \(Section[O\.2](https://arxiv.org/html/2606.19172#A15.SS2)\); the Engram lead widens monotonically with dense size \(Figure[13](https://arxiv.org/html/2606.19172#S5.F13)\)\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x14.png)Figure 13:LOCOMO single\-hop, full 10 conversations, scaling with Mini\-Engram dense size\.\(a\)*Token\-F1*: User\-as\-Engram Joint OPT \(blue\) beats every retrieval baseline at every scale\.\(b\)*LLM\-judge accuracy*\(Qwen2\.5\-14B\-Instruct\): the ranking flips—retrieval baselines \(MEM0\_LIKE, MEMMACHINE\) beat Joint OPT by 0\.05–0\.10 because token\-F1 over\-credits Engram for the correct first answer token despite a noisy continuation\.But the win depends on the kind of question\.Across LOCOMO’s four answerable categories, writing the fact wins on single\-hop, multi\-hop, and reasoning questions—where the stored question→\\toanswer pair is exactly what is needed—but loses on open\-domain questions, where the answer is a long verbatim span of a sentence the model never saw and a one\-token nudge cannot rebuild it \(Figure[14](https://arxiv.org/html/2606.19172#S5.F14)and Table[2](https://arxiv.org/html/2606.19172#S5.T2)\)\. So User\-as\-Engram is a specialized tool, not a drop\-in replacement for retrieval: a real system would route by question type, or use Engram for the facts it was taught and retrieval for free\-form answers from the session\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x15.png)Figure 14:LOCOMO category breakdown \(token\-F1\), Engram Joint OPT \(blue\) vs\. the best retrieval baseline \(red; max of MEM0\_LIKE and MEMMACHINE\_LIKE\) across three dense scales\. Engram wins single\-hop, multi\-hop, and reasoning at every scale—the per\-fact \(question, answer\) loss encodes the question→\\toanswer map that retrieval must chain across evidence sentences—but loses open\-domain, where the answer is a verbatim span of an unseen evidence sentence that a single\-token bias cannot reconstruct\.Table 2:LOCOMO category×\\timesmodel token\-F1\. Engram Joint OPT vs\. the best retrieval baseline \(MEM0\_LIKE or MEMMACHINE\_LIKE\) per cell\. The Engram win is category\-dependent \(Figure[14](https://arxiv.org/html/2606.19172#S5.F14)\)\.
#### One caveat about the score, and its fix\.

Token\-F1 gives credit for partial word overlap, and our basic insertion only trains the*first*answer token, so it can score well by getting that first word right over an otherwise noisy answer\. When we re\-score with an LLM judge\(Zheng et al\.,[2023](https://arxiv.org/html/2606.19172#bib.bib58)\)\(Qwen2\.5\-14B, Appendix[A](https://arxiv.org/html/2606.19172#A1)\), first\-token Engram indeed slips behind retrieval on single\-hop—but training the*whole*answer with multi\-token Joint OPT fixes it and again overtakes the best retrieval baseline on the larger models\. The paraphrase and storage wins above score only the first token, so they are unaffected\. \(The judge is stricter than LOCOMO’s official one, so these numbers are not directly comparable to its leaderboard\.\)

### 5\.6 Multi\-hop reasoning over inserted facts

We probe chaining directly: store two facts \(“my doctor is Patel”, “Patel works at Globex”\) and ask a question that needs both \(“where does my doctor work?”\)\. On 63 such pairs, half worded so the question ends in the same words as the second stored fact and half not, we can separate a true chain from a lucky word\-match\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x16.png)Figure 15:Multi\-hop reasoning over Engram\-inserted facts \(n=63n\{=\}63chained pairs on Mini\-Engram\-d20, two OPT\-15 insertions per item\)\. Per\-fact direct recall is 99\.2% \(both facts are insertable\), so the gap is chaining, not insertion\. When the query’s suffix N\-gram coincides with Fact\-2’s trigger, the gate fires and recall is 90\.6%; when true chaining is required, it collapses to 12\.9% \(Wilson 95% CIs shown\)\. The gate is a surface\-N\-gram hash—it matches, it does not compose\.The result is almost all\-or\-nothing \(Figure[15](https://arxiv.org/html/2606.19172#S5.F15)\): when the question happens to end in the second fact’s words, recall is high; when a real chain is needed, it falls to near chance—and the few hits are word\-match coincidences\. Since each fact alone is recalled essentially perfectly, the gap is chaining, not storage\. This is the limit the store shares with a per\-user LoRA: it matches words, it does not reason across them\. The recite\-then\-reason recipe \(Section[8](https://arxiv.org/html/2606.19172#S8)\) is the natural fix\.

## 6 Layered architecture: shared reasoning skill \+ per\-user content

So far: per\-user LoRA contaminates \(Section[3](https://arxiv.org/html/2606.19172#S3)\) and per\-user Engram preserves direct recall but lacks in\-context reasoning skill \(the per\-user Engram alone reaches only 23% indirect\_any atn=20n\{=\}20in Figure[16](https://arxiv.org/html/2606.19172#S6.F16)\)\. Can the two be*composed*without inheriting LoRA’s damage? Our thesis: personal memory splits into*content*\(per\-user, trigger\-keyable, low\-cost\) and*reasoning skill*\(shared patterns, learnable from held\-out users\); storing each the right way beats every all\-in\-one baseline on every measure\. Two hypotheses make this concrete:

- •H1 \(naive combination\): the naive stack of per\-user LoRA\+\+per\-user Engram inherits LoRA’s indirect\-reasoning damage; Engram’s local edit does not rescue the composition\.
- •H2 \(layered architecture\): one*shared*LoRA \(cross\-user reasoning skill\)\+\+per\-user Engram \(local content\), the layered design, splits memory correctly\. The shared LoRA’s contamination is paid once and shared across all users, and the per\-user Engram adds none on top \(the layered design’sΔ\\Deltabpb equals the shared LoRA’s alone, by Section[4](https://arxiv.org/html/2606.19172#S4)’s locality property\)\.

### 6\.1 Design

Train a single shared LoRA on cross\-user in\-context\-reasoning data: for each held\-out training user \(here, u020–u029\), render each indirect QA as a completion\-format sample “Facts: <fact1\>\. <fact2\>\. \.\.\. Q: <q\> A: <gold\>”, using only the facts the question needs\. This teaches the LoRA the*pattern*“given facts in context, do the reasoning,” without memorizing any specific user’s content\. At inference, attach the shared LoRA \+ the test user’s Engram\-row override and ask the indirect question without facts in the prompt: the Engram provides content \(via trigger N\-gram lookup\), the shared LoRA provides the skill \(Figure[1](https://arxiv.org/html/2606.19172#S1.F1)\)\.

### 6\.2 Six\-condition head\-to\-head

We compare six parametric conditions, shown as the six bars \(in this order\) in Figure[16](https://arxiv.org/html/2606.19172#S6.F16):the untouched base,per\-user LoRA,the per\-user Engram alone\(content, no shared skill\),the naive LoRA\+\+Engram stack,the shared LoRA alone\(skill, no per\-user content\), andthe layered design\(per\-user Engram\+\+shared LoRA\); the retrieval variants are introduced in Section[6\.4](https://arxiv.org/html/2606.19172#S6.SS4)\. We evaluate every per\-user combination onn=20n\{=\}20test users \(n=10n\{=\}10training users held out for the shared LoRA\), measuring direct top\-1/top\-5 recall on the user’s facts \(completion\-format prompts\), indirect top\-1 and indirect\_any on completion\-format indirect probes \(“Q: <q\>\\nA:”, greedy 16\-token continuation\), and val\_bpb on a held\-out 262 K\-token ClimbMix shard \(Figure[16](https://arxiv.org/html/2606.19172#S6.F16)\)\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x17.png)Figure 16:The six parametric conditions onn=20n\{=\}20test users \(Mini\-Engram\-d20, seed S0; bars left\-to\-right as listed in the text\)\.Left:the layered design \(Engram \+ shared LoRA\) matches per\-user LoRA’s direct recall \(100% vs\. 99%\) while delivering7\.4×7\.4\\timesits indirect\_any on this seed \(44% vs\. 6%\); neither the per\-user Engram \(no skill\) nor the shared LoRA alone \(never saw the user’s facts\) suffices\.Right:the layered design adds*zero*contamination on top of the shared skill \(Δ\\Deltabpb\+0\.386\+0\.386, equal to the shared LoRA alone\), while per\-user LoRA and the naive LoRA\+Engram stack pay\+1\.78\+1\.78/\+1\.82\+1\.82\. The retrieval conditions appear in Figure[28](https://arxiv.org/html/2606.19172#A8.F28); 3 seeds in Appendix[N](https://arxiv.org/html/2606.19172#A14)\(3\-seed mean: layered 41%, per\-user LoRA 7%\)\.
### 6\.3 Results: the layered design wins on every measure

The headline is the layered\-design\-vs\-per\-user\-LoRA contrast \(Figure[16](https://arxiv.org/html/2606.19172#S6.F16)\)\. The layered design matches per\-user LoRA on direct recall \(100% vs\. 99%\) but answers indirect questions7\.4×7\.4\\timesmore oftenon this canonical seed \(44% vs\. 6%;5\.6×5\.6\\timesaveraged over three seeds\), with4\.6×4\.6\\timesless damage to unrelated text—all of it from the shared skill, none from the per\-user Engram—andnot a single user out of 20 made worse than the untouched model\(against 17 of 20 for the LoRA\), all at 88 KB per user instead of 14\.2 MB\. The gap is stable across three seeds \(paired\-bootstrap 95% CI on the layered−\-LoRA indirect difference\[\+31,\+37\]\[\+31,\+37\]pp; per\-seed spread in Appendix[N](https://arxiv.org/html/2606.19172#A14)\)\.

H1 holds: bolting an Engram onto a per\-user LoRA does not rescue its reasoning—the LoRA’s damage dominates the combination\. H2 holds on all three counts: adding the per\-user Engram on top of the shared skill costs*zero*extra damage to unrelated text \(itsΔ\\Deltabpb equals the shared LoRA’s alone\), keeps direct recall at 100%, and roughly doubles indirect accuracy over either the Engram alone or the untouched model\. Neither piece works on its own—the Engram has the facts but no reasoning, the shared skill has reasoning but has never seen the user’s facts—and only together do they win\.

### 6\.4 Does RAG just close the gap instead?

The obvious objection to the layered design is: why not just retrieve the facts and put them in the prompt? We test this two ways \(the naive\-RAG and RAG\+shared\-LoRA conditions in the context\-cost plot, Appendix[H](https://arxiv.org/html/2606.19172#A8), Figure[28](https://arxiv.org/html/2606.19172#A8.F28)\)\. On the Engram base itself, putting facts in the prompt sits*below*the layered design—the base LM never learned to use an in\-context fact list, and even handing it the exact right facts does not help, because the missing piece is the reasoning skill, not the facts\. The fairer test feeds the retrieved facts to a real instruction\-tuned model \(Qwen2\.5\-3B\-Instruct, about2\.5×2\.5\\timeslarger\)\. Even there, the layered design—with no retrieval and an empty prompt—lands within a few points of the larger model, and adding retrieval on top of the shared skill passes it once context is free\. The plain rule: no room in the prompt→\\touse the layered design; room to spare→\\toadd retrieval on top of the shared skill\.

### 6\.5 RAG\-vs\-Engram trade\-offs sharpen as the KB grows

The experiments above used each user’s native 34 facts\. Production fact counts are typically larger \(10210^\{2\}–10310^\{3\}: chat history, calendar, preferences, contacts\)\. We augment each test user’s 34 facts with distractors sampled from the 30\-user schema\-family pool \(u000\-\-u029minus the active user; 1008 facts\) to reachN∈\{34,100,200,300,500,1000\}N\\in\\\{34,100,200,300,500,1000\\\}\. Indirect probes are unchanged; the retriever now discriminates the gold fact amongNNcandidates\. The layered design is unaffected: the per\-user Engram table holds only the test user’s facts, independent of population size\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x18.png)Figure 17:Indirect\-reasoning accuracy and retrieval recall vs\. KB size \(n=20n\{=\}20users, 20 indirect probes each; KB = test user’s 34 facts \+ distractors sampled from the 30\-user schema\-family pool, logxx\-axis,N∈\{34,100,200,300,500,1000\}N\\in\\\{34,100,200,300,500,1000\\\}\)\.\(a\)the layered design \(horizontal solid line at 44%\) is invariant to KB size; the retrieval\-based conditions degrade monotonically\.Naive RAG and Qwen\-3B \+ RAG fall below the layered design at KB≥\\geq100; at KB=1000 the gap is 14 pp\.RAG top\-3 \+ shared LoRA is the most robust to KB growth \(54%→\\to47%\) because the shared reasoning skill compensates when retrieval misses, though its lead over the layered design shrinks from 10 pp to 3 pp across the sweep\.\(b\)Retrieval recall \(fraction of probes where the retrieved set covers every fact the question needs\) drops by 7×\\timesfor top\-3 \(62%→\\to9%\) between KB=34 and KB=1000, which is the mechanistic driver of \(a\)\.Retrieval degrades quickly: as the candidate pool grows, top\-3 recall onall\-MiniLM\-L6\-v2collapses several\-fold purely from more candidates in the same embedding space, which drives the accuracy drop\. Naive RAG, and even Qwen\-3B \+ RAG, fall below the layered design once the KB passes∼\\sim100 facts, while the layered design holds flat—its per\-user table never grows\. The clean reading \(Figure[17](https://arxiv.org/html/2606.19172#S6.F17)\):*which method wins shifts with deployment scale*\. At a few dozen facts, RAG on a larger model is the accuracy ceiling and the layered design is the zero\-context option; by a few hundred facts, the layered design decisively wins\.

#### Where each side’s failure comes from\.

The two limits are mechanistically distinct\. Engram’s density ceiling \(Section[5\.2](https://arxiv.org/html/2606.19172#S5.SS2)\) comes from forward\-pass interference among co\-active rows inside one user’s table, and is bounded by per\-user fact count\. RAG’s retrieval ceiling comes from nearest\-neighbor confusion across an ever\-growing candidate pool, and is bounded by whatever pool the retriever sees \(per\-user, per\-tenant, or per\-corpus\)\. For10210^\{2\}–10310^\{3\}facts/user, Engram density is the limiting factor on the parametric side; for10410^\{4\}\+ facts in a shared corpus, retrieval is the limiting factor on the RAG side\. The two costs scale on different axes and cross aroundN≈100N\\approx 100\.

## 7 Multi\-tenant serving system

We implementEngramServer, a∼\\sim50\-line wrapper around an Engram\-pretrained model\. The base and global tables stay on the GPU; each user’s small override map lives in CPU memory\. A request resolves the user, writes their rows, runs the forward pass, and restores the originals—no router, no custom kernel, no graph rewrite\. Writing and restoring the rows are cheap relative to the forward pass in every configuration we measured—a sub\-millisecond array write on the ablation\-optimald12@1280\(Figure[18](https://arxiv.org/html/2606.19172#S7.F18)\), and at most≈\\approx2\.2 ms each in the earlier per\-component breakdown on the smallerd12v2 checkpoint \(Appendix[G](https://arxiv.org/html/2606.19172#A7)\)—and the frozen forward pass dominates the rest\.

#### It scales flat in the number of users\.

Figure[18](https://arxiv.org/html/2606.19172#S7.F18)reports throughput, latency, storage, and recall across four deployment scales\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x19.png)Figure 18:EngramServer throughput across four deployment scales on one idle Blackwell GPU\. On the ablation\-optimal d12@1280 the server reaches 232 req/s \(30 u/50 f\) and holds 226 req/s at 100 u/100 f—within 3% as tenants triple—at 4\.4 ms p50 latency, and a sub\-millisecond override apply \(0\.03 ms p50; the larger per\-component figures in Appendix[G](https://arxiv.org/html/2606.19172#A7)are from an earlier, slower benchmark on the smallerd12v2 checkpoint\), 88 KB/user, and 62%/96% top\-1/top\-5 own\-fact recall\.Per\-request work does not depend on how many users share the server, so throughput and latency barely move as users triple \(Figure[18](https://arxiv.org/html/2606.19172#S7.F18)\); each extra user costs one small row\-swap and no shared state\. Two users are never in the table at the same time, so one cannot read another’s facts: cross\-user leakage is zero by construction\. \(The only residual overlap is coincidence—two users with the same favorite spice—and it is small\.\) By contrast, serving per\-user LoRAs needs custom CUDA kernels and batch\-routing \(S\-LoRA\(Sheng et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib33)\), Punica\(Chen et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib34)\)\); here the write is just an array assignment\. Batched multi\-user serving would add one gather per Engram layer, which we do not benchmark\.

## 8 Discussion and limitations

### 8\.1 When to use what

Which method to use turns on three things:*how many facts each user has*,*whether the queries need indirect reasoning*, and*whether context budget or per\-user storage is the tighter constraint*\. Table[3](https://arxiv.org/html/2606.19172#S8.T3)gives the recommended method for each case\.

Table 3:Cheat sheet for which method to use\. “invariant” = the number does not degrade with KB / population size\.The most important row is the second: the layered design is the right choice for the common case \(10–1000 facts/user, indirect reasoning matters, many users\), and decisively beats RAG with a∼\\sim2\.5×\\timeslarger instruction\-tuned LM atN≥100N\\geq 100\.*Any*method that reaches 100% direct top\-1 pays a price in indirect reasoning; the question is whether that price is baked in \(a global LoRA\), manageable \(a local Engram, where the trigger N\-gram is the contract\), or set by retrieval \(RAG, where the encoder must pick the gold fact out ofNNcandidates\)\.

### 8\.2 Open limitations

#### Shared multi\-hop reasoning gap\.

Both LoRA and Engram are surface\-trigger keyed; neither composes facts across triggers \(“if my doctor is Patel and Patel works at Globex, what is my doctor’s employer?”\)\. User\-as\-Engram inherits this gap from per\-user LoRA\. A candidate fix is to mix recite\-then\-reason traces into Engram pretraining; we leave it to future work\.

#### Within\-user density ceiling\.

Joint OPT at 1000 facts reaches 35% top\-1\. The slot space is highly sparse \(<1%\{<\}\\,1\\%occupancy\), so the constraint is forward\-pass interference among co\-active rows, not hash collisions\. Per\-user scoped tables \(load only the relevant rows for the current query, via a small classifier\) is a candidate future fix; another is the multi\-fact\-in\-the\-loss recipe change in Section[8\.3](https://arxiv.org/html/2606.19172#S8.SS3)\.

#### Engram pretraining required\.

Our method assumes an Engram\-pretrained base\. We trained four ourselves at 178 M, 339 M, 625 M, and 1\.22 B total parameters; production deployment depends on either DeepSeek releasing weights or pretraining your own \(∼\\sim10 h on a single Blackwell GPU at our 625 M scale; production\-scale Engrams require correspondingly larger compute budgets\)\.

### 8\.3 Future work: training\-recipe changes

Three changes to the training recipe target the limits we documented, and we expect more from them per unit of compute than from simply making the model bigger\.*\(1\) Put many facts in the loss during training*, so the model learns to hold several at once\. We tested a lighter version of this by finetuning d12@1280, and it helps at high density—but only by reaching the same recall in fewer optimization steps, not by raising the ceiling \(Appendix[O\.1](https://arxiv.org/html/2606.19172#A15.SS1), Figure[30](https://arxiv.org/html/2606.19172#A15.F30)\)\. This pins the density ceiling on inference\-time optimization, not pretraining\.*\(2\) Train with random per\-user offsets to the hash*, which would let the gate read rows it never saw in training and bring back a per\-user\-salt option\.*\(3\) Mix in “recall the fact, then reason” examples*, the most direct attack on the chaining gap\.

#### A negative result worth recording\.

We tried improving the shared skill by training it on reasoning traces from a teacher model\. It cut the contamination but also*lowered*accuracy \(Appendix[O](https://arxiv.org/html/2606.19172#A15)\)\. The reason is instructive: those traces assume the model first writes out a thinking step, while our test reads only the final answer\. The recipe and the way you score it have to be designed together—you cannot mix and match them\. Remaining layered follow\-ups: a larger shared\-skill training set, training the per\-user table and the shared skill together, and starting from a chat\-tuned base\.

## 9 Related Work

#### Memory architectures and adjacent work\.

Engram\(Cheng et al\.,[2026](https://arxiv.org/html/2606.19172#bib.bib6)\)sits in a long line of trainable key–value memory modules\(Geva et al\.,[2021](https://arxiv.org/html/2606.19172#bib.bib93); Lample et al\.,[2019](https://arxiv.org/html/2606.19172#bib.bib35); He,[2024](https://arxiv.org/html/2606.19172#bib.bib36); Berges et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib37); Huang et al\.,[2024b](https://arxiv.org/html/2606.19172#bib.bib38),[2025](https://arxiv.org/html/2606.19172#bib.bib39); Yu et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib40); Pagnoni et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib41); Liu et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib42)\), but none of them addresses writing a single user’s fact in at inference time\. Stacking edits without a combiner echoes LoRA stacking for Stable Diffusion\(CivitAI,[2024](https://arxiv.org/html/2606.19172#bib.bib43)\)\(LoRAHub\(Huang et al\.,[2023](https://arxiv.org/html/2606.19172#bib.bib17)\)instead trains one\)\. A mirror\-image line keeps memory*outside*the weights but executable—User as Code\(Li,[2026](https://arxiv.org/html/2606.19172#bib.bib1)\)compiles a user’s history into typed state and rules—which we contrast with our in\-weights edits in Section[5](https://arxiv.org/html/2606.19172#S5)\.

#### Memory systems for LLM agents\.

A large body of work gives an agent long\-term memory by keeping facts*outside*the weights and retrieving them at query time: paging systems\(Packer et al\.,[2023](https://arxiv.org/html/2606.19172#bib.bib59)\); extraction\- and graph\-based stores\(Chhikara et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib60); Xu et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib61); Rasmussen et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib62); Li et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib63); Hu et al\.,[2026](https://arxiv.org/html/2606.19172#bib.bib65); Wang et al\.,[2026](https://arxiv.org/html/2606.19172#bib.bib64)\); and a line that learns*how*to read and write the store\(Yan et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib67); Yu et al\.,[2026](https://arxiv.org/html/2606.19172#bib.bib68); Wang et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib69)\), with surveys cataloguing the space\(Zhang et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib70); Wu et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib71); Du,[2026](https://arxiv.org/html/2606.19172#bib.bib72)\)\. None writes a fact*into*the model\. \(One concurrent system even shares our name\(Patel & Patel,[2025](https://arxiv.org/html/2606.19172#bib.bib66)\)while doing the opposite—it orchestrates memories in natural language\.\) We use Mem0\- and MemMachine\-style retrieval as our strongest baselines \(Section[5\.4](https://arxiv.org/html/2606.19172#S5.SS4)\) and ask a different question:*when*does writing a fact into the weights beat keeping it in a retrievable store\(Pollertlam & Kornsuwannawit,[2026](https://arxiv.org/html/2606.19172#bib.bib73)\)?

#### Personalizing language models\.

Adapting a model to an individual user is an active area in its own right\(Zhang et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib75); Liu et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib76); Xu et al\.,[2026](https://arxiv.org/html/2606.19172#bib.bib77)\), with benchmarks such as LaMP\(Salemi et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib74)\)measuring personalized generation\. Most methods personalize through the prompt or a retrieval store; the per\-user LoRA line above is the parametric alternative, and User as Engram is our attempt to win the locality of a prompt at the zero\-context cost of a weight edit\.

#### Retrieval and non\-parametric memory\.

Retrieval\-augmented generation\(Lewis et al\.,[2020](https://arxiv.org/html/2606.19172#bib.bib11); Gao et al\.,[2023](https://arxiv.org/html/2606.19172#bib.bib12)\)is the dominant non\-parametric route, spanning retrieve\-then\-read pretraining, large\-scale retrieval pretraining, and self\-reflective retrieval\(Guu et al\.,[2020](https://arxiv.org/html/2606.19172#bib.bib88); Khandelwal et al\.,[2020](https://arxiv.org/html/2606.19172#bib.bib89); Borgeaud et al\.,[2022](https://arxiv.org/html/2606.19172#bib.bib90); Izacard et al\.,[2023](https://arxiv.org/html/2606.19172#bib.bib91); Asai et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib92)\)\. Its quality is bounded by the encoder’s ability to surface the right evidence among many candidates—exactly the failure mode we measure as the knowledge base grows \(Section[6\.5](https://arxiv.org/html/2606.19172#S6.SS5)\)\.

#### Personal memory benchmarks\.

Long\-term personal memory is evaluated by LOCOMO\(Maharana et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib19)\), LongMemEval\(Wu et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib18)\), BEAM\(Tavakoli et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib20)\)\(multi\-turn evolving beliefs\), and PersonaMem\-v2\(Jiang et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib21)\)\. We use synthetic per\-user fact corpora for the within\-user fact\-density sweeps \(tighter control over the parameter we vary\), and LOCOMO \(Section[5\.5](https://arxiv.org/html/2606.19172#S5.SS5)\) for end\-to\-end evaluation against retrieval baselines\. We chose LOCOMO over LongMemEval and BEAM because its per\-question evidence pointers let us train a per\-fact Engram row and a retrieval entry against the same gold sentence—an apples\-to\-apples “same gold, different method” comparison\. LongMemEval and BEAM remain open for future work\.

#### Knowledge editing and recitation\.

Knowledge editing changes a fact the model already holds—locating and rewriting weights\(Meng et al\.,[2022](https://arxiv.org/html/2606.19172#bib.bib44),[2023](https://arxiv.org/html/2606.19172#bib.bib45); Dai et al\.,[2022](https://arxiv.org/html/2606.19172#bib.bib80); Yao et al\.,[2023](https://arxiv.org/html/2606.19172#bib.bib84)\), or routing around them\(Mitchell et al\.,[2022a](https://arxiv.org/html/2606.19172#bib.bib78),[b](https://arxiv.org/html/2606.19172#bib.bib79); Wang et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib83)\)—with ripple\-effect benchmarks measuring the fallout\(Cohen et al\.,[2023](https://arxiv.org/html/2606.19172#bib.bib46),[2024](https://arxiv.org/html/2606.19172#bib.bib47); Meng et al\.,[2022b](https://arxiv.org/html/2606.19172#bib.bib48)\)\. We do the opposite: we*add*facts the model never saw and want everything else left alone\. Recitation methods, which surface a stored fact before reasoning over it\(Wu et al\.,[2025](https://arxiv.org/html/2606.19172#bib.bib49); Sun et al\.,[2023](https://arxiv.org/html/2606.19172#bib.bib50)\), we reuse in Section[3](https://arxiv.org/html/2606.19172#S3)\.

#### Complementary learning systems and continual learning\.

The extra loss a per\-user LoRA imposes on unrelated text is a modern instance of*catastrophic interference*: training on something new overwrites what was there before\(McCloskey & Cohen,[1989](https://arxiv.org/html/2606.19172#bib.bib85)\), which decades of continual\-learning work fight by protecting important weights\(Kirkpatrick et al\.,[2017](https://arxiv.org/html/2606.19172#bib.bib86)\)or replaying old data, with the problem persisting for large language models\(Shi et al\.,[2024](https://arxiv.org/html/2606.19172#bib.bib87)\)\. The brain avoids it with a different architecture, and the word*engram*points to how: an engram is the physical trace a memory leaves in neural tissue\(Semon,[1921](https://arxiv.org/html/2606.19172#bib.bib2)\), and*engram cells*—sparse neuron populations whose reactivation triggers recall—sit largely in the hippocampus\(Tonegawa et al\.,[2015](https://arxiv.org/html/2606.19172#bib.bib4)\), while general skills live in the slow, distributed neocortex\. Complementary learning systems theory\(McClelland et al\.,[1995](https://arxiv.org/html/2606.19172#bib.bib3); Kumaran et al\.,[2016](https://arxiv.org/html/2606.19172#bib.bib5)\)argues this split is what lets a new episode be written without overwriting old skills\. User as Engram is the same division made architectural: the per\-user Engram row is the fast, sparse, local trace \(the hippocampal engram\); the shared LoRA and frozen backbone are the slow, distributed skill \(the neocortex\); and our measured locality \(Δ\\Deltabpb\+0\.00005\+0\.00005on unrelated text, Section[3](https://arxiv.org/html/2606.19172#S3)\) is the engineering form of the*pattern separation*that keeps the two from interfering\. The analogy is also honest about where our mechanism falls short: the hippocampus performs*pattern completion*—recovering a whole memory from a partial or indirect cue—whereas our gated trigger\-N\-gram lookup fires only on a near\-exact surface match, which is exactly the multi\-hop and paraphrase gap of Sections[5\.6](https://arxiv.org/html/2606.19172#S5.SS6)and[5\.4](https://arxiv.org/html/2606.19172#S5.SS4.SSS0.Px3)\. The biology thus predicts the precise capability we still lack\. Where continual learning asks one set of weights to absorb new facts gracefully, we sidestep the interference by writing each fact to its own address and leaving the shared weights untouched\.

## 10 Conclusion

ersonal memory is two jobs, not one: remembering a user’s specific facts, and the general skill of reasoning with them\. The standard in\-weights recipe asks one per\-user adapter to do both, and that is the root of the trouble: an adapter that must*hold*the facts cannot help also reshaping how the model thinks, so remembering and reasoning end up pulling on the same weights and the user pays for it in damaged knowledge and weaker reasoning\. User as Engram separates the two—a user’s facts become a small, local edit, and the reasoning skill lives in one model everyone shares\.

The deeper lesson is that the trouble is not unique to adapters\. Every method that reaches near\-perfect recall pays the same price in weaker reasoning over the stored facts; what differs is only*where*the price lands\. A per\-user LoRA pays it in the shared weights, on every query and stacking with every user; retrieval pays it in a search index that degrades as the candidate pool grows; User as Engram pays it at the one address its fact occupies, so the cost grows with a single user’s facts and nothing else\.

A century ago Richard Semon coined*engram*for the trace an experience etches into living tissue—a mark local enough that one memory does not erase the next\. The brain pairs that sparse, local trace with a slow cortex that learns how to use it, and the pairing is what lets a person remember a new fact without forgetting how to think\. Maya’s assistant needs the same discipline\. When we stop asking one set of weights to be both the memory and the mind, and instead give each user a small private trace beside a skill everyone shares, personalization stops fighting the model it runs on\. That, more than any single number, is what*User as Engram*is for\.

## Acknowledgements

Pine Copilot, Claude Code, and Claude Opus 4\.8 were used during this research\. We thank BSQL Networking for hosting the NVIDIA RTX Pro 6000 GPU on which these experiments were run\.

## References

- Li \[2026\]B\. Li\.User as code: Executable memory for personalized agents\.arXiv:2606\.16707, 2026\.
- Semon \[1921\]R\. Semon\.*The Mneme*\.George Allen & Unwin, London, 1921\.\(English translation of*Die Mneme*, 1904; origin of the term “engram”\)\.
- McClelland et al\. \[1995\]J\. L\. McClelland, B\. L\. McNaughton, and R\. C\. O’Reilly\.Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory\.*Psychological Review*, 102\(3\):419–457, 1995\.
- Tonegawa et al\. \[2015\]S\. Tonegawa, X\. Liu, S\. Ramirez, and R\. Redondo\.Memory engram cells have come of age\.*Neuron*, 87\(5\):918–931, 2015\.
- Kumaran et al\. \[2016\]D\. Kumaran, D\. Hassabis, and J\. L\. McClelland\.What learning systems do intelligent agents need? Complementary learning systems theory updated\.*Trends in Cognitive Sciences*, 20\(7\):512–534, 2016\.
- Cheng et al\. \[2026\]X\. Cheng, W\. Zeng, D\. Dai, Q\. Chen, B\. Wang, Z\. Xie, K\. Huang, X\. Yu, Z\. Hao, Y\. Li, H\. Zhang, H\. Zhang, D\. Zhao, and W\. Liang\.Conditional memory via scalable lookup: A new axis of sparsity for large language models\.arXiv:2601\.07372, January 2026\.
- Vaswani et al\. \[2017\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin\.Attention is all you need\.NeurIPS 2017\.
- Brown et al\. \[2020\]T\. Brown et al\.Language models are few\-shot learners\.NeurIPS 2020\.
- Radford et al\. \[2019\]A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, and I\. Sutskever\.Language models are unsupervised multitask learners \(GPT\-2\)\.OpenAI technical report, 2019\.
- Min et al\. \[2022\]S\. Min, X\. Lyu, A\. Holtzman, M\. Artetxe, M\. Lewis, H\. Hajishirzi, and L\. Zettlemoyer\.Rethinking the role of demonstrations: What makes in\-context learning work?EMNLP 2022\.
- Lewis et al\. \[2020\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela\.Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.NeurIPS 2020\.
- Gao et al\. \[2023\]Y\. Gao et al\.Retrieval\-augmented generation for large language models: A survey\.arXiv:2312\.10997, 2023\.
- Shazeer et al\. \[2017\]N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean\.Outrageously large neural networks: The sparsely\-gated mixture\-of\-experts layer\.ICLR 2017\.
- Dai et al\. \[2024\]D\. Dai et al\.DeepSeekMoE: Towards ultimate expert specialization in mixture\-of\-experts language models\.arXiv:2401\.06066, 2024\.
- Mangrulkar et al\. \[2022\]S\. Mangrulkar et al\.PEFT: State\-of\-the\-art parameter\-efficient fine\-tuning methods\.[https://github\.com/huggingface/peft](https://github.com/huggingface/peft), 2022\.
- Jordan et al\. \[2024\]K\. Jordan et al\.Muon: An optimiser for hidden layers in neural networks\.GitHub / blog post, 2024\.
- Huang et al\. \[2023\]C\. Huang et al\.LoRAHub: Efficient cross\-task generalization via dynamic LoRA composition\.arXiv:2307\.13269, 2023\.
- Wu et al\. \[2024\]D\. Wu et al\.LongMemEval: Benchmarking chat assistants on long\-term memory\.arXiv 2024\.
- Maharana et al\. \[2024\]A\. Maharana et al\.LOCOMO: Evaluating very long\-term conversational memory of LLM agents\.ACL 2024\.
- Tavakoli et al\. \[2025\]M\. Tavakoli, A\. Salemi, C\. Ye, M\. Abdalla, H\. Zamani, and J\. R\. Mitchell\.Beyond a million tokens: Benchmarking and enhancing long\-term memory in LLMs \(BEAM\)\.arXiv:2510\.27246, 2025\.
- Jiang et al\. \[2025\]B\. Jiang, Y\. Yuan, M\. Shen, Z\. Hao, Z\. Xu, Z\. Chen, Z\. Liu, A\. R\. Vijjini, J\. He, H\. Yu, R\. Poovendran, G\. Wornell, L\. Ungar, D\. Roth, S\. Chen, and C\. J\. Taylor\.PersonaMem\-v2: Towards personalized intelligence via learning implicit user personas and agentic memory\.arXiv:2512\.06688, 2025\.
- Karpathy \[2026\]A\. Karpathy\.nanochat: an experimental training harness for LLMs\.[https://github\.com/karpathy/nanochat](https://github.com/karpathy/nanochat), 2026\.
- Hu et al\. \[2022\]E\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\.LoRA: Low\-rank adaptation of large language models\.ICLR 2022\.
- Houlsby et al\. \[2019\]N\. Houlsby et al\.Parameter\-efficient transfer learning for NLP\.ICML 2019\.
- Su et al\. \[2025\]W\. Su et al\.Parametric retrieval\-augmented generation \(PRAG\)\.arXiv:2501\.15915, 2025\.
- Tan et al\. \[2025\]Z\. Tan et al\.DyPRAG: Dynamic parametric RAG\.arXiv:2505\.19386, 2025\.
- Chen et al\. \[2025\]J\. Chen, H\. Zhang, L\. Pang, Y\. Tong, H\. Zhou, Y\. Zhan, W\. Lin, and Z\. Zheng\.Privacy\-preserving reasoning with knowledge\-distilled parametric retrieval\-augmented generation \(DistilledPRAG\)\.arXiv:2509\.01088, 2025\.
- Tan et al\. \[2024\]Z\. Tan, Q\. Liu, and M\. Jiang\.Democratizing large language models via personalized parameter\-efficient fine\-tuning \(OPPU\)\.EMNLP 2024 \(arXiv:2402\.04401\)\.
- Zhuang et al\. \[2024\]Y\. Zhuang et al\.HYDRA: Per\-user adapters for personalised LLMs\.arXiv 2024\.
- Bini et al\. \[2025\]M\. Bini, O\. Bohdal, U\. Michieli, Z\. Akata, M\. Ozay, and T\. Ceritli\.MemLoRA: Distilling expert adapters for on\-device memory systems\.arXiv:2512\.04763, 2025\.
- Charakorn et al\. \[2025\]R\. Charakorn, E\. Cetin, Y\. Tang, and R\. T\. Lange\.Text\-to\-LoRA: Instant transformer adaption\.arXiv:2506\.06105, 2025\.
- Tan et al\. \[2024b\]Z\. Tan et al\.PER\-PCS: Per\-user post\-hoc LoRA composition\.arXiv 2024\.
- Sheng et al\. \[2024\]Y\. Sheng, S\. Cao, D\. Li, et al\.S\-LoRA: Serving thousands of concurrent LoRA adapters\.arXiv:2311\.03285, 2024\.
- Chen et al\. \[2024\]L\. Chen, Z\. Ye, Y\. Wu, et al\.Punica: Multi\-tenant LoRA serving\.MLSys 2024\.
- Lample et al\. \[2019\]G\. Lample, A\. Sablayrolles, M\. Ranzato, L\. Denoyer, and H\. Jégou\.Large memory layers with product keys\.NeurIPS 2019\.
- He \[2024\]P\. He\.PEER: Mixture of one million experts\.arXiv 2024\.
- Berges et al\. \[2025\]V\. Berges, B\. Oğuz, D\. Haziza, W\. Yih, L\. Zettlemoyer, and G\. Ghosh\.Memory layers at scale\.ICML 2025\.
- Huang et al\. \[2024b\]Z\. Huang, Q\. Min, H\. Huang, D\. Zhu, Y\. Zeng, R\. Guo, and X\. Zhou\.Ultra\-sparse memory network \(Ultra\-Mem\)\.arXiv:2411\.12364, 2024 \(ICLR 2025\)\.
- Huang et al\. \[2025\]J\. Huang et al\.OverEncoding: hashed N\-gram embeddings via averaging\.2025\.
- Yu et al\. \[2025\]L\. Yu et al\.SCONE: scalable contextual N\-gram embeddings\.2025\.
- Pagnoni et al\. \[2025\]A\. Pagnoni, R\. Pasunuru, P\. Rodriguez, et al\.BLT: byte latent transformer with hashed N\-gram embeddings\.arXiv:2412\.09871, 2025\.
- Liu et al\. \[2025\]A\. Liu et al\.SuperBPE: word\-level BPE for compositional patterns\.2025\.
- CivitAI \[2024\]CivitAI Community\.LoRA stacking patterns for Stable Diffusion\.[https://civitai\.com/](https://civitai.com/), 2024\.
- Meng et al\. \[2022\]K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov\.Locating and editing factual associations in GPT \(ROME\)\.NeurIPS 2022\.
- Meng et al\. \[2023\]K\. Meng et al\.MEMIT: Mass\-editing memory in a transformer\.ICLR 2023\.
- Cohen et al\. \[2023\]R\. Cohen et al\.Evaluating the ripple effects of knowledge editing in language models \(MQuAKE\)\.2023\.
- Cohen et al\. \[2024\]R\. Cohen et al\.RippleEdits: A benchmark for ripple effects of model editing\.2024\.
- Meng et al\. \[2022b\]K\. Meng et al\.CounterFact: a counterfactual editing benchmark\.2022\.
- Wu et al\. \[2025\]D\. Wu, J\.\-C\. Gu, K\.\-W\. Chang, and N\. Peng\.Self\-routing RAG: Binding selective retrieval with knowledge verbalization\.arXiv:2504\.01018, 2025\.
- Sun et al\. \[2023\]Z\. Sun et al\.Recitation\-augmented language models\.ICLR 2023\.
- Qwen Team \[2025a\]A\. Yang, B\. Yang, B\. Zhang, et al\.Qwen2\.5 technical report\.arXiv:2412\.15115, 2025\.
- Qwen Team \[2025b\]A\. Yang, A\. Li, B\. Yang, et al\.Qwen3 technical report\.arXiv:2505\.09388, 2025\.
- Grattafiori et al\. \[2024\]A\. Grattafiori, A\. Dubey, A\. Jauhri, et al\.The Llama 3 herd of models\.arXiv:2407\.21783, 2024\.
- Jiang et al\. \[2023\]A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, et al\.Mistral 7B\.arXiv:2310\.06825, 2023\.
- DeepSeek\-AI \[2024\]DeepSeek\-AI\.DeepSeek\-V3 technical report\.arXiv:2412\.19437, 2024\.
- Reimers & Gurevych \[2019\]N\. Reimers and I\. Gurevych\.Sentence\-BERT: Sentence embeddings using Siamese BERT\-networks\.EMNLP\-IJCNLP 2019\.
- Wang et al\. \[2020\]W\. Wang, F\. Wei, L\. Dong, H\. Bao, N\. Yang, and M\. Zhou\.MiniLM: Deep self\-attention distillation for task\-agnostic compression of pre\-trained transformers\.NeurIPS 2020\.
- Zheng et al\. \[2023\]L\. Zheng, W\.\-L\. Chiang, Y\. Sheng, et al\.Judging LLM\-as\-a\-judge with MT\-Bench and Chatbot Arena\.NeurIPS 2023 Datasets and Benchmarks\. arXiv:2306\.05685\.
- Packer et al\. \[2023\]C\. Packer, S\. Wooders, K\. Lin, et al\.MemGPT: Towards LLMs as operating systems\.arXiv:2310\.08560, 2023\.
- Chhikara et al\. \[2025\]P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav\.Mem0: Building production\-ready AI agents with scalable long\-term memory\.arXiv:2504\.19413, 2025\.
- Xu et al\. \[2025\]W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang\.A\-MEM: Agentic memory for LLM agents\.arXiv:2502\.12110, 2025\.
- Rasmussen et al\. \[2025\]P\. Rasmussen, P\. Paliychuk, T\. Beauvais, J\. Ryan, and D\. Chalef\.Zep: A temporal knowledge graph architecture for agent memory\.arXiv:2501\.13956, 2025\.
- Li et al\. \[2025\]Z\. Li, S\. Song, H\. Wang, et al\.MemOS: An operating system for memory\-augmented generation in large language models\.arXiv:2505\.22101, 2025\.
- Wang et al\. \[2026\]S\. Wang, E\. Yu, O\. Love, T\. Zhang, T\. Wong, S\. Scargall, and C\. Fan\.MemMachine: A ground\-truth\-preserving memory system for personalized AI agents\.arXiv:2604\.04853, 2026\.
- Hu et al\. \[2026\]C\. Hu, X\. Gao, Z\. Zhou, et al\.EverMemOS: A self\-organizing memory operating system for structured long\-horizon reasoning\.arXiv:2601\.02163, 2026\.
- Patel & Patel \[2025\]D\. Patel and S\. Patel\.ENGRAM: Effective, lightweight memory orchestration for conversational agents\.arXiv:2511\.12960, 2025\.
- Yan et al\. \[2025\]S\. Yan, X\. Yang, Z\. Huang, et al\.Memory\-R1: Enhancing large language model agents to manage and utilize memories via reinforcement learning\.arXiv:2508\.19828, 2025\.
- Yu et al\. \[2026\]Y\. Yu, L\. Yao, Y\. Xie, et al\.Agentic memory: Learning unified long\-term and short\-term memory management for LLM agents\.arXiv:2601\.01885, 2026\.
- Wang et al\. \[2025\]Y\. Wang, R\. Takanobu, Z\. Liang, et al\.Mem\-α\\alpha: Learning memory construction via reinforcement learning\.arXiv:2509\.25911, 2025\.
- Zhang et al\. \[2024\]Z\. Zhang, X\. Bo, C\. Ma, et al\.A survey on the memory mechanism of large language model based agents\.arXiv:2404\.13501, 2024\.
- Wu et al\. \[2025\]Y\. Wu, S\. Liang, C\. Zhang, et al\.From human memory to AI memory: A survey on memory mechanisms in the era of LLMs\.arXiv:2504\.15965, 2025\.
- Du \[2026\]P\. Du\.Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers\.arXiv:2603\.07670, 2026\.
- Pollertlam & Kornsuwannawit \[2026\]N\. Pollertlam and W\. Kornsuwannawit\.Beyond the context window: A cost\-performance analysis of fact\-based memory vs\. long\-context LLMs for persistent agents\.arXiv:2603\.04814, 2026\.
- Salemi et al\. \[2024\]A\. Salemi, S\. Mysore, M\. Bendersky, and H\. Zamani\.LaMP: When large language models meet personalization\.ACL 2024\. arXiv:2304\.11406\.
- Zhang et al\. \[2024\]Z\. Zhang, R\. A\. Rossi, B\. Kveton, et al\.Personalization of large language models: A survey\.arXiv:2411\.00027, 2024\.
- Liu et al\. \[2025\]J\. Liu, Z\. Qiu, Z\. Li, et al\.A survey of personalized large language models: Progress and future directions\.arXiv:2502\.11528, 2025\.
- Xu et al\. \[2026\]Y\. Xu, Q\. Chen, Z\. Ma, et al\.Toward personalized LLM\-powered agents: Foundations, evaluation, and future directions\.arXiv:2602\.22680, 2026\.
- Mitchell et al\. \[2022a\]E\. Mitchell, C\. Lin, A\. Bosselut, C\. Finn, and C\. D\. Manning\.Fast model editing at scale\.ICLR 2022\.
- Mitchell et al\. \[2022b\]E\. Mitchell, C\. Lin, A\. Bosselut, C\. D\. Manning, and C\. Finn\.Memory\-based model editing at scale\.ICML 2022\.
- Dai et al\. \[2022\]D\. Dai, L\. Dong, Y\. Hao, Z\. Sui, B\. Chang, and F\. Wei\.Knowledge neurons in pretrained transformers\.ACL 2022\.
- Allen\-Zhu & Li \[2025\]Z\. Allen\-Zhu and Y\. Li\.Physics of language models: Part 3\.3, knowledge capacity scaling laws\.ICML 2025\. arXiv:2404\.05405\.
- Li \[2026b\]B\. Li\.Incompressible knowledge probes: Estimating black\-box LLM parameter counts via factual capacity\.arXiv:2604\.24827, 2026\.
- Wang et al\. \[2024\]P\. Wang, Z\. Li, N\. Zhang, et al\.WISE: Rethinking the knowledge memory for lifelong model editing of large language models\.NeurIPS 2024\.
- Yao et al\. \[2023\]Y\. Yao, P\. Wang, B\. Tian, et al\.Editing large language models: Problems, methods, and opportunities\.EMNLP 2023\.
- McCloskey & Cohen \[1989\]M\. McCloskey and N\. J\. Cohen\.Catastrophic interference in connectionist networks: The sequential learning problem\.*Psychology of Learning and Motivation*, 24:109–165, 1989\.
- Kirkpatrick et al\. \[2017\]J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, et al\.Overcoming catastrophic forgetting in neural networks\.*PNAS*, 114\(13\):3521–3526, 2017\.
- Shi et al\. \[2024\]H\. Shi, Z\. Xu, H\. Wang, et al\.Continual learning of large language models: A comprehensive survey\.arXiv:2404\.16789, 2024\.
- Guu et al\. \[2020\]K\. Guu, K\. Lee, Z\. Tung, P\. Pasupat, and M\.\-W\. Chang\.REALM: Retrieval\-augmented language model pre\-training\.ICML 2020\.
- Khandelwal et al\. \[2020\]U\. Khandelwal, O\. Levy, D\. Jurafsky, L\. Zettlemoyer, and M\. Lewis\.Generalization through memorization: Nearest neighbor language models\.ICLR 2020\.
- Borgeaud et al\. \[2022\]S\. Borgeaud, A\. Mensch, J\. Hoffmann, et al\.Improving language models by retrieving from trillions of tokens\.ICML 2022\.
- Izacard et al\. \[2023\]G\. Izacard, P\. Lewis, M\. Lomeli, et al\.Atlas: Few\-shot learning with retrieval augmented language models\.*JMLR*, 24\(251\):1–43, 2023\.
- Asai et al\. \[2024\]A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi\.Self\-RAG: Learning to retrieve, generate, and critique through self\-reflection\.ICLR 2024\.
- Geva et al\. \[2021\]M\. Geva, R\. Schuster, J\. Berant, and O\. Levy\.Transformer feed\-forward layers are key\-value memories\.EMNLP 2021\.
- Li & Liang \[2021\]X\. L\. Li and P\. Liang\.Prefix\-tuning: Optimizing continuous prompts for generation\.ACL 2021\.
- Zhang et al\. \[2023\]Q\. Zhang, M\. Chen, A\. Bukharin, et al\.AdaLoRA: Adaptive budget allocation for parameter\-efficient fine\-tuning\.ICLR 2023\.
- Dettmers et al\. \[2023\]T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer\.QLoRA: Efficient finetuning of quantized LLMs\.NeurIPS 2023\.

## Appendix ALOCOMO diagnostics: token\-F1 vs\. LLM\-judge

The category breakdown \(Figure[14](https://arxiv.org/html/2606.19172#S5.F14)\) and the underlying per\-cell token\-F1 for every system, scale, and category \(Table[2](https://arxiv.org/html/2606.19172#S5.T2)\) are in the body \(Section[5\.5](https://arxiv.org/html/2606.19172#S5.SS5)\)\. This appendix collects the two diagnostic figures behind the LLM\-judge discussion of Section[5\.5](https://arxiv.org/html/2606.19172#S5.SS5): Figure[19](https://arxiv.org/html/2606.19172#A1.F19)shows that token\-F1 over\-credits Engram relative to an LLM judge, and Figure[20](https://arxiv.org/html/2606.19172#A1.F20)shows that multi\-token Joint\-OPT closes the resulting judge gap\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x20.png)Figure 19:LOCOMO token\-F1 vs\. LLM\-judge accuracy across all system×\\timesdense\-size cells\. Retrieval baselines \(red/orange\) sit near they=xy\{=\}xline: their token\-F1 and LLM\-judge scores roughly agree\. User\-as\-Engram OPT and Joint OPT \(blue/cyan\) sit systematically*above*the line, indicating that token\-F1 over\-credits Engram\. The asymmetry is the metric mismatch documented in Section[5\.5](https://arxiv.org/html/2606.19172#S5.SS5)\.![Refer to caption](https://arxiv.org/html/2606.19172v1/x21.png)Figure 20:LOCOMO single\-hop under an LLM judge \(Qwen2\.5\-14B\), matched multi\-token pipeline\. First\-token OPT lets token\-F1 over\-credit Engram for a correct first token over a noisy continuation \(Figure[19](https://arxiv.org/html/2606.19172#A1.F19)\); training the*full*answer with multi\-token Joint\-OPT fixes this and overtakes the best retrieval baseline from d12@1280 upward \(\+27\+27% at 625 M,\+29\+29% at 1\.22 B\)\. Below that cross\-over the dense backbone is too small for the continuation to fall into place after the Engram\-anchored prefix\.
## Appendix BMechanistic analysis

This appendix consolidates the mechanistic evidence that is referenced piecewise in the main text \(locality in Section[4\.3](https://arxiv.org/html/2606.19172#S4.SS3), the multi\-hop decomposition in Section[5\.6](https://arxiv.org/html/2606.19172#S5.SS6)\) into a single account of*how*a per\-user fact becomes a retrievable, leak\-free parametric edit\. The picture is deliberately simple: User\-as\-Engram is not an opaque learned circuit but acontent\-addressable memorywhose every step is either deterministic \(hashing, gating\) or directly measurable \(the value the row writes, the gate it opens, the locality of its effect\)\. We trace the path of one inserted fact through five observable stages, verifying each on the trained Mini\-Engrams \(d12@1280 and d20, 16 canonical facts\)\. All numbers below are on the trained models unless stated\.

#### \(1\) Addressing: a fact’s trigger N\-gram is a hash key\.

A fact is written at the row\(s\) that its trigger suffix N\-gram hashes to, via the deterministic multiplicative\-XOR hash of the Engram architecture\[Cheng et al\.,[2026](https://arxiv.org/html/2606.19172#bib.bib6)\]\. The address is a pure function of surface tokens—no learning, no per\-user state—which is why two users’ edits never interfere unless their*distinguishing*tokens collide\. A collision audit on the randomly\-initialized Engram demonstration code\[Cheng et al\.,[2026](https://arxiv.org/html/2606.19172#bib.bib6)\]measures this directly: across 100 synthetic users×\\times30 facts, pairwise overlap on user\-distinguishing key tokens \(Patel,Portland,1991\) is just 4\.5%, and a 30\-fact user occupies 0\.027% of the demo table \(the table is 99\.97% empty after one user\)\. Addressing is sparse and content\-keyed, so capacity is never the limiting factor \(Section[O\.2](https://arxiv.org/html/2606.19172#A15.SS2)\)\.

#### \(2\) The write opens its own gate and injects exactly its value path\.

A row reaches the residual stream only through the gated valueαt​WV​e\\alpha\_\{t\}\\,W\_\{V\}e, so a write has to do two things, and we can watch both\.*It opens its own gate\.*Because the gate’s key isWK​eW\_\{K\}e, writing the row raises the trigger\-position gate fromα≈0\.015\\alpha\\\!\\approx\\\!0\.015\(an untrained synthetic trigger barely fires\) toα≈0\.99\\alpha\\\!\\approx\\\!0\.99under the deployed OPT strategy \(0\.590\.59under closed\-form UNEMBED\_P\), while every non\-trigger position stays atα≈0\.03\\alpha\\\!\\approx\\\!0\.03–0\.040\.04\(identical on d12@1280 and d20\)\.*It injects its value path\.*The residual change the deployed OPT row induces at the trigger has cosine0\.998\(d12@1280\) /0\.999\(d20\) with the analyticWV​eW\_\{V\}eprojection—the gate and short\-convolution scale the injection but do not redirect it\. This reproduces, on the trained model and for the strategy we deploy, the read/write existence proof on the random\-init demonstration code \(where writing a marker into one trigger’s rows moved the trigger position by116\.96116\.96vs\. a mean0\.650\.65elsewhere, cosine0\.9980\.998to the predictedWVW\_\{V\}projection\)\. \(The small\-magnitude closed\-form marker is dominated by the convolution nonlinearity, so its*direction*is noisier; OPT drives the row norm up,99→13299\\to 132, until the value path dominates\.\) The trained model corroborates this from the behavioral side via the sensitivity asymmetry \(Section[5\.1](https://arxiv.org/html/2606.19172#S5.SS1)\): ablating the Engram pathway degrades*factual*top\-5 recall \(93\.3% retained\) while*reading*comprehension is untouched \(100% retained\)—a 6\.7 pp asymmetry that localises factual content to the Engram pathway\.

#### \(2b\) What the row*encodes*vs\. what it*optimizes*\.

A subtler reading concerns the row’s relation to the gold token\. The closed\-form UNEMBED\_P row points its value path straight at the gold token’s unembedding \(cosine0\.59–0\.65tolm\_head​\[y\]\\text\{lm\\\_head\}\[y\]\): “the row stores the gold value” is then literally true\. Gradient OPT and Joint\-OPT trade some of that direct alignment \(cosine drops to0\.160\.16–0\.240\.24\) for higher recall, shaping the*full*next\-token head rather than only the gold unembedding direction\. The interpretability is therefore in the exact value\-path injection \(stage 2\), not in the row equalling a single unembedding vector—the deployed row is a value the network reads cleanly, optimized against the true objective\.

#### \(3\) Locality is exact—and it comes from addressing, not from the gate\.

It is tempting to attribute locality to the gate \(“it fires only on the trigger”\), but the gate measurement above shows the gate does*not*single out an arbitrary synthetic trigger before the write\. Locality has a simpler source: we write only the trigger’s rows, and the trigger’s suffixNN\-gram is unique in the sequence, so every other position reads unchanged rows and its output is bit\-identical\. We verify the consequence not on one fact but across*all 16 canonical facts and both insertion strategies*: the maximum residual change over every non\-trigger position and every layer is0\.000, and the change before the Engram layer is0\.000—exact, through every subsequent attention/MLP layer \(the trigger position itself moves by order 1\)\. Figure[7](https://arxiv.org/html/2606.19172#S4.F7)contrasts this with a per\-user LoRA fit to the*same*fact, whose effect is nonzero at every position and every layer and which perturbs unrelated text by mean107107\. Because the edit is*addressed*, a per\-user write contaminates no unrelated forward pass \(Δ\\Deltabpb\+0\.00005\+0\.00005vs\. a global LoRA’s\+1\.78\+1\.78; Section[3](https://arxiv.org/html/2606.19172#S3)\), and cross\-user leakage is zero by design \(Section[4](https://arxiv.org/html/2606.19172#S4)\)\.

#### \(4\) Depth: the edit lands where the model has already “deepened”\.

Engram rows are read at a late Engram layer, after the network has effectively committed to a next\-token distribution\. The LogitLens trace \(Figure[21](https://arxiv.org/html/2606.19172#A2.F21)\) shows Mini\-Engram\-d8 converging to its output distribution*faster*than the base—layer\-3 KL is 3\.66 lower—reproducing[Cheng et al\.](https://arxiv.org/html/2606.19172#bib.bib6)’s “effective deepening” at our scale\. A row therefore overrides a near\-final prediction at exactly the position where it is decoded, rather than steering an early, still\-malleable representation \(the place where a global LoRA change does its damage\)\. We make this dependence*causal*rather than merely correlational by re\-running the insertion at the model’s*early*Engram layer instead of the late one \(layer 2 vs\. layer 11 on d20, layer 2 vs\. 7 on d12@1280; identical OPT\-15 budget\): top\-1 recall collapses from1\.00 at the late layer to 0\.25\(d20\) /0\.31\(d12@1280\)at the early layer, because an early injection must survive the rest of the transformer stack and largely washes out\. The depth at which the Engram reads is not incidental—it is where a value\-path edit can override the prediction\.

#### \(5\) Limit: the gate is a surface\-N\-gram hash, so it cannot compose across triggers\.

The same mechanism that makes the edit local also bounds what it can do\. The multi\-hop decomposition \(Section[5\.6](https://arxiv.org/html/2606.19172#S5.SS6), Figure[15](https://arxiv.org/html/2606.19172#S5.F15)\) is the cleanest mechanistic readout in the paper: on a balanced 63\-pair corpus, chained queries succeed91%of the time when the query’s suffix N\-gram*overlaps*the second fact’s trigger, but only13%\(≲10%\{\\lesssim\}10\\%after filtering first\-fact\-wins coincidences\) when true composition is required—even though per\-fact direct recall is 99\.2% in both subsets\. The near\-bimodal split is exactly what a surface\-keyed lookup predicts: the gate fires on a matching trigger and returns the stored token, or it does not fire and the gold token is far from the top\. There is no chaining circuit to find because, by design, there is none\.

#### Synthesis\.

The five stages compose into one sentence:*User\-as\-Engram hashes a fact’s surface trigger to a sparse row whose write both opens its own gate \(α​0\.02→0\.99\\alpha\\,0\.02\\to 0\.99\) and injects exactly its value path \(cosine 0\.999 to the predicted projection\), at a late, already\-deepened layer \(early insertion recall collapses 1\.00→\\to0\.25\), with the change exactly 0\.000 at every other position—and therefore retrieves single facts with zero contamination but cannot compose across facts\.*This is why the method cleanly separates*content*\(the rows\) from*reasoning skill*\(composition, which must come from the shared LoRA in the layered design of Section[6](https://arxiv.org/html/2606.19172#S6)\)—the content/skill split the paper proposes is not an engineering convenience but a direct consequence of the mechanism\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x22.png)Figure 21:LogitLens KL by layer \(Mini\-Engram\-d8 vs\. base\-d8\)\. Engram converges faster at layer 3 \(–3\.66 KL gap\), confirming[Cheng et al\.](https://arxiv.org/html/2606.19172#bib.bib6)’s effective\-deepening claim at our scale \(stage 4 of the mechanism above\)\.

## Appendix CPretraining curves

Figure[22](https://arxiv.org/html/2606.19172#A3.F22)shows the Mini\-Engram pretraining runs that every later insertion experiment builds on\. Adding the Engram table is not a tax on language modeling: at iso\-FLOPs the d8 Engram model reaches marginally lower validation bits\-per\-byte than the d8 baseline \(0\.924 vs\. 0\.929 at step 4 000\), reproducing[Cheng et al\.](https://arxiv.org/html/2606.19172#bib.bib6)’s effective\-deepening finding at our much smaller scale, and the larger d12 Engram model—more backbone, longer training—is substantially better \(0\.849 bpb\)\. These checkpoints are the base models for the row\-insertion results in the body\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x23.png)Figure 22:Mini\-Engram pretraining curves\. Engram d8 reaches lower validation bpb than base d8 at iso\-FLOPs \(matching the paper’s finding at much smaller scale\); engram d12 is substantially better thanks to the larger backbone and longer training\.
## Appendix DInsertion strategy comparison

Figure[23](https://arxiv.org/html/2606.19172#A4.F23)compares five ways of writing a fact into Engram rows at the hardest density we test \(100 facts/user, Mini\-Engram\-d12\)\. A random marker \(RANDOM, a control\) and the token input embedding \(WTE\) never recover the fact \(0% top\-1\), confirming the recall signal is not an artefact of the metric\. The closed\-form pseudo\-inverse marker \(UNEMBED\_P\) reaches only 6%/18% top\-1/top\-5: it orients the row correctly but the gate suppresses its magnitude\. Fifteen steps of per\-row gradient descent \(OPT\-15\) lift this to 38%/44%, and Joint OPT—optimizing all touched rows together against the user’s full fact set—reaches 68%/96%, recovering most of what a per\-user LoRA achieves at this density \(Figure[10](https://arxiv.org/html/2606.19172#S5.F10)\) at161×161\\timesless storage\. The earlier single\-fact\-at\-a\-time comparison on the original 16\-fact USER\+ORG benchmark \(Mini\-Engram\-d8\) is reported in Appendix[J](https://arxiv.org/html/2606.19172#A10)\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x24.png)Figure 23:Top\-1 and top\-5 recall at 100 facts/user, Mini\-Engram\-d12\. RANDOM \(control\) confirms our signal isn’t noise; UNEMBED\_P alone is too weak; OPT\-15 helps; Joint OPT closes most of the gap to full LoRA fine\-tuning at 161×\\timesless storage\.
## Appendix EParaphrase generalization

Figure[24](https://arxiv.org/html/2606.19172#A5.F24)tests whether a fact written at one trigger phrasing answers the same question asked differently, across five held\-out paraphrases per fact\. Single\-trigger insertion \(write once, at the canonical phrasing\) averages 50% top\-1: 4/5 for the “doctor” fact, 3/5 for “Globex hours”, 2/5 for “Stark HQ”, and 1/5 for “spice”\. Generalization tracks suffixNN\-gram overlap— paraphrases that end in the same tokens as the trained trigger transfer for free, while divergent surface forms \(e\.g\. “I love the spice” vs\. “my favorite spice is”\) do not\. Multi\-trigger insertion \(write the fact at all five phrasings, at5×5\\timesthe per\-fact OPT cost\) drives every paraphrase to top\-1\. The full per\-paraphrase rank data are tabulated in Appendix[K](https://arxiv.org/html/2606.19172#A11)\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x25.png)Figure 24:Paraphrase generalization across 4 facts×\\times5 paraphrases each\. Single\-trigger insertion gets 50% top\-1 for free \(suffix N\-gram overlap\); multi\-trigger insertion \(insert at all 5 paraphrases\) gets 100% at 5×\\timesthe per\-fact OPT cost\.
## Appendix FMulti\-domain composition

Figure[25](https://arxiv.org/html/2606.19172#A6.F25)stacksDDfact domains into one Engram table and reports per\-domain top\-1 asDDgrows\. The user domain \(user\_v1\) is unaffected by adding disjoint\-template corporate domains: it holds 66%→\\to63%→\\to64% asorg\_v1andorg\_v2are layered in \(D=1D\{=\}1to 3\)\. The collapse appears atD=4D\{=\}4, when a second*user\-template*domain \(multi\_user\_0\) is added: because it reuses the same trigger templates, the two user domains collide in the address hash anduser\_v1falls to 8% while the newcomer reads 67%\. Domains that share templates interfere \(the corporate domains also degrade each other,org\_v128%→\\to18%\), whereas template\-disjoint domains compose without loss—consistent with the corp\+user composition measured directly in Section[5\.3](https://arxiv.org/html/2606.19172#S5.SS3)\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x26.png)Figure 25:Multi\-domain additive composition heatmap\. Each cell shows per\-domain top\-1 recall whenDDdomains are stacked\. Heavy degradation when domains share trigger templates \(high address overlap\)\. Disjoint\-template domains \(corp \+ user, demonstrated in main text\) compose without loss\.
## Appendix GServing latency CDF

Figure[26](https://arxiv.org/html/2606.19172#A7.F26)shows the per\-request latency distribution under multi\-tenant serving \(the earlierd12v2 checkpoint, 339 M, 30 users×\\times50 facts, 600 single\-token recall probes\), and Figure[27](https://arxiv.org/html/2606.19172#A7.F27)breaks the median request into its three components\. Applying a user’s override map and restoring the base table afterwards are both cheap and tightly bounded—median 2\.2 ms, p99 2\.3 ms each—so the swap adds no latency tail\. The forward pass dominates: a median of 16\.5 ms out of a 23\.2 ms end\-to\-end total \(p99 27\.8 ms\), so per\-user memory adds under 20% overhead on top of a request the model would run anyway\. These absolute numbers are higher than the ablation\-optimald12@1280serving run reported in Figure[18](https://arxiv.org/html/2606.19172#S7.F18)\(0\.03 ms apply, 4\.4 ms p50, 226 req/s on an idle GPU\)—both the apply cost and the forward pass are faster on that newer benchmark—but the qualitative conclusion is the same in both: the apply/restore swap is a small fraction of per\-request latency, and per\-request work is independent of how many tenants share the server\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x27.png)Figure 26:Per\-request latency CDF on Mini\-Engram\-d12 \(30 users×\\times50 facts×\\times600 requests\)\. Override apply \(blue\) and restore \(orange\) are both sub\-3 ms at p99; the forward pass \(grey\) dominates total latency \(black\)\.![Refer to caption](https://arxiv.org/html/2606.19172v1/x28.png)Figure 27:Per\-request latency breakdown, Mini\-Engram\-d12 \(30 users×\\times50 facts×\\times600 requests\)\.\(a\)Override\-apply latency: median 2\.2 ms\.\(b\)Median per\-request latency by component: apply \(2\.3 ms\) \+ forward \(16\.5 ms\) \+ restore \(2\.3 ms\); the frozen forward pass dominates, and the override apply\+restore overhead \(∼\\sim4\.6 ms\) is under 20% of the 23\.2 ms end\-to\-end median\.
## Appendix HQwen\-3B \+ RAG details

The Qwen\-3B \+ RAG measurements in Section[6\.4](https://arxiv.org/html/2606.19172#S6.SS4)useQwen/Qwen2\.5\-3B\-Instructin bf16 with greedy decoding \(≤32\\leq 32new tokens\)\. The chat\-template prompt is:

```
[SYSTEM]
You are answering personal-memory questions about the user. Use only
the facts provided. Answer with the shortest possible span (typically
1-3 words). Do not explain or add any extra text.

[USER]
Facts about the user:
- <retrieved fact 1>.
- <retrieved fact 2>.
...

Question: <indirect question from indirect_qa>
Answer:
```

Retrieval uses the sameall\-MiniLM\-L6\-v2encoder as the RAG/MEM0/MEMMACHINE baselines of Section[5\.4](https://arxiv.org/html/2606.19172#S5.SS4), with the per\-user fact set as the index\. For each indirect probe the question text is embedded and cosine\-similarity top\-kkretrieval returns the facts to put in the context block\. The same 20\-user/20\-probe split as Figure[16](https://arxiv.org/html/2606.19172#S6.F16)is evaluated\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x29.png)Figure 28:Indirect\-reasoning accuracy \(indirect\_any\) vs\. avg context tokens per query on the 20\-user split\. The layered design and the shared LoRA alone anchor the 0\-context end; RAG top\-3 \+ shared LoRA lifts accuracy to 54% at 44 context tokens; Qwen\-3B \+ RAG reaches 55–57% at 82–390 tokens on a∼\\sim2\.5×\\timeslarger backbone\. Naive RAG on the same Mini\-Engram\-d20 base sits*below*the layered design: putting facts in context is not the same as having the reasoning skill\.Two answer\-match criteria are reported:*indirect\_top1*requires the first generated word to equal \(case\-insensitive\) the first word of the gold answer;*indirect\_any*requires the gold span to occur anywhere in the generation\. This matches the substring matcher used for the layered design\.

#### Why oracle\-top\-1 underperforms RAG top\-1\.

A non\-obvious result in the Qwen\-3B \+ RAG experiment is that oracle top\-1 retrieval \(using the known required facts to fetch the ground\-truth fact\) gives 46% indirect\_any,*below*RAG top\-1’s 55%\. Inspecting the data, retrieval often returns facts that are*adjacent*to the required key in semantic space and that happen to also contain the answer surface \(e\.g\. retrieving the “birth\-year” fact when asked about “current age” covers both the year and the implicit age\)\. RAG withk=1k=1thus over\-samples the information needed for indirect reasoning, while strict oracle retrieval leaves the LM to derive the answer from one single\-attribute fact—which Qwen\-3B often refuses to do under our “shortest span” system prompt\.

## Appendix IAlternative architectures we considered

For completeness we report two alternative designs we benchmarked before settling on per\-user override tables\.

#### Shared table with per\-user hash salt \(deprecated\)\.

In this design, every user’s facts coexist in one physical table; their hashes are XOR\-salted by user\-id so identical surface triggers map to disjoint rows\. We benchmark on Mini\-Engram\-d12 \(Table[4](https://arxiv.org/html/2606.19172#A9.T4)\):

Table 4:Shared\-table\-with\-salt: own\-fact recall vs\. leak rate\. All numbers on Mini\-Engram\-d12 with the XL corpus\.The salt design provides architectural privacy at the cost of recall: salted addresses point to untrained rows \(the model never saw salt during pretraining\), so the gate fires more weakly and OPT must work harder\. Because per\-user override tables \(main body Section[7](https://arxiv.org/html/2606.19172#S7)\) provide stronger privacy by design*and*use the trained address space, we recommend overrides as the production design and keep salt as a fallback for cases where the storage backend cannot afford per\-user override maps\.

#### 100\-user serving: UNEMBED\_P vs\. Joint OPT\.

At 100 users×\\times100 facts on Mini\-Engram\-d12@1280, the cheap \(closed\-form\) UNEMBED\_P strategy reaches 9\.2% top\-1 / 23\.4% top\-5 at 2\.3 ms apply latency\. Joint OPT \(1000 steps, the production setting\) on the same configuration reaches62% top\-1 / 96% top\-5at226 req/s and 4\.4 ms p50 latencyon an idle GPU \(Figure[18](https://arxiv.org/html/2606.19172#S7.F18)\), matching the∼\\sim60%/95% prediction extrapolated from the 30\-user data and consistent with the per\-user\-density curve at d12@1280 \(Section[5\.2](https://arxiv.org/html/2606.19172#S5.SS2): 67% top\-1 / 99% top\-5 at 100 facts/user\)\. The serving\-protocol cross\-user leak under Joint OPT is 6\.1% from gold\-value coincidence \(7 of 114 cross\-user probes\), versus 46% under the all\-coresident protocol where every user’s overrides occupy the same table at once\.

## Appendix JInsertion strategies \(full\)

Table[5](https://arxiv.org/html/2606.19172#A10.T5)reports the first comparison of insertion strategies on Mini\-Engram\-d8 with the original 16\-fact USER\+ORG benchmark \(single\-fact\-at\-a\-time\):

Table 5:Insertion strategy comparison on the original 8 USER \+ 8 ORG benchmark, Mini\-Engram\-d8\. RANDOM is a control; UNEMBED\_P is the closed\-form pseudo\-inverse; OPT is 15\-step gradient on the row\. Numbers are mean across the 16 facts\.The DUAL strategy \(jointly satisfying gate\-K and value\-V via least squares\) was tested, but proved*worse*than UNEMBED\_P \(4/16\+Δ\+\\Deltalogit\)—over\-constrainingeedrives it off the trained\-row distribution and the gate suppresses it\. We dropped DUAL\.

## Appendix KPer\-paraphrase rank table

Table[6](https://arxiv.org/html/2606.19172#A11.T6)gives the full per\-paraphrase rank data behind Section[5\.4](https://arxiv.org/html/2606.19172#S5.SS4.SSS0.Px3)and Figure[24](https://arxiv.org/html/2606.19172#A5.F24), single\-trigger insertion on Mini\-Engram\-d8:

Table 6:Per\-paraphrase rank of the gold token after single\-trigger OPT insertion\. “⋆\\star” marks rank 0 \(top\-1\)\.Generalization is high when paraphrases share suffix N\-grams \(e\.g\., all “doctor” paraphrases end in “Dr\.”\); poor when the surface form diverges \(“I love the spice” vs\. “My favorite spice is”\)\. Inserting at all 5 paraphrases per fact \(multi\-trigger\) drives every paraphrase to rank 0\.

## Appendix LJoint OPT convergence

Figure[29](https://arxiv.org/html/2606.19172#A12.F29)traces the Joint OPT optimization behind the density curves of Section[5\.2](https://arxiv.org/html/2606.19172#S5.SS2)\. At 100 facts/user the joint objective converges to≈\\approx0\.8 cross\-entropy within∼\\sim1 500 steps; at 1 000 facts/user it plateaus near 2\.0, the higher floor reflecting the within\-user address interference that caps recall at high density\. Convergence is monotone and stable in both regimes, so the fixed step budgets we use \(2 000 steps at 100 facts, 8 000 at 1 000\) leave no easy gains on the table\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x30.png)Figure 29:Joint OPT convergence on Mini\-Engram\-d12\. At 100 facts/user, loss converges to 0\.8 within 1500 steps; at 1000 facts, loss saturates around 2\.0 due to higher within\-user fact\-density interference \(Section[5\.2](https://arxiv.org/html/2606.19172#S5.SS2)\)\.
## Appendix MJoint OPT algorithm

1. 1\.For each factff: compute trigger global rowsRfR\_\{f\}\(deterministic from tokens\) and UNEMBED\_P initial markeref\(0\)=WV†​Uyfe\_\{f\}^\{\(0\)\}=W\_\{V\}^\{\\dagger\}U\_\{y\_\{f\}\}\.
2. 2\.Allocate one trainable tensor over the union of all touched rows,𝐑=∪fRf\\mathbf\{R\}=\\cup\_\{f\}R\_\{f\}\. Initialize: each row𝐑​\[i\]=meanf:i∈Rf​ef\(0\)​\[i\]\\mathbf\{R\}\[i\]=\\mathrm\{mean\}\_\{f:i\\in R\_\{f\}\}e\_\{f\}^\{\(0\)\}\[i\]\(average of UNEMBED\_P rows where multiple facts share an address\)\.
3. 3\.Zero out the embedding rows in𝐑\\mathbf\{R\}\. Install a forward hook on the embedding lookup that, when an address∈𝐑\\in\\mathbf\{R\}is consulted, replaces the retrieved row with our trainable tensor at that address\.
4. 4\.ForKKsteps: sample a random factff, compute the cross\-entropy loss onyfy\_\{f\}at the trigger position, backprop only to𝐑\\mathbf\{R\}\.
5. 5\.After training, write𝐑\\mathbf\{R\}into the embedding table as the user’s override map\.

We useK=2000K=2000for 100 facts andK=8000K=8000for 1000 facts, Adam LR 0\.5\. Wall time scales asKKalone \(one fwd\+bwd per step\) so per\-fact cost*decreases*with more facts: 0\.44 s/fact atN=100N\{=\}100, 0\.23 s/fact atN=1000N\{=\}1000\.

## Appendix NSeed variance of the layered measurement

The headline layered\-design vs\. per\-user\-LoRA numbers in Figure[16](https://arxiv.org/html/2606.19172#S6.F16)are from a single LoRA training run; for transparency, we re\-ran the same configuration with two further RNG states and compared \(Table[7](https://arxiv.org/html/2606.19172#A14.T7)\)\. Because the training\-data ordering, LoRA\-weight initialization, and Adam\-state warmup are stochastic, per\-user val\_bpbΔ\\Deltavaries across seeds even with identical hyperparameters\.

Table 7:Three\-seed variance on the layered\-design vs\. per\-user\-LoRA contrast, Mini\-Engram\-d20,n=20n\{=\}20users per seed, identical config across seeds \(rank\-64 LoRA, 1500 steps, lr 5e\-4, lora\_alpha 128; per\-user Engram J\-OPT 1500 steps lr 0\.5; eval\_tokens 524288; only seeds differ in LoRA\-weight initialization and Adam\-state / data\-ordering RNG\)\. The architectural property \(per\-user EngramΔ\\Deltabpb at noise floor∼\\sim10−510^\{\-5\}\) is seed\-invariant; the LoRA contamination magnitude varies by∼\\sim13% and the layered design’s indirect\_any varies by∼\\sim10 pp across seeds while the qualitative ranking \(layered≫\\ggLoRA on indirect, LoRA≫\\gglayered on contamination\) is preserved in every seed\.Reading: the architectural\-property comparison \(the per\-user Engram’sΔ\\Deltabpb≪\\llper\-user LoRA’s by∼\\sim33,000×\\times\) is seed\-stable to four decimal places—every seed’s Engram is at the noise floor and every seed’s LoRA is in\[\+1\.56,\+1\.78\]\[\+1\.56,\+1\.78\]\. The conclusion that the layered design beats per\-user LoRA on every measure is also seed\-stable: the layered design is never worse than the no\-edit base on indirect \(0/60 across seeds\), while per\-user LoRA is worse in 49/60 = 82%\. The exact magnitudes vary modestly: seed\-to\-seed layered indirect\_any spans34\.834\.8%–44\.544\.5% \(mean41\.241\.2%\), and the layered\-vs\-LoRA indirect ratio spans4\.8×4\.8\\times\(S2\) to7\.4×7\.4\\times\(S0\)\. The body headlines the canonical seed S0 \(7\.4×7\.4\\times, 44% vs\. 6%\) and cites the 3\-seed mean \(5\.6×5\.6\\times, 41% vs\. 7%\) alongside it; this appendix makes the per\-seed variance explicit so readers can see which numbers depend on which seed\.

## Appendix OExtended experiments and ablations

This appendix collects the capacity, fact\-count, dense\-size, and rank ablations, and the layered\-architecture robustness studies, that the main text summarizes in figures\.

### O\.1 Multi\-fact\-in\-the\-loss finetune, and the teacher\-trace negative result

We test the multi\-fact\-in\-the\-loss recipe via a continuation finetune of d12@1280 that injects many synthetic foreign rows per batch and adapts the gate, value projection, and dense backbone to handle them, while general LM ability is preserved \(val\_bpb unchanged\)\. At a fixed inference budget it improves recall at high density \(\+14\+14% relative atn=1000n\{=\}1000, 8k steps\); with the budget unpinned \(12k steps\) both the baseline and MF reach∼\\sim34\.5% atn=1000n\{=\}1000\(Figure[30](https://arxiv.org/html/2606.19172#A15.F30)\)\. So MF accelerates Joint\-OPT convergence rather than raising the asymptotic ceiling—an economic win, not an architectural one\. The teacher\-trace variants \(245 Qwen3\-8B\[Qwen Team,[2025b](https://arxiv.org/html/2606.19172#bib.bib52)\]traces of observation/statement/QA/⟨\\langlethink⟩\\rangle\-reasoning\) cut contamination \(Δ\\Deltabpb\+0\.386→\+0\.233\+0\.386\\to\+0\.233/\+0\.196\+0\.196\) but lowered indirect\_any \(44\.5%→38\.5%44\.5\\%\\to 38\.5\\%/28\.0%28\.0\\%; bootstrap 95% CIs strictly negative\), because that trace format scores a⟨\\langlethink⟩\\rangleblock while our eval reads only the top answer token\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x31.png)Figure 30:Multi\-fact\-in\-the\-loss finetune vs\. baseline d12@1280 on the Joint\-OPT density curve \(matched OPT\-step schedule\)\. At a fixed inference budget MF improves recall at high density \(\+14\+14% relative atn=1000n\{=\}1000, 8k steps\)\. With the budget unpinned \(12k steps\) both reach∼\\sim34\.5% atn=1000n\{=\}1000: MF*accelerates convergence*rather than raising the asymptotic ceiling\. val\_bpb is preserved \(0\.774 vs\. 0\.770\)\.
### O\.2 Engram capacity ablation

How does Engram\-table size interact with token budget at a fixed dense backbone? We probe this as a5×3×25\\times 3\\times 2matrix: five capacity levels \(vvslots per N\-gram,eetotal embed\-dim per N\-gram; Table[8](https://arxiv.org/html/2606.19172#A15.T8)\), three token budgets, two dense sizes \(Figure[31](https://arxiv.org/html/2606.19172#A15.F31)\)\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x32.png)Figure 31:Engram capacity×\\timestoken\-budget response surface, at Mini\-Engram\-d8 v2 \(left\) and d12 v2 \(right\)\. Top row: E1 USER OPT top\-1 \(100\-fact recall\)\. Bottom row: LOCOMO Joint OPT F1\. The samelarge\(50 K×\\times256\) Engram size wins at the highest token budget for*both*dense scales;*tiny*is under\-capacity,*xlarge*is over\-provisioned\. The optimum scales in tokens, not capacity\.Table 8:Capacity levels used in the ablation\. Total Engram params=4⋅v⋅e=4\\cdot v\\cdot e\(2 N\-gram orders×\\times2 layers\)\.#### Three regularities\.

\(i\)*Capacity has an optimum\.*Both endpoints under\-perform:*tiny*cannot address enough distinct facts;*xlarge*under\-trains each slot\. \(ii\)*The optimum is the same Engram size at both dense scales*:large\(50 K×\\times256==51\.2 M params\) wins at the highest token budget for both d8 and d12\.The optimum scales in tokens, not in capacity\.\(iii\)*Below the catch\-up token budget, smaller Engrams win\.*At d8 with 0\.5 B tokens,*small*\(10 M\) beats*large*\(51\.2 M\) on ins\-OPT 1\.00 vs\. 0\.62—the larger table is under\-trained per slot\. The cross\-over occurs around 1 B tokens for d8 and 1\.32 B tokens for d12\.

#### Practical rule\.

For an Engram pretrained at Karpathy 12 t/p of scaling\-params, use thelarge\(50 K×\\times256\) Engram table from d12@768 onward\. For d8\-class models with≤1\{\\leq\}\\,1B training tokens, usesmall\(20 K×\\times128\) instead\. Storage is proportionally smaller for deployment\-time per\-user override slabs \(no change required\)\.

### O\.3 Fact\-count scaling: per\-fact independent OPT to 1000 facts

Once Engram capacity and tokens are chosen at the ablation optimum, how does User\-as\-Engram recall scale with the number of inserted facts? We probe by per\-fact*independent*OPT insertion \(the per\-user override deployment mode\) on the firstnnfacts of an XL corpus of 1 000 USER \+ 1 000 ORG templated facts\. ICL@1000 is the in\-context ceiling at the hardest end\. Figure[32](https://arxiv.org/html/2606.19172#A15.F32)shows the full curves\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x33.png)Figure 32:Per\-fact independent OPT recall vs\. fact countnn\. Left: USER OPT top\-1\. Right: ORG OPT top\-1\. The d12@1280 and d20@1536 optimal cells \(green, purple\) hold near\-1\.00 recall fromn=100n=100ton=1000n=1000\. No ceiling visible in this per\-fact independent\-OPT setting\.Per\-fact recall is approximately flat innnup to 1000 facts\.The best d8 and d12@1280 cells hold≥0\.98\\geq 0\.98top\-1 fromn=100n\{=\}100ton=1000n\{=\}1000\. The crucial caveat is that this is*independent*per\-fact OPT: each fact gets its own private hash rows, with no within\-table interference between facts—the per\-user override deployment mode of Section[4](https://arxiv.org/html/2606.19172#S4)\.*In that setting, we observe no recall ceiling up to 1,000 facts on a 625 M\-parameter base LM\.*The density ceiling appears under Joint OPT into a shared table \(Section[5\.2](https://arxiv.org/html/2606.19172#S5.SS2)\): 68%→\\to35% top\-1 from 100 to 1,000 facts at d12@768\.

### O\.4 Dense\-size scaling at optimal config

Pulling the four trained Mini\-Engrams into one row each \(Figure[33](https://arxiv.org/html/2606.19172#A15.F33)\):

![Refer to caption](https://arxiv.org/html/2606.19172v1/x34.png)Figure 33:Dense\-size scaling at the ablation\-optimal recipe\. Left: LOCOMO Joint OPT first\-token token\-F1 climbs from 0\.134 \(d8 v2\) to 0\.233 \(d20@1536\); MEMMACHINE\_LIKE retrieval baseline climbs more slowly because its quality is bounded by sentence\-encoder retrieval, not the LM\. Right: val\_bpb \(grey\) drops monotonically with dense\+tokens, while E1 USER OPT single\-fact recall \(blue\) saturates at 1\.00 from d12@1280 onward \(small dip at d20 at iso\-12 t/p\)\.#### Three observations\.

\(i\)Val bpb scales smoothly with dense×\\timestokens: 0\.91→\\to0\.83→\\to0\.77→\\to0\.73\. \(ii\)Insertion\-OPT and E1 saturatefrom d12@768 onward\. The 16\-fact ins\-OPT probe reaches 1\.00 at every dense size with≥1\.32\{\\geq\}\\,1\.32B tokens; E1 USER/ORG OPT reaches 1\.00 at d12@1280@3\.34 B, dipping slightly to 0\.97 at d20@1536@7\.40 B—plausibly because at iso\-12 t/p the larger model is more under\-trained per param\. \(iii\)LOCOMO Joint OPT first\-token F1 scales monotonically with dense sizefrom 0\.134 \(178 M\) to 0\.233 \(1\.22 B\), and the multi\-token token\-F1 variant climbs from 0\.159 to 0\.310\. Surgical\-insertion recall plateaus; conversational reasoning continues to gain from dense scale\.

#### Reading the gap\.

Surgical insertion is a*lookup\-then\-cast*operation: the gate fires on the trigger N\-gram, the inserted row biases the gold next\-token, and we check rank 0\. Once the base LM is fluent enough to make the gate selective \(≥\{\\geq\}d12@768\), recall saturates\. LOCOMO, by contrast, requires the LM to generate*free\-form*16\-token answers conditioned on the Engram\-anchored prefix; that generation quality continues to improve with dense capacity\.

### O\.5 LoRA rank ablation at 100 facts

For completeness we also ablate the multi\-fact LoRA rank at fixed training budget \(2000 steps, lr 5e\-4, allQ/K/VQ/K/Vprojections\):

Every rank≥8\{\\geq\}\\,8saturates at the recall ceiling within a 2000\-step budget\. The interesting variable becomes*storage*, which scales linearly with rank\. The takeaway: at 100 facts/user, all rank\-≥8\{\\geq\}\\,8LoRAs hit the recall ceiling, so the LoRA\-vs\-Engram trade\-off collapses to recall\-vs\-storage\.Engram Joint OPT is 20×\\timessmaller than the smallest LoRA at the ceiling \(88 KB vs\. 1\.8 MB rank\-8\), in exchange for a∼\\sim35\-pt top\-1 gap \(65% vs\. 100%; top\-5 gap is 2 pt: 98% vs\. 100%\)\.That gap closes byn=1,000n\{=\}1\{,\}000, where LoRA rank\-64 also slips to 38% top\-1 \(Section[5\.2](https://arxiv.org/html/2606.19172#S5.SS2)\)\.

### O\.6 Inserted\-fact recall in free continuation

Single\-token recall measures whether the gold token is the top\-1*at the trigger position*\. In conversational use the model continues for many tokens; we test whether the inserted fact surfaces in the first 8 generated tokens \(greedy decoding\) on 30 facts after Joint OPT \(1500 steps\)\.

Dense scaling sharpens generation behavior\.At the smallest scale the gold token is the immediate top\-1 only 23% of the time but*anywhere in the next 8 tokens*53% of the time—the model “approaches” the fact across a few tokens\. From d12@1280 onward, the gold either lands at position 0 or never appears in the window: the bigger model commits earlier rather than drifting\. We caution thatn=30n\{=\}30is small \(Wilson 95% CI on 27/30 is\[74%,97%\]\[74\\%,97\\%\]; on 18/30 is\[42%,75%\]\[42\\%,75\\%\]\), and the result is consistent with the bigger model simply missing the fact when its first guess is wrong\. Either way, the 90% first\-token rate at d20@1536 is the practical recall in conversational use\.

### O\.7 Rank ablation: r=16 is the sweet spot

We sweep the shared LoRA rankr∈\{4,16,64\}r\\in\\\{4,16,64\\\}on the full 20\-user split \(Figure[34](https://arxiv.org/html/2606.19172#A15.F34)\) and observe a clear, non\-monotonic curve\. Smaller ranks under\-fit the reasoning skill; larger ranks both contaminate more \(more parameters, more global perturbation\)*and*learn the reasoning skill worse, presumably because the LoRA’s capacity exceeds what the 510\-sample training corpus can profitably constrain\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x35.png)Figure 34:Shared\-LoRA rank ablation \(the layered design,n=20n\{=\}20\)\. The layered design preserves 100% direct recall at every rank; indirect\_any peaks atr=16r\{=\}16\(44%\) while contamination grows with rank\. Ther=64r\{=\}64surprise:4×4\\timesthe capacity gives*worse*reasoning \(35%\) at2\.7×2\.7\\timesthe contamination, plausibly because capacity exceeds what the 510\-sample shared corpus can constrain\.The r=64 row is the surprise: a 4×\\timeslarger shared LoRA gives*worse*indirect performance \(35% vs\. 44% at r=16\) at 2\.7×\\timesmore contamination\. We recommend r=16; future work could revisit this once the shared training corpus is scaled beyond 510 samples\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x36.png)Figure 35:Cost\-quality trade\-off for the layered architecture \(Section[6](https://arxiv.org/html/2606.19172#S6)\)\. Same Mini\-Engram\-d20 base,n=20n\{=\}20test users, six method combinations\.The layered design gives the best trade\-off: it matches the highest indirect\-reasoning accuracy \(44% indirect\_any\) at low storage \(88 KB/user\) and low contamination \(Δ\\Deltabpb\+0\.39\+0\.39\)\. It beats per\-user LoRA on every measure\. The naive LoRA\+Engram stack \(the “add Engram on top of LoRA” baseline\) is beaten too\.
### O\.8 Cross\-schema generalization

The headline layered\-design vs\. per\-user\-LoRA numbers train the shared LoRA on held\-out users in the same synthetic schema as the test users\. We test whether the reasoning skill transfers*across*schemas\. We train the rank\-16 shared LoRA on the original first\-person personal\-facts schema \(u020–u029; “My name is…”, “I was born in…”\) and evaluate the full six\-condition head\-to\-head on a held\-out medical schema \(m000–m019, third\-person patient framing: “The patient’s name is…”, “MRN: …”; 29 facts, 20 indirect probes per user; Figure[36](https://arxiv.org/html/2606.19172#A15.F36)\)\.

![Refer to caption](https://arxiv.org/html/2606.19172v1/x37.png)Figure 36:The layered claim survives a personal→\\tomedical schema shift \(shared LoRA trained on personal\-schema users, tested on medical\-schema users\)\. The layered design’s indirect\_any decays from 44% to 31% \(a realistic 30% relative drop\) while per\-user LoRA goes 6%→\\to4%, so the layered design’s lead is preserved \(7\.4×→7\.6×7\.4\\times\\to 7\.6\\times\); locality is unchanged \(the layered design’sΔ\\Deltabpb equals the shared LoRA’s,\+0\.386\{\+\}0\.386\)\.#### The layered claim survives cross\-schema, with a realistic decay\.

The layered design’s indirect\_any degrades from 44% \(within\-schema, Figure[16](https://arxiv.org/html/2606.19172#S6.F16)\) to 31% \(cross\-schema, Figure[36](https://arxiv.org/html/2606.19172#A15.F36)\)—a 30% relative drop\. Over the same gap, per\-user LoRA’s indirect\_any goes from 6% \(within\-schema\) to 4% \(cross\-schema\), so the layered design’s lead is preserved \(7\.4×7\.4\\timeswithin\-schema;7\.6×7\.6\\timescross\-schema\)\. The architectural locality is guaranteed by design: the layered design’sΔ\\Deltabpb equals the shared LoRA’s,\+0\.386\+0\.386, in both cases, identical to four decimal places, because the per\-user Engram still fires only on its trigger N\-grams\.*Architectural decomposition transfers across schemas\.*

### O\.9 The layered design on a Q/A\-format\-adapted Engram base

The headline layered\-design vs\. per\-user\-LoRA numbers are on a base LM \(Mini\-Engram\-d20\)\. Section[3](https://arxiv.org/html/2606.19172#S3)’s cross\-base table shows that per\-user LoRA’s*user\-visible*damage shrinks dramatically on instruction\-tuned bases \(85% of users worse on a base LM, 0–20% on four instruction\-tuned bases\)\. Does the layered design still dominate per\-user LoRA when the latter itself becomes a stronger baseline?

We SFT Mini\-Engram\-d20 on∼\\sim2000 single\-token \(Q, A\) pairs from the same fact templates the layered experiment uses \(1500 steps; lr1×10−41\{\\times\}10^\{\-4\}; post\-embedding params trained; Engram tables frozen\)\. The SFT is intentionally narrow—no general\-text corpus mixed in—so it tests Q/A format adaptation rather than full chat\-tuning\. val\_bpb on raw ClimbMix rises from 0\.73 to 4\.42, indicating that the model has been pushed toward the Q/A distribution at the cost of general\-text LM ability\. We then train a fresh rank\-16 shared LoRA on this SFT’d base \(shared\_lora\_d20\_sft/r16, 2000 steps on u020–u029\) and rerun the six\-condition head\-to\-head; the numbers below carry the conclusion\.

#### The layered design still beats per\-user LoRA on every measure; the lead shrinks as the baseline strengthens\.

The qualitative picture survives Q/A adaptation: the layered design matches per\-user LoRA on direct \(100% vs\. 100%\), beats it on indirect \(1\.6×1\.6\\timeson top\-1,2\.8×2\.8\\timeson indirect\_any\), and the architectural locality is preserved \(the layered design and the shared LoRA alone differ inΔ\\Deltabpb by1×10−41\\times 10^\{\-4\}, on the same order as the5×10−55\\times 10^\{\-5\}original\-base noise floor\)\. The cross\-base prediction reproduces: per\-user LoRA’s indirect\_top1 jumps from 1\.3% \(untouched base\) to 23\.5% on this adapted base, matching Section[3](https://arxiv.org/html/2606.19172#S3)’s “instruction\-tuned bases absorb the LoRA perturbation” pattern\. The layered design’s lead over per\-user LoRA on indirect\_any compresses from7\.4×7\.4\\times\(base LM\) to2\.8×2\.8\\times\(SFT’d\), but it remains the best method\.

#### Caveats\.

The SFT here is narrow Q/A adaptation, not full chat\-tuning: mixing general\-text in the SFT corpus or starting from a true instruction\-tuned base \(the cross\-base table’s Qwen/Llama/Mistral setup, but with an Engram\-pretrained backbone\) is the natural follow\-up\. The aggressive SFT also collapses the untouched base’s and the per\-user Engram’s indirect to floor \(1%\), which inflates the layered design’s relative lead over the untouched base but not the layered\-vs\-LoRA contrast that matters for deciding which method to use\.

### O\.10 Storage and cost sharing

![Refer to caption](https://arxiv.org/html/2606.19172v1/x38.png)Figure 37:Per\-user storage scales linearly with user count\. Engram override is∼\\sim161×\\timessmaller at 100 facts/user \(88 KB vs\. 14\.2 MB\) and the gap widens to∼\\sim1700×\\timesat 10 facts/user \(LoRA’s parameter count is fact\-count\-independent while Engram grows linearly\)\. At 1 M users×\\times100 facts/user, Engram storage is 100 GB vs\. per\-user LoRA’s 14\.2 TB\.Per user, the layered design costs the Engram’s 88 KB; the shared LoRA is one global 11\.8 MB \(rank\-16\) file shared by all users\. For 1 M users that is 100 GB\+\+12 MB≈\\approx100 GB, indistinguishable from Engram alone, versus 14\.2 TB for per\-user LoRA at the same recall ceiling\. Figure[35](https://arxiv.org/html/2606.19172#A15.F35)plots all six method combinations against storage and indirect\-reasoning accuracy\.

Similar Articles

Scaling Self-Evolving Agents via Parametric Memory

arXiv cs.AI

Researchers from Alibaba/Qwen and Peking University introduce TMEM, a self-evolving parametric memory framework that uses online LoRA weight updates to let LLM agents genuinely learn from experience within a single episode, rather than relying solely on prompt-space memory. TMEM outperforms summary-based and retrieval-based baselines across multiple benchmarks including LoCoMo, LongMemEval-S, and CL-Bench.