Memory-Managed Long-Context Attention: A Preliminary Study of Editable Request-Local Memory

arXiv cs.CL Papers

Summary

This paper investigates memory-managed long-context attention, a research direction that separates efficient state compression from explicit editable memory slots. Experiments show that a hybrid approach combining fast recurrent/sparse backbones with explicit memory management outperforms pure fixed-state or pure sparse methods across synthetic tasks and long-context benchmarks.

arXiv:2606.28876v1 Announce Type: new Abstract: Long-context language models often conflate two different goals: compressing history into an efficient state, and maintaining reliable long-term memory. Linear, recurrent, and sparse attention reduce the cost of processing long sequences, but they do not by themselves specify when a fact should be written, overwritten, protected from distractors, or discarded. We study memory-managed long-context attention, a research route that separates a fast recurrent or sparse backbone from explicit editable request-local memory slots and query-time sparse fallback. Across structured synthetic tasks, token/chunk/sequence bridges, generated natural language, and local frozen-model diagnostics, pure fixed-state or pure sparse methods fail some overwrite, version, anti-pollution, or no-write-signal cases, while a hybrid covers both routes. A small 2,097,152-token mechanism stress test reaches 50/50 pooled accuracy with 2-132 active chunks. A 2.74M-parameter minimal causal event-token model reaches 595/600 with lite write supervision, supporting proof of trainability rather than scale. A six-family frozen-hidden-state bridge reaches 1079/1080 controlled pointer accuracy, but it uses generator-provided integer key IDs and separately encoded canonical key strings; it is an oracle-metadata probe, not open-text entity resolution. Local non-leaderboard RULER 4K diagnostics remain close to full context, whereas a 33-record LongBench v1 16K subset shows that naive lexical selection is not general. The evidence separates three claims: controlled slot lifecycle is feasible, sparse fallback is needed when writes lack future-query signals, and learned open-domain selection remains the main architectural bottleneck. We do not claim a final generative architecture, global slot-trajectory convergence, or systems superiority.
Original Article
View Cached Full Text

Cached at: 06/30/26, 05:28 AM

# A Preliminary Study of Editable Request-Local Memory
Source: [https://arxiv.org/html/2606.28876](https://arxiv.org/html/2606.28876)
## Memory\-Managed Long\-Context Attention: A Preliminary Study of Editable Request\-Local Memory

\(June 27, 2026\)

###### Abstract

Long\-context language models often conflate two different goals: compressing history into an efficient state, and maintaining reliable long\-term memory\. Linear, recurrent, and sparse attention reduce the cost of processing long sequences, but they do not by themselves specify when a fact should be written, overwritten, protected from distractors, or discarded\. We study*memory\-managed long\-context attention*, a research route that separates a fast recurrent or sparse backbone from explicit editable request\-local memory slots and query\-time sparse fallback\.

Across structured synthetic tasks, token/chunk/sequence bridges, generated natural language, and local frozen\-model diagnostics, pure fixed\-state or pure sparse methods fail some overwrite, version, anti\-pollution, or no\-write\-signal cases, while a hybrid covers both routes\. A small 2,097,152\-token mechanism stress test reaches 50/50 pooled accuracy with 2–132 active chunks\. A 2\.74M\-parameter minimal causal event\-token model reaches 595/600 with lite write supervision, supporting proof of trainability rather than scale\. A six\-family frozen\-hidden\-state bridge reaches 1079/1080 controlled pointer accuracy, but it uses generator\-provided integer key IDs and separately encoded canonical key strings; it is an oracle\-metadata probe, not open\-text entity resolution\. Local non\-leaderboard RULER 4K diagnostics remain close to full context, whereas a 33\-record LongBench v1 16K subset shows that naive lexical selection is not general\. The evidence separates three claims: controlled slot lifecycle is feasible, sparse fallback is needed when writes lack future\-query signals, and learned open\-domain selection remains the main architectural bottleneck\. We do not claim a final generative architecture, global slot\-trajectory convergence, or systems superiority\.

## 1Introduction

Efficient long\-context modeling has made rapid progress through linear attention, state\-space and recurrent hybrids, and sparse attention\[[17](https://arxiv.org/html/2606.28876#bib.bib15),[26](https://arxiv.org/html/2606.28876#bib.bib18),[13](https://arxiv.org/html/2606.28876#bib.bib19),[31](https://arxiv.org/html/2606.28876#bib.bib6),[33](https://arxiv.org/html/2606.28876#bib.bib17),[32](https://arxiv.org/html/2606.28876#bib.bib23),[10](https://arxiv.org/html/2606.28876#bib.bib26)\]\. These methods reduce either the per\-token state size or the number of attended positions, enabling longer windows than dense softmax attention\. Yet a compressed state is not necessarily a managed memory\. A model that accumulates information into a fixed matrix or vector must still decide which events deserve persistence, how to handle a newer fact that conflicts with an older one, how to protect a stable fact from distractor pollution, and how to expose memory state in a bounded decode path\.

We study the following hypothesis:

> State compression and memory management are separate design problems\. Long\-context models need an explicit lifecycle for memory writes, overwrites, conflict handling, and eviction, not only a cheaper attention state\.

The proposed direction, memory\-managed long\-context attention, keeps the efficiency motivation of linear, recurrent, and sparse attention but adds a small editable memory table\. A fast state handles local and high\-throughput sequence processing\. Stable slots store selected events with metadata such as confidence, version, usage, and conflict score\. Sparse query\-time retrieval is used as a fallback when the model had no causal signal during prefill to know which ordinary fact would later be queried\. The intended result is not a universal replacement for full attention, but a Pareto candidate for ultra\-long contexts where a small active decode set must still support durable and editable facts\.

This draft is deliberately conservative\. We do not claim a final state\-of\-the\-art model, an official benchmark leaderboard result, or a production kernel\. Instead, we report a staged evidence chain that has been implemented locally and use it to define the next architecture proof\.

The preliminary contributions are:

- •We formulate long\-context memory as an explicit lifecycle problem: write, read, overwrite, version, protect, and evict\.
- •We implement a sequence of memory\-managed prototypes, from vector tasks to token, chunk, and sequence backbones plus natural\-language bridges\.
- •We compare against fixed\-state, sliding\-window, query\-sparse, and MSA\-style sparse proxies, including a 2M\-token replicate\.
- •We add a frozen published\-model diagnostic harness for RULER and LongBench, showing where sparse/hybrid context selection works and where it fails\.
- •We train a small internal memory\-managed backbone and then reproduce a frozen\-hidden\-state memory adapter across six model families and three adapter seeds\.
- •We identify the remaining gap: oracle metadata\-assisted canonical keys and pointer routing must be replaced by learned open\-text grounding, slot matching, and generative memory injection\.

#### Evidence logic\.

The study separates three components that are often conflated\. First, an oracle\-assisted controlled key protocol defines the grounding interface and isolates whether hidden representations can support memory lifecycle; it is an experimental instrument, not a solved parser\. Second, the editable slot mechanism tests write, overwrite, protection, and eviction behavior\. Third, a deliberately simple lexical selector tests query\-time fallback on real benchmark records\. Its relative success on RULER and failure on LongBench identify which selector constraints remain unsatisfied\. We therefore specify the target parser/selector properties below rather than presenting the current components as a complete open\-domain algorithm\.

## 2Method Sketch

### 2\.1Fast State and Stable Memory

Let a backbone maintain a fast stateStS\_\{t\}for local sequence processing, as in linear attention or recurrent hybrids\. In parallel, a request\-local stable memory containsMMslots,

ℳt=\{\(ki,vi,ci,ui,τi,zi\)\}i=1M,\\mathcal\{M\}\_\{t\}=\\\{\(k\_\{i\},v\_\{i\},c\_\{i\},u\_\{i\},\\tau\_\{i\},z\_\{i\}\)\\\}\_\{i=1\}^\{M\},wherekik\_\{i\}andviv\_\{i\}are the slot key and value,cic\_\{i\}is confidence,uiu\_\{i\}is usage,τi\\tau\_\{i\}is time or version metadata, andziz\_\{i\}stores auxiliary conflict or source features\. The memory is request\-local in the current formulation; we do not assume cross\-user persistent memory\.

At read time, the model computes a sparse active set,

A​\(qt\)=TopKi⁡\(qt⊤​ki\+λ​ci\+γ​ui\+b​\(τi\)\),A\(q\_\{t\}\)=\\operatorname\{TopK\}\_\{i\}\\left\(q\_\{t\}^\{\\top\}k\_\{i\}\+\\lambda c\_\{i\}\+\\gamma u\_\{i\}\+b\(\\tau\_\{i\}\)\\right\),and returns a weighted value over only\|A\|≪M\|A\|\\ll Mslots\. This preserves an interpretable active decode set\.

### 2\.2Write, Overwrite, and Eviction

For each candidate eventete\_\{t\}, a write controller estimates whether it should enter stable memory:

pt=σ​\(fθ​\(et,St,qlocal,rt\)\),p\_\{t\}=\\sigma\(f\_\{\\theta\}\(e\_\{t\},S\_\{t\},q\_\{\\mathrm\{local\}\},r\_\{t\}\)\),wherertr\_\{t\}may include event type, salience, recency, and conflict features\. A candidate either updates a matched slot or occupies an evicted slot\. A simplified overwrite update is:

vi←\(1−αi\)​vi\+αi​vt,ci←g​\(ci,pt,Δi\),v\_\{i\}\\leftarrow\(1\-\\alpha\_\{i\}\)v\_\{i\}\+\\alpha\_\{i\}v\_\{t\},\\quad c\_\{i\}\\leftarrow g\(c\_\{i\},p\_\{t\},\\Delta\_\{i\}\),whereΔi\\Delta\_\{i\}measures conflict or version evidence\. For same\-key facts, the memory manager prefers newer high\-confidence writes rather than averaging incompatible values\. Eviction uses confidence, usage, recency, and importance so that distractor\-heavy tails do not flush early durable facts\.

#### Conditional piecewise stability\.

We do not require global convergence under a non\-stationary stream: a newer fact should cause a discrete state change\. Consider one stable entity\-version segment with fixed slot assignment, no eviction or erroneous writes, and

vt\+1=\(1−αt\)​vt\+αt​xt,0<αt≤1\.v\_\{t\+1\}=\(1\-\\alpha\_\{t\}\)v\_\{t\}\+\\alpha\_\{t\}x\_\{t\},\\qquad 0<\\alpha\_\{t\}\\leq 1\.Ifxt=μx\_\{t\}=\\muin the segment and∑tαt=∞\\sum\_\{t\}\\alpha\_\{t\}=\\infty, then

∥vt−μ∥=∥vT−μ∥​∏s=Tt−1\(1−αs\)⟶0\.\\lVert v\_\{t\}\-\\mu\\rVert=\\lVert v\_\{T\}\-\\mu\\rVert\\prod\_\{s=T\}^\{t\-1\}\(1\-\\alpha\_\{s\}\)\\longrightarrow 0\.For conditionally unbiased noisy observations with bounded variance, the additional condition∑tαt2<∞\\sum\_\{t\}\\alpha\_\{t\}^\{2\}<\\inftygives the corresponding stochastic\-approximation target\. An explicit newer\-version event may reset or overwrite the slot and begin a new stable segment\. These are conditional design targets, not a global convergence theorem for the learned system: state\-dependent matching, thresholded writes, and eviction form a switched process\. In particular, a merely bounded but persistent false\-write rate can bias the limit or induce oscillation unless its cumulative update mass vanishes or cancels\.

### 2\.3Sparse Fallback

A bounded causal memory cannot know during prefill which fully ordinary fact will be queried in the future\. We therefore combine explicit slots with query\-time sparse retrieval over chunk summaries\. The sparse path recovers uniform recall cases; the explicit memory path handles overwrite and version lifecycle\. This hybrid is essential: pure memory fails no\-signal uniform recall, while pure sparse retrieval often retrieves a stale version when the task requires overwrite semantics\. Figure[1](https://arxiv.org/html/2606.28876#S2.F1)distinguishes the implemented routing path from the planned generative integration\.

### 2\.4Oracle Metadata\-Assisted Canonical\-Key Protocol

Phase 13 is a controlled representation bridge, not a learned or open\-domain parser\. Each generated sample supplies event sentences, a query sentence, integerkey\_ids, integervalue\_ids, a target event index, and optional write labels\. The frozen backbone mean\-pools last\-layer hidden states over each complete event and query sentence\. Separately, the protocol constructs the canonical stringentity Nfrom each generator\-provided key ID, encodes it with the same frozen backbone, and mean\-pools the complete canonical string\. Event and query representations concatenate their full\-sentence vectors with these canonical\-key vectors\.

No runtime regex, named\-entity recognizer, exact\-string span matcher, source\-sentence key\-span pooling, or value\-span extractor is used\. The query key is selected from generator metadata via the target event index\. Same\-key replacement and memory\-versus\-sparse branch arbitration use exact integer key equality\. Event type remains expressed by the controlled sentence template; version is represented by stream order and repeated key IDs; the lite\-write variant additionally receives write labels\. The adapter predicts an event pointer rather than generating the value\. Consequently, Phase 13 isolates whether frozen hidden representations can support a memory lifecycle when key identity is supplied; it does not establish entity discovery, alias resolution, coreference, ambiguous or nested entity handling, or learned slot matching\.

The intended future grounding component has the contract

𝒫ϕ​\(xt\)⟼\{\(sj,kj,ej,rj,ρj\)\}j=1Jt,\\mathcal\{P\}\_\{\\phi\}\(x\_\{t\}\)\\longmapsto\\\{\(s\_\{j\},k\_\{j\},e\_\{j\},r\_\{j\},\\rho\_\{j\}\)\\\}\_\{j=1\}^\{J\_\{t\}\},Heresjs\_\{j\}denotes a source span andkjk\_\{j\}its normalized entity or event key\. The termseje\_\{j\},rjr\_\{j\}, andρj\\rho\_\{j\}represent value\-bearing evidence, an event/version/conflict label, and confidence\. The current protocol replaces𝒫ϕ\\mathcal\{P\}\_\{\\phi\}with generator metadata\. A learned implementation must recover this structure from open text and calibrate uncertainty rather than assume exact IDs\.

### 2\.5Target Selector and Lifecycle Properties

Given grounded candidates, the missing learned selector can be stated as

𝒮θ​\(qt,C≤t,ℳt\)⟼\(At,πt,at\),\|At\|≤K,\\mathcal\{S\}\_\{\\theta\}\(q\_\{t\},C\_\{\\leq t\},\\mathcal\{M\}\_\{t\}\)\\longmapsto\(A\_\{t\},\\pi\_\{t\},a\_\{t\}\),\\qquad\|A\_\{t\}\|\\leq K,whereAtA\_\{t\}is the active memory set,πt\\pi\_\{t\}contains match and conflict scores, andata\_\{t\}is a calibrated abstention/fallback decision\. Together with the writer and lifecycle manager, the target system should provide:

1. 1\.*key consistency*: paraphrases of the same entity or event map to compatible keys;
2. 2\.*version monotonicity*: a recognized newer version supersedes rather than averages with an incompatible old value;
3. 3\.*distractor stability*: irrelevant additions do not corrupt or evict protected relevant slots;
4. 4\.*bounded activation*:KKdoes not grow linearly with context length;
5. 5\.*conflict separability*: same\-key contradiction is distinguished from different\-key semantic similarity;
6. 6\.*calibrated abstention*: uncertain matches fall back to sparse retrieval instead of forcing a memory read; and
7. 7\.*piecewise stability*: without new relevant evidence, assignments and slot contents become quiescent, while explicit newer versions may cause bounded jumps followed by a new stable segment\.

These are desired properties and evaluation criteria; the current oracle\-assisted bridge does not prove them\.

Context token/event streamBackbone representationsand fast stateStS\_\{t\}Write controllersalience/conflict/budgetEditable memory slots\(k,v,c,u,τ,z\)\(k,v,c,u,\\tau,z\)Sparse chunk/key indexquery\-time fallbackQuery representationqtq\_\{t\}Branch routermemory precedence if key matchedSelected memory/value statessmall active setCurrent evidence:pointer/routing outputNext architecture step:LM generation conditioningmatched\-key readfallback candidates

Figure 1:Memory\-managed sequence path\. The solid path summarizes mechanisms implemented in the controlled prototypes: a fast backbone, budgeted editable slots, sparse fallback, and memory\-first branch arbitration\. The dashed generation\-conditioning path is proposed next work and is not part of the reported pointer experiments\.

## 3Experimental Route

The experiments are arranged as a staged validity check\. Early stages ask whether the mechanism is measurable at all\. Later stages ask whether the behavior survives tokenization, chunking, sequence backbones, generated natural text, local benchmark harnesses, official data ingestion, and frozen published\-model inference\.

#### Tasks\.

The controlled suite includes Associative Recall, Overwrite Facts, Temporal Versioning, Distractor Pollution, and Streaming QA\. Later natural\-language and benchmark\-style suites include labeled needle, uniform needle, multi\-needle conflict, variable tracking, and multi\-hop route tasks\.

#### Baselines\.

We compare to full scan or full context where feasible, tail/sliding windows, FIFO memory, pure linear or Delta/GDN/KDA\-style fixed\-state proxies\[[31](https://arxiv.org/html/2606.28876#bib.bib6),[14](https://arxiv.org/html/2606.28876#bib.bib4),[28](https://arxiv.org/html/2606.28876#bib.bib5)\], local query\-sparse proxies inspired by NSA and DeepSeek Sparse Attention\[[32](https://arxiv.org/html/2606.28876#bib.bib23),[10](https://arxiv.org/html/2606.28876#bib.bib26)\], MSA\-style document/chunk sparse proxies\[[6](https://arxiv.org/html/2606.28876#bib.bib24)\], pure explicit memory, and memory\+sparse hybrid variants\. These proxy names indicate mechanism families only; they are not faithful implementations or vendor kernels\.

#### Published models\.

The frozen diagnostic harness uses local Llama and Qwen instruct weights\[[12](https://arxiv.org/html/2606.28876#bib.bib27),[24](https://arxiv.org/html/2606.28876#bib.bib28)\]with vLLM\[[19](https://arxiv.org/html/2606.28876#bib.bib33)\]\. It compares full\-context, tail\-window, query\-sparse, and memory\+sparse prompts on local non\-leaderboard RULER 4K and 16K\-comparable LongBench v1 subsets\[[15](https://arxiv.org/html/2606.28876#bib.bib21),[1](https://arxiv.org/html/2606.28876#bib.bib22)\]\. A separate controlled bridge extracts frozen hidden states from Llama, Qwen, Mistral\-Nemo, Gemma, GLM, and InternLM\[[12](https://arxiv.org/html/2606.28876#bib.bib27),[24](https://arxiv.org/html/2606.28876#bib.bib28),[21](https://arxiv.org/html/2606.28876#bib.bib29),[11](https://arxiv.org/html/2606.28876#bib.bib30),[27](https://arxiv.org/html/2606.28876#bib.bib31),[5](https://arxiv.org/html/2606.28876#bib.bib32)\], then trains a small writer/read adapter per family\. The latter validates controlled representation compatibility under oracle canonical\-key scaffolding and pointer routing; it does not evaluate open\-text parsing or generation\.

## 4Results

### 4\.1Controlled Evidence

Table[1](https://arxiv.org/html/2606.28876#S4.T1)summarizes the main controlled experiments\. The consistent pattern is that explicit memory solves overwrite and anti\-pollution when write signals exist, sparse retrieval solves no\-signal recall, and the hybrid combines both behaviors\.

Table 1:Controlled evidence chain\. These are local mechanism\-validation experiments, not official benchmark scores\.
### 4\.22M\-Token Sparse\-Hybrid Scaling

The largest controlled run evaluates 2,097,152\-token synthetic contexts over five seeds and two trials per seed, with five scenarios and eight method rows per sample\. Table[2](https://arxiv.org/html/2606.28876#S4.T2)shows the 2M slice\.

Table 2:2M\-token synthetic replicate summary\. This 50\-trial slice is a small mechanism stress test, not a statistically powered benchmark\. The hybrid uses sparse retrieval for no\-write\-signal recall and explicit memory for overwrite/version lifecycle\. The dense\-stride row is a local proxy, not the published DSA or NSA kernel\. Wilson 95% CI for the hybrid pooled result is approximately \[0\.929, 1\.0\]\.These results do not prove that the proxy implementation is faster than optimized sparse kernels, but they clarify the mechanism tradeoff\. Query\-time sparse methods can recover ordinary recall but may activate a large candidate set and lack explicit stale\-version handling\. Explicit memory has bounded decode reads but needs causal write evidence\. The hybrid is the smallest design in this study that covers both boundaries\.

### 4\.3Trainable Backbone and Six\-Family Hidden\-State Bridge

Phase 12 is a minimal causal event\-token proof of trainability, not model\-scale evidence\. Its 2\.74M\-parameter backbone jointly trains fast state, a soft memory relaxation, a 32\-slot hard memory evaluation path, and top\-16 sparse fallback\. Over three training seeds and 600 evaluation samples per variant, answer\-only training reaches 575/600 \(95\.83%, Wilson 95% CI \[0\.939, 0\.972\]\) but writes about 2,445 times on the longest streaming setting\. A 0\.25\-weight write auxiliary reaches 595/600 \(99\.17%, CI \[0\.981, 0\.996\]\) and reduces streaming writes to roughly one or two\. This supports trainability, while showing that answer supervision alone has not yet produced a reliably sparse writer\.

Phase 13 replaces learned toy embeddings with last\-layer representations from six frozen model families\. Full\-event representations feed the writer; oracle metadata\-derived canonical\-key representations feed a shared key/query retriever\. Table[3](https://arxiv.org/html/2606.28876#S4.T3)aggregates three newly trained adapter seeds per model, 60 controlled evaluation samples per seed, and ten settings up to 2,048 events\.

Table 3:Oracle\-key controlled frozen\-hidden\-state pointer/routing results\. The pooled Wilson 95% intervals are \[0\.980, 0\.993\] for answer\-only and \[0\.995, 1\.000\] for lite\-write training\. Generator\-provided integer key IDs determine canonical key strings, same\-key replacement, and branch arbitration; these are neither open\-text parsing nor generative benchmark scores\.For the lite variant, pooled accuracy across the same 1,080 evaluations is 91\.57% for an unconstrained full\-hidden readout, 59\.72% for hidden sparse top\-16, 80\.00% for editable memory alone, 18\.52% for tail\-64, and 99\.91% for the hybrid\. Two failed pilots determined the final routing rule\. Whole\-sentence mean pooling produced 0% uniform recall on Llama because it did not isolate the entity key\. After adding the separately encoded oracle canonical\-key vectors, a unified write\-biased read score could still suppress a correctly retrieved ordinary fact\. The final controlled hybrid therefore reads an authoritative editable slot when an exact metadata key match is present and otherwise falls back to pure sparse key similarity\. Figure[2](https://arxiv.org/html/2606.28876#S4.F2)summarizes the per\-family hybrid scores and the pooled lite\-write ablations\.

LlamaQwenMistralGemmaGLMInternLM05050100100Accuracy \(%\)\(a\) Hybrid accuracy by frozen backboneAnswer\-onlyLite write

Tail64Sparse16MemoryFull18\.5259\.7280\.0091\.5799\.91\(b\) Pooled lite\-write ablations

Figure 2:Phase 13 oracle\-key controlled pointer accuracy over three adapter seeds\. \(a\) Each model/variant has 180 evaluations; the InternLM answer\-only writer is the main cross\-family failure\. \(b\) Pooled lite\-write ablations use 1,080 evaluations each\. Values are computed from the committed result rows; generator\-provided exact key IDs and canonical key strings remain supplied\.
### 4\.4Official Data and Frozen Published\-Model Diagnostics

The external\-data pipeline ingests 9,051 normalized records: 8,418 LongBench v1/E records, 503 LongBench v2 records, and 130 RULER 4K smoke records\[[15](https://arxiv.org/html/2606.28876#bib.bib21),[1](https://arxiv.org/html/2606.28876#bib.bib22)\]\. Phase 9 and 10 verify recoverability, sparse selection, prediction JSONL output, and task\-matched scoring\. Phase 11 runs real generation with two local published models\. Table[4](https://arxiv.org/html/2606.28876#S4.T4)reports aggregate scores on bounded local diagnostic subsets; it is not a leaderboard evaluation\.

Table 4:Frozen published\-model diagnostics\. RULER supports sparse/hybrid context selection close to full\-context and far above tail\-window truncation\. LongBench v1 shows the opposite pressure: naive lexical sparse selection loses useful context on mixed QA, summarization, code, and classification tasks\. These are local diagnostic scores, not official leaderboard submissions\.The RULER result supports the route for retrieval\-style and structured long\-context tasks\. The main weakness is variable tracking: sparse/hybrid prompts sometimes omit or reorder update chains, reducing exact overwrite accuracy relative to full context\. The LongBench failure is not an incidental implementation detail; it identifies the missing component of the architecture\. A simple lexical selector is insufficient for broad long\-context understanding, so task\-aware query extraction, learned reranking and abstention, or an internal memory\-managed selector is needed\.

## 5Discussion

#### Three separated claims\.

The evidence supports three deliberately separated claims\. First, explicit slot lifecycle can implement overwrite, version handling, and anti\-pollution under controlled keys and bounded active decode\. Second, sparse fallback is necessary when a causal writer receives no signal about a future query\. Third, the LongBench negative control and the oracle dependence of Phase 13 identify learned open\-domain selection as the main unresolved architectural bottleneck\. Phase 12 is only a minimal proof of trainability, and Phase 13 establishes controlled representation compatibility across six frozen model families\.

#### What is not yet supported\.

The evidence does not support a claim of universal superiority over DeltaNet, Gated DeltaNet, Kimi Delta Attention, Native/DeepSeek Sparse Attention, Memory Sparse Attention, RACE Attention, Titans, Infini\-attention, or VLA\[[31](https://arxiv.org/html/2606.28876#bib.bib6),[14](https://arxiv.org/html/2606.28876#bib.bib4),[28](https://arxiv.org/html/2606.28876#bib.bib5),[32](https://arxiv.org/html/2606.28876#bib.bib23),[10](https://arxiv.org/html/2606.28876#bib.bib26),[6](https://arxiv.org/html/2606.28876#bib.bib24),[16](https://arxiv.org/html/2606.28876#bib.bib25),[2](https://arxiv.org/html/2606.28876#bib.bib2),[22](https://arxiv.org/html/2606.28876#bib.bib3),[23](https://arxiv.org/html/2606.28876#bib.bib1)\]\. Several rows are proxy baselines\. Phase 13 uses generator\-provided exact key IDs, canonical\-key encodings, controlled pointer targets, and separately trained adapters; it does not perform open\-text grounding or inject memory into language generation\. System latency has not been measured with a custom kernel\. No global convergence result is established for the switched write/match/evict process\.

#### Why frozen\-model evaluation is still useful\.

Published\-model evaluation fixes the language model and isolates the context\-construction question\. If sparse/hybrid prompts fail even under strong frozen models, the routing or scoring pipeline is suspect\. If they work on RULER but fail on LongBench, the result tells us which tasks require a learned selector or an internal backbone\. This is a diagnostic step before, not a substitute for, training the architecture\.

## 6Related Work

Linear attention and efficient sequence models reduce attention cost by replacing quadratic softmax attention with recurrent or kernelized states\[[17](https://arxiv.org/html/2606.28876#bib.bib15),[7](https://arxiv.org/html/2606.28876#bib.bib16),[26](https://arxiv.org/html/2606.28876#bib.bib18),[13](https://arxiv.org/html/2606.28876#bib.bib19),[9](https://arxiv.org/html/2606.28876#bib.bib20)\]\. Delta\-style models and recent hybrid architectures improve the expressiveness and update dynamics of fixed states\[[31](https://arxiv.org/html/2606.28876#bib.bib6),[14](https://arxiv.org/html/2606.28876#bib.bib4),[28](https://arxiv.org/html/2606.28876#bib.bib5)\]\. VLA frames linear attention as stable associative memory\[[23](https://arxiv.org/html/2606.28876#bib.bib1)\]\. Our focus differs: we do not only stabilize the state; we add explicit memory lifecycle management outside the fast state\.

Long\-term and neural memory systems augment transformers with recurrence, memory layers, or retrieved memories\[[8](https://arxiv.org/html/2606.28876#bib.bib13),[25](https://arxiv.org/html/2606.28876#bib.bib14),[30](https://arxiv.org/html/2606.28876#bib.bib7),[29](https://arxiv.org/html/2606.28876#bib.bib8),[3](https://arxiv.org/html/2606.28876#bib.bib9),[2](https://arxiv.org/html/2606.28876#bib.bib2),[22](https://arxiv.org/html/2606.28876#bib.bib3)\]\. RAG\-style and nearest\-neighbor methods retrieve external text or hidden states\[[20](https://arxiv.org/html/2606.28876#bib.bib10),[4](https://arxiv.org/html/2606.28876#bib.bib11),[18](https://arxiv.org/html/2606.28876#bib.bib12)\]\. These works motivate memory augmentation, but our studied unit is request\-local editable memory inside the attention/backbone path, with explicit overwrite and anti\-pollution metadata\.

Sparse attention systems reduce the attended set through fixed or learned patterns, ranging from BigBird structured sparsity to NSA hierarchical selection, DeepSeek learned token indexer, and MSA document/chunk memory retrieval\[[33](https://arxiv.org/html/2606.28876#bib.bib17),[32](https://arxiv.org/html/2606.28876#bib.bib23),[10](https://arxiv.org/html/2606.28876#bib.bib26),[6](https://arxiv.org/html/2606.28876#bib.bib24)\]\. RACE instead replaces softmax similarity and uses random projections with soft locality\-sensitive hashing to obtain a strictly linear attention layer\[[16](https://arxiv.org/html/2606.28876#bib.bib25)\]; it should not be conflated with the local query\-sparse proxies in our tables\. Sparse retrieval is complementary to our method: it is strong when no write signal exists, but by itself does not define same\-key overwrite or slot lifecycle\. RULER and LongBench provide benchmark pressure for these distinctions\[[15](https://arxiv.org/html/2606.28876#bib.bib21),[1](https://arxiv.org/html/2606.28876#bib.bib22)\]\.

## 7Reproducibility Notes

The local repository contains runnable stages and result artifacts\. Key commands include:

```
conda run -n memory-attention-phase1 \
  python -m phase4_competitive_baselines.run_sparse_hybrid_replicates

conda run -n memory-attention-phase1 \
  python -m phase6_real_context_bridge.run_natural_context_bridge

conda run -n memory-attention-phase1 \
  python -m phase7_real_benchmark_harness.benchmark_harness

conda run -n memory-attention-phase1 \
  python -m phase11_published_model_harness.run_llm_eval \
  --models llama3_1_8b_instruct,qwen2_5_14b_instruct \
  --benchmarks ruler --max-per-dataset 10 \
  --out-dir phase11_published_model_harness/results/ruler_4k_llm_eval_full \
  --max-model-len 16384 --max-tokens 96 --batch-size 2

conda run -n memory-attention-phase1 \
  python -m phase12_trainable_memory_backbone.run_replicates

conda run -n memory-attention-phase1 \
  python -m phase13_frozen_hidden_memory.run_phase13
```

The most relevant result files are:

- •phase4\_competitive\_baselines/results/sparse\_hybrid\_replicates/
- •phase6\_real\_context\_bridge/results/natural\_context\_bridge/
- •phase7\_real\_benchmark\_harness/results/local\_harness/
- •phase11\_published\_model\_harness/results/ruler\_4k\_llm\_eval\_full/
- •phase11\_published\_model\_harness/results/longbench\_v1\_16k\_llm\_eval/
- •phase12\_trainable\_memory\_backbone/results/trainable\_replicates/
- •phase13\_frozen\_hidden\_memory/results/adapters\_keyspan\_final/all\_six\_models/

## 8Limitations and Next Steps

This v1 draft is best read as a route paper\. The main limitations are:

- •The strongest long\-context results are still controlled or generated\-context experiments\.
- •Several competitors are proxy implementations rather than faithful published kernels\.
- •The 50\-trial 2M slice is a mechanism stress test and is underpowered as a statistical benchmark\.
- •Lite write supervision remains materially better calibrated than answer\-only writing on some families; InternLM answer\-only shows a clear distractor\-pollution failure\.
- •Phase 12 is a 2\.74M\-parameter event\-token proof of trainability, not evidence of language\-model\-scale quality\.
- •Phase 13 uses oracle generator metadata, canonical key strings, and exact integer IDs for lifecycle matching; open\-text discovery, aliases, coreference, ambiguity, and learned slot matching are not solved\.
- •Phase 13 evaluates pointer/routing accuracy, not memory\-conditioned answer generation or full\-model fine\-tuning\.
- •RULER 4K and LongBench v1 results are bounded local diagnostic subsets, not official leaderboard evaluations\.
- •Slot count is bounded, but the full learned write/match/evict trajectory has no global convergence guarantee; only conditional piecewise stability is characterized\.
- •LongBench v1 shows that naive lexical sparse routing is insufficient and makes learned task\-aware selection the next architectural bottleneck\.

The next decisive experiment is to replace oracle metadata\-assisted canonical keys with a token\-level learned grounding head, contrastive query–event matching, hard same\-entity/version negatives, and a budgeted differentiable writer\. The selected memory states should then enter a frozen or parameter\-efficiently tuned language model as soft\-prefix or gated cross\-attention memory tokens and be trained with generation loss\. In parallel, LongBench requires a task\-aware hierarchical selector with dense/lexical first\-stage retrieval, learned reranking, neighboring\-chunk expansion, calibrated abstention, and task\-specific budgets\. Faithful Delta/GDN/KDA, DSA/NSA, and MSA/RACE baselines plus custom\-kernel measurements remain necessary for a final systems claim\.

## 9Conclusion

The preliminary evidence suggests that long\-context efficiency alone is not enough\. A reliable ultra\-long context model should manage memory lifecycle explicitly: which events are written, which facts are overwritten, which distractors are rejected, and which small active set is exposed during decode\. Memory\-managed long\-context attention is a candidate design for that goal\. The minimal trainable backbone and oracle\-key six\-family bridge provide controlled feasibility evidence, but the strongest final claim still depends on learned open\-text grounding, piecewise\-stable slot management, generative integration, faithful baselines, and systems measurements\.

## AI\-Assisted Tooling Disclosure

The authors used AI\-assisted programming and writing tools for code development, drafting, and editing\. The authors reviewed and verified the experimental outputs, claims, and manuscript content and take responsibility for the work\.

## Author Contributions

Junyi Zou led the research direction, method development, implementation, experiments, result analysis, and manuscript drafting\. Avrova Donz contributed to early method discussions and provided feedback on the research framing\.

## References

- \[1\]Y\. Baiet al\.\(2023\)LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding\.Note:arXiv:2308\.14508[https://arxiv\.org/abs/2308\.14508](https://arxiv.org/abs/2308.14508)External Links:2308\.14508Cited by:[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px3.p1.1),[§4\.4](https://arxiv.org/html/2606.28876#S4.SS4.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p3.1)\.
- \[2\]A\. Behrouz, P\. Zhong, and V\. Mirrokni\(2025\)Titans: Learning to Memorize at Test Time\.Note:arXiv:2501\.00663[https://arxiv\.org/abs/2501\.00663](https://arxiv.org/abs/2501.00663)External Links:2501\.00663Cited by:[§5](https://arxiv.org/html/2606.28876#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p2.1)\.
- \[3\]V\. Bergeset al\.\(2024\)Memory Layers at Scale\.Note:arXiv:2412\.09764[https://arxiv\.org/abs/2412\.09764](https://arxiv.org/abs/2412.09764)External Links:2412\.09764Cited by:[§6](https://arxiv.org/html/2606.28876#S6.p2.1)\.
- \[4\]S\. Borgeaudet al\.\(2021\)Improving Language Models by Retrieving from Trillions of Tokens\.Note:arXiv:2112\.04426[https://arxiv\.org/abs/2112\.04426](https://arxiv.org/abs/2112.04426)External Links:2112\.04426Cited by:[§6](https://arxiv.org/html/2606.28876#S6.p2.1)\.
- \[5\]Z\. Caiet al\.\(2024\)InternLM2 Technical Report\.Note:[https://arxiv\.org/abs/2403\.17297](https://arxiv.org/abs/2403.17297)External Links:2403\.17297Cited by:[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px3.p1.1)\.
- \[6\]Y\. Chen, R\. Chen, S\. Yi, X\. Zhao, X\. Li, J\. Zhang, J\. Sun, C\. Hu, Y\. Han, L\. Bing, Y\. Deng, and T\. Chen\(2026\)MSA: Memory Sparse Attention for Efficient End\-to\-End Memory Model Scaling to 100M Tokens\.Note:[https://arxiv\.org/abs/2603\.23516](https://arxiv.org/abs/2603.23516)External Links:2603\.23516Cited by:[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.28876#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p3.1)\.
- \[7\]K\. Choromanskiet al\.\(2020\)Rethinking Attention with Performers\.Note:arXiv:2009\.14794[https://arxiv\.org/abs/2009\.14794](https://arxiv.org/abs/2009.14794)External Links:2009\.14794Cited by:[§6](https://arxiv.org/html/2606.28876#S6.p1.1)\.
- \[8\]Z\. Daiet al\.\(2019\)Transformer\-XL: Attentive Language Models Beyond a Fixed\-Length Context\.Note:arXiv:1901\.02860[https://arxiv\.org/abs/1901\.02860](https://arxiv.org/abs/1901.02860)External Links:1901\.02860Cited by:[§6](https://arxiv.org/html/2606.28876#S6.p2.1)\.
- \[9\]T\. Dao and A\. Gu\(2024\)Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality\.Note:arXiv:2405\.21060[https://arxiv\.org/abs/2405\.21060](https://arxiv.org/abs/2405.21060)External Links:2405\.21060Cited by:[§6](https://arxiv.org/html/2606.28876#S6.p1.1)\.
- \[10\]DeepSeek\-AIet al\.\(2025\)DeepSeek\-V3\.2: Pushing the Frontier of Open Large Language Models\.Note:[https://arxiv\.org/abs/2512\.02556](https://arxiv.org/abs/2512.02556)External Links:2512\.02556Cited by:[§1](https://arxiv.org/html/2606.28876#S1.p1.1),[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.28876#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p3.1)\.
- \[11\]Gemma Teamet al\.\(2025\)Gemma 3 Technical Report\.Note:[https://arxiv\.org/abs/2503\.19786](https://arxiv.org/abs/2503.19786)External Links:2503\.19786Cited by:[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px3.p1.1)\.
- \[12\]A\. Grattafioriet al\.\(2024\)The Llama 3 Herd of Models\.Note:[https://arxiv\.org/abs/2407\.21783](https://arxiv.org/abs/2407.21783)External Links:2407\.21783Cited by:[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px3.p1.1)\.
- \[13\]A\. Gu and T\. Dao\(2023\)Mamba: Linear\-Time Sequence Modeling with Selective State Spaces\.Note:arXiv:2312\.00752[https://arxiv\.org/abs/2312\.00752](https://arxiv.org/abs/2312.00752)External Links:2312\.00752Cited by:[§1](https://arxiv.org/html/2606.28876#S1.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p1.1)\.
- \[14\]A\. Hatamizadeh, Y\. Choi, and J\. Kautz\(2026\)Gated DeltaNet\-2: Decoupling Erase and Write in Linear Attention\.Note:arXiv:2605\.22791[https://arxiv\.org/abs/2605\.22791](https://arxiv.org/abs/2605.22791)External Links:2605\.22791Cited by:[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.28876#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p1.1)\.
- \[15\]C\. Hsiehet al\.\(2024\)RULER: What’s the Real Context Size of Your Long\-Context Language Models?\.Note:arXiv:2404\.06654[https://arxiv\.org/abs/2404\.06654](https://arxiv.org/abs/2404.06654)External Links:2404\.06654Cited by:[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px3.p1.1),[§4\.4](https://arxiv.org/html/2606.28876#S4.SS4.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p3.1)\.
- \[16\]S\. Joshi, A\. Chowdhury, A\. Kanakamedala, E\. Singh, E\. Tu, and A\. Shrivastava\(2025\)RACE Attention: A Strictly Linear\-Time Attention Layer for Training on Outrageously Large Contexts\.Note:[https://arxiv\.org/abs/2510\.04008](https://arxiv.org/abs/2510.04008)External Links:2510\.04008Cited by:[§5](https://arxiv.org/html/2606.28876#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p3.1)\.
- \[17\]A\. Katharopouloset al\.\(2020\)Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention\.Note:arXiv:2006\.16236[https://arxiv\.org/abs/2006\.16236](https://arxiv.org/abs/2006.16236)External Links:2006\.16236Cited by:[§1](https://arxiv.org/html/2606.28876#S1.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p1.1)\.
- \[18\]U\. Khandelwalet al\.\(2019\)Generalization through Memorization: Nearest Neighbor Language Models\.Note:arXiv:1911\.00172[https://arxiv\.org/abs/1911\.00172](https://arxiv.org/abs/1911.00172)External Links:1911\.00172Cited by:[§6](https://arxiv.org/html/2606.28876#S6.p2.1)\.
- \[19\]W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica\(2023\)Efficient Memory Management for Large Language Model Serving with PagedAttention\.Note:[https://arxiv\.org/abs/2309\.06180](https://arxiv.org/abs/2309.06180)External Links:2309\.06180Cited by:[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px3.p1.1)\.
- \[20\]P\. Lewiset al\.\(2020\)Retrieval\-Augmented Generation for Knowledge\-Intensive NLP Tasks\.Note:arXiv:2005\.11401[https://arxiv\.org/abs/2005\.11401](https://arxiv.org/abs/2005.11401)External Links:2005\.11401Cited by:[§6](https://arxiv.org/html/2606.28876#S6.p2.1)\.
- \[21\]Mistral AI Team and NVIDIA\(2024\)Mistral NeMo\.Note:Official model release[https://mistral\.ai/news/mistral\-nemo/](https://mistral.ai/news/mistral-nemo/)Cited by:[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px3.p1.1)\.
- \[22\]T\. Munkhdalai, M\. Faruqui, and S\. Gopal\(2024\)Leave No Context Behind: Efficient Infinite Context Transformers with Infini\-attention\.Note:arXiv:2404\.07143[https://arxiv\.org/abs/2404\.07143](https://arxiv.org/abs/2404.07143)External Links:2404\.07143Cited by:[§5](https://arxiv.org/html/2606.28876#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p2.1)\.
- \[23\]V\. Pandey and G\. Singh\(2026\)Variational Linear Attention: Stable Associative Memory for Long\-Context Transformers\.Note:arXiv:2605\.11196[https://arxiv\.org/abs/2605\.11196](https://arxiv.org/abs/2605.11196)External Links:2605\.11196Cited by:[§5](https://arxiv.org/html/2606.28876#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p1.1)\.
- \[24\]Qwenet al\.\(2024\)Qwen2\.5 Technical Report\.Note:[https://arxiv\.org/abs/2412\.15115](https://arxiv.org/abs/2412.15115)External Links:2412\.15115Cited by:[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px3.p1.1)\.
- \[25\]J\. W\. Raeet al\.\(2019\)Compressive Transformers for Long\-Range Sequence Modelling\.Note:arXiv:1911\.05507[https://arxiv\.org/abs/1911\.05507](https://arxiv.org/abs/1911.05507)External Links:1911\.05507Cited by:[§6](https://arxiv.org/html/2606.28876#S6.p2.1)\.
- \[26\]Y\. Sunet al\.\(2023\)Retentive Network: A Successor to Transformer for Large Language Models\.Note:arXiv:2307\.08621[https://arxiv\.org/abs/2307\.08621](https://arxiv.org/abs/2307.08621)External Links:2307\.08621Cited by:[§1](https://arxiv.org/html/2606.28876#S1.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p1.1)\.
- \[27\]Team GLMet al\.\(2024\)ChatGLM: A Family of Large Language Models from GLM\-130B to GLM\-4 All Tools\.Note:[https://arxiv\.org/abs/2406\.12793](https://arxiv.org/abs/2406.12793)External Links:2406\.12793Cited by:[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px3.p1.1)\.
- \[28\]K\. Teamet al\.\(2025\)Kimi Linear: An Expressive, Efficient Attention Architecture\.Note:arXiv:2510\.26692[https://arxiv\.org/abs/2510\.26692](https://arxiv.org/abs/2510.26692)External Links:2510\.26692Cited by:[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.28876#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p1.1)\.
- \[29\]W\. Wanget al\.\(2023\)Augmenting Language Models with Long\-Term Memory\.Note:arXiv:2306\.07174[https://arxiv\.org/abs/2306\.07174](https://arxiv.org/abs/2306.07174)External Links:2306\.07174Cited by:[§6](https://arxiv.org/html/2606.28876#S6.p2.1)\.
- \[30\]Y\. Wu, M\. N\. Rabe, D\. Hutchins, and C\. Szegedy\(2022\)Memorizing Transformers\.Note:arXiv:2203\.08913[https://arxiv\.org/abs/2203\.08913](https://arxiv.org/abs/2203.08913)External Links:2203\.08913Cited by:[§6](https://arxiv.org/html/2606.28876#S6.p2.1)\.
- \[31\]S\. Yang, J\. Kautz, and A\. Hatamizadeh\(2024\)Gated Delta Networks: Improving Mamba2 with Delta Rule\.Note:arXiv:2412\.06464[https://arxiv\.org/abs/2412\.06464](https://arxiv.org/abs/2412.06464)External Links:2412\.06464Cited by:[§1](https://arxiv.org/html/2606.28876#S1.p1.1),[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.28876#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p1.1)\.
- \[32\]J\. Yuan, H\. Gao, D\. Dai, J\. Luo, L\. Zhao, Z\. Zhang, Z\. Xie, Y\. X\. Wei, L\. Wang, Z\. Xiao, Y\. Wang, C\. Ruan, M\. Zhang, W\. Liang, and W\. Zeng\(2025\)Native sparse attention: hardware\-aligned and natively trainable sparse attention\.Note:[https://arxiv\.org/abs/2502\.11089](https://arxiv.org/abs/2502.11089)External Links:2502\.11089Cited by:[§1](https://arxiv.org/html/2606.28876#S1.p1.1),[§3](https://arxiv.org/html/2606.28876#S3.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.28876#S5.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p3.1)\.
- \[33\]M\. Zaheeret al\.\(2020\)Big Bird: Transformers for Longer Sequences\.Note:arXiv:2007\.14062[https://arxiv\.org/abs/2007\.14062](https://arxiv.org/abs/2007.14062)External Links:2007\.14062Cited by:[§1](https://arxiv.org/html/2606.28876#S1.p1.1),[§6](https://arxiv.org/html/2606.28876#S6.p3.1)\.

Similar Articles

Context Memorization for Efficient Long Context Generation

Hugging Face Daily Papers

Proposes attention-state memory, a training-free approach that stores precomputed attention states in lightweight memory to improve accuracy and reduce latency for long prefix inference, outperforming traditional methods on benchmarks.

Dynamic Linear Attention

arXiv cs.CL

This paper proposes DLA, a dynamic memory modeling framework for multi-state linear attention that adaptively merges states based on token information variation and maintains a fixed-size state cache, enabling better long-context representation without the quadratic complexity of standard attention.

Dynamic Linear Attention

Hugging Face Daily Papers

DLA introduces adaptive state merging and capacity-bounded memory modeling for multi-state linear attention, improving long-context LLM performance.

Memory

Reddit r/artificial

Explains why LLM inference is increasingly memory-bandwidth bound due to the KV cache scaling with context length and concurrent users, and how systems like vLLM and PagedAttention improve memory utilization.