Parallel Manifold Steering: Efficient Adaptation of Large Associative Memories via Residual Energy Shaping

arXiv cs.LG Papers

Summary

This paper proposes H-Res, a method to adapt large transformer models by shaping the energy landscape of associative memories without modifying weights or adding prompts, preserving memory capacity and outperforming LoRA.

arXiv:2606.24396v1 Announce Type: new Abstract: Large Transformer models function as Dense Associative Memories (DAMs), retrieving knowledge via high-dimensional attractor dynamics driven by the self-attention mechanism \citep{ramsauer2020hopfield, wu2024attention}. However, adapting these frozen memory systems to new tasks presents a fundamental ``Plasticity-Stability'' dilemma. Current methods either risk catastrophic interference by modifying synaptic weights directly (e.g., LoRA) \citep{hu2021lora} or degrade associative capacity by clogging the retrieval buffer with static prompt tokens (e.g., VPT) \citep{jia2022vpt}. In this work, we propose \textbf{H-Res} (Hierarchical Residual Steering), a mechanism that modulates the effective energy landscape of the Transformer without altering its global equilibrium or expanding its sequence length. By formulating adaptation as a control problem on the activation manifold \citep{chen2018neuralode}, H-Res learns a state-dependent vector field that steers token trajectories into task-specific basins of attraction. We formally prove that H-Res preserves the attention entropy of the foundation model and facilitates Neural Collapse \citep{papyan2020prevalence}. Empirically, Manifold Steering outperforms global weight modification by 26\% on associative retrieval tasks and eliminates the computational overhead of prompt-based methods, scaling effectively to structured domains \citep{zha2023vtab}.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:50 AM

# Parallel Manifold Steering: Efficient Adaptation of Large Associative Memories via Residual Energy Shaping
Source: [https://arxiv.org/html/2606.24396](https://arxiv.org/html/2606.24396)
###### Abstract

Large Transformer models function as Dense Associative Memories \(DAMs\), retrieving knowledge via high\-dimensional attractor dynamics driven by the self\-attention mechanism\(Ramsaueret al\.,[2020](https://arxiv.org/html/2606.24396#bib.bib1);wu2024attention\)\. However, adapting these frozen memory systems to new tasks presents a fundamental “Plasticity\-Stability” dilemma\. Current methods either risk catastrophic interference by modifying synaptic weights directly \(e\.g\., LoRA\)\(Huet al\.,[2021](https://arxiv.org/html/2606.24396#bib.bib2)\)or degrade associative capacity by clogging the retrieval buffer with static prompt tokens \(e\.g\., VPT\)\(Jiaet al\.,[2022](https://arxiv.org/html/2606.24396#bib.bib3)\)\. In this work, we proposeH\-Res\(Hierarchical Residual Steering\), a mechanism that modulates the effective energy landscape of the Transformer without altering its global equilibrium or expanding its sequence length\. By formulating adaptation as a control problem on the activation manifold\(Chenet al\.,[2018](https://arxiv.org/html/2606.24396#bib.bib16)\), H\-Res learns a state\-dependent vector field that steers token trajectories into task\-specific basins of attraction\. We formally prove that H\-Res preserves the attention entropy of the foundation model and facilitates Neural Collapse\(Papyanet al\.,[2020](https://arxiv.org/html/2606.24396#bib.bib7)\)\. Empirically, Manifold Steering outperforms global weight modification by 26% on associative retrieval tasks and eliminates the computational overhead of prompt\-based methods, scaling effectively to structured domains\(Zhaiet al\.,[2019](https://arxiv.org/html/2606.24396#bib.bib11)\)\.

## 1Introduction

The convergence of modern Deep Learning and classical Neuroscience has revealed a unified perspective: large\-scale Transformers are not merely feed\-forward function approximators butAssociative Memory Networksgoverned by energy minimization principles\(Krotov and Hopfield,[2016](https://arxiv.org/html/2606.24396#bib.bib4); Han and others,[2023](https://arxiv.org/html/2606.24396#bib.bib29)\)\. In this framework, the pre\-trained weights of a Large Language Model \(LLM\) or Vision Transformer \(ViT\)\(Dosovitskiyet al\.,[2021](https://arxiv.org/html/2606.24396#bib.bib5); Radfordet al\.,[2019](https://arxiv.org/html/2606.24396#bib.bib10)\)define a complex high\-dimensional energy landscapeE​\(𝐱\)E\(\\mathbf\{x\}\), where “correct” outputs correspond to deep local minima \(attractors\)\.

The challenge ofAdaptation—fine\-tuning a general\-purpose memory for a specific downstream task—is fundamentally a problem of reshaping this energy landscape\. The ideal adaptation mechanism should create a new, task\-specific basin of attraction local to the input query, without destroying the global structure of the pre\-trained memories \(Catastrophic Forgetting\) and without reducing the bandwidth available for memory retrieval\.

![Refer to caption](https://arxiv.org/html/2606.24396v1/figures/energy_landscape_3d.png)\(a\)Manifold Steering on Energy Landscape
![Refer to caption](https://arxiv.org/html/2606.24396v1/x1.png)\(b\)Vector Field: LoRA \(Chaotic\) vs H\-Res \(Convergent\)

Figure 1:The Geometry of Adaptation\.\(a\) While standard training might trap a model in a pre\-trained local minimum \(Red\), H\-Res introduces a residual force field that steers the latent state across energy barriers into the task\-optimal global minimum \(Cyan\)\. \(b\) Comparing the gradient fields: LoRA’s global weight shifts induce chaotic updates \(Left\), while H\-Res learns a smooth, convergent vector field directing states to the attractor \(Right\)\.### 1\.1The Adaptation Dilemma in Associative Systems

Current approaches to adapting these massive memory systems suffer from distinct theoretical flaws when viewed through the lens of dynamical systems:

- •Global Deformation \(Synaptic Modification\):Methods like Low\-Rank Adaptation \(LoRA\)\(Huet al\.,[2021](https://arxiv.org/html/2606.24396#bib.bib2); Dettmerset al\.,[2024](https://arxiv.org/html/2606.24396#bib.bib25)\)modify the synaptic weightsWWdirectly \(W′=W\+Δ​WW^\{\\prime\}=W\+\\Delta W\)\. While efficient\(Aghajanyanet al\.,[2021](https://arxiv.org/html/2606.24396#bib.bib26)\), this acts as a global deformation of the energy landscape\. Even a low\-rank update shifts the equilibrium for all memories stored in the network\. This introducesInterference, where the gradients of the new task distort the retrieval dynamics of the pre\-trained knowledge\(McCandlishet al\.,[2018](https://arxiv.org/html/2606.24396#bib.bib12)\)\.
- •Buffer Congestion \(Context Expansion\):Visual Prompt Tuning \(VPT\)\(Jiaet al\.,[2022](https://arxiv.org/html/2606.24396#bib.bib3)\)and Prefix Tuning\(Li and Liang,[2021](https://arxiv.org/html/2606.24396#bib.bib8)\)attempt to steer the model by injecting learnable “context vectors” \(prompts\) into the input sequence\. In associative memory terms, this is equivalent to crowding the retrieval buffer\. By appendingppprompt tokens to a sequence of lengthNN, these methods increase the retrieval complexity fromO​\(N2\)O\(N^\{2\}\)toO​\(\(N\+p\)2\)O\(\(N\+p\)^\{2\}\)and dilute the probability mass of the attention mechanism\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.24396#bib.bib6)\), weakening the signal\-to\-noise ratio of true associative recall\.

## 2Methodology

We introduceH\-Res\(Hierarchical Residual Steering\), a method that rejects both global weight modification and context expansion\. Instead, H\-Res operates by injecting a residual control signal directly into the state evolution of the network, inspired by Residual Adapters\(Rebuffiet al\.,[2017](https://arxiv.org/html/2606.24396#bib.bib28); Houlsbyet al\.,[2019](https://arxiv.org/html/2606.24396#bib.bib9)\)and Neural ODEs\(Chenet al\.,[2018](https://arxiv.org/html/2606.24396#bib.bib16)\)\.

### 2\.1Manifold Steering: The Vector Field

Letzl∈ℝN×dz\_\{l\}\\in\\mathbb\{R\}^\{N\\times d\}be the latent state at layerll\. If we view a Transformer layer as a discrete dynamical system updating a statezlz\_\{l\}tozl\+1z\_\{l\+1\}, H\-Res introduces a parallel control termℋ​\(zl\)\\mathcal\{H\}\(z\_\{l\}\):

zl\+1=Attn​\(zl\)\+FFN​\(zl\)\+λ⋅ℋθ​\(zl\)z\_\{l\+1\}=\\text\{Attn\}\(z\_\{l\}\)\+\\text\{FFN\}\(z\_\{l\}\)\+\\lambda\\cdot\\mathcal\{H\}\_\{\\theta\}\(z\_\{l\}\)\(1\)Here,ℋθ​\(zl\)\\mathcal\{H\}\_\{\\theta\}\(z\_\{l\}\)acts as a learnablevector fieldon the activation manifold\. It is parameterized as a bottleneck Multi\-Layer Perceptron \(MLP\) using the GeLU activation\(Hendrycks and Gimpel,[2016](https://arxiv.org/html/2606.24396#bib.bib21)\)to enforce a low\-rank constraint on the control signal:

ℋθ​\(x\)=Wu​p⋅σ​\(Wd​o​w​n⋅x\)\\mathcal\{H\}\_\{\\theta\}\(x\)=W\_\{up\}\\cdot\\sigma\(W\_\{down\}\\cdot x\)\(2\)whereWd​o​w​n∈ℝr×dW\_\{down\}\\in\\mathbb\{R\}^\{r\\times d\}projects the high\-dimensional state onto a low\-dimensional “control manifold”, andWu​p∈ℝd×rW\_\{up\}\\in\\mathbb\{R\}^\{d\\times r\}projects the correction back\.r≪dr\\ll dis the bottleneck rank \(typicallyr=32r=32\)\. Becauseℋ\\mathcal\{H\}is additive and state\-dependent\(Zhanget al\.,[2020](https://arxiv.org/html/2606.24396#bib.bib23)\), it steers the trajectory only when the input state enters the receptive field of the task\. Note that while we term this “Manifold Steering,” it functions as a parallel residual adapter that is architecturally orthogonal \(separate\) to the frozen backbone, avoiding direct interference with the pre\-trained weights\.

### 2\.2Energy Minimization Dynamics

FollowingRamsaueret al\.\([2020](https://arxiv.org/html/2606.24396#bib.bib1)\), the update rule of the self\-attention mechanism can be viewed as minimizing an energy functionE​\(ξ\)E\(\\xi\)via a concave\-convex procedure\. The standard update is:

ξn​e​w=softmax​\(β​WQ​WKT\)​WV\\xi^\{new\}=\\text\{softmax\}\(\\beta W\_\{Q\}W\_\{K\}^\{T\}\)W\_\{V\}\(3\)which corresponds to minimizing the Lagrangian of the Hopfield energy\. H\-Res modifies this dynamic by adding a residual gradient termℋ​\(ξ\)\\mathcal\{H\}\(\\xi\)that effectively reshapes the local optimization landscape without altering the global energy function:

ξf​i​n​a​l=ξn​e​w\+∇ξEt​a​s​k​\(ξ\)\\xi^\{final\}=\\xi^\{new\}\+\\nabla\_\{\\xi\}E\_\{task\}\(\\xi\)\(4\)whereℋ≈−∇Et​a​s​k\\mathcal\{H\}\\approx\-\\nabla E\_\{task\}\.

### 2\.3Zero\-Initialization: Preserving the Energy Minimum

A critical flaw in Prompt Tuning strategies is theInitialization Shock\. Randomly initialized prompts distort the attention probability distribution att=0t=0\. To address this, we explicitly initialize the up\-projection matrixWu​pW\_\{up\}to zeros\.

Wu​p←𝟎⟹ℋθi​n​i​t​\(z\)=𝟎W\_\{up\}\\leftarrow\\mathbf\{0\}\\implies\\mathcal\{H\}\_\{\\theta\_\{init\}\}\(z\)=\\mathbf\{0\}\(5\)This ensures that at initialization, the control signal is null, and the effective update rule is exactly the pre\-trained model\. This property guarantees that H\-Res begins optimization from the global minimum of the pre\-trained energy landscape, allowing for smooth trajectory optimization\(Lianet al\.,[2022](https://arxiv.org/html/2606.24396#bib.bib24)\)\.

### 2\.4Theoretical Proof: Attention Entropy and Fidelity

We formally prove that H\-Res preserves theAssociative Bandwidthof the foundation model\.Lemma 1 \(VPT Entropy Expansion\):In the VPT framework, the sequence length increases toN\+pN\+p\. The new attention distributionAc​l​s′A^\{\\prime\}\_\{cls\}is defined overN\+pN\+pelements\. Because learned promptsPPare optimized for saliency, they attract probability mass from visual patchesXX, increasing the Shannon Entropy and blurring retrieval\(Bahriet al\.,[2020](https://arxiv.org/html/2606.24396#bib.bib19)\)\.

Lemma 2 \(H\-Res Fidelity Preservation\):H\-Res operates on a constant sequence lengthNN\. Since the adapter is applied parallel to the self\-attention block\(Heet al\.,[2016](https://arxiv.org/html/2606.24396#bib.bib18)\), the attention weights remain untouched by synthetic tokens\. The entropyH​\(Ac​l​s\)H\(A\_\{cls\}\)remains minimal, preserving the “spatial eye” of the foundation model\.

### 2\.5Multi\-Task Orthogonality via Null\-Space Projection

To ensure that an expert for Task B does not disrupt the manifold of Task A, we implement a Null\-Space Projection \(NSP\)\. LetΣp​r​e​v\\Sigma\_\{prev\}be the covariance matrix of the hidden features for all previous tasks\. We project the gradients of the new task into the null space ofΣp​r​e​v\\Sigma\_\{prev\}:

∇θn​e​w←\(I−Σp​r​e​v​\(Σp​r​e​vT​Σp​r​e​v\)−1​Σp​r​e​vT\)​∇θn​e​w\\nabla\\theta\_\{new\}\\leftarrow\(I\-\\Sigma\_\{prev\}\(\\Sigma\_\{prev\}^\{T\}\\Sigma\_\{prev\}\)^\{\-1\}\\Sigma\_\{prev\}^\{T\}\)\\nabla\\theta\_\{new\}\(6\)This ensures that the residual “nudge” is mathematically invisible to the feature spaces of prior tasks\(Poweret al\.,[2022](https://arxiv.org/html/2606.24396#bib.bib27)\)\.

## 3Empirical Evaluation

We evaluate H\-Res against LoRA\(Huet al\.,[2021](https://arxiv.org/html/2606.24396#bib.bib2)\)and Soft Prompting \(VPT\)\(Jiaet al\.,[2022](https://arxiv.org/html/2606.24396#bib.bib3)\)on SQuAD \(Associative Retrieval\), WikiText \(Generative Dynamics\), and VTAB\-1k \(Visual Adaptation\)\.

### 3\.1Efficiency vs\. Fidelity Trade\-off

![Refer to caption](https://arxiv.org/html/2606.24396v1/x2.png)Figure 2:Efficiency vs\. Fidelity Pareto Frontier\.Left Axis \(Red\):SQuAD Retrieval Loss \(Lower is better\)\. H\-Res achieves significantly better retrieval \(3\.78\) than LoRA \(5\.17\) and VPT \(5\.61\)\.Right Axis \(Blue\):WikiText Generation Speed \(Higher is better\)\. H\-Res matches the speed of LoRA and outperforms VPT, confirming the theoreticalO​\(N2\)O\(N^\{2\}\)advantage\.As shown in Figure[2](https://arxiv.org/html/2606.24396#S3.F2), H\-Res dominates the pareto frontier\. On SQuAD, H\-Res achieves a validation loss of3\.78, a 26% improvement over LoRA\. This confirms our hypothesis that global weight deformation distorts the fine\-grained attractors\. Furthermore, H\-Res avoids the computational penalty of VPT, maintaining high throughput for generation tasks\(Devlinet al\.,[2019](https://arxiv.org/html/2606.24396#bib.bib20); Touvronet al\.,[2021](https://arxiv.org/html/2606.24396#bib.bib22)\)\.

### 3\.2Visual Adaptation \(VTAB\-1k\)

We benchmark H\-Res V2600 against VPT on the VTAB\-1k suite\(Zhaiet al\.,[2019](https://arxiv.org/html/2606.24396#bib.bib11)\)\.

Table 1:Main Results: H\-Res V2600 vs\. Visual Prompt Tuning \(VPT\)H\-Res outperforms VPT in natural domains \(59\.37% vs 58\.90

### 3\.3Ablation Study

Table[2](https://arxiv.org/html/2606.24396#S3.T2)shows that H\-Res scales more effectively than VPT\. While increasing prompt length in VPT can lead to optimization instability \(accuracy drops from 76\.54% to 70\.48

Table 2:Ablation Study: H\-Res vs\. VPT on Latent Adaptation Tasks

## 4Discussion

### 4\.1Manifold Steering vs\. Global Deformation

The success of H\-Res suggests a paradigm shift in PEFT\. Rather than modifying the memories themselves \(weights\) or the queries \(prompts\), we should modify thedynamicsof retrieval\. By learning a residual vector field, H\-Res effectively "surfs" the pre\-trained energy landscape\(Sohl\-Dicksteinet al\.,[2015](https://arxiv.org/html/2606.24396#bib.bib34)\)\.

### 4\.2Generalization to Non\-Transformer Architectures \(SSMs\)

Unlike Prompt Tuning, which relies on theO​\(N2\)O\(N^\{2\}\)attention mechanism to integrate prompts, H\-Res is model\-agnostic\. It operates entirely in the residual stream, making it naturally compatible with emerging sub\-quadratic architectures like Mamba\(Gu and Dao,[2023](https://arxiv.org/html/2606.24396#bib.bib31)\)and S4\(Guet al\.,[2022](https://arxiv.org/html/2606.24396#bib.bib32)\)\. In these State Space Models \(SSMs\), the hidden statehth\_\{t\}is updated via a linear recurrence\. Inserting extra "prompt tokens" disrupts the continuous\-time approximation of these models\. H\-Res, however, can act as a "Control Input"u​\(t\)u\(t\)in the state equationh˙​\(t\)=A​h​\(t\)\+B​u​\(t\)\\dot\{h\}\(t\)=Ah\(t\)\+Bu\(t\), enabling efficient adaptation of SSMs without architectural modification\.

### 4\.3The Thermodynamics of Adaptation

H\-Res facilitatesNeural Collapse\(Papyanet al\.,[2020](https://arxiv.org/html/2606.24396#bib.bib7)\), where intra\-class features converge to the class mean\. The residual adapter acts as a Maxwell’s Demon, reducing the entropy of the latent state by filtering out task\-irrelevant noise \(higher energy states\) and funneling trajectories into low\-energy attractors\. This thermodynamic perspective aligns with recent findings on the statistical mechanics of deep learning\(Bahriet al\.,[2020](https://arxiv.org/html/2606.24396#bib.bib19)\), suggesting that adaptation is equivalent to cooling the system into a new ordered phase\.

## 5Conclusion

We have presented H\-Res, a framework that resolves the Plasticity\-Stability dilemma in Associative Memories via Parallel Residual Steering\. By replacing input\-space prompting with latent\-space manifold modulation, H\-Res preserves the associative capacity, sequence length, and energy landscape of the pre\-trained model\. Our results confirm that H\-Res is not only more efficient \(O​\(N2\)O\(N^\{2\}\)\) but also uniquely capable of maintaining high\-fidelity associative retrieval in complex cognitive tasks, setting the stage for universal adaptation in next\-generation architectures like Mamba\.

## References

- A\. Aghajanyan, L\. Zettlemoyer, and S\. Gupta \(2021\)Intrinsic dimensionality explains the effectiveness of language model fine\-tuning\.ACL\.Cited by:[1st item](https://arxiv.org/html/2606.24396#S1.I1.i1.p1.2)\.
- Y\. Bahri, J\. Kadmon, S\. Ganguli,et al\.\(2020\)Statistical mechanics of deep learning\.Annual Review of Condensed Matter Physics\.Cited by:[§2\.4](https://arxiv.org/html/2606.24396#S2.SS4.p1.5),[§4\.3](https://arxiv.org/html/2606.24396#S4.SS3.p1.1)\.
- R\. T\. Chen, Y\. Rubanova, J\. Bettencourt, and D\. K\. Duvenaud \(2018\)Neural ordinary differential equations\.NeurIPS31\.Cited by:[§2](https://arxiv.org/html/2606.24396#S2.p1.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2024\)QLoRA: efficient finetuning of quantized llms\.NeurIPS\.Cited by:[1st item](https://arxiv.org/html/2606.24396#S1.I1.i1.p1.2)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.NAACL\.Cited by:[§3\.1](https://arxiv.org/html/2606.24396#S3.SS1.p1.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn,et al\.\(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.ICLR\.Cited by:[§1](https://arxiv.org/html/2606.24396#S1.p1.1)\.
- A\. Gu and T\. Dao \(2023\)Mamba: linear\-time sequence modeling with selective state spaces\.arXiv preprint arXiv:2312\.00752\.Cited by:[§4\.2](https://arxiv.org/html/2606.24396#S4.SS2.p1.4)\.
- A\. Gu, K\. Goel, and C\. Ré \(2022\)Efficiently modeling long sequences with structured state spaces\.InICLR,Cited by:[§4\.2](https://arxiv.org/html/2606.24396#S4.SS2.p1.4)\.
- X\. Y\. Hanet al\.\(2023\)Associative memory in transformers\.ICLR Workshop on Associative Memory\.Cited by:[§1](https://arxiv.org/html/2606.24396#S1.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InCVPR,Cited by:[§2\.4](https://arxiv.org/html/2606.24396#S2.SS4.p2.2)\.
- D\. Hendrycks and K\. Gimpel \(2016\)Gaussian error linear units \(gelus\)\.arXiv preprint arXiv:1606\.08415\.Cited by:[§2\.1](https://arxiv.org/html/2606.24396#S2.SS1.p1.6)\.
- N\. Houlsby, A\. Giouvanos, Z\. Kozareva, M\. Wei,et al\.\(2019\)Parameter\-efficient transfer learning for nlp\.ICML\.Cited by:[§2](https://arxiv.org/html/2606.24396#S2.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li,et al\.\(2021\)LoRA: low\-rank adaptation of large language models\.InICLR,Cited by:[1st item](https://arxiv.org/html/2606.24396#S1.I1.i1.p1.2),[§3](https://arxiv.org/html/2606.24396#S3.p1.1)\.
- M\. Jia, L\. Tang, B\. Chen, C\. Cardie, S\. Belongie,et al\.\(2022\)Visual prompt tuning\.InECCV,Cited by:[2nd item](https://arxiv.org/html/2606.24396#S1.I1.i2.p1.4),[§3](https://arxiv.org/html/2606.24396#S3.p1.1)\.
- D\. Krotov and J\. J\. Hopfield \(2016\)Dense associative memory for pattern recognition\.NeurIPS29\.Cited by:[§1](https://arxiv.org/html/2606.24396#S1.p1.1)\.
- X\. L\. Li and P\. Liang \(2021\)Prefix\-tuning: optimizing continuous prompts for generation\.ACL\.Cited by:[2nd item](https://arxiv.org/html/2606.24396#S1.I1.i2.p1.4)\.
- D\. Lian, D\. Zhou, J\. Feng, and X\. Wang \(2022\)Scaling & shifting your features: a new baseline for efficient model tuning\.NeurIPS\.Cited by:[§2\.3](https://arxiv.org/html/2606.24396#S2.SS3.p1.3)\.
- S\. McCandlish, J\. Kaplan, D\. Amodei, and D\. OpenAI \(2018\)An empirical model of large\-batch training\.arXiv preprint arXiv:1812\.06162\.Cited by:[1st item](https://arxiv.org/html/2606.24396#S1.I1.i1.p1.2)\.
- V\. Papyan, X\. Y\. Han, and D\. L\. Donoho \(2020\)Prevalence of neural collapse during the terminal phase of deep learning training\.PNAS117\.Cited by:[§4\.3](https://arxiv.org/html/2606.24396#S4.SS3.p1.1)\.
- A\. Power, Y\. Burda, H\. Edwards, I\. Babuschkin, and V\. Misra \(2022\)Grokking: generalization beyond overfitting on small algorithmic datasets\.arXiv preprint arXiv:2201\.02177\.Cited by:[§2\.5](https://arxiv.org/html/2606.24396#S2.SS5.p1.3)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog\.Cited by:[§1](https://arxiv.org/html/2606.24396#S1.p1.1)\.
- H\. Ramsauer, B\. Schäfl, J\. Lehner, P\. Seidl, M\. Widrich,et al\.\(2020\)Hopfield networks is all you need\.InICLR,Cited by:[§2\.2](https://arxiv.org/html/2606.24396#S2.SS2.p1.1)\.
- S\. Rebuffi, H\. Bilen, and A\. Vedaldi \(2017\)Learning multiple visual domains with residual adapters\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2606.24396#S2.p1.1)\.
- J\. Sohl\-Dickstein, E\. Weiss, N\. Maheswaranathan, and S\. Ganguli \(2015\)Nonequilibrium thermodynamics of stochastic learning\.arXiv preprint arXiv:1506\.03233\.Cited by:[§4\.1](https://arxiv.org/html/2606.24396#S4.SS1.p1.1)\.
- H\. Touvron, M\. Cord, M\. Douze, F\. Massa,et al\.\(2021\)Training data\-efficient image transformers & distillation through attention\.InICML,Cited by:[§3\.1](https://arxiv.org/html/2606.24396#S3.SS1.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit,et al\.\(2017\)Attention is all you need\.NeurIPS30\.Cited by:[2nd item](https://arxiv.org/html/2606.24396#S1.I1.i2.p1.4)\.
- X\. Zhai, J\. Puigcerver, A\. Kolesnikov, P\. Ruyssen,et al\.\(2019\)The visual task adaptation benchmark\.InarXiv preprint arXiv:1910\.04867,Cited by:[§3\.2](https://arxiv.org/html/2606.24396#S3.SS2.p1.1)\.
- J\. O\. Zhang, A\. Sax, A\. Zamir, L\. Guibas, and J\. Malik \(2020\)Side\-tuning: a baseline for network adaptation via additive side networks\.InECCV,Cited by:[§2\.1](https://arxiv.org/html/2606.24396#S2.SS1.p1.11)\.

Similar Articles

Scaling Self-Evolving Agents via Parametric Memory

arXiv cs.AI

Researchers from Alibaba/Qwen and Peking University introduce TMEM, a self-evolving parametric memory framework that uses online LoRA weight updates to let LLM agents genuinely learn from experience within a single episode, rather than relying solely on prompt-space memory. TMEM outperforms summary-based and retrieval-based baselines across multiple benchmarks including LoCoMo, LongMemEval-S, and CL-Bench.

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

arXiv cs.LG

This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.

HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

Hugging Face Daily Papers

HAGE introduces a weighted multi-relational memory framework that enables query-conditioned traversal over unified relational memory graphs, improving long-horizon reasoning accuracy through adaptive memory retrieval and reinforcement learning-based optimization.