SHARP: Sleep-based Hierarchical Accelerated Replay for Long Range Non-Stationary Temporal Pattern Recognition
Summary
SHARP introduces a bio-inspired framework that separates memory accumulation from pattern recognition, using accelerated replay during offline sleep phases to learn long-range non-stationary temporal patterns in streaming settings. It improves context retention on text8 and PG-19 while maintaining computational efficiency.
View Cached Full Text
Cached at: 06/02/26, 03:48 PM
# SHARP: Sleep-based Hierarchical Accelerated Replay for Long Range Non-Stationary Temporal Pattern Recognition
Source: [https://arxiv.org/html/2606.00732](https://arxiv.org/html/2606.00732)
Jayanta Dey1Shikhar Srivastava2Itamar Lerner3Christopher Kanan2Dhireesha Kudithipudi1
1Department of Computer Engineering, University of Texas at San Antonio, USA 2Department of Computer Science, University of Rochester, USA 3Department of Psychology, University of Texas at San Antonio, USA
jayanta\.dey@utsa\.edussrivas9@ur\.rochester\.eduitamar\.lerner@utsa\.edu ckanan@cs\.rochester\.edudhireesha\.kudithipudi@utsa\.edu
###### Abstract
Learning long\-range non\-stationary temporal patterns remains a core challenge for modern sequence models, particularly in strict streaming settings\. In these settings, data arrive sequentially and must be processed in a single pass without simultaneously revisiting past observations\. Standard architectures, including recurrent neural networks and transformers, are constrained by either truncated backpropagation through time horizon or explicit input window length for long range credit assignment\. To address these limitations, we proposeSHARP\(Sleep\-based Hierarchical Accelerated Replay\), a framework that decomposes temporal learning into two complementary components: a memory module that accumulates a structured history of past inputs, and a pattern\-recognition module that operates over this memory\. This separation enables resource\- and compute\-efficient adaptation to non\-stationary dynamics by eliminating the need for backpropagation through time across many steps for long\-range credit assignment\. Inspired by the accelerated replay observed in rodents during slow\-wave sleep,SHARPincorporates offline \(sleep\) phases in which temporally structured memory traces are replayed in an accelerated form and integrated into higher\-level memory representations, improving long\-range context retention\. Through controlled simulations and ablation studies, we characterize the key properties of the proposed framework\. In benchmark datasets such astext8andPG\-19, we demonstrate thatSHARPimproves over recurrent baselines by retaining next\-token predictive performance on previously seen data while continuing to learn from the current stream and generalizing to future unseen data\. These gains are enabled by its hierarchical structure, which yields an exponentially increasing effective temporal context with only linear\-time computational cost\.
††Implementation ofSHARPis available at[https://github\.com/jdey4/sharp](https://github.com/jdey4/sharp)\.## 1Introduction
In many real\-world settings, observations arrive sequentially without the possibility of revisiting past data\. Learning algorithms must therefore continually integrate new information while preserving the structure of prior experience\([Harunet al\.,](https://arxiv.org/html/2606.00732#bib.bib65)\)\. This imposes a strict constraint: learning must proceed online, with limited opportunity for long\-horizon credit assignment\. The challenge is further exacerbated under distribution shift, where the underlying data\-generating process evolves over time\.
From a modeling perspective, continual learning under streaming constraints can be naturally formulated as a sequential learning problem\. To generalize under these constraints, a system must retain information about past inputs even after they are no longer directly accessible\. Classical sequence models such as recurrent neural networks \(RNNs\) and long short\-term memory networks \(LSTMs\) attempt to encode memory within recurrent dynamics\. However, their effective memory is governed by backpropagation through time \(BPTT\), which limits credit assignment to a finite temporal horizon and introduces numerical instabilities such as vanishing and exploding gradients\. Although recurrent models have a theoretically unbounded context memory, in practice their memory is lossy: information dissipates, interferes, or becomes entangled over time, restricting the reliable capture of long\-range temporal structure\(Bengioet al\.,[1994](https://arxiv.org/html/2606.00732#bib.bib5); Hochreiter and Schmidhuber,[1997](https://arxiv.org/html/2606.00732#bib.bib9)\)\. A common consequence of the limited temporal horizon is the degradation of previously acquired knowledge as new information is incorporated, known as catastrophic forgetting\(McCloskey and Cohen,[1989](https://arxiv.org/html/2606.00732#bib.bib52); McClellandet al\.,[1995](https://arxiv.org/html/2606.00732#bib.bib53); Doanet al\.,[2021](https://arxiv.org/html/2606.00732#bib.bib54); Vogelsteinet al\.,[2025](https://arxiv.org/html/2606.00732#bib.bib44)\)\. Indeed, one way to understand catastrophic forgetting is as a consequence of limited long\-range credit assignment, which biases learning toward the current task and degrades generalization to past tasks\.
Existing regularization\-based approaches, developed to mitigate catastrophic forgetting, do not treat memory as an explicit structural component; instead memory in these models emerges implicitly through gradient\-based optimization\. For instance, a neural network trained sequentially on multiple tasks tends to overwrite previously learned representations unless additional mechanisms, such as Elastic Weight Consolidation \(EWC\)\(Kirkpatricket al\.,[2017](https://arxiv.org/html/2606.00732#bib.bib26)\)or Learning without Forgetting \(LwF\)\(Li and Hoiem,[2017](https://arxiv.org/html/2606.00732#bib.bib62)\), are introduced to preserve past knowledge in the weights\. Alternatively, replay\-based approaches maintain external buffers to revisit past samples\(Shinet al\.,[2017](https://arxiv.org/html/2606.00732#bib.bib56); Chaudhryet al\.,[2019](https://arxiv.org/html/2606.00732#bib.bib59); van de Venet al\.,[2020](https://arxiv.org/html/2606.00732#bib.bib55); Buzzegaet al\.,[2020](https://arxiv.org/html/2606.00732#bib.bib60); Channappayyaet al\.,[2023](https://arxiv.org/html/2606.00732#bib.bib61)\)\. In these models, replay often serves a composite role, bundling together explicit data storage, old\-task rehearsal, and the refinement of predictive models through supervised updates\(Lopez\-Paz and Ranzato,[2017](https://arxiv.org/html/2606.00732#bib.bib77); Chaudhryet al\.,[2018](https://arxiv.org/html/2606.00732#bib.bib76); Rolnicket al\.,[2019](https://arxiv.org/html/2606.00732#bib.bib58); Rebuffiet al\.,[2017](https://arxiv.org/html/2606.00732#bib.bib75); Shinet al\.,[2017](https://arxiv.org/html/2606.00732#bib.bib56); Buzzegaet al\.,[2020](https://arxiv.org/html/2606.00732#bib.bib60)\)\. Despite their differences, the above strategies treat memory either as constrained weight plasticity or as stored raw data, rather than as a structured dynamical system\. In general, an organized memory system is essential for generalization in a sequential and continually evolving environment\(Dorovataset al\.,[2026](https://arxiv.org/html/2606.00732#bib.bib66)\)\.
Biological systems appear to circumvent the above limitations through structured memory organization that takes places, at least partially, during sleep\(O’Reillyet al\.,[2014](https://arxiv.org/html/2606.00732#bib.bib49); Kumaranet al\.,[2016](https://arxiv.org/html/2606.00732#bib.bib48); Lutzet al\.,[2026](https://arxiv.org/html/2606.00732#bib.bib46)\)\. In particular, evidence from rodents suggests that during slow\-wave sleep \(SWS\), sequential memory of previously encoded experiences are reactivated in the hippocampus in a shorter timescale than originally experienced, essentially implementing accelerated \(or “time\-compressed”\) replay\. One recent theory, the temporal scaffolding hypothesis \(TSH\) posits that such accelerated replay has a functional role, enabling the consolidation of long\-range associations that are difficult to form during online experience alone\(Lerner,[2017](https://arxiv.org/html/2606.00732#bib.bib17); Lerner and Gluck,[2019](https://arxiv.org/html/2606.00732#bib.bib25)\)\.
Figure 1:Conceptual overview of slow\-wave sleep\-based temporal learning\.During wake, environmental interaction drives causal relation learning by updating plastic memory and pattern modules\. Salient experiences are tagged and later replayed during sleep, where accelerated replay provides richer temporal context for memory consolidation\.Informed by TSH, the current work presents a new learning framework, SHARP \(Sleep\-based Hierarchical Accelerated Replay\), that learns to detect temporal patterns using a “wake” phase and a “sleep” phase \(Figure[1](https://arxiv.org/html/2606.00732#S1.F1)\)\. The sleep phase incorporates an abstracted approximation of accelerated replay \(explained in detail in Section[2\.2](https://arxiv.org/html/2606.00732#S2.SS2)\) that allows the system to traverse longer temporal contexts during consolidation than would be feasible during online learning, thus improving memory retention\. Replay during sleep is restricted to unsupervised memory consolidation to allow better future memory retention and prediction, while the learning of causal and predictive relationships is driven exclusively by wake\-time interactions with the environment\. These two processes are accomplished within two separate modules: \(i\) a hierarchical memory module that accumulates experience without credit assignment from prediction and \(ii\) a hierarchical pattern\-recognition module that operates over this memory to perform prediction \(Figure[2](https://arxiv.org/html/2606.00732#S1.F2)\)\. This separation avoids confounding memory storage with credit assignment and provides a stable substrate for pattern learning\.
Figure 2:Sleep\-based hierarchical accelerated replay framework\.Left \(Wake phase\):The context knowledge base \(upper context blocks\) remains non\-plastic while the model interacts with the environment\. Lower context\-block memories are progressively accelerated into their immediately higher blocks\.Right \(Sleep phase \(SWS\)\):The context knowledge base is updated offline, while C\-1 and P\-1 stop receiving environmental input and feedback, respectively and replays its tagged wake experiences to the context knowledge base\.Unlike modern state space models \(SSMs\)\(Guet al\.,[2021](https://arxiv.org/html/2606.00732#bib.bib33); Gu and Dao,[2023](https://arxiv.org/html/2606.00732#bib.bib81)\), where memory is encoded within a single dynamical system, our framework explicitly learns and organizes memory representations hierarchically across multiple layers \(see Figure[2](https://arxiv.org/html/2606.00732#S1.F2)\)\. During the wake phase, only the lowest layer remains plastic, accumulating recent experiences\. Salient events are selectively tagged\(Yanget al\.,[2024](https://arxiv.org/html/2606.00732#bib.bib45)\)and later consolidated into higher layers during offline sleep phases through accelerated sequential replay\. The higher layers thus form a stable context knowledge base, while lower layers capture rapidly changing dynamics\. This hierarchical consolidation expands the effective temporal context available to downstream pattern\-recognition modules in a familiar environment, while keeping most parameters frozen during online interaction\. In this paper, we demonstrate that accelerated replay effectively extends the model’s usable context window, enabling it to capture long\-range dependencies without requiring long\-horizon BPTT\. In what follows, we describe the problem setting, formalize the desired properties of memory and acceleration, describe the model architecture and learning dynamics, and evaluate an instantiation of the proposed framework through controlled simulations and two benchmark datasets\.
## 2Technical Background
### 2\.1Problem Setting
Let\{X1,X2,⋯,Xt\}\\\{X\_\{1\},X\_\{2\},\\cdots,X\_\{t\}\\\}be a stochastic process or a sequence of random variables where each variable takes a value from some finite set𝒜=\{a1,a2,⋯,aK\}⊆ℝD\\mathcal\{A\}=\\\{a\_\{1\},a\_\{2\},\\cdots,a\_\{K\}\\\}\\subseteq\\mathbb\{R\}^\{D\}\. The variableXtX\_\{t\}represents the state of the process at timett, governed by a set of underlying state transition probability laws𝒫=\{𝐩1,𝐩2,⋯,𝐩t\}\\mathcal\{P\}=\\\{\\mathbf\{p\}\_\{1\},\\mathbf\{p\}\_\{2\},\\cdots,\\mathbf\{p\}\_\{t\}\\\}\. Each𝐩t∈𝒫\\mathbf\{p\}\_\{t\}\\in\\mathcal\{P\}defines the probability of transitioning to any stateai∈𝒜,∀i=1,⋯,Ka\_\{i\}\\in\\mathcal\{A\},\\forall i=1,\\cdots,Kgiven all previous states up to time\(t−1\)\(t\-1\), that is,
𝐩t=\[pt\(ai\)\]i=1K=\[P\(Xt=ai\|X1,⋯,Xt−1\)\]i=1K\.\\mathbf\{p\}\_\{t\}=\[p\_\{t\}\(a\_\{i\}\)\]\_\{i=1\}^\{K\}=\[P\(X\_\{t\}=a\_\{i\}\|X\_\{1\},\\cdots,X\_\{t\-1\}\)\]\_\{i=1\}^\{K\}\.\(1\)
Given a sequence of states evolving according to an unknown transition rule𝒫\\mathcal\{P\}, a sequential learnerf:𝒜T→\[0,1\]Kf:\\mathcal\{A\}^\{T\}\\rightarrow\[0,1\]^\{K\}, having access to an input window of pastTTstates, estimates𝐩t\\mathbf\{p\}\_\{t\}:
𝐩^t=\[p^n\(ai\)\]i=1K=f\(xt−T,⋯,xt−1\),\\hat\{\\mathbf\{p\}\}\_\{t\}=\[\\hat\{p\}\_\{n\}\(a\_\{i\}\)\]\_\{i=1\}^\{K\}=f\(x\_\{t\-T\},\\cdots,x\_\{t\-1\}\),\(2\)
wherextx\_\{t\}is the value of the random variableXtX\_\{t\}at timett\. The state at timettis estimated as theargmax\\operatorname\*\{arg\\,max\}of𝐩^t\\hat\{\\mathbf\{p\}\}\_\{t\}:
x^t=argmaxai∈𝒜p^t\(ai\),∀i=1,⋯,K\.\\hat\{x\}\_\{t\}=\\operatorname\*\{arg\\,max\}\_\{a\_\{i\}\\in\\mathcal\{A\}\}\\hat\{p\}\_\{t\}\(a\_\{i\}\),\\forall i=1,\\cdots,K\.\(3\)If the transition probability between states,𝐩t\\mathbf\{p\}\_\{t\}, depends on more past states than those captured within the input window, then the learner must integrate an internal mechanism to retain the memory of the previous states\. The estimation accuracy of the next sample depends on how effectively the model retains and utilizes past information\. Unlike traditional training setups, we adopt a one\-pass learning regime in which the model observes training samples in an online streaming manner and cannot optimize for multiple parallel sequence segments simultaneously\.
#### Memory encoding desiderata
While prior work has explored various forms of memory, including associative memory\(Hopfield,[1982](https://arxiv.org/html/2606.00732#bib.bib70); Kanerva,[1988](https://arxiv.org/html/2606.00732#bib.bib71)\)and sparse or quantized representations\(Olshausen and Field,[1997](https://arxiv.org/html/2606.00732#bib.bib72); Van Den Oordet al\.,[2017](https://arxiv.org/html/2606.00732#bib.bib73); Razaviet al\.,[2019](https://arxiv.org/html/2606.00732#bib.bib74)\), these approaches primarily emphasize storage or compression\. In contrast, we focus on the*dynamics*of memory, i\.e\., how information about past inputs is continuously maintained and updated over time\. To support stable downstream processing, we seek continuous latent states that preserve similarity structure across inputs, enabling pattern\-recognition modules to operate on temporally coherent signals without explicit access to past samples\(Baet al\.,[2016](https://arxiv.org/html/2606.00732#bib.bib67)\)\.
In our framework, a pattern\-recognition mapf\(⋅\)f\(\\cdot\)operates on a dynamic memory encodingm\(⋅,⋅\)m\(\\cdot,\\cdot\)to produce the current state transition probability, i\.e\.,𝐩^t=f\(m\(xt−1,ht−1\)\),\\hat\{\\mathbf\{p\}\}\_\{t\}=f\\big\(m\(x\_\{t\-1\},h\_\{t\-1\}\)\\big\),whereht−1h\_\{t\-1\}denotes the previous memory state\. We now describe desirable properties of the memory encodingm\(⋅,⋅\)m\(\\cdot,\\cdot\)\.
Let𝒮=\{S1,S2,⋯\}\\mathcal\{S\}=\\\{S\_\{1\},S\_\{2\},\\cdots\\\}denote the set of all length\-sssequences over the elements of𝒜\\mathcal\{A\}, equipped with a metricd𝒮d\_\{\\mathcal\{S\}\}\. For a sequenceSi∈𝒮S\_\{i\}\\in\\mathcal\{S\}, lethi=m\(Si,hi′\)∈ℋ⊆ℝPh\_\{i\}=m\(S\_\{i\},h^\{\\prime\}\_\{i\}\)\\in\\mathcal\{H\}\\subseteq\\mathbb\{R\}^\{P\}denote the encoding after observing thess\-th sample, where\(ℋ,dℋ\)\(\\mathcal\{H\},d\_\{\\mathcal\{H\}\}\)is a metric space andhi′h^\{\\prime\}\_\{i\}is the previous state\. Ideally, we seek memory encodings that approximately preserve similarity structure across sequences, such that sequences that are close underd𝒮d\_\{\\mathcal\{S\}\}map to nearby representations underdℋd\_\{\\mathcal\{H\}\}, up to bounded distortion\. This can be expressed as:
1Cd𝒮\(Si,Sj\)≤dℋ\(hi,hj\)≤Cd𝒮\(Si,Sj\),\\frac\{1\}\{C\}d\_\{\\mathcal\{S\}\}\(S\_\{i\},S\_\{j\}\)\\leq d\_\{\\mathcal\{H\}\}\(h\_\{i\},h\_\{j\}\)\\leq C\\,d\_\{\\mathcal\{S\}\}\(S\_\{i\},S\_\{j\}\),for some constantC\>0C\>0\. This perspective ensures that similar sequences map to nearby representations while remaining distinguishable\. We use the notions of*memory capacity*as the size of the largest subset of𝒮\\mathcal\{S\}whose elements can remain distinguishable under the above representation, and*memory span*as the temporal extent or sequence lengthssover which information is retained\. These terms serve as conceptual tools for characterizing memory encodings\.
### 2\.2Acceleration
We use acceleration as a computational abstraction of compressed\-time biological replay\. In SHARP, this is implemented through temporal downsampling over learned memory states as information passes from one layer to the next one: Each higher layer receives a downsampled sequence of the lower\-layer’s states, so each higher\-level transition summarizes a longer span of environment\-level experience\. During wake, the environment determines the pace of incoming inputs; therefore, downsampling makes higher\-layer states evolve more slowly \(they have to “wait” to the environmental inputs to proceed before updates can occur\)\. However, during sleep, replay is not paced by environmental input, allowing higher layers to traverse memory states faster than online experience; \(Figure[2](https://arxiv.org/html/2606.00732#S1.F2)\)\. Therefore, the combination of downsampled states, each containing information on several timesteps from the previous layer, and replay of these states free from the environmental pace, effectively creates acceleration of inputs from one layer to the next\.
Figure 3:Wake–sleep temporal scaling inSHARP\(see Figure[2](https://arxiv.org/html/2606.00732#S1.F2)\)\.During wake, higher layers update at slower rates \(C\-2 updates once every two environment\-level steps\)\. During sleep, C\-2 input states are generated sequentially from C\-1\. Thus generating four C\-2 level input states corresponds to traversing eight environment\-level inputs in four steps\.We define acceleration as passing only everyα\\alpha\-th lower\-level memory state to the next higher level\. Suppose a lower\-level memory statehth\_\{t\}summarizes the most recentssinputs, i\.e\., the window\{xt−s\+1,…,xt\}\\\{x\_\{t\-s\+1\},\\ldots,x\_\{t\}\\\}\. If the higher level receives the subsequence\{h1,h1\+α,h1\+2α,…\}\\\{h\_\{1\},h\_\{1\+\\alpha\},h\_\{1\+2\\alpha\},\\ldots\\\}, then it processes fewer states while each received state summarizes multiple environment\-level inputs\. When the memory encoding is sufficiently lossless, this downsampled sequence still preserves the relevant temporal structure but allows the higher level to traverse a longer environment\-level history per update\. We callα\\alphathe acceleration factor\. Figure[3](https://arxiv.org/html/2606.00732#S2.F3)shows the update schedule of a33layer hierarchical acceleration by2×2\\times\.
### 2\.3Framework Description
Figure[2](https://arxiv.org/html/2606.00732#S1.F2)illustrates the general architecture ofSHARP\. We will describe an instance of the above generalized framework in detail in Section[3](https://arxiv.org/html/2606.00732#S3)\. As depicted in the figure, the model consists of a total ofLLhierarchical blocks, each comprising a context block and a corresponding pattern\-recognition block\. The context blocks are organized bottom\-up and the pattern recognition blocks are arranged top\-down\. The memory contents in the context blocks are not credit\-assigned from the prediction objective, and hence gradients do not flow from the pattern recognition blocks back to the memory blocks\. Memory in each context block is accelerated in the form of down\-sampling by a factor ofα\\alpha, as explained above, when transferred from a lower block to its immediate upper block\. Note that this acceleration is possible because the memory operates without credit assignment\. In practice, however, memory representations that rely on credit assignment may be imperfect\. In particular, when the temporal credit assignment at the lowest level is constrained by a limited input horizon, the resulting memory states may not capture long\-range dependencies\. Under such conditions, acceleration can amplify these distortions, as higher\-level blocks operate on accelerated representations derived from imperfect lower\-level memory\. This motivates the need for smooth, sufficiently lossless memory encodings that do not depend on credit assignment\. Importantly,SHARPinduces a bootstrapping process through sleep\-time replay: higher\-level memory \(e\.g\., C\-2\) starts noisy but improves through replay during sleep\. This leads to better predictions in the next wake phase\. The improved predictions then improve the quality of future replay \(see Figure[8](https://arxiv.org/html/2606.00732#S3.F8)\)\. Repeating this cycle progressively refines the overall representation\.
During wake\-time, the upper context blocks evolve on exponentially slower timescales compared to that of C\-11: for everyα\\alphaupdates of C\-11, C\-22is updated once, and more generally, C\-ℓ\\ellis updated once everyαℓ−1\\alpha^\{\\ell\-1\}updates of C\-11\(see Figure[3](https://arxiv.org/html/2606.00732#S2.F3)\)\. Similarly, the learning rate for the upper pattern blocks is keptγ×\\gamma\\timesslower than that of their immediate lower pattern block\. This is done so that the upper pattern blocks do not overfit to transient lower\-level fluctuations\. This hierarchical separation of timescales prevents pattern\-recognition modules from overfitting to transient input, instead encouraging them to operate over a broader temporal context, thereby improving generalization and reducing forgetting\. Importantly, it also governs how upper blocks are trained during sleep\. Once the model detaches from the environment, learning is no longer constrained by the pace of incoming data and does not require waiting for new observations to arrive\. Consequently, upper blocks can be trained rapidly on accelerated memory traces generated by lower blocks\. This sleep\-time training optimizes the upper memory representation to better retain the trajectory of the current experience\. In this work,SHARPenters the sleep phase at a fixed interval\. However, in essence, the sleep\-mode could be triggered based on more real\-world conditions encountered by biological systems, for example, following a period of inactivity due to diminished environmental inputs as occurring during nighttime\.
## 3Mechanistic Simulation Studies of Context and Pattern Blocks
Figure 4:Example context and pattern recognition blocks\.\(a\) Context encodes a window ofssinputs intohth\_\{t\}and reconstructs without credit assignment\. \(b\) Pattern block applies FiLM using contextctl\+1c\_\{t\}^\{l\+1\}to modulatehtlh\_\{t\}^\{l\}and producectlc\_\{t\}^\{l\}\.Before the detailed analyses, we summarize the instantiated architecture\.SHARPusesLLcontext blocks for hierarchical memory andLLpattern\-recognition blocks for prediction\. During wake, the lowest context block processes the stream while prediction losses update only pattern blocks; during sleep, tagged wake states are replayed offline to train higher context blocks through temporally downsampled memory traces\.
### 3\.1Architecture Instantiation
Figure[2](https://arxiv.org/html/2606.00732#S1.F2)presents a general framework and in this work, we instantiate the context blocks using recurrent autoencoders\. Specifically, we employ an RNN encoder–decoder architecture\(Sutskeveret al\.,[2014](https://arxiv.org/html/2606.00732#bib.bib68)\), as illustrated in Figure[4](https://arxiv.org/html/2606.00732#S3.F4)a, where the encoder maps an input window to a latent state and the decoder reconstructs the sequence from this representation during training\. To promote continuity in the learned representations, the hidden state of each input window is initialized from the previous window while training\. This encourages the memory to evolve smoothly over time and retain information across overlapping windows\.
Figure[4](https://arxiv.org/html/2606.00732#S3.F4)b illustrates the pattern recognition block used in this work\. At each level, the upper\-level context modulates the current\-layer memory state via Feature\-wise Linear Modulation \(FiLM\)\(Perezet al\.,[2018](https://arxiv.org/html/2606.00732#bib.bib69)\)\. Specifically, the memory state undergoes feature\-wise affine conditioning, where scaling and shifting parameters are generated from the upper\-layer context\. The modulated representation is then processed by a multilayer perceptron \(MLP\) to produce an updated context\. This context is subsequently passed to the next lower layer or used for next\-token prediction at the lowest layer\. Below, we examine the empirical properties of each block\.
Figure 5:Memory capabilities show carryover across regimes, with stronger transfer generally observed from harder \(e\.g\., nonlinear\) to simpler sequences \(e\.g\., linear\)\.Colors denote BPTT steps; shading shows interquartile range across1010runs; dashed line is chance performance\.
### 3\.2Simulation Environments
We use three simulation environments to systematically probe different properties of the instantiated blocks: linear, nonlinear and random\. See Appendix[A](https://arxiv.org/html/2606.00732#A1)for details\. In terms of resource requirements and sequence compressibility, the difficulty of retaining the above three sequences can be ordered as:Random¿Nonlinear¿Linear\.
Figure 6:Loss\-thresholded updates stabilize representations, reducing drift observed under standard training\.Drift is measured as the meanL2L\_\{2\}distance between hidden states from a fixed probe sequence across checkpoints\.
### 3\.3Experiments with Context Blocks
#### Memory Capability Carryover across Sequential Distribution Shifts
We train an RNN autoencoder with hidden size100100on one of three simulated sequence regimes using varying BPTT windowsTT, and evaluate memory retention across all regimes\. Retention is measured via a linear probe that reconstructs a past tokenxix\_\{i\}from the hidden statehth\_\{t\}, wherei≤ti\\leq t\. Varying the offsett−i\+1t\-i\+1and measuring the reconstruction error of the probe provide a direct estimate of how much of the past information is preserved inhth\_\{t\}\. Each row from left to right of Figure[5](https://arxiv.org/html/2606.00732#S3.F5)reveals intrinsic differences in regime difficulty, with the linear regime being the easiest\. The off\-diagonal panels show cross\-distribution retention on unseen regimes\. Memory representations learned on harder regimes transfer more effectively to simpler ones, while representations learned on simpler regimes are less reliable under harder test regimes\. This indicates that the learned representations preserve sufficient temporal information under harder to simpler sequence shifts\.
Inspired by this observation, we quantify sequence hardness using the decoder reconstruction error and update memory weights only when this error exceeds a threshold\. To reduce noise, we maintain an exponential moving average of reconstruction error with smoothing factor0\.10\.1\. This selective update biases learning toward harder sequences while avoiding unnecessary computation on easier inputs\.
Figure 7:Increasing the depth of the pattern recognition head improves performance up to an optimal depth, after which deeper heads slow learning\.
#### Hidden State Drift
The RNN autoencoder is an overparameterized model with a non\-identifiable latent space, meaning that multiple hidden state configurations can yield similar reconstruction error\. As a result, the hidden representations can drift along the solution manifold during training without degrading reconstruction performance\. This drift introduces a moving target for downstream pattern\-recognition blocks, making it harder for them to learn stable mappings from memory states\. To quantify this drift effect, we measure the meanL2L\_\{2\}distance between hidden states extracted from a fixed probe sequence at consecutive training checkpoints \(every 1000 steps\) on nonlinear simulation setting\. Figure[6](https://arxiv.org/html/2606.00732#S3.F6)shows standard training without any update constraint exhibits persistent representational drift\. In contrast, applying a reconstruction\-error\-based threshold to gate memory updates mitigates this drift, leading to more stable representations once training loss stabilizes\.
### 3\.4Experiments with Pattern Recognition Blocks
We study the effect of pattern recognition head complexity by varying the depth of the MLP in Figure[7](https://arxiv.org/html/2606.00732#S3.F7)\. As shown, increasing depth initially reduces the number of samples required to reach near\-optimal performance, with the best performance achieved at depth33\. Beyond this point, deeper heads slow down learning, indicating a trade\-off between representational capacity and sample efficiency\. These results suggest the existence of an optimal complexity for the pattern recognition head\. For simplicity and to keep experiments computationally tractable, we fix the MLP depth to22in all subsequent experiments\. However, we note that further tuning of the architectural components described above may yield additional performance gains\.
Figure 8:Sleep replay converges to the wake hidden\-state distribution while prediction performance improves\.Left:Distributional discrepancy between hidden states measured using Maximum Mean Discrepancy \(MMD\)\. Replay MMD \(blue\) compares sleep\-generated states with wake states, the Wake MMD baseline \(green\) measures the discrepancy between two subsets of wake states, and Noise MMD \(orange\) compares wake states with random noise\. As training progresses, replay discrepancy rapidly decreases and approaches the wake baseline while remaining far below the noise reference, indicating that replayed states become indistinguishable from wake representations\.Right:Prediction error over training samples\.
### 3\.5Wake Experience Generation and Sleep Replay
The sleep phase in Figure[2](https://arxiv.org/html/2606.00732#S1.F2)illustrates how wake\-time experience is generated offline\. At the lowest level, Pattern Block11predicts the next token conditioned on the current memory state from Context11,ht1h\_\{t\}^\{1\}, and the modulating context from Block22,ct2c\_\{t\}^\{2\}\. During the wake phase, while interacting with the environment, salient experiences, identified by reconstruction errors that exceed a fixed threshold, are tagged and stored as context pairs\(ht1,ct2\)\(h\_\{t\}^\{1\},c\_\{t\}^\{2\}\)\. In this work, we maintain a fixed\-size queue buffer of such context tags\. Sleep phase is triggered at a regular interval\. During sleep, Pattern Block11generates sequences by conditioning on these stored context tags\. The predicted token is fed back into Context11, updating its memory stateht1h\_\{t\}^\{1\}, which in turn influences subsequent predictions\. Through this iterative process, the model can generate a particular wake\-time sequence segment\. Memory blocks are updated sequentially during sleep: at each step, one block is trained while all others remain frozen\. This enables stable consolidation of representations across layers, preventing higher\-level updates from depending on unstable lower\-layer dynamics\.
To quantify the quality of sleep\-time generation, we measure the Maximum Mean Discrepancy \(MMD\) between the hidden\-state distributions of wake and sleep\-time at Block11on nonlinear simulation\. As shown in Figure[8](https://arxiv.org/html/2606.00732#S3.F8), the discrepancy between sleep\-generated and wake\-time hidden\-states decreases as the pattern recognition head improves\.
### 3\.6Ablation Study
In this experiment, we compare a 3\-layerSHARPmodel with a 3\-layer vanillaRNNon the nonlinear simulation under identical constraints \(BPTT=4\\mathrm\{BPTT\}=4\)\. We vary the required temporal context by increasing the number of past community dependencies needed to predict the next token\. Figure[9](https://arxiv.org/html/2606.00732#S3.F9)shows that the acceleration factor must be matched to the available memory span to effectively extend the contextual reach of the pattern blocks\. While larger acceleration enables access to longer temporal dependencies, excessive acceleration compresses temporal structure beyond what the memory span \(set by BPTT\) can preserve, leading to degraded performance\. In this work, we set the acceleration factor equal to the BPTT length across all experiments\. Interestingly, higher\-level representations become effective only after sufficient training, leading to delayed but sharper gains for longer context requirements\. This reflects an implicit curriculum over temporal scales: short\-range dependencies are learned first, followed by progressively longer\-range structure as hierarchical memory stabilizes\.
Figure 9:Acceleration inSHARPis constrained by memory span\.Too little acceleration limits temporal reach, while excessive acceleration discards information\. The4×4\\timesno\-sleep variant converges more slowly, and vanillaRNNfails once required context exceeds the BPTT horizon\.We further include an ablation of the sleep mechanism by comparing the4×4\\timesmodel with and without sleep\. In this experiment, wake\-time computation is unchanged: context states evolve according to the hierarchical downsampling schedule, the lowest context block is updated by its reconstruction objective, and pattern blocks are updated by prediction loss\. The only removed component is the offline sleep phase, i\.e\., replay\-based consolidation updates to higher context blocks\. We observe that sleep does not provide uniform improvement in all regimes\. Instead, its benefit emerges only when the required temporal context significantly exceeds the effective memory span of the lowest level, at which point higher\-level representations become useful\. When the task is solvable within the local credit assignment horizon, sleep provides no advantage\.
Crucially, under the same constraint, the vanillaRNNfails to reach optimal performance once the required context exceeds the BPTT horizon\. In contrast,SHARPsuccessfully captures longer dependencies through accelerated replay, without increasing the length of BPTT steps\. This highlights that despite having multiple layers, the vanillaRNNfails to effectively utilize its hierarchical structure\. Moreover, unlikeRNN, the context knowledge base weights inSHARPare not updated during the active \(wake\) phase, further improving computational efficiency\. A detailed pseudocode explaining the approach is provided in the Appendix[B](https://arxiv.org/html/2606.00732#A2)\.
## 4Benchmark Experiments
We evaluate the proposedSHARPmodel in a single\-pass streaming character and subword level language modeling setting, where observations arrive sequentially without revisiting past data\. This setup directly tests a model’s ability to retain and utilize long\-range information under limited credit assignment, which is central to our formulation of the framework\. To this end, we consider two standard character\-level benchmarks with complementary properties:text8\(Mahoney,[2011](https://arxiv.org/html/2606.00732#bib.bib78)\)andPG\-19\(Raeet al\.,[2019](https://arxiv.org/html/2606.00732#bib.bib13)\), both of which exhibit non\-stationarity in their token distributions, withPG\-19consisting of long book\-length sequences with substantial long\-range dependencies \(see Appendix Figure[13](https://arxiv.org/html/2606.00732#A3.F13)\)\. Together, these datasets allow us to evaluate both short\-term prediction and long\-horizon memory retention\. However, we report results onPG\-19for a single run due to our computational resource limitations\.
We compare against standard recurrent architectures including vanillaRNN,GRU, andLSTMwith identical embedding size \(100100\), hidden dimension \(512512\), and depth \(L=5L=5\)\. All recurrent baselines are trained online with truncated context \(T=4T=4\), with hidden states propagated sequentially\. We additionally include aClockwork RNN\(Koutniket al\.,[2014](https://arxiv.org/html/2606.00732#bib.bib29)\)with multi\-timescale modules\. Moreover, to contextualize performance with direct context access, we includeTransformerbaselines with comparable parameter budgets\.Transformers operate on fixed\-length windows \(e\.g\.,T=αL=1024T=\\alpha^\{L\}=1024\) without internal memory, serving as a reference for explicit context rather than streaming memory\. In addition to the total parameters, we report the number of active parameters updated during the online phase, reflecting the complexity of wake\-time training\. Preprocessing and hyperparameter details are provided in Appendices[C](https://arxiv.org/html/2606.00732#A3)and[D](https://arxiv.org/html/2606.00732#A4)\.
#### Evaluation Protocol
Performance is measured using bits\-per\-character \(BPC\)\(Chunget al\.,[2016](https://arxiv.org/html/2606.00732#bib.bib41)\), defined as𝔼\[−log2p\(xt\+1∣x≤t\)\]\\mathbb\{E\}\\left\[\-\\log\_\{2\}p\(x\_\{t\+1\}\\mid x\_\{\\leq t\}\)\\right\]\. BPC provides a calibrated, information\-theoretic measure of predictive uncertainty, making it particularly suitable for sequential settings with partial stochasticity, where accuracy alone may be misleading\. BPC captures how well the model assigns probability mass to the true next token, rather than measuring prediction correctness only and lower BPC indicates better performance\. We report three complementary metrics:Forward BPCis evaluated on unseen future data \(held\-out 1M tokens\),Current BPCon the most recent 1M tokens to assess short\-term adaptation, andBackward BPCon early training data \(first 1M tokens\) to measure past performance retention\.
Table 1:Forward, backward, and current BPC ontext8\.Error bars show standard deviation across99runs\. The Context\-1 block \(∼\\sim1M parameters, within55M active\) is updated only when reconstruction loss exceeds a threshold, so its updates become increasingly infrequent during training\.TTdenotes the BPTT window or input length, andα\\alphathe acceleration factor; here, forSHARPaccelerationα=T\\alpha=T, total layersL=5L=5\. Wall\-time is reported as amortized elapsed time per10001000online next\-token predictions using identical computational resources\.Figure 10:Sleep ablations ontext8\.Sleep\-enabledSHARPachieves the lowest forward, current, and backward BPC\.Figure 11:Thresholded updates improve computational efficiency and performance stability\.Performance is insensitive to the choice of smaller thresholds, while removing thresholding causes performance to degrade over time\.
### 4\.1Character\-Level Modeling
#### Results ontext8
Table[1](https://arxiv.org/html/2606.00732#S4.T1)shows that ontext8,SHARPconsistently achieves the best performance across forward, current, and backward BPC, indicating improved retention, quick adaptation, and generalization in the streaming setting\. Moreover, the performance of the baselines reflects their architectural specifications\. VanillaRNNs perform competitively due to their simplicity and stable optimization under short BPTT, but lack mechanisms to extend effective context beyond the truncation horizon\.LSTMs exhibit high variance across runs, likely due to the difficulty of reliably gating long\-range dependencies under strict single\-pass training and limited credit assignment, leading to unstable memory retention\.GRUs provide a more stable trade\-off between expressivity and optimization, resulting in relatively consistent performance but still constrained by short\-term credit assignment\.Clockwork RNN, despite its multi\-timescale design, operates through a horizontal hierarchy of modules that differ only in their update frequency\. While this enables modeling of multiple temporal resolutions, it does not induce a hierarchy of representations: all modules operate within the same representational space without progressively transforming or abstracting information\. As a result, there is no notion of higher\-level generalized representations or lower\-level specialized features\. In contrast,SHARPconstructs a vertical hierarchy in which representations are progressively compressed and reorganized across layers, enabling structured abstraction of temporal information\. Transformer models achieve strong performance by directly attending to recent inputs, but rely on explicit access to past tokens and incur quadratic complexity, making them less suitable for strict streaming constraints\. In contrast,SHARPseparates memory from prediction and propagates information hierarchically via acceleration, enabling efficient long\-range retention with linear complexity\.
We further ablateSHARPontext8by considering three variants: removing the offline sleep phase, removing the slower learning\-rate schedule for upper pattern blocks, and training the full model during wake only without freezing the context knowledge base\. For faster experimentation, these ablations use a smaller hidden size of128128\. Figure[10](https://arxiv.org/html/2606.00732#S4.F10)shows that the sleep\-enabled model consistently achieves the lowest BPC across forward, current, and backward evaluations\. Removing sleep degrades performance, indicating that offline replay\-based consolidation improves future generalization, current adaptation, and retention of earlier stream statistics\. The no\-pattern\-slowdown and wake\-only all\-trainable variants also underperform the full sleep\-enabled model, suggesting that both sleep\-time consolidation and hierarchical timescale separation contribute toSHARP’s gains\.
We also experiment on the sensitivity of performance depending on the update threshold\. Figure[11](https://arxiv.org/html/2606.00732#S4.F11)shows that reconstruction\-thresholded memory updates rapidly reduce wake\-time context\-block updates to near zero after the initial training phase\. Moderate thresholds \(τ=10−3,10−2\\tau=10^\{\-3\},10^\{\-2\}\) achieve the best forward, current, and backward BPC, while updating at every step \(τ=0\\tau=0\) causes performance to degrade over time\. This suggests that thresholding both reduces computation and prevents unnecessary memory drift from over\-updating on already well\-reconstructed inputs\.
Table 2:PG\-19 benchmark results\.Left: character\-level BPC on PG\-19\. Right: subword\-level BPT using pretrainedGPT\-2tokenizer and frozen embeddings\. All results are from a single run\.\(a\) Character\-level PG\-19
\(b\) Subword\-level PG\-19
#### Results onPG\-19
PG\-19introduces stronger distribution shifts across books\. Consistent withtext8,SHARPdemonstrates improved forward generalization and reduced backward degradation, suggesting that hierarchical replay enables better transfer of learned structure across distributions \(Table[2](https://arxiv.org/html/2606.00732#S4.T2)a\)\. In contrast, standard recurrent baselines exhibit significant performance gaps between current and backward metrics, indicating instability under distribution shift\. Compared with the character\-level setting intext8, the gap to Transformers is larger because Transformers directly attend over a10241024\-token window, whereasSHARPuses only local wake\-time credit assignment\. Moreover, because Transformers have direct access to the input window, they rely less on internal memory formation and can devote more capacity to predictive mapping over the visible context\.
### 4\.2Subword\-Level Modeling onPG\-19
We further evaluate whetherSHARPcan utilize pretrained representations in a more realistic subword\-level setting with a50,25750\{,\}257\-tokenGPT\-2vocabulary, rather than the2727\-character vocabulary used above\. In contrast to the character\-level setup, where text is lowercased and punctuation is removed, the subword setup preserves richer lexical structure through theGPT\-2tokenizer\. We use pretrained, frozenGPT\-2token embeddings as the input front end for bothSHARPand recurrent baselines, and train the models online to predict the next subword token under the same strict streaming constraint with BPTT=4=4\. For faster runtime, we use44SHARP hierarchy levels and44MLP layers per pattern block while keeping other settings unchanged from character\-level modeling\. We trained on the first217217books without any truncation\. As shown in Table[2](https://arxiv.org/html/2606.00732#S4.T2)b, all models achieve BPT far below the chance level oflog2\(50,257\)≈15\.62\\log\_\{2\}\(50\{,\}257\)\\approx 15\.62, andSHARPimproves over recurrent baselines across forward, current, and backward BPT\.SHARP’s consistent improvement overRNN,LSTM, andGRUsuggests that hierarchical accelerated replay remains useful beyond low\-level character transitions, extending to larger subword vocabularies with pretrained semantic embeddings\.
## 5Discussion
We introducedSHARP, a hierarchical framework that separates memory \(without credit\-assignment\) from pattern recognition \(with credit\-assignment\), extending the temporal context via accelerated replay of lower\-level memory\. Our experiments suggest that separating memory from prediction enablesSHARPto extend effective context beyond the BPTT horizon through progressively compressed temporal summaries\. This produces a hierarchical organization in which higher levels capture longer\-range dependencies without requiring long\-horizon wake\-time credit assignment\.
While our current instantiation demonstrates promising results, it can be optimized further\. The framework is modular and agnostic to the choice of memory representation: more expressive memory modules \(e\.g\.,LSTMautoencoders\) or alternative mechanisms may further enhance acceleration and extend the effective temporal horizon\. Similarly, improving pattern recognition modules to better utilize the available context may further enhance performance\. In addition,SHARPcan be integrated with modern state space models such asMambato construct temporal hierarchies through acceleration, enabling efficient long\-range sequence modeling\.
In future work, we aim to explore alternative memory mechanisms \(e\.g\., Hebbian\) inspired by hippocampal place cells, where memory is represented as a superposition of basis elements\. Each basis corresponds to a token or a contiguous input segment\. Such memory representations, where each memory element can be identified separately, may enable downstream pattern modules to perform selective retrieval akin to attention, without requiring explicit selection at storage time\. This suggests a pathway toward attention\-like mechanisms in streaming settings, where relevance emerges during retrieval rather than being predetermined at memory encoding\. More broadly, we emphasize that not only the capacity but also the quality of memory representations should be studied in depth along with pattern\-recognition mechanisms to build more robust and generalizable continual learning systems\.
## 6Acknowledgment
This work was graciously supported by the NSF EFRI under Award23177062317706and was supported in part by NSF NAIAD under award 2332744, and the U\.S\. Department of Energy’s \(DoE\) ESTEEM Center\.
## References
- Using fast weights to attend to the recent past\.Advances in neural information processing systems29\.Cited by:[§2\.1](https://arxiv.org/html/2606.00732#S2.SS1.SSS0.Px1.p1.1)\.
- Y\. Bengio, P\. Simard, and P\. Frasconi \(1994\)Learning long\-term dependencies with gradient descent is difficult\.IEEE transactions on neural networks5\(2\),pp\. 157–166\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p2.1)\.
- P\. Buzzega, M\. Boschini, A\. Porrello, D\. Abati, and S\. Calderara \(2020\)Dark experience for general continual learning: a strong, simple baseline\.Advances in neural information processing systems33,pp\. 15920–15930\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p3.1)\.
- S\. Channappayya, B\. R\. Tamma,et al\.\(2023\)Augmented memory replay\-based continual learning approaches for network intrusion detection\.Advances in Neural Information Processing Systems36,pp\. 17156–17169\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p3.1)\.
- A\. Chaudhry, M\. Ranzato, M\. Rohrbach, and M\. Elhoseiny \(2018\)Efficient lifelong learning with a\-gem\.arXiv preprint arXiv:1812\.00420\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p3.1)\.
- A\. Chaudhry, M\. Rohrbach, M\. Elhoseiny, T\. Ajanthan, P\. K\. Dokania, P\. H\. Torr, and M\. Ranzato \(2019\)On tiny episodic memories in continual learning\.arXiv preprint arXiv:1902\.10486\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p3.1)\.
- J\. Chung, S\. Ahn, and Y\. Bengio \(2016\)Hierarchical multiscale recurrent neural networks\.arXiv preprint arXiv:1609\.01704\.Cited by:[§4](https://arxiv.org/html/2606.00732#S4.SS0.SSS0.Px1.p1.1)\.
- T\. Doan, M\. A\. Bennani, B\. Mazoure, G\. Rabusseau, and P\. Alquier \(2021\)A theoretical analysis of catastrophic forgetting through the ntk overlap matrix\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 1072–1080\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p2.1)\.
- V\. Dorovatas, M\. Schwerin, A\. D\. Bagdanov, L\. Caccia, A\. Carta, L\. Charlin, B\. Hammer, T\. L\. Hayes, T\. Hess, C\. Kanan,et al\.\(2026\)Modular memory is the key to continual learning agents\.arXiv preprint arXiv:2603\.01761\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p3.1)\.
- A\. Gu and T\. Dao \(2023\)Mamba: linear\-time sequence modeling with selective state spaces\.arXiv preprint arXiv:2312\.00752\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p6.1)\.
- A\. Gu, K\. Goel, and C\. Ré \(2021\)Efficiently modeling long sequences with structured state spaces\.arXiv preprint arXiv:2111\.00396\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p6.1)\.
- \[12\]M\. Y\. Harun, J\. Gallardo, T\. L\. Hayes, R\. Kemker, and C\. KananSIESTA: efficient online continual learning with sleep\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p1.1)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Long short\-term memory\.Neural computation9\(8\),pp\. 1735–1780\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p2.1)\.
- J\. J\. Hopfield \(1982\)Neural networks and physical systems with emergent collective computational abilities\.\.Proceedings of the national academy of sciences79\(8\),pp\. 2554–2558\.Cited by:[§2\.1](https://arxiv.org/html/2606.00732#S2.SS1.SSS0.Px1.p1.1)\.
- P\. Kanerva \(1988\)Sparse distributed memory\.MIT press\.Cited by:[§2\.1](https://arxiv.org/html/2606.00732#S2.SS1.SSS0.Px1.p1.1)\.
- J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska,et al\.\(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the national academy of sciences114\(13\),pp\. 3521–3526\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p3.1)\.
- J\. Koutnik, K\. Greff, F\. Gomez, and J\. Schmidhuber \(2014\)A clockwork rnn\.InInternational conference on machine learning,pp\. 1863–1871\.Cited by:[§4](https://arxiv.org/html/2606.00732#S4.p2.5)\.
- D\. Kumaran, D\. Hassabis, and J\. L\. McClelland \(2016\)What learning systems do intelligent agents need? complementary learning systems theory updated\.Trends in cognitive sciences20\(7\),pp\. 512–534\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p4.1)\.
- I\. Lerner and M\. A\. Gluck \(2019\)Sleep and the extraction of hidden regularities: a systematic review and the importance of temporal rules\.Sleep Medicine Reviews47,pp\. 39–50\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p4.1)\.
- I\. Lerner \(2017\)Sleep is for the brain\.InComputational Models of Brain and Behavior,pp\. 245–256\.External Links:[Document](https://dx.doi.org/10.1002/9781119159193.ch18)Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p4.1)\.
- Z\. Li and D\. Hoiem \(2017\)Learning without forgetting\.IEEE transactions on pattern analysis and machine intelligence40\(12\),pp\. 2935–2947\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p3.1)\.
- D\. Lopez\-Paz and M\. Ranzato \(2017\)Gradient episodic memory for continual learning\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p3.1)\.
- N\. D\. Lutz, M\. Harkotte, and J\. Born \(2026\)Sleep’s contribution to memory formation\.Physiological Reviews106\(1\),pp\. 363–483\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p4.1)\.
- M\. Mahoney \(2011\)Large text compression benchmark\.Cited by:[§4](https://arxiv.org/html/2606.00732#S4.p1.1)\.
- J\. L\. McClelland, B\. L\. McNaughton, and R\. C\. O’Reilly \(1995\)Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory\.\.Psychological review102\(3\),pp\. 419\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p2.1)\.
- M\. McCloskey and N\. J\. Cohen \(1989\)Catastrophic interference in connectionist networks: the sequential learning problem\.InPsychology of learning and motivation,Vol\.24,pp\. 109–165\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p2.1)\.
- R\. C\. O’Reilly, R\. Bhattacharyya, M\. D\. Howard, and N\. Ketz \(2014\)Complementary learning systems\.Cognitive science38\(6\),pp\. 1229–1248\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p4.1)\.
- B\. A\. Olshausen and D\. J\. Field \(1997\)Sparse coding with an overcomplete basis set: a strategy employed by v1?\.Vision research37\(23\),pp\. 3311–3325\.Cited by:[§2\.1](https://arxiv.org/html/2606.00732#S2.SS1.SSS0.Px1.p1.1)\.
- E\. Perez, F\. Strub, H\. De Vries, V\. Dumoulin, and A\. Courville \(2018\)Film: visual reasoning with a general conditioning layer\.InProceedings of the AAAI conference on artificial intelligence,Vol\.32\.Cited by:[§3\.1](https://arxiv.org/html/2606.00732#S3.SS1.p2.1)\.
- J\. W\. Rae, A\. Potapenko, S\. M\. Jayakumar, and T\. P\. Lillicrap \(2019\)Compressive transformers for long\-range sequence modelling\.arXiv preprint arXiv:1911\.05507\.Cited by:[§4](https://arxiv.org/html/2606.00732#S4.p1.1)\.
- A\. Razavi, A\. Van den Oord, and O\. Vinyals \(2019\)Generating diverse high\-fidelity images with vq\-vae\-2\.Advances in neural information processing systems32\.Cited by:[§2\.1](https://arxiv.org/html/2606.00732#S2.SS1.SSS0.Px1.p1.1)\.
- S\. Rebuffi, A\. Kolesnikov, G\. Sperl, and C\. H\. Lampert \(2017\)Icarl: incremental classifier and representation learning\.InProceedings of the IEEE conference on Computer Vision and Pattern Recognition,pp\. 2001–2010\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p3.1)\.
- D\. Rolnick, A\. Ahuja, J\. Schwarz, T\. Lillicrap, and G\. Wayne \(2019\)Experience replay for continual learning\.InAdvances in Neural Information Processing Systems,pp\. 350–360\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p3.1)\.
- H\. Shin, J\. K\. Lee, J\. Kim, and J\. Kim \(2017\)Continual learning with deep generative replay\.InAdvances in Neural Information Processing Systems,pp\. 2990–2999\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p3.1)\.
- W\. Sun, X\. Song, P\. Li, L\. Yin, Y\. Zheng, and S\. Liu \(2025\)The curse of depth in large language models\.arXiv preprint arXiv:2502\.05795\.Cited by:[Table 8](https://arxiv.org/html/2606.00732#A4.T8)\.
- I\. Sutskever, O\. Vinyals, and Q\. V\. Le \(2014\)Sequence to sequence learning with neural networks\.Advances in neural information processing systems27\.Cited by:[§3\.1](https://arxiv.org/html/2606.00732#S3.SS1.p1.1)\.
- G\. M\. van de Ven, H\. T\. Siegelmann, and A\. S\. Tolias \(2020\)Brain\-inspired replay for continual learning with artificial neural networks\.Nature communications11,pp\. 4069\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p3.1)\.
- A\. Van Den Oord, O\. Vinyals,et al\.\(2017\)Neural discrete representation learning\.Advances in neural information processing systems30\.Cited by:[§2\.1](https://arxiv.org/html/2606.00732#S2.SS1.SSS0.Px1.p1.1)\.
- J\. T\. Vogelstein, J\. Dey, H\. S\. Helm, W\. LeVine, R\. D\. Mehta, T\. M\. Tomita, H\. Xu, A\. Geisa, Q\. Wang, G\. M\. Van De Ven,et al\.\(2025\)Simple lifelong learning machines\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p2.1)\.
- W\. Yang, C\. Sun, R\. Huszár, T\. Hainmueller, K\. Kiselev, and G\. Buzsáki \(2024\)Selection of experience for memory by hippocampal sharp wave ripples\.Science383\(6690\),pp\. 1478–1483\.Cited by:[§1](https://arxiv.org/html/2606.00732#S1.p6.1)\.
## Appendix ASimulation Environments
#### Linear:
In this simulation, the token sequence \{A, B, C, D, E, F, G\} is repeated periodically\. We refer to this setting as*Linear*, since a sequence model with purely linear dynamics should, in principle, be able to capture the underlying periodic rule without requiring nonlinear transformations\.
Figure 12:Token generation graph for the nonlinear simulation\.Two communities, Community 1 \{A, B, C\} and Community 2 \{D, E, F\}, are connected by a hub token G\. From G either community can be entered with equal probability\. The traversal direction depends on the pastkkcommunity visits\.
#### Nonlinear:
Figure[12](https://arxiv.org/html/2606.00732#A1.F12)illustrates the transition graph for the nonlinear simulation\. The system consists of two token communities: Community 1 \{A, B, C\} and Community 2 \{D, E, F\}\. Within each community, tokens are traversed in either a clockwise \(ABC, BCA, CAB, DEF, EFD, FDE\) or counterclockwise \(ACB, CBA, BAC, DFE, FED, EDF\) direction\. A special token G acts as a hub connecting the two communities\. From G, the next token is sampled uniformly from \{A, B, C, D, E, F\}, thereby selecting both the community and the starting point of traversal\. Once a community is entered, the model completes exactly one full traversal \(three steps\) before deterministically returning to G\.
Crucially, the traversal direction is determined by the parity of the lastKKcommunity visits\. Letvt−K,…,vt−1∈\{0,1\}v\_\{t\-K\},\\ldots,v\_\{t\-1\}\\in\\\{0,1\\\}denote the community indices of the previousKKvisits\. If∑i=t−Kt−1vi\\sum\_\{i=t\-K\}^\{t\-1\}v\_\{i\}is even, traversal proceeds clockwise; otherwise, it proceeds counterclockwise\. Thus, correct prediction requires retaining memory over pastKKvisits, inducing a fixed\-length temporal dependency\. An example sequence generated forK=2K=2is: ‘CAB\-G\-DEF\-G\-DFE\-G\-ABC\-G\-FED\-G\-EDF\-G\-FED\-G\-FDE\-G\-DEF\-G\-EFD\-G\-EFD’\. Note that forK=2K=2, the direction at the second token of a community depends on the preceding77tokens\.
The environment is partially stochastic: while traversal within a community is deterministic given the direction, the transition fromGGis uniform over six tokens\. Consequently, the optimal achievable accuracy is3\+164=79\.17%\\frac\{3\+\\frac\{1\}\{6\}\}\{4\}=79\.17\\%\.
#### Random:
This simulation represents the theoretically most difficult case to retain\. At each time step, a token is sampled uniformly at random from the set \{A, B, C, D, E, F, G\}\. Since the sequence contains no underlying structure, a model cannot utilize implicit regularities to compress its memory footprint\.
## Appendix BPseudocode
Algorithm 1SHARP: Wake and Sleep Phases1:input stream
\{xt\}\\\{x\_\{t\}\\\}, context blocks
\{mℓ\}ℓ=1L\\\{m^\{\\ell\}\\\}\_\{\\ell=1\}^\{L\}, pattern blocks
\{fℓ\}ℓ=1L\\\{f^\{\\ell\}\\\}\_\{\\ell=1\}^\{L\}, acceleration factor
α\\alpha
2:initialize memory states
\{hℓ\}ℓ=1L\\\{h^\{\\ell\}\\\}\_\{\\ell=1\}^\{L\}and context queue buffer
ℬ\\mathcal\{B\}
3:foreach time step
t=1,2,⋯t=1,2,\\cdotsdo⊳\\trianglerightWake phase
4:update lowest memory:
ht1←m1\(xt,ht−11\)h^\{1\}\_\{t\}\\leftarrow m^\{1\}\(x\_\{t\},h^\{1\}\_\{t\-1\}\)
5:\(selective update\)update
m1m^\{1\}using reconstruction loss if needed
6:propagate states bottom\-up with acceleration:
htℓ←mℓ\(htℓ−1,ht−1ℓ\)ift≡0\(modαℓ−1\),forℓ=2,⋯,Lh^\{\\ell\}\_\{t\}\\leftarrow m^\{\\ell\}\(h^\{\\ell\-1\}\_\{t\},h^\{\\ell\}\_\{t\-1\}\)\\quad\\text\{if \}t\\equiv 0\\pmod\{\\alpha^\{\\ell\-1\}\},\\text\{ for \}\\ell=2,\\cdots,L
7:construct top\-down context:
ctℓ←fℓ\(htℓ,ctℓ\+1\),forℓ=L,⋯,2c^\{\\ell\}\_\{t\}\\leftarrow f^\{\\ell\}\(h^\{\\ell\}\_\{t\},c^\{\\ell\+1\}\_\{t\}\),\\text\{for\}~\\ell=L,\\cdots,2
8:predict next state transition probability:
𝐩^t←f1\(ht1,ct2\)\\hat\{\\mathbf\{p\}\}\_\{t\}\\leftarrow f^\{1\}\(h^\{1\}\_\{t\},c^\{2\}\_\{t\}\)
9:update pattern\-recognition blocks using prediction loss
10:ifmemory update is triggeredthen
11:store
\(ht1,ct2\)\(h^\{1\}\_\{t\},c^\{2\}\_\{t\}\)in buffer
ℬ\\mathcal\{B\}⊳\\trianglerightcontext tags
12:endif
13:ifsleep is triggeredthen⊳\\trianglerightSleep phase
14:foreach layer
ℓ=2,…,L\\ell=2,\\dots,Ldo
15:sample tagged context from buffer
ℬ\\mathcal\{B\}
16:replay
\(l−1\)\(l\-1\)th layer states with stride
α\\alpha
17:update
mℓm^\{\\ell\}to reconstruct replayed states
18:endfor
19:endif
20:endfor
## Appendix CData Preprocessing and Evaluation Protocol
Table 3:Hyperparameters and configuration choices used bySHARP\. We distinguish method\-specific hyperparameters from standard architecture, data, and optimization settings\.CategoryHyperparameter / SettingRoleMethod\-specific hyperparametersHierarchyNumber of layersLLNumber of context and pattern\-recognition levelsTemporal scalingAcceleration factorα\\alphaDownsampling factor between adjacent context levelsPattern slow\-downlearning rate slow\-down factorγ\\gammaupdates upper pattern blocks more slowlyMemory updateReconstruction thresholdτ\\tauTriggers selective memory updates and context taggingReplay storageContext buffer sizeNumber of tagged wake states retained for sleep replaySleep scheduleSleep intervalFrequency of offline consolidation phasesReplay lengthReplay sequence lengthNumber of replayed states used during sleep updatesArchitecture and scale settingsPattern blockMLP depthNumber of layers in each pattern\-recognition headRepresentationHidden sizeDimensionality of memory and recurrent statesInput encodingEmbedding dimensionDimensionality of token embeddingsData\-dependentVocabulary sizeNumber of discrete tokens in the datasetTraining and optimization settingsCredit horizonBPTT windowTTLocal wake\-time credit\-assignment horizonOptimizationOptimizerAlgorithm used for gradient\-based updatesOptimizationLearning rateStep size for trainable parametersRegularizationWeight decayWeight regularization coefficient### C\.1Text8
We use the standardtext8corpus as a continuous character stream\. A character\-level vocabulary is constructed directly from the dataset, and the text is encoded into integer token IDs\. To obtain error bar estimates under a single\-pass setting, we partition the corpus into99disjoint segments of1010M characters each, and train a separate model on each segment independently\. Training within each segment is performed sequentially without shuffling\.
Evaluation is performed within each segment using three disjoint portions of the stream: the initial11M tokens are used to assess retention \(backward\), the most recent11M tokens of the training stream are used for current performance, and a held\-out future portion \(last11M tokens oftext8\) is used to measure generalization \(forward\)\. For fairness, context length at evaluation is fixed across all approaches, and the predicted \(last\) positions’ loss is measured\. Results are aggregated across the99runs\.
To assess non\-stationarity at the scale relevant to the model, we compute the Hellinger distance between character distributions \(histograms\) over non\-overlapping windows of length10241024, matching the effective context size\. Distances are averaged over all window pairs separated by a fixed lag\. As shown in Figure[13](https://arxiv.org/html/2606.00732#A3.F13), the distance increases rapidly at small lags and saturates thereafter, indicating that local token distributions evolve over short temporal scales and this distribution shift reaches its maximum after a certain lag\. This demonstrates that the input stream is non\-stationary at the scale of the model’s available context\.
Figure 13:Context\-scale non\-stationarity intext8andPG\-19\.Average Hellinger distance between character distributions \(histograms\) computed over non\-overlapping45=10244^\{5\}=1024\-token windows, as a function of lag \(window separation\)\. Each point represents the mean distance between all window pairs separated by a fixed lag\. Both datasets exhibit a sharp increase in distance at small lags followed by a plateau, indicating that local token distributions diverge rapidly but stabilize beyond a characteristic separation\.#### Text8 100M Character Sequence Modeling
To complement the nine\-fold cross\-validation results in the main text, we evaluate each model on the full 90 M\-charactertext8training split as a single continuous sequence, keeping the same model configurations as our main experiments\. The final 10 M characters are held out as a test set\. This regime more closely mirrors deployment conditions where a model encounters one long, non\-repeating stream of data, and it amplifies differences in catastrophic forgetting: a model that loses information about early portions of the stream will exhibit a large gap between its forward \(future\) and backward \(past\) BPC\.
Table 4:Forward, backward, and current performance ontext8\(100M regime, single run\)\.All models trained on the full 90 M\-character text8 training split; only one run per model so no error bars are reported\.TTdenotes the BPTT window or input length; forSHARPα\\alphais the acceleration factor\.†\\daggerRNN backward/current BPC approaches the random\-predictor ceiling \(log227≈4\.76\\log\_\{2\}27\\approx 4\.76BPC\), indicating severe catastrophic forgetting under 90 M\-step online training; gated architectures \(LSTM, GRU\) do not exhibit this collapse\.
### C\.2PG\-19
ForPG\-19, all books are normalized to a fixed 27\-character vocabulary consisting ofa–zand space\. Specifically, text is lowercased, all non\-alphabetic characters are mapped to spaces, and consecutive spaces are collapsed\. Training data is constructed by sequentially concatenating books from the training split until a total budget of100100M characters is reached\. Books shorter than2020K normalized characters are discarded to avoid degenerate sequences\. To ensure computational tractability and balanced contribution across books, each training book is truncated to at most22M characters\. For evaluation, we select up to55held\-out books from the validation split \(or test split if needed\), each truncated to at most11M characters\. These books are never seen during training and are used to assess generalization\. Evaluation is performed sequentially within each book without resetting hidden states\. For fairness, context length at evaluation is fixed across all approaches, and the predicted \(last\) positions’ loss is measured\. In addition, retention is evaluated on the first33training books \(each truncated to at most11M characters\), while current performance is measured on the last33training books under the same truncation\. This allows us to characterize both memory retention and adaptation within the training stream\.
To examine non\-stationarity at the scale relevant to the model, we compute the Hellinger distance between character distributions \(histograms\) over non\-overlapping windows of length10241024within the sequence\. Distances are averaged over all window pairs separated by a fixed lag\. As shown in Figure[13](https://arxiv.org/html/2606.00732#A3.F13), the distance increases rapidly at small lags and plateaus thereafter, indicating that local token distributions change over short temporal scales while remaining stable beyond a characteristic separation\. This shows that the input stream is not locally stationary even at the scale of the model’s effective context\.
#### Sequence construction
For both datasets on recurrent baselines, the next\-token prediction is formulated using a fixed context window of length44\. Given a token sequence\{xt\}\\\{x\_\{t\}\\\}, each training or evaluation sample is constructed as
\(xt−4,xt−3,xt−2,xt−1\)→xt\.\(x\_\{t\-4\},x\_\{t\-3\},x\_\{t\-2\},x\_\{t\-1\}\)\\rightarrow x\_\{t\}\.These subsequences are extracted in a sliding\-window fashion with stride11, ensuring that every position in the sequence contributes a training example\.
Importantly, while inputs are constructed from short local windows, hidden states are propagated sequentially across the stream without resetting, allowing the model to accumulate information over long temporal horizons beyond the fixed context window\. In contrast, Transformers do not maintain persistent hidden states across the stream and rely solely on explicit context windows, highlighting the distinction between learned memory and direct context access\.
## Appendix DHyperparameters
Table 5:Hyperparameters forSHARPon Benchmark Datasets\.Table 6:Hyperparameters for RNN, LSTM, and GRU baselines on Benchmark Datasets\.Table 7:Hyperparameters for Clockwork RNN baseline on Benchmark Datasets\.Table 8:Hyperparameters for the Transformer baseline on benchmark datasets\.Two parameter budgets \(∼\\sim10M,∼\\sim5M for char\-level tasks;∼\\sim22\.7M,∼\\sim18\.0M at for sub\-word task\) each use a training context length of10241024\. Architecture follows a Pre\-LN LLaMA\-style stack \(RMSNorm, RoPE, SwiGLU\)\. Our implementation was built on codes fromSunet al\.\([2025](https://arxiv.org/html/2606.00732#bib.bib80)\)\.Similar Articles
SHARP: A Self-Evolving Human-Auditable Rubric Policy for Financial Trading Agents
This paper introduces SHARP, a neuro-symbolic framework for financial trading agents that uses structured, human-auditable rubrics for policy optimization to improve robustness and transparency in noisy market environments.
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction
HorizonStream introduces a long-horizon attention mechanism for streaming 3D reconstruction that explicitly models geometric propagation via an evidence influence kernel, achieving stable, scalable reconstruction with constant memory and linear time complexity, and generalizing to sequences over 10,000 frames.
SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent
This paper proposes SAM, a state-adaptive memory framework that dynamically manages interaction histories for long-horizon agentic reasoning, enabling intent-driven recall without retraining the backbone model. It outperforms strong baselines across multiple benchmarks like BrowseComp and HLE.
sapientinc/HRM-Text-1B
Sapient Intelligence released HRM-Text-1B, a 1-billion-parameter language model with a novel dual-timescale recurrent architecture (Hierarchical Reasoning Model) that provides unbounded compute depth at bounded parameter count. The pre-alignment checkpoint is available on Hugging Face.
Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism
This paper proposes PAT, an adaptive tensor parallelism method that dynamically reconfigures TP during the generation stage of synchronous RLHF training to mitigate long-tail generation bottlenecks. Evaluations on LLaMA3.1-8B and Qwen3-14B show reductions in generation latency by up to 34.6% and end-to-end iteration latency by up to 27.2%.