Language Models Need Sleep

Hacker News Top Papers

Summary

This paper introduces a sleep-like consolidation mechanism for Transformer-based LLMs that periodically converts recent context into persistent fast weights in SSM blocks, clearing the KV cache to improve long-horizon reasoning without increasing inference latency.

No content available
Original Article
View Cached Full Text

Cached at: 05/26/26, 06:57 PM

# Language Models Need Sleep
Source: [https://arxiv.org/html/2605.26099](https://arxiv.org/html/2605.26099)
Sangyun Lee Carnegie Mellon University &Sean McLeish University of Maryland &Tom Goldstein University of Maryland &Giulia Fanti Carnegie Mellon University

###### Abstract

Transformer\-based large language models are increasingly used for long\-horizon tasks; however, their attention mechanism scales poorly with context length\. To handle this, we study a sleep\-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key\-value cache\. During the sleep, the model performsNNoffline recurrent passes over the accumulated context and updates the fast weights in its state\-space model \(SSM\) blocks through a learned local rule\. During inference, this shifts extra computation to the sleep while preserving the latency of wake\-time prediction\. We test our method on controlled synthetic tasks, including cellular automata and multi\-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM\-attention hybrid models fail\. We then show that increasing sleep durationNNfor our models improves performance, with the largest gains on examples that require deeper reasoning\.

## 1Introduction

Large Language Models \(LLMs\) are commonly based on the transformer architecture\[[51](https://arxiv.org/html/2605.26099#bib.bib4)\], which stores context in an attention cache and retrieves past tokens as needed\. This memory mechanism is central to their performance, but it scales poorly: total attention compute grows quadratically with context length, while cache memory grows linearly\.

Recent efficient sequence models\[[42](https://arxiv.org/html/2605.26099#bib.bib49),[18](https://arxiv.org/html/2605.26099#bib.bib50),[16](https://arxiv.org/html/2605.26099#bib.bib51),[2](https://arxiv.org/html/2605.26099#bib.bib31)\]mitigate this cost by introducing fixed\-size fast weight memories\[[53](https://arxiv.org/html/2605.26099#bib.bib1),[14](https://arxiv.org/html/2605.26099#bib.bib52),[43](https://arxiv.org/html/2605.26099#bib.bib41)\]interleaved with full self\-attention\. This hybrid design brings together two complementary forms of memory: attention for high\-fidelity access to recent tokens, and weight\-based memory for compressed information beyond the active context window\. Hybrid models are now common among large scale frontier models\[[49](https://arxiv.org/html/2605.26099#bib.bib61)\]\.

However, scalable memory is not the same as scalable reasoning\. A fast weight memory may support long\-range recall\[[42](https://arxiv.org/html/2605.26099#bib.bib49)\], but it is unclear whether it can support deep computation over tokens that are no longer present in the KV cache\. We find that the performance of vanilla SSM\-attention hybrid models degrades \(under the same token budget\) as the required reasoning depth increaseseven when the amount of information to store is held fixed\.This suggests that the bottleneck is not merely memory capacity as suggested by prior work\[[27](https://arxiv.org/html/2605.26099#bib.bib32),[2](https://arxiv.org/html/2605.26099#bib.bib31)\], but the amount of computation available for transforming evicted context into a useful internal state\.

Sleep\.In animals, the transfer from short\-term memory to long\-term memory is thought to be supported by hippocampal replay\[[33](https://arxiv.org/html/2605.26099#bib.bib15)\], especially during sleep\[[41](https://arxiv.org/html/2605.26099#bib.bib16)\]; in this phase, short\-term hippocampal memories are reactivated and consolidated into cortical synaptic weights\. Sleep makes animals unable to respond to external stimuli, suggesting that it must provide enough cognitive benefit to justify this cost\[[41](https://arxiv.org/html/2605.26099#bib.bib16)\]\. Inspired by these biological processes, we propose a method for transferring context\-window memory into persistent weights\. When the model’s context window becomes full during inference, the model enters a “sleep” in which it performs multiple forward passes over the accumulated context and recursively updates its fast weights via a learned local rule\. As in animal sleep, the model receives no external input tokens during this phase\. After consolidation, the context window is cleared, and the model resumes operation with updated fast weights\. During training, the model is optimized end\-to\-end by backpropagating through the entire process to maximize task performance after sleep\.

Our architecture is also motivated by results on depth\-recurrent or looped neural networks\[[23](https://arxiv.org/html/2605.26099#bib.bib22),[17](https://arxiv.org/html/2605.26099#bib.bib24),[4](https://arxiv.org/html/2605.26099#bib.bib25)\]\. Prior work shows that dynamic\-depth models can outperform fixed\-depth counterparts on sequential reasoning tasks and solve hard problem instances that fixed\-depth models cannot by scaling amount of compute spent at prediction\.Our key insight is that recurrence can be used not only for prediction but also for memory consolidation\.Converting observed tokens into useful weight memory is itself a nontrivial computation, and need not be achievable in a single pass\. Indeed, many learning algorithms, such as gradient descent, improve through iterative weight updates\. Thus, allocating more recurrent computation during fast weight formation gives the model more steps to transform context into representations that support later prediction\. We find that increasing the depth of recurrence, orsleep duration, improves reasoning after sleep\. Unlike previous looped models, our model does not need to loop at prediction time: the additional computation has already been spent on forming fast weights that support later single\-pass prediction\.

We introduce and evaluate LLM sleep on carefully designedsynthetic taskswhere a model must answer questions about context that has already been evicted, using only a single forward pass\. These synthetic tasks allow us to vary reasoning depth while holding memory load fixed, providing a clean stress test of whether sleep\-time computation can convert transient context into fast weights that support later inference\. We summarize our contributions as follows:

- •In a controlled setting, we show that as the reasoning depth of a problem increases, vanilla State\-Space Models \(SSMs\) such as Gated Delta Nets \(GDNs\) fail despite having enough fast weight capacity\.
- •We propose an architecture that combines recurrent computation with fast weight memory blocks, and show that increasing the number of recursions for our architecture improves performance over GDNs\. We observe the largest gains on problem instances that require the deepest reasoning\.
- •We further validate the efficacy of our architecture on GSM\-Infinite, a natural language math\-reasoning dataset, using pre\-trained LLM initializations\.

Overall, these results support the central claim that a sleep\-like offline recurrence can organize evicted context into weights to support later reasoning\.

## 2Related Work

Fast weights and linear recurrent neural networks\.Linear recurrent neural networks or SSMs can be viewed as maintaining an online fast weight memory rather than a KV cache which grows quadratically with sequence length\. In this view, linear attention corresponds to a recurrent update over a fixed\-size, matrix\-valued, state, where key\-value mappings are written and queried\[[29](https://arxiv.org/html/2605.26099#bib.bib42),[43](https://arxiv.org/html/2605.26099#bib.bib41)\]\. Recent variants improve this memory with delta\-rule updates and gates, enabling more selective writing, overwriting, and forgetting\[[54](https://arxiv.org/html/2605.26099#bib.bib46),[53](https://arxiv.org/html/2605.26099#bib.bib1),[55](https://arxiv.org/html/2605.26099#bib.bib45),[14](https://arxiv.org/html/2605.26099#bib.bib52)\]\. These mechanisms underlie recent efficient hybrid language models\[[24](https://arxiv.org/html/2605.26099#bib.bib10),[39](https://arxiv.org/html/2605.26099#bib.bib9)\]and help explain why linear networks can offer a favorable recall, throughput, and memory tradeoffs\. They still struggle with exact copying and retrieval relative to full attention in some cases due to a fixed memory size, as pointed out by prior work\[[2](https://arxiv.org/html/2605.26099#bib.bib31),[27](https://arxiv.org/html/2605.26099#bib.bib32)\]\. Contrary to these works, we show that such models can fail as the required reasoning depth to solve a task increases,*even when the amount of information to store is held fixed*\.

Context compression\.There are several methods for processing long contexts at test time by condensing contextual information\.Geet al\.\[[21](https://arxiv.org/html/2605.26099#bib.bib11)\]propose using a language model to compress long contexts into a shorter sequence of hidden states, which are then passed to the language model in place of the original long context\.Eyubogluet al\.\[[20](https://arxiv.org/html/2605.26099#bib.bib47)\]use offline self\-study to learn a small KV cache that can substitute for the full\-context cache\. This line of work shares our goal of spending offline computation once to turn a long context into a compact state that can be reused later\. These methods shorten what remains in the attention context, whereas our method transfers evicted context into weight\-based memory\.

Context distillation\.Context distillation\[[46](https://arxiv.org/html/2605.26099#bib.bib66),[3](https://arxiv.org/html/2605.26099#bib.bib65)\]aims to distill active context into model weights by training a model without it to imitate a contextful teacher\[[46](https://arxiv.org/html/2605.26099#bib.bib66),[3](https://arxiv.org/html/2605.26099#bib.bib65),[8](https://arxiv.org/html/2605.26099#bib.bib68)\], reconstruct it\[[11](https://arxiv.org/html/2605.26099#bib.bib67)\], predict its continuation\[[8](https://arxiv.org/html/2605.26099#bib.bib68),[11](https://arxiv.org/html/2605.26099#bib.bib67)\], or answer questions about it\[[47](https://arxiv.org/html/2605.26099#bib.bib70),[9](https://arxiv.org/html/2605.26099#bib.bib69),[8](https://arxiv.org/html/2605.26099#bib.bib68)\]\. Instead of doing gradient descent on predefined losses, our method uses a learned recurrent forward pass to transfer context to weights\.

Test\-time training\.Tandonet al\.\[[48](https://arxiv.org/html/2605.26099#bib.bib48)\]replace full attention with sliding\-window attention and perform test\-time gradient updates on a subset of MLP layers\. At inference time, their method optimizes a standard cross\-entropy loss on the observed context, storing long\-range information in temporary parameter updates rather than in a full KV cache\. They perform only one gradient step for distilling each context chunk\. By contrast, our method uses a learned recurrent forward pass as the memory\-update rule, allowing more flexible forms of consolidation that need not correspond to a one\-step gradient descent on a fixed scalar objective\. They primarily evaluate perplexity on general web\-text data, where retrieval and reasoning demands are entangled; we instead use synthetic tasks that independently control reasoning depth and problem length, showing that additional sleep\-time computation is most beneficial when reasoning depth increases\.Zhanget al\.\[[56](https://arxiv.org/html/2605.26099#bib.bib72)\]attach a LoRA adapter that updates model weights from the current context chunk and evaluate this approach in a reinforcement\-learning setting\. Unlike ours, their method updates the weights only once per chunk\.

Depth\-recurrent models\.Increasing the depth of language models is known to increase their expressivity\[[35](https://arxiv.org/html/2605.26099#bib.bib23)\]\. Depth\-recurrence, is one way to increase depth in transformer models and is one method to make them Turing complete\[[17](https://arxiv.org/html/2605.26099#bib.bib24)\]\. Moreover, the depth of these models can be adaptive\[[23](https://arxiv.org/html/2605.26099#bib.bib22),[19](https://arxiv.org/html/2605.26099#bib.bib38),[44](https://arxiv.org/html/2605.26099#bib.bib39),[5](https://arxiv.org/html/2605.26099#bib.bib40)\]\. Recent work has scaled these depth\-adaptive language models to large scales, both training from scratch\[[22](https://arxiv.org/html/2605.26099#bib.bib20),[58](https://arxiv.org/html/2605.26099#bib.bib34)\]and as a post\-training objective\[[34](https://arxiv.org/html/2605.26099#bib.bib37)\]\. Detailed analyses of how best to train depth recurrent models suggest the recurrent depth should be scaled with training compute\[[40](https://arxiv.org/html/2605.26099#bib.bib35),[45](https://arxiv.org/html/2605.26099#bib.bib36)\]\.

Offline planning\.Successful planning in structured environments often requires combining newly\-observed information with memories of earlier states\. A longstanding view is that animals perform this integration online at choice time\[[50](https://arxiv.org/html/2605.26099#bib.bib57),[36](https://arxiv.org/html/2605.26099#bib.bib56)\]\. However, integrating distant memories at choice time can be time\-consuming, and offline planning during off\-task rest can amortize such cost\[[36](https://arxiv.org/html/2605.26099#bib.bib56)\]\. Consistent with this view,Momennejadet al\.\[[36](https://arxiv.org/html/2605.26099#bib.bib56)\]show that neural evidence of offline replay during rest predicts improved planning performance for human subjects\. Recent work from the machine learning community studies related mechanisms with artificial neural networks\.Linet al\.\[[30](https://arxiv.org/html/2605.26099#bib.bib54)\]propose scaling offline compute by letting LLMs generate expected questions from users and precompute quantities needed to solve them\.Chalvidalet al\.\[[10](https://arxiv.org/html/2605.26099#bib.bib55)\]train a single\-layer network on reinforcement\-learning environments and show that recursive Hebbian\-like weight updates support fast adaptation\. In this paper, we show that recursively updating fast weights during a sleep\-like offline phase improves reasoning over evicted context while preserving a strict prediction\-phase latency constraint\.

## 3Preliminaries

### 3\.1Sequence mixers

Attention\.Softmax attention\[[51](https://arxiv.org/html/2605.26099#bib.bib4)\]is a sequence\-mixing operation in which each token retrieves information from previous tokens according to query\-key similarity\. For the token representation𝒙t\\boldsymbol\{x\}\_\{t\}at timesteptt, define

𝒒t\\displaystyle\\boldsymbol\{q\}\_\{t\}=𝐖Q​𝒙t,\\displaystyle=\\mathbf\{W\}\_\{Q\}\\boldsymbol\{x\}\_\{t\},𝒌t\\displaystyle\\boldsymbol\{k\}\_\{t\}=𝐖K​𝒙t,\\displaystyle=\\mathbf\{W\}\_\{K\}\\boldsymbol\{x\}\_\{t\},𝒗t\\displaystyle\\boldsymbol\{v\}\_\{t\}=𝐖V​𝒙t,\\displaystyle=\\mathbf\{W\}\_\{V\}\\boldsymbol\{x\}\_\{t\},\(1\)where𝒒t,𝒌t,𝒗t∈ℝd\\boldsymbol\{q\}\_\{t\},\\boldsymbol\{k\}\_\{t\},\\boldsymbol\{v\}\_\{t\}\\in\\mathbb\{R\}^\{d\}are column vectors, and𝐖Q,𝐖K,𝐖V\\mathbf\{W\}\_\{Q\},\\mathbf\{W\}\_\{K\},\\mathbf\{W\}\_\{V\}are learned projection matrices with compatible shapes\. Self\-attention stores all previous keys𝒌t\\boldsymbol\{k\}\_\{t\}and values𝒗t\\boldsymbol\{v\}\_\{t\}in𝐊t=\[𝒌1,…,𝒌t\]⊤∈ℝt×d\\mathbf\{K\}\_\{t\}=\[\\boldsymbol\{k\}\_\{1\},\\ldots,\\boldsymbol\{k\}\_\{t\}\]^\{\\top\}\\in\\mathbb\{R\}^\{t\\times d\}and𝐕t=\[𝒗1,…,𝒗t\]⊤∈ℝt×d\\mathbf\{V\}\_\{t\}=\[\\boldsymbol\{v\}\_\{1\},\\ldots,\\boldsymbol\{v\}\_\{t\}\]^\{\\top\}\\in\\mathbb\{R\}^\{t\\times d\}, then computes

𝒐t\\displaystyle\\boldsymbol\{o\}\_\{t\}=𝐕t⊤​softmax⁡\(𝐊t​𝒒td\)\.\\displaystyle=\\mathbf\{V\}\_\{t\}^\{\\top\}\\operatorname\{softmax\}\\\!\\left\(\\frac\{\\mathbf\{K\}\_\{t\}\\boldsymbol\{q\}\_\{t\}\}\{\\sqrt\{d\}\}\\right\)\.\(2\)This allows𝒙t\\boldsymbol\{x\}\_\{t\}to attend to any previous token, but requires storing𝐊t\\mathbf\{K\}\_\{t\}and𝐕t\\mathbf\{V\}\_\{t\}, the KV cache, whose size grows linearly with sequence length\.

Linear recurrent layers\.By contrast, linear recurrent layers, including many SSM\-style architectures, store the past in a fixed\-size fast\-weight state\. A simple Mamba2\-style\[[14](https://arxiv.org/html/2605.26099#bib.bib52)\]update can be written as a gated Hebbian\-like outer\-product rule\[[25](https://arxiv.org/html/2605.26099#bib.bib60),[43](https://arxiv.org/html/2605.26099#bib.bib41)\]:

𝐒t\\displaystyle\\mathbf\{S\}\_\{t\}=αt​𝐒t−1\+βt​𝒗t​𝒌t⊤,\\displaystyle=\\alpha\_\{t\}\\mathbf\{S\}\_\{t\-1\}\+\\beta\_\{t\}\\boldsymbol\{v\}\_\{t\}\\boldsymbol\{k\}\_\{t\}^\{\\top\},𝒐t\\displaystyle\\boldsymbol\{o\}\_\{t\}=𝐒t​𝒒t\.\\displaystyle=\\mathbf\{S\}\_\{t\}\\boldsymbol\{q\}\_\{t\}\.\(3\)Hereαt∈\(0,1\)\\alpha\_\{t\}\\in\(0,1\)is a data\-dependent forget gate andβt∈\(0,1\)\\beta\_\{t\}\\in\(0,1\)is a data\-dependent input gate, both computed from𝒙t\\boldsymbol\{x\}\_\{t\}\. Unlike the KV cache𝐊t\\mathbf\{K\}\_\{t\}and𝐕t\\mathbf\{V\}\_\{t\}, the fast\-weight𝐒t\\mathbf\{S\}\_\{t\}does not grow in size withtt\. This makes linear recurrent layers more memory\-efficient, but also more lossy: past tokens must be compressed into a fixed\-size weight\-based memory\. In our experiments we use Gated Delta Networks \(GDNs\), which add a delta\-rule correction to this update; however, the specific update rule does not matter for our discussion\.

In a language model, a sequence\-mixing layer is combined with normalization, residual connections, and an MLP layer to form a block\. We writeℬℓattn\\mathcal\{B\}^\{\\mathrm\{attn\}\}\_\{\\ell\}for a block whose sequence\-mixing layer is attention, andℬℓssm\\mathcal\{B\}^\{\\mathrm\{ssm\}\}\_\{\\ell\}for a block whose sequence\-mixing layer is a linear recurrent layer\.

For example, an attention\-only language model is formed by stacking attention blocksDDtimes between an embedding layer and an output projection:

Embed→ℬ0attn→⋯→ℬℓattn→ℬℓ\+1attn→⋯→ℬD−1attn→OutProj\.\\displaystyle\\mathrm\{Embed\}\\rightarrow\\mathcal\{B\}^\{\\mathrm\{attn\}\}\_\{0\}\\rightarrow\\cdots\\rightarrow\\mathcal\{B\}^\{\\mathrm\{attn\}\}\_\{\\ell\}\\rightarrow\\mathcal\{B\}^\{\\mathrm\{attn\}\}\_\{\\ell\+1\}\\rightarrow\\cdots\\rightarrow\\mathcal\{B\}^\{\\mathrm\{attn\}\}\_\{D\-1\}\\rightarrow\\mathrm\{OutProj\}\.\(4\)
Hybrid models\.Recent hybrid sequence models\[[42](https://arxiv.org/html/2605.26099#bib.bib49),[18](https://arxiv.org/html/2605.26099#bib.bib50),[16](https://arxiv.org/html/2605.26099#bib.bib51),[2](https://arxiv.org/html/2605.26099#bib.bib31)\]mitigate the cost of self\-attention layers by interleaving them with SSM blocks\[[53](https://arxiv.org/html/2605.26099#bib.bib1),[14](https://arxiv.org/html/2605.26099#bib.bib52),[43](https://arxiv.org/html/2605.26099#bib.bib41)\]with fixed\-size fast\-weight memories\. For example:

Embed→ℬ0attn→ℬ1ssm→ℬ2attn→ℬ3ssm→⋯→ℬD−1attn→OutProj\.\\displaystyle\\mathrm\{Embed\}\\rightarrow\\mathcal\{B\}^\{\\mathrm\{attn\}\}\_\{0\}\\rightarrow\\mathcal\{B\}^\{\\mathrm\{ssm\}\}\_\{1\}\\rightarrow\\mathcal\{B\}^\{\\mathrm\{attn\}\}\_\{2\}\\rightarrow\\mathcal\{B\}^\{\\mathrm\{ssm\}\}\_\{3\}\\rightarrow\\cdots\\rightarrow\\mathcal\{B\}^\{\\mathrm\{attn\}\}\_\{D\-1\}\\rightarrow\\mathrm\{OutProj\}\.\(5\)

### 3\.2Synthetic reasoning tasks

To begin, we study two synthetic tasks to understand our changes in a controlled setting\.

Rule 110\.Rule 110\[[13](https://arxiv.org/html/2605.26099#bib.bib59)\]is a simple one\-dimensional binary cellular automaton that evolves a binary string according to a fixed local transition rule\. The general problem of predicting Rule 110 afterttsteps is P\-complete\[[37](https://arxiv.org/html/2605.26099#bib.bib58)\], and no efficient general parallel shortcut is known\. Training a neural network to predict thett\-th state is therefore a good test to see if the model can carry out deep sequential computation\.

Depo\.Depo is a multi\-hop knowledge retrieval task introduced byAllen\-Zhu and Li \[[1](https://arxiv.org/html/2605.26099#bib.bib2)\]to evaluate reasoning depth of a language model\. Each sequence consists of a shuffled directed cycle followed by queries; each query asks for the node reached afterkkoutgoing edges from a start node, with largerkkrequiring deeper graph traversal\.

These tasks allow us to vary reasoning demand while holding sequence length fixed, isolating a model’s reasoning capability from its information retrieval capability\.

## 4Motivating example: Can attention\-SSM hybrid models reason about context they can no longer attend to?

Attention\-SSM hybrid models are often motivated by the idea that fast\-weight memory can compensate for limited attention windows\[[42](https://arxiv.org/html/2605.26099#bib.bib49)\], compressing information from past tokens once they are no longer directly accessible\. In this section, we explore a case where this hybrid mechanism fails\.

Consider the following example drawing on cellular automaton Rule 110\[[13](https://arxiv.org/html/2605.26099#bib.bib59)\]\. In this setting, we train the model on four independent length\-24 binary strings, each representing an initial state for Rule 110\. Here, we use a character\-level tokenizer \(i\.e\., ‘0’ and ‘1’ define tokens\)\. The four states are unrelated to each other \(i\.e\., they are not obtained by unrolling the previous state\)\. After processing the all four binary strings of lengthT:=24×4=96T:=24\\times 4=96, the model must later predict the first bit of each state aftertttransitions\. Since there are four label tokens following the states, the total sequence lengthTTis100100\. An example sequence is:

The first answer token1\(label0\) is obtained by unrolling 0101…1101 \(state0\)tttimes and taking the first bit from it, and so on\.ttcontrols the reasoning depth required to solve this task: whent=0t=0\(no rollout\), this becomes a simple first\-bit retrieval task, and the task becomes more difficult asttincreases\.

To stress\-test whether SSM can complement self\-attention by providing past information, we impose a strict context window size as well as ahard\-eviction constraint: we clear the context window every2424tokens, and we denote this withL=24L=24\. This means that the model can only see one state in context at a time and must fully encode this information into its fast weights𝐒t\\mathbf\{S\}\_\{t\}, as the KV cache𝐊t\\mathbf\{K\}\_\{t\}and𝐕t\\mathbf\{V\}\_\{t\}arefully evictedbefore moving onto the next state\. The hard eviction boundary is denoted by\|\|\.

This hard eviction constraint naturally divides a sequence into two distinct phases:

- •theconsolidation phase\(the first 96 tokens in the example sequence\), during which the model must encode context into its fast weights𝐒t\\mathbf\{S\}\_\{t\}; and
- •theprediction phase\(the last 4 tokens in the example sequence\), during which the model predicts the answer tokens\.

We impose aprediction\-phase latency constraint: during the prediction phase, each answer token is predicted with a single standard forward pass\. Extra loops or chain\-of\-thought tokens are disallowed because they increase prediction latency\. Thus, all information needed to predict the labels must already have been consolidated into the fast weights before the prediction phase begins\.

Under this hard eviction constraint, a standard transformer cannot do better than random guessing as the KV cache has been destroyed before prediction is made\. SSMs or attention\-SSM hybrid models can do better than random guessing because they can store the initial states in their fast weights\. For example, one way to solve this task is to simulate thett\-step state evolution once the context is full, store the first bit of each evolved state in the fast weights, and retrieve this bit at prediction time\. However,[Figure˜2\(a\)](https://arxiv.org/html/2605.26099#S6.F2.sf1)shows that the performance of a 4\-layer GDN\-attention hybrid model \(with an attention→\\rightarrowGDN→\\rightarrowattention→\\rightarrowGDN layout\) drops rapidly asttincreases\. This drop is not due to the memory\-capacity limitation identified in prior work\[[27](https://arxiv.org/html/2605.26099#bib.bib32),[2](https://arxiv.org/html/2605.26099#bib.bib31)\]: we vary onlyttwhile keeping the sequence lengthTTfixed\. Instead, the difficulty comes from the deep sequential computation needed to simulate the automaton forttsteps, which a fixed\-depth model cannot scale with\.

On task failures\.When we say that a model fails or degrades on a task, we do not mean that the architecture could never learn the task with unlimited data, compute, or training time\. Our claims concern performance under a fixed training\-token budget\. This budgeted setting matters because reasoning\-intensive data is sparse even in web\-scale corpora\. Budget\-controlled synthetic tasks can expose trends that align with phenomena observed in larger\-scale pretraining earlier and more clearly\[[1](https://arxiv.org/html/2605.26099#bib.bib2)\]\.

## 5LLM Sleep: Offline Recursive Memory Consolidation

Now, we introduce a solution to the above example: we introduce a*sleep*during LLM training, in which the model performs recursion during a consolidation phase, before evicting tokens from attention layers once the context window is full\. In this way, we can scale compute to handle deep reasoning tasks \(e\.g\., a largettfrom our motivating example\) while still obeying a prediction\-phase latency constraint\. For example, if we loop over allDDblocks, it looks like:

Embed→\[ℬ0attn→ℬ1ssm→⋯→ℬD−1attn\]×N→OutProj\\displaystyle\\mathrm\{Embed\}\\rightarrow\\left\[\\mathcal\{B\}^\{\\mathrm\{attn\}\}\_\{0\}\\rightarrow\\mathcal\{B\}^\{\\mathrm\{ssm\}\}\_\{1\}\\rightarrow\\cdots\\rightarrow\\mathcal\{B\}^\{\\mathrm\{attn\}\}\_\{D\-1\}\\right\]^\{\\times N\}\\rightarrow\\mathrm\{OutProj\}\(6\)where the superscript×N\\times NdenotesNNlooped passes over the architecture\.

Algorithm 1Our LLM sleep training with hard eviction\.1:tokens

xx, loss mask

mm, window size

LL, sleep passes

NN
2:Zero\-initializeSSM fast weights

𝐒\\mathbf\{S\}
3:Split

x,mx,minto non\-overlapping chunks of length at most

LL
4:foreach token chunk

ccand its loss mask

mcm\_\{c\}do

5:

h←Embed​\(c\)h\\leftarrow\\mathrm\{Embed\}\(c\)
6:if

mcm\_\{c\}is all\-zerothen⊳\\trianglerightconsolidation phase

7:for

n=1,…,Nn=1,\\ldots,Ndo

8:

h,𝐒←Blocks​\(h,𝐒\)h,\\mathbf\{S\}\\leftarrow\\mathrm\{Blocks\}\(h,\\mathbf\{S\}\)
9:endfor

10:else⊳\\trianglerightprediction phase

11:

h,𝐒←Blocks​\(h,𝐒\)h,\\mathbf\{S\}\\leftarrow\\mathrm\{Blocks\}\(h,\\mathbf\{S\}\)
12:

ℒ←MaskedCE​\(OutProj​\(h\),c,mc\)\\mathcal\{L\}\\leftarrow\\mathrm\{MaskedCE\}\(\\mathrm\{OutProj\}\(h\),c,m\_\{c\}\)⊳\\trianglerightMasked cross entropy loss

13:endif

14:endfor

15:Backpropagate

ℒ\\mathcal\{L\}and take an optimizer step

[Figure˜1](https://arxiv.org/html/2605.26099#S5.F1)describes the architecture in detail\. We initialize from an SSM\-attention hybrid model with a fixed context\-window sizeLL, where the attention cache is fully evicted everyLLtokens\. Before evicting the KV cache everyLLtokens, the model performsNNrecurrent passes to iteratively update the fast weights inside the SSM blocks following[Equation˜3](https://arxiv.org/html/2605.26099#S3.E3); withN=1N=1, it reduces to a vanilla SSM\-attention hybrid model\. We call the phase when the model is iteratively updating the fast weights asleep\.

After recurrently refining the fast weights, the KV cache is evicted and the nextLLtokens are processed\. After processing the full context, the model predicts the answer based on the refined memory and current contextin a single forward pass\. The model is trained to minimize the prediction error by backpropagating through the entire computational graph shown in[Equation˜6](https://arxiv.org/html/2605.26099#S5.E6), similarly to other depth\-recurrent models\[[17](https://arxiv.org/html/2605.26099#bib.bib24),[23](https://arxiv.org/html/2605.26099#bib.bib22)\]\. Unlike prior depth\-recurrent models where gradient flows through recursively refined feature vectors, the gradient flows through the refined fast weights because we discard the refined features after sleep\.[Algorithm˜1](https://arxiv.org/html/2605.26099#alg1)summarizes the training procedure\.

![Refer to caption](https://arxiv.org/html/2605.26099v1/x1.png)Figure 1:At the eviction boundary, an SSM\-attention hybrid performsNNoffline recurrent passes over the current context before discarding the attention cache\. These recurrent passes update the fast weights in the SSM blocks, allowing later predictions to use consolidated context without wake\-time looping\.
## 6Experiments

Our experiments test whether longer sleep, implemented by increasingNN, produces fast weights that support deeper reasoning over states that are no longer present in the attention cache\. This requires more than storing evicted tokens: the model must encode past context into fast weights\(𝐒t\)\(\\mathbf\{S\}\_\{t\}\)in a form that supports nontrivial computation after the cache has been cleared, while still using only a single forward pass at prediction time\. We evaluate this question across increasingly more difficult settings\. First, the cellular automaton task varies the rollout steptt, isolating the depth of reasoning required over each evicted state\. First, the Depo task\[[1](https://arxiv.org/html/2605.26099#bib.bib2)\]adds a harder compression problem: the model must encode a fragmented graph into fast weights and later answer unseen multi\-hop queries over it\. Finally, we consider GSM\-Infinite\[[57](https://arxiv.org/html/2605.26099#bib.bib3)\], where we fine\-tune the pre\-trained Jet\-Nemotron 2B\[[24](https://arxiv.org/html/2605.26099#bib.bib10)\]and Ouro 1\.4B\[[58](https://arxiv.org/html/2605.26099#bib.bib34)\]on a synthetic math\-reasoning dataset\.

Experiment details\.FollowingMcLeishet al\.\[[34](https://arxiv.org/html/2605.26099#bib.bib37)\], we use the Muon optimizer for all experiments\. We fix the AdamW learning rate to5​e−55\\mathrm\{e\}\{\-\}5and tune only the Muon learning rate\. For[Section˜4](https://arxiv.org/html/2605.26099#S4)and[Section˜6\.1](https://arxiv.org/html/2605.26099#S6.SS1), we use a 4\-layer GDN\-attention hybrid model with hidden dimensiond=256d=256\. We tune the Muon learning rate on theN=1N=1model, giving the no\-loop baseline an advantage, and use the selected value,2​e−32\\mathrm\{e\}\{\-\}3, for all looped models\. For[Section˜6\.2](https://arxiv.org/html/2605.26099#S6.SS2), we use the Jet\-Nemotron architecture\[[24](https://arxiv.org/html/2605.26099#bib.bib10)\], an SSM\-attention hybrid model fine\-tuned from Qwen 2\.5 1\.5B by replacing some attention layers with Jet layers, which use dynamic convolution instead of the fixed convolution in GDN\. To roughly match the small model size inAllen\-Zhu and Li \[[1](https://arxiv.org/html/2605.26099#bib.bib2)\], we train a 10\-layer model from scratch with hidden dimensiond=512d=512\. We apply the same tuning protocol as above: tune the Muon learning rate on theN=1N=1baseline and use2​e−32\\mathrm\{e\}\{\-\}3for the looped models\. For[Section˜6\.3](https://arxiv.org/html/2605.26099#S6.SS3), we use pre\-trained Ouro 1\.4B\[[58](https://arxiv.org/html/2605.26099#bib.bib34)\]and Jet\-Nemotron 2B\[[24](https://arxiv.org/html/2605.26099#bib.bib10)\]models, and set the Muon learning rate to1​e−31\\mathrm\{e\}\{\-\}3followingMcLeishet al\.\[[34](https://arxiv.org/html/2605.26099#bib.bib37)\]\. The automaton experiments require less than one A6000 GPU\-day\. The Depo and GSM\-Infinite experiments require roughly 1–2 H100 GPU\-days per run\. For the batch size, we use 512 for automaton, 128 for Depo, and 256 for GSM\-Infinite\. For fair comparison, we fix random seeds ensuring that all runs use exactly the same data ordering\.

### 6\.1Task: Cellular automaton

In[Section˜4](https://arxiv.org/html/2605.26099#S4), we see how vanilla SSM\-attention hybrid models fail on the automaton task whenttis large, as hybrid models cannot scale compute when performing memory consolidation\. In[Figure˜2\(b\)](https://arxiv.org/html/2605.26099#S6.F2.sf2), we use the same architecture from[Figure˜2\(a\)](https://arxiv.org/html/2605.26099#S6.F2.sf1): a 4\-layer GDN\-attention hybrid model, with an attention→\\rightarrowGDN→\\rightarrowattention→\\rightarrowGDN layout\. Our method additionally uses the ‘sleep’ during the consolidation phase discussed in[Section˜5](https://arxiv.org/html/2605.26099#S5), where we use recurrence to iteratively update the fast weights\. We study using 2 to 4 recurrent updates here\.

We train this looped hybrid architecture on a setting that requires substantial reasoning compute and is challenging for the non\-recurrent architecture:t=32t=32\. In[Figure˜2\(b\)](https://arxiv.org/html/2605.26099#S6.F2.sf2), “2 loops”, “3 loops”, and “4 loop” mean the model uses asleepfor memory consolidation, while "no loop" is the baseline\.[Figure˜2\(b\)](https://arxiv.org/html/2605.26099#S6.F2.sf2)shows that the non\-looped model remains close to random guessing, reaching only about10%10\\%exact accuracy after nearly55B training tokens\. Adding offline passes improves both learning speed and final accuracy under the same token budget: two loops achieves approximately20%20\\%accuracy, while three and four loops achieve above30%30\\%\. Because the context length, eviction rule, and prediction\-phase computation are fixed across these runs, the improvement comes from additional consolidation\-time computation during sleep\.

![Refer to caption](https://arxiv.org/html/2605.26099v1/x2.png)\(a\)Effect of rollout steptt\.
![Refer to caption](https://arxiv.org/html/2605.26099v1/x3.png)\(b\)Effect of offline looping fort=32t=32\.

Figure 2:IncreasingNNimproves performance on cellular automaton\.Left:Each curve represents a different number of rollout stepsttfor a hybrid attention\-SSM architecture, as in the motivating example section\. Increasingttmakes the task harder for a vanilla attention\-GDN hybrid model\. We early\-stop 4\- and 8\-step runs as they converge earlier\.Right:For a challenging reasoning task \(t=32t=32\), additional offline sleep loops improve accuracy while preserving single\-pass wake\-time prediction\.
### 6\.2Task: Depo

Next, we evaluate Depo, thekk\-hop knowledge retrieval task introduced byAllen\-Zhu and Li \[[1](https://arxiv.org/html/2605.26099#bib.bib2)\]\. Each sequence consists of a shuffled directed cycle followed by queries; each query asks for the node reached afterkkoutgoing edges from a start node, with largerkkrequiring deeper graph traversal\.Allen\-Zhu and Li \[[1](https://arxiv.org/html/2605.26099#bib.bib2)\]show that SSMs perform substantially worse than transformers on this task, despite having enough fast weight capacity to store the context\. This suggests that the bottleneck is not storage alone, but organizing stored edges into a representation that supports later multi\-hop retrieval\[[38](https://arxiv.org/html/2605.26099#bib.bib53)\]\. An example sequence from Depo is:

b\-\>a, f\-\>l, \.\.\.\|\.\.\.\|\.\.\., e\-\>b⏞shuffled directed cycle\\overbrace\{\\mbox\{\{b\-\>a, f\-\>l, \.\.\.\} \{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}$\|$\} \{\.\.\.\} \{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}$\|$\} \{\.\.\., e\-\>b\}\}\}^\{\\text\{shuffled directed cycle\}\}\|\|1 hop after a:c\.\.\.4 hops after e:d⏞query and answer\\overbrace\{\\mbox\{\{1 hop after a:\} \{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\{c\}\} \\quad\{\.\.\.\} \\quad\{4 hops after e:\} \{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\{d\}\}\}\}^\{\\text\{query and answer\}\}

Here\|\|denotes an eviction boundary, and red text denotes answer tokens\.

In our setting, each cycle contains up to 75 nodes and spans up to 300 tokens; shorter instances are left\-padded to 300 tokens both at test and train time\. The query\-answer portion then follows, with 10 query\-answer pairs spanning up to 60 tokens, making the total sequence lengthT=360T=360\. The model’s window size isL=75L=75, so each cycle is fragmented across four cache windows\. When the model predicts the query answers, the cycle context has been evicted from the KV cache\. Depo is harder than the cellular automaton task for two reasons\. First, each cycle is fragmented across four cache windows, whereas each automaton state fits within a single window\. Second, the model must form a query\-agnostic representation because bothkkand the start node are randomly sampled for each example, whereasttis fixed in the automaton task\.

In Depo,kkcontrols task difficulty: largerkkmakes the query more difficult because the model must perform longer multi\-hop traversal to recover the answer\. FollowingAllen\-Zhu and Li \[[1](https://arxiv.org/html/2605.26099#bib.bib2)\], we uniformly samplekkfrom\[1,16\]\[1,16\]during training and measure test loss on held\-out examples withk=\{1,2,4,8,16\}k=\\\{1,2,4,8,16\\\}\.

![Refer to caption](https://arxiv.org/html/2605.26099v1/x4.png)Figure 3:IncreasingNNimproves performance on Depo\.Test loss of a 4\-layer GDN\-attention hybrid on thekk\-hop knowledge retrieval task\. Additional offline loops accelerate learning, especially for more reasoning\-intensive, higher\-hop queries\.[Figure˜3](https://arxiv.org/html/2605.26099#S6.F3)shows test loss on held\-out examples over training steps, with each subplot corresponding to a hop countk∈\{1,2,4,8,16\}k\\in\\\{1,2,4,8,16\\\}and each curve comparing a model withN∈\{1,2,4\}N\\in\\\{1,2,4\\\}offline loops\. We see that increasing the number of offline loops improves learning speed for queries that require 4 or more hops\. The 1\-loop model makes little progress on 4\-hop and harder queries, and the 2\-loop model similarly stalls on 8\-hop and harder queries\. Within our training budget, only the 4\-loop model begins to improve on the hardest 16\-hop task\.

### 6\.3Task: GSM\-Infinite

To test whether the trend from the controlled tasks extends to pretrained LLMs, we evaluate on GSM\-Infinite\[[57](https://arxiv.org/html/2605.26099#bib.bib3)\], a synthetic reasoning benchmark modeled after GSM8K\[[12](https://arxiv.org/html/2605.26099#bib.bib5)\]\. GSM\-Infinite is still structured enough for controlled analysis, but realistic enough that training on it can improve a model’s reasoning capabilities on other tasks\[[28](https://arxiv.org/html/2605.26099#bib.bib8)\]\. As GSM\-Infintie is procedurally generated we can generate distinct training and evaluation datasets from the same distribution, similarly toKabraet al\.\[[28](https://arxiv.org/html/2605.26099#bib.bib8)\]\. Our evaluation set is 1,600 held\-out examples\. The dataset controls problem length by adding distractor tokens that resemble the rest of the problem, making them difficult to ignore, and controls difficulty by varying the number of arithmetic operations required to solve the problem\. Unlike retrieval\-focused long\-context tasks such as RULER\[[26](https://arxiv.org/html/2605.26099#bib.bib7)\], simple retrieval\-augmented baselines fail\[[57](https://arxiv.org/html/2605.26099#bib.bib3)\]on GSM\-Infinite, indicating that the task requires both long\-context processingandmulti\-step reasoning\. GSM\-Infinite is challenging even for reasoning\-optimized frontier models, whose accuracy decays as the number of required operations increases\[[57](https://arxiv.org/html/2605.26099#bib.bib3)\]\.

In our experiments, each problem contains between 2,000 and 3,300 tokens, and the number of operations is sampled uniformly from\[1,8\]\[1,8\]\. We place the question before the context and exclude Chain\-of\-Thought traces from the data, forcing the model to the final answer in the single prediction time forward pass alone\. This order gives the model the query before it reads the long problem context, allowing it to selectively consolidate information relevant to the question while ignoring filler tokens\. We set the model’s context\-window size toL=2000L=2000, so a full problem does not fit in the active context window and the model cannot attend to a majority of the problem context at prediction time\.

There are two complementary ways of instantiating our method from a pre\-trained model: starting from an SSM\-attention hybrid and fine\-tuning it with sleep time recurrence, or starting from a depth\-recurrent model and adding SSM memory layers\. We explore both, fine\-tuning the hybrid Jet\-Nemotron 2B\[[24](https://arxiv.org/html/2605.26099#bib.bib10)\], and the recurrent Ouro 1\.4B\[[58](https://arxiv.org/html/2605.26099#bib.bib34)\]\. Jet\-Nemotron is an SSM\-attention hybrid model fine\-tuned from Qwen 2\.5 1\.5B by replacing some attention layers with Jet layers, which use dynamic convolution instead of the fixed convolution in GDN\. Ouro is a looped attention\-only model, so we insert 6 Jet layers without MLP layers to augment Ouro with fast weight memory while increasing the total parameter count by less than 10%\.

For Jet, we loop over the middle 14 blocks out of the total 28 blocks\. Looping over middle\-blocks only is a common practice in depth\-recurrence models\[[34](https://arxiv.org/html/2605.26099#bib.bib37),[22](https://arxiv.org/html/2605.26099#bib.bib20)\]\. For Ouro, we loop over the entire blocks following how the model is pre\-trained\[[58](https://arxiv.org/html/2605.26099#bib.bib34)\]\. To keep memory cost during training manageable while using a reasonable batch size, we useN=\{1,2,4\}N=\\\{1,2,4\\\}for Ouro\. Since Jet loops over only a half of the entire blocks, we use\{1,2,4,6\}\\\{1,2,4,6\\\}\.

![Refer to caption](https://arxiv.org/html/2605.26099v1/x5.png)\(a\)Jet\-Nemotron 2B\.
![Refer to caption](https://arxiv.org/html/2605.26099v1/x6.png)\(b\)Ouro 1\.4B\.

Figure 4:IncreasingNNimproves performance on GSM\-Infinite\.GSM\-Infinite accuracy over training steps\. Subplots group examples by the number of arithmetic operations required by the problem, and colors indicate the number of offline loopsNNused before cache eviction\. Additional loops improve accuracy most clearly on harder problems with more operations, where single\-loop models have less sleep\-time computation available to organize the evicted context into useful fast weights\.[Figure˜4](https://arxiv.org/html/2605.26099#S6.F4)shows the accuracy trend over training steps, with each subplot corresponding to a different number of operations required to solve the problem, ranging from 2 to 8\. We see that the trend from the pretraining from scratch experiments persists in a more realistic math\-reasoning setting\. For easier two\- and four\-operation problems, accuracy often approaches saturation regardless of the number of loops, especially for Jet, which has more fast weight memory capacity than Ouro\. However, as the number of required operations increases, the gap between loop counts widens: additional offline recurrence improves both final accuracy and learning speed on the six\- and eight\-operation settings\. For Jet, six loops improves final accuracy on six\-operation problems from0\.7420\.742to0\.8120\.812and on eight\-operation problems from0\.3510\.351to0\.3880\.388\. For Ouro, four loops improves final accuracy from0\.4190\.419to0\.6150\.615on six\-operation problems and from0\.2100\.210to0\.2720\.272on eight\-operation problems\. The gap is wider for Ouro, which may reflect its depth\-recurrent pretraining\. These results suggest that sleep\-time computation can support multi\-step reasoning even on realistic math\-reasoning data and with pre\-trained LLMs\.

### 6\.4Sliding\-window eviction

So far, we have assumed that the model’s context window is completely evicted whenever it is full\. We can instead use a sliding\-window eviction strategy: after sleep, the model retains the most recentL−1L\-1tokens in the attention cache and evicts only older tokens\. This does not increase peak inference\-time memory: the active context is still capped atLLtokens, as in sliding\-window attention \(SWA\)\. WithN=1N=1, this reduces to a standard SWA\-SSM hybrid model\[[42](https://arxiv.org/html/2605.26099#bib.bib49)\]; withN\>1N\>1, the model performs additional recursive consolidation before older context leaves the attention cache\.

We evaluate this strategy on GSM\-Infinite withL=512L=512, so the total sequence lengthTTis roughly44–6×6\\timesthe window size\. We fine\-tune Ouro 1\.4B withN∈\{1,2,4\}N\\in\\\{1,2,4\\\}\. Analogously to observations in prior work\[[7](https://arxiv.org/html/2605.26099#bib.bib64)\], we find that giving the model access to a sliding\-window KV cache can make the newly inserted Jet layers underutilized\. We therefore first warm up only the Jet layers for one epoch and then train the full model for two epochs\. This SSM\-only warm\-up stage is standard when converting attention\-only models into attention\-SSM hybrids\[[52](https://arxiv.org/html/2605.26099#bib.bib62),[6](https://arxiv.org/html/2605.26099#bib.bib63),[24](https://arxiv.org/html/2605.26099#bib.bib10)\]\. We find that forN\>1N\>1, using hard eviction for the warm\-up stage is crucial for the model to learn to refine the fast weights\.

![Refer to caption](https://arxiv.org/html/2605.26099v1/x7.png)Figure 5:IncreasingNNimproves accuracy on GSM\-Infinite with sliding\-window eviction\.GSM\-Infinite accuracy with sliding\-window eviction over training steps\. We fine\-tune Ouro 1\.4B with window sizeL=512L=512and compareN∈\{1,2,4\}N\\in\\\{1,2,4\\\}sleep passes\.[Figure˜5](https://arxiv.org/html/2605.26099#S6.F5)shows accuracy over training steps, where the curve labeledno loopcorresponds to the SWA\-SSM hybrid baseline withN=1N=1\. IncreasingNNimproves accuracy at all operation counts, matching the trend in[Figure˜4](https://arxiv.org/html/2605.26099#S6.F4)\. Unlike in[Figure˜4](https://arxiv.org/html/2605.26099#S6.F4), where the window size isL=2000L=2000, this baseline performs poorly even on two\-operation problems, which are the least reasoning\-heavy and therefore more directly stress retrieval under distractor tokens\. On the other hand, using loops drastically improves accuracy from 0\.596 to 0\.905, an 52% improvement\. This suggests that when the active attention window is several times smaller than the sequence length,longer sleep duration helps not only with multi\-step reasoning, but also with compressing and retrieving relevant context\.

### 6\.5Training throughput

![Refer to caption](https://arxiv.org/html/2605.26099v1/x8.png)\(a\)Parallel SWA vs windowed\.
![Refer to caption](https://arxiv.org/html/2605.26099v1/x9.png)\(b\)Effect of varyingNN\.

Figure 6:Recurrence across context windows incur minimal training overhead; recurrent\-depth linearly increases cost\.Training throughput comparison on 1 NVIDIA H200 GPU\. Sequence length is set to12,00012,000\. \(a\) When window sizeLLis sufficiently large, serialness across context windows do not meaningfully change the throughput compared to the fully parallel baseline\. \(b\) Throughput is roughly inversely proportional toNN\. For each setting, batch size is tuned to optimize the GPU utilization\. \(b\) additionally uses activation checkpointing across context chunk axis to prevent out\-of\-memory error\. FlashAttention 2\[[15](https://arxiv.org/html/2605.26099#bib.bib75)\]is used\.Here we analyze how our method affects training throughput in terms of the number of tokens processed per second compared to a SWA\-SSM hybrid baseline\. We use Ouro 1\.4B model from[Section˜6\.3](https://arxiv.org/html/2605.26099#S6.SS3)\.

Recurrence across context windows\.Unlike standard teacher\-forced transformer training, which can process all token positions in parallel, our training is recurrent across context windows, since before windowj\+1j\+1can be processed, the model must finish processing windowjjand perform theNNsleep passes that refine the fast weights\. The updated fast weights then become the state used to process windowj\+1j\+1, creating a sequential dependency across windows\. This prevents full parallelization along the sequence axis\. However, this loss of sequence\-axis parallelism need not hinder wall\-clock training time when the window sizeLLis large enough to keep the GPU saturated, as can occur in long\-context training regimes where bothTTandLLare large, as shown in[Figure˜6\(a\)](https://arxiv.org/html/2605.26099#S6.F6.sf1)\.

Recurrent\-depth cost\.In addition, as in other depth\-recurrent models, training cost grows roughly linearly with the number of recurrent stepsNN, as shown in[Figure˜6\(b\)](https://arxiv.org/html/2605.26099#S6.F6.sf2)\. However, as we see in our experiments, increasing recurrence consistently improves task performance compared to non\-recurrent models\.

## 7Discussion and Limitations

Our method preserves single\-pass prediction\-phase latency by moving the extra recurrent computation into the consolidation phase, but this gain is not free: during training, we need to performNNdeeper forward and backward passes, which can make training slow and unstable\. Tackling these challenges is an active topic in recurrent\-depth training, with possible approaches including implicit gradients\[[4](https://arxiv.org/html/2605.26099#bib.bib25)\]and truncated backpropagation through time\[[22](https://arxiv.org/html/2605.26099#bib.bib20),[34](https://arxiv.org/html/2605.26099#bib.bib37)\], as well as various techniques to stabilize training\[[40](https://arxiv.org/html/2605.26099#bib.bib35),[22](https://arxiv.org/html/2605.26099#bib.bib20)\]\.

Sleep makes training sequential across context and depth dimension, but this sequentiality is also why our method shows gains on the tasks we consider, whose solutions are themselves sequential\. Many reasoning, simulation, and decision\-making problems often targeted by modern machine learning appear to have this property\[[32](https://arxiv.org/html/2605.26099#bib.bib73)\]\. Attempting to solve inherently sequential tasks with fully parallel computation encourages brittle shortcut solutions\[[31](https://arxiv.org/html/2605.26099#bib.bib74),[32](https://arxiv.org/html/2605.26099#bib.bib73)\]\.

## 8Conclusion

We propose a sleep\-like process in which a model performs multiple recursive forward passes to iteratively refine its fast weights before evicting the corresponding context from the attention cache\. Unlike vanilla attention\-SSM hybrid model, sleep allows models to reason deeply about past context that they can no longer attend to\. Across controlled synthetic tasks and a more realistic mathematical reasoning benchmark, we show that increasing the number of recursions, or sleep duration, improves the model’s ability to perform deep sequential computation over evicted context\.

## Broader Impact

This work studies memory consolidation and reasoning in language models, which are important ingredients for building more capable long\-context systems\. Our contribution is primarily methodological and is evaluated on controlled synthetic tasks and modest\-scale pretrained models\. We therefore do not expect the risks to exceed those of other work in this area\.

## Acknowledgements

We gratefully acknowledge Modal for providing generous GPU resources\.

## References

- \[1\]Z\. Allen\-Zhu and Y\. Li\(2025\)Physics of language models: part 4\.1, architecture design and the magic of canon layers\.InThe Thirty\-Ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=kxv0M6I7Ud)Cited by:[§3\.2](https://arxiv.org/html/2605.26099#S3.SS2.p3.2),[§4](https://arxiv.org/html/2605.26099#S4.p6.1),[§6\.2](https://arxiv.org/html/2605.26099#S6.SS2.p1.3),[§6\.2](https://arxiv.org/html/2605.26099#S6.SS2.p4.5),[§6](https://arxiv.org/html/2605.26099#S6.p1.3),[§6](https://arxiv.org/html/2605.26099#S6.p2.8)\.
- \[2\]S\. Arora, S\. Eyuboglu, M\. Zhang, A\. Timalsina, S\. Alberti, D\. Zinsley, J\. Zou, A\. Rudra, and C\. Ré\(2024\)Simple linear attention language models balance the recall\-throughput tradeoff\.arXiv preprint arXiv:2402\.18668\.Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p2.1),[§1](https://arxiv.org/html/2605.26099#S1.p3.1),[§2](https://arxiv.org/html/2605.26099#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.26099#S3.SS1.p5.1),[§4](https://arxiv.org/html/2605.26099#S4.p5.8)\.
- \[3\]A\. Askell, Y\. Bai, A\. Chen, D\. Drain, D\. Ganguli, T\. Henighan, A\. Jones, N\. Joseph, B\. Mann, N\. DasSarma, N\. Elhage, Z\. Hatfield\-Dodds, D\. Hernandez, J\. Kernion, K\. Ndousse, C\. Olsson, D\. Amodei, T\. Brown, J\. Clark, S\. McCandlish, C\. Olah, and J\. Kaplan\(2021\)A general language assistant as a laboratory for alignment\.External Links:2112\.00861,[Link](https://arxiv.org/abs/2112.00861)Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p3.1)\.
- \[4\]S\. Bai, J\. Z\. Kolter, and V\. Koltun\(2019\)Deep equilibrium models\.Advances in neural information processing systems32\.Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p5.1),[§7](https://arxiv.org/html/2605.26099#S7.p1.1)\.
- \[5\]A\. Bansal, A\. Schwarzschild, E\. Borgnia, Z\. Emam, F\. Huang, M\. Goldblum, and T\. Goldstein\(2022\)End\-to\-end algorithm synthesis with recurrent networks: extrapolation without overthinking\.Advances in Neural Information Processing Systems35,pp\. 20232–20242\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p5.1)\.
- \[6\]A\. Bick, K\. Y\. Li, E\. P\. Xing, J\. Z\. Kolter, and A\. Gu\(2024\)Transformers to ssms: distilling quadratic knowledge to subquadratic models\.Advances in neural information processing systems37,pp\. 31788–31812\.Cited by:[§6\.4](https://arxiv.org/html/2605.26099#S6.SS4.p2.6)\.
- \[7\]L\. Cabannes, M\. Beck, G\. Szilvasy, M\. Douze, M\. Lomeli, J\. Copet, P\. Mazaré, G\. Synnaeve, and H\. Jégou\(2025\)Short window attention enables long\-term memorization\.arXiv preprint arXiv:2509\.24552\.Cited by:[§6\.4](https://arxiv.org/html/2605.26099#S6.SS4.p2.6)\.
- \[8\]L\. Caccia, A\. Ansell, E\. Ponti, I\. Vulić, and A\. Sordoni\(2025\)Training plug\-n\-play knowledge modules with deep context distillation\.External Links:2503\.08727,[Link](https://arxiv.org/abs/2503.08727)Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p3.1)\.
- \[9\]B\. Cao, D\. Cai, and W\. Lam\(2025\)InfiniteICL: breaking the limit of context window size via long short\-term memory transformation\.External Links:2504\.01707,[Link](https://arxiv.org/abs/2504.01707)Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p3.1)\.
- \[10\]M\. Chalvidal, T\. Serre, and R\. VanRullen\(2022\)Meta\-reinforcement learning with self\-modifying networks\.Advances in Neural Information Processing Systems35,pp\. 7838–7851\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p6.1)\.
- \[11\]T\. Chen, H\. Fang, P\. Xia, X\. Liu, B\. V\. Durme, L\. Zettlemoyer, J\. Gao, and H\. Cheng\(2024\)Generative adapter: contextualizing language models in parameters with a single forward pass\.External Links:2411\.05877,[Link](https://arxiv.org/abs/2411.05877)Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p3.1)\.
- \[12\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§6\.3](https://arxiv.org/html/2605.26099#S6.SS3.p1.1)\.
- \[13\]M\. Cooket al\.\(2004\)Universality in elementary cellular automata\.Complex systems15\(1\),pp\. 1–40\.Cited by:[§3\.2](https://arxiv.org/html/2605.26099#S3.SS2.p2.2),[§4](https://arxiv.org/html/2605.26099#S4.p2.4)\.
- \[14\]T\. Dao and A\. Gu\(2024\)Transformers are ssms: generalized models and efficient algorithms through structured state space duality\.arXiv preprint arXiv:2405\.21060\.Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p2.1),[§2](https://arxiv.org/html/2605.26099#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.26099#S3.SS1.p2.8),[§3\.1](https://arxiv.org/html/2605.26099#S3.SS1.p5.1)\.
- \[15\]T\. Dao\(2024\)Flashattention\-2: faster attention with better parallelism and work partitioning\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 35549–35562\.Cited by:[Figure 6](https://arxiv.org/html/2605.26099#S6.F6),[Figure 6](https://arxiv.org/html/2605.26099#S6.F6.6.3.3)\.
- \[16\]S\. De, S\. L\. Smith, A\. Fernando, A\. Botev, G\. Cristian\-Muraru, A\. Gu, R\. Haroun, L\. Berrada, Y\. Chen, S\. Srinivasan,et al\.\(2024\)Griffin: mixing gated linear recurrences with local attention for efficient language models\.arXiv preprint arXiv:2402\.19427\.Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.26099#S3.SS1.p5.1)\.
- \[17\]M\. Dehghani, S\. Gouws, O\. Vinyals, J\. Uszkoreit, and Ł\. Kaiser\(2018\)Universal transformers\.arXiv preprint arXiv:1807\.03819\.Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p5.1),[§2](https://arxiv.org/html/2605.26099#S2.p5.1),[§5](https://arxiv.org/html/2605.26099#S5.p4.1)\.
- \[18\]X\. Dong, Y\. Fu, S\. Diao, W\. Byeon, Z\. Chen, A\. S\. Mahabaleshwarkar, S\. Liu, M\. Van Keirsbilck, M\. Chen, Y\. Suhara,et al\.\(2024\)Hymba: a hybrid\-head architecture for small language models\.arXiv preprint arXiv:2411\.13676\.Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.26099#S3.SS1.p5.1)\.
- \[19\]M\. Elbayad, J\. Gu, E\. Grave, and M\. Auli\(2019\)Depth\-adaptive transformer\.arXiv preprint arXiv:1910\.10073\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p5.1)\.
- \[20\]S\. Eyuboglu, R\. Ehrlich, S\. Arora, N\. Guha, D\. Zinsley, E\. Liu, W\. Tennien, A\. Rudra, J\. Zou, A\. Mirhoseini,et al\.\(2025\)Cartridges: lightweight and general\-purpose long context representations via self\-study\.arXiv preprint arXiv:2506\.06266\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p2.1)\.
- \[21\]T\. Ge, J\. Hu, L\. Wang, X\. Wang, S\. Chen, and F\. Wei\(2023\)In\-context autoencoder for context compression in a large language model\.arXiv preprint arXiv:2307\.06945\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p2.1)\.
- \[22\]J\. Geiping, S\. M\. McLeish, N\. Jain, J\. Kirchenbauer, S\. Singh, B\. R\. Bartoldson, B\. Kailkhura, A\. Bhatele, and T\. Goldstein\(2025\)Scaling up test\-time compute with latent reasoning: a recurrent depth approach\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p5.1),[§6\.3](https://arxiv.org/html/2605.26099#S6.SS3.p4.2),[§7](https://arxiv.org/html/2605.26099#S7.p1.1)\.
- \[23\]A\. Graves\(2016\)Adaptive computation time for recurrent neural networks\.arXiv preprint arXiv:1603\.08983\.Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p5.1),[§2](https://arxiv.org/html/2605.26099#S2.p5.1),[§5](https://arxiv.org/html/2605.26099#S5.p4.1)\.
- \[24\]Y\. Gu, Q\. Hu, S\. Yang, H\. Xi, J\. Chen, S\. Han, and H\. Cai\(2025\)Jet\-Nemotron: efficient language model with post neural architecture search\.arXiv preprint arXiv:2508\.15884\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p1.1),[§6\.3](https://arxiv.org/html/2605.26099#S6.SS3.p3.1),[§6\.4](https://arxiv.org/html/2605.26099#S6.SS4.p2.6),[§6](https://arxiv.org/html/2605.26099#S6.p1.3),[§6](https://arxiv.org/html/2605.26099#S6.p2.8)\.
- \[25\]D\. O\. Hebb\(2005\)The organization of behavior: a neuropsychological theory\.Psychology press\.Cited by:[§3\.1](https://arxiv.org/html/2605.26099#S3.SS1.p2.8)\.
- \[26\]C\. Hsieh, S\. Sun, S\. Kriman, S\. Acharya, D\. Rekesh, F\. Jia, Y\. Zhang, and B\. Ginsburg\(2024\)RULER: what’s the real context size of your long\-context language models?\.InFirst Conference on Language Modeling,Cited by:[§6\.3](https://arxiv.org/html/2605.26099#S6.SS3.p1.1)\.
- \[27\]S\. Jelassi, D\. Brandfonbrener, S\. M\. Kakade, and E\. Malach\(2024\)Repeat after me: transformers are better than state space models at copying\.arXiv preprint arXiv:2402\.01032\.Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p3.1),[§2](https://arxiv.org/html/2605.26099#S2.p1.1),[§4](https://arxiv.org/html/2605.26099#S4.p5.8)\.
- \[28\]A\. Kabra, Y\. Yin, A\. Gong, K\. Stankeviciute, D\. Go, J\. Lee, K\. Z\. Luo, C\. P\. Gomes, and K\. Q\. Weinberger\(2026\)Learning from synthetic data improves multi\-hop reasoning\.InInternational Conference on Learning Representations,Cited by:[§6\.3](https://arxiv.org/html/2605.26099#S6.SS3.p1.1)\.
- \[29\]A\. Katharopoulos, A\. Vyas, N\. Pappas, and F\. Fleuret\(2020\)Transformers are rnns: fast autoregressive transformers with linear attention\.InInternational conference on machine learning,pp\. 5156–5165\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p1.1)\.
- \[30\]K\. Lin, C\. Snell, Y\. Wang, C\. Packer, S\. Wooders, I\. Stoica, and J\. E\. Gonzalez\(2025\)Sleep\-time compute: beyond inference scaling at test\-time\.arXiv preprint arXiv:2504\.13171\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p6.1)\.
- \[31\]B\. Liu, J\. T\. Ash, S\. Goel, A\. Krishnamurthy, and C\. Zhang\(2022\)Transformers learn shortcuts to automata\.arXiv preprint arXiv:2210\.10749\.Cited by:[§7](https://arxiv.org/html/2605.26099#S7.p2.1)\.
- \[32\]Y\. Liu, K\. Preechakul, K\. Kuwaranancharoen, and Y\. Bai\(2025\)The serial scaling hypothesis\.arXiv preprint arXiv:2507\.12549\.Cited by:[§7](https://arxiv.org/html/2605.26099#S7.p2.1)\.
- \[33\]J\. L\. McClelland, B\. L\. McNaughton, and R\. C\. O’Reilly\(1995\)Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory\.\.Psychological review102\(3\),pp\. 419\.Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p4.1)\.
- \[34\]S\. McLeish, A\. Li, J\. Kirchenbauer, D\. S\. Kalra, B\. R\. Bartoldson, B\. Kailkhura, A\. Schwarzschild, J\. Geiping, T\. Goldstein, and M\. Goldblum\(2025\)Teaching pretrained language models to think deeper with retrofitted recurrence\.arXiv preprint arXiv:2511\.07384\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p5.1),[§6\.3](https://arxiv.org/html/2605.26099#S6.SS3.p4.2),[§6](https://arxiv.org/html/2605.26099#S6.p2.8),[§7](https://arxiv.org/html/2605.26099#S7.p1.1)\.
- \[35\]W\. Merrill and A\. Sabharwal\(2025\)A little depth goes a long way: the expressive power of log\-depth transformers\.arXiv preprint arXiv:2503\.03961\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p5.1)\.
- \[36\]I\. Momennejad, A\. R\. Otto, N\. D\. Daw, and K\. A\. Norman\(2018\)Offline replay supports planning in human reinforcement learning\.elife7,pp\. e32548\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p6.1)\.
- \[37\]T\. Neary and D\. Woods\(2006\)P\-completeness of cellular automaton rule 110\.InInternational Colloquium on Automata, Languages, and Programming,pp\. 132–143\.Cited by:[§3\.2](https://arxiv.org/html/2605.26099#S3.SS2.p2.2)\.
- \[38\]S\. Noroozizadeh, V\. Nagarajan, E\. Rosenfeld, and S\. Kumar\(2025\)Deep sequence models tend to memorize geometrically; it is unclear why\.arXiv preprint arXiv:2510\.26745\.Cited by:[§6\.2](https://arxiv.org/html/2605.26099#S6.SS2.p1.3)\.
- \[39\]NVIDIA\(2025\)NVIDIA Nemotron Nano 2: an accurate and efficient hybrid Mamba\-Transformer reasoning model\.arXiv preprint arXiv:2508\.14444\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p1.1)\.
- \[40\]H\. Prairie, Z\. Novack, T\. Berg\-Kirkpatrick, and D\. Y\. Fu\(2026\)Parcae: scaling laws for stable looped language models\.arXiv preprint arXiv:2604\.12946\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p5.1),[§7](https://arxiv.org/html/2605.26099#S7.p1.1)\.
- \[41\]B\. Rasch and J\. Born\(2013\)About sleep’s role in memory\.Physiological reviews\.Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p4.1)\.
- \[42\]L\. Ren, Y\. Liu, Y\. Lu, Y\. Shen, C\. Liang, and W\. Chen\(2024\)Samba: simple hybrid state space models for efficient unlimited context language modeling\.arXiv preprint arXiv:2406\.07522\.Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p2.1),[§1](https://arxiv.org/html/2605.26099#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.26099#S3.SS1.p5.1),[§4](https://arxiv.org/html/2605.26099#S4.p1.1),[§6\.4](https://arxiv.org/html/2605.26099#S6.SS4.p1.4)\.
- \[43\]I\. Schlag, K\. Irie, and J\. Schmidhuber\(2021\)Linear transformers are secretly fast weight programmers\.InInternational conference on machine learning,pp\. 9355–9366\.Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p2.1),[§2](https://arxiv.org/html/2605.26099#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.26099#S3.SS1.p2.8),[§3\.1](https://arxiv.org/html/2605.26099#S3.SS1.p5.1)\.
- \[44\]A\. Schwarzschild, E\. Borgnia, A\. Gupta, F\. Huang, U\. Vishkin, M\. Goldblum, and T\. Goldstein\(2021\)Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks\.Advances in Neural Information Processing Systems34,pp\. 6695–6706\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p5.1)\.
- \[45\]K\. Schwethelm, D\. Rueckert, and G\. Kaissis\(2026\)How much is one recurrence worth? iso\-depth scaling laws for looped language models\.arXiv preprint arXiv:2604\.21106\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p5.1)\.
- \[46\]C\. Snell, D\. Klein, and R\. Zhong\(2022\)Learning by distilling context\.External Links:2209\.15189,[Link](https://arxiv.org/abs/2209.15189)Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p3.1)\.
- \[47\]J\. Tack, J\. Kim, E\. Mitchell, J\. Shin, Y\. W\. Teh, and J\. R\. Schwarz\(2024\)Online adaptation of language models with a memory of amortized contexts\.External Links:2403\.04317,[Link](https://arxiv.org/abs/2403.04317)Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p3.1)\.
- \[48\]A\. Tandon, K\. Dalal, X\. Li, D\. Koceja, M\. Rød, S\. Buchanan, X\. Wang, J\. Leskovec, S\. Koyejo, T\. Hashimoto,et al\.\(2025\)End\-to\-end test\-time training for long context\.arXiv preprint arXiv:2512\.23675\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p4.1)\.
- \[49\]Q\. Team\(2026\-02\)Qwen3\.5: accelerating productivity with native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p2.1)\.
- \[50\]E\. C\. Tolman\(1948\)Cognitive maps in rats and men\.\.Psychological review55\(4\),pp\. 189\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p6.1)\.
- \[51\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.26099#S3.SS1.p1.2)\.
- \[52\]J\. Wang, D\. Paliotta, A\. May, A\. M\. Rush, and T\. Dao\(2024\)The mamba in the llama: distilling and accelerating hybrid models\.Advances in Neural Information Processing Systems37,pp\. 62432–62457\.Cited by:[§6\.4](https://arxiv.org/html/2605.26099#S6.SS4.p2.6)\.
- \[53\]S\. Yang, J\. Kautz, and A\. Hatamizadeh\(2024\)Gated delta networks: improving mamba2 with delta rule\.arXiv preprint arXiv:2412\.06464\.Cited by:[§1](https://arxiv.org/html/2605.26099#S1.p2.1),[§2](https://arxiv.org/html/2605.26099#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.26099#S3.SS1.p5.1)\.
- \[54\]S\. Yang, B\. Wang, Y\. Shen, R\. Panda, and Y\. Kim\(2023\)Gated linear attention transformers with hardware\-efficient training\.arXiv preprint arXiv:2312\.06635\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p1.1)\.
- \[55\]S\. Yang, B\. Wang, Y\. Zhang, Y\. Shen, and Y\. Kim\(2024\)Parallelizing linear transformers with the delta rule over sequence length\.Advances in neural information processing systems37,pp\. 115491–115522\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p1.1)\.
- \[56\]Z\. Zhang, X\. Liu, H\. Cheng, H\. Sun, C\. Xu, and J\. Gao\(2026\)Training large reasoning models efficiently via progressive thought encoding\.arXiv preprint arXiv:2602\.16839\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p4.1)\.
- \[57\]Y\. Zhou, H\. Liu, Z\. Chen, Y\. Tian, and B\. Chen\(2025\)GSM\-Infinite: how do your LLMs behave over infinitely increasing reasoning complexity and context length?\.InICML 2025 Workshop on Long\-Context Foundation Models,Cited by:[§6\.3](https://arxiv.org/html/2605.26099#S6.SS3.p1.1),[§6](https://arxiv.org/html/2605.26099#S6.p1.3)\.
- \[58\]R\. Zhu, Z\. Wang, K\. Hua, T\. Zhang, Z\. Li, H\. Que, B\. Wei, Z\. Wen, F\. Yin, H\. Xing,et al\.\(2025\)Scaling latent reasoning via looped language models\.arXiv preprint arXiv:2510\.25741\.Cited by:[§2](https://arxiv.org/html/2605.26099#S2.p5.1),[§6\.3](https://arxiv.org/html/2605.26099#S6.SS3.p3.1),[§6\.3](https://arxiv.org/html/2605.26099#S6.SS3.p4.2),[§6](https://arxiv.org/html/2605.26099#S6.p1.3),[§6](https://arxiv.org/html/2605.26099#S6.p2.8)\.

Similar Articles

Language Models Need Sleep

Hugging Face Daily Papers

This paper proposes a sleep-like consolidation mechanism for transformer models that uses fast weights and recurrent passes to improve long-context processing while maintaining inference speed.

@omarsar0: Language models need "sleep"

X AI KOLs Following

A paper explores letting language model agents 'sleep' to reset internal state and improve performance on long-horizon tasks, addressing context length scaling issues.

Learning, Fast and Slow: Towards LLMs That Adapt Continually

Hugging Face Daily Papers

A fast-slow learning framework for LLMs combines fixed slow weights with optimized fast context weights, achieving up to 3x better sample efficiency and reduced catastrophic forgetting in continual learning scenarios.