HRM-Text: Efficient Pretraining Beyond Scaling

arXiv cs.CL Papers

Summary

HRM-Text introduces a Hierarchical Recurrent Model that decouples computation into slow and fast layers, enabling efficient pretraining from scratch on only 40 billion tokens and a $1,500 budget, achieving competitive performance with larger models.

arXiv:2605.20613v1 Announce Type: new Abstract: The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:34 AM

# HRM-Text: Efficient Pretraining Beyond Scaling
Source: [https://arxiv.org/html/2605.20613](https://arxiv.org/html/2605.20613)
Guan Wang1,∗,†, Changling Liu1,∗, Chenyu Wang2, Cai Zhou2, Yuhao Sun1, Yifei Wu1, Shuai Zhen1, Luca Scimeca1, Yasin Abbasi Yadkori1,† 1Sapient Intelligence2MIT

###### Abstract

The current pretraining paradigm for large language models relies on massive compute and internet\-scale raw text, creating a significant barrier to foundational research\. In contrast, biological systems demonstrate highly sample\-efficient learning through multi\-timescale processing, such as the functional organization of the frontoparietal loop\. Taking this as inspiration, we introduce HRM\-Text, which replaces standard Transformers with a Hierarchical Recurrent Model \(HRM\) that decouples computation into slow\-evolving strategic and fast\-evolving execution layers\. To stabilize this deep recurrence for language modeling, we introduceMagicNormand warmup deep credit assignment\. Furthermore, instead of standard raw\-text pretraining, we train exclusively on instruction\-response pairs using a task\-completion objective and PrefixLM masking\. Serving as an empirical existence proof of efficient pretraining, a 1B\-parameter HRM\-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60\.7% on MMLU, 81\.9% on ARC\-C, 82\.2% on DROP, 84\.5% on GSM8K, and 56\.2% on MATH\. Despite utilizing roughly 100\-900x fewer training tokens and 96\-432x less estimated compute than standard baselines, HRM\-Text performs competitively with 2–7B parameter open models\. These results demonstrate that co\-designing architectures and objectives can radically reduce the compute\-to\-performance ratio, making pretraining from scratch accessible to the broader research community\.

††footnotetext:†Corresponding author\.∗Equal Contribution\. Contact:research@sapient\.inc\. Code available at:[github\.com/sapientinc/HRM\-Text](https://github.com/sapientinc/HRM-Text)![Refer to caption](https://arxiv.org/html/2605.20613v1/x1.png)Figure 1:Pretraining efficiency\.Trained from scratch in 1\.9 days on 16 GPUs, HRM\-Text 1B achieved performance competitive with substantially larger 2–7B foundation models while utilizing up to 432×\\timesless compute and 900×\\timesfewer training tokens\.## 1Introduction

The remarkable success of large language models \(LLMs\) is currently driven by a monolithic recipe: massive, multi\-stage pipelines that begin with broad unsupervised pretraining over internet\-scale raw text\. While undeniably effective, this brute\-force scaling paradigm is highly inefficient in data\-limited regimes\. Massive compute is spent predicting prompt\-like or task\-irrelevant text simply to build generalized representations[37](https://arxiv.org/html/2605.20613#bib.bib26),[31](https://arxiv.org/html/2605.20613#bib.bib23),[63](https://arxiv.org/html/2605.20613#bib.bib38)\. Consequently, this extreme computational barrier has largely locked the broader research community out of foundational pretraining exploration\. The prevailing assumption is that without immense compute clusters and trillions of tokens, investigating new architectures or training from scratch is futile\.

This brute\-force data hunger stands in stark contrast to human intelligence, which can grasp governing rules and perform heuristic\-guided search from only a few examples\. In our previous work, we introduced the Hierarchical Recurrent Model \(HRM\), a dual\-timescale architecture inspired by the functional organization of the biological frontoparietal loop[69](https://arxiv.org/html/2605.20613#bib.bib18)\. By decoupling deliberation into a slow\-evolving strategic layer and a fast\-evolving execution layer, HRM provided a structural inductive bias that helped avoid local stagnation and successfully guided symbolic search on combinatorial tasks\.

However, scaling recurrent architectures to the open\-ended complexities of language modeling introduces severe gradient\-instability risks[6](https://arxiv.org/html/2605.20613#bib.bib76),[13](https://arxiv.org/html/2605.20613#bib.bib74),[34](https://arxiv.org/html/2605.20613#bib.bib75),[78](https://arxiv.org/html/2605.20613#bib.bib77)\. A structural prior alone is insufficient; achieving competitive open\-domain performance requires a holistic codesign\. In this paper, we demonstrate that architecture and training methods are profoundly important once again\. We explore two major, synergistic directions to realize this sample\-efficient engine:

- •Architectural Exploration:To achieve deep computation without a proportional explosion in parameter counts, we build upon HRM’s modular, multi\-timescale recurrence\. The fastLL\-module performs local iterative refinement, while the slowHH\-module maintains stable semantic context across cycles[69](https://arxiv.org/html/2605.20613#bib.bib18)\. To make this deep recurrence mathematically viable for language, we introduce stabilization techniques likeMagicNormand warmup deep credit assignment, which bound forward activation variance while maintaining backward optimization stability[71](https://arxiv.org/html/2605.20613#bib.bib70),[44](https://arxiv.org/html/2605.20613#bib.bib71),[62](https://arxiv.org/html/2605.20613#bib.bib28)\.
- •Objective Exploration:We challenge the dogma of autoregressive pretraining on raw text\. Since models are primarily used for conditional generation at inference time, we pretrain HRM\-Text directly from scratch on instruction\-response pairs[70](https://arxiv.org/html/2605.20613#bib.bib39),[55](https://arxiv.org/html/2605.20613#bib.bib40),[46](https://arxiv.org/html/2605.20613#bib.bib44)\. We optimize a task\-completion objective, computing the negative log\-likelihood loss exclusively over the response:−log⁡P​\(xa∣xq\)\-\\log P\(x\_\{a\}\\mid x\_\{q\}\)[61](https://arxiv.org/html/2605.20613#bib.bib17),[53](https://arxiv.org/html/2605.20613#bib.bib37),[55](https://arxiv.org/html/2605.20613#bib.bib40)\. We pair this with a PrefixLM attention mask, which allows full bidirectional \(encoder\-like\) attention across the instruction tokens while preserving standard causal generation for the response[45](https://arxiv.org/html/2605.20613#bib.bib35),[17](https://arxiv.org/html/2605.20613#bib.bib36),[53](https://arxiv.org/html/2605.20613#bib.bib37),[63](https://arxiv.org/html/2605.20613#bib.bib38)\.

When these two directions are combined, the result is an empirical existence proof that defies the current scaling dogma\. Trained from scratch on a low budget of only 40B unique tokens, HRM\-Text achieves strong performance on most benchmarks against contemporary open models like Llama, Qwen, Gemma, OLMo, Ouro and Huginn[48](https://arxiv.org/html/2605.20613#bib.bib19),[72](https://arxiv.org/html/2605.20613#bib.bib21),[64](https://arxiv.org/html/2605.20613#bib.bib20),[50](https://arxiv.org/html/2605.20613#bib.bib85),[77](https://arxiv.org/html/2605.20613#bib.bib67),[23](https://arxiv.org/html/2605.20613#bib.bib7)\. Strikingly, it reaches this performance neighborhood using roughly100\-900×100\\text\{\-\}900\\timesfewer training tokens and96\-432×96\\text\{\-\}432\\timesless estimated training compute than these baselines, as shown in[Figure˜1](https://arxiv.org/html/2605.20613#S0.F1)and[Table˜4](https://arxiv.org/html/2605.20613#S3.T4)\.

We do not present HRM\-Text as the final or optimal language model, but rather as proof that specific structural priors and targeted training objectives can radically alter the compute\-to\-performance ratio\. Because the entry price is vastly reduced, this methodology democratizes foundational AI research\. Pretraining from scratch is accessible again—we invite the community to join us in exploring how far smart architectures and focused objectives can go\.

## 2Methods

![Refer to caption](https://arxiv.org/html/2605.20613v1/x2.png)Figure 2:HRM\-Text architecture\.\(a\) Dual\-timescale recurrent design comprising L and H modules\. \(b\) L/H module internals featuringMagicNorm—PreNorm blocks followed by final norm\. \(c\) Sigmoid\-gated multi head self\-attention\. \(d\) PrefixLM mask enabling bidirectional attention on instruction\.HRM\-Text builds upon an improved HRM architecture, featuring a dual\-timescale recurrence[69](https://arxiv.org/html/2605.20613#bib.bib18)\. The forward pass is initialized with a high\-level state,zH0z\_\{H\}^\{0\}, derived from the input token embeddings, alongside a fixed low\-level state,zL0z\_\{L\}^\{0\}\. The core processing sequence consists of two high\-level cycles\. Each cycle executes three fastLLmodule updates followed by a single slowHHmodule update\. Token logits are generated by applying a linear head to the output of the finalHHmodule state\. We employ a warmup deep credit assignment strategy: gradients are initially backpropagated through only the final two recurrent steps, expanding to the final five steps as training progresses\.

Internally, both theHHandLLrecurrent modules are structured usingMagicNorm\. Additionally, we utilize parameterless RMSNorm \(omitting the learnableγ\\gammaparameter\)[74](https://arxiv.org/html/2605.20613#bib.bib87), SwiGLU activation functions[58](https://arxiv.org/html/2605.20613#bib.bib88), Rotary Position Embeddings \(RoPE\)[60](https://arxiv.org/html/2605.20613#bib.bib89), and a sigmoid\-gated self\-attention mechanism[52](https://arxiv.org/html/2605.20613#bib.bib90)\.

In contrast to standard autoregressive pretraining on raw text, we optimize a task\-completion objective\. The model is pretrained directly on instruction\-response pairs\(xq,xa\)\(x\_\{q\},x\_\{a\}\)from scratch using a negative log\-likelihood \(NLL\) loss computed exclusively over the response,−log⁡P​\(xa\|xq\)\-\\log P\(x\_\{a\}\|x\_\{q\}\)\. This objective is naturally paired with a PrefixLM attention mask, enabling full bidirectional attention across the instruction tokens\.

In the following sections, we detail the specific mechanics that enable HRM\-Text’s extreme efficiency\. Section[2\.1](https://arxiv.org/html/2605.20613#S2.SS1)delves into our novel stabilization techniques, while Section[2\.2](https://arxiv.org/html/2605.20613#S2.SS2)explores the task\-completion pretraining objective and PrefixLM masking strategy\.

### 2\.1Scaling to language with recurrence

#### 2\.1\.1Stabilization viaMagicNorm

Although the original HRM demonstrated strong performance on symbolic tasks, scaling recurrent architectures to language modeling introduces severe gradient\-instability risks\. Transformer design already involves a compromise in the placement of normalization layers[71](https://arxiv.org/html/2605.20613#bib.bib70),[44](https://arxiv.org/html/2605.20613#bib.bib71); recurrence amplifies this compromise because the same transformation is repeatedly applied over many steps\.

PostNorm[67](https://arxiv.org/html/2605.20613#bib.bib25)places the normalization outside the residual branch:

hl=Norm​\(hl−1\+Sublayer​\(hl−1\)\)h\_\{l\}=\\text\{Norm\}\(h\_\{l\-1\}\+\\text\{Sublayer\}\(h\_\{l\-1\}\)\)This effectively bounds activation variance and can improve expressivity, but it disrupts the clean identity path and can lead to vanishing gradients in deeper networks[44](https://arxiv.org/html/2605.20613#bib.bib71)\.

PreNormplaces the normalization inside the residual branch:

hl=hl−1\+Sublayer​\(Norm​\(hl−1\)\)h\_\{l\}=h\_\{l\-1\}\+\\text\{Sublayer\}\(\\text\{Norm\}\(h\_\{l\-1\}\)\)This maintains a direct identity path,hL=h0\+∑l=1LSublayer​\(⋅\)h\_\{L\}=h\_\{0\}\+\\sum\_\{l=1\}^\{L\}\\text\{Sublayer\}\(\\cdot\), allowing gradients to flow more directly to early layers\. However, the unnormalized residual accumulation can cause hidden\-state variance to grow with depth, which may lead to representation collapse or reduced performance relative to PostNorm\.

MagicNorm:To address this tradeoff in recurrent models, we introduceMagicNorm, which exploits the asymmetry between the forward and backward computational horizons induced by truncated backpropagation through time \(TBPTT\)\.

LetNNdenote the total number of recurrent forward steps andKKdenote the truncated backward horizon, whereK≪NK\\ll N\. InMagicNorm, each recurrent module is composed ofLLinternal PreNorm blocks, but is capped with a final normalization layer at its exit:

zn=Norm​\(zn−1\+∑l=1LSublayerl​\(Norm​\(⋅\)\)\)z\_\{n\}=\\text\{Norm\}\\left\(z\_\{n\-1\}\+\\sum\_\{l=1\}^\{L\}\\text\{Sublayer\}\_\{l\}\(\\text\{Norm\}\(\\cdot\)\)\\right\)
During the*forward pass*, the recurrent statezzis subjected toNNmodule\-level normalization operations\. Because these norms sit directly on the main recurrent pathway, they bound activation variance at the end of every recurrent step\. This prevents the unbounded variance growth of pure PreNorm and gives the recurrent core PostNorm\-like forward stability\.

Conversely, during the*backward pass*, the truncated gradient horizon means the error signal passes through the module\-level normalization onlyKKtimes\. Within that same horizon, the gradient also flows throughLLinternal PreNorm identity connections\. SinceKKis small relative to the full recurrence depthNN, MagicNorm behaves more like a stable PreNorm architecture during optimization\.

#### 2\.1\.2Warmup deep credit assignment

The original HRM uses a fixed 1\-step gradient strategy, backpropagating only through the last two recurrent steps \(lastHHand lastLL\)\. We extend this approach withwarmup deep credit assignment\. The schedule is motivated by temporal\-curriculum principles: early optimization is restricted to short credit\-assignment paths, and longer paths are introduced only after the model has reached a more stable regime\. This design is also consistent with biological accounts of temporal learning, where local traces can support delayed credit assignment[35](https://arxiv.org/html/2605.20613#bib.bib82), reward\-predictive signals can shift from reward\-proximal events to earlier cues[4](https://arxiv.org/html/2605.20613#bib.bib79), and developmental curricula can improve sequence learning by exposing learners to shorter\-range structure before longer\-range dependencies[19](https://arxiv.org/html/2605.20613#bib.bib83)\.

Operationally, we dynamically adjust the backward gradient horizon,KK\. During early pretraining, we compute gradients through only the last two recurrent steps \(K=2K=2\), then linearly warm up the horizon to the last five steps \(K=5K=5\)\. This progressive deepening allows the model to exploit longer recurrent computation while reducing exposure to the optimization pathologies that often arise from long gradient paths at initialization\. Because the warmup phase backpropagates through fewer recurrent steps than the final setting, it also reduces the average backward\-pass computation and accelerates early training\.

### 2\.2Task\-completion objective and PrefixLM

The dominant paradigm for training foundation models relies on a resource\-intensive, multi\-stage pipeline\. From T5 through modern large language models[53](https://arxiv.org/html/2605.20613#bib.bib37), training typically begins with broad unsupervised pretraining and is followed by higher\-quality mid\-training\.

In the pretraining phase, models are trained on internet\-scale raw corpora to learn general language representations\. In the mid\-training \(or annealing\) phase, the model is refined on high\-quality text, usually instruction\-like data\. In both phases, the model optimizes an NLL objective over all tokens

While effective, this approach can be inefficient in the data\- and resource\-limited regime\. Broad raw\-text pretraining consumes most of the compute and data, and much of the token\-level loss is spent on predicting prompt\-like or task\-irrelevant text\. Yet at inference time, models are applied primarily on conditional generation:given a query or instruction, they must produce an appropriate response\.

To improve sample efficiency, HRM\-Text omits broad raw\-text pretraining and trains exclusively on instruction\-response pairs from scratch\. Given an example containing an instruction and responsex=\(xq,xa\)x=\(x\_\{q\},x\_\{a\}\), we optimize the NLL of the response conditioned on the instruction:

−log⁡P​\(xa\|xq\)\-\\log P\(x\_\{a\}\|x\_\{q\}\)
By not predicting the instruction tokens, the model concentrates its parameter updates on generating accurate responses\.[Figure˜3](https://arxiv.org/html/2605.20613#S2.F3)\-\(a\) illustrates this effect\. Although the total loss is comparable with and without the task\-completion objective, the error associated with the response component is substantially lower\.

Furthermore, this single\-stage conditional objective naturally aligns with a PrefixLM attention mask[53](https://arxiv.org/html/2605.20613#bib.bib37)\. Because the model is never required to autoregressively predict the instructionxqx\_\{q\}, we remove the causal masking over the instruction segment: all instruction tokens attend to one another bidirectionally, while standard causal masking is maintained over the response sequence\. This gives HRM\-Text an encoder–decoder\-like separation inside a decoder\-style implementation\. The instruction segment is first integrated as a fully visible context, analogous to an encoder\-side representation, while the response segment is generated autoregressively, analogous to a decoder\.

[Figure˜3](https://arxiv.org/html/2605.20613#S2.F3)\(b\) shows that PrefixLM leads to higher attention softmax entropy, indicating attention over a more diverse set of tokens\.[Figure˜3](https://arxiv.org/html/2605.20613#S2.F3)\(c\) shows that causal attention is more localized, whereas PrefixLM attention is more global and diverse\. Together, the response\-only conditional loss and PrefixLM attention improve sample efficiency in the data\- and compute\-restricted regime\.

![Refer to caption](https://arxiv.org/html/2605.20613v1/x3.png)Figure 3:Task\-completion and PrefixLM improve response modeling\.\(a\) Compared with full causal language modelingP​\(x\)P\(x\), response\-only trainingP​\(xa\|xq\)P\(x\_\{a\}\|x\_\{q\}\)lowers response\-token NLL\. PrefixLM further improves response loss\. \(b\) PrefixLM increases layerwise attention entropy relative to causal masking, suggesting broader use of the prompt\. \(c\) Attention maps illustrate the qualitative difference: causal attention remains mostly local and triangular, while PrefixLM enables global bidirectional interactions among prompt\.

## 3Results

As the central question of this paper is whether a model trained from random initialization under a small pretraining budget can reach a meaningful open\-model performance regime, we approach this question as a small\-budget design exploration: first, whether architectural choices can improve the use of fixed training compute, and second, whether the objective and input structure can increase the yield of each training example\. Finally, we compare HRM\-Text with contemporary fully open and open\-weight models to quantify its efficiency relative to current pretraining practice, and analyze whether the recurrent architecture increases effective depth\. Training details for all models are provided in[Section˜4](https://arxiv.org/html/2605.20613#S4)\.

Across these experiments, HRM\-Text is trained from scratch on the task\-formatted mixture described in[Section˜4\.1](https://arxiv.org/html/2605.20613#S4.SS1), using only 40B unique tokens\. We report all the performance from a single HRM\-Text checkpoint\.

### 3\.1Architecture efficiency under matched training compute

The first part of this exploration asks how much architecture design can improve the use of a fixed training budget\. We test this by comparing standard Transformers, larger matched\-FLOPs Transformers, Looped Transformers[16](https://arxiv.org/html/2605.20613#bib.bib63), RINS[3](https://arxiv.org/html/2605.20613#bib.bib91), and HRM under matched training compute\.

Table 1:Training FLOPs\-matched comparison of recurrent architectures and Transformer models\. Bold denotes the highest score in each column, and underline denotes the second highest\.[Table˜1](https://arxiv.org/html/2605.20613#S3.T1)compares training\-FLOPs\-matched recurrent architectures \(including HRM, looped Transformers, and RINS\) with standard Transformers\. For recursive models, the value in the recursions column indicates total compute per forward pass, expressed as a multiple of the compute required if recurrence is not present\. For example, H2L3 denotes 2 outer H cycles, with 3 L steps inside each outer cycle, giving2×\(3\+1\)=82\\times\(3\+1\)=8total H/L module steps\. Since each H or L module contains half of the non\-embedding parameters of the full HRM recurrent core, this corresponds to8×0\.5=48\\times 0\.5=4recursions in the table\. For standard Transformer models, the value is 1\.

Looped Transformers and RINS generally outperform Transformer models of the same size, showing that recurrent or looped computation is an effective architectural direction\. When compared with a larger Transformer under a matched training\-FLOPs budget, however, their advantage is less consistent\. HRM is a strong instance of this architecture\-design space and performs well against the listed baselines, including the larger deep Transformer\.

Within recurrent designs, we further compare HRM with TRM to separate hierarchical dual\-timescale recurrence from a shared\-parameter dual\-timescale recurrent variant\.

Table 2:Performance and stability comparison against TRM\.HRM maintains stable training dynamics across all scales, whereas TRM suffers from severe instability at the 1B parameter scale\. Furthermore, at the 0\.6B scale, HRM achieves competitive performance across most benchmarks while requiring 2×\\timesless compute than TRM\.TRM is a HRM\-variant that shares the H and L module parameters, to achieve strong results on symbolic reasoning problems at smaller scale[36](https://arxiv.org/html/2605.20613#bib.bib15)\.[Table˜2](https://arxiv.org/html/2605.20613#S3.T2)compares HRM and TRM\. Since TRM shares parameters across H\-L modules, there are two ways to approximately match FLOPs: keeping the overall parameter count fixed and reduce the number of recursions, or keeping the recursive structure fixed and reduce the parameter count\. In the first setting, TRM training is less stable, likely due to the reduced recursion weakening the intended iterative computation\. In the second setting, the additional recursion stabilizes training and improves performance, but the model still lags behind FLOPs\-matched HRM\. HRM achieves generally comparable or stronger performance while using substantially fewer FLOPs than TRM in this comparison\.

These results support the first part of the small\-budget design exploration: recurrent and looped architectures can improve benchmark yield under fixed training compute, and HRM is one effective point in this broader architecture\-design space\.

### 3\.2Task\-completion objective and PrefixLM yield

The second part of this exploration asks whether the training objective and input structure can increase the yield of each training example\. We test this through an incremental ablation that starts with a standard Transformer trained on full question–answer pairs using causal attention, then adds the task\-completion objective, PrefixLM attention, and finally the HRM architecture\. All experiments are FLOPs\-matched\.

As shown in[Table˜3](https://arxiv.org/html/2605.20613#S3.T3), the task\-completion objective, PrefixLM training, and the HRM architecture each significantly contribute to overall performance\. Introducing the task\-completion objective establishes initial gains across all benchmarks, while PrefixLM training further enhances these results compared to standard causal masking\. Ultimately, transitioning from a standard Transformer to the HRM architecture delivers a final, consistent performance increase across the board\.

Table 3:Performance Comparison across Model Architectures and Objectives
### 3\.3Comparison with contemporary open models

After exploring architecture, objective, and input structure under the small\-budget setting, we compare the resulting HRM\-Text checkpoint with contemporary fully open and open\-weight models trained with substantially larger budgets\.

[Figure˜1](https://arxiv.org/html/2605.20613#S0.F1)and[Table˜4](https://arxiv.org/html/2605.20613#S3.T4)compares HRM\-Text 1B with contemporary fully open and open\-weight models, including Llama, Qwen, Gemma, OLMo and recurrent models, Huginn and Ouro\. HRM\-Text achieves strong performance among these models on most benchmarks, while remaining competitive on MMLU despite its smaller parameter count and limited 40B unique\-token pretraining budget\. This pattern is consistent with the role of HRM\-Text: recurrent depth and task\-completion pretraining improve reasoning and task execution, while broad factual\-knowledge coverage remains more sensitive to model scale and data breadth\. HRM\-Text reaches this performance range with96\-432×96\\text\{\-\}432\\timesless estimated training compute and roughly100\-900×100\\text\{\-\}900\\timesfewer training tokens than the compared open baselines\. This comparison supports the paper’s central question by showing that a small, task\-completion\-oriented pretraining run can enter the performance range of open models trained with far larger token and compute budgets\.

ModelArchitectureFLOPs\(102110^\{21\}\)Tokens\(T\)MMLUARC\-CHella\.Wino\.BoolQDROPGSM8KMATHFully openHRM\-Text 1BRecurrent10\.0660\.781\.963\.472\.486\.282\.284\.556\.2Huginn 3\.5BRecurrent1270\.831\.438\.265\.259\.469\.817\.834\.612\.6Olmo3 7BDense252665\.881\.672\.764\.685\.471\.575\.540\.0Open weightLlama3\.2 3BDense162958\.069\.147\.152\.476\.245\.277\.748\.0Gemma3 4BDense96459\.656\.277\.264\.772\.360\.138\.424\.2Qwen3\.5 2BDense4323664\.581\.064\.656\.780\.530\.853\.034\.2Ouro 1\.4BRecurrent259767\.460\.974\.372\.383\.649\.778\.922\.4Table 4:Evaluation results of HRM\-Text 1B and contemporary fully open or open\-weight models\.Our reported scaling experiments extend to 3B parameters for Transformers and 1B parameters for HRM\-Text\. Within this range, the results show that models trained with a limited amount of data can remain competitive with contemporary industrial\-scale pretraining efforts that use much larger datasets \(up to 36T tokens\)\. Demonstrating similar efficiency gains at larger model scales remains in the scope of future work\.

### 3\.4Effective depth analysis

We hypothesize that HRM’s effectiveness is due to its recurrence, increasing the amount of useful internal computation\. We test this hypothesis by examining whether HRM exhibits greater effective depth than standard and looped Transformer baselines\.

![Refer to caption](https://arxiv.org/html/2605.20613v1/x4.png)Figure 4:Effective depth analysis\. \(a\) Each layer of HRM consistently reveals considerable changes compared to its previous layer, showing that deep layers of HRM are still making meaningful contributions to the hidden states\. \(b\) HRM has smaller cosine similarity of block\-wise representations, while other model variants suffer more from the common layer representation over\-smoothing issue, analogously to standard transformers\.![Refer to caption](https://arxiv.org/html/2605.20613v1/x5.png)Figure 5:Per\-layer logit lens KL\. HRM shows the largest logit len KL in deep layers, while both standard and looped transformers converges to stable distributions in shallow layers\.[Figure˜4](https://arxiv.org/html/2605.20613#S3.F4)illustrates effective depth from two perspectives: \(a\) the norm of the difference between adjacent recurrent blocks, and \(b\) the cosine similarity of block\-wise representations\. Both metrics suggest that HRM maintains more active representational change across depth than standard Transformers and other looped models\.

FollowingHuet al\.[32](https://arxiv.org/html/2605.20613#bib.bib69), we also use logit lens analysis to evaluate how early the model’s output distribution begins to stabilize\. We decode hidden states from different layers using the model’s output projection head, then compute the KL divergence between each probed prediction and the final model distribution\. As shown in[Figure˜5](https://arxiv.org/html/2605.20613#S3.F5), both the standard Transformer and looped Transformer converge to a stable output distribution in relatively early layers, suggesting that their deeper layers make smaller incremental contributions\. In contrast, HRM retains larger KL values in deeper layers, indicating greater effective depth\.

## 4Training details

### 4\.1Dataset

We train HRM\-Text exclusively on open\-source datasets, comprising general instructions, rewritten knowledge, mathematical and symbolic tasks, textbook exercises, and web\-extracted questions\. The initial corpus contains approximately176\.5176\.5B tokens across593\.7593\.7M documents\. From this, we sample4040B unique tokens for a total training duration of6060B tokens, with repetition governed by the stratified sampling schedule described below\. Table[5](https://arxiv.org/html/2605.20613#S4.T5)summarizes the dataset composition\.

Table 5:Source datasets used for HRM\-Text training, grouped by type\.Table 6:Stratified sampling limits used to construct the training mixture\.To control response properties during inference, we prepend specific condition tags to the instructions based on the target response style\. We utilize four primary conditions:direct\(answer\-only\),cot\(chain\-of\-thought\),synth\(synthetic answer style\), andnoisy\(web\-crawl text with uneven formatting\)\. As outlined in the “Condition” column of Table[5](https://arxiv.org/html/2605.20613#S4.T5), this approach leverages conditioned training[68](https://arxiv.org/html/2605.20613#bib.bib2),[18](https://arxiv.org/html/2605.20613#bib.bib4)to enable explicit selection of the model’s output format at generation time\.

To concentrate the training signal on final task completions, we strip all text enclosed within<think\>…</think\>boundaries prior to training\. This eliminates explicit long\-CoT traces mostly produced by reinforcement learning with verifiable rewards \(RLVR\) training[15](https://arxiv.org/html/2605.20613#bib.bib24), aligning with our objective for HRM\-Text to rely on its internal hierarchical computation rather than explicit reasoning steps\.

We employ SeqIO\-style stratified sampling[54](https://arxiv.org/html/2605.20613#bib.bib59), treating each dataset or task as an independent stratum rather than sampling uniformly from a pooled corpus\. To ensure a balanced training mixture and prevent over\-representation of massive datasets, we apply caps on number of documents per task or per dataset, while upsampling smaller datasets\. The specific sampling limits and multipliers are detailed in[Table 6](https://arxiv.org/html/2605.20613#S4.T6)\.

### 4\.2Dataset Contamination

While our pretraining data originates from widely used public sources, and many enforce decontamination measures, residual contamination may still persist considering the scale of pretraining\. To rigorously assess whether our models’ benchmark performance artificially benefits from exposure to test examples, we conduct a statistical test adapted from the Llama family[66](https://arxiv.org/html/2605.20613#bib.bib86)\.

We tokenize questions from all evaluated benchmarks \(excluding few\-shot exemplars\) and identifynn\-gram matches against the fully tokenized pretraining corpus\. A sample’s contamination percentage is defined as the fraction of its tokens participating in these matchednn\-grams\.

To determine if contamination inflates performance, we partition the evaluation data into four overlapping subsets based on contamination percentage: “Clean” \(<20%<20\\%\), “Not Clean” \(≥20%\\geq 20\\%\), “Not Dirty” \(<80%<80\\%\), and “Dirty” \(≥80%\\geq 80\\%\)\. For contamination to be deemed significantly impactful, “clean” samples must perform demonstrably worse than average, while “dirty” samples perform demonstrably better\. For each subset of sizekk, we compute the empirical mean performanceX¯\\bar\{X\}and the test statisticZk=\(X¯−μk\)/σkZ\_\{k\}=\(\\bar\{X\}\-\\mu\_\{k\}\)/\\sigma\_\{k\}, whereμk\\mu\_\{k\}andσk\\sigma\_\{k\}are the mean and standard deviation of the sampling distribution for sizekk\. We conclude that dataset contamination provides a statistically significant performance boost only if\|Zk\|\>2\|Z\_\{k\}\|\>2across all four subsets\.

We applied this test to HRM\-Text 0\.6B and 1B usingn=13n=13andn=20n=20, on all benchmarks shown in[Table˜4](https://arxiv.org/html/2605.20613#S3.T4)\. HRM\-Text 0\.6B exhibited no significant contamination in either setting\. HRM\-Text 1B showed statistical significance on the DROP benchmark forn=13n=13\(as shown in[Table˜7](https://arxiv.org/html/2605.20613#S4.T7)\), but not forn=20n=20\. Nonetheless, HRM\-Text 1B still achieves a score of81\.181\.1on the strictly clean subset \(0%0\\%average contamination,59045904samples\) of DROP, indicating strong baseline generalization despite a marginal potential benefit from contamination\.

Overall, these analyses show that HRM\-Text’s benchmark performance is unlikely artificially driven by prior exposure to test examples\.

Table 7:Contamination analysis results for the DROP dataset on HRM\-Text 1B\.
### 4\.3Architecture and optimization details

HRM\-Text 1B took 46 hours to pretrain on two 8×H100 nodes, costing around $1,472 \(assuming $2 per H100 hour\)\. We summarize the model, optimization, and infrastructure settings below\.

##### Tokenizer\.

We employ a Byte\-Pair Encoding \(BPE\) tokenizer with a vocabulary size of 65,536, trained using thetokenizerslibrary\.

##### Model configuration\.

Each module is a transformer comprising 16 layers, with a hidden size of 1536 and a head size of 128\. We use a context size of 4,096 and RoPE positional encoding withθ=10,000\\theta=10\{,\}000\. The RMSNormϵ\\epsilonparameter is set to10−610^\{\-6\}\. All models are trained inbfloat16precision, and all model weights are initialized using LeCun normal\.

##### Optimization\.

We use the Adam\-atan2 optimizer[20](https://arxiv.org/html/2605.20613#bib.bib27)withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.95\\beta\_\{2\}=0\.95, and a weight decay of 0\.1\. The learning rate is linearly warmed up over 2,000 steps and then held constant at2\.2×10−42\.2\\times 10^\{\-4\}\. No gradient clipping is applied\. The batch size is 196,608 tokens\. Rather than applying standard learning rate decay, we maintain an Exponential Moving Average \(EMA\) of the model weights with a decay rate of 0\.9999\. Both our final evaluations and the publicly released model weights use this EMA checkpoint\.

##### Infrastructure\.

The parallelization framework is based on PyTorch FSDP2[42](https://arxiv.org/html/2605.20613#bib.bib3)\. All models are trained in a single continuous run\. We do not use intermediate checkpointing, crash recovery, or skip loss spikes\.

## 5Discussion

### 5\.1Toward decoupling knowledge and reasoning

Our results suggest a direction for partially decoupling factual coverage from reasoning computation\. HRM\-Text is trained on only 40B unique tokens, and explicitly knowledge\-oriented sources constitute only a fraction of the task\-formatted mixture\. Nevertheless, the model achieves strong performance on reasoning\-heavy benchmarks such as MATH and GSM8K, while retaining nontrivial performance on broader knowledge benchmarks such as MMLU\. This pattern suggests that a compact recurrent model can learn useful task\-execution and reasoning behavior without requiring the same degree of broad factual memorization typically associated with trillion\-token pretraining\.

This observation motivates future systems that separate a compact reasoning core from factual storage\. In such systems, recurrent models like HRM\-Text could specialize in computation, planning, and task execution, while factual breadth is supplied by curated corpora, retrieval\-augmented stores, or learned memory modules\. Recent conditional\-memory approaches such as Engram point in a related direction: instead of forcing the Transformer backbone to simulate static pattern lookup through dense computation, they introduce scalable memory lookup as a complementary sparsity axis, freeing neural computation for global context integration and reasoning[10](https://arxiv.org/html/2605.20613#bib.bib84)\. HRM\-Text does not yet implement retrieval or conditional memory, but its results suggest that combining small recurrent reasoning models with external or learned knowledge stores is a promising direction for future work\.

### 5\.2Adaptive computation time \(ACT\)

Wanget al\.[69](https://arxiv.org/html/2605.20613#bib.bib18)equipped HRM with an adaptive computation module that allows simpler problems to terminate earlier, reducing computation while maintaining near\-optimal performance\. We do not use this component in HRM\-Text in order to keep the design and training procedure simpler, but it remains a promising direction for improving both performance and computational efficiency\. The current recurrent schedule provides additional effective serial depth, but it also increases inference\-time computation relative to a single\-pass Transformer\. ACT would allow easy prompts or tokens to halt after fewer recurrent cycles while reserving the full recurrent budget for harder cases, potentially recovering a substantial portion of the inference overhead\. We therefore view ACT as a natural complement to HRM\-Text’s recurrent\-depth design: HRM supplies depth when reasoning is needed, while ACT can make that depth conditional rather than fixed\.

### 5\.3PrefixLM with inference frameworks

PrefixLM can run inside standard text\-generation inference frameworks such as vLLM without requiring a fundamentally different serving stack\. The main requirement is custom attention\-mask handling during prefilling, so that instruction tokens can attend bidirectionally while response tokens remain autoregressive\.

Using a PrefixLM\-style attention pattern in multi\-turn chat also requires careful KV\-cache logic: user tokens need full attention within each user segment, while assistant tokens must preserve causal generation\. This is an engineering constraint rather than a conceptual limitation, but it should be addressed explicitly in production inference systems\.

## 6Conclusion

We introduced HRM\-Text as an empirical existence proof that highly efficient pretraining is achievable\. Inspired by biological multi\-timescale processing, we co\-designed a hierarchical recurrent architecture with a targeted task\-completion objective\. This demonstrates that there is at least one model family capable of reaching competitive performance without relying on the massive compute and internet\-scale raw text that dominate current paradigms\.

By drastically reducing the compute\-to\-performance ratio, this work opens significant potentials for future research\. Foundational pretraining is no longer locked inside highly resourced institutions; it is now computationally accessible to small labs, academic groups, and even individuals\. We hope this democratization empowers the broader community to actively explore, train, and innovate on new architectures from scratch\.

## 7Related Work

The literature on recurrent neural networks and language modelling is extensive\. In this section, we discuss the most relevant papers\.

### 7\.1Scaling laws and efficient pretraining

Language\-model development is driven by scaling laws and compute\-optimal training, which together prescribe jointly increasing parameters, data, and compute[37](https://arxiv.org/html/2605.20613#bib.bib26),[31](https://arxiv.org/html/2605.20613#bib.bib23)\. This underlies the dominant recipe: large decoder\-only Transformers trained on massive corpora and refined via mid\- and post\-training[8](https://arxiv.org/html/2605.20613#bib.bib22),[24](https://arxiv.org/html/2605.20613#bib.bib78),[43](https://arxiv.org/html/2605.20613#bib.bib43),[49](https://arxiv.org/html/2605.20613#bib.bib42)\. While this scaling paradigm has produced strong models, it concentrates pretraining among compute\-rich organizations, reinforcing a growing compute divide[7](https://arxiv.org/html/2605.20613#bib.bib72),[2](https://arxiv.org/html/2605.20613#bib.bib73)\. HRM\-Text instead explores whether improved architectures, objectives, and data curation can shift the cost–performance frontier, complementing scaling laws by increasing per\-token and per\-FLOP efficiency\.

### 7\.2Conditional sequence modeling and PrefixLM

The distinction between modeling conditional answers,Pθ​\(xa∣xq\)P\_\{\\theta\}\(x\_\{a\}\\mid x\_\{q\}\), and full text streams,Pθ​\(x\)P\_\{\\theta\}\(x\), predates modern LLMs\. Early sequence\-to\-sequence models and encoder–decoder transformers explicitly model outputs conditioned on inputs[61](https://arxiv.org/html/2605.20613#bib.bib17),[12](https://arxiv.org/html/2605.20613#bib.bib32),[5](https://arxiv.org/html/2605.20613#bib.bib34),[67](https://arxiv.org/html/2605.20613#bib.bib25)\. T5[53](https://arxiv.org/html/2605.20613#bib.bib37)later unified NLP tasks as text\-to\-text generation, reinforcing this conditional framing\. In the instruction\-tuning phase of language modeling, NLP datasets are converted into instruction–response pairs, and a mask is often applied so that loss is only computed on the response tokens[70](https://arxiv.org/html/2605.20613#bib.bib39),[55](https://arxiv.org/html/2605.20613#bib.bib40)\. Scaling approaches like FLAN show that such task formatting improves generalization[46](https://arxiv.org/html/2605.20613#bib.bib44),[14](https://arxiv.org/html/2605.20613#bib.bib41)\.

Decoder\-only models concatenate the prompt and response into a single causal stream and predict all tokens\. Although scalable, this is inefficient: the prompt is known at inference time, yet training still assigns loss to reconstruct it\.

PrefixLM\-style objectives bridge decoder\-only models and conditional generation: prefix tokens attend bidirectionally, while outputs remain causal[45](https://arxiv.org/html/2605.20613#bib.bib35),[17](https://arxiv.org/html/2605.20613#bib.bib36),[53](https://arxiv.org/html/2605.20613#bib.bib37),[63](https://arxiv.org/html/2605.20613#bib.bib38)\. HRM\-Text builds directly on this lineage by making conditional modeling the primary pretraining objective, using response\-only loss and PrefixLM masking to combine encoder–decoder behavior with decoder\-only simplicity\.

### 7\.3Latent computation and recurrent language models

A line of work seeks to improve model capability by increasing internal computation rather than just scaling parameters or output tokens\. Universal Transformers introduced recurrent depth to self\-attention[16](https://arxiv.org/html/2605.20613#bib.bib63), and later recurrent or block\-recurrent Transformer variants reuse parameters across steps or layers[34](https://arxiv.org/html/2605.20613#bib.bib75),[13](https://arxiv.org/html/2605.20613#bib.bib74),[56](https://arxiv.org/html/2605.20613#bib.bib64)\. These approaches echo classic recurrent\-network ideas but inherit the challenge of unstable long\-range credit assignment[6](https://arxiv.org/html/2605.20613#bib.bib76),[78](https://arxiv.org/html/2605.20613#bib.bib77)\.

Recent latent\-reasoning approaches refine hidden states internally before emitting an answer[27](https://arxiv.org/html/2605.20613#bib.bib65),[39](https://arxiv.org/html/2605.20613#bib.bib66)\. Recurrent\-depth language models such as Huginn and looped language models such as Ouro[76](https://arxiv.org/html/2605.20613#bib.bib9)scale this idea to language modeling and test\-time computation[23](https://arxiv.org/html/2605.20613#bib.bib7),[76](https://arxiv.org/html/2605.20613#bib.bib9)\. Meanwhile, CCDD[75](https://arxiv.org/html/2605.20613#bib.bib6)establishes the connection between looped transformers and continuous diffusion language model with latent reasoning advantages\. These works demonstrate that latent recurrence is a promising alternative to purely token\-level reasoning, but many still rely on large token budgets, stage\-wise training, or extensive test\-time recurrence\.

HRM\-Text builds on the Hierarchical Reasoning Model, which uses a two\-timescale recurrent design for symbolic reasoning[69](https://arxiv.org/html/2605.20613#bib.bib18)\. Like prior work, it emphasizes richer internal computation, but it differs in that it is trained from scratch under a small token budget and uses a hierarchical dual\-timescale architecture\. Related work such as TRM explore even smaller recursive models[36](https://arxiv.org/html/2605.20613#bib.bib15), suggesting that hierarchy, temporal separation and recurrence can enable useful serial computation, though applying them to language remains challenging due to larger states and broader data\.

### 7\.4Stable recurrent optimization

Stability is a key challenge for recurrent\-depth language models\. In Transformers, normalization placement trades off forward stability and gradient flow: PostNorm stabilizes activations but is harder to optimize at depth, while PreNorm improves gradients but risks residual growth and reduced expressivity[71](https://arxiv.org/html/2605.20613#bib.bib70),[44](https://arxiv.org/html/2605.20613#bib.bib71)\. Recurrence intensifies this issue, as repeated transformations create long products of Jacobian\-like operators during backpropagation\. Prior work shows exact long\-horizon credit assignment is often impractical[6](https://arxiv.org/html/2605.20613#bib.bib76),[62](https://arxiv.org/html/2605.20613#bib.bib28), and studies of random matrix products and neural gradients suggest deep multiplicative paths lead to heavy\-tailed, lognormal\-like variability[26](https://arxiv.org/html/2605.20613#bib.bib29),[11](https://arxiv.org/html/2605.20613#bib.bib30),[30](https://arxiv.org/html/2605.20613#bib.bib31)\.

HRM\-Text addresses these stability issues using architecture\-specific techniques:MagicNormand warm\-up for deep credit assignment\. These design choices distinguish HRM\-Text from generic looped Transformers and are crucial to making recurrent depth stable at language\-model scale\.

## Acknowledgements

We thank Sen Song, Jiacheng You, and Andy L\. Siy for their insightful discussions\.

## References

- P\. Aggarwal, M\. Ghazvininejad, S\. Kim, I\. Kulikov, J\. Lanchantin, X\. Li, T\. Li, B\. Liu, G\. Neubig, A\. Ovalle, S\. Saha, S\. Sukhbaatar, S\. Welleck, J\. Weston, C\. Whitehouse, A\. Williams, J\. Xu, P\. Yu, W\. Yuan, J\. Zhang, and W\. Zhao \(2026\)Reasoning over mathematical objects: on\-policy reward modeling and test time aggregation\.External Links:2603\.18886Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.4.3.2.1.1)\.
- The de\-democratization of ai: deep learning and the compute divide in artificial intelligence research\.arXiv preprint arXiv:2010\.15581\.Cited by:[§7\.1](https://arxiv.org/html/2605.20613#S7.SS1.p1.1)\.
- I\. Alabdulmohsin and X\. Zhai \(2026\)Recursive inference scaling: a winning path to scalable inference in language and multimodal systems\.Advances in Neural Information Processing Systems38,pp\. 109020–109049\.Cited by:[§3\.1](https://arxiv.org/html/2605.20613#S3.SS1.p1.1.2.1)\.
- R\. Amo, S\. Matias, A\. Yamanaka, K\. F\. Tanaka, N\. Uchida, and M\. Watabe\-Uchida \(2022\)A gradual temporal shift of dopamine responses mirrors the progression of temporal difference error in machine learning\.Nature neuroscience25\(8\),pp\. 1082–1092\.Cited by:[§2\.1\.2](https://arxiv.org/html/2605.20613#S2.SS1.SSS2.p1.2)\.
- D\. Bahdanau, K\. Cho, and Y\. Bengio \(2015\)Neural machine translation by jointly learning to align and translate\.InInternational Conference on Learning Representations,Cited by:[§7\.2](https://arxiv.org/html/2605.20613#S7.SS2.p1.2)\.
- Y\. Bengio, P\. Simard, and P\. Frasconi \(1994\)Learning long\-term dependencies with gradient descent is difficult\.IEEE Transactions on Neural Networks5\(2\),pp\. 157–166\.External Links:[Document](https://dx.doi.org/10.1109/72.279181)Cited by:[§1](https://arxiv.org/html/2605.20613#S1.p3.1),[§7\.3](https://arxiv.org/html/2605.20613#S7.SS3.p1.1),[§7\.4](https://arxiv.org/html/2605.20613#S7.SS4.p1.1)\.
- T\. Besiroglu, S\. A\. Bergerson, A\. Michael, L\. Heim, X\. Luo, and N\. Thompson \(2024\)The compute divide in machine learning: a threat to academic contribution and scrutiny?\.arXiv preprint arXiv:2401\.02452\.Cited by:[§7\.1](https://arxiv.org/html/2605.20613#S7.SS1.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.InAdvances in neural information processing systems,Vol\.33,pp\. 1877–1901\.Cited by:[§7\.1](https://arxiv.org/html/2605.20613#S7.SS1.p1.1)\.
- Y\. Chen, Z\. Yang, Z\. Liu, C\. Lee, P\. Xu, M\. Shoeybi, B\. Catanzaro, and W\. Ping \(2026\)Acereason\-nemotron: advancing math and code reasoning through reinforcement learning\.InAdvances in neural information processing systems,Vol\.38,pp\. 110320–110345\.Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.6.5.2.1.1)\.
- X\. Cheng, W\. Zeng, D\. Dai, Q\. Chen, B\. Wang, Z\. Xie, K\. Huang, X\. Yu, Z\. Hao, Y\. Li, H\. Zhang, H\. Zhang, D\. Zhao, and W\. Liang \(2026\)Conditional memory via scalable lookup: a new axis of sparsity for large language models\.arXiv preprint arXiv:2601\.07372\.Cited by:[§5\.1](https://arxiv.org/html/2605.20613#S5.SS1.p2.1)\.
- B\. Chmiel, L\. Ben\-Uri, M\. Shkolnik, E\. Hoffer, R\. Banner, and D\. Soudry \(2020\)Neural gradients are near\-lognormal: improved quantized and sparse training\.arXiv preprint arXiv:2006\.08173\.External Links:2006\.08173Cited by:[§C\.1](https://arxiv.org/html/2605.20613#A3.SS1.p2.1),[§7\.4](https://arxiv.org/html/2605.20613#S7.SS4.p1.1)\.
- K\. Cho, B\. van Merriënboer, Ç\. Gülçehre, D\. Bahdanau, F\. Bougares, H\. Schwenk, and Y\. Bengio \(2014\)Learning phrase representations using RNN encoder–decoder for statistical machine translation\.InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing,pp\. 1724–1734\.External Links:[Document](https://dx.doi.org/10.3115/v1/D14-1179)Cited by:[§7\.2](https://arxiv.org/html/2605.20613#S7.SS2.p1.2)\.
- J\. R\. Chowdhury and C\. Caragea \(2024\)Investigating recurrent transformers with dynamic halt\.arXiv preprint arXiv:2402\.00976\.Cited by:[§1](https://arxiv.org/html/2605.20613#S1.p3.1),[§7\.3](https://arxiv.org/html/2605.20613#S7.SS3.p1.1)\.
- H\. W\. Chung, L\. Hou, S\. Longpre, B\. Zoph, Y\. Tay, W\. Fedus, Y\. Li, X\. Wang, M\. Dehghani, S\. Brahma,et al\.\(2024\)Scaling instruction\-finetuned language models\.Journal of Machine Learning Research25,pp\. 1–53\.Cited by:[§7\.2](https://arxiv.org/html/2605.20613#S7.SS2.p1.2)\.
- DeepSeek\-AI \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.External Links:2501\.12948,[Document](https://dx.doi.org/10.48550/arXiv.2501.12948),[Link](https://arxiv.org/abs/2501.12948)Cited by:[§4\.1](https://arxiv.org/html/2605.20613#S4.SS1.p3.1.3.1)\.
- M\. Dehghani, S\. Gouws, O\. Vinyals, J\. Uszkoreit, and Ł\. Kaiser \(2019\)Universal transformers\.InInternational Conference on Learning Representations,Cited by:[§3\.1](https://arxiv.org/html/2605.20613#S3.SS1.p1.1.1.1),[§7\.3](https://arxiv.org/html/2605.20613#S7.SS3.p1.1)\.
- L\. Dong, N\. Yang, W\. Wang, F\. Wei, X\. Liu, Y\. Wang, J\. Gao, M\. Zhou, and H\. Hon \(2019\)Unified language model pre\-training for natural language understanding and generation\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[2nd item](https://arxiv.org/html/2605.20613#S1.I1.i2.p1.1),[§7\.2](https://arxiv.org/html/2605.20613#S7.SS2.p3.1)\.
- Y\. Dong, Z\. Wang, M\. Sreedhar, X\. Wu, and O\. Kuchaiev \(2023\)Steerlm: attribute conditioned sft as an \(user\-steerable\) alternative to rlhf\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 11275–11288\.Cited by:[§4\.1](https://arxiv.org/html/2605.20613#S4.SS1.p2.1)\.
- J\. L\. Elman \(1993\)Learning and development in neural networks: the importance of starting small\.Cognition48\(1\),pp\. 71–99\.External Links:[Document](https://dx.doi.org/10.1016/0010-0277%2893%2990058-4)Cited by:[§2\.1\.2](https://arxiv.org/html/2605.20613#S2.SS1.SSS2.p1.2)\.
- K\. E\. Everett, L\. Xiao, M\. Wortsman, A\. A\. Alemi, R\. Novak, P\. J\. Liu, I\. Gur, J\. Sohl\-Dickstein, L\. P\. Kaelbling, J\. Lee, and J\. Pennington \(2024\)Scaling exponents across parameterizations and optimizers\.InForty\-first International Conference on Machine Learning,Cited by:[§4\.3](https://arxiv.org/html/2605.20613#S4.SS3.SSS0.Px3.p1.3.1.1)\.
- R\. Fan, Z\. Wang, and P\. Liu \(2025\)MegaScience: pushing the frontiers of post\-training datasets for science reasoning\.External Links:2507\.16812Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.7.6.2.1.1)\.
- B\. Gao, F\. Song, Z\. Yang, Z\. Cai, Y\. Miao, Q\. Dong, L\. Li, C\. Ma, L\. Chen, Z\. Tang,et al\.\(2025\)Omni\-math: a universal olympiad level mathematic benchmark for large language models\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 100540–100569\.Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.4.3.2.1.1)\.
- J\. Geiping, S\. M\. McLeish, N\. Jain, J\. Kirchenbauer, S\. Singh, B\. R\. Bartoldson, B\. Kailkhura, A\. Bhatele, and T\. Goldstein \(2026\)Scaling up test\-time compute with latent reasoning: a recurrent depth approach\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=S3GhJooWIC)Cited by:[§1](https://arxiv.org/html/2605.20613#S1.p5.2),[§7\.3](https://arxiv.org/html/2605.20613#S7.SS3.p2.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§7\.1](https://arxiv.org/html/2605.20613#S7.SS1.p1.1)\.
- E\. Guha, R\. Marten, S\. Keh, N\. Raoof, G\. Smyrnis, H\. Bansal, M\. Nezhurina, J\. Mercat, T\. Vu, Z\. Sprague, A\. Suvarna, B\. Feuer, L\. Chen, Z\. Khan, E\. Frankel, S\. Grover, C\. Choi, N\. Muennighoff, S\. Su, W\. Zhao, J\. Yang,et al\.\(2025\)OpenThoughts: data recipes for reasoning models\.External Links:2506\.04178Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.6.5.2.1.1)\.
- B\. Hanin and M\. Nica \(2020\)Products of many large random matrices and gradients in deep neural networks: b\. hanin, m\. nica\.Communications in Mathematical Physics376\(1\),pp\. 287–322\.Cited by:[§C\.1](https://arxiv.org/html/2605.20613#A3.SS1.p2.1),[§7\.4](https://arxiv.org/html/2605.20613#S7.SS4.p1.1)\.
- S\. Hao, S\. Sukhbaatar, D\. Su, X\. Li, Z\. Hu, J\. E\. Weston, and Y\. Tian \(2025\)Training large language models to reason in a continuous latent space\.InConference on Language Modeling,Cited by:[§7\.3](https://arxiv.org/html/2605.20613#S7.SS3.p2.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.5.4.2.1.1),[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.8.7.2.1.1)\.
- J\. Ho and T\. Salimans \(2022\)Classifier\-free diffusion guidance\.arXiv preprint arXiv:2207\.12598\.Cited by:[Appendix D](https://arxiv.org/html/2605.20613#A4.p1.1)\.
- L\. Hodgkinson and M\. W\. Mahoney \(2021\)Multiplicative noise and heavy tails in stochastic optimization\.InProceedings of the 38th International Conference on Machine Learning \(ICML\),Proceedings of Machine Learning Research, Vol\.139,pp\. 4262–4272\.External Links:[Link](https://proceedings.mlr.press/v139/hodgkinson21a.html)Cited by:[§C\.1](https://arxiv.org/html/2605.20613#A3.SS1.p2.1),[§7\.4](https://arxiv.org/html/2605.20613#S7.SS4.p1.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. d\. L\. Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)Training compute\-optimal large language models\.arXiv preprint arXiv:2203\.15556\.Cited by:[§1](https://arxiv.org/html/2605.20613#S1.p1.1),[§7\.1](https://arxiv.org/html/2605.20613#S7.SS1.p1.1)\.
- Y\. Hu, C\. Zhou, and M\. Zhang \(2025\)What affects the effective depth of large language models?\.arXiv preprint arXiv:2512\.14064\.Cited by:[§3\.4](https://arxiv.org/html/2605.20613#S3.SS4.p3.1)\.
- HuggingFace H4 \(2023\)No robots\.Note:[https://huggingface\.co/datasets/HuggingFaceH4/no\_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)Dataset cardCited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.2.1.2.1.1)\.
- D\. Hutchins, I\. Schlag, Y\. Wu, E\. Dyer, and B\. Neyshabur \(2022\)Block\-recurrent transformers\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 33248–33261\.Cited by:[§1](https://arxiv.org/html/2605.20613#S1.p3.1),[§7\.3](https://arxiv.org/html/2605.20613#S7.SS3.p1.1)\.
- E\. M\. Izhikevich \(2007\)Solving the distal reward problem through linkage of STDP and dopamine signaling\.Cerebral Cortex17\(10\),pp\. 2443–2452\.External Links:[Document](https://dx.doi.org/10.1093/cercor/bhl152)Cited by:[§2\.1\.2](https://arxiv.org/html/2605.20613#S2.SS1.SSS2.p1.2)\.
- A\. Jolicoeur\-Martineau \(2025\)Less is more: recursive reasoning with tiny networks\.External Links:2510\.04871,[Link](https://arxiv.org/abs/2510.04871)Cited by:[§3\.1](https://arxiv.org/html/2605.20613#S3.SS1.p5.1),[§7\.3](https://arxiv.org/html/2605.20613#S7.SS3.p3.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§1](https://arxiv.org/html/2605.20613#S1.p1.1),[§7\.1](https://arxiv.org/html/2605.20613#S7.SS1.p1.1)\.
- T\. Karras, M\. Aittala, T\. Kynkäänniemi, J\. Lehtinen, T\. Aila, and S\. Laine \(2024\)Guiding a diffusion model with a bad version of itself\.Advances in Neural Information Processing Systems37,pp\. 52996–53021\.Cited by:[Appendix D](https://arxiv.org/html/2605.20613#A4.p1.1)\.
- Y\. Koishekenov, A\. Lipani, and N\. Cancedda \(2025\)Encode, think, decode: scaling test\-time reasoning with recursive latent thoughts\.External Links:2510\.07358Cited by:[§7\.3](https://arxiv.org/html/2605.20613#S7.SS3.p2.1)\.
- A\. N\. Lee, C\. J\. Hunter, and N\. Ruiz \(2023\)Platypus: quick, cheap, and powerful refinement of llms\.External Links:2308\.07317Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.4.3.2.1.1)\.
- J\. Li, E\. Beeching, L\. Tunstall, B\. Lipkin, R\. Soletskyi, S\. C\. Huang, K\. Rasul, L\. Yu, A\. Jiang, Z\. Shen, Z\. Qin, B\. Dong, L\. Zhou, Y\. Fleureau, G\. Lample, and S\. Polu \(2024\)NuminaMath\.Numina\.Note:[https://huggingface\.co/datasets/AI\-MO/NuminaMath\-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.4.3.2.1.1)\.
- W\. Liang, T\. Liu, L\. Wright, W\. Constable, A\. Gu, C\. Huang, I\. Zhang, W\. Feng, H\. Huang, J\. Wang, S\. Purandare, G\. Nadathur, and S\. Idreos \(2025\)TorchTitan: one\-stop pytorch native solution for production ready LLM pretraining\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SFN6Wm7YBI)Cited by:[§4\.3](https://arxiv.org/html/2605.20613#S4.SS3.SSS0.Px4.p1.1)\.
- E\. Liu, G\. Neubig, and C\. Xiong \(2025\)Midtraining bridges pretraining and posttraining distributions\.External Links:2510\.14865Cited by:[§7\.1](https://arxiv.org/html/2605.20613#S7.SS1.p1.1)\.
- L\. Liu, X\. Liu, J\. Gao, W\. Chen, and J\. Han \(2020\)Understanding the difficulty of training transformers\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 5747–5763\.Cited by:[1st item](https://arxiv.org/html/2605.20613#S1.I1.i1.p1.2),[§2\.1\.1](https://arxiv.org/html/2605.20613#S2.SS1.SSS1.p1.1),[§2\.1\.1](https://arxiv.org/html/2605.20613#S2.SS1.SSS1.p2.2),[§7\.4](https://arxiv.org/html/2605.20613#S7.SS4.p1.1)\.
- P\. J\. Liu, M\. Saleh, E\. Pot, B\. Goodrich, R\. Sepassi, L\. Kaiser, and N\. Shazeer \(2018\)Generating wikipedia by summarizing long sequences\.InInternational Conference on Learning Representations,Cited by:[2nd item](https://arxiv.org/html/2605.20613#S1.I1.i2.p1.1),[§7\.2](https://arxiv.org/html/2605.20613#S7.SS2.p3.1)\.
- S\. Longpre, L\. Hou, T\. Vu, A\. Webson, H\. W\. Chung, Y\. Tay, D\. Zhou, Q\. V\. Le, B\. Zoph, J\. Wei, and A\. Roberts \(2023\)The flan collection: designing data and methods for effective instruction tuning\.InProceedings of the 40th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.202,pp\. 22631–22648\.Cited by:[2nd item](https://arxiv.org/html/2605.20613#S1.I1.i2.p1.1),[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.2.1.2.1.1),[§7\.2](https://arxiv.org/html/2605.20613#S7.SS2.p1.2)\.
- X\. Ma, Q\. Liu, D\. Jiang, G\. Zhang, Z\. Ma, and W\. Chen \(2026\)General\-reasoner: advancing llm reasoning across all domains\.InAdvances in Neural Information Processing Systems,Vol\.38,pp\. 56596–56618\.Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.8.7.2.1.1)\.
- Meta AI \(2024\)Llama 3: state\-of\-the\-art open weight language models\.Technical reportMeta\.External Links:[Link](https://ai.meta.com/llama/)Cited by:[§1](https://arxiv.org/html/2605.20613#S1.p5.2)\.
- K\. Mo, Y\. Shi, W\. Weng, Z\. Zhou, S\. Liu, H\. Zhang, and A\. Zeng \(2025\)Mid\-training of large language models: a survey\.External Links:2510\.06826Cited by:[§7\.1](https://arxiv.org/html/2605.20613#S7.SS1.p1.1)\.
- T\. Olmo, A\. Ettinger, A\. Bertsch, B\. Kuehl, D\. Graham, D\. Heineman, D\. Groeneveld, F\. Brahman, F\. Timbers, H\. Ivison,et al\.\(2025\)Olmo 3\.arXiv preprint arXiv:2512\.13961\.Cited by:[§1](https://arxiv.org/html/2605.20613#S1.p5.2)\.
- PleIAs \(2025\)PleIAs/synth · datasets at hugging face\.External Links:[Link](https://huggingface.co/datasets/PleIAs/SYNTH)Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.3.2.2.1.1)\.
- Z\. Qiu, Z\. Wang, B\. Zheng, Z\. Huang, K\. Wen, S\. Yang, R\. Men, L\. Yu, F\. Huang, S\. Huang,et al\.\(2026\)Gated attention for large language models: non\-linearity, sparsity, and attention\-sink\-free\.Advances in Neural Information Processing Systems38,pp\. 100092–100118\.Cited by:[§2](https://arxiv.org/html/2605.20613#S2.p2.3.5.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of Machine Learning Research21\(140\),pp\. 1–67\.Cited by:[2nd item](https://arxiv.org/html/2605.20613#S1.I1.i2.p1.1),[§2\.2](https://arxiv.org/html/2605.20613#S2.SS2.p1.1.1.1),[§2\.2](https://arxiv.org/html/2605.20613#S2.SS2.p6.1),[§7\.2](https://arxiv.org/html/2605.20613#S7.SS2.p1.2),[§7\.2](https://arxiv.org/html/2605.20613#S7.SS2.p3.1)\.
- A\. Roberts, H\. W\. Chung, A\. Levskaya, G\. Mishra, J\. Bradbury, D\. Andor, S\. Narang, B\. Lester, C\. Gaffney, A\. Mohiuddin,et al\.\(2023\)Scaling up models and data with t5x and seqio\.Journal of Machine Learning Research24\(377\),pp\. 1–8\.Cited by:[§4\.1](https://arxiv.org/html/2605.20613#S4.SS1.p4.1)\.
- V\. Sanh, A\. Webson, C\. Raffel, S\. H\. Bach, L\. Sutawika, Z\. Alyafeai, A\. Chaffin, A\. Stiegler, T\. Le Scao, A\. Raja,et al\.\(2022\)Multitask prompted training enables zero\-shot task generalization\.InInternational Conference on Learning Representations,Cited by:[2nd item](https://arxiv.org/html/2605.20613#S1.I1.i2.p1.1),[§7\.2](https://arxiv.org/html/2605.20613#S7.SS2.p1.2)\.
- N\. Saunshi, N\. Dikkala, Z\. Li, S\. Kumar, and S\. J\. Reddi \(2025\)Reasoning with latent thoughts: on the power of looped transformers\.InInternational Conference on Learning Representations,Cited by:[§7\.3](https://arxiv.org/html/2605.20613#S7.SS3.p1.1)\.
- D\. Saxton, E\. Grefenstette, F\. Hill, and P\. Kohli \(2019\)Analysing mathematical reasoning abilities of neural models\.InInternational Conference on Learning Representations,Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.5.4.2.1.1)\.
- N\. Shazeer \(2020\)Glu variants improve transformer\.arXiv preprint arXiv:2002\.05202\.Cited by:[§2](https://arxiv.org/html/2605.20613#S2.p2.3.3.1)\.
- D\. Sileo \(2024\)Tasksource: a large collection of nlp tasks with a structured dataset preprocessing framework\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation,Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.2.1.2.1.1)\.
- J\. Su, M\. Ahmed, Y\. Lu, S\. Pan, W\. Bo, and Y\. Liu \(2024\)Roformer: enhanced transformer with rotary position embedding\.Neurocomputing568,pp\. 127063\.Cited by:[§2](https://arxiv.org/html/2605.20613#S2.p2.3.4.1)\.
- I\. Sutskever, O\. Vinyals, and Q\. V\. Le \(2014\)Sequence to sequence learning with neural networks\.External Links:1409\.3215,[Link](https://arxiv.org/abs/1409.3215)Cited by:[2nd item](https://arxiv.org/html/2605.20613#S1.I1.i2.p1.1),[§7\.2](https://arxiv.org/html/2605.20613#S7.SS2.p1.2)\.
- C\. Tallec and Y\. Ollivier \(2017\)Unbiasing truncated backpropagation through time\.arXiv preprint arXiv:1705\.08209\.External Links:1705\.08209Cited by:[§C\.1](https://arxiv.org/html/2605.20613#A3.SS1.p1.1),[1st item](https://arxiv.org/html/2605.20613#S1.I1.i1.p1.2),[§7\.4](https://arxiv.org/html/2605.20613#S7.SS4.p1.1)\.
- Y\. Tay, M\. Dehghani, V\. Q\. Tran, X\. Garcia, J\. Wei, X\. Wang, H\. W\. Chung, S\. Shakeri, D\. Bahri, T\. Schuster, H\. S\. Zheng, D\. Zhou, N\. Houlsby, and D\. Metzler \(2023\)UL2: unifying language learning paradigms\.InInternational Conference on Learning Representations,Cited by:[2nd item](https://arxiv.org/html/2605.20613#S1.I1.i2.p1.1),[§1](https://arxiv.org/html/2605.20613#S1.p1.1),[§7\.2](https://arxiv.org/html/2605.20613#S7.SS2.p3.1)\.
- G\. Team \(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§1](https://arxiv.org/html/2605.20613#S1.p5.2)\.
- S\. Toshniwal, W\. Du, I\. Moshkov, B\. Kisacanin, A\. Ayrapetyan, and I\. Gitman \(2025\)OpenMathInstruct\-2: accelerating ai for math with massive open\-source instruction data\.InInternational Conference on Learning Representations,Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.4.3.2.1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§4\.2](https://arxiv.org/html/2605.20613#S4.SS2.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in neural information processing systems,pp\. 5998–6008\.Cited by:[§2\.1\.1](https://arxiv.org/html/2605.20613#S2.SS1.SSS1.p2.1),[§7\.2](https://arxiv.org/html/2605.20613#S7.SS2.p1.2)\.
- G\. Wang, S\. Cheng, X\. Zhan, X\. Li, S\. Song, and Y\. Liu \(2024\)OpenChat: advancing open\-source language models with mixed\-quality data\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=AOJyfhWYHf)Cited by:[§4\.1](https://arxiv.org/html/2605.20613#S4.SS1.p2.1)\.
- G\. Wang, J\. Li, Y\. Sun, X\. Chen, C\. Liu, Y\. Wu, M\. Lu, S\. Song, and Y\. A\. Yadkori \(2025\)Hierarchical reasoning model\.arXiv preprint arXiv:2506\.21734\.Cited by:[1st item](https://arxiv.org/html/2605.20613#S1.I1.i1.p1.2),[§1](https://arxiv.org/html/2605.20613#S1.p2.1),[§2](https://arxiv.org/html/2605.20613#S2.p1.5.1.1),[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.5.4.2.1.1),[§5\.2](https://arxiv.org/html/2605.20613#S5.SS2.p1.1),[§7\.3](https://arxiv.org/html/2605.20613#S7.SS3.p3.1)\.
- J\. Wei, M\. Bosma, V\. Y\. Zhao, K\. Guu, A\. W\. Yu, B\. Lester, N\. Du, A\. M\. Dai, and Q\. V\. Le \(2022\)Finetuned language models are zero\-shot learners\.InInternational Conference on Learning Representations,Cited by:[2nd item](https://arxiv.org/html/2605.20613#S1.I1.i2.p1.1),[§7\.2](https://arxiv.org/html/2605.20613#S7.SS2.p1.2)\.
- R\. Xiong, Y\. Yang, D\. He, K\. Zheng, S\. Zheng, C\. Xing, H\. Zhang, Y\. Lan, L\. Wang, and T\. Liu \(2020\)On layer normalization in the transformer architecture\.InInternational conference on machine learning,pp\. 10524–10533\.Cited by:[1st item](https://arxiv.org/html/2605.20613#S1.I1.i1.p1.2),[§2\.1\.1](https://arxiv.org/html/2605.20613#S2.SS1.SSS1.p1.1),[§7\.4](https://arxiv.org/html/2605.20613#S7.SS4.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2605.20613#S1.p5.2)\.
- W\. Yuan, J\. Yu, S\. Jiang, K\. Padthe, Y\. Li, D\. Wang, I\. Kulikov, K\. Cho, Y\. Tian, J\. Weston,et al\.\(2026\)Naturalreasoning: reasoning in the wild with 2\.8 m challenging questions\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[Table 5](https://arxiv.org/html/2605.20613#S4.T5.2.8.7.2.1.1)\.
- B\. Zhang and R\. Sennrich \(2019\)Root mean square layer normalization\.Advances in neural information processing systems32\.Cited by:[§2](https://arxiv.org/html/2605.20613#S2.p2.3.2.1)\.
- C\. Zhou, C\. Yang, Y\. Hu, C\. Wang, C\. Zhang, M\. Zhang, L\. Mackey, T\. Jaakkola, S\. Bates, and D\. Zhang \(2026\)Coevolutionary continuous discrete diffusion: make your diffusion language model a latent reasoner\.InForty\-third International Conference on Machine Learning,Cited by:[§7\.3](https://arxiv.org/html/2605.20613#S7.SS3.p2.1)\.
- R\. Zhu, Z\. Wang, K\. Hua, T\. Zhang, Z\. Li, H\. Que, B\. Wei, Z\. Wen, F\. Yin, H\. Xing, L\. Li, J\. Shi, K\. Ma, S\. Li, T\. Kergan, A\. Smith, X\. Qu, M\. Hui, B\. Wu, Q\. Min, H\. Huang, X\. Zhou, W\. Ye, J\. Liu, J\. Yang, Y\. Shi, C\. Lin, E\. Zhao, T\. Cai, G\. Zhang, W\. Huang, Y\. Bengio, and J\. Eshraghian \(2025a\)Scaling latent reasoning via looped language models\.External Links:2510\.25741,[Link](https://arxiv.org/abs/2510.25741)Cited by:[§7\.3](https://arxiv.org/html/2605.20613#S7.SS3.p2.1)\.
- R\. Zhu, Z\. Wang, K\. Hua, T\. Zhang, Z\. Li, H\. Que, B\. Wei, Z\. Wen, F\. Yin, H\. Xing, L\. Li, J\. Shi, K\. Ma, S\. Li, T\. Kergan, A\. Smith, X\. Qu, M\. Hui, B\. Wu, Q\. Min, H\. Huang, X\. Zhou, W\. Ye, J\. Liu, J\. Yang, Y\. Shi, C\. Lin, E\. Zhao, T\. Cai, G\. Zhang, W\. Huang, Y\. Bengio, and J\. Eshraghian \(2025b\)Scaling latent reasoning via looped language models\.External Links:2510\.25741Cited by:[§1](https://arxiv.org/html/2605.20613#S1.p5.2)\.
- N\. Zucchet and A\. Orvieto \(2024\)Recurrent neural networks: vanishing and exploding gradients are not the end of the story\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.20613#S1.p3.1),[§7\.3](https://arxiv.org/html/2605.20613#S7.SS3.p1.1)\.

## Appendix

## Appendix AFLOPs estimation

For dense models, we use the standard training\-FLOPs estimateF=6​N​DF=6ND\.

For recurrent models, we account separately for the forward and backward recurrent unrolls\. We count2​N​D2NDfor forward computation and4​N​D4NDfor backward computation, then scale these terms by the number of recurrent steps included in each pass\.

## Appendix BEvaluation details

Table 8:Shared evaluation configuration\.The evaluation prompt contains the original benchmark question and, when required by the benchmark protocol, the corresponding few\-shot examples\. Few\-shot examples are added only for few\-shot evaluations\. We do not add an additional system prompt\. Unless otherwise specified, decoding is deterministic with temperature zero and a maximum context length of 3072 tokens\.

Baseline scores are taken from the original papers when those scores are available under comparable settings\. When paper\-reported numbers are unavailable, we evaluate the corresponding open\-weight model directly\. All few\-shot evaluations are run with the same configuration used for HRM\-Text and use the vLLM inference engine\. Chain\-of\-thought evaluations are run withlm\_eval\_harness\.

## Appendix CStable optimization in recurrent\-depth models

### C\.1Gradient Stability Under Deep BPTT in HRM

Backpropagation through time \(BPTT\) is the canonical mechanism for training recurrent computation graphs, yet extensive prior work has established that propagating gradients through the full unroll is often unnecessary and can be detrimental to optimization\. On the other hand, truncating the backward horizon can improve practical convergence by trading exact long\-range credit assignment for gradients that behave more favorably as stochastic estimators\. This bias–stability tradeoff has been formalized and leveraged in the recurrent literature, motivating both principled truncation schemes and analyses of when and why truncation improves training dynamics[62](https://arxiv.org/html/2605.20613#bib.bib28)\. In contrast, analogous diagnostics remain underdeveloped for HRM, or other modern looped architectures, where repeated application of a shared block induces an implicit recurrence and the effective backward depth is controlled by the number of backpropagated loop iterations\.

![Refer to caption](https://arxiv.org/html/2605.20613v1/x6.png)\(\(a\)\)Mean absolute gradient magnitude over training\.
![Refer to caption](https://arxiv.org/html/2605.20613v1/x7.png)\(\(b\)\)Log\-magnitude dispersion\.

Figure 6:\(a\) Full BPTT exhibits rare but substantially larger gradient\-magnitude spikes compared with the truncated setting, suggesting that longer backward horizons introduce intermittent high\-amplitude gradient events\. \(b\) Values are normalized within diagnostic checkpoints to isolate the effect of backward depth from global training\-time drift\. Deeper H cycling increases log\-magnitude dispersionWe hypothesize that the instabilities observed under deep BPTT in looped architectures are a consequence of the intrinsically multiplicative structure of gradient propagation through repeated iterations\. Specifically, gradients backpropagate through products of Jacobian\-like operators across loop steps, and theory for products of many random matrices predicts that the logarithm of the norm of such products is approximately Gaussian, implying lognormal\-like variability in gradient magnitudes and increasing separation between typical and extreme values as backward depth grows[26](https://arxiv.org/html/2605.20613#bib.bib29)\. Complementing this theoretical perspective, empirical evidence suggests that neural gradient magnitudes are often close to lognormal, consistent with multiplicative mechanisms that concentrate mass near small magnitudes while producing comparatively heavier tails[11](https://arxiv.org/html/2605.20613#bib.bib30),[30](https://arxiv.org/html/2605.20613#bib.bib31)\.

To test these hypotheses, we perform a targeted study of gradient dynamics in HRM as we systematically increase the number of backward H and L cycles while holding the forward computation fixed\. We first quantify instability using the*mean absolute gradient magnitude*and show in Figure[6](https://arxiv.org/html/2605.20613#A3.F6)a that extending the backward horizon toward full BPTT yields substantially more intermittent high\-amplitude gradient events over training\. We then characterize distributional heterogeneity using a complementary dispersion measure in Figure[6](https://arxiv.org/html/2605.20613#A3.F6)b, which reports*log\-magnitude dispersion*, defined asStd​\(log⁡\(\|g\|\+ε\)\)\\mathrm\{Std\}\(\\log\(\|g\|\+\\varepsilon\)\)\. This measure supports the view that deeper backward cycling increases multiplicative heterogeneity in gradient magnitudes\. Notably, the increase in log\-magnitude dispersion is driven primarily by the H\-cycle depth rather than the L\-cycle depth\. We therefore interpret the H dimension as the dominant contributor to gradient\-magnitude spread in these experiments\.

![Refer to caption](https://arxiv.org/html/2605.20613v1/x8.png)Figure 7:Mechanistic evidence for multiplicative gradient instability\.Left: Jacobian growth increases with deeper backward cycling, consistent with stronger amplification through products of loop Jacobians\. Right: paired full\-vs\-truncated gradient magnitudes show that full BPTT produces rare, disproportionately large gradient events at the same diagnostic checkpoints\.Finally, we examine whether the observed gradient spikes are consistent with a multiplicative amplification mechanism\. Figure[7](https://arxiv.org/html/2605.20613#A3.F7)a shows that Jacobian growth increases with backward depth, indicating that deeper backpropagation through the recurrent computation amplifies some directions more strongly\. Figure[7](https://arxiv.org/html/2605.20613#A3.F7)b provides a paired comparison between the truncated setting used for training and the full\-BPTT reference at identical diagnostic checkpoints\. The paired scatter shows that full BPTT is often comparable to truncation but occasionally produces much larger gradient magnitudes, consistent with the hypothesis that full backward unrolling primarily harms optimization through rare, high\-amplitude tail events rather than through a uniform increase in gradient scale\.

The truncation setting used in our experiments is the closest setting to full BPTT that remains stable during training, as illustrated in Figure[6](https://arxiv.org/html/2605.20613#A3.F6)a and Figure[7](https://arxiv.org/html/2605.20613#A3.F7)\.

### C\.2Gradient stability across recurrent architectures

We further compare HRM against RINs and the Universal Transformer through the lens of gradient stability\. Since all three architectures reuse computation over depth or recurrence, stable gradient dynamics are an important part of whether the architecture can be trained effectively, rather than merely whether it has sufficient expressive capacity\. We therefore evaluate two complementary statistics across runs: the median absolute gradient magnitude and the tail\-to\-median ratio\.

![Refer to caption](https://arxiv.org/html/2605.20613v1/x9.png)\(\(a\)\)Median absolute gradient magnitude across runs\.
![Refer to caption](https://arxiv.org/html/2605.20613v1/x10.png)\(\(b\)\)Tail\-to\-median gradient ratio across runs\.

Figure 8:Gradient stability comparison between RINs, HRM, and the Universal Transformer\. HRM maintains a strong gradient signal while exhibiting increasingly even gradient dynamics over training, matching the stability of the Universal Transformer more closely than RINs\.Figure[8\(a\)](https://arxiv.org/html/2605.20613#A3.F8.sf1)shows that HRM and the Universal Transformer maintain a stronger training signal as optimization progresses: their median absolute gradient magnitudes remain higher than those of RINs across training\. At the same time, Figure[8\(b\)](https://arxiv.org/html/2605.20613#A3.F8.sf2)shows that this signal does not come from increasingly unstable or rare extreme updates\. Instead, HRM and the Universal Transformer exhibit lower tail\-to\-median ratios over training, indicating that the gradient distribution becomes more even, less heavy\-tailed, and less dominated by rare large updates\.

This places HRM in the favorable regime of retaining useful gradient signal while avoiding the instability associated with heavy\-tailed gradient dynamics\. Combined with the main results, this suggests that HRM preserves the training stability of stronger recurrent\-depth baselines while delivering better downstream performance\.

## Appendix DInference\-time analysis

Table 9:Inference\-time auto\-guidance\.We report the base performance of standard inference, and the best performance along with the corresponding guidance scalew∈\{−0\.5,−0\.1,0,0\.1,0\.5\}w\\in\\\{\-0\.5,\-0\.1,0,0\.1,0\.5\\\}\.At inference time, we enable the auto\-guidance[38](https://arxiv.org/html/2605.20613#bib.bib80)mechanism specifically designed for HRM, which guides itself by interpolating or extrapolating logits from various recursion depths\. While having similar motivations, auto\-guidance is more efficient than classifier\-free guidance \(CFG\)[29](https://arxiv.org/html/2605.20613#bib.bib81): it induces zero computation overhead because the hidden representations from shallow loops are already accessible at decoding time\.

In particular, suppose we have the final hidden statehhand another hidden stateh′h^\{\\prime\}from an earlier recurrent loop, both decoded by the LM head\. Auto\-guidance with guidance scalewwis calculated as:

logitsw=\(1\+w\)⋅logits​\(h\)−w⋅logits​\(h′\)\\text\{logits\}\_\{w\}=\(1\+w\)\\cdot\\text\{logits\}\(h\)\-w\\cdot\\text\{logits\}\(h^\{\\prime\}\)w=0w=0recovers the standard final prediction;w\>0w\>0corresponds to extrapolation between the final layer and a shallower layer, treating the shallower prediction as a negative direction; andw<0w<0corresponds to interpolation, where the model balances predictions from shallow and deep recurrent states\.

[Table˜9](https://arxiv.org/html/2605.20613#A4.T9)reports HRM performance with and without auto\-guidance, where the guidance scale is searched overw∈\{−0\.5,−0\.1,0,0\.1,0\.5\}w\\in\\\{\-0\.5,\-0\.1,0,0\.1,0\.5\\\}\. We use an HRM model with two high\-level loops and interpolate or extrapolate the logits from these twoHHmodules\. Because the intermediate hidden states are already available, auto\-guidance introduces no additional computation\. It slightly improves performance at test time, and the best guidance scale varies across benchmarks, suggesting that different tasks may benefit from different effective recurrent depths\.

Auto\-guidance is also closely related to adaptive computation time \(ACT\) and test\-time scaling \(TTS\)\. When interpolation \(w<0w<0\) yields better results, the task may not require the full recurrent depth, suggesting that early stopping could improve efficiency\. Conversely, when extrapolation \(w\>0w\>0\) performs better, the task may benefit from deeper recurrent computation and could be a candidate for adaptive test\-time scaling\. The results in[Table˜9](https://arxiv.org/html/2605.20613#A4.T9)therefore suggest that HRM inference can support adaptive control of recurrent depth, balancing efficiency and performance at test time\.

Similar Articles

HRM Seems To Be Going Off Right Now

Reddit r/LocalLLaMA

Sapient Intelligence has released HRM-Text, a 1B parameter text generation model, trained on only 0.04 trillion tokens (costing approximately $1000), surpassing much larger models trained on 100-1000 times more data on multiple reasoning benchmarks, marking the beginning of a new paradigm for AI training.

sapientinc/HRM-Text-1B

Hugging Face Models Trending

Sapient Intelligence released HRM-Text-1B, a 1-billion-parameter language model with a novel dual-timescale recurrent architecture (Hierarchical Reasoning Model) that provides unbounded compute depth at bounded parameter count. The pre-alignment checkpoint is available on Hugging Face.

PRX Part 3 — Training a Text-to-Image Model in 24h!

Hugging Face Blog

Photoroom's PRX Part 3 demonstrates training a text-to-image model in 24 hours by combining optimized architectural and training techniques including perceptual losses, token routing with TREAD, and the Muon optimizer.