Tutorial on the loop transformer architecture (rumored to be the major Mythos improvement; 19 minutes)

Reddit r/singularity 07/01/26, 07:13 PM Papers

Summary

The Looped Transformer achieves internal reasoning by designing recursion directly into the architecture, avoiding the inefficiency of chain-of-thought having to simulate iteration by generating discrete tokens. Latest research shows it performs excellently on multi-hop reasoning, and can be further improved through stabilization techniques and adaptive recursion.

No content available

Original Article

View Cached Full Text

Cached at: 07/01/26, 08:15 PM

TL;DR: Recurrent Transformers implement internal reasoning by building recursion directly into the architecture, avoiding the inefficiency of chain-of-thought that must simulate iteration by generating discrete tokens. Recent research shows they excel at multi-hop reasoning and can be further improved with stabilization techniques and adaptive recurrence. ## From Chain-of-Thought to Recurrent Reasoning When large language models think by generating a long string of tokens and then outputting a result, this "chain-of-thought" method has proven extremely effective, becoming standard for nearly all contemporary reasoning models. However, it is far from elegant: the model must repeatedly decompress its internal state into text, append it to the context, and re-embed it into hidden states. Additionally, the reasoning process is constrained by sampling discrete tokens instead of directly optimizing the computation itself. This naturally raises the question: can recursion be built into the architecture itself, allowing the model to update hidden states directly rather than simulating iteration through external tokens? The "Recurrent Transformer," which suddenly gained attention in April 2026, is the embodiment of this idea. Its core is extremely simple: a standard Transformer uses a long sequence of distinct layers, while a Recurrent Transformer takes just a few layers to form a block, then runs that block repeatedly—the hidden states output from the previous round are used directly as input to the next. In this way, the model can refine its representations over multiple internal steps without being forced to compress each step into a token. ## Multi-Hop Reasoning and Generalization To test reasoning ability, researchers often use multi-hop reasoning as a benchmark. Taking "Who is the wife of the 44th President of the United States?" as an example, the model must first identify that the 44th President is Barack Obama, then connect to Michelle Obama. A standard Transformer must complete all hops in a single forward pass; as the number of hops increases, it lacks any iterative mechanism to support it. Chain-of-thought provides this iteration through external tokens, but at the cost of a decompression-compression loop. Recurrent Transformers differ: the number of recurrences becomes an adjustable "compute knob," allowing the model to think longer and expect better results. In the paper *Loop, Think, and Generalize* published in April 2026, the authors used this setup and found that the model not only performed well on hops seen during training but could also handle multi-hop reasoning beyond the depth seen during training when given more recurrence iterations at inference time. More importantly, they observed three phases during training: 1. **Memorization phase**: training set accuracy rises and saturates. 2. **In-distribution generalization phase**: in-distribution test set performance peaks, indicating the model can handle new problems that follow the same structure. 3. **Systematic generalization phase**: the model can combine knowledge fragments in ways never seen during training, achieving out-of-distribution generalization. This demonstrates that Recurrent Transformers learn a reusable process of composition and inference—something standard Transformers struggle to achieve. ## Stability: The Contribution of PARC However, directly introducing recursion into the architecture brings stability issues. The same hidden state is repeatedly subjected to the same transformation; small deviations are amplified, errors accumulate, and the model may deviate from a stable trajectory or even diverge. Another paper from April 2026, *PARC*, explicitly models recurrence as a dynamical system on the residual stream and analyzes the root of instability. The residual stream is the main channel through which hidden states accumulate information across multiple recurrences. If the dynamics are unstable, the norm of the hidden state grows, leading to loss spikes. Their solution stabilizes the recurrence by constraining and normalizing the recurrent updates, preventing uncontrolled accumulation. ## Mechanism Analysis: Is the Model Truly Reasoning? The paper *Mechanistic Analysis of Recurrent Reasoning Models* (also published in April 2026) answers this question by tracking latent states across multiple recurrences. Since hidden state dimensions are extremely high, the researchers used PCA (principal component analysis) to compress them onto 2D/3D maps. They found that recurrent models tend to move toward stable trajectories—fixed points or cyclic paths. As hidden states stabilize, the behavior of attention heads also becomes consistent, indicating that the internal computations performed by the block become increasingly stable and predictable. Further analysis revealed that even with weight sharing, early, middle, and late recurrences can play distinct roles: - **Early recurrences**: construct a rough representation of the problem, collect and organize relevant information, with updates that are large and exploratory. - **Middle recurrences**: combine pieces of information and propagate relationships; updates become more structured. - **Late recurrences**: updates diminish, the model stabilizes its representation and converges toward the final answer. Recurrence is not merely repeating the same computation; the fact that each input is different forces the same function to behave differently at each step, naturally forming a progression from rough understanding to refined solution. ## Mixed Recurrence (MOR): Adaptive Computation Allocation Another issue with Recurrent Transformers is that every token undergoes the same number of recurrences, regardless of its complexity. The paper *Mixed Recurrence* (July 2025) aims to break this uniform allocation. It introduces a router that decides the actual number of recurrences needed for each token. The paper explores two main approaches: - **Token-choice routing**: predicts the number of steps each token should take before the recurrence loop begins. The computational path is fixed, but this heavily depends on the correctness of the initial prediction and cannot be adjusted later. - **Expert-choice routing**: makes decisions step-by-step; at each iteration, the model can choose to continue or exit. This is more flexible and can adapt to evolving representations, but training and stabilization are more complex. Under the same three-layer recurrence setting, expert-choice routing achieved an average few-shot accuracy of 42.6%, while token-choice routing dropped to 40%. This indicates that committing to a full computation path in advance is not practical—it is too difficult to accurately judge the required amount of thinking before iteration begins. ## KV Cache Optimization Once recurrence becomes adaptive, the KV cache bottleneck immediately becomes apparent. In a standard Transformer, each layer stores key-value tensors for every token, consuming significant memory and bandwidth during long-context decoding. A naive recurrent model makes this worse because, even with parameter sharing, it must maintain separate KV caches for each recurrence depth. The MOR paper changed how the KV cache works to accommodate adaptive recurrence, thereby reducing memory traffic—a key enabler for further efficiency gains. ## Summary Recurrent Transformers provide a more elegant architectural-level solution for test-time computation. From naive recurrence to stability control, and from adaptive allocation to cache optimization, this direction is rapidly maturing. It is not merely a replacement for chain-of-thought but could become the foundational design for the next generation of reasoning models—just as some suspect that Claude 3 uses a similar technique under the hood. **Source**: Tutorial on the loop transformer architecture (rumored to be the major Mythos improvement; 19 minutes) (https://www.youtube.com/watch?v=nYwid6Q5HXk)

Tutorial on the loop transformer architecture (rumored to be the major Mythos improvement; 19 minutes)

Similar Articles

@ZhihuFrontier: Half a year ago, a Zhihu contributor predicted that the next Transformer would absorb loops, recurrent state, sparse ro…

@DorothyDDU: LoopCoder-v2 is out Loop Transformers reuse the same block for recurrent hidden-state refinement — letting models “thin…

@retr0sushi_: looped transformer -> hyper-looped transformer -> looped world model ??

Simply Stabilizing the Loop via Fully Looped Transformer

@askalphaxiv: Another cool research on Looped Transformers They ask the question: "Can we loop a frozen, off-the-shelf checkpoint dir…

Submit Feedback

Similar Articles

@ZhihuFrontier: Half a year ago, a Zhihu contributor predicted that the next Transformer would absorb loops, recurrent state, sparse ro…

@DorothyDDU: LoopCoder-v2 is out Loop Transformers reuse the same block for recurrent hidden-state refinement — letting models “thin…

@retr0sushi_: looped transformer -> hyper-looped transformer -> looped world model ??

Simply Stabilizing the Loop via Fully Looped Transformer

@askalphaxiv: Another cool research on Looped Transformers They ask the question: "Can we loop a frozen, off-the-shelf checkpoint dir…