@askalphaxiv: Another cool research on Looped Transformers They ask the question: "Can we loop a frozen, off-the-shelf checkpoint dir…

X AI KOLs Timeline Papers

Summary

This research introduces a technique to loop frozen, off-the-shelf transformer checkpoints at inference time by using damped Runge-Kutta substeps, treating transformer layers as Euler steps in a residual ODE. This allows extra latent compute without fine-tuning, architecture changes, or new weights, showing gains on knowledge tasks like MMLU-Pro, GPQA, and ARC.

Another cool research on Looped Transformers They ask the question: "Can we loop a frozen, off-the-shelf checkpoint directly at inference time without any modifications?" So naive repetition pushes hidden states outside the distribution later layers expect, so performance drops. But if you treat transformer layers as Euler steps in a residual ODE and replaces naive loops with damped Runge–Kutta substeps, it is possible. This lets the frozen models get extra latent compute at test time with no fine-tuning, no new weights, and no architecture changes. And the best gains show up on hard knowledge MC tasks like MMLU-Pro, GPQA, and ARC.
Original Article
View Cached Full Text

Cached at: 05/27/26, 03:18 AM

Another cool research on Looped Transformers

They ask the question: “Can we loop a frozen, off-the-shelf checkpoint directly at inference time without any modifications?”

So naive repetition pushes hidden states outside the distribution later layers expect, so performance drops.

But if you treat transformer layers as Euler steps in a residual ODE and replaces naive loops with damped Runge–Kutta substeps, it is possible.

This lets the frozen models get extra latent compute at test time with no fine-tuning, no new weights, and no architecture changes.

And the best gains show up on hard knowledge MC tasks like MMLU-Pro, GPQA, and ARC.

Similar Articles

Simply Stabilizing the Loop via Fully Looped Transformer

arXiv cs.LG

This paper identifies gradient oscillation and residual explosion as causes of training instability in Looped Transformers, and proposes Fully Looped Transformer with two parameter-free modifications (Fully Looped Architecture and Attention Injection) to stabilize training up to 12 loop iterations, achieving up to 13.2% improvement in downstream performance.

Skip a Layer or Loop It? Learning Program-of-Layers in LLMs

Hugging Face Daily Papers

This paper introduces PoLar, a framework that learns input-specific execution programs for frozen transformer layers, allowing layers to be skipped, kept, or repeated. It improves accuracy and reduces inference overhead compared to fixed-depth methods.

LoopQ: Quantization for Recursive Transformers

arXiv cs.LG

LoopQ is a post-training quantization framework for looped language models that addresses distribution shift, state reuse, and error accumulation. It achieves 68.8% average accuracy improvement under 4-bit weights and activations.

LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

Hugging Face Daily Papers

LoopUS is a post-training framework that converts pretrained LLMs into looped architectures for improved reasoning performance via latent-refinement and adaptive early exiting. It addresses computational costs and capability preservation issues found in existing looped computation methods.