@huskydogewoof: ๐Œ๐ฒ ๐ญ๐š๐ค๐ž๐ฌ ๐š๐ง๐ ๐ญ๐ก๐จ๐ฎ๐ ๐ก๐ญ๐ฌ ๐š๐ซ๐ž ๐š๐ฌ ๐Ÿ๐จ๐ฅ๐ฅ๐จ๐ฐ๐ฌ (sorry for being verbose, but I hope you will enjoy โ€ฆ

X AI KOLs Timeline News

Summary

The author shares thoughts on making convergence a reliable halting signal for iterative weight-tied models, discussing tricks from papers like DEQ, Huggin, Ouro, and EqR, and highlighting the roles of pre-norm and input injection.

๐Œ๐ฒ ๐ญ๐š๐ค๐ž๐ฌ ๐š๐ง๐ ๐ญ๐ก๐จ๐ฎ๐ ๐ก๐ญ๐ฌ ๐š๐ซ๐ž ๐š๐ฌ ๐Ÿ๐จ๐ฅ๐ฅ๐จ๐ฐ๐ฌ (sorry for being verbose, but I hope you will enjoy this thread if you are interested in weight-tied / iterative / loop models): ๐Ÿ/ ๐“๐ก๐ข๐ฌ ๐ฐ๐จ๐ซ๐ค ๐ฌ๐ก๐จ๐ฐ๐ฌ ๐ญ๐ก๐š๐ญ, ๐›๐ž๐ฒ๐จ๐ง๐ ๐„๐ช๐‘โ€™๐ฌ ๐š๐ฉ๐ฉ๐ซ๐จ๐š๐œ๐ก ๐จ๐Ÿ ๐š๐๐๐ข๐ง๐  ๐ฌ๐ญ๐จ๐œ๐ก๐š๐ฌ๐ญ๐ข๐œ๐ข๐ญ๐ฒ, ๐จ๐ญ๐ก๐ž๐ซ ๐ญ๐ซ๐ข๐œ๐ค๐ฌ ๐œ๐š๐ง ๐š๐ฅ๐ฌ๐จ ๐ฆ๐š๐ค๐ž ๐œ๐จ๐ง๐ฏ๐ž๐ซ๐ ๐ž๐ง๐œ๐ž ๐š ๐ซ๐ž๐ฅ๐ข๐š๐›๐ฅ๐ž ๐ก๐š๐ฅ๐ญ๐ข๐ง๐  ๐ฌ๐ข๐ ๐ง๐š๐ฅ ๐Ÿ๐จ๐ซ ๐ข๐ญ๐ž๐ซ๐š๐ญ๐ข๐ฏ๐ž ๐ฐ๐ž๐ข๐ ๐ก๐ญ-๐ญ๐ข๐ž๐ ๐ฆ๐จ๐๐ž๐ฅ๐ฌ, ๐ข๐ง๐๐ข๐œ๐š๐ญ๐ข๐ง๐  ๐š ๐ ๐จ๐จ๐ ๐š๐ฅ๐ข๐ ๐ง๐ฆ๐ž๐ง๐ญ ๐›๐ž๐ญ๐ฐ๐ž๐ž๐ง ๐ญ๐ก๐ž ๐Ÿ๐ข๐ฑ๐ž๐ ๐ฉ๐จ๐ข๐ง๐ญ ๐š๐ง๐ ๐ญ๐ก๐ž ๐ฌ๐จ๐ฅ๐ฎ๐ญ๐ข๐จ๐ง. Actually, at the early stage of EqR project, I intended to follow this path, which is more spiritually aligned with Deep Equilibrium Models (DEQ): replacing the separate ACT head with convergence-based halting, both during training and inference. However, my preliminary results did not show positive signals, so I eventually gave up on that direction. Now, after looking at the tricks used in this paper: a. pre-norm instead of post-norm, b. residual scaling and damping to stabilize the recurrent dynamics, c. input mixing / conditioning preservation across iterations, it seems that, although somewhat complicated, using convergence for halting is not impossible. It just require many tricks to improve contraction and convergence. ๐‘จ๐’Ž๐’๐’๐’ˆ ๐’•๐’‰๐’๐’”๐’† ๐’•๐’“๐’Š๐’„๐’Œ๐’”, ๐‘ฐ ๐’๐’Š๐’Œ๐’† ๐’•๐’‰๐’† ๐’…๐’Š๐’”๐’„๐’–๐’”๐’”๐’Š๐’๐’ ๐’‚๐’“๐’๐’–๐’๐’… ๐’‘๐’“๐’†-๐’๐’๐’“๐’Ž ๐’‚๐’๐’… ๐’‘๐’๐’”๐’•-๐’๐’๐’“๐’Ž ๐’•๐’‰๐’† ๐’Ž๐’๐’”๐’•. As pointed out in the Huggin and DEQ, input injection is important for weight-tied models [*]. However, Ouro from ByteDance does not explicitly perform input injection, yet it still works reasonably well. Why? My interpretation is simple: it uses a pre-norm-like design (not standard pre-norm, but one where the residual can be passed across layers more directly), which already helps preserve the conditioning signal from the input. People use post-norm to fight against the known instability of weight-tied models with loops, but that may not be the best choice. ============================== DEQ: https://arxiv.org/abs/1909.01377) Huggin: https://arxiv.org/abs/2502.05171 Ouro: https://ouro-llm.github.io EqR: https://arxiv.org/abs/2605.21488 [*Why is input injection important?] A fixed point corresponds to the infinite-depth limit of an iterative weight-tied model. For such a fixed point to be useful, the conditioning signal from the input must be preserved throughout these infinite iterations; otherwise, the dynamics may converge to an input-agnostic attractor. More below
Original Article
View Cached Full Text

Cached at: 06/18/26, 06:10 AM

My takes and thoughts are as follows (sorry for being verbose, but I hope you will enjoy this thread if you are interested in weight-tied / iterative / loop models):

1/ This work shows that, beyond EqRโ€™s approach of adding stochasticity, other tricks can also make convergence a reliable halting signal for iterative weight-tied models, indicating a good alignment between the fixed point and the solution.

Actually, at the early stage of EqR project, I intended to follow this path, which is more spiritually aligned with Deep Equilibrium Models (DEQ): replacing the separate ACT head with convergence-based halting, both during training and inference.

However, my preliminary results did not show positive signals, so I eventually gave up on that direction. Now, after looking at the tricks used in this paper:

a. pre-norm instead of post-norm, b. residual scaling and damping to stabilize the recurrent dynamics, c. input mixing / conditioning preservation across iterations,

it seems that, although somewhat complicated, using convergence for halting is not impossible. It just require many tricks to improve contraction and convergence.

Among those tricks, I like the discussion around pre-norm and post-norm the most.

As pointed out in the Huggin and DEQ, input injection is important for weight-tied models [*]. However, Ouro from ByteDance does not explicitly perform input injection, yet it still works reasonably well.

Why? My interpretation is simple: it uses a pre-norm-like design (not standard pre-norm, but one where the residual can be passed across layers more directly), which already helps preserve the conditioning signal from the input.

People use post-norm to fight against the known instability of weight-tied models with loops, but that may not be the best choice.

==============================

DEQ: https://arxiv.org/abs/1909.01377) Huggin: https://arxiv.org/abs/2502.05171 Ouro: https://ouro-llm.github.io EqR: https://arxiv.org/abs/2605.21488

[*Why is input injection important?] A fixed point corresponds to the infinite-depth limit of an iterative weight-tied model. For such a fixed point to be useful, the conditioning signal from the input must be preserved throughout these infinite iterations; otherwise, the dynamics may converge to an input-agnostic attractor.

More below


Deep Equilibrium Models

Source: https://arxiv.org/abs/1909.01377 View PDF (https://arxiv.org/pdf/1909.01377)

Abstract:We present a new approach to modeling sequential data: the deep equilibrium model (DEQ). Motivated by an observation that the hidden layers of many existing deep sequence models converge towards some fixed point, we propose the DEQ approach that directly finds these equilibrium points via root-finding. Such a method is equivalent to running an infinite depth (weight-tied) feedforward network, but has the notable advantage that we can analytically backpropagate through the equilibrium point using implicit differentiation. Using this approach, training and prediction in these networks require only constant memory, regardless of the effective โ€œdepthโ€ of the network. We demonstrate how DEQs can be applied to two state-of-the-art deep sequence models: self-attention transformers and trellis networks. On large-scale language modeling tasks, such as the WikiText-103 benchmark, we show that DEQs 1) often improve performance over these state-of-the-art models (for similar parameter counts); 2) have similar computational requirements to existing models; and 3) vastly reduce memory consumption (often the bottleneck for training large sequence models), demonstrating an up-to 88% memory reduction in our experiments. The code is available atthis https URL (https://github.com/locuslab/deq).

Submission history

From: Shaojie Bai [view email (https://arxiv.org/show-email/8ac06e46/1909.01377)] **[v1]**Tue, 3 Sep 2019 18:02:50 UTC (721 KB) **[v2]**Mon, 28 Oct 2019 22:25:01 UTC (720 KB)

Francesco Bertolotti (@f14bertolotti): This TRM variant makes a transformer block a contractive map, so that looping becomes a fixed-point process. They leverage this by approximating gradients with Neumann series (Truncated BPTT). Very cool work!

๐Ÿ”—

Similar Articles

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

X AI KOLs Timeline

The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.