@huskydogewoof: 𝐌𝐲 𝐭𝐚𝐤𝐞𝐬 𝐚𝐧𝐝 𝐭𝐡𝐨𝐮𝐠𝐡𝐭𝐬 𝐚𝐫𝐞 𝐚𝐬 𝐟𝐨𝐥𝐥𝐨𝐰𝐬 (sorry for being verbose, but I hope you will enjoy …

X AI KOLs Timeline 06/17/26, 07:37 PM News

weight-tied-models iterative-models deep-equilibrium-models convergence pre-norm post-norm input-injection

Summary

The author shares thoughts on making convergence a reliable halting signal for iterative weight-tied models, discussing tricks from papers like DEQ, Huggin, Ouro, and EqR, and highlighting the roles of pre-norm and input injection.

𝐌𝐲 𝐭𝐚𝐤𝐞𝐬 𝐚𝐧𝐝 𝐭𝐡𝐨𝐮𝐠𝐡𝐭𝐬 𝐚𝐫𝐞 𝐚𝐬 𝐟𝐨𝐥𝐥𝐨𝐰𝐬 (sorry for being verbose, but I hope you will enjoy this thread if you are interested in weight-tied / iterative / loop models): 𝟏/ 𝐓𝐡𝐢𝐬 𝐰𝐨𝐫𝐤 𝐬𝐡𝐨𝐰𝐬 𝐭𝐡𝐚𝐭, 𝐛𝐞𝐲𝐨𝐧𝐝 𝐄𝐪𝐑’𝐬 𝐚𝐩𝐩𝐫𝐨𝐚𝐜𝐡 𝐨𝐟 𝐚𝐝𝐝𝐢𝐧𝐠 𝐬𝐭𝐨𝐜𝐡𝐚𝐬𝐭𝐢𝐜𝐢𝐭𝐲, 𝐨𝐭𝐡𝐞𝐫 𝐭𝐫𝐢𝐜𝐤𝐬 𝐜𝐚𝐧 𝐚𝐥𝐬𝐨 𝐦𝐚𝐤𝐞 𝐜𝐨𝐧𝐯𝐞𝐫𝐠𝐞𝐧𝐜𝐞 𝐚 𝐫𝐞𝐥𝐢𝐚𝐛𝐥𝐞 𝐡𝐚𝐥𝐭𝐢𝐧𝐠 𝐬𝐢𝐠𝐧𝐚𝐥 𝐟𝐨𝐫 𝐢𝐭𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐰𝐞𝐢𝐠𝐡𝐭-𝐭𝐢𝐞𝐝 𝐦𝐨𝐝𝐞𝐥𝐬, 𝐢𝐧𝐝𝐢𝐜𝐚𝐭𝐢𝐧𝐠 𝐚 𝐠𝐨𝐨𝐝 𝐚𝐥𝐢𝐠𝐧𝐦𝐞𝐧𝐭 𝐛𝐞𝐭𝐰𝐞𝐞𝐧 𝐭𝐡𝐞 𝐟𝐢𝐱𝐞𝐝 𝐩𝐨𝐢𝐧𝐭 𝐚𝐧𝐝 𝐭𝐡𝐞 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧. Actually, at the early stage of EqR project, I intended to follow this path, which is more spiritually aligned with Deep Equilibrium Models (DEQ): replacing the separate ACT head with convergence-based halting, both during training and inference. However, my preliminary results did not show positive signals, so I eventually gave up on that direction. Now, after looking at the tricks used in this paper: a. pre-norm instead of post-norm, b. residual scaling and damping to stabilize the recurrent dynamics, c. input mixing / conditioning preservation across iterations, it seems that, although somewhat complicated, using convergence for halting is not impossible. It just require many tricks to improve contraction and convergence. 𝑨𝒎𝒐𝒏𝒈 𝒕𝒉𝒐𝒔𝒆 𝒕𝒓𝒊𝒄𝒌𝒔, 𝑰 𝒍𝒊𝒌𝒆 𝒕𝒉𝒆 𝒅𝒊𝒔𝒄𝒖𝒔𝒔𝒊𝒐𝒏 𝒂𝒓𝒐𝒖𝒏𝒅 𝒑𝒓𝒆-𝒏𝒐𝒓𝒎 𝒂𝒏𝒅 𝒑𝒐𝒔𝒕-𝒏𝒐𝒓𝒎 𝒕𝒉𝒆 𝒎𝒐𝒔𝒕. As pointed out in the Huggin and DEQ, input injection is important for weight-tied models [*]. However, Ouro from ByteDance does not explicitly perform input injection, yet it still works reasonably well. Why? My interpretation is simple: it uses a pre-norm-like design (not standard pre-norm, but one where the residual can be passed across layers more directly), which already helps preserve the conditioning signal from the input. People use post-norm to fight against the known instability of weight-tied models with loops, but that may not be the best choice. ============================== DEQ: https://arxiv.org/abs/1909.01377) Huggin: https://arxiv.org/abs/2502.05171 Ouro: https://ouro-llm.github.io EqR: https://arxiv.org/abs/2605.21488 [*Why is input injection important?] A fixed point corresponds to the infinite-depth limit of an iterative weight-tied model. For such a fixed point to be useful, the conditioning signal from the input must be preserved throughout these infinite iterations; otherwise, the dynamics may converge to an input-agnostic attractor. More below

Original Article

View Cached Full Text

Cached at: 06/18/26, 06:10 AM

My takes and thoughts are as follows (sorry for being verbose, but I hope you will enjoy this thread if you are interested in weight-tied / iterative / loop models):

1/ This work shows that, beyond EqR’s approach of adding stochasticity, other tricks can also make convergence a reliable halting signal for iterative weight-tied models, indicating a good alignment between the fixed point and the solution.

Actually, at the early stage of EqR project, I intended to follow this path, which is more spiritually aligned with Deep Equilibrium Models (DEQ): replacing the separate ACT head with convergence-based halting, both during training and inference.

However, my preliminary results did not show positive signals, so I eventually gave up on that direction. Now, after looking at the tricks used in this paper:

a. pre-norm instead of post-norm, b. residual scaling and damping to stabilize the recurrent dynamics, c. input mixing / conditioning preservation across iterations,

it seems that, although somewhat complicated, using convergence for halting is not impossible. It just require many tricks to improve contraction and convergence.

Among those tricks, I like the discussion around pre-norm and post-norm the most.

As pointed out in the Huggin and DEQ, input injection is important for weight-tied models [*]. However, Ouro from ByteDance does not explicitly perform input injection, yet it still works reasonably well.

Why? My interpretation is simple: it uses a pre-norm-like design (not standard pre-norm, but one where the residual can be passed across layers more directly), which already helps preserve the conditioning signal from the input.

People use post-norm to fight against the known instability of weight-tied models with loops, but that may not be the best choice.

==============================

DEQ: https://arxiv.org/abs/1909.01377) Huggin: https://arxiv.org/abs/2502.05171 Ouro: https://ouro-llm.github.io EqR: https://arxiv.org/abs/2605.21488

[*Why is input injection important?] A fixed point corresponds to the infinite-depth limit of an iterative weight-tied model. For such a fixed point to be useful, the conditioning signal from the input must be preserved throughout these infinite iterations; otherwise, the dynamics may converge to an input-agnostic attractor.

More below

Deep Equilibrium Models

Source: https://arxiv.org/abs/1909.01377 View PDF (https://arxiv.org/pdf/1909.01377)

Abstract:We present a new approach to modeling sequential data: the deep equilibrium model (DEQ). Motivated by an observation that the hidden layers of many existing deep sequence models converge towards some fixed point, we propose the DEQ approach that directly finds these equilibrium points via root-finding. Such a method is equivalent to running an infinite depth (weight-tied) feedforward network, but has the notable advantage that we can analytically backpropagate through the equilibrium point using implicit differentiation. Using this approach, training and prediction in these networks require only constant memory, regardless of the effective “depth” of the network. We demonstrate how DEQs can be applied to two state-of-the-art deep sequence models: self-attention transformers and trellis networks. On large-scale language modeling tasks, such as the WikiText-103 benchmark, we show that DEQs 1) often improve performance over these state-of-the-art models (for similar parameter counts); 2) have similar computational requirements to existing models; and 3) vastly reduce memory consumption (often the bottleneck for training large sequence models), demonstrating an up-to 88% memory reduction in our experiments. The code is available atthis https URL (https://github.com/locuslab/deq).

Submission history

From: Shaojie Bai [view email (https://arxiv.org/show-email/8ac06e46/1909.01377)] **[v1]**Tue, 3 Sep 2019 18:02:50 UTC (721 KB) **[v2]**Mon, 28 Oct 2019 22:25:01 UTC (720 KB)

Francesco Bertolotti (@f14bertolotti): This TRM variant makes a transformer block a contractive map, so that looping becomes a fixed-point process. They leverage this by approximating gradients with Neumann series (Truncated BPTT). Very cool work!

🔗

@huskydogewoof: 𝐌𝐲 𝐭𝐚𝐤𝐞𝐬 𝐚𝐧𝐝 𝐭𝐡𝐨𝐮𝐠𝐡𝐭𝐬 𝐚𝐫𝐞 𝐚𝐬 𝐟𝐨𝐥𝐥𝐨𝐰𝐬 (sorry for being verbose, but I hope you will enjoy …

Deep Equilibrium Models

Submission history

Similar Articles

@charles_irl: my gut says that to solve float numerics problems from nondeterminism x nonassociativity, we need to think bigger than …

@hooeem: https://x.com/hooeem/status/2062266452921491934

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

@jobergum: You know me as the BM25 guy, but embeddings are cool too. New post from the @HornetDev team just dropped. ANN tuning at…

@jaminball: I love attending conferences focused on research, and @CAISconf this week was great! Hearing what's happening on the bl…

Submit Feedback

Similar Articles

@charles_irl: my gut says that to solve float numerics problems from nondeterminism x nonassociativity, we need to think bigger than …

@hooeem: https://x.com/hooeem/status/2062266452921491934

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

@jobergum: You know me as the BM25 guy, but embeddings are cool too. New post from the @HornetDev team just dropped. ANN tuning at…

@jaminball: I love attending conferences focused on research, and @CAISconf this week was great! Hearing what's happening on the bl…