@huskydogewoof: ๐๐ฒ ๐ญ๐๐ค๐๐ฌ ๐๐ง๐ ๐ญ๐ก๐จ๐ฎ๐ ๐ก๐ญ๐ฌ ๐๐ซ๐ ๐๐ฌ ๐๐จ๐ฅ๐ฅ๐จ๐ฐ๐ฌ (sorry for being verbose, but I hope you will enjoy โฆ
Summary
The author shares thoughts on making convergence a reliable halting signal for iterative weight-tied models, discussing tricks from papers like DEQ, Huggin, Ouro, and EqR, and highlighting the roles of pre-norm and input injection.
View Cached Full Text
Cached at: 06/18/26, 06:10 AM
My takes and thoughts are as follows (sorry for being verbose, but I hope you will enjoy this thread if you are interested in weight-tied / iterative / loop models):
1/ This work shows that, beyond EqRโs approach of adding stochasticity, other tricks can also make convergence a reliable halting signal for iterative weight-tied models, indicating a good alignment between the fixed point and the solution.
Actually, at the early stage of EqR project, I intended to follow this path, which is more spiritually aligned with Deep Equilibrium Models (DEQ): replacing the separate ACT head with convergence-based halting, both during training and inference.
However, my preliminary results did not show positive signals, so I eventually gave up on that direction. Now, after looking at the tricks used in this paper:
a. pre-norm instead of post-norm, b. residual scaling and damping to stabilize the recurrent dynamics, c. input mixing / conditioning preservation across iterations,
it seems that, although somewhat complicated, using convergence for halting is not impossible. It just require many tricks to improve contraction and convergence.
Among those tricks, I like the discussion around pre-norm and post-norm the most.
As pointed out in the Huggin and DEQ, input injection is important for weight-tied models [*]. However, Ouro from ByteDance does not explicitly perform input injection, yet it still works reasonably well.
Why? My interpretation is simple: it uses a pre-norm-like design (not standard pre-norm, but one where the residual can be passed across layers more directly), which already helps preserve the conditioning signal from the input.
People use post-norm to fight against the known instability of weight-tied models with loops, but that may not be the best choice.
==============================
DEQ: https://arxiv.org/abs/1909.01377) Huggin: https://arxiv.org/abs/2502.05171 Ouro: https://ouro-llm.github.io EqR: https://arxiv.org/abs/2605.21488
[*Why is input injection important?] A fixed point corresponds to the infinite-depth limit of an iterative weight-tied model. For such a fixed point to be useful, the conditioning signal from the input must be preserved throughout these infinite iterations; otherwise, the dynamics may converge to an input-agnostic attractor.
More below
Deep Equilibrium Models
Source: https://arxiv.org/abs/1909.01377 View PDF (https://arxiv.org/pdf/1909.01377)
Abstract:We present a new approach to modeling sequential data: the deep equilibrium model (DEQ). Motivated by an observation that the hidden layers of many existing deep sequence models converge towards some fixed point, we propose the DEQ approach that directly finds these equilibrium points via root-finding. Such a method is equivalent to running an infinite depth (weight-tied) feedforward network, but has the notable advantage that we can analytically backpropagate through the equilibrium point using implicit differentiation. Using this approach, training and prediction in these networks require only constant memory, regardless of the effective โdepthโ of the network. We demonstrate how DEQs can be applied to two state-of-the-art deep sequence models: self-attention transformers and trellis networks. On large-scale language modeling tasks, such as the WikiText-103 benchmark, we show that DEQs 1) often improve performance over these state-of-the-art models (for similar parameter counts); 2) have similar computational requirements to existing models; and 3) vastly reduce memory consumption (often the bottleneck for training large sequence models), demonstrating an up-to 88% memory reduction in our experiments. The code is available atthis https URL (https://github.com/locuslab/deq).
Submission history
From: Shaojie Bai [view email (https://arxiv.org/show-email/8ac06e46/1909.01377)] **[v1]**Tue, 3 Sep 2019 18:02:50 UTC (721 KB) **[v2]**Mon, 28 Oct 2019 22:25:01 UTC (720 KB)
Francesco Bertolotti (@f14bertolotti): This TRM variant makes a transformer block a contractive map, so that looping becomes a fixed-point process. They leverage this by approximating gradients with Neumann series (Truncated BPTT). Very cool work!
๐
Similar Articles
@charles_irl: my gut says that to solve float numerics problems from nondeterminism x nonassociativity, we need to think bigger than โฆ
This tweet discusses the idea of training models with 'implementation noise' to improve robustness against float numerics problems caused by nondeterminism and nonassociativity.
@hooeem: https://x.com/hooeem/status/2062266452921491934
A guide explaining how to make agentic workflows up to 462x cheaper by compiling fixed procedures into smaller fine-tuned models instead of repeatedly prompting frontier models.
@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587
The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.
@jobergum: You know me as the BM25 guy, but embeddings are cool too. New post from the @HornetDev team just dropped. ANN tuning atโฆ
HornetDev team published a post on tuning approximate-nearest-neighbor search at 100M scale, covering embedding bias, graph connectivity, and quantization limits.
@jaminball: I love attending conferences focused on research, and @CAISconf this week was great! Hearing what's happening on the blโฆ
Percy Liang discussed training open frontier models with modest compute, emphasizing algorithmic efficiency and scaling recipes, and advocated for open development beyond open weights.