residual-scaling

Tag

Cards List
#residual-scaling

On the Residual Scaling of Looped Transformers: Stability and Transferability

arXiv cs.LG · 2026-06-18 Cached

This paper analyzes residual scaling in looped (weight-tied) transformers, showing that weight sharing requires stronger scaling (1/N) than standard residual networks, and derives a factored parameterization that enables hyperparameter transfer across loop counts without retuning.

0 favorites 0 likes
← Back to home

Submit Feedback