Tag
This paper derives scaling rules for Gated Delta Networks using Maximal Update Parametrization (μP), enabling zero-shot hyperparameter transfer across model widths for efficient sub-quadratic LLM architectures. Experiments confirm stable learning-rate transfer under both AdamW and SGD, whereas standard parametrization fails.
This paper introduces a framework to quantify hyperparameter transfer in LLMs and finds that the benefit of μP over SP in AdamW training largely comes from increasing the embedding layer learning rate. It also explores the impact of weight decay and other factors.
This paper extends the maximal update parameterization (μP) framework to grouped-query attention (GQA), deriving scaling laws for hyperparameter transfer across model architectures. It introduces spectral norm conditions for feature learning and addresses issues with low-rank weight matrices in GQA.