hyperparameter-transfer

#hyperparameter-transfer

Unlocking Feature Learning in Gated Delta Networks at Scale

arXiv cs.LG ↗ · 3d ago Cached

This paper derives scaling rules for Gated Delta Networks using Maximal Update Parametrization (μP), enabling zero-shot hyperparameter transfer across model widths for efficient sub-quadratic LLM architectures. Experiments confirm stable learning-rate transfer under both AdamW and SGD, whereas standard parametrization fails.

0 favorites 0 likes

#hyperparameter-transfer

@maximelabonne: Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Ba…

X AI KOLs Following ↗ · 2026-05-22 Cached

This paper introduces a framework to quantify hyperparameter transfer in LLMs and finds that the benefit of μP over SP in AdamW training largely comes from increasing the embedding layer learning rate. It also explores the impact of weight decay and other factors.

0 favorites 0 likes

#hyperparameter-transfer

GQA-{\mu}P: The maximal parameterization update for grouped query attention

arXiv cs.LG ↗ · 2026-05-18 Cached

This paper extends the maximal update parameterization (μP) framework to grouped-query attention (GQA), deriving scaling laws for hyperparameter transfer across model architectures. It introduces spectral norm conditions for feature learning and addresses issues with low-rank weight matrices in GQA.

0 favorites 0 likes

hyperparameter-transfer

Unlocking Feature Learning in Gated Delta Networks at Scale

@maximelabonne: Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Ba…

GQA-{\mu}P: The maximal parameterization update for grouped query attention

Submit Feedback