delta-networks

Tag

Cards List
#delta-networks

Unlocking Feature Learning in Gated Delta Networks at Scale

arXiv cs.LG · 3d ago Cached

This paper derives scaling rules for Gated Delta Networks using Maximal Update Parametrization (μP), enabling zero-shot hyperparameter transfer across model widths for efficient sub-quadratic LLM architectures. Experiments confirm stable learning-rate transfer under both AdamW and SGD, whereas standard parametrization fails.

0 favorites 0 likes
← Back to home

Submit Feedback