Weight normalization: A simple reparameterization to accelerate training of deep neural networks
Summary
OpenAI presents weight normalization, a reparameterization technique that decouples weight vector length from direction to improve neural network training convergence and computational efficiency without introducing minibatch dependencies, making it suitable for RNNs and noise-sensitive applications.
View Cached Full Text
Cached at: 04/20/26, 02:45 PM
Similar Articles
Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors | Alexander Hägele
This blog post introduces Magnitude-Direction (MD) Decoupling, a method that separates neural network weight matrices into direction and magnitude components optimized with separate learning rates. Experiments show improved performance across Adam and Muon optimizers, automatic learning rate transfer across model widths, and scaling benefits in large Mixture-of-Experts models.
Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations
This paper establishes a theoretical framework showing that smooth activations in deep neural networks can mitigate the curse of dimensionality in uniform convergence, providing non-asymptotic guarantees and outperforming ReLU networks in worst-case reliability.
Unified Neural Scaling Laws
Presents a unified neural scaling law that accurately models deep neural network scaling across multiple dimensions including parameters, dataset size, training steps, and compute, validated across diverse architectures and tasks.
Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models
This paper systematically studies scale vectors in LLM normalization layers, showing they optimize training through a self-amplifying preconditioning effect, and proposes three lightweight improvements that enhance performance and scaling behavior with negligible overhead.
Learning sparse neural networks through L₀ regularization
OpenAI proposes a practical L₀ regularization method for neural networks that encourages weights to become exactly zero during training, enabling network pruning for improved speed and generalization. The method uses stochastic gates and introduces the hard concrete distribution to make the non-differentiable L₀ norm optimization tractable via gradient descent.