Weight normalization: A simple reparameterization to accelerate training of deep neural networks

OpenAI Blog Papers

Summary

OpenAI presents weight normalization, a reparameterization technique that decouples weight vector length from direction to improve neural network training convergence and computational efficiency without introducing minibatch dependencies, making it suitable for RNNs and noise-sensitive applications.

No content available
Original Article
View Cached Full Text

Cached at: 04/20/26, 02:45 PM

# Weight normalization: A simple reparameterization to accelerate training of deep neural networks Source: [https://openai.com/index/weight-normalization/](https://openai.com/index/weight-normalization/) OpenAI## Abstract We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction\. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent\. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch\. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise\-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited\. Although our method is much simpler, it still provides much of the speed\-up of full batch normalization\. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time\. We demonstrate the usefulness of our method on applications in supervised image recognition, generative modelling, and deep reinforcement learning\.

Similar Articles

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors | Alexander Hägele

Reddit r/LocalLLaMA

This blog post introduces Magnitude-Direction (MD) Decoupling, a method that separates neural network weight matrices into direction and magnitude components optimized with separate learning rates. Experiments show improved performance across Adam and Muon optimizers, automatic learning rate transfer across model widths, and scaling benefits in large Mixture-of-Experts models.

Unified Neural Scaling Laws

Hugging Face Daily Papers

Presents a unified neural scaling law that accurately models deep neural network scaling across multiple dimensions including parameters, dataset size, training steps, and compute, validated across diverse architectures and tasks.

Learning sparse neural networks through L₀ regularization

OpenAI Blog

OpenAI proposes a practical L₀ regularization method for neural networks that encourages weights to become exactly zero during training, enabling network pruning for improved speed and generalization. The method uses stochastic gates and introduces the hard concrete distribution to make the non-differentiable L₀ norm optimization tractable via gradient descent.