Weight normalization: A simple reparameterization to accelerate training of deep neural networks

OpenAI Blog Papers

Summary

OpenAI presents weight normalization, a reparameterization technique that decouples weight vector length from direction to improve neural network training convergence and computational efficiency without introducing minibatch dependencies, making it suitable for RNNs and noise-sensitive applications.

No content available
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:45 PM

# Weight normalization: A simple reparameterization to accelerate training of deep neural networks Source: [https://openai.com/index/weight-normalization/](https://openai.com/index/weight-normalization/) OpenAI## Abstract We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction\. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent\. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch\. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise\-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited\. Although our method is much simpler, it still provides much of the speed\-up of full batch normalization\. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time\. We demonstrate the usefulness of our method on applications in supervised image recognition, generative modelling, and deep reinforcement learning\.

Similar Articles

Learning sparse neural networks through L₀ regularization

OpenAI Blog

OpenAI proposes a practical L₀ regularization method for neural networks that encourages weights to become exactly zero during training, enabling network pruning for improved speed and generalization. The method uses stochastic gates and introduces the hard concrete distribution to make the non-differentiable L₀ norm optimization tractable via gradient descent.

Understanding neural networks through sparse circuits

OpenAI Blog

OpenAI researchers present methods for training sparse neural networks that are easier to interpret by forcing most weights to zero, enabling the discovery of small, disentangled circuits that can explain model behavior while maintaining performance. This work aims to advance mechanistic interpretability as a complement to post-hoc analysis of dense networks and support AI safety goals.

Are Flat Minima an Illusion?

arXiv cs.LG

This paper challenges the common belief that flat minima cause better generalization in neural networks, arguing that 'weakness'—a reparameterization-invariant measure of function simplicity—is the true driver. Empirical results on MNIST and Fashion-MNIST show that weakness predicts generalization while sharpness anticorrelates, and the large-batch generalization advantage vanishes as training data increases.

How AI training scales

OpenAI Blog

OpenAI researchers discovered that the gradient noise scale, a simple statistical metric, predicts the parallelizability of neural network training across a wide range of tasks. They found that more complex tasks and more powerful models tolerate larger batch sizes, suggesting future AI systems can scale further through increased parallelization.