Tag
OpenAI researchers present methods for training sparse neural networks that are easier to interpret by forcing most weights to zero, enabling the discovery of small, disentangled circuits that can explain model behavior while maintaining performance. This work aims to advance mechanistic interpretability as a complement to post-hoc analysis of dense networks and support AI safety goals.
OpenAI proposes a practical L₀ regularization method for neural networks that encourages weights to become exactly zero during training, enabling network pruning for improved speed and generalization. The method uses stochastic gates and introduces the hard concrete distribution to make the non-differentiable L₀ norm optimization tractable via gradient descent.