Tag
A Python implementation of the Gumbel-Sinkhorn neural network for sorting lists of numbers, based on the 2018 paper by Mena et al.
CODA reparameterizes memory-bound operations in LLM training to fuse them into the matmul epilogue, achieving near state-of-the-art performance with LLM-generated kernels.
This paper challenges the common belief that flat minima cause better generalization in neural networks, arguing that 'weakness'—a reparameterization-invariant measure of function simplicity—is the true driver. Empirical results on MNIST and Fashion-MNIST show that weakness predicts generalization while sharpness anticorrelates, and the large-batch generalization advantage vanishes as training data increases.