Tag
This paper presents a large-scale empirical study of the Derivative Regularization (DREG) penalty, showing it achieves high accuracy and noise robustness, particularly with GELU activation and data-scarce regimes, positioning it as a general-purpose plug-and-play regularizer for neural networks.
Explains the mathematical concepts of gradient, Jacobian, and Hessian as fundamental tools in AI model training, describing how they measure change and their roles in optimization.
This paper performs full Jacobian eigendecomposition across production-scale LLMs, revealing a learned spectral gradient from rotation-dominated early layers to symmetric late layers, along with a low-rank bottleneck that compresses perturbations. The results link perturbation propagation and compression to network functional topology.