Tag
This paper analyzes precision loss in FP8 attention due to the attention sink phenomenon when casting the softmax output to FP8 (E4M3). It shows that forward KV iteration causes underflow of non-sink attention values, and proposes reverse iteration and a static scaling factor S=256 to eliminate underflow, achieving 3-10x MSE improvement.
This article recounts how Geoffrey Hinton persisted in his research for three decades during the AI winter, when neural networks were abandoned by academia. He eventually gained fame with AlexNet in the 2012 ImageNet competition and won the Nobel Prize in Physics in 2024.
The article discusses surprising robustness of model distillation with respect to training distribution, even with little overlap with target distribution, and its implications for on/off-policy distillation.
A tweet highlights Chris Potts' talk on how large language models learn linguistic structures, reinforcing the view that LLMs capture syntax and semantics.
This paper argues that transformer architectures are inherently succinct, meaning they can represent certain functions more efficiently than other models. It presents theoretical analysis and proofs.
This post explores DINOv3 vision embeddings by generating images that correspond to specific embedding directions, using gradient optimization and augmentation strategies to invert the model.
This ICML 2026 paper introduces Derivative Informed XC-Loss (DI-Loss), a training approach for machine-learned exchange-correlation functionals that incorporates first and second derivative supervision on the Grassmannian of density matrices. Across four architectures, DI-Loss reduces total-energy MAE by 66% compared to energy and density supervision alone, and improves excited-state predictions in TDDFT calculations.
This paper presents a theoretical framework for deep reinforcement learning in continuous environments, modeling it as a continuous-time stochastic process using stochastic control theory. The authors characterize an actor-critic algorithm's dynamics in the infinite width limit of two-layer networks, deriving an equation for infinitesimal changes in state distribution under a vanishingly small learning rate.
This paper presents AIcon2abs, a methodology combining visual programming and WiSARD weightless neural networks to help general audiences, including children, understand AI concepts through hands-on learning activities. The approach integrates training and classification as first-class programming constructs to make the distinction between learning machines and conventional programs more intuitive.
A creative dialogue explores the idea that large language models are fundamentally just matrices of weights, challenging notions of understanding and sentience.
Curatube is a distraction-free interface for YouTube playlists, designed to help focus on learning. It currently features the Neural Networks: Zero to Hero course by Andrej Karpathy.
This paper theoretically demonstrates that two-layer neural networks trained on group composition tasks learn spectral representations, with neurons converging to irreducible representations and achieving rotational rank-one alignment, providing a representation-theoretic account of feature learning.
This paper presents an exact decomposition of the curvature exponent α in neural network loss landscapes, explaining why it varies across layer types. It introduces the spectral alignment decomposition and derives a spectral transfer identity linking curvature, gradient rank decay, and Hessian exponents, validated across architectures and datasets.
This paper provides a theoretical analysis of how neural networks learn structured representations during group composition tasks, proving that training dynamics drive neurons to converge to irreducible group representations with exponential convergence rates. The work establishes a representation-theoretic account of feature learning and characterizes a low-rank compression phenomenon for matrix-valued group representations.
This paper investigates why larger models outperform smaller ones, attributing it to data-induced competition for neural resources through formal analysis and experiments.
Stanford CS224N course notes provide a clear introduction to the mathematics of backpropagation and gradient computation in neural networks, covering chain rule, computational graphs, and vectorized derivatives.
This paper benchmarks five uncertainty quantification methods for neural network predictions of turbine gas temperature, evaluating trade-offs in coverage, width, and stability to guide prognostics and health management in engines.
The Bit-Mass Theory proposes that the total number of weight bits determines model accuracy, not the computation format, with experiments on MNIST showing equivalent performance between binary and floating-point networks at the same bit-mass.
This paper establishes an exact correspondence between neural network training and Hamilton-Jacobi initial-value problems, unifying deep learning architectures through a deformation parameter.
This paper introduces the log-alignment ratio (LAR), a training-time metric that measures parameter-activation alignment and predicts generalization by capturing the spread of weight and activation spectra. Experiments on grokking and a 3B-parameter language model show LAR tracks the transition from memorization to generalization and flags overfitting without held-out data.