Tag
This paper investigates how neural networks maintain high accuracy even when over 90% of input features are corrupted, deriving a centroid-based decision rule in the high-noise limit using a mean-field approach.
This paper introduces MF-Net, a recurrent dynamical model that represents multivariate systems through a shared field state and learns a mechanical transition for joint evolution. It achieves competitive forecasting while enabling interpretable structural readout of learned relations.
The article discusses a deficiency in executive control within transformer attention mechanisms, highlighting limitations in how transformers manage sequential dependencies.
A tweet by Karan (@kmeanskaran) outlining a learning roadmap for balancing ML and AI, covering Python, neural networks, NLP, LLMs, deployment, and agentic AI, with a reply from Amit seeking beginner guidance.
This paper introduces the Hierarchical Emergence Framework (HEF), which explains how diverse systems such as neural networks and biological evolution converge to similar internal representations through phase transitions in mechanism landscapes under physical and informational constraints. The framework is validated empirically with 111 grokking experiments that confirm universal convergence and identify a critical energy threshold.
This paper addresses the open question of maximum step size for gradient descent convergence on non-L-smooth objectives, introducing adaptive methods that operate at the edge of stability and can minimize sharpness globally.
This paper analyzes precision loss in FP8 attention due to the attention sink phenomenon when casting the softmax output to FP8 (E4M3). It shows that forward KV iteration causes underflow of non-sink attention values, and proposes reverse iteration and a static scaling factor S=256 to eliminate underflow, achieving 3-10x MSE improvement.
This article recounts how Geoffrey Hinton persisted in his research for three decades during the AI winter, when neural networks were abandoned by academia. He eventually gained fame with AlexNet in the 2012 ImageNet competition and won the Nobel Prize in Physics in 2024.
The article discusses surprising robustness of model distillation with respect to training distribution, even with little overlap with target distribution, and its implications for on/off-policy distillation.
A tweet highlights Chris Potts' talk on how large language models learn linguistic structures, reinforcing the view that LLMs capture syntax and semantics.
This paper argues that transformer architectures are inherently succinct, meaning they can represent certain functions more efficiently than other models. It presents theoretical analysis and proofs.
This post explores DINOv3 vision embeddings by generating images that correspond to specific embedding directions, using gradient optimization and augmentation strategies to invert the model.
This ICML 2026 paper introduces Derivative Informed XC-Loss (DI-Loss), a training approach for machine-learned exchange-correlation functionals that incorporates first and second derivative supervision on the Grassmannian of density matrices. Across four architectures, DI-Loss reduces total-energy MAE by 66% compared to energy and density supervision alone, and improves excited-state predictions in TDDFT calculations.
This paper presents a theoretical framework for deep reinforcement learning in continuous environments, modeling it as a continuous-time stochastic process using stochastic control theory. The authors characterize an actor-critic algorithm's dynamics in the infinite width limit of two-layer networks, deriving an equation for infinitesimal changes in state distribution under a vanishingly small learning rate.
This paper presents AIcon2abs, a methodology combining visual programming and WiSARD weightless neural networks to help general audiences, including children, understand AI concepts through hands-on learning activities. The approach integrates training and classification as first-class programming constructs to make the distinction between learning machines and conventional programs more intuitive.
A creative dialogue explores the idea that large language models are fundamentally just matrices of weights, challenging notions of understanding and sentience.
Curatube is a distraction-free interface for YouTube playlists, designed to help focus on learning. It currently features the Neural Networks: Zero to Hero course by Andrej Karpathy.
This paper theoretically demonstrates that two-layer neural networks trained on group composition tasks learn spectral representations, with neurons converging to irreducible representations and achieving rotational rank-one alignment, providing a representation-theoretic account of feature learning.
This paper presents an exact decomposition of the curvature exponent α in neural network loss landscapes, explaining why it varies across layer types. It introduces the spectral alignment decomposition and derives a spectral transfer identity linking curvature, gradient rank decay, and Hessian exponents, validated across architectures and datasets.
This paper provides a theoretical analysis of how neural networks learn structured representations during group composition tasks, proving that training dynamics drive neurons to converge to irreducible group representations with exponential convergence rates. The work establishes a representation-theoretic account of feature learning and characterizes a low-rank compression phenomenon for matrix-valued group representations.