Tag
This paper introduces the log-alignment ratio (LAR), a training-time metric that measures parameter-activation alignment and predicts generalization by capturing the spread of weight and activation spectra. Experiments on grokking and a 3B-parameter language model show LAR tracks the transition from memorization to generalization and flags overfitting without held-out data.
This paper investigates why larger models outperform smaller ones, attributing it to reduced gradient interference and better resource allocation, allowing them to learn rare and complex tasks even with infinite data. Experiments on synthetic data and OLMo models verify that larger models avoid overwriting rare-task features due to weaker gradient updates for common tasks.
This paper develops a PAC-Bayesian framework for physics-informed machine learning, providing high-probability generalization guarantees for unbounded losses. It proposes a multi-task perspective that jointly handles data fidelity, PDE residuals, and boundary conditions, and introduces a self-bounding learning algorithm.
This paper identifies neural network training as a search through Hamilton-Jacobi initial-value problems, showing that residual networks, transformers, and RNNs discretize the same class of viscous Hamilton-Jacobi equations. It derives quantitative consequences including minimax optimal generalization rates, adversarial robustness bounds, and a closed-form influence function.
An annotated version of a paper showing that a simple neural network with just two neurons can control a bicycle, highlighting minimal requirements for stable locomotion.
Proposes a verification-based algorithm to compute provable bounds on exact SHAP values for neural networks, scaling to much larger search spaces than prior exact methods.
This paper introduces a bifurcation theory of representation dynamics to detect when neural networks acquire structured representations during training, using a Hessian analysis of a GMM probe. The resulting ratio β/β_c serves as a label-free phase coordinate that predicts the onset of usable structure and can forecast feature interpretability in sparse autoencoders early in training.
ai-by-hand-excel is an open-source collection of Excel workbooks that teach AI concepts like neural networks, backpropagation, and transformers by letting users inspect the math cell by cell, making model internals more intuitive.
This paper introduces a novel task, transitive inference with exceptions, and analytically characterizes how neural network models (kernel ridge regression) balance relational generalization and memorization. The theory is validated in pretrained language models, showing systematic mistakes predicted by the theory.
This thread explains the intuition behind the Jacobian Matrix and its widespread applications in AI and machine learning, including backpropagation, normalizing flows, computer vision, and robotics.
Figure AI's F.03 humanoid robots, powered by Helix-02 neural network, autonomously sorted 249,560 packages over 200 hours without hardware failure, approaching human-level efficiency.
A Chinese article that organizes and translates 20 hand-drawn AI illustrations created by @sairahul1, covering core concepts from neural networks to agents, suitable for beginners to systematically understand the AI technology stack.
The author argues that deterministic decision trees will always outperform neural networks, claiming that AI's successes are only due to computational limits on building such trees.
This position paper argues that sampling-based inference in Bayesian neural networks has achieved computational parity with optimization-based methods and is poised to supersede them, offering superior uncertainty quantification and prediction performance.
This paper introduces the Representation Gap, a metric for neural network generalization error with better asymptotic dynamics. Using a geometric perspective and optimal quantization theory, the authors show it is governed by the intrinsic dimension of the task, and verify this empirically on synthetic and realistic datasets.
This paper develops a mean-field theory of dropout as a perturbation at the edge of chaos in neural networks, deriving scaling laws for correlation decay and establishing distinct universality classes for smooth and ReLU-like activations. It also yields optimal dropout scheduling that reduces test loss with no extra computational cost.
This paper extends Equilibrium Propagation to skew-gradient systems and demonstrates an equivalence between deep Energy-Based Models and Hamiltonian neural networks, focusing on diffusively coupled Fitzhugh-Nagumo neurons. It derives a layer-wise Hamiltonian recurrence relation for inference in such networks.
This paper introduces novel methods for generating high-quality embeddings for Horn logic reasoning using triplet loss, including techniques for balanced training example generation and hard example emphasis, which improve the efficiency of downstream logical reasoning.
This paper proposes collocational bootstrapping, a mechanism by which statistical word co-occurrence cues can aid the acquisition of English subject-verb agreement, supported by neural network simulations and analysis of child-directed speech.
This paper studies symmetrization of loss functions for robust training under label noise, introducing SGCE and alpha-MAE loss functions that interpolate between multi-class unhinged loss and Mean Absolute Error, with theoretical guarantees and competitive empirical performance.