Tag
This paper presents an information-processing theory of consciousness and argues that instantiating conscious subsystems in AI could enable superior adaptation without extensive training, potentially leading to AGI.
This paper provides optimal high-probability bounds for stochastic gradient descent under Markovian noise for PL-smooth objectives, closing gaps between expectation and high-probability guarantees and extending to heavy-tailed settings with matching lower bounds.
This book develops an effective theory for deep neural networks, showing that their predictions are nearly-Gaussian and governed by the depth-to-width ratio, and introduces representation group flow to analyze signal propagation and learning dynamics.
A paper investigating the reasons behind the success of overparameterization in neural networks, comparing the lottery ticket hypothesis with escape dimensions.
This paper argues that vanilla conditional diffusion models fundamentally fail at compositional generation when the target distribution is out-of-distribution, due to score estimation error, and that inference-time corrections cannot fully compensate.
This paper presents a knowledge-based theory of capital, examining the value of both natural and artificial intelligence from an economic perspective.
This paper develops a formal account of what generalist agents must store in memory to act near-optimally across multiple environments and goals, presenting a separation theorem that memory is necessary for domain disambiguation and transition-model reconstruction.
This paper provides a theoretical explanation for why diffusion models can generate clean samples without explicit noise-level conditioning, attributing it to high-dimensional geometry and analyzing why some model parameterizations succeed while others collapse.
The paper reveals that latent reasoning in transformer-based reasoning models (TRMs) functions as a policy improvement operator, and proposes an algorithm that enhances learning and inference efficiency by up to 18x.
This paper provides guidance on the appropriate use of different Schatten-p norms in deep learning, analyzing their theoretical properties and practical implications for model regularization and optimization.
This paper presents theoretical bounds for uncertainty estimation and generalization in modern deep learning models.
The paper identifies a failure mode where predictors collapse to a point on unidentified counterfactual couplings and proposes a framework using a positive semidefinite coupling kernel to bound counterfactuals, showing that prediction cannot represent uncertainty over cross-world couplings and that enforcing kernel constraints yields tractable bounds.
This paper argues that transformer architectures are inherently succinct, meaning they can represent certain functions more efficiently than other models. It presents theoretical analysis and proofs.
This paper applies stereological theory to LLM benchmarks, revealing that current leaderboards measure only 3–5 independent dimensions, creating geometric blind spots that dominate statistical noise. It provides theoretical bounds on benchmark coverage and a submodular algorithm for efficient benchmark selection.
This article explores the four layers of physics' role in AI, from the bottom computational skeleton to the methodological layer, arguing that physics' methodology is migrating from the natural world to the AI domain.
This paper presents an exact decomposition of the curvature exponent α in neural network loss landscapes, explaining why it varies across layer types. It introduces the spectral alignment decomposition and derives a spectral transfer identity linking curvature, gradient rank decay, and Hessian exponents, validated across architectures and datasets.
This paper derives exact closed-form expressions for gradients and test loss after one and two steps of gradient descent in two-layer and three-layer linear neural networks, characterizing optimal learning rate selection and revealing a distinct early-training regime where unequal layer-wise learning rates are initially optimal.
This paper investigates why larger models outperform smaller ones, attributing it to data-induced competition for neural resources through formal analysis and experiments.
This paper proves that learning by predicting latent representations (as in world models like JEPA and data2vec) requires exponentially less data than predicting tokens (as in LLMs) for hierarchical data with hidden structure.
This paper establishes an exact correspondence between neural network training and Hamilton-Jacobi initial-value problems, unifying deep learning architectures through a deformation parameter.