Tag
This paper demonstrates that two-layer neural networks trained with gradient-based methods can achieve the optimal computational-statistical tradeoff for learning Gaussian single-index models, matching the SQ lower bound up to polylogarithmic factors for all generative exponents and extending to sparse settings with a novel weight perturbation technique.
GRAPE is a training framework that progressively exposes parameter space during adversarial training, achieving higher robust accuracy with fewer parameters compared to fixed-structure methods on CIFAR-10.
This thread argues that standard transformers have a topological flaw: once a state representation reaches the top layer, they cannot update beliefs over time, causing collapse as depth increases.
This thread discusses the concept of 'Jagged Intelligence' in AI, framing it as a consequence of AI learning being an ill-posed inverse problem, and argues that external stabilizers like scaffolding and verification are essential.
The article proposes Implicit Variational Rejection Sampling (IVRS), which integrates implicit distributions with rejection sampling to improve posterior approximation in variational inference, and introduces the Implicit Resampling Evidence Lower Bound (IR-ELBO) as a tighter variational lower bound.
This paper introduces neural slack variables, a primal-side approach that converts constraint enforcement into a regression problem by coupling the primary network with a jointly learned auxiliary network, achieving zero violations on monotonicity and convexity tests and enabling arbitrage-free learning of volatility surfaces.
This paper demonstrates that the weight norm causally controls the timescale of grokking in neural networks, reconciling conflicting accounts. Through interventions, it shows that grokking follows an exponential delay law and that norm magnitude dominates grokking time over learning rate across architectures.
This paper argues that recent claims that neural networks have solved Fodor and Pylyshyn's systematicity challenge are premature. The authors show that the meta-learning for compositionality model fails to generalize out-of-distribution and behaves unsystematically even on in-distribution problems, concluding the challenge remains unmet.
This thread presents a technique to encode a functional QR code into neural network weights using natural language text during training, enabling hidden information embedding in models trained on benign data.
Singular Learning Theory (SLT) uses algebraic geometry to explain why neural networks generalize well despite their degeneracies, introducing the real log canonical threshold (RLCT) as a measure of model complexity.
Recommend a book for systematically learning the basics of large language models: 《Foundations of Large Language Models》, written by Tong Xiao and Jingbo Zhu from Northeastern University NLP Lab and NiuTrans Research.
This paper investigates reducing the computational complexity of deep neural networks for EEG analysis on wearable devices by applying parameter quantization and electrode reduction techniques, demonstrating significant complexity reduction with minimal accuracy loss for epileptic seizure detection.
Tokyo Institute of Technology has released free machine learning course materials covering topics like regression, neural networks, SVM, clustering, and PCA, with hands-on code using NumPy, scikit-learn, and PyTorch.
This paper investigates how neural networks maintain high accuracy even when over 90% of input features are corrupted, deriving a centroid-based decision rule in the high-noise limit using a mean-field approach.
This paper introduces MF-Net, a recurrent dynamical model that represents multivariate systems through a shared field state and learns a mechanical transition for joint evolution. It achieves competitive forecasting while enabling interpretable structural readout of learned relations.
The article discusses a deficiency in executive control within transformer attention mechanisms, highlighting limitations in how transformers manage sequential dependencies.
A tweet by Karan (@kmeanskaran) outlining a learning roadmap for balancing ML and AI, covering Python, neural networks, NLP, LLMs, deployment, and agentic AI, with a reply from Amit seeking beginner guidance.
This paper introduces the Hierarchical Emergence Framework (HEF), which explains how diverse systems such as neural networks and biological evolution converge to similar internal representations through phase transitions in mechanism landscapes under physical and informational constraints. The framework is validated empirically with 111 grokking experiments that confirm universal convergence and identify a critical energy threshold.
This paper addresses the open question of maximum step size for gradient descent convergence on non-L-smooth objectives, introducing adaptive methods that operate at the edge of stability and can minimize sharpness globally.
This paper analyzes precision loss in FP8 attention due to the attention sink phenomenon when casting the softmax output to FP8 (E4M3). It shows that forward KV iteration causes underflow of non-sink attention values, and proposes reverse iteration and a static scaling factor S=256 to eliminate underflow, achieving 3-10x MSE improvement.