Tag
The paper investigates whether weight norm directly controls the grokking delay in neural networks or if its effect is mediated by logit scale and softmax saturation under cross-entropy loss. Experiments show that the delay is almost entirely explained by the effective logit scale, with weight norm contributing negligibly.
This paper demonstrates that when transformers grok modular multiplication, the dense Fourier spectrum observed in previous work is an artifact of using the additive Fourier transform; using the multiplicative character transform reveals a sparse representation, leading to a reverse-engineered 'Discrete-Log Clock' algorithm analogous to the clock algorithm for modular addition.
The paper proposes that grokking in deep neural networks arises from noise-driven escape from metastable phases in first-order L2 phase transitions, demonstrating that delayed generalization follows Arrhenius scaling and reproduces canonical grokking curves.
This paper demonstrates that the weight norm causally controls the timescale of grokking in neural networks, reconciling conflicting accounts. Through interventions, it shows that grokking follows an exponential delay law and that norm magnitude dominates grokking time over learning rate across architectures.
This paper introduces the Hierarchical Emergence Framework (HEF), which explains how diverse systems such as neural networks and biological evolution converge to similar internal representations through phase transitions in mechanism landscapes under physical and informational constraints. The framework is validated empirically with 111 grokking experiments that confirm universal convergence and identify a critical energy threshold.
This paper introduces an exposure-based framework to study grokking-like delayed generalization during LLM pre-training, using BLiMP minimal pairs and critical phrases. The authors observe delayed generalization across five grammatical phenomena and analyze internal changes such as concept vector predictability and attention head concentration.
This paper introduces the log-alignment ratio (LAR), a training-time metric that measures parameter-activation alignment and predicts generalization by capturing the spread of weight and activation spectra. Experiments on grokking and a 3B-parameter language model show LAR tracks the transition from memorization to generalization and flags overfitting without held-out data.
This paper introduces a bifurcation theory of representation dynamics to detect when neural networks acquire structured representations during training, using a Hessian analysis of a GMM probe. The resulting ratio β/β_c serves as a label-free phase coordinate that predicts the onset of usable structure and can forecast feature interpretability in sparse autoencoders early in training.
This paper investigates how weight decay acts as a control parameter for transitioning between memorization and generalization in transformers trained on modular arithmetic, and introduces two cheap online diagnostic metrics from attention activations that track these dynamics.
This paper presents the first quantitative prediction of the grokking delay under AdamW, deriving a closed-form law and validating it on algorithmic tasks with high accuracy.
Applies graph spectral analysis (Fiedler value) and Scheffer critical slowing down indicators to predict grokking in neural networks, detecting it 21,000 steps before the loss function changes, across five reproducible experiments.
This paper proposes a unified theoretical framework for phase transitions in deep learning (grokking, emergent capabilities) and non-equilibrium chemistry, describing both as driven informational systems governed by two gradient fields.
This paper proposes distributional spectral diagnostics to localize grokking transitions in Transformer models before test accuracy rises. It uses empirical distributions and Hankel dynamic mode decomposition to create a monitoring signal that discriminates between grokking and non-grokking runs.
This empirical study validates theoretical findings on feature repulsion and spectral lock-in during the grokking phenomenon in two-layer neural networks, demonstrating how activation functions influence the transition from memorization to generalization.