grokking

#grokking

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

arXiv cs.LG ↗ · 2026-06-18 Cached

The paper investigates whether weight norm directly controls the grokking delay in neural networks or if its effect is mediated by logit scale and softmax saturation under cross-entropy loss. Experiments show that the delay is almost entirely explained by the effective logit scale, with weight norm contributing negligibly.

0 favorites 0 likes

#grokking

The Discrete-Log Clock: How a Transformer Learns Modular Multiplication

arXiv cs.LG ↗ · 2026-06-17 Cached

This paper demonstrates that when transformers grok modular multiplication, the dense Fourier spectrum observed in previous work is an artifact of using the additive Fourier transform; using the multiplicative character transform reveals a sparse representation, leading to a reverse-engineered 'Discrete-Log Clock' algorithm analogous to the clock algorithm for modular addition.

0 favorites 0 likes

#grokking

Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks

arXiv cs.LG ↗ · 2026-06-17 Cached

The paper proposes that grokking in deep neural networks arises from noise-driven escape from metastable phases in first-order L2 phase transitions, demonstrating that delayed generalization follows Arrhenius scaling and reproduces canonical grokking curves.

0 favorites 0 likes

#grokking

The Weight Norm Sets the Grokking Timescale: A Causal Delay Law

arXiv cs.LG ↗ · 2026-06-15 Cached

This paper demonstrates that the weight norm causally controls the timescale of grokking in neural networks, reconciling conflicting accounts. Through interventions, it shows that grokking follows an exponential delay law and that norm magnitude dominates grokking time over learning rate across architectures.

0 favorites 0 likes

#grokking

Emergence via Phase Transitions: Mechanism Landscapes and Universal Convergence Across Complex Systems

arXiv cs.LG ↗ · 2026-06-09 Cached

This paper introduces the Hierarchical Emergence Framework (HEF), which explains how diverse systems such as neural networks and biological evolution converge to similar internal representations through phase transitions in mechanism landscapes under physical and informational constraints. The framework is validated empirically with 111 grokking experiments that confirm universal convergence and identify a critical energy threshold.

0 favorites 0 likes

#grokking

A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization

arXiv cs.LG ↗ · 2026-06-02 Cached

This paper introduces an exposure-based framework to study grokking-like delayed generalization during LLM pre-training, using BLiMP minimal pairs and critical phrases. The authors observe delayed generalization across five grammatical phenomena and analyze internal changes such as concept vector predictability and attention head concentration.

0 favorites 0 likes

#grokking

A Training-Time Diagnostic for Generalization via the Log-Alignment Ratio

arXiv cs.LG ↗ · 2026-05-29 Cached

This paper introduces the log-alignment ratio (LAR), a training-time metric that measures parameter-activation alignment and predicts generalization by capturing the spread of weight and activation spectra. Experiments on grokking and a 3B-parameter language model show LAR tracks the transition from memorization to generalization and flags overfitting without held-out data.

0 favorites 0 likes

#grokking

Feature Lottery? A Bifurcation Theory of Concept Emergence

arXiv cs.LG ↗ · 2026-05-26 Cached

This paper introduces a bifurcation theory of representation dynamics to detect when neural networks acquire structured representations during training, using a Hessian analysis of a GMM probe. The resulting ratio β/β_c serves as a label-free phase coordinate that predicts the onset of usable structure and can forecast feature interpretability in sparse autoencoders early in training.

0 favorites 0 likes

#grokking

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

arXiv cs.LG ↗ · 2026-05-21 Cached

This paper investigates how weight decay acts as a control parameter for transitioning between memorization and generalization in transformers trained on modular arithmetic, and introduces two cheap online diagnostic metrics from attention activations that track these dynamics.

0 favorites 0 likes

#grokking

First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation

arXiv cs.LG ↗ · 2026-05-20

This paper presents the first quantitative prediction of the grokking delay under AdamW, deriving a closed-form law and validating it on algorithmic tasks with high accuracy.

0 favorites 0 likes

#grokking

Graph spectral analysis (Fiedler value + Scheffer CSD indicators) predicts grokking 21k steps before loss function - five reproducible experiments [R]

Reddit r/MachineLearning ↗ · 2026-05-19

Applies graph spectral analysis (Fiedler value) and Scheffer critical slowing down indicators to predict grokking in neural networks, detecting it 21,000 steps before the loss function changes, across five reproducible experiments.

0 favorites 0 likes

#grokking

Phase Transitions in Driven Informational Systems: A Two-Field Perspective on Learning Theory and Non-Equilibrium Chemistry

arXiv cs.LG ↗ · 2026-05-19 Cached

This paper proposes a unified theoretical framework for phase transitions in deep learning (grokking, emergent capabilities) and non-equilibrium chemistry, describing both as driven informational systems governed by two gradient fields.

0 favorites 0 likes

#grokking

Distributional Spectral Diagnostics for Localizing Grokking Transitions

arXiv cs.LG ↗ · 2026-05-12 Cached

This paper proposes distributional spectral diagnostics to localize grokking transitions in Transformer models before test accuracy rises. It uses empirical distributions and Hankel dynamic mode decomposition to create a monitoring signal that discriminates between grokking and non-grokking runs.

0 favorites 0 likes

#grokking

Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking

arXiv cs.LG ↗ · 2026-05-12 Cached

This empirical study validates theoretical findings on feature repulsion and spectral lock-in during the grokking phenomenon in two-layer neural networks, demonstrating how activation functions influence the transition from memorization to generalization.

0 favorites 0 likes

grokking

Submit Feedback