training-dynamics

#training-dynamics

Predictable GRPO: A Closed-Form Model of Training Dynamics

arXiv cs.LG ↗ · yesterday Cached

Presents a closed-form reduced-order model of GRPO training dynamics, reducing it to a damped oscillator and deriving predictions for stability, group-size invariance, and loss curvature. Validated across multiple models and benchmarks.

0 favorites 0 likes

#training-dynamics

@rosinality: https://arxiv.org/abs/2606.29858 Why does power-law scaling occur? Loss of individual tokens follows a sigmoidal curve,…

X AI KOLs Timeline ↗ · 2d ago Cached

This paper presents a token-level framework showing that power-law scaling in language model loss arises from the aggregation of sigmoidal learning curves of individual tokens, and demonstrates that reshaping training distributions based on token learning times can accelerate validation loss reduction by 11%.

0 favorites 0 likes

#training-dynamics

A Gravitational Interpretation of Fine-Tuning Reversion

arXiv cs.LG ↗ · 2d ago Cached

The paper proposes a gravitational interpretation for fine-tuning reversion, where early training creates dominant behavioral manifolds that later alignment only shallowly displaces, causing a persistent reversion direction. Experiments show that blocking this direction reduces harmfulness with minimal task cost.

0 favorites 0 likes

#training-dynamics

Training Dynamics of Neural Software Defect Predictors under Coupled Data-Quality Issues

arXiv cs.LG ↗ · 2026-06-25 Cached

This paper investigates how training dynamics of neural networks for software defect prediction are affected by coupled data-quality issues such as class imbalance and overlap, proposing an interaction-aware empirical protocol.

0 favorites 0 likes

#training-dynamics

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

arXiv cs.LG ↗ · 2026-06-09 Cached

The paper identifies repetition mismatch as a primary cause for data mixture experiments failing to scale, and proposes a repetition-controlled subsampling procedure that allows small-scale experiments to recover near-optimal mixtures using far fewer tokens.

0 favorites 0 likes

#training-dynamics

Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics

arXiv cs.AI ↗ · 2026-06-08 Cached

This position paper argues that a scientific understanding of AI must go beyond post-hoc analysis and instead study the training dynamics that shape model behavior, with implications for predicting, intervening, and designing training procedures for desired properties like capabilities and safety.

0 favorites 0 likes

#training-dynamics

On the Geometry of On-Policy Distillation

Hugging Face Daily Papers ↗ · 2026-06-05 Cached

This paper characterizes the unique parameter space dynamics of on-policy distillation (OPD) for large language models, showing that it exhibits relaxed off-principal updates and subspace locking, distinguishing it from supervised fine-tuning and reinforcement learning with verifiable rewards.

0 favorites 0 likes

#training-dynamics

Edge of Stability Selectively Shapes Learning Across the Data Distribution

arXiv cs.LG ↗ · 2026-06-04 Cached

MIT researchers show that the edge of stability (EoS) in neural network training is not merely a global optimization phenomenon but selectively redistributes learning across subsets of the training distribution, amplifying progress on some data groups while suppressing others. They identify two key conditions governing this allocation: gradient alignment with the top Hessian eigenvector and sustained non-vanishing gradient magnitude.

0 favorites 0 likes

#training-dynamics

Your transformer's attention entropy collapse isn't a bug. It's the model doing exactly what you trained it to do. Here's how to fix it with a three-line temperature schedule. arXiv-able. Self-contained proof. No citations needed.

Reddit r/ArtificialInteligence ↗ · 2026-06-02

The article explains that attention entropy collapse in deep transformer layers is a geometric consequence of training, not a bug, and proposes a three-line temperature schedule to prevent it.

0 favorites 0 likes

#training-dynamics

Balancing Learning Rates Across Layers: Exact Two-Step Dynamics and Optimal Scaling in Linear Neural Networks

arXiv cs.LG ↗ · 2026-06-02 Cached

This paper derives exact closed-form expressions for gradients and test loss after one and two steps of gradient descent in two-layer and three-layer linear neural networks, characterizing optimal learning rate selection and revealing a distinct early-training regime where unequal layer-wise learning rates are initially optimal.

0 favorites 0 likes

#training-dynamics

Neural Networks Provably Learn Spectral Representations for Group Composition

Hugging Face Daily Papers ↗ · 2026-06-02

This paper provides a theoretical analysis of how neural networks learn structured representations during group composition tasks, proving that training dynamics drive neurons to converge to irreducible group representations with exponential convergence rates. The work establishes a representation-theoretic account of feature learning and characterizes a low-rank compression phenomenon for matrix-valued group representations.

0 favorites 0 likes

#training-dynamics

Do Deep Networks Forget Initialization? A Forgetting-Time View of Practical Inductive Bias

arXiv cs.LG ↗ · 2026-05-29 Cached

This paper introduces the concept of 'initialization memory' to study how much of the random initialization bias survives training in deep networks, showing that low-learning-rate SGD preserves initialization while Adam-family optimizers erase it, and linking this to forgetting dynamics.

0 favorites 0 likes

#training-dynamics

Distributional Spectral Diagnostics for Localizing Grokking Transitions

arXiv cs.LG ↗ · 2026-05-12 Cached

This paper proposes distributional spectral diagnostics to localize grokking transitions in Transformer models before test accuracy rises. It uses empirical distributions and Hankel dynamic mode decomposition to create a monitoring signal that discriminates between grokking and non-grokking runs.

0 favorites 0 likes

training-dynamics

Submit Feedback