Tag
This paper identifies a spectral phenomenon called Stability of Singular Distribution (SoSD) in large language model pre-training, where the singular value spectrum stabilizes early while parameters continue to evolve. The authors prove that this stabilization marks the transition to the slow-descent phase of training, and they analyze how training strategies like WSD and Muon affect this behavior.