Tag
This paper presents a token-level framework showing that power-law scaling in language model loss arises from the aggregation of sigmoidal learning curves of individual tokens, and demonstrates that reshaping training distributions based on token learning times can accelerate validation loss reduction by 11%.