@rosinality: https://arxiv.org/abs/2606.29858 Why does power-law scaling occur? Loss of individual tokens follows a sigmoidal curve,…
Summary
This paper presents a token-level framework showing that power-law scaling in language model loss arises from the aggregation of sigmoidal learning curves of individual tokens, and demonstrates that reshaping training distributions based on token learning times can accelerate validation loss reduction by 11%.
View Cached Full Text
Cached at: 06/30/26, 09:39 AM
https://t.co/bsAl3FGITv
Why does power-law scaling occur? Loss of individual tokens follows a sigmoidal curve, and the aggregation of these curves with different times of learning makes a power-law curve. Quanta hypothesis again? https://t.co/Iw2qn3m6QE
Smooth Scaling Laws Hide Stepwise Token Learning
Source: https://arxiv.org/html/2606.29858 Pingjie Wang Dots Studio, Xiaohongshu Inc. Shanghai Jiao Tong University [email protected] &Zechen Hu11footnotemark:1 Dots Studio, Xiaohongshu Inc. [email protected] &Peiru Yang11footnotemark:122footnotemark:2 Dots Studio, Xiaohongshu Inc. Tsinghua university [email protected] &Fu Guo Dots Studio, Xiaohongshu Inc. [email protected] &Debing Zhang33footnotemark:3 Dots Studio, Xiaohongshu Inc. [email protected] Equal contribution.Work done during an internship at Dots Studio, Xiaohongshu Inc.Corresponding author.
Abstract
Language model loss follows remarkably regular scaling laws over model and data size, yet it remains unclear why the aggregate loss should exhibit a power-law form. Existing explanations often attribute this regularity to a heavy-tailed spectrum of pattern difficulty in natural language, but this view has not been directly validated at token-level granularity in large-scale real-data training. We present a token-level framework that decomposes scaling laws into localized learning events of individual contextualized tokens. By fitting token loss trajectories with sigmoids, we show that token learning is concentrated in localized transitions, giving rise to a learning-time spectrum that dominates the scaling-law shape. Across more than one hundred pre-training runs on large and diverse real-language corpora with modern LLM architectures, scaling up to 6B parameters and 300B training tokens, the measured learning-time spectrum quantitatively reconstructs the validation loss derivative along the training-stepTT, data-scaleDD, and model-scaleMMaxes. We further show that the same signal is actionable: by reshaping the training distribution according to when tokens become learnable, we alter the optimization trajectory and achieve 11% faster validation-loss reduction. These results provide direct empirical evidence that scaling laws are governed primarily by the distribution of token-level learning times, and that this distribution can be used not only to explain scaling behavior but also to improve training performance.
1Introduction
Figure 1:A.Token-level loss trajectories. Individual token losses are well fit by sigmoids (top), showing that token learning is concentrated in localized transitions at specific learning times rather than spread uniformly over training. After aligning tokens by learning time, their loss curves collapse to a highly similar shape (bottom).B.Learning-time spectrum and empirical loss derivative. The learning-time spectrum is heavy-tailed and approximately power-law, and the loss derivative reconstructed from the measured learning-time spectrum together with the shared learning pulse closely matches the empirical one along the training-stepTT, data-scaleDD, and model-scaleMMaxes. These results indicate that the power-law form of the global loss is governed primarily by the distribution of token learning times, rather than by individual token trajectories.Scaling laws show that language model loss and downstream performance follow predictable trends as model size, data, and training compute increase[1,2]. In recent large-scale training practice, they have directly guided compute allocation and motivated major shifts in training strategy, especially toward compute-optimal parameter-data tradeoffs[2,3]. However, a basic question remains:Why should the aggregate loss of a language model follow a power law?
To explain this phenomenon, recent work attributes the power-law scaling to the spectrum of patterns inherent in the training data[4,5,6,7,8]. The intuition is that natural language does not consist of uniformly difficult features; it exhibits a long-tail distribution spanning from high-frequency, low-order linguistic rules to low-frequency, high-order complexities[9,10]. From this perspective, the training process acts as a progressive coverage of this spectrum, where models first master dominant patterns before capturing rarer, more sophisticated dependencies[11,12,13]. Under this view, the power-law scaling is rooted in intrinsic statistical structure of natural language itself. However, despite the broad appeal of this data-distribution perspective, it has not yet been decomposed at a sufficiently fine-grained level or directly validated in large-scale real-data training runs with industrial-scale models.
In this work, we introduce a token-level perspective to characterize the relationship between token-level loss dynamics, data distribution structure, and aggregate model loss, thereby unraveling the origin of power-law scaling. Our main contribution is a direct empirical demonstration on real-world data that the power-law form of the global loss fundamentally arises from the statistical distribution of token-level learning events. As illustrated in Fig.1(A,top), individual token losses do not decrease uniformly: they remain on a plateau before and after a localized transition, a pattern that is accurately captured by sigmoid fits. When tokens are grouped by their learning times (sigmoid center), the resulting learning-time spectrum is long-tailed and follows a power law itself: many tokens are learned early, and progressively fewer tokens learned at later stages. At the same time, Fig.1(A,bottom) shows that after aligning tokens by their learning times, the local shapes of their loss drops are highly similar. These two observations suggest a simple explanation for the global loss curve: the macroscopic power-law scaling behavior should be governed primarily by the spectrum of learning times, i.e., by how many tokens are learned at each stage.
We validate our theory on large-scale real-world corpora and modern Large Language Model (LLM) architectures used in industrial pre-training pipelines. Our experiments span models from 290M to 6B parameters and training budgets from 1B to 300B tokens, comprising over 110 runs using 1,178 A100 GPU-days, scaling along three axes: training-stepTT, data-scaleDD, and model-scaleMM. As summarized in Fig.1(B), taking the derivative makes each token’s loss drop visible as a localized learning pulse, indicating when that token is being learned. The measured learning-time spectrum, combined with the shared pulse, can reconstruct the empirical loss derivative along allTT,DD, andMMaxes. Prior work has explored improving LLM pre-training by macroscopically adjusting data mixtures or schedules[14,15,16], but such interventions are often guided by heuristic signals rather than a fine-grained theory. Building on the decomposition, we further use the learning-time signal to reshape the training distribution, upweighting samples whose token-level learning events concentrate in a target training interval and downweighting samples that contribute little to that interval. This intervention changes the subsequent optimization trajectory and yields a 11% acceleration in validation-loss reduction relative to the original distribution, showing that the learning-time spectrum is not only explanatory but also practical for controlling scaling behavior. Our main findings are summarized as follows:
- •Token learning is non-uniform and transition-like.Individual token losses remain on a plateau before and after a localized transition, a pattern well captured by sigmoid fits.
- •Learning pulse shape is shared across learning times.Each token loss’s derivative forms a localized learning pulse, and they share a similar shape after alignment by learning time.
- •Scaling-law is dominated by learning-time spectrum.The validation loss is driven by the learning-time distribution, i.e., by how many tokens are learned at each axis location.
- •The scaling behavior can be reshaped.Measured learning times can be used to reshape the training distribution, change the subsequent optimization trajectory, and therefore accelerate validation-loss reduction.
2Related Work
Empirical Scaling Laws in Language Modeling.
In language modeling, validation loss scales as a power law with model size, data size, and training compute[1]. This empirical regularity has become a practical guide for compute allocation and parameter-data tradeoffs in large-scale training[2]. Subsequent work further investigates the optimal allocation of compute resources to maximize model performance, alongside exploring scaling behaviors under specific conditions[17,18,3]. Beyond final convergence regarding model sizeNNand data sizeDD, the continuous training trajectory itself exhibits a predictable power-law decay with respect to the number of training stepsTT[1,19,20].
Data-Distribution Explanations of Scaling Laws.
Existing explanations for scaling laws largely attribute them to the long-tailed structure of the data distribution, suggesting that the observed power laws reflect a long-tail spectrum of learnable patterns or latent features in the data[4,6,7]. For example,Hutter [4]show that a Zipfian data distribution induces a power-law learning curve, with the exponent set by the Zipf exponent. Furthermore, several works support this view through toy models and synthetic data experiments[5,21,8,22]. In particular,Michaud et al. [5]use an MLP toy model on a Zipf-distributed sparse parity synthetic dataset to show that neural scaling can be decomposed into the learning of distinct tasks. By employing different types of synthetic data structures,Cagnetta et al. [8]study power-law learning curves using PCFG-generated hierarchical compositional data, whereasBarkeshli et al. [22]use synthetic data from function learning on random graphs. Taken together, these works form a broad line of explanation that links scaling laws to properties of the data distribution, such as compressibility, spectral structure, hierarchy, and distributions of learning complexity. However, most existing studies remain at the level of theoretical analysis or coarse-grained macroscopic statistics, without directly characterizing scaling behavior at the token level. Moreover, many of these analyses are conducted in toy settings, with limited validation on large-scale real-world data or modern model architectures used in industrial practice.
Token-Level Training Dynamics.
Recent work has also highlighted the importance of token-level training dynamics by tracking how individual token instances are learned during training. Several studies provide qualitative evidence that token-level loss trajectories are highly heterogeneous, with different tokens exhibiting markedly different learning patterns[23,24,25,26,27,28]. For example,Lin et al. [26]track the loss dynamics of a large number of tokens during training and show that token losses do not decrease uniformly, but instead separate into components associated with effective learning and noise.Chang et al. [27]further show that the learning trajectories of individual token instances are reproducible across training runs, suggesting that token learning is shaped by sequential learning dependencies. They also relate these dynamics to text-level statistical features. Together, these works show that learning is highly heterogeneous at the token level. However, they mainly provide qualitative observations of these dynamics, and do not explicitly model how such micro-level heterogeneity aggregates into the global scaling laws observed at the loss level. In contrast, our work directly connects token-level training dynamics to scaling laws, and validates this connection on modern LLMs with industry-scale data and architectures.
3Decomposing Scaling Laws
3.1From Macroscopic Power Laws to Token-Level Decomposition
Scaling laws are typically expressed at the level of aggregate loss: along axisaa, validation loss follows a regular power-law decay of the formL(a)=ka−α+EL(a)=ka^{-\alpha}+E[1,2], whereaacan be training stepsTT, data scaleDD, and model scaleMM. This leads to a central question:What microscopic mechanism gives rise to the power-law decay of aggregate loss? Three natural hypotheses about what drives the power law are:
- •Hyp. A: it comes mainly from overall training dynamics, such as optimization, scheduling, or noise.
- •Hyp. B: it comes from individual token-loss trajectories themselves decreasing as power laws.
- •Hyp. C: it comes from different tokens being learned at different points along the scaling axis.
To distinguish these possibilities, we decompose the validation lossL(a)L(a)back to token-level:L(a)=1|ℐ|∑i∈ℐℓi(a)L(a)=\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\ell_{i}(a).ℐ\mathcal{I}is the set of contextualized token instances from the validation set, andℓi(a)\ell_{i}(a)is the loss of token instanceiimeasured along axisaa. More directly, the loss derivative tracks where learning occurs:L′(a):=−dL(a)da=−1|ℐ|∑i∈ℐdℓi(a)daL^{\prime}(a):=-\frac{dL(a)}{da}=-\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\frac{d\ell_{i}(a)}{da}. Therefore, our core task is to decomposeL′(a)L^{\prime}(a)into token-level loss derivative and identify which factor truly governs its macroscopic shape.
3.2Parameterizing Token-Level Learning Events
Figure 2:Representative token-level loss trajectories across three independent runs and the corresponding sigmoid fits. The sigmoid fits capture the dominant learning event for each token, which is characterized by a localized sharp descent that occurs at a specific learning time. This suggests that token learning occurs as concentrated events rather than uniformly spread across the axis. Based on the location of learning time, tokens can be grouped into 4 categories: learned, early-learning, late-learning, and unlearned tokens.We begin with an observational study comparing the trajectories of the same token instances across independent runs with different random seeds. Representative cases are shown in Fig.2.
Three observations consistent with prior works on token-level learning dynamics[25,26,27]are visible: Firstly, the same token exhibits highly similar loss descent timing and trajectory shape across different runs, indicating that its learning behavior is determined mainly by token-level structure rather than by optimizer noise, which weakens Hypothesis A. Secondly, different token losses do not individually follow power-law decay, which argues against Hypothesis B. Thirdly, the key difference across token instances is when their loss starts to drop, as tokens begin their sharp loss decrease at distinct stages of training. As shown in Fig.2, the token-level loss typically begins in a high initial plateau, where the model cannot yet predict the token reliably. Once the model learns the relevant predictive pattern, the token loss drops rapidly over a localized interval, after which it enters a lower-loss plateau. Thus, token learning is not uniformly spread across the axis, but occurs as a concentratedlearning event.Nam et al. [29]model the emergence of task-level abilities[30]on synthetic data with sigmoid functions. Motivated by our observations above, we further model real-world token-level learning events with the following sigmoid parameterization:
ℓi(a)≈ℓipost+Δi[1−σ(λi(a−τi))],\ell_{i}(a)\approx\ell_{i}^{\mathrm{post}}+\Delta_{i}\left[1-\sigma\!\left(\lambda_{i}(a-\tau_{i})\right)\right],(1)whereσ(z)=1/(1+e−z)\sigma(z)=1/(1+e^{-z}), andΔi:=ℓipre−ℓipost\Delta_{i}:=\ell_{i}^{\mathrm{pre}}-\ell_{i}^{\mathrm{post}}. The parameters in this modeling have direct interpretations:τi\tau_{i}is thelearning time, the center of the dominant descent.λi\lambda_{i}controls the sharpness of the transition and therefore the temporal width of the learning event,Δi\Delta_{i}is the total loss reduction contributed by this event,ℓipre\ell_{i}^{\mathrm{pre}}andℓipost\ell_{i}^{\mathrm{post}}are the initial and final plateaus. Importantly, we use the sigmoid not to claim exact logistic dynamics, but as a simple, interpretable model for extracting the dominant learning event from noisy token-level trajectories.
The loss derivative of tokeniiis:ℓi′(a):=−dℓi(a)da≈Δiλiσ(λi(a−τi))[1−σ(λi(a−τi))]\ell_{i}^{\prime}(a):=-\frac{d\ell_{i}(a)}{da}\approx\Delta_{i}\lambda_{i}\,\sigma\!\left(\lambda_{i}(a-\tau_{i})\right)\left[1-\sigma\!\left(\lambda_{i}(a-\tau_{i})\right)\right]. Usingσ(z)(1−σ(z))=14sech2(z/2)\sigma(z)(1-\sigma(z))=\frac{1}{4}\,\mathrm{sech}^{2}(z/2), this can be equivalently written as
ℓi′(a)≈Δiλi4sech2(λi(a−τi)2).\ell_{i}^{\prime}(a)\approx\frac{\Delta_{i}\lambda_{i}}{4}\,\mathrm{sech}^{2}\!\left(\frac{\lambda_{i}(a-\tau_{i})}{2}\right).(2)We denoteℓi′(a)\ell_{i}^{\prime}(a)as the token’slearning pulse. It is centered atτi\tau_{i}, has width on the order ofλi−1\lambda_{i}^{-1}, and integrates to the total loss reduction:∫−∞∞ℓi′(a)𝑑a=Δi\int_{-\infty}^{\infty}\ell_{i}^{\prime}(a)\,da=\Delta_{i}. In other words,Δi\Delta_{i}identifieshow muchis learned,λi\lambda_{i}identifieshow abruptlyit is learned, andτi\tau_{i}showswhenit is learned. Such a localized learning pulse is visualized in Fig.2(bottom).
Under this model, Eq.2shows that a token contributes to aggregate loss reduction only in a neighborhood of its learning time. At any axis locationaa, tokens withτi\tau_{i}close toaacontribute most of the descent rate, while tokens withτi\tau_{i}far fromaacontribute little because they are either not yet undergoing their main learning event or have already completed it. Therefore,
L′(a)=1|ℐ|∑i∈ℐℓi′(a)≈1|ℐ|∑i∈ℐΔiλi4sech2(λi(a−τi)2).L^{\prime}(a)=\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\ell_{i}^{\prime}(a)\approx\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\frac{\Delta_{i}\lambda_{i}}{4}\,\mathrm{sech}^{2}\!\left(\frac{\lambda_{i}(a-\tau_{i})}{2}\right).(3)This admits a direct interpretation as the superposition of the token-level learning events that are active aroundaa, which is the bridge to the decomposition developed next.
3.3Decoupling Learning Shape and Time
The decomposition in Eq.3suggests that the aggregate loss derivative is formed by many token-level learning pulses. Each pulse contains two pieces of information: its local pulse shape (determined byΔi\Delta_{i}andλi\lambda_{i}), which describes how the loss decreases once the token is being learned, and its pulse center (determined byτi\tau_{i}), which specifies when this decrease occurs.
To explain the source of the macroscopic power-law, we need to separate these two effects. We defineτi\tau_{i}as thelearning timeof tokenii, which is the center of the token’slearning pulseℓi′(a)\ell_{i}^{\prime}(a), both of which are clearly exhibited in Fig.2. Under this view,L′(a)L^{\prime}(a)is the superposition of token-local pulses occurring at different learning timesτi\tau_{i}. To isolate the shape component, we remove the timing variation by shifting each token’s pulse so that its center is placed at zero. This gives the average centered learning pulse
g(a):=1|ℐ|∑i∈ℐℓi′(a+τi),g(a):=\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\ell_{i}^{\prime}(a+\tau_{i}),(4)which serves as a typical template for the local shape of a token-level learning event. Once the local pulse shape is summarized bygg, what remains is to describe how the pulse centersτi\tau_{i}are distributed over the axis. We capture this timing component with the empirical learning-time spectrum
p(τ):=1|ℐ|∑i∈ℐδ(τ−τi),p(\tau):=\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\delta(\tau-\tau_{i}),(5)whereδ(⋅)\delta(\cdot)is the Dirac delta. Together,ggandppseparate the two factors that were mixed in Eq.3:ggdescribes the typical local pulse shape, whileppdescribes how these pulses are distributed over training time. If the aligned pulses have similar shapes thatℓi′(a)≈g(a−τi)\ell_{i}^{\prime}(a)\approx g(a-\tau_{i}), each token-level pulse can be approximated as the same template shifted to its learning time. Then Eq.3becomes a superposition of shifted copies ofgg:
L′(a)≈1|ℐ|∑i∈ℐg(a−τi)≈∫p(τ)g(a−τ)𝑑τ=(p∗g)(a).L^{\prime}(a)\approx\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}g(a-\tau_{i})\approx\int p(\tau)\,g(a-\tau)\,d\tau=(p*g)(a).(6) Under this decomposition,g(a)g(a)captures the local shape of a typical learning event, whilep(τ)p(\tau)captures how many learning events occur at each axis location.Ifggis approximately shared across tokens, the macroscopic shape ofL′(a)L^{\prime}(a)is governed primarily by thelearning-time spectrum*p(τ)p(\tau)rather than by token-specific loss shape.*This is the hypothesis to be tested in our experiments.
Moreover, whilep(τ)p(\tau)is defined formally here, the hierarchical and compositional structure of natural language provides a linguistic reason for it to follow a power-law distribution, because such regularities create a long-tailed spectrum of learning times, as discussed in AppendixA.1. We also comparep(τ)p(\tau)with existing data-centric curriculum proxies and pre-training intuitions, including data domain, token frequency, and n-gram dependency length. The results show thatp(τ)p(\tau)is consistent with these macroscopic data strategy consensuses, suggesting that the learning-time spectrum provides a more fundamental explanation for why such proxies are useful; details are provided in AppendixA.2.
4Step-Axis Decomposition
Figure 3:A.Empirical validation loss on the step axis and the aggregate loss implied by token-wise sigmoid fits for a 1.5B model trained on 300B tokens. Both exhibit a clear power-law regime after the initial transient stage.B.Learning pulses estimated from tokens grouped into 40 bins by learning time; after grouping by learning time, the pulse shapes remain highly similar across bins.C.Empirical learning-time spectrum, which is strongly heavy-tailed and is well approximated by a power law.D.Empirical loss derivative, sigmoid-implied derivative, power-law fit, and spectrum-dominated reconstruction on the step axis. Their close agreement shows that the shape of the macroscopic loss derivative is determined primarily by the learning-time spectrum rather than by fine-grained variation in token-level transition shape.### 4.1Experimental Setup
We validate the pulse-spectrum mechanism in two stages. We first test it on controllable synthetic data, where the underlying difficulty distribution can be specified explicitly and the full decomposition can be verified cleanly; these results are deferred to AppendixA.3. We then turn to industrial-scale real-data pre-training and begin on the training-step axisTT, since token-level learning events unfold most directly over optimization time and prior work has shown that loss also exhibits regular power-law-like decay over training steps[1,31,20,32,33]. The step axis is therefore the natural first testbed for establishing the full token-to-loss decomposition before extending the analysis toDDandMM.
For the step-axis analysis, we train a 1.5B MoE model on 300B tokens from an industrial pre-training corpus containing mixed Chinese, English, mathematics, reasoning, books, papers, and code. We evaluate on a fixed validation set sampled from the same underlying distribution and record token-level losses across training checkpoints so that each contextualized token can be analyzed as a function of optimization step. Because real token trajectories are noisy and need not be monotone, we do not enforceΔi>0\Delta_{i}>0in the sigmoid fitting and allow both descending and ascending trajectories, including forgetting-like behavior. Training uses Adam withβ1=0.9\beta_{1}=0.9andβ2=0.95\beta_{2}=0.95and a warmup ratio of 10% without decay. The step-axis study consumed 133 A100 GPU days overall; in addition to the main runs, we performed two extra runs for token-level analysis in Fig.2.
4.2Step-Axis Results
*Observation 1: The aggregate loss on the step axis exhibits a clear power-law-like regime after the initial transient stage.*As shown in Fig.3(A), once training enters the stable learning phase, the validation loss decays regularly over optimization steps and follows a power-law pattern.
*Observation 2: Token-wise sigmoid fits accurately recover the aggregate loss trajectory.*Fig.3(A) shows that the average loss implied by the fitted token-level sigmoids closely matches the empirical validation loss, indicating that the dominant token-level learning events are already well captured by the parameterization in Section3.
Observation 3: Token-level learning pulses are highly similar, whereas the learning-time spectrum is strongly heavy-tailed.Fig.3(B, C) shows that the main cross-token variation lies primarily inwhentokens are learned, rather than in the detailed local shape ofhowthey are learned.
*Observation 4: The empirical loss derivative is governed primarily by the learning-time spectrum.*In Fig.3(D), the empirical loss derivative, the sigmoid-implied derivative, and the spectrum-dominated reconstruction closely agree, showing that replacing token-specific pulses by a shared pulse leaves the macroscopic derivative shape largely unchanged.
Taken together, these observations establish the full mechanism on the step axis: sigmoid-parameterized token transitions recover the empirical loss dynamics, learning pulses are approximately shared, and the remaining macroscopic shape is governed primarily by the learning-time spectrum. We next test whether the same mechanism continues to hold on theDDandMMaxes.
Figure 4:A.Training-loss trajectories from a 50-run data-scale sweep with budgets ranging from 1B to 35B tokens, used to construct the data-scale frontier.B.Frontier validation loss on theDDaxis together with its sigmoid fit; the resulting envelope remains well described by a power law.C.Learning-time spectrum, empirical loss derivative, sigmoid-implied derivative, power-law fit, and spectrum-dominated reconstruction on the data axis.
5Data and Model Axes Decomposition
Having established the full decomposition on the step axis, we next ask whether the same pulse-spectrum mechanism also governs the classical scaling-law axes of data scale and model scale. These experiments follow standard scaling-law practice: for each setting, we tune the training configuration to obtain near-optimal performance rather than inheriting a single fixed recipe from the step-axis analysis. The data source remains the same as before, while the scaling variable changes from training-token budget on theDDaxis to model size on theMMaxis.
5.1Data-Axis Results
For the data-scale analysis, we fix the model to a 290M MoE architecture and vary the training-token budget from 1B to 35B tokens. Each run uses a complete warmup-constant-decay schedule. To obtain a reliable scaling curve rather than an artifact of suboptimal training choices, we run a large sweep of 50 pre-training experiments and search over training configurations for each budget following standard scaling-law practice. The resulting family of trajectories allows us to construct the envelope validation-loss curve on theDDaxis, which exhibits a clear power-law trend and serves as the macroscopic object to be explained. This final configuration for the data axis training consumed 358 A100 GPU days in total.
We then perform the same token-level decomposition with data scale as the horizontal axis, fitting each validation token as a function ofDDand estimating the corresponding learning-time spectrum, shared pulse, and loss-derivative reconstruction. As shown in Fig.4, the same mechanism reappears on the data axis: the token-level sigmoid fits capture the dominant transitions, the learning-time spectrum remains strongly heavy-tailed, and the spectrum-dominated reconstruction closely matches the empirical loss derivative. Thus, even under a frontier constructed from recipe-optimized runs, the macroscopic scaling curve is governed primarily by where learning events are distributed along the data-scale axis, rather than by token-specific variation in local transition shape.
5.2Model-Axis Results
Figure 5:A.Training-loss trajectories from a 60-run model-scale sweep spanning 0.42 to 5.00 GFLOPs/token, used to construct the model-scale frontier.B.Frontier validation loss on theMMaxis together with its sigmoid fit; the resulting envelope again follows a clear power law.C.Learning-time spectrum, empirical loss derivative, sigmoid-implied derivative, power-law fit, and spectrum-dominated reconstruction on the model-scale axis.We use non-embedding FLOPs per tokenMMrather than parameter countNNto represent model scale. For MoE models, parameter count alone is not a sufficiently faithful measure of effective scale, because it does not reflect the amount of computation and capacity actually activated during training. In particular, as noted in prior work[34],NNcan introduce substantial distortions when comparing small and medium-scale models. UsingMMtherefore provides a more stable and comparable horizontal axis for scaling-law analysis.
For the model-scale analysis, we fix the training-token budget to 10B tokens and train 60 model variants spanning 0.42 to 5.00 GFLOPs/token, corresponding roughly to 400M to 6B parameters. Following standard scaling-law practice, we tune the architecture and training configuration for each setting to obtain a near-optimal frontier on theMMaxis. The final configuration of model-axis training consumed 422 A100 GPU days in total. We then perform the same token-level decomposition with model scale as the horizontal axis. As shown in Fig.5, the frontier loss again follows a power law, the learning-time spectrum remains strongly heavy-tailed, and the spectrum-dominated reconstruction closely matches the empirical loss derivative. Thus, the same pulse-spectrum mechanism persists when the horizontal axis is changed from training progress or data scale to model scale.
6Reshaping Scaling Behavior with Learning-Time Signals
Figure 6:Intervening on the training distribution using the measured learning-time signal.A.We first train a reference model and save checkpoints to estimate learning times for a randomly sampled set of 1B candidate tokens.B.Training samples are then ranked by the fraction of token-level learning events that occur within a target step interval (here, steps 2000–3800) and partitioned into 10 buckets.C.We reshape the training distribution by downweighting the lowest-ranked buckets and upweighting the highest-ranked buckets.D.Continuing training from step 2000 with the reshaped distribution yields a faster decline in training loss than the original distribution, despite a higher irreducible-loss floor.E.The reshaped distribution also improves validation-loss dynamics, achieving a 11% acceleration relative to the original distribution, which shows that the measured learning-time signal is not only descriptive but can also be used to controllably reshape the scaling curve.The previous sections establish learning time as an explanatory signal for macroscopic loss dynamics. We now ask whether it is also actionable:can a measured learning-time signal be used to reshape the training distribution and thereby alter subsequent scaling behavior in a controlled way?
To test this, we first train a 290M MoE reference model and save a sequence of checkpoints together with the corresponding loss trajectories of a randomly sampled set of 1B candidate tokens. We then fit the token-level sigmoid to estimate the learning time of each token. Because autoregressive next-token prediction requires processing the full prefix, token-level intervention is not a practical training unit. We therefore operate at the sample level, which is the smallest unit at which the data distribution can be reshaped efficiently in pre-training. For each sample, we compute a score based on the fraction of its tokens whose estimated learning times fall within a target interval of training steps (here, steps 2000–3800), and rank samples accordingly. The ranked samples are partitioned into 10 buckets, after which we reshape the training distribution by downweighting the lowest-ranked buckets and upweighting the highest-ranked buckets. We then continue training from the beginning of the target interval using the reshaped distribution and compare the resulting training and validation trajectories against the original distribution.
As shown in Fig.6, the resulting buckets exhibit clearly ordered loss behavior: higher-ranked samples contain more tokens whose learning events fall within the target interval and therefore undergo larger loss reductions during that phase. Reshaping the training distribution accordingly leads to a faster decline in training loss than the original distribution, indicating that the measured learning-time signal captures an actionable notion of sample learnability. This intervention also improves validation dynamics, yielding a 11% acceleration relative to the original distribution. These results show that the learning-time signal is not only descriptive but can also be used to intervene on macroscopic scaling behavior in a controlled way.
7Conclusion
We introduced a token-level framework for decomposing scaling laws through the distribution of learning times. Across the training-step, data-scale, and model-scale axes, we found that aligned learning pulses are approximately shared and that the macroscopic loss derivative is governed primarily by the learning-time spectrum. This provides direct empirical evidence that the power-law form of language-model loss is determined mainly bywhentokens are learned. We further showed that the same signal is actionable: reshaping the training distribution with learning-time estimates alters optimization dynamics and improves validation-loss reduction. These results connect token-level learning dynamics to macroscopic scaling behavior and open a path toward data-centric control of large-scale model training.
References
- Kaplan et al. [2020]Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020.
- Hoffmann et al. [2022]Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al.Training compute-optimal large language models.InProceedings of the 36th International Conference on Neural Information Processing Systems, pages 30016–30030, 2022.
- Gadre et al. [2025]Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, et al.Language models scale reliably with over-training and on downstream tasks.InThe Thirteenth International Conference on Learning Representations, 2025.
- Hutter [2021]Marcus Hutter.Learning curve theory.arXiv preprint arXiv:2102.04074, 2021.
- Michaud et al. [2023]Eric Michaud, Ziming Liu, Uzay Girit, and Max Tegmark.The quantization model of neural scaling.Advances in Neural Information Processing Systems, 36:28699–28722, 2023.
- Maloney et al. [2022]Alexander Maloney, Daniel A. Roberts, and James Sully.A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022.
- Brill [2024]Ari Brill.Neural scaling laws rooted in the data distribution.arXiv preprint arXiv:2412.07942, 2024.
- Cagnetta et al. [2025]Francesco Cagnetta, Hyunmo Kang, and Matthieu Wyart.Learning curves theory for hierarchically compositional data with power-law distributed features.InInternational Conference on Machine Learning, pages 6149–6164. PMLR, 2025.
- Piantadosi [2014]Steven T Piantadosi.Zipf’s word frequency law in natural language: A critical review and future directions.Psychonomic bulletin & review, 21(5):1112–1130, 2014.
- Linders and Louwerse [2023]Guido M Linders and Max M Louwerse.Zipf’s law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort.Psychonomic Bulletin & Review, 30(1):77–101, 2023.
- Zhang et al. [2021]Yian Zhang, Alex Warstadt, Xiaocheng Li, and Samuel Bowman.When do you need billions of words of pretraining data?InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 1112–1125, 2021.
- Rahaman et al. [2019]Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville.On the spectral bias of neural networks.InInternational conference on machine learning, pages 5301–5310. PMLR, 2019.
- Xu et al. [2020]Benfeng Xu, Licheng Zhang, Zhendong Mao, Quan Wang, Hongtao Xie, and Yongdong Zhang.Curriculum learning for natural language understanding.InProceedings of the 58th annual meeting of the association for computational linguistics, pages 6095–6104, 2020.
- Xie et al. [2023]Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu.Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36:69798–69818, 2023.
- Albalak et al. [2023]Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang.Efficient online data mixing for language model pre-training.arXiv preprint arXiv:2312.02406, 2023.
- Zhang et al. [2026]Yang Zhang, Amr Mohamed, Hadi Abdine, Guokan Shang, and Michalis Vazirgiannis.Beyond random sampling: Efficient language model pretraining via curriculum learning.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5776–5794, 2026.
- Porian et al. [2024]Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon.Resolving discrepancies in compute-optimal scaling of language models.Advances in Neural Information Processing Systems, 37:100535–100570, 2024.
- Chen et al. [2025]Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, and Jingang Wang.Revisiting scaling laws for language models: The role of data quality and training strategies.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23881–23899, 2025.
- Li et al. [2025]Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, and Lei Wu.Functional scaling laws in kernel regression: Loss dynamics and learning rate schedules.arXiv preprint arXiv:2509.19189, 2025.
- Luo et al. [2025]Kairong Luo, Haodong Wen, Shengding Hu, Zhenbo Sun, Zhiyuan Liu, Maosong Sun, Kaifeng Lyu, and Wenguang Chen.A multi-power law for loss curve prediction across learning rate schedules.InThe Thirteenth International Conference on Learning Representations, 2025.
- Allen-Zhu and Li [2023]Zeyuan Allen-Zhu and Yuanzhi Li.Physics of language models: Part 1, learning hierarchical language structures.arXiv preprint arXiv:2305.13673, 2023.
- Barkeshli et al. [2026]Maissam Barkeshli, Alberto Alfarano, and Andrey Gromov.On the origin of neural scaling laws: from random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026.
- Carlini et al. [2022]Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang.Quantifying memorization across neural language models.InThe Eleventh International Conference on Learning Representations, 2022.
- Biderman et al. [2023]Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al.Pythia: A suite for analyzing large language models across training and scaling.InInternational conference on machine learning, pages 2397–2430. PMLR, 2023.
- Xia et al. [2023]Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, and Veselin Stoyanov.Training trajectories of language models across scales.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13711–13738, 2023.
- Lin et al. [2024]Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, et al.Not all tokens are what you need for pretraining.Advances in Neural Information Processing Systems, 37:29029–29063, 2024.
- Chang et al. [2024]Tyler A Chang, Zhuowen Tu, and Benjamin K Bergen.Characterizing learning curves during language model pre-training: Learning, forgetting, and stability.Transactions of the Association for Computational Linguistics, 12:1346–1362, 2024.
- Michaelov et al. [2025]James A Michaelov, Roger P Levy, and Ben Bergen.Language model behavioral phases are consistent across architecture, training data, and scale.InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Nam et al. [2024]Yoonsoo Nam, Nayara Fonseca, Seok H Lee, Chris Mingard, and Ard A Louis.An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem.Advances in Neural Information Processing Systems, 37:39632–39693, 2024.
- Wei et al. [2022]Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al.Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022.
- Hestness et al. [2017]Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou.Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017.
- Choshen et al. [2025]Leshem Choshen, Yang Zhang, and Jacob Andreas.A hitchhiker’s guide to scaling law estimation.InInternational Conference on Machine Learning, pages 10683–10699. PMLR, 2025.
- Bordelon et al. [2024]Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan.A dynamical model of neural scaling laws.InProceedings of the 41st International Conference on Machine Learning, pages 4345–4382, 2024.
- Bi et al. [2024]Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al.Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024.
- Chen et al. [2023]Mayee Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christopher Ré.Skill-it! a data-driven skills framework for understanding and training language models.Advances in Neural Information Processing Systems, 36:36000–36040, 2023.
- Naïr et al. [2024]Marwa Naïr, Kamel Yamani, Lynda Lhadj, and Riyadh Baghdadi.Curriculum learning for small code language models.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 390–401, 2024.
- Platanios et al. [2019]Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell.Competence-based curriculum learning for neural machine translation.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 1162–1172, 2019.
- Oh et al. [2024]Byung-Doh Oh, Shisen Yue, and William Schuler.Frequency explains the inverse correlation of large language models’ size, training data amount, and surprisal’s fit to reading times.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2644–2663, 2024.
- Varre et al. [2025]Aditya Varre, Gizem Yüce, and Nicolas Flammarion.Learning in-contextnn-grams with transformers: Sub-nn-grams are near-stationary points.InInternational Conference on Machine Learning, pages 60924–60963. PMLR, 2025.
- Cagnetta et al. [2026]Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart.Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026.
- Allen-Zhu [2025]Zeyuan Allen-Zhu.Physics of language models: Part 4.1, architecture design and the magic of canon layers.arXiv preprint arXiv:2512.17351, 2025.
Appendix AAppendix
Contents
A.1Why the token distribution in natural language is power-law
We now give a compact structural account of why token “difficulty” in natural language can be approximately power-law distributed. The “difficulty” of a token is not an intrinsic label of its surface form, but a dynamical quantity: for a tokenxx, letτ(x)\tau(x)be the learning time at which its loss undergoes its main drop. A token is hard when the regularities needed to predict it are learned late. Since natural language contains long-tailed linguistic units and usage patterns[9,10], the relevant object is not the distribution of tokens alone, but the distribution of learnable regularities that tokens depend on. Letℛ\mathcal{R}be the set of latent regularities, and letℓ(r)∈{1,2,…}\ell(r)\in\{1,2,\dots\}denote the compositional level of regularityrr. Because natural language is hierarchical and compositional, the number of candidate regularities grows rapidly with level, while each concrete high-level regularity becomes more specific and is reused less often. A minimal abstraction is
Nℓ∝aℓ,fℓ∝q−ℓ,a,q>1,N_{\ell}\propto a^{\ell},\qquad f_{\ell}\propto q^{-\ell},\qquad a,q>1,whereNℓN_{\ell}is the number of regularities at levelℓ\ellandfℓf_{\ell}is their typical frequency. Eliminatingℓ\ellgives
N(f)∝f−α,α=logalogq,N(f)\propto f^{-\alpha},\qquad\alpha=\frac{\log a}{\log q},so a power-law tail arises from the combination of compositional growth in possible patterns and decreasing reuse frequency of each concrete pattern.
To connect this linguistic structure to training dynamics, assume that the time needed to learn a regularity increases with its level or rarity. Letτ(r)\tau(r)be the learning time of regularityrr. With an increasing mapTTand, in the simplest case, exponential growth in learning time,
τ(r)=T(ℓ(r)),τℓ∝sℓ,s>1.\tau(r)=T(\ell(r)),\qquad\tau_{\ell}\propto s^{\ell},\qquad s>1.Since the cumulative number of regularities up to levelℓ\ellscales asaℓa^{\ell}andℓ\ellgrows logarithmically withτ\tau, the induced distribution over regularity learning times is heavy-tailed. Finally, a token becomes easy only after the regularities needed to predict it have been acquired. Ifℛ(x)\mathcal{R}(x)denotes those regularities, then
τ(x)≈maxr∈ℛ(x)τ(r),\tau(x)\approx\max_{r\in\mathcal{R}(x)}\tau(r),so a heavy-tailed learning-time distribution over regularities transfers to a heavy-tailed, approximately power-law distribution over token learning times. Under this view, token difficulty is the visible projection of hierarchical linguistic structure through the dynamics of model learning.
A.2Alignment Between Learning Time and Macroscopic Data Consensus
Many prior studies and industrial practices have formed a set of stable macroscopic intuitions about how language data should be organized during pre-training[35,14,27]. Our goal here is to test whether these macroscopic consensus views reappear at the microscopic level as an ordering over token learning timesτ\tau. If our interpretation is correct, then data associated with simpler and more reusable regularities should have smaller meanτ\tau, whereas data associated with rarer, more abstract, or more long-range regularities should have larger meanτ\tau. Fig.7shows that this is indeed what we observe. Even under mixed training, the model does not learn all token groups simultaneously, but instead self-organizes into a learning-time order that closely matches existing intuitions.
Figure 7:Alignment between token learning time and three macroscopic data regularities.Left: mean learning timeτ\tauacross validation subsets from different domains. General-text domains such as books, papers, and general web data are learned earlier, while more specialized domains such as math, reasoning, and code are learned later.Middle: mean learning time as a function of token frequency bins. Higher-frequency tokens have smaller meanτ\tau, whereas lower-frequency tokens are learned later.Right: mean learning time as a function of the context length needed to recover90%90\%of full-context predictive performance. Token instances that require longer contexts have systematically larger meanτ\tau.#### A.2.1From General Text to Reasoning-Heavy Data
At the macroscopic level, curriculum-learning studies have repeatedly suggested that reasoning-heavy data should not be injected too early or too aggressively, and that pre-training often benefits from a progression from general text to more specialized domains, as well as from basic syntax to more abstract reasoning[35,36]. The left panel of Fig.7provides a microscopic validation of this consensus through the learning-time spectrum. When we compute the meanτ\taufor validation tokens grouped by domain, tokens from books, papers, and general web text appear earliest, while math and knowledge-oriented data lie in the middle, and explicit reasoning, QA, and code tokens appear later. In particular, code has the largest meanτ\tau, and reasoning-oriented data are also clearly shifted to the right relative to general-language domains.
This ordering suggests that even when all domains are mixed together during training, the model still follows an implicit internal curriculum rather than learning every domain uniformly from the start. General-language tokens, which mainly reflect common lexical and syntactic structure, tend to be acquired earlier. By contrast, tokens tied to abstract logic, symbolic manipulation, or domain-specific conventions tend to require later learning events. We also observe that Chinese tokens have slightly larger meanτ\tauthan English tokens, which may reflect differences in tokenization granularity, corpus balance, and the higher sparsity of some Chinese lexical or domain-specific patterns in the mixed pre-training distribution.
A.2.2Token Frequency
At the macroscopic level, natural language is well known to have a long-tailed frequency structure, and models are widely believed to acquire high-frequency words and shallow patterns before rare tokens and more complex usages[37,38,27]. The middle panel of Fig.7shows that the learning-time analysis is highly consistent with this view. For each token, we compute its corpus frequency and then examine how meanτ\tauchanges across frequency bins. The overall trend is strongly decreasing: high-frequency tokens have smaller meanτ\tau, while low-frequency tokens are learned substantially later.
Although there is mild local fluctuation, the global negative correlation is clear across the entire frequency range. This means that frequent and widely reused tokens tend to trigger earlier learning events, whereas rare and long-tail tokens tend to populate the late part of the learning-time spectrum. In this sense, the long tail of token frequency is mirrored by a long tail inp(τ)p(\tau). The microscopic distribution of learning times therefore offers a direct dynamical explanation for why rare linguistic material contributes disproportionately to the late-training regime.
A.2.3N-gram Regularities and Dependency Length
At the macroscopic level, previous work has argued that large language models first absorb low-order and short-rangenn-gram regularities, and only later transition to higher-order patterns and longer-range dependencies[27,39,40]. To test this microscopically, we assign each token instance a dependency length defined as the minimum context length required to recover90%90\%of its full-context predictive performance. The right panel of Figure7shows a clear monotone trend: token instances that can be recovered from short contexts have smaller meanτ\tau, whereas token instances that require longer contexts have progressively larger meanτ\tau. The increase continues steadily from short-range cases to the>256>256-token regime.
This pattern indicates that later learning events are not arbitrary, but are systematically associated with token instances whose prediction depends on more extended contextual structure. In other words, short local regularities tend to be learned earlier, while long-range and higher-order dependencies are learned later. The familiar macroscopic story of “shortnn-grams first, long dependencies later” is therefore visible again at the microscopic level as a rightward shift of meanτ\tau. Together with the domain and frequency results above, this supports the broader claim that many apparently distinct curriculum effects can be unified as differences in when tokens enter the learned set.
A.3Experiments on Controllable Synthetic Data
A.3.1Why Controllable Synthetic Data
We introduce controllable synthetic data to test the core causal claim of this paper in a setting where the underlying difficulty distribution is explicitly designed rather than only indirectly observed. On natural language corpora, one can measure token-level learning times after training, but it is difficult to intervene on the latent distribution that generates them while holding the rest of the training pipeline fixed. Synthetic data removes this ambiguity: the task family, the difficulty variable, the train-validation split, and the sampling distribution can all be specified a priori, so changes in the global loss curve can be traced back to changes in the learning-time spectrum much more cleanly.
To this end, we build our synthetic benchmark on top of the PhysicsLM synthetic pretraining playground[21,41]. We choose PhysicsLM because it provides a deterministic and fully programmable family of algorithmic tasks whose compositional structure is explicit at data-generation time. Among its tasks, we useMano[41], a modular arithmetic reasoning task in which each example is a compositional expression built from addition, subtraction, and multiplication over a finite value space with modulus2323. The target is the final answer token, whose vocabulary is restricted to tokens50005000–50225022. This setup is especially suitable for our purpose because the supervision is concentrated on a single answer token, so each training example induces a sharply defined learning event instead of mixing many heterogeneous supervised positions within one sequence.
The key controllable variable inManois the expression length, denoted byLL, which measures the compositional depth of the arithmetic expression and therefore acts as an explicit difficulty variable. LargerLLrequires more intermediate operations to be composed before the final answer can be predicted, so it naturally induces later and harder learning events. PhysicsLM allows us to specify the sampling distribution overLLdirectly in the data-generation config. This is the central notion of “control” in our synthetic setup: we keep the task semantics, tokenizer, model, optimizer, and training procedure fixed, and intervene only on the distribution over compositional difficulty. As a result, any systematic change in the aggregate loss curve can be attributed to a change in the distribution of learning times induced by the data distribution, rather than to a change in architecture or optimization.
A.3.2Data Construction and Training Setup
Following this principle, we construct two training datasets, each containing51.251.2M non-overlapping Mano examples generated in incremental mode from deterministic seeds. The first is a baseline dataset with a uniform difficulty distribution overL∈{1,…,10}L\in\{1,\dots,10\}. The second is a power-law-skewed dataset whose sampling weights overLLare chosen to approximate a discrete distribution proportional toL−2L^{-2}on the same support. We use separate held-out validation sets generated from non-overlapping seed ranges, so training and validation examples are disjoint by construction. In the main synthetic runs, evaluation is performed on a1010K uniform validation set for diagnosing how learning progresses across difficulty levels.
All synthetic runs use the same training configuration so that the data distribution is the only intended source of variation. We train an88-layer,88-head transformer with hidden size512512and RoPE positional encoding for400400K optimization steps using Adam with learning rate3×10−53\times 10^{-5}, batch size128128, sequence length6464, BF16 training, and a loss mask that supervises only the final answer token. This last choice is important: because the loss is concentrated on the answer token, the measured learning dynamics reflect when the model becomes able to solve the underlying compositional problem, rather than being diluted by easy prompt tokens.
A.3.3Learning Order Under a Uniform Difficulty Distribution
Figure 8:Synthetic learning dynamics under the uniform-Manotraining distribution. From left to right: overall validation loss, validation loss by difficulty level, overall validation accuracy, and validation accuracy by difficulty level. The aggregate validation loss decreases smoothly but does not show a clear power-law regime; instead, under the uniform difficulty distribution it is much closer to a roughly linear decay over training. At the same time, the per-difficulty curves reveal a strict learning order: low-difficulty samples (L1L1,L2L2,L3L3) are mastered early, whereas higher-difficulty samples improve only later and remain substantially harder throughout training. The same progressive ordering is visible in accuracy as well.
Figure 9:Token-level loss trajectories on uniform-Mano. Each row corresponds to one difficulty level fromL1L1toL10L10, and each row shows five sampled tokens from that level. In each panel, the blue curve is the true token loss trajectory over training, the red dashed curve is its sigmoid fit, and the green curve is the average loss trajectory of the corresponding difficulty level. Several regularities are immediately visible. First, except for the easiestL1L1cases that start decreasing before the main observation window, most tokens begin from a nearly common high-loss state and end near zero, as expected for a deterministic single-answer task. Second, individual token learning is concentrated in a localized transition interval rather than being uniformly distributed over training. Third, different tokens mainly vary in the timing and sharpness of this transition, while the sigmoid fit captures the dominant step-like behavior well. These observations motivate modeling training as a collection of token-level learning events and decomposing aggregate loss dynamics in terms of their distribution over learning times.Before turning to token-level fitting, we first verify two basic empirical properties of the uniform-Manosetting as shown in Figure8. First, because the training distribution over difficulty levels is uniform rather than power-law, the aggregate validation loss does not exhibit the clear power-law shape observed in real-language pre-training. Instead, it decreases in a much more nearly linear manner over the main training regime. This point is important for our overall argument: the synthetic experiment is not designed to reproduce a power law by construction, but to show that when the underlying difficulty distribution is changed, the shape of the aggregate learning curve changes accordingly.
Second, the learning dynamics are strongly ordered by difficulty. When we evaluate the model separately on validation subsets of different expression lengths, we observe a clear progression fromL1L1toL10L10: the easiest examples are learned first, and the hardest examples are learned last. The same ordering appears consistently in both validation loss and validation accuracy. This shows that the synthetic data indeed contains a meaningful difficulty hierarchy, and that learning over this hierarchy is highly non-uniform rather than simultaneous across all levels. In other words, different parts of the data distribution enter the learned set at different times, which is precisely the condition needed for a nontrivial learning-time spectrum to emerge.
A.3.4Token-Level Learning Dynamics
We next examine token-level loss trajectories under the same uniform-Manosetting. As showin in Figure9, because each sample has a unique correct answer token and the loss is computed only on that answer position, the token-level dynamics are especially clean. Before a token is learned, the model is close to guessing among a fixed and small answer vocabulary, so most tokens start from nearly the same initial loss plateau. Empirically this common starting point is visible for almost all difficulty levels exceptL1L1, whose easiest tokens often begin to decrease so early that part of the initial drop already occurs before the main observation window. At the other end, once the model has learned the corresponding rule and places essentially all probability mass on the unique correct answer, the terminal loss is naturally driven to zero.
More importantly, the trajectories show that learning is not spread uniformly across the entire training horizon. Instead, each token stays near an initial steady state, then undergoes a relatively sharp transition within a localized interval, and finally enters a second steady state near zero. Different tokens do not differ mainly in whether they exhibit such a transition, but in when this transition happens and how sharp it is. This is exactly the token-level signature of a learning event: a token contributes meaningfully to the aggregate loss decrease only around its own transition window, while contributing little before it is learned or after it has already saturated. From this perspective, the macroscopic loss curve is not a smooth primitive object, but the superposition of many asynchronous token-level learning events occurring at different times.
A.3.5Token-level Learning Pulse
Figure 10:Scatter plots of token-level sigmoid parameters against learning time on uniform-Mano. Left: loss-drop magnitudeΔi\Delta_{i}versus learning timeτi\tau_{i}. Right: sharpness parameterλi\lambda_{i}versus learning timeτi\tau_{i}. Both quantities vary across tokens, showing that different tokens do not share exactly the same transition amplitude or speed. However, neitherΔi\Delta_{i}norλi\lambda_{i}exhibits a strong systematic dependence onτi\tau_{i}, indicating that later-learned tokens are not qualitatively different from earlier-learned ones in a way that would by itself determine the global loss shape. This supports the view that the dominant macroscopic factor is the distribution of learning times rather than a strong temporal drift in token-specific pulse shape.We then fit a sigmoid to every validation token and extract its three token-level parameters(τi,λi,Δi)(\tau_{i},\lambda_{i},\Delta_{i}), corresponding respectively to learning time, transition sharpness, and loss-drop magnitude. These quantities summarize when each token is learned, how quickly the transition happens, and how much loss it removes once learned. Having obtained these parameters for all validation tokens, we next ask whether the macroscopic loss shape is driven mainly by variation in the token-specific pulse shape itself, or by the distribution of learning times across tokens.
We first examine the relationship between the fitted shape parameters and learning time. As shown in Figure10, bothλi\lambda_{i}andΔi\Delta_{i}vary across tokens, meaning that different tokens indeed have different drop magnitudes and different transition speeds. However, neither quantity shows a strong systematic trend withτi\tau_{i}. In other words, although token-level learning events are not literally identical, their local shape variation is not organized primarily along the learning-time axis. This already suggests that the timing distribution may be more important than the detailed per-token pulse parameters in explaining the global loss trajectory.
Figure 11:Similarity of learning pulses across learning times on uniform-Mano.Left: the upper panel shows the averaged learning pulse for different learning-time groups on the original training axis, while the lower panel shows the same pulses after aligning their centers. The strong collapse after alignment indicates that pulses learned at different times have nearly the same shape up to translation.Middle: singular-value spectrum of the aligned pulse matrix, together with the cumulative energy ratio; the first singular mode explains about86.3%86.3\%of the energy and the first two modes explain about94.8%94.8\%, showing that the aligned pulse family is strongly low-rank.Right: the top four singular vectors of the aligned pulse matrix. The leading mode closely matches the canonical pulse shape, while the higher modes account only for small residual variation. Together these results justify replacing token-specific pulses by a shared average templateggwhen reconstructing aggregate loss dynamics.We next visualize the learning pulse induced by tokens at different learning times and align them by their centers. Figure11shows that before alignment, the pulses appear at different positions along the training axis simply because different token groups are learned at different times. Once centered, however, these pulses almost collapse onto the same shape, revealing a high degree of invariance across learning times. We further quantify this by performing SVD on the matrix of center-aligned learning pulses. The resulting spectrum is strongly low-rank: the first singular mode alone explains about86.3%86.3\%of the energy, and the first two modes together explain about94.8%94.8\%. Moreover, the leading singular vector itself has essentially the same localized shape as the aligned learning pulse family. This provides direct evidence that, after averaging over individual token noise, learning pulses at different times are nearly the same object up to translation.
This observation is central for the decomposition developed in the main text. Although individual tokens can still have visibly different trajectories, their average learning pulse is highly stable once aligned by learning time. We can therefore use a shared templateggto represent the typical token-level learning event, and decouple the macroscopic loss decrease from the detailed shape of each token’s drop. Under this view, the aggregate loss dynamics are governed primarily by how many tokens undergo this shared learning event at each time, namely by the distribution of learning times along the training axis.
A.3.6From Learning-Time Spectrum to Loss Decrease Rate
Having established that token-level trajectories are well fit by sigmoids and that the aligned learning pulses are highly similar, we can now test the full reconstruction implied by our framework. The key question is whether the learning-time spectrump(τ)p(\tau)already captures the dominant structure of the macroscopic loss decrease rate, once the token-specific pulse shape is replaced by a shared average template.
Figure 12:Reconstruction of synthetic aggregate dynamics from token-level sigmoid fits.Left: the true mean validation loss and the mean loss obtained by averaging token-wise sigmoid fits. The two curves are nearly indistinguishable, showing that the sigmoid decomposition faithfully captures the aggregate loss trajectory and that the dominant token-level learning events are already encoded by the fitted parameters.Middle: the true loss decrease rate and the empirical learning-time spectrump(τ)p(\tau)on the same step axis. Their shapes are closely matched, indicating that the temporal distribution of learning events already tracks the main structure of macroscopic loss reduction.Right: comparison among the true loss decrease rate, the rate implied by the mean sigmoid fit, and the rate reconstructed from the shared average pulseggtogether withp(τ)p(\tau). All three curves nearly overlap, and the reconstruction using the averaged pulse is effectively a smoothed version of the true rate. This shows thatp(τ)p(\tau)captures the principal features of the loss decrease rate, and that once a shared pulse template is fixed, the aggregate rate is determined primarily by how learning events are distributed over time rather than by detailed token-specific pulse variations.Figure12verifies this prediction directly. We first compare the true mean validation loss with the average of all token-wise sigmoid fits. The two curves are almost identical, showing that the fitted sigmoid family already provides an accurate description of the aggregate synthetic loss curve. This means the decomposition is not merely qualitatively plausible, but quantitatively faithful at the level of the overall loss.
We then place the empirical learning-time spectrump(τ)p(\tau)and the loss decrease rate on the same step axis. Their distributions are closely matched: the times at which many tokens are learned coincide with the times at which the aggregate loss falls most rapidly. This is exactly the behavior predicted by the pulse-spectrum view, in which the rate of macroscopic improvement is governed by how densely token-level learning events are distributed over time.
Finally, using the shared average learning pulse together withp(τ)p(\tau), we reconstruct the loss decrease rate and compare it with both the true rate and the rate implied by the full sigmoid decomposition. All three curves nearly overlap, and the reconstruction from the averaged pulse is essentially a smoothed version of the true loss decrease rate. This shows that the dominant structure of aggregate loss reduction is determined mainly by the learning-time spectrump(τ)p(\tau). By contrast, the precise shape ofggis secondary: once learning pulses are broadly similar, macroscopic loss decay is governed primarily by when tokens learn, not by the fine details of how each individual pulse is shaped.
A.3.7Power-Law Synthetic Distribution
We finally repeat the synthetic analysis under a power-law data distribution to test whether the conclusions above depend on the uniform-difficulty setting. Concretely, we construct both the training and validation sets from the sameManotask family, but replace the uniform length distribution by a power-law distribution over difficulty levels. Both the training and validation sets are drawn from the probability mass overL∈{1,…,10}L\in\{1,\dots,10\}is chosen to approximate a discrete law proportional toL−2L^{-2}. As before, the task definition, model architecture, optimizer, and answer-only supervision remain unchanged, so the only intended intervention is the distribution of compositional difficulty.
Figure 13:Overview of synthetic learning dynamics under the power-lawManodistribution.First: validation loss by difficulty level. Even under the power-law data distribution, the losses of different difficulty levels still exhibit a clear learning order, with easier levels improving earlier and harder levels improving later.Second: learning pulses associated with different learning times. The pulse family remains localized and qualitatively similar across the training axis.Third: mean validation loss under the power-law distribution, showing the true loss together with the sigmoid-implied loss; the two curves remain closely matched.Fourth: validation-loss decrease rate under the power-law distribution, comparing the true decrease rate, the sigmoid-implied decrease rate, and the rate reconstructed from the averaged learning pulse together withp(τ)p(\tau). The close agreement among these curves shows that the pulse-spectrum decomposition continues to hold in the power-law setting as well.Figure13shows that the main conclusions from the uniform synthetic setting remain valid after switching to a power-law data distribution. First, the by-difficulty validation losses still display a clear progression from easy to hard levels, so the synthetic task continues to induce an ordered hierarchy of learning events rather than simultaneous learning across all difficulties. Second, the learning pulses associated with different learning times remain sharply localized and visually similar, indicating that the shared-pulse approximation is not an artifact of the uniform-distribution experiment.
The aggregate-level conclusions also persist. The mean validation loss is again accurately captured by the average of token-wise sigmoid fits, showing that the token-level decomposition remains quantitatively faithful. Likewise, the true loss decrease rate, the sigmoid-implied rate, and the rate reconstructed from the averaged learning pulse together withp(τ)p(\tau)still track one another closely. This means that even when the underlying data distribution itself is power-law, the same mechanism continues to explain the macroscopic loss curve: the dominant structure of loss reduction is determined mainly by the distribution of learning times, while the detailed token-specific pulse shape plays a comparatively secondary role.
Limitations and Future Work.
This work makes a substantial step toward explaining the origin of neural scaling laws by resolving aggregate power-law behavior into token-level learning events and their learning-time spectrum, thereby providing a more mechanistic understanding of why macroscopic loss follows such regular forms. Despite this progress, several important directions remain open for future work. First, the current analysis is computationally expensive, as it requires training-recipe sweeping, dense checkpointing, and token-level trajectory fitting across multiple scaling axes, which currently limits broader coverage over larger experimental grids and additional model families. Second, our current data-reshaping experiments operate at the sample level rather than the token level, since autoregressive next-token prediction couples each target token with its preceding context during the forward pass. The present intervention should therefore be viewed as an initial demonstration that the measured learning-time signal is actionable, while finer-grained token-level reshaping and more complete curriculum-style scheduling remain important directions for future work.
Similar Articles
Scaling laws for neural language models
Foundational empirical study demonstrating power-law scaling relationships between language model performance and model size, dataset size, and compute budget, with implications for optimal training allocation and sample efficiency.
How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size
Proposes a three-term scaling law that decouples model size, training steps, and batch size, enabling robust fitting with fewer runs and deriving scaling laws for suboptimal batch sizes.
Compute Optimal Tokenization (2 minute read)
This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.
Stochasticity in Tokenization Improves Robustness
This paper demonstrates that training large language models with stochastic tokenization instead of deterministic canonical tokenization significantly improves robustness to adversarial attacks and random perturbations, with improvements shown across pre-training, fine-tuning, and in-context learning without increasing inference costs.
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
InfoLaw is a data-aware scaling framework that predicts model loss based on token consumption, model size, data mixture weights, and repetition, enabling efficient data-recipe selection under varying compute budgets.