Prescriptive Scaling Laws for Data Constrained Training
Summary
A modified scaling law accounting for data repetition effects provides compute-optimal training strategies for data-constrained scenarios, showing that beyond a point further repetition is counterproductive and compute is better spent on model capacity.
View Cached Full Text
Cached at: 05/08/26, 06:29 PM
Paper page - Prescriptive Scaling Laws for Data Constrained Training
Source: https://huggingface.co/papers/2605.01640
Abstract
A modified scaling law accounts for data repetition effects and provides compute-optimal training strategies for data-constrained scenarios.
Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adoptedChinchilla scaling lawassumes every training token is unique. This limits its ability to guide pretraining decisions indata-constrained regimes. We model the excess loss under repetition with a simple additiveoverfitting penaltyand find that it accurately describes model behavior. Our scaling law yields qualitatively newcompute-optimal allocationadvice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law’s recommended configuration improves performance indata-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strongweight decay(λ=1.0) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimalweight decayindata-constrained regimesis an order of magnitude larger than standard practice.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.01640
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.01640 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.01640 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.01640 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Scaling Laws for Mixture Pretraining Under Data Constraints
This paper studies the trade-off between scarce target data and abundant generic data in mixture pretraining, finding that repetition is a key driver of performance and that mixture training tolerates 15-20 repetitions of target data. It introduces a repetition-aware scaling law to optimize mixture configurations under data constraints.
Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws
This paper studies data-constrained language model pretraining, proposing masked-input regularization (MIR) to improve validation loss and downstream performance, and SoftQ, a scaling law that better captures model-data interaction under repeated data.
@lilianweng: A super long overdue (3+ years?) post on scaling laws. Compute is expensive. Scaling laws are a way to help us reason a…
Lilian Weng's blog post provides a comprehensive overview of scaling laws in deep learning, covering their derivation, compute-optimal allocation, and the debate between Kaplan et al. and Chinchilla.
Scaling Laws, Carefully (25 minute read)
A comprehensive overview of scaling laws in deep learning, tracing their theoretical roots and empirical findings, and explaining how loss decreases predictably with model size, data, and compute.
Scaling laws for reward model overoptimization
OpenAI researchers empirically study how reward model overoptimization affects performance, establishing scaling laws that show the relationship between proxy reward optimization and ground truth performance varies by optimization method and scales predictably with model size.