Prescriptive Scaling Laws for Data Constrained Training

Hugging Face Daily Papers 05/02/26, 12:00 AM Papers

Summary

A modified scaling law accounting for data repetition effects provides compute-optimal training strategies for data-constrained scenarios, showing that beyond a point further repetition is counterproductive and compute is better spent on model capacity.

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law's recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strong weight decay (λ=1.0) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimal weight decay in data-constrained regimes is an order of magnitude larger than standard practice.

Original Article

View Cached Full Text

Cached at: 05/08/26, 06:29 PM

Paper page - Prescriptive Scaling Laws for Data Constrained Training

Source: https://huggingface.co/papers/2605.01640

Abstract

A modified scaling law accounts for data repetition effects and provides compute-optimal training strategies for data-constrained scenarios.

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adoptedChinchilla scaling lawassumes every training token is unique. This limits its ability to guide pretraining decisions indata-constrained regimes. We model the excess loss under repetition with a simple additiveoverfitting penaltyand find that it accurately describes model behavior. Our scaling law yields qualitatively newcompute-optimal allocationadvice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law’s recommended configuration improves performance indata-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strongweight decay(λ=1.0) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimalweight decayindata-constrained regimesis an order of magnitude larger than standard practice.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.01640

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.01640 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.01640 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.01640 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Prescriptive Scaling Laws for Data Constrained Training

Paper page - Prescriptive Scaling Laws for Data Constrained Training

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Scaling Laws for Mixture Pretraining Under Data Constraints

Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

@lilianweng: A super long overdue (3+ years?) post on scaling laws. Compute is expensive. Scaling laws are a way to help us reason a…

Scaling Laws, Carefully (25 minute read)

Scaling laws for reward model overoptimization

Submit Feedback

Similar Articles

Scaling Laws for Mixture Pretraining Under Data Constraints

Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

@lilianweng: A super long overdue (3+ years?) post on scaling laws. Compute is expensive. Scaling laws are a way to help us reason a…

Scaling Laws, Carefully (25 minute read)

Scaling laws for reward model overoptimization