Prescriptive Scaling Laws for Data Constrained Training

Hugging Face Daily Papers Papers

Summary

A modified scaling law accounting for data repetition effects provides compute-optimal training strategies for data-constrained scenarios, showing that beyond a point further repetition is counterproductive and compute is better spent on model capacity.

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law's recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strong weight decay (λ=1.0) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimal weight decay in data-constrained regimes is an order of magnitude larger than standard practice.
Original Article
View Cached Full Text

Cached at: 05/08/26, 06:29 PM

Paper page - Prescriptive Scaling Laws for Data Constrained Training

Source: https://huggingface.co/papers/2605.01640

Abstract

A modified scaling law accounts for data repetition effects and provides compute-optimal training strategies for data-constrained scenarios.

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adoptedChinchilla scaling lawassumes every training token is unique. This limits its ability to guide pretraining decisions indata-constrained regimes. We model the excess loss under repetition with a simple additiveoverfitting penaltyand find that it accurately describes model behavior. Our scaling law yields qualitatively newcompute-optimal allocationadvice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law’s recommended configuration improves performance indata-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strongweight decay(λ=1.0) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimalweight decayindata-constrained regimesis an order of magnitude larger than standard practice.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.01640

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.01640 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.01640 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.01640 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Scaling Laws for Mixture Pretraining Under Data Constraints

arXiv cs.LG

This paper studies the trade-off between scarce target data and abundant generic data in mixture pretraining, finding that repetition is a key driver of performance and that mixture training tolerates 15-20 repetitions of target data. It introduces a repetition-aware scaling law to optimize mixture configurations under data constraints.

Scaling Laws, Carefully (25 minute read)

TLDR AI

A comprehensive overview of scaling laws in deep learning, tracing their theoretical roots and empirical findings, and explaining how loss decreases predictably with model size, data, and compute.

Scaling laws for reward model overoptimization

OpenAI Blog

OpenAI researchers empirically study how reward model overoptimization affects performance, establishing scaling laws that show the relationship between proxy reward optimization and ground truth performance varies by optimization method and scales predictably with model size.