InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

Hugging Face Daily Papers Papers

Summary

InfoLaw is a data-aware scaling framework that predicts model loss based on token consumption, model size, data mixture weights, and repetition, enabling efficient data-recipe selection under varying compute budgets.

Upweighting high-quality data in LLM pretraining often improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do not reliably extrapolate across mixture recipes or under repetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), a data-aware scaling framework that predicts loss from consumed tokens, model size, data mixture weights, and repetition. The key idea is to model pretraining as information accumulation, where quality controls information density and repetition induces scaledependent diminishing returns. We first collect the model performance after training on datasets that vary in scale, quality distribution, and repetition level. Then we build up the modeling for information so that information accurately predicts those model performance. InfoLaw predicts performance on unseen data recipes and larger scale runs (up to 7B, 425B tokens) with 0.15% mean and 0.96% max absolute error in loss, and it extrapolates reliably across overtraining levels, enabling efficient data-recipe selection under varying compute budgets.
Original Article
View Cached Full Text

Cached at: 05/13/26, 12:18 AM

Paper page - InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

Source: https://huggingface.co/papers/2605.02364

Abstract

InfoLaw is a data-aware scaling framework that predicts model loss based on token consumption, model size, data mixture weights, and repetition, enabling efficient data-recipe selection under varying compute budgets.

Upweighting high-quality data in LLMpretrainingoften improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increasesrepetitionand can degrade performance. However, standardscaling lawsdo not reliably extrapolate across mixture recipes or underrepetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), adata-aware scaling frameworkthat predicts loss from consumed tokens,model size,data mixture weights, andrepetition. The key idea is to modelpretrainingasinformation accumulation, where quality controls information density andrepetitioninduces scaledependent diminishing returns. We first collect the model performance after training on datasets that vary in scale, quality distribution, andrepetitionlevel. Then we build up the modeling for information so that information accurately predicts those model performance. InfoLaw predicts performance on unseen data recipes and larger scale runs (up to 7B, 425B tokens) with 0.15% mean and 0.96% max absolute error in loss, and it extrapolates reliably across overtraining levels, enabling efficient data-recipe selection under varying compute budgets.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.02364

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.02364 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.02364 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.02364 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Model Merging Scaling Laws in Large Language Models

Hugging Face Daily Papers

This paper establishes empirical scaling laws for language model merging, identifying power-law relationships between model size, expert count, and performance to enable predictive planning for optimal model composition.

Scaling Laws for Mixture Pretraining Under Data Constraints

arXiv cs.LG

This paper studies the trade-off between scarce target data and abundant generic data in mixture pretraining, finding that repetition is a key driver of performance and that mixture training tolerates 15-20 repetitions of target data. It introduces a repetition-aware scaling law to optimize mixture configurations under data constraints.

Scaling laws for neural language models

OpenAI Blog

Foundational empirical study demonstrating power-law scaling relationships between language model performance and model size, dataset size, and compute budget, with implications for optimal training allocation and sample efficiency.

Can LLMs Take Retrieved Information with a Grain of Salt?

arXiv cs.CL

This paper investigates how large language models adapt to the certainty of retrieved information, identifying systematic limitations in handling uncertainty. It proposes an interaction strategy that reduces obedience errors by 25% without modifying model weights.