InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
Summary
InfoLaw is a data-aware scaling framework that predicts model loss based on token consumption, model size, data mixture weights, and repetition, enabling efficient data-recipe selection under varying compute budgets.
View Cached Full Text
Cached at: 05/13/26, 12:18 AM
Paper page - InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
Source: https://huggingface.co/papers/2605.02364
Abstract
InfoLaw is a data-aware scaling framework that predicts model loss based on token consumption, model size, data mixture weights, and repetition, enabling efficient data-recipe selection under varying compute budgets.
Upweighting high-quality data in LLMpretrainingoften improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increasesrepetitionand can degrade performance. However, standardscaling lawsdo not reliably extrapolate across mixture recipes or underrepetitions, making the selection for optimal data recipes at scaling underdetermined. To solve this, we introduce InfoLaw (Information Scaling Laws), adata-aware scaling frameworkthat predicts loss from consumed tokens,model size,data mixture weights, andrepetition. The key idea is to modelpretrainingasinformation accumulation, where quality controls information density andrepetitioninduces scaledependent diminishing returns. We first collect the model performance after training on datasets that vary in scale, quality distribution, andrepetitionlevel. Then we build up the modeling for information so that information accurately predicts those model performance. InfoLaw predicts performance on unseen data recipes and larger scale runs (up to 7B, 425B tokens) with 0.15% mean and 0.96% max absolute error in loss, and it extrapolates reliably across overtraining levels, enabling efficient data-recipe selection under varying compute budgets.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.02364
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.02364 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.02364 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.02364 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Model Merging Scaling Laws in Large Language Models
This paper establishes empirical scaling laws for language model merging, identifying power-law relationships between model size, expert count, and performance to enable predictive planning for optimal model composition.
Scaling Laws for Mixture Pretraining Under Data Constraints
This paper studies the trade-off between scarce target data and abundant generic data in mixture pretraining, finding that repetition is a key driver of performance and that mixture training tolerates 15-20 repetitions of target data. It introduces a repetition-aware scaling law to optimize mixture configurations under data constraints.
Scaling laws for neural language models
Foundational empirical study demonstrating power-law scaling relationships between language model performance and model size, dataset size, and compute budget, with implications for optimal training allocation and sample efficiency.
Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws
This paper studies data-constrained language model pretraining, proposing masked-input regularization (MIR) to improve validation loss and downstream performance, and SoftQ, a scaling law that better captures model-data interaction under repeated data.
Can LLMs Take Retrieved Information with a Grain of Salt?
This paper investigates how large language models adapt to the certainty of retrieved information, identifying systematic limitations in handling uncertainty. It proposes an interaction strategy that reduces obedience errors by 25% without modifying model weights.