@AI_Whisper_X: 苦涩教训第二弹：只要你算力够，最好的数据过滤器就是不过滤。看完这篇 paper 最大的感受是，rich 老爷子的苦涩教训，这是要到数据侧了？斯坦福的 Hashimoto 发了一篇《A Bitter Lesson for Data Fi…

X AI KOLs Timeline 2026/05/24 10:43 论文

data-filtering scaling-laws common-crawl pretraining large-language-models compute-scaling

摘要

斯坦福大学的研究论文提出，在足够大的算力下，最好的数据过滤策略是不过滤。实验表明，大规模模型对低质量数据具有鲁棒性，并且未过滤的数据池在较大规模下表现更优，但该结论适用于密集模型的标准预训练，且在算力受限时过滤仍然重要。

苦涩教训第二弹：只要你算力够，最好的数据过滤器就是不过滤。看完这篇 paper 最大的感受是，rich 老爷子的苦涩教训，这是要到数据侧了？斯坦福的 Hashimoto 发了一篇《A Bitter Lesson for Data Filtering》，核心结论一句话：只要你算力够，最好的数据过滤器就是不过滤。他们的意思是，业界花了好几年打磨的数据清洗 pipeline，在足够大的 scaling 面前，优势可能就不在了。至少在这篇 paper 的设定里，很多小算力阶段看起来合理的过滤策略，放大以后反而会输给最粗暴的方案：直接用完整池子。实验做法其实不复杂。把 Common Crawl 和它的各种过滤版本（轻滤、重滤）同比例缩小，然后看随着模型变大、训练步数增加，哪个池子最终能训练出最好的模型。结果：在 670M token 的 CC 子集实验里，未过滤的完整池子胜过了他们测试的所有过滤版本。后面他们把 pool size 放大两个数量级继续看，至少在 CC vs RefinedWeb 这组对比里，这个趋势仍然稳定。不过他们最多只做了 10B token 的实验，仍然是个非常小的尺度他们还做了更极端的测试：往训练池里注入低质量数据。 ① 先构造一个由 1 万个随机词组成的词表，再从中随机采样拼成文档 ② 把 CC 文档的词序完全打乱其中，词序打乱文档的注入量最高做到原池的 8 倍。结果是，足够大的模型对这类低质量数据表现出惊人的鲁棒性。最反直觉的一个结果是：打乱词序的文档，在 330M 模型上不仅没拖累，反而帮模型超过了纯 CC 池的表现（除了 +800% 那组还没训够）。他们还建了一套 scaling law 来预测：DCLM-Pool 完整的 240T tokens CC 池，最早在 1e30 FLOPs 时就会成为最优选择。而且 1e30 倒也不是那么无法想象。现在前沿模型的预训练算力大约在 5e26 FLOPs 量级；而到 2030 年，已有预测认为单次训练可能到 1e29 FLOPs 换句话说，我们距离“不过滤反而更好”的临界点，可能没有想象中那么远。这其实呼应了 Sutton 原文里的那个核心观察：试图把你对领域的知识编码进算法，长期看往往会被更简单、随算力优雅扩展的方法击败。但有个前提必须说清楚：当算力还是瓶颈的时候，过滤仍然重要。而且更重要的是，随着模型增大，对于算力需求是越来越大的，所以我们可能永远到不了算力不是瓶颈的那一天 hhhh 作者也列出了适用边界：他们讨论的是dense 模型的标准预训练，没有数据课程、数据权重和 post-training；MoE、合成数据、训练后期的数据策略，可能都会是另一回事。而且从另一个角度说，如果 filtering 本身是完美的，我们当然可以 filter https://arxiv.org/html/2605.19407v1#S6…

查看原文

查看缓存全文

缓存时间: 2026/05/24 20:37

苦涩教训第二弹：只要你算力够，最好的数据过滤器就是不过滤。

看完这篇 paper 最大的感受是，rich 老爷子的苦涩教训，这是要到数据侧了？斯坦福的 Hashimoto 发了一篇《A Bitter Lesson for Data Filtering》，核心结论一句话：只要你算力够，最好的数据过滤器就是不过滤。他们的意思是，业界花了好几年打磨的数据清洗 pipeline，在足够大的 scaling 面前，优势可能就不在了。至少在这篇 paper 的设定里，很多小算力阶段看起来合理的过滤策略，放大以后反而会输给最粗暴的方案：直接用完整池子。实验做法其实不复杂。把 Common Crawl 和它的各种过滤版本（轻滤、重滤）同比例缩小，然后看随着模型变大、训练步数增加，哪个池子最终能训练出最好的模型。结果：在 670M token 的 CC 子集实验里，未过滤的完整池子胜过了他们测试的所有过滤版本。后面他们把 pool size 放大两个数量级继续看，至少在 CC vs RefinedWeb 这组对比里，这个趋势仍然稳定。不过他们最多只做了 10B token 的实验，仍然是个非常小的尺度

他们还做了更极端的测试：往训练池里注入低质量数据。 ① 先构造一个由 1 万个随机词组成的词表，再从中随机采样拼成文档 ② 把 CC 文档的词序完全打乱其中，词序打乱文档的注入量最高做到原池的 8 倍。结果是，足够大的模型对这类低质量数据表现出惊人的鲁棒性。最反直觉的一个结果是：打乱词序的文档，在 330M 模型上不仅没拖累，反而帮模型超过了纯 CC 池的表现（除了 +800% 那组还没训够）。

他们还建了一套 scaling law 来预测：DCLM-Pool 完整的 240T tokens CC 池，最早在 1e30 FLOPs 时就会成为最优选择。而且 1e30 倒也不是那么无法想象。现在前沿模型的预训练算力大约在 5e26 FLOPs 量级；而到 2030 年，已有预测认为单次训练可能到 1e29 FLOPs 换句话说，我们距离“不过滤反而更好”的临界点，可能没有想象中那么远。这其实呼应了 Sutton 原文里的那个核心观察：试图把你对领域的知识编码进算法，长期看往往会被更简单、随算力优雅扩展的方法击败。

但有个前提必须说清楚：当算力还是瓶颈的时候，过滤仍然重要。而且更重要的是，随着模型增大，对于算力需求是越来越大的，所以我们可能永远到不了算力不是瓶颈的那一天 hhhh 作者也列出了适用边界：他们讨论的是dense 模型的标准预训练，没有数据课程、数据权重和 post-training；MoE、合成数据、训练后期的数据策略，可能都会是另一回事。而且从另一个角度说，如果 filtering 本身是完美的，我们当然可以 filter https://arxiv.org/html/2605.19407v1#S6…

A Bitter Lesson for Data Filtering

Source: https://arxiv.org/html/2605.19407v1 Christopher Mohri Department of Computer Science Stanford University [email protected] &John Duchi Departments of Statistics and Electrical Engineering Stanford University [email protected] &Tatsunori Hashimoto Department of Computer Science Stanford University [email protected]

Abstract

We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally “poor” data.

1Introduction

The standard approach to select pretraining data for language models is to filter text from sources like Common Crawl (CC)(Common Crawl,2024). It is widely documented that in compute-constrained regimes, where one must train on a subset of CC, different data selection strategies can have a large impact on performance. This is intuitive: all else equal, it seems natural to train on “higher-quality” data. As a result, a large body of research has emerged to tackle the data selection problem, with the goal of finding the best subset for pretraining language models(Albalaket al.,2024; Liet al.,2025a).

However, not only is large-scale filter ablation heuristic and expensive, but filtering removes data, which is at odds with scaling trends that prescribe ever-increasing amounts of data to improve model performance. For example, the heavily-filtered DCLM-Baseline dataset keeps∼1%\sim\!1\%of the original CC, leading to about 3.8 trillion tokens(Liet al.,2025a). While this is still enormous, it falls short of the Chinchilla-optimal token budget for a 1 trillion parameter model, even after accounting for diminishing returns when epoching(Muennighoffet al.,2025). The current trend is also to over-train relative to Chinchilla-optimal, which prescribes even more tokens to allow for (relatively) smaller models that are financially feasible to serve(Sardanaet al.,2025).

We begin by testing the hypothesis that data filtering is necessary at all in the large compute limit. While large-scale machine learning has moved toward task-agnostic pretraining(Raffelet al.,2023), and there is anecdotal evidence that larger computational budgets benefit from looser data filters(Goyalet al.,2024; Muennighoffet al.,2025), removingalldata filtering would be an extreme intervention that uses data considered to be actively harmful(Raffelet al.,2023). Our goal in this work is to take this extreme seriously and study the limits of (low-quality) data for transformer pretraining.

We find evidence that rejects the hypothesis that data filtering is necessary, and that eventually, no existing data filter is likely to improve upon training directly on Common Crawl. In our experiments, we scale down both CC and its filtered versions to keep their relative sizes intact, and then scale computational resources for pretraining on these different datasets. Our two main levers to do so are scaling model size (which requires more compute per training step) and training steps (which eventually leads to epoching). When comparing the best achieved performance, regardless of computational cost, our main finding is that the full pool outperforms our selected filters.

Our findings are robust as we scale our experiments by 2 orders of magnitude, and we find that we can continue to see the effects from our small pool experiments as long as the models are sufficiently large. Furthermore, we find a predictable relationship between pool size, training steps, and model size which enables us to build scaling laws that predict how much compute is needed for no filter to be optimal for a particular pool size. Using this, we find that the 240 trillion token Common Crawl pool from DCLM-Pool may become optimal as soon as 1e+30 FLOPs.

These initial findings lead us to study the robustness of pretraining to “junk” data. Surprisingly, sufficiently large models are highly robust to irrelevant or junk data and can extract useful information even from highly noisy data. We test this using randomly generated strings and documents with shuffled word orders. While performance degrades at low compute budgets, sufficiently trained large models close the gap. Remarkably, these models even benefit from shuffled-word documents, despite only the unigram distribution of the documents remaining intact.

Overall, our experiments suggest that sufficiently large models that are trained for sufficiently long can benefit from the full CC dataset. While it is possible to construct harmful data, which could for example be non-factual content that looks identical to high-quality data, we do not find large amounts of this in CC. As a result, data filtering may suffer from the bitter lesson(Sutton,2019)in which human-designed filters that perform well at the small scale are eventually replaced by simple, no-filter approaches that scale more gracefully with compute.

We structure the paper as follows. In Section2, we provide the basic experimental setup, followed by experiments on filtering in Section3. We then move to adding data to our CC pool in Section4, and scaling the pool size in Section5. We finish with edge cases in Section6and a theoretical model in Section7to provide a post-hoc explanation of the observed phenomena.

1.1Related Work

Data-constrained pretraining.Several prior works consider the data-constrained pretraining regime.Muennighoffet al.(2025)derive scaling laws that factor data repetition into the original Chinchilla scaling laws, finding diminishing returns after around 4 epochs on the data and that adding code data and using looser perplexity-based filters mitigates data scarcity. However, the authors recommend filtering “noisy datasets” and train on subsets of C4(Raffelet al.,2023), while the current work directly trains on (parsed) Common Crawl and finds evidence in support of no filtering.Kimet al.(2025)study the question of algorithmic improvements in a data-constrained but compute-unlimited setting. We share a similar experimental setup (where we take subsets of a dataset, scale compute on this subset, and then scale the subset size) but differ in the object of analysis (dataset filtering).

Loose data filters.The closest work to ours isGoyalet al.(2024), who argue that filter thresholds should depend on the compute budget, showing evidence for vision-language models. They derive a scaling law to predict the filtering threshold as a function of compute budget, and conclude that “less aggressive filtering is best” with “large compute” but do not identify the parameter scaling interactions that are critical to our work, and do not show our main findings that for language models, no filter can be the best filter.Fanget al.(2025)tackle a related question by artificially repeating “high-quality” data to match the scale of loosely filtered data. They find that the former can outperform the latter in low-compute regimes, but the high compute regime studied in this work remains fully speculative in their work. Finally,Gao (2021)finds that filtering aggressively can hurt performance, speculating that this follows from Goodhart’s law [1984], andSaadaet al.(2025)find that filtering with a quality classifier may improve downstream benchmarks but not validation losses on “high-quality” data.

On the theoretical side,Chenget al.(2024)develop theoretical models of the data cleaning process, arguing that given models that have enough fidelity to model noisy data generation schemes, it is better to not clean data, while cleaning data can yield more robust learning when models are not perfect. This prediction dovetails with our subsequent findings.

Low quality data.Recent works’ exploration of the impact of low-quality or intentionally degraded data on model performance motivates our experiments in Section4.Allen-Zhu and Li (2024)find that “junk data” significantly reduces knowledge capacity in a synthetic data setting, which aligns with our findings on sufficient model sizes. Counterintuitively,Liet al.(2025b)argue that pretraining on toxic data leads to better representations, which makes it easier to remove toxic behavior during the post-training phase. Investigating the limits of data structure,Sinhaet al.(2021)train on shuffled-word data similar to our shuffled-word experiments, arguing that the success of masked language models is primarily due to modeling “higher-order word co-occurrence statistics”. Finally,Ruet al.(2025)train models on randomly generated integers similar to our randomly generated text in Section4, and notice only a small performance drop.

2Preliminaries

We begin with our problem setup. Our goal is to measure the value of a dataset in terms of best possible performance, regardless of computational cost, on metrics of interest such as perplexity and downstream benchmarks. More formally, for a training algorithm𝒜\mathcal{A}which accepts as arguments a datasetDDof any size, parameter countMM, and training stepsNN, and outputs a modelθ∈Θ\theta\in\Thetato be evaluated at a lossℓ:Θ→ℝ\ell\colon\Theta\to\mathbb{R}, our goal is to find the best achievable performance

ℒ⋆(D):=minM,N⁡ℓ(𝒜(D,M,N)),\displaystyle\mathcal{L}^{\star}(D):=\min_{M,N}\;\ell(\mathcal{A}(D,M,N)),(1)as a function of the pretraining data. Our formulation has an unconstrained minimum over parameter countMMand training stepsNNin an attempt to extract all the “juice” out of a dataset, no matter its size. Empirically, we compute this minimum by varyingMMandNNover several orders of magnitude until either performance improvements start to plateau or we run out of compute.

Since we do not have the compute budget to train on all of Common Crawl (let alone perform multiple epochs), our experiments are structured around randomly sampled subsets. LetDccD_{cc}be the entire CC,Dcc,m⊆DccD_{cc,m}\subseteq D_{cc}be a randomly sampled subset ofmmtokens, andf(Dcc,m)⊆Dcc,mf(D_{cc,m})\subseteq D_{cc,m}be a filtered variant of the subset. In Section3, we compareℒ⋆(Dcc,m)\mathcal{L}^{\star}(D_{cc,m})andℒ⋆(f(Dcc,m))\mathcal{L}^{\star}(f(D_{cc,m}))for standard filtering functionsffsuch as DCLM-Baseline and RefinedWeb and our smallest subset sizemm, to test if the commonly removed documentsDcc,m∖f(Dcc,m)D_{cc,m}\setminus f(D_{cc,m})are indeed helpful for improving performance. In Section4, we test model robustness by injecting various “junk data”JJto formDcc,m∪JD_{cc,m}\cup J, challenging the hypothesis thatℒ⋆(Dcc,m)<ℒ⋆(Dcc,m∪J)\mathcal{L}^{\star}(D_{cc,m})<\mathcal{L}^{\star}(D_{cc,m}\cup J)holds.

Our smaller scale experiments implicitly assume that the better ofℒ⋆(f(Dcc,m))\mathcal{L}^{\star}(f(D_{cc,m}))andℒ⋆(Dcc,m)\mathcal{L}^{\star}(D_{cc,m})does not change (or at least changes predictably) withmm, which allows us to scale down and study the functionℒ⋆\mathcal{L}^{\star}at reasonable compute budgets. To investigate whether this is indeed the case, and understand how performance changes as a function ofmm,MM, andNN, we additionally scale over the pool sizemmin Section5.

2.1Experiment details

We use the version of Common Crawl provided byLiet al.(2025a)in their DCLM-Pool dataset, which is all of CC before20232023with text extracted from HTML viaresiliparse(Bevendorffet al.,2018). This dataset is240240trillion GPT-NeoX(Blacket al.,2022)tokens and our randomly sampled subsets range from about670670million to1010billion tokens. When filtering, we use the code provided byLiet al.(2025a). We do not use any specialized data curricula or data weights.

Our models are Llama-style dense transformers ranging from1515million to77billion parameters, trained with the Meta Lingua code repository(Videauet al.,2024). For each of the models, we tune the training step count and weight decay, following prior studies to increase repeatability of the data(Fanget al.,2025; Kimet al.,2025). As is standard, we set the learning rate to decay with model size(Brownet al.,2020; Kaplanet al.,2020), with an initial tuning stage to determine the decay. We release our configuration files on GitHub.111https://github.com/chrismohrii/bitter-lesson-data-filtering

Our main metrics of interest are the loss (negative log-likelihood) on various datasets, since this is known to correlate with downstream performance and provides smoother measurements than common question-answering benchmarks (likely due to their small size). These datasets are the English portion of C4(Raffelet al.,2023), Fineweb-Edu(Penedoet al.,2024), which is a pretraining dataset targeting educational texts, and Cosmopedia(Ben Allalet al.,2024), a dataset of synthetically-generated texts. We primarily plot the average loss across these three, but the trends are the same for each individually as well. We also provide results on common benchmarks such as ARC-Easy(Clarket al.,2018)and PIQA(Bisket al.,2019)in AppendixB. Since our experiments use pool sizes of only up to 10B tokens, we do not expect to suffer from test set contamination.

3Data Filtering

In this section, we test the hypothesis that standard filtered versions of CC achieve a lower loss than the unfiltered CC. Returning to our formulation in (1), whenDcc,mD_{cc,m}is anmm-token subset of CC andffis a filtering function, we are interested in the best ofℒ⋆(Dcc,m)\mathcal{L}^{\star}(D_{cc,m})andℒ⋆(f(Dcc,m))\mathcal{L}^{\star}(f(D_{cc,m})). While we evaluate a representative set of standard and relaxed filters, an exhaustive search over the exponential space of subsets is computationally intractable. Our objective is instead to benchmark open curation strategies against the pool and identify if models are able to extract signal from “low-quality” data.

We focus on our smallest CC pool size (about670670million tokens) where ablations are the cheapest, and curate five filtered versions of this pool by applying the filters described below, all of which are used inLiet al.(2025a). The first three are individual filters applied in the initial “heuristic cleaning” stage of DCLM-Baseline, and ablating them alone gives us pretraining datasets that are larger and more loosely filtered than standard. The fourth gives the end result of the “heuristic cleaning” stage, and the last gives the result of the full filtering pipeline.

Refer to caption Figure 1:Comparison of models on 670M token CC pool and five filtered subsets. For sufficiently large models (330M+), the unfiltered pool (black) outperforms all five filters (colors) after sufficiently many optimization steps (x-axis, tokens under multiple epochs).Figure 2:Pareto frontier of Figure1showing that in high-compute regimes, pool becomes optimal.English filter.This filter first obtains an English score for a document using a fastText classifier(Joulinet al.,2016)and then applies a threshold to this score. According to our tokenizer,28.2%28.2\%of the data is left after applying this filter.

Repetition filter.This filter originates from the data curation stage of the Gopher model, with the motivation that “excessive repetition is often linked with uninformative content”(Raeet al.,2022). It splits documents into segments of various granularities, such as lines, paragraphs, or n-grams, and applies a threshold on the duplicate fraction of these segments. According to our tokenizer,45.3%45.3\%of the data is left after applying this filter.

Stop word filter.This filter ensures that a document contains at least 2 occurrences of English stop words from the following list: “the”, “be”, “to”, “of”, “and”, “that”, “have”, and “with”. According to our tokenizer,50.4%50.4\%of the data is left after applying this filter.

RefinedWeb.This consists of the filters above along with other similar filters, in an attempt to reproduce the RefinedWeb dataset(Penedoet al.,2023). According to our tokenizer,13%13\%of the data is left after applying this filter.

DCLM-Baseline.This dataset applies deduplication and quality-based filtering with a fastText classifier to RefinedWeb. According to our tokenizer,2.1%2.1\%of the original pool data is left after applying this filter. We address questions of severe data scarcity in AppendixB.

In Figure1, we show the average loss for each dataset as compute is varied with both model size and training steps. Each point consists of a separate training run, with its own warmup and cosine decay learning rate schedule. Overall, the pool (CC) reaches the best loss of3.373.37on the 1B model, and its loss has not visibly plateaued from scaling model size. Outperforming the filtered datasets requires both a sufficiently large modelanda sufficiently large training step count. While we have not trained the1515M model until the loss starts to increase again because the loss continues to decrease even at a training budget of100100B total tokens, it does not appear as though the pool will ever outperform any of the first four filtered datasets. As we transition to the larger models, we observe crossing points on the loss curves between the pool and filtered versions, and these crossing points appear earlier as model size increases.

In Figure2, we take the same runs from Figure1and derive a compute-performance Pareto frontier. We calculate the compute for a run with the standard6NM6NMapproximation(Kaplanet al.,2020), whereNNis the number of total training tokens andMMis the number of model parameters. As compute is increased, the pool transitions from the worst-performing dataset to the best. Perhaps surprisingly, not all datasets enjoy a point on the overall Pareto frontier: at every given compute level, there are at least two better-performing datasets than the repetition filtered dataset.

Overall, these experiments suggest that pretraining is surprisingly resilient. Even at our scale, we see that the pool eventually beats the performance of all the filtered variants. This can be counterintuitive, since we might expect some junk data to hurt model performance. To further explore this phenomenon, we create artificial low quality data to probe the limits of pretraining robustness in the next section.

4Data Injection

We now test the limits of model robustness by deliberately injecting low-quality data. We investigate the hypothesis that the best achievable performance strictly degrades when curated “junk” distributions are added to the pretraining pool. More formally, ifDcc,mD_{cc,m}is a subset of CC andJJrepresents the injected low-quality dataset, we are interested in the best ofℒ⋆(Dcc,m)\mathcal{L}^{\star}(D_{cc,m})andℒ⋆(Dcc,m∪J)\mathcal{L}^{\star}(D_{cc,m}\cup J). Our first variant ofJJis designed to be devoid of any useful signal, and the second is designed to have some useful signal but of extremely low quality (Examples in Figure4).

Randomly generated strings.We define a vocabulary of 10,000 words by uniformly sampling 3 to 8 characters from the lowercase English alphabet (a-z). We then sample uniformly from these words and concatenate them with a space character to form documents.

Additional shuffled pool documents.We take additional CC documents that are not included in our CC subset and randomly shuffle the order of the words in each document.

Refer to caption

Figure 3:670M-token CC pool versus junk-injected versions. Plots show a surprising robustness to random data (top) for large models with consistent gains from low-quality (shuffled) data (bottom) with sufficiently many epoched training steps (x-axis).In Figure3, we provide the comparison of the two new datasets alongside the CC pool when varying model size and training step count. We have included varying amounts of injected junk data, up to 8 times the pool size in the shuffled words case, leaving only about 10% of untouched CC documents. In both cases, it is immediate that the injected data has not completely reduced model performance to random performance, which would result in a cross-entropy loss or negative log-likelihood of−log⁡(1/V)-\log(1/V)whereVVis the vocabulary size, giving approximately 10.8 with our tokenizer.

For both dataset variants, a sufficiently large model is required to match the pool performance. With the 15M model, there is a separation in the loss curves, regardless of the ratio of injected documents. As we transition to larger models, this gap closes. On the 330M model, we even see that all of the shuffled datasets—except the +800% shuffled dataset—surpass the pool performance after around 11B training tokens. We have not trained the +800% shuffled dataset past 100B tokens, but we expect it will also surpass pool performance since its loss has not visibly plateaued. We also expect it to cross this threshold even earlier on the larger 1B model because of its faster-decreasing loss. In the case of the randomly generated strings, the random datasets appear to more closely match the performance of the pool, but overall, the gaps are still closing with model size.

Our intuition for the differences between these datasets is that the shuffled words are more “confusing” for a smaller model, whereas the randomly generated strings are more clearly drawn from a different distribution. As we scale model size, and thus perhaps the ability to differentiate between the two distributions, there is more signal to extract from the shuffled data as it contains additional unseen pool documents with the unigram distribution intact. If, for example, we shuffled the sentence “The capital of France is Paris”, we would still see “France” and “Paris” together, helping the model understand that there may be some connection between the two. We attribute the improved performance with +20% random to either a potential regularization effect or an unintended similarity to natural text, which generally features words of similar lengths separated by space characters.

this RC [English]WLtoys topics cannot You cannot and Quadcopter Instruction and Replies post attachments in your other…\ldots

htb hqovl bwdws wesqae wcb xkk xhkqfm jhvbvutr nqxm ykzpnklm trgikh nymn dcncwn osyrr zpvrrly yhdsrr nyvo ynx…\ldots

Figure 4:Examples of “low-quality” documents injected into CC pool.Left: documents with shuffled word order.Right: documents with randomly generated strings.

5Scaling Pool Size

Do our experiments have implications for large-scale pretraining where the pool is all of CC? While suggestive, our 670M pool sample is quite far from the available internet stock of 200-500 trillion tokens(Villaloboset al.,2024), and scale effects could significantly change our conclusions.

To address these concerns, we turn to scaling studies that show our effects are consistent across scale by varying our pool size and model sizes across22orders of magnitude, and build up to a prediction of the compute threshold where the CC pool in DCLM-Pool (240T tokens) outperforms RefinedWeb. Due to the computational costs of these runs, we focus solely on the comparison between CC andf=f=RefinedWeb, with the goal of making a prediction on the better ofℒ⋆(Dcc)\mathcal{L}^{\star}(D_{cc})andℒ⋆(f(Dcc))\mathcal{L}^{\star}(f(D_{cc})).

Refer to caption

Figure 5:Top:11B model performance as we vary the pool size; the total needed steps for pool to outperform RefinedWeb grows rapidly.Bottom:Crossing point as a function of pool size for various model sizes. Markers each represent a crossing point (e.g. top panel), with text showing the epoch count. Epochs above the largest observed crossing point (121.6 epochs) are shaded to indicate unreliability at extreme epoch counts. Dashed lines show second-order polynomial fits used to interpolate data and show growth trends.Understanding how pool size affects performance requires us to map out the joint space of pool sizemm, model parametersMMand step countNN. As a simplifying first step, we represent step count as a function of the other two variables,

N⋆(M,m):=min⁡{N:ℓ(𝒜(Dcc,m,M,N))<minN′⁡ℓ(𝒜(f(Dcc,m),M,N′))},N^{\star}(M,m):=\min\left\{N\colon\ell(\mathcal{A}(D_{cc,m},M,N))<\min_{N^{\prime}}\ell(\mathcal{A}(f(D_{cc,m}),M,N^{\prime}))\right\},where we have taken the minimal winningNN(if one exists) as the output of the function. Given our intuition and experimental evidence that performance improves with larger models when sweeping over step count (see Figures1and3), this serves as a succinct representation of our 3 variable space.

Our step count functionN⋆N^{\star}has predictable behavior in both of its arguments. When we fixM=1M=1B and increase the pool, we make two important observations. First, we see that the point at which the pool performance becomes better than RefinedWeb (N⋆N^{\star}) grows rapidly (top half of Figure5), and the precise quantitative rate of growth is super-linear (roughly 10 epochs are needed for the 10B-token pool, compared to roughly three epochs for the 2B-token pool and one epoch for the 670M-token pool). Our second observation is that the validation losses are nonmonotone even with tuned weight decay regularization, suggesting that in extreme epochs (100+), the two may cease to cross.

Refer to caption Figure 6:Scaling laws for optimality of no data filtering. Two scaling laws with token-per-parameter scaling (in orange) and epoch constraints (in blue) both give highly linear scaling and predict similar budgets (1e+30 FLOPs).We now also vary model sizeMMto understand the joint scaling behavior as model size grows with pool size. Figure5shows a sweep overN⋆(M,m)N^{\star}(M,m)with each panel varyingMMand the x-axis varyingmm. On the leftmost plot with the 80M model, we can clearly see that crossing points cease to exist, even across our evaluated pool sizes: while there is a crossing point for the smallest 670M-token pool, there is no longer a crossing point on the largest 10B-token pool as indicated by the dark orange marker. As high-epoch regimes can become nonmonotone, we mark those regions in orange in the plot to indicate that they are unlikely to have any crossing points. As we scale upMM, however, we see that the epoch counts needed for the pool to win rapidly decrease as a function of model size.

With these observations and our experimental measurements ofN⋆N^{\star}, we can answer our question of what happens when we scale our pool sizes to the current CC pool size (240T tokens in DCLM-Pool). Are compute levels in the near future likely to reach a point where the entire CC pool is better than RefinedWeb? We follow a simple procedure to build a compute scaling law on top of ourN⋆N^{\star}function (Figure5), fitting two types of scaling laws to be robust to misspecification. In our first approach, we start by specifying a token-to-non-embedding-parameter ratio (600:1, following DeepSeek V4). For each model size, this ratio immediately specifies the number of training steps(N⋆)(N^{\star})as well as the compute level(C=6MN⋆)(C=6MN^{\star}). We can then estimate the pool size corresponding to thisN⋆N^{\star}for each model (using a fitted quadratic to interpolate among our observed data points as described in AppendixA.1) and build a scaling law againstCC. In our second approach, we instead specify an epoch count (4, based onMuennighoffet al.(2025)). The epoch count specifies a linear constraint which intersectsN⋆N^{\star}for each model at a single point (cf. the orange 120-epoch line in Figure5). This point specifies the pool size and compute level, which we can then also use to build a scaling law.

In contrast to the training stepsN⋆N^{\star}, our compute scaling laws are highly linear (Figure6), withR2R^{2}above0.990.99, and both give similar predictions, near 1e+30 FLOPs for the crossing point. This compute level is quite high, with the best current estimates of frontier pretraining compute near 5e+26(xAI,2025), but this is far from an outlandish amount of near-future compute, with existing forecasts predicting 1e+29 FLOP training runs by 2030(Owen,2025).

6Model Degradation

In all of our experiments so far, we have seen that regardless of the distribution, more data helps if we are free to train a sufficiently large model for sufficiently long. We should not expect this to be a universal property in machine learning, as a large body of research has been dedicated to the problem of domain adaptation and learning under distribution shift(Mansouret al.,2023; Awasthiet al.,2023). Instead, we hypothesize that language models are highly resistant to covariate shifts, and it is “incorrectly labeled” data or data with shifts in the conditional distribution from a target metric that can be detrimental. For example, we expect that a model trained on sufficient instances of “The capital of France is Copenhagen” will learn the wrong capital of France.

Table 1:Average GPT5-mini judgements on keyword-matched CC data for select MMLU categories. Refer to caption Figure 7:330M model: loss of670670M pool subset versus+200%+200\%dataset.While CC is too large to exhaustively search through and contains non-factual content such as conspiracy theories, we argue that such actively harmful content is relatively low frequency. We provide a very brief study to support this with a corpus analysis of MMLU-related documents in CC(Hendryckset al.,2021). We first match keywords, and then we prompt GPT5-mini to classify whether the document supports, refutes, is related, or is unrelated to the question and answer. We target MMLU subjects such as world religions, where there are very rare keywords. We present our analysis in Table1. While our search did find mostly unrelated or related but neither supporting nor refuting documents, the average number of documents in support is at least an order of magnitude larger than refuting. In AppendixC, we develop some theory to provide an analysis of when filtering should help, in terms of how factual or correctly labeled a dataset is.

We now move to a case of distribution shift from our experiments with shuffled word order documents in Section4. Our metrics were the average validation loss across the entire sequence, but we may expect to suffer from a distribution shift with the loss on the initial tokens in a document, because we changed the distribution from the natural distribution of first tokens that appear in CC. In the case of predicting the very first token, it is impossible to detect whether a document is shuffled by having access only to the empty prefix.

In Figure7, we compare the average CC validation loss for CC and the+200%+200\%shuffled dataset when we look at the loss on initial segments of the document. As we transition from the full average to the loss on only the very first token, the+200%+200\%shuffled dataset loses its advantage over the pool. We do not expect this behavior to change with larger models. However, as most use cases of language models involve more than just a few tokens, we do not anticipate that this is a meaningful degradation.

7Theoretical Models

We might ask whether the results we have identified are predictable: ought we expect them? We present two theoretical models, one here and one in the appendices, that exhibit the behaviors we see, suggesting the types of behavior we identify might hold more broadly.

Heuristically, we might hypothesize that once a (transformer) model is large enough, it can pass “bad” data through components that do not interact with components representing “good” data, and when a model is not large enough, this cannot occur. Our experiments are consistent with this: large models absorb unfiltered data without penalty, while smaller models cannot. In low-rank matrix factorization—the simplest 1 hidden layer (linear) neural network—we see exactly this behavior at the population level.

To make this more precise, consider predicting vector-valued outputsyy(tokens) using a rankrrlinear transformation of an inputxx. Assume the pairs(x,y)∈ℝd×ℝm(x,y)\in\mathbb{R}^{d}\times\mathbb{R}^{m}come from one ofkktasks, where taskiioccurs with probabilitypi>0p_{i}>0and generatesY=u⋆,iv⋆,i⊤Xi+ξY=u_{\star,i}\,v_{\star,i}^{\top}X_{i}+\xifor independent mean-zero noiseξ∈ℝm\xi\in\mathbb{R}^{m}, whereΣi=𝔼[XiXi⊤]\Sigma_{i}=\operatorname*{\mathbb{E}}[X_{i}X_{i}^{\top}]satisfytr⁡(ΣiΣi′)=0\operatorname{tr}(\Sigma_{i}\Sigma_{i^{\prime}})=0fori≠i′i\neq i^{\prime}, so that tasks have orthogonal inputs: one may exactly separate them. The next proposition, whose proof is in AppendixC, follows.

Proposition 7.1(Rank Necessity under Orthogonal Inputs).

Let the conditions above hold andM⋆,i=u⋆,iv⋆,i⊤M_{\star,i}=u_{\star,i}\,v_{\star,i}^{\top}, and defineM⋆=∑i=1kM⋆,iM_{\star}=\sum_{i=1}^{k}M_{\star,i}andΣ=∑i=1kpiΣi\Sigma=\sum_{i=1}^{k}p_{i}\Sigma_{i}. Ifσ1≥⋯≥σρ>0\sigma_{1}\geq\cdots\geq\sigma_{\rho}>0are theρ≤k\rho\leq kpositive singular values ofM⋆Σ1/2M_{\star}\Sigma^{1/2}, then for any model rankrr

minU∈ℝm×rV∈ℝd×r𝔼[‖Y−UV⊤X‖2]=∑j=r+1ρσj2+𝔼[‖ξ‖2],\min_{\begin{subarray}{c}U\in\mathbb{R}^{m\times r}\\ V\in\mathbb{R}^{d\times r}\end{subarray}}\operatorname*{\mathbb{E}}\!\big[\|Y-UV^{\top}X\|^{2}\big]=\sum_{j=r+1}^{\rho}\sigma_{j}^{2}+\operatorname*{\mathbb{E}}\!\big[\|\xi\|^{2}\big],where the sum evaluates to0ifr≥ρr\geq\rho.

The result makes clear that, given a large enough model rankrr, a matrix factorization can optimally represent the prediction problem (so long asr≥kr\geq k). On the other hand, without enough capacity (r<ρr<\rho), model performance necessarily degrades with interference of the tasks inYY-space, as the singular values ofM⋆Σ1/2M_{\star}\Sigma^{1/2}capture. Moreover, at least at this population level, (regularized) gradient-based methods are guaranteed to find the optimal matricesUUandVV, because the objective𝔼[‖Y−UV⊤X‖2]\operatorname*{\mathbb{E}}[\|Y-UV^{\top}X\|^{2}]has no non-strict saddle points whenr≥kr\geq k(Baldi and Hornik,1988; Zhuet al.,2018), and gradient descent converges to local minimizers with probability 1(Leeet al.,2016). In a fairly precise sense, then, this simple matrix factorization model exhibits much of the behavior we see in experiments: with enough capacity, noise (tasks) can be immediately absorbed, while smaller models suffer, and first-order methods are sufficient for optimal fitting.

8Discussion

While we have identified ways that scaling compute appears to make filtering immaterial, there are several limitations that lead to natural next steps for research in this direction.

Deviations from vanilla pretraining.Our setting is restricted to pretraining on dense transformer models, without any data curricula, data weights, or post-training. There may be more unstable architectures such as Mixture of Experts models (MoEs), or phenomena in later stages of training, that require more careful choices with the pretraining data. Other changes, like pretraining on synthetic data, can also have an effect. If we view synthetic data as just augmenting “high quality” data and assume that “low quality” data does provide useful information for improving metrics, then including synthetic data may just increase the compute level for the crossing points as in Section5by providing more effective tokens. However, if low-quality data mainly acts as a regularizer, then synthetic data may be strictly better.

Duplicate documents. The expected fraction of duplicate documents increases with subset size. At our subset size, it is likely much smaller than the entire CC. We do not expect that our general conclusions would change, especially as we epoch the data, but this is a variable that likely does not play a large role in our experiments.

Compute.The compute required for raw Common Crawl to outperform our tested filters is large, up to around1e301e30FLOPs with our projections in Figure6. When compute is a bottleneck, we expect filtering to still be important.

AI-generated content.We expect the fraction of AI-generated content in CC to increase, with likely a small amount in our pre-2023 DCLM-Pool dataset. It is unclear whether this will be detrimental.

Factuality.We have conducted an initial study into CC factuality or correctness with Table1, but there are likely some rare edge cases where models trained on the full pool learn inaccuracies.

References

A. Albalak, Y. Elazar, S. M. Xie, S. Longpre, N. Lambert, X. Wang, N. Muennighoff, B. Hou, L. Pan, H. Jeong, C. Raffel, S. Chang, T. Hashimoto, and W. Y. Wang (2024)A survey on data selection for language models.External Links:2402.16827,LinkCited by:§1.
Physics of language models: part 3.3, knowledge capacity scaling laws.External Links:2404.05405,LinkCited by:§1.1.
P. Awasthi, C. Cortes, and C. Mohri (2023)Theory and algorithm for batch distribution drift problems.InProceedings of The 26th International Conference on Artificial Intelligence and Statistics,F. Ruiz, J. Dy, and J. van de Meent (Eds.),Proceedings of Machine Learning Research, Vol.206,pp. 9826–9851.External Links:LinkCited by:§6.
P. Baldi and K. Hornik (1988)Neural networks and principal component analysis: learning from examples without local minima.Neural Networks2,pp. 53–58.Cited by:§7.
L. Ben Allal, A. Lozhkov, G. Penedo, T. Wolf, and L. von Werra (2024)Cosmopedia.External Links:LinkCited by:§2.1.
J. Bevendorff, B. Stein, M. Hagen, and M. Potthast (2018)Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl.InAdvances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018),L. Azzopardi, A. Hanbury, G. Pasi, and B. Piwowarski (Eds.),Lecture Notes in Computer Science,Berlin Heidelberg New York.Cited by:§2.1.
Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi (2019)PIQA: reasoning about physical commonsense in natural language.External Links:1911.11641,LinkCited by:§2.1.
S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, M. Pieler, U. S. Prashanth, S. Purohit, L. Reynolds, J. Tow, B. Wang, and S. Weinbach (2022)GPT-NeoX-20B: an open-source autoregressive language model.InProceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models,A. Fan, S. Ilic, T. Wolf, and M. Gallé (Eds.),virtual+Dublin,pp. 95–136.External Links:Link,DocumentCited by:§2.1.
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020)Language models are few-shot learners.External Links:2005.14165,LinkCited by:§2.1.
C. Cheng, H. Asi, and J. Duchi (2024)How many labelers do you have? a closer look at gold-standard labels.External Links:2206.12041,LinkCited by:§1.1.
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018)Think you have solved question answering? try arc, the ai2 reasoning challenge.External Links:1803.05457,LinkCited by:§2.1.
Common Crawl (2024)Common crawl corpus.Note:https://commoncrawl.orgAccessed: 2026-04-20Cited by:§1.
A. Fang, H. Pouransari, M. Jordan, A. Toshev, V. Shankar, L. Schmidt, and T. Gunter (2025)Datasets, documents, and repetitions: the practicalities of unequal data quality.External Links:2503.07879,LinkCited by:§1.1,§2.1.
L. Gao (2021)An empirical exploration in quality filtering of text data.External Links:2109.00698,LinkCited by:§1.1.
C. A. E. Goodhart (1984)Problems of monetary management: the uk experience.InMonetary Theory and Practice: The UK Experience,pp. 91–121.External Links:ISBN 978-1-349-17295-5,Document,LinkCited by:§1.1.
S. Goyal, P. Maini, Z. C. Lipton, A. Raghunathan, and J. Z. Kolter (2024)Scaling laws for data filtering – data curation cannot be compute agnostic.External Links:2404.07177,LinkCited by:§1.1,§1.
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt (2021)Measuring massive multitask language understanding.External Links:2009.03300,LinkCited by:§6.
A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2016)Bag of tricks for efficient text classification.External Links:1607.01759,LinkCited by:§3.
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei (2020)Scaling laws for neural language models.External Links:2001.08361,LinkCited by:§2.1,§3.
K. Kim, S. Kotha, P. Liang, and T. Hashimoto (2025)Pre-training under infinite compute.External Links:2509.14786,LinkCited by:§1.1,§2.1.
J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht (2016)Gradient descent only converges to minimizers.InProceedings of the Twenty Ninth Annual Conference on Computational Learning Theory,Cited by:§7.
J. Li, A. Fang, G. Smyrnis, M. Ivgi, M. Jordan, S. Gadre, H. Bansal, E. Guha, S. Keh, K. Arora, S. Garg, R. Xin, N. Muennighoff, R. Heckel, J. Mercat, M. Chen, S. Gururangan, M. Wortsman, A. Albalak, Y. Bitton, M. Nezhurina, A. Abbas, C. Hsieh, D. Ghosh, J. Gardner, M. Kilian, H. Zhang, R. Shao, S. Pratt, S. Sanyal, G. Ilharco, G. Daras, K. Marathe, A. Gokaslan, J. Zhang, K. Chandu, T. Nguyen, I. Vasiljevic, S. Kakade, S. Song, S. Sanghavi, F. Faghri, S. Oh, L. Zettlemoyer, K. Lo, A. El-Nouby, H. Pouransari, A. Toshev, S. Wang, D. Groeneveld, L. Soldaini, P. W. Koh, J. Jitsev, T. Kollar, A. G. Dimakis, Y. Carmon, A. Dave, L. Schmidt, and V. Shankar (2025a)DataComp-lm: in search of the next generation of training sets for language models.External Links:2406.11794,LinkCited by:§1,§1,§2.1,§3.
K. Li, Y. Chen, F. Viégas, and M. Wattenberg (2025b)When bad data leads to good models.External Links:2505.04741,LinkCited by:§1.1.
Y. Mansour, M. Mohri, and A. Rostamizadeh (2023)Domain adaptation: learning bounds and algorithms.External Links:0902.3430,LinkCited by:§6.
N. Muennighoff, A. M. Rush, B. Barak, T. L. Scao, A. Piktus, N. Tazi, S. Pyysalo, T. Wolf, and C. Raffel (2025)Scaling data-constrained language models.External Links:2305.16264,LinkCited by:§1.1,§1,§1,§5.
D. Owen (2025)What will ai look like in 2030?.Epoch AI.External Links:LinkCited by:§5.
G. Penedo, H. Kydlíček, L. B. allal, A. Lozhkov, M. Mitchell, C. Raffel, L. V. Werra, and T. Wolf (2024)The fineweb datasets: decanting the web for the finest text data at scale.External Links:2406.17557,LinkCited by:§2.1.
G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay (2023)The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only.arXiv preprint arXiv:2306.01116.Cited by:§3.
J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Elsen, S. Jayakumar, E. Buchatskaya, D. Budden, E. Sutherland, K. Simonyan, M. Paganini, L. Sifre, L. Martens, X. L. Li, A. Kuncoro, A. Nematzadeh, E. Gribovskaya, D. Donato, A. Lazaridou, A. Mensch, J. Lespiau, M. Tsimpoukelli, N. Grigorev, D. Fritz, T. Sottiaux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama, C. de Masson d’Autume, Y. Li, T. Terzi, V. Mikulik, I. Babuschkin, A. Clark, D. de Las Casas, A. Guy, C. Jones, J. Bradbury, M. Johnson, B. Hechtman, L. Weidinger, I. Gabriel, W. Isaac, E. Lockhart, S. Osindero, L. Rimell, C. Dyer, O. Vinyals, K. Ayoub, J. Stanway, L. Bennett, D. Hassabis, K. Kavukcuoglu, and G. Irving (2022)Scaling language models: methods, analysis & insights from training gopher.External Links:2112.11446,LinkCited by:§3.
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2023)Exploring the limits of transfer learning with a unified text-to-text transformer.External Links:1910.10683,LinkCited by:§1.1,§1,§2.1.
J. Ru, Y. Xie, X. Zhuang, Y. Yin, Z. Guo, Z. Liu, Q. Ren, and Y. Zou (2025)Do we really have to filter out random noise in pre-training data for language models?.External Links:2502.06604,LinkCited by:§1.1.
T. N. Saada, L. Bethune, M. Klein, D. Grangier, M. Cuturi, and P. Ablin (2025)The data-quality illusion: rethinking classifier-based quality filtering for llm pretraining.External Links:2510.00866,LinkCited by:§1.1.
M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi (2019)SocialIQA: commonsense reasoning about social interactions.External Links:1904.09728,LinkCited by:Appendix B.
N. Sardana, J. Portes, S. Doubov, and J. Frankle (2025)Beyond chinchilla-optimal: accounting for inference in language model scaling laws.External Links:2401.00448,LinkCited by:§1.
K. Sinha, R. Jia, D. Hupkes, J. Pineau, A. Williams, and D. Kiela (2021)Masked language modeling and the distributional hypothesis: order word matters pre-training for little.External Links:2104.06644,LinkCited by:§1.1.
R. S. Sutton (2019)The bitter lesson.Note:http://www.incompleteideas.net/IncIdeas/BitterLesson.htmlIncomplete Ideas (blog)Cited by:§1.
M. Videau, B. Y. Idrissi, D. Haziza, L. Wehrstedt, J. Copet, O. Teytaud, and D. Lopez-Paz (2024)Meta Lingua: a minimal PyTorch LLM training library.External Links:LinkCited by:§2.1.
P. Villalobos, A. Ho, J. Sevilla, T. Besiroglu, L. Heim, and M. Hobbhahn (2024)Will we run out of data? limits of llm scaling based on human-generated data.External Links:2211.04325,LinkCited by:§5.
xAI (2025)Grok 4 model card.External Links:LinkCited by:§5.
Z. Zhu, Q. Li, G. Tang, and M. B. Wakin (2018)Global optimality in low-rank matrix optimization.IEEE Transactions on Signal Processing66(13),pp. 3614–3628.Cited by:§7.

Appendix AExperimental Details

Hyperparameters.Across all experiments, we use a context length of10241024tokens, batch size of219=524,2882^{19}=524,288tokens, and a500500training step warmup. We provide model-specific details in Table2. All runs have a weight decay tuned in[0.1,0.5][0.1,0.5]. The learning rates for the models were obtained with an initial tuning stage (and they also match the default learning rates for the 1B and 7B models in the Lingua repository).

Training details and compute.Throughout the plots (for example Figure1), we vary the training steps as powers of22. We evaluate a model55times during training and report the best checkpoint (which is almost always the last one, except for rare cases when the training steps are very large compared to data size). All experiments were conducted on H200 GPUs. Each run used only data parallelism on a single 8-GPU node, except for the 7B model which also uses FSDP, varying from less than an hour to up to 2-3 days. The combined cost of all our experiments exceeds 20,000 H200 GPU hours.

Table 2:Model architecture configurations.### A.1Scaling law fits

In several of our plots, we fit scaling laws to our empirically obtained measurements.

Figure5.In the bottom half, we fit a second-degree polynomial to the log-log plot due to the super-linear and eventually infinite behavior. The hollow points on the plot are obtained from training runs at the given pool size, but with step counts prior to the crossing point. In those cases, we fit a power law to the (decaying) loss, and extrapolate the first training step or token count where the pool surpasses the best RefinedWeb loss (achieved or extrapolated). The cases where no crossing is ever predicted are marked with an orange “x” on the plot, and only appear on the 80M model size plot.

Figure6.We use a standard power law, where the input is pool size and the output is the compute target. We use the number of non-embedding parameters as the model size when computing for example the 600 token/parameter ratio.

Appendix BAdditional experiments

In this section, we begin with downstream benchmark results to complement the validation loss metrics from the main text. We use PIQA, ARC-Easy, and SocialIQA[Sapet al.,2019]as these have easy enough questions to provide signal at our scale.

In Figure8, we provide plots analogous to those in Figure1but for the benchmarks, and Figure9is similarly analogous to Figure2. We do the same for the injected datasets: Figure12shows the datasets with random injection and Figure13shows the datasets with shuffled word order. These plots are in general much noisier than the perplexity-based ones in the main text, likely due to the relatively small number of questions in the benchmarks. However, the trends are roughly the same.

Refer to caption Figure 8:Ablation of670670M token CC pool and five filtered versions. Each plot is a different model size and the total tokens x-axis corresponds to the number of gradient steps taken (with epoching).Figure 9:Pareto frontier of compute vs. benchmark performance for CC pool and filtered datasets. The frontier is formed with the same runs as in Figure8.Finally, we address the potential confounding in Section3when we used the DCLM-Baseline filter on the 670M Common Crawl pool, which retains roughly 2% of the data and potentially results in severe data scarcity with respect to model size. While we did train a very small 15M parameter model in that setting, and note that no matter the subset size, DCLM-Baseline will always be about 2 orders of magnitude smaller than the pool, we provide an experiment here where we instead use 100M DCLM-Baseline tokens. This increases the size by roughly an order of magnitude. Figure10adds this artificially-increased DCLM-Baseline to Figure1, and Figure11adds it to the Pareto curve of Figure2. Even though the dataset now has more tokens than in the RefinedWeb subset, the pool and looser variants still outperform it with sufficient model size and training.

Refer to caption Figure 10:670670M token CC pool and five filtered versions. Each plot is a different model size and the total tokens x-axis corresponds to the number of gradient steps taken (with epoching). The arrow shows the change in DCLM-Baseline performance with about an order of magnitude more tokens. Refer to caption Figure 11:Pareto frontier of compute vs. average negative log-likelihood for CC pool and filtered datasets. The frontier is formed with the same runs as in Figure1. The arrow shows the change in DCLM-Baseline performance with about an order of magnitude more tokens.Figure 12:670M CC pool and random injection datasets. Each row is a downstream benchmark. Refer to caption Figure 13:670M CC pool and shuffled-word injection datasets. Each row is a downstream benchmark.

Appendix CProofs and Additional Theory

We now restate Proposition7.1from Section7and give its proof.

Consider predicting vector-valued outputsyy(tokens) using a rankrrlinear transformation of an inputxx. Assume the pairs(x,y)∈ℝd×ℝm(x,y)\in\mathbb{R}^{d}\times\mathbb{R}^{m}come from one ofkktasks, where taskiioccurs with probabilitypi>0p_{i}>0and generatesY=u⋆,iv⋆,i⊤Xi+ξY=u_{\star,i}\,v_{\star,i}^{\top}X_{i}+\xifor independent mean-zero noiseξ∈ℝm\xi\in\mathbb{R}^{m}, whereΣi=𝔼[XiXi⊤]\Sigma_{i}=\operatorname*{\mathbb{E}}[X_{i}X_{i}^{\top}]satisfytr⁡(ΣiΣi′)=0\operatorname{tr}(\Sigma_{i}\Sigma_{i^{\prime}})=0fori≠i′i\neq i^{\prime}, so that tasks have orthogonal inputs: one may exactly separate them. We assume without loss of generality thatv⋆,i∈range⁡(Σi)v_{\star,i}\in\operatorname{range}(\Sigma_{i}). The next proposition follows.

Proposition C.1(Rank Necessity under Orthogonal Inputs).

Proof.

LetM=UV⊤M=UV^{\top}. We first decouple the noiseξ\xi:

𝔼[‖Y−MX‖2]\displaystyle\operatorname*{\mathbb{E}}\big[\|Y-MX\|^{2}\big]=∑i=1kpi𝔼[‖M⋆,iXi+ξ−MXi‖2]\displaystyle=\sum_{i=1}^{k}p_{i}\,\operatorname*{\mathbb{E}}\big[\|M_{\star,i}\,X_{i}+\xi-M\,X_{i}\|^{2}\big]=∑i=1kpi𝔼[‖(M⋆,i−M)Xi‖2]⏟=⁣:g(M)+𝔼[‖ξ‖2],\displaystyle=\underbrace{\sum_{i=1}^{k}p_{i}\,\operatorname*{\mathbb{E}}\big[\|(M_{\star,i}-M)X_{i}\|^{2}\big]}_{=:g(M)}+\operatorname*{\mathbb{E}}\big[\|\xi\|^{2}\big],where the scalar cross-term2𝔼[ξ⊤(M⋆,i−M)Xi]=02\operatorname*{\mathbb{E}}[\xi^{\top}(M_{\star,i}-M)X_{i}]=0vanishes by independence and zero mean. Since the noise term is independent ofMM, it suffices to minimizeg(M)g(M)over matrices of rank at mostrr. We rewrite the multi-task objective into a single-target objective:

g(M)\displaystyle g(M)=∑i=1kpitr⁡((M⋆,i−M)Σi(M⋆,i−M)⊤)\displaystyle=\sum_{i=1}^{k}p_{i}\,\operatorname{tr}\!\big((M_{\star,i}-M)\,\Sigma_{i}\,(M_{\star,i}-M)^{\top}\big)=∑i=1kpitr⁡((M⋆−M−∑j≠iM⋆,j)Σi(M⋆−M−∑j≠iM⋆,j)⊤)\displaystyle=\sum_{i=1}^{k}p_{i}\,\operatorname{tr}\!\Big(\big(M_{\star}-M-\sum_{j\neq i}M_{\star,j}\big)\Sigma_{i}\big(M_{\star}-M-\sum_{j\neq i}M_{\star,j}\big)^{\top}\Big)=tr⁡((M⋆−M)Σ(M⋆−M)⊤)\displaystyle=\operatorname{tr}\!\big((M_{\star}-M)\,\Sigma\,(M_{\star}-M)^{\top}\big)=‖(M⋆−M)Σ1/2‖F2,\displaystyle=\big\|(M_{\star}-M)\,\Sigma^{1/2}\big\|_{F}^{2},where the cross terms in the second step vanish byM⋆,jΣi=u⋆,jv⋆,j⊤𝔼[XiXi⊤]=0M_{\star,j}\Sigma_{i}=u_{\star,j}v_{\star,j}^{\top}\operatorname*{\mathbb{E}}[X_{i}X_{i}^{\top}]=0fori≠ji\neq jsincev⋆,j⊤Xi=0v_{\star,j}^{\top}X_{i}=0almost surely.

LetA=M⋆Σ1/2A=M_{\star}\Sigma^{1/2}have positive singular valuesσ1≥⋯≥σρ\sigma_{1}\geq\cdots\geq\sigma_{\rho}. The rank-constrained minimization reduces to

minrank⁡(M)≤r⁡g(M)=minrank⁡(M)≤r⁡‖A−MΣ1/2‖F2.\min_{\operatorname{rank}(M)\leq r}g(M)=\min_{\operatorname{rank}(M)\leq r}\|A-M\Sigma^{1/2}\|_{F}^{2}.For anyMMwithrank⁡(M)≤r\operatorname{rank}(M)\leq r, the matrixB=MΣ1/2B=M\Sigma^{1/2}has rank at mostrr. By the Eckart–Young–Mirsky theorem, the squared Frobenius distance betweenAAand any rank-rrmatrixBBis at least∑j=r+1ρσj2\sum_{j=r+1}^{\rho}\sigma_{j}^{2}. This lower bound is exactly achievable: letArA_{r}be the rank-rrtruncated SVD ofAA. BecauseAAis formed by right-multiplying byΣ1/2\Sigma^{1/2}, its row space and therefore the row space ofArA_{r}lies entirely within therange⁡(Σ1/2)\operatorname{range}(\Sigma^{1/2}). Thus, settingM=Ar(Σ1/2)†M=A_{r}(\Sigma^{1/2})^{\dagger}yields a valid matrix withrank⁡(M)≤r\operatorname{rank}(M)\leq rthat satisfiesMΣ1/2=ArM\Sigma^{1/2}=A_{r}. The minimum achievable excess loss is therefore∑j=r+1ρσj2\sum_{j=r+1}^{\rho}\sigma_{j}^{2}, which vanishes if and only ifr≥ρr\geq\rho. ∎

C.1Theoretical conditions for filter improvement

We now give a simple model that explains when filtering can improve or degrade performance. To understand this, we hypothesize that a sufficiently trained large model’s conditional distributions can be defined by a similarity measures:𝒳×𝒳→ℝ+s\colon\mathcal{X}\times\mathcal{X}\to\mathbb{R}_{+}over test inputsx∈𝒳x\in\mathcal{X}and train inputsxi∈𝒳x_{i}\in\mathcal{X}from a training datasetD={(xi,yi)}iD=\{(x_{i},y_{i})\}_{i}:

ℙD(y∣x):=∑i∈Ds(x,xi)∑j∈Ds(x,xj)𝟙yi=y.\operatorname*{\mathbb{P}}_{D}(y\mid x):=\sum_{i\in D}\frac{s(x,x_{i})}{\sum_{j\in D}s(x,x_{j})}\mathds{1}_{y_{i}=y}.The conditional distribution is the fraction of examples inDDwith the same labelyy, weighted byss. According to the definition, we can affect the model’s prediction at a given test inputxxby including a similarx′x^{\prime}in the training datasetDD.

Let us consider the error (in KL divergence) of this predictorℙD\operatorname*{\mathbb{P}}_{D}compared to a predictor using filtered dataℙϕ∘D\operatorname*{\mathbb{P}}_{\phi\circ D}, whereϕ:𝒳×𝒴→{0,1}\phi:\mathcal{X}\times\mathcal{Y}\to\{0,1\}is a filter andϕ∘D⊆D\phi\circ D\subseteq D. We use the notationD|yD_{|y}to denote the restriction ofDDto examples(xi,yi)(x_{i},y_{i})withyi=yy_{i}=yandD|≠yD_{|\neq y}to denote the restriction ofDDto examples(xi,yi)(x_{i},y_{i})withyi≠yy_{i}\neq y. Expectations are defined with respect to ans(x,⋅)s(x,\cdot)-weighted dataset; we assume the relevant weighted masses are nonzero so that the displayed conditional distributions and expectations are well-defined. We find a simple characterization of the error difference.

Fact C.2(Characterization of Filter Improvement).

Given Dirac target conditionalℙt(⋅∣x)\operatorname*{\mathbb{P}}_{\text{t}}(\cdot\mid x)with all mass ony⋆y^{\star}, the improvement ofℙϕ∘D\operatorname*{\mathbb{P}}_{\phi\circ D}with respect toℙD\operatorname*{\mathbb{P}}_{D}in KL divergence toℙt\operatorname*{\mathbb{P}}_{t}is

KL(ℙt∣∣ℙD)−KL(ℙt∣∣ℙϕ∘D)=−log(ℙD(y⋆∣x)+(1−ℙD(y⋆∣x))𝔼D|≠y⋆[ϕ(X,Y)]𝔼D|y⋆[ϕ(X,Y)]).KL(\operatorname*{\mathbb{P}}_{t}\mid\mid\operatorname*{\mathbb{P}}_{D})-KL(\operatorname*{\mathbb{P}}_{t}\mid\mid\operatorname*{\mathbb{P}}_{\phi\circ D})=-\log\left(\operatorname*{\mathbb{P}}_{D}(y^{\star}\mid x)+(1-\operatorname*{\mathbb{P}}_{D}(y^{\star}\mid x))\frac{\operatorname*{\mathbb{E}}_{D_{|\neq y^{\star}}}[\phi(X,Y)]}{\operatorname*{\mathbb{E}}_{D_{|y^{\star}}}[\phi(X,Y)]}\right).

Proof.

In the following, we drop the first argument tos(⋅,⋅)s(\cdot,\cdot)to simplify notation. We first simplify the difference using the definition of KL divergence:

KL(ℙt∣∣ℙD)−KL(ℙt∣∣ℙϕ∘D)=∑y∈𝒴ℙt(y∣x)logℙϕ∘D(y∣x)ℙD(y∣x).\displaystyle KL(\operatorname*{\mathbb{P}}_{t}\mid\mid\operatorname*{\mathbb{P}}_{D})-KL(\operatorname*{\mathbb{P}}_{t}\mid\mid\operatorname*{\mathbb{P}}_{\phi\circ D})=\sum_{y\in{\mathscr{Y}}}\operatorname*{\mathbb{P}}_{t}(y\mid x)\log\frac{\operatorname*{\mathbb{P}}_{\phi\circ D}(y\mid x)}{\operatorname*{\mathbb{P}}_{D}(y\mid x)}.We analyze the likelihood ratio:

ℙD(y∣x)ℙϕ∘D(y∣x)\displaystyle\frac{\operatorname*{\mathbb{P}}_{D}(y\mid x)}{\operatorname*{\mathbb{P}}_{\phi\circ D}(y\mid x)}=∑i∈D,yi=ys(xi)∑j∈Ds(xj)∑i∈D,yi=ys(xi)ϕ(xi,yi)∑j∈Ds(xj)ϕ(xj,yj)\displaystyle=\frac{\sum_{i\in D,y_{i}=y}\frac{s(x_{i})}{\sum_{j\in D}s(x_{j})}}{\sum_{i\in D,y_{i}=y}\frac{s(x_{i})\phi(x_{i},y_{i})}{\sum_{j\in D}s(x_{j})\phi(x_{j},y_{j})}}=∑j∈Ds(xj)ϕ(xj,yj)∑j∈Ds(xj)∑i∈D,yi=ys(xi)∑i∈D,yi=ys(xi)ϕ(xi,yi)\displaystyle=\frac{\sum_{j\in D}s(x_{j})\phi(x_{j},y_{j})}{\sum_{j\in D}s(x_{j})}\frac{\sum_{i\in D,y_{i}=y}s(x_{i})}{\sum_{i\in D,y_{i}=y}s(x_{i})\phi(x_{i},y_{i})}=𝔼D[ϕ(X,Y)]𝔼D|y[ϕ(X,Y)]\displaystyle=\frac{\operatorname*{\mathbb{E}}_{D}[\phi(X,Y)]}{\operatorname*{\mathbb{E}}_{D_{|y}}[\phi(X,Y)]}=ℙD(y∣x)𝔼D|y[ϕ(X,Y)]+(1−ℙD(y∣x))𝔼D|≠y[ϕ(X,Y)]𝔼D|y[ϕ(X,Y)]\displaystyle=\frac{\operatorname*{\mathbb{P}}_{D}(y\mid x)\operatorname*{\mathbb{E}}_{D_{|y}}[\phi(X,Y)]+(1-\operatorname*{\mathbb{P}}_{D}(y\mid x))\operatorname*{\mathbb{E}}_{D_{|\neq y}}[\phi(X,Y)]}{\operatorname*{\mathbb{E}}_{D_{|y}}[\phi(X,Y)]}=ℙD(y∣x)+(1−ℙD(y∣x))𝔼D|≠y[ϕ(X,Y)]𝔼D|y[ϕ(X,Y)].\displaystyle=\operatorname*{\mathbb{P}}_{D}(y\mid x)+\left(1-\operatorname*{\mathbb{P}}_{D}(y\mid x)\right)\frac{\operatorname*{\mathbb{E}}_{D_{|\neq y}}[\phi(X,Y)]}{\operatorname*{\mathbb{E}}_{D_{|y}}[\phi(X,Y)]}.Plugging this back in, we find that the general difference is

−∑y∈𝒴ℙt(y∣x)⁡log⁡(ℙD(y∣x)+(1−ℙD(y∣x))𝔼D|≠y[ϕ(X,Y)]𝔼D|y[ϕ(X,Y)]).\displaystyle-\sum_{y\in{\mathscr{Y}}}\operatorname*{\mathbb{P}}_{t}(y\mid x)\log\left(\operatorname*{\mathbb{P}}_{D}(y\mid x)+(1-\operatorname*{\mathbb{P}}_{D}(y\mid x))\frac{\operatorname*{\mathbb{E}}_{D_{|\neq y}}[\phi(X,Y)]}{\operatorname*{\mathbb{E}}_{D_{|y}}[\phi(X,Y)]}\right).FactC.2follows as a special case by settingℙt(y∣x)=𝟙y=y⋆\operatorname*{\mathbb{P}}_{t}(y\mid x)=\mathds{1}_{y=y^{\star}}. ∎

The fact shows that two terms appear in the difference: the prevalence of the labely⋆y^{\star}in the original datasetℙD(y⋆∣x)\operatorname*{\mathbb{P}}_{D}(y^{\star}\mid x)and a measurement of filter performance via the ratio of the false positive rate to the true positive rate,

𝔼D|≠y⋆[ϕ(X,Y)]𝔼D|y⋆[ϕ(X,Y)].\frac{\operatorname*{\mathbb{E}}_{D_{|\neq y^{\star}}}[\phi(X,Y)]}{\operatorname*{\mathbb{E}}_{D_{|y^{\star}}}[\phi(X,Y)]}.WhenℙD(y⋆∣x)<1\operatorname*{\mathbb{P}}_{D}(y^{\star}\mid x)<1, filtering improves the KL if and only if this ratio is less than 1. If the prevalence is already high, there is little improvement possible, and otherwiseϕ\phimust distinguish correct from incorrect labels on thess-weighted dataset. In the case of CC, Table1suggests that the prevalence on select MMLU subjects is already high. In cases of strong filtering, e.g. removing99%99\%of the data including allx′x^{\prime}with highs(x,x′)s(x,x^{\prime}), the true positive rate may approach zero, making the ratio large and the KL worse.

相似文章

数据过滤的苦涩教训（1分钟阅读）

TLDR AI

本文研究了大模型预训练中的数据过滤，发现在高计算、数据稀缺的情况下，过滤可能并非必要，甚至可能有害；充分训练的大模型能从名义上的低质量数据中受益。

@Phoenixyin13: 这篇来自Meta FAIR的最新重磅论文，旨在告诉AI行业一句重要的风向标： “大模型数据，正在迎来智能科学家时代。” 在这篇论文里，一个经过 Autodata 精准洗礼的 4B小模型，在法律推理任务上，不仅碾压了传统合成数据训练出来的…

X AI KOLs Timeline

Meta FAIR最新论文提出Autodata方法，通过智能数据科学家Agent自主生成和优化高质量数据，使4B小模型在法律推理任务上击败397B大模型，预示数据质量可弥补参数量鸿沟，为数据pipeline和scaling提供新思路。

@tatsu_hashimoto: 我发推文给Chris（他不在线）的一些令人惊讶的新结果。只要有足够的算力，最好的数据…

X AI KOLs Following

令人惊讶的新结果表明，对于大型语言模型（LLM），只要有足够的算力，最好的数据过滤器可能就是没有过滤器，因为它们能很好地容忍低质量数据。

@kothasuhas: 非常非常棒的工作。TLDR：在无限计算资源的条件下，过滤 _any_ 数据可能都没有意义。

X AI KOLs Following

新研究表明，在拥有充足计算资源的情况下，语言模型训练数据的过滤可能并不必要，模型反而能从低质量数据中受益。

@vintcessun: 预训练原来可以这么省？1B模型、~$1000就能从零训出可用的基础模型，计算和数据量直接砍掉数百倍。核心不靠堆算力，而是层次递归架构加上潜在空间推理，配合PrefixLM packing和FA3把效率拉满。有点离谱，但论文和代码都开源了。

X AI KOLs Timeline

HRM-Text发布了一个1B参数的基础模型，声称仅需约$1000即可从零完成预训练，计算量和数据量减少数百倍，采用层级递归架构、潜在空间推理和PrefixLM packing等高效技术，论文与代码均已开源。

A Bitter Lesson for Data Filtering

Abstract

1Introduction

1.1Related Work

2Preliminaries

2.1Experiment details

3Data Filtering

4Data Injection

5Scaling Pool Size

6Model Degradation

7Theoretical Models

Proposition 7.1(Rank Necessity under Orthogonal Inputs).

8Discussion

References

Appendix AExperimental Details

Appendix BAdditional experiments

Appendix CProofs and Additional Theory

Proposition C.1(Rank Necessity under Orthogonal Inputs).

Proof.

C.1Theoretical conditions for filter improvement

Fact C.2(Characterization of Filter Improvement).

Proof.

相似文章

数据过滤的苦涩教训（1分钟阅读）

@tatsu_hashimoto: 我发推文给Chris（他不在线）的一些令人惊讶的新结果。只要有足够的算力，最好的数据…

@kothasuhas: 非常非常棒的工作。TLDR：在无限计算资源的条件下，过滤 _any_ 数据可能都没有意义。

提交意见反馈