@BetaTomorrow: Title: A Bitter Lesson for Data Filtering Authors : Christopher Mohri , John Duchi, Tatsunori Hashimoto (@tatsu_hashimo…

X AI KOLs Following 06/12/26, 04:46 AM Papers

data-filtering large-language-models generalization scaling-laws curation manifold-learning

Summary

This paper argues that for large enough models, unfiltered data can improve generalization by providing weak perturbations, contrary to the common assumption that only high-quality filtered data is beneficial. The authors caution that harmful conditional shifts can still damage models, but over-curation may remove useful perturbations.

Title: A Bitter Lesson for Data Filtering Authors : Christopher Mohri , John Duchi, Tatsunori Hashimoto (@tatsu_hashimoto) Filtering helps when the model lacks enough capacity to separate manifold regions. But when the model is large enough, unfiltered data supplies weak stochastic perturbations across a broader manifold. These perturbations can activate more intrinsic pathways, stabilize more fixed-point basins, and improve generalization. The “bitter lesson” here is not only scale beats curation; it is that over-curation may remove the very perturbations needed for fixed-point construction in high-order nonlinear data. One caution: this should not be overstated as “all data is good.” The paper itself says harmful conditional shifts can still damage the model, for example systematically false statements that look like normal high-quality text. Deep Manifold would say the same: useful perturbation nudges the manifold; adversarial or wrong conditional structure can anchor the wrong fixed point. ** Dataualism ** https://x.com/BetaTomorrow/status/2048580677290070016… #DeepManifoldInterpretation

Original Article

View Cached Full Text

Cached at: 06/13/26, 02:17 PM

Turing Post (@TheTuringPost): Wow, this is interesting..

@Stanford researchers put a common assumption to the test: large models need only “high-quality” filtered training data.

What if the best filter is no filter at all?

They compared full Common Crawl data with heavily filtered versions of it and got

@BetaTomorrow: Title: A Bitter Lesson for Data Filtering Authors : Christopher Mohri , John Duchi, Tatsunori Hashimoto (@tatsu_hashimo…

Similar Articles

The Curse of Depth in Large Language Models

Singular Learning Theory: AI learns like ice melts

(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable

Prefill Awareness in Large Language Models

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

Submit Feedback

Similar Articles

The Curse of Depth in Large Language Models

Singular Learning Theory: AI learns like ice melts

(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable

Prefill Awareness in Large Language Models

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior