@BetaTomorrow: Title: A Bitter Lesson for Data Filtering Authors : Christopher Mohri , John Duchi, Tatsunori Hashimoto (@tatsu_hashimo…
Summary
This paper argues that for large enough models, unfiltered data can improve generalization by providing weak perturbations, contrary to the common assumption that only high-quality filtered data is beneficial. The authors caution that harmful conditional shifts can still damage models, but over-curation may remove useful perturbations.
View Cached Full Text
Cached at: 06/13/26, 02:17 PM
Title: A Bitter Lesson for Data Filtering Authors : Christopher Mohri , John Duchi, Tatsunori Hashimoto (@tatsu_hashimoto) Filtering helps when the model lacks enough capacity to separate manifold regions. But when the model is large enough, unfiltered data supplies weak stochastic perturbations across a broader manifold. These perturbations can activate more intrinsic pathways, stabilize more fixed-point basins, and improve generalization. The “bitter lesson” here is not only scale beats curation; it is that over-curation may remove the very perturbations needed for fixed-point construction in high-order nonlinear data. One caution: this should not be overstated as “all data is good.” The paper itself says harmful conditional shifts can still damage the model, for example systematically false statements that look like normal high-quality text. Deep Manifold would say the same: useful perturbation nudges the manifold; adversarial or wrong conditional structure can anchor the wrong fixed point. ** Dataualism ** https://x.com/BetaTomorrow/status/2048580677290070016… #DeepManifoldInterpretation
Turing Post (@TheTuringPost): Wow, this is interesting..
@Stanford researchers put a common assumption to the test: large models need only “high-quality” filtered training data.
What if the best filter is no filter at all?
They compared full Common Crawl data with heavily filtered versions of it and got
Similar Articles
The Curse of Depth in Large Language Models
This paper introduces the Curse of Depth in LLMs, where deep layers become ineffective due to Pre-Layer Normalization causing output variance explosion. The authors propose LayerNorm Scaling to mitigate this, showing consistent improvements in pre-training and fine-tuning across model sizes up to 7B.
Singular Learning Theory: AI learns like ice melts
Singular Learning Theory (SLT) uses algebraic geometry to explain why neural networks generalize well despite their degeneracies, introducing the real log canonical threshold (RLCT) as a measure of model complexity.
(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable
This paper proposes that reliability in AI-assisted social science research depends on decision architecture—how cognitive labor is divided between humans and machines. Through a pre-specified factorial experiment, the authors show that an unconstrained multi-agent baseline fails in 72% of runs, while one organized with three architectural commitments (LLMs restricted to reasoning, deterministic data/estimation, and three human decision gates) fails in only 16%.
Prefill Awareness in Large Language Models
This paper investigates whether frontier language models can detect when their prior assistant messages have been inserted or edited (prefill awareness). The study finds that models like Claude Opus 4.5 exhibit substantial prefill awareness, detecting tampered prefills in up to 35% of cases without false positives, which could compromise the validity of prefill-based safety evaluations.
Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
This paper examines when and why self-reported psychometric measures predict the actual behavior of large language models, finding that fine-grained, behavior-specific instruments (Theory of Planned Behavior) achieve human-level coherence within a shared conversation, while broad traits like Big 5 do not.