@kothasuhas: really really cool work. TLDR: it probably does not make sense to filter _any_ data in the infinite compute regime
Summary
New research suggests that with sufficient compute, filtering training data for language models may be unnecessary, and models can benefit from low-quality data.
View Cached Full Text
Cached at: 05/22/26, 09:53 PM
really really cool work. TLDR: it probably does not make sense to filter any data in the infinite compute regime https://t.co/61P9AOZe2b
Tatsunori Hashimoto (@tatsu_hashimoto): Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally ‘low quality’ data, and can sometimes even benefit.
Similar Articles
A Bitter Lesson for Data Filtering (1 minute read)
This paper investigates data filtering for large model pretraining and finds that in the high-compute, data-scarce regime, filtering may not be necessary and can even be detrimental; sufficiently trained large models benefit from nominally low-quality data.
@tatsu_hashimoto: Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data …
Surprising new results show that for large LMs with enough compute, the best data filter might be no filter, as they tolerate low-quality data well.
@AI_Whisper_X: Bitter Lesson Part Two: If you have enough compute, the best data filter is no filter. The biggest takeaway from reading this paper is that Rich Sutton's bitter lesson is now coming to the data side? Stanford's Hashimoto published "A Bitter Lesson for Data Filtering"...
A research paper from Stanford University proposes that with sufficient compute, the best data filtering strategy is no filtering. Experiments show that large-scale models are robust to low-quality data, and unfiltered data pools perform better at larger scales. However, this conclusion applies to standard pre-training of dense models, and filtering remains important when compute is limited.
@FrancoisChauba1: If you train on (unsorted list, bubble sort procedure, sorted list) traces, you will never test time compute (TTC) your…
A critique arguing that training LLMs on human-generated data limits their ability to discover novel solutions via test-time compute, and that true AGI requires models that can explore hypothesis spaces more broadly, similar to AlphaZero.
@yoonholeee: https://x.com/yoonholeee/status/2064027464926716154
The author argues that text optimization (prompts, context, memory) is a legitimate and sample-efficient learning mechanism that should be taken more seriously by the ML community, enabling a new scaling axis of update-time compute.