@tatsu_hashimoto: Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data …
Summary
Surprising new results show that for large LMs with enough compute, the best data filter might be no filter, as they tolerate low-quality data well.
View Cached Full Text
Cached at: 05/21/26, 09:37 PM
Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally ‘low quality’ data, and can sometimes even benefit. https://t.co/VhshLOWBIx
Similar Articles
@kothasuhas: really really cool work. TLDR: it probably does not make sense to filter _any_ data in the infinite compute regime
New research suggests that with sufficient compute, filtering training data for language models may be unnecessary, and models can benefit from low-quality data.
A Bitter Lesson for Data Filtering (1 minute read)
This paper investigates data filtering for large model pretraining and finds that in the high-compute, data-scarce regime, filtering may not be necessary and can even be detrimental; sufficiently trained large models benefit from nominally low-quality data.
@AI_Whisper_X: Bitter Lesson Part Two: If you have enough compute, the best data filter is no filter. The biggest takeaway from reading this paper is that Rich Sutton's bitter lesson is now coming to the data side? Stanford's Hashimoto published "A Bitter Lesson for Data Filtering"...
A research paper from Stanford University proposes that with sufficient compute, the best data filtering strategy is no filtering. Experiments show that large-scale models are robust to low-quality data, and unfiltered data pools perform better at larger scales. However, this conclusion applies to standard pre-training of dense models, and filtering remains important when compute is limited.
@Tono_Ken3: Oh man, I did it! It went off—DeepSeek-V4-Flash-FP8 8 parallel aggregate 400TPS!! Local LLM revolution yesssssss lol
Achieved 400 tokens per second with DeepSeek-V4-Flash-FP8 using 8 parallel aggregates on local hardware, marking a significant milestone for local LLM inference.
@rohanpaul_ai: Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer. They just showed MTP (Mult…
atomic.chat's MTP technique speeds up local LLM inference by drafting multiple tokens and verifying them together, achieving up to 137% speedup on Qwen 27B dense model with zero accuracy loss.