@tatsu_hashimoto: Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data …

X AI KOLs Following 05/21/26, 03:51 PM Papers

lm data-filtering scaling compute low-quality-data dclm

Summary

Surprising new results show that for large LMs with enough compute, the best data filter might be no filter, as they tolerate low-quality data well.

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally 'low quality' data, and can sometimes even benefit. https://t.co/VhshLOWBIx

Original Article

View Cached Full Text

Cached at: 05/21/26, 09:37 PM

Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data filter for LMs (on DCLM) might be no filter. Why? Large models can tolerate a surprising amount of nominally ‘low quality’ data, and can sometimes even benefit. https://t.co/VhshLOWBIx

Similar Articles

@kothasuhas: really really cool work. TLDR: it probably does not make sense to filter _any_ data in the infinite compute regime

X AI KOLs Following

New research suggests that with sufficient compute, filtering training data for language models may be unnecessary, and models can benefit from low-quality data.

A Bitter Lesson for Data Filtering (1 minute read)

TLDR AI

This paper investigates data filtering for large model pretraining and finds that in the high-compute, data-scarce regime, filtering may not be necessary and can even be detrimental; sufficiently trained large models benefit from nominally low-quality data.

@AI_Whisper_X: Bitter Lesson Part Two: If you have enough compute, the best data filter is no filter. The biggest takeaway from reading this paper is that Rich Sutton's bitter lesson is now coming to the data side? Stanford's Hashimoto published "A Bitter Lesson for Data Filtering"...

X AI KOLs Timeline

A research paper from Stanford University proposes that with sufficient compute, the best data filtering strategy is no filtering. Experiments show that large-scale models are robust to low-quality data, and unfiltered data pools perform better at larger scales. However, this conclusion applies to standard pre-training of dense models, and filtering remains important when compute is limited.

@Tono_Ken3: Oh man, I did it! It went off—DeepSeek-V4-Flash-FP8 8 parallel aggregate 400TPS!! Local LLM revolution yesssssss lol

X AI KOLs Timeline

Achieved 400 tokens per second with DeepSeek-V4-Flash-FP8 using 8 parallel aggregates on local hardware, marking a significant milestone for local LLM inference.

@rohanpaul_ai: Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer. They just showed MTP (Mult…