data-filtering

#data-filtering

@BetaTomorrow: Title: A Bitter Lesson for Data Filtering Authors : Christopher Mohri , John Duchi, Tatsunori Hashimoto (@tatsu_hashimo…

X AI KOLs Following ↗ · 2d ago Cached

This paper argues that for large enough models, unfiltered data can improve generalization by providing weak perturbations, contrary to the common assumption that only high-quality filtered data is beneficial. The authors caution that harmful conditional shifts can still damage models, but over-curation may remove useful perturbations.

0 favorites 0 likes

#data-filtering

@lateinteraction: dspy.GEPA used in pretraining data curation in the new Microsoft AI effort :-)

X AI KOLs Following ↗ · 2026-06-03 Cached

GEPA-optimized LLM judges from dspy are used for data filtering in Microsoft's MAI-Thinking-1 model pre-training pipeline.

0 favorites 0 likes

#data-filtering

@AI_Whisper_X: Bitter Lesson Part Two: If you have enough compute, the best data filter is no filter. The biggest takeaway from reading this paper is that Rich Sutton's bitter lesson is now coming to the data side? Stanford's Hashimoto published "A Bitter Lesson for Data Filtering"...

X AI KOLs Timeline ↗ · 2026-05-24 Cached

A research paper from Stanford University proposes that with sufficient compute, the best data filtering strategy is no filtering. Experiments show that large-scale models are robust to low-quality data, and unfiltered data pools perform better at larger scales. However, this conclusion applies to standard pre-training of dense models, and filtering remains important when compute is limited.

0 favorites 0 likes

#data-filtering

@kothasuhas: really really cool work. TLDR: it probably does not make sense to filter _any_ data in the infinite compute regime

X AI KOLs Following ↗ · 2026-05-21 Cached

New research suggests that with sufficient compute, filtering training data for language models may be unnecessary, and models can benefit from low-quality data.

0 favorites 0 likes

#data-filtering

@tatsu_hashimoto: Some new results I found surprising that I’m tweeting for Chris (who isnt on here). With enough compute, the best data …

X AI KOLs Following ↗ · 2026-05-21 Cached

Surprising new results show that for large LMs with enough compute, the best data filter might be no filter, as they tolerate low-quality data well.

0 favorites 0 likes

#data-filtering

A Bitter Lesson for Data Filtering (1 minute read)

TLDR AI ↗ · 2026-05-21 Cached

This paper investigates data filtering for large model pretraining and finds that in the high-compute, data-scarce regime, filtering may not be necessary and can even be detrimental; sufficiently trained large models benefit from nominally low-quality data.

1 favorites 1 likes

#data-filtering

Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

arXiv cs.AI ↗ · 2026-05-20 Cached

This position paper advocates for developing 'data probes'—synthetic sequences from random processes—to systematically study how data characteristics affect LLM performance, aiming to move beyond empirical heuristics.

0 favorites 0 likes

#data-filtering

GradShield: Alignment Preserving Finetuning

arXiv cs.CL ↗ · 2026-05-15 Cached

GradShield introduces a principled filtering method to preserve LLM safety alignment during fine-tuning by computing a Finetuning Implicit Harmfulness Score and using adaptive thresholding to remove harmful data, achieving low attack success rates while maintaining utility.

0 favorites 0 likes

#data-filtering

DALL·E 2 pre-training mitigations

OpenAI Blog ↗ · 2022-06-28 Cached

OpenAI describes the pre-training data filtering and active learning techniques used to reduce harmful content in DALL·E 2, while also addressing unintended bias amplification caused by data filtering—particularly demographic biases in generated images.

0 favorites 0 likes

data-filtering

Submit Feedback