Tag
This paper argues that for large enough models, unfiltered data can improve generalization by providing weak perturbations, contrary to the common assumption that only high-quality filtered data is beneficial. The authors caution that harmful conditional shifts can still damage models, but over-curation may remove useful perturbations.
GEPA-optimized LLM judges from dspy are used for data filtering in Microsoft's MAI-Thinking-1 model pre-training pipeline.
A research paper from Stanford University proposes that with sufficient compute, the best data filtering strategy is no filtering. Experiments show that large-scale models are robust to low-quality data, and unfiltered data pools perform better at larger scales. However, this conclusion applies to standard pre-training of dense models, and filtering remains important when compute is limited.
New research suggests that with sufficient compute, filtering training data for language models may be unnecessary, and models can benefit from low-quality data.
Surprising new results show that for large LMs with enough compute, the best data filter might be no filter, as they tolerate low-quality data well.
This paper investigates data filtering for large model pretraining and finds that in the high-compute, data-scarce regime, filtering may not be necessary and can even be detrimental; sufficiently trained large models benefit from nominally low-quality data.
This position paper advocates for developing 'data probes'—synthetic sequences from random processes—to systematically study how data characteristics affect LLM performance, aiming to move beyond empirical heuristics.
GradShield introduces a principled filtering method to preserve LLM safety alignment during fine-tuning by computing a Finetuning Implicit Harmfulness Score and using adaptive thresholding to remove harmful data, achieving low attack success rates while maintaining utility.
OpenAI describes the pre-training data filtering and active learning techniques used to reduce harmful content in DALL·E 2, while also addressing unintended bias amplification caused by data filtering—particularly demographic biases in generated images.