Tag
Kyutai Labs trains 6B-parameter models on Common Crawl data ordered sequentially from 2018 to 2025, showing that performance drop on recent years disappears, and open-sources the checkpoints for continual learning research.
A research paper from Stanford University proposes that with sufficient compute, the best data filtering strategy is no filtering. Experiments show that large-scale models are robust to low-quality data, and unfiltered data pools perform better at larger scales. However, this conclusion applies to standard pre-training of dense models, and filtering remains important when compute is limited.
A Hugging Face Space allows running SQL queries over 2.19 billion web pages from Common Crawl without downloading, using DuckDB to read directly from Hugging Face storage buckets.
Learn how to set up and use Common Crawl data locally for web data processing tasks.
This paper investigates data filtering for large model pretraining and finds that in the high-compute, data-scarce regime, filtering may not be necessary and can even be detrimental; sufficiently trained large models benefit from nominally low-quality data.
A survey analyzing over 300,000 web feeds on the top 500k sites reveals that while feeds remain prevalent, most are abandoned or low quality due to automatic CMS generation. The author used AI agents to process Common Crawl data and calls for better feed management practices.