common-crawl

#common-crawl

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper introduces WebGraphMix, a lightweight framework that uses web graph centrality scores from Common Crawl to select pretraining data, showing that mixing central and peripheral documents improves language model performance.

0 favorites 0 likes

#common-crawl

@kyutai_labs: We train 6B-param models on Common Crawl ordered sequentially from 2018 to 2025, so that the freshest data is seen last…

X AI KOLs Following ↗ · 2026-05-26 Cached

Kyutai Labs trains 6B-parameter models on Common Crawl data ordered sequentially from 2018 to 2025, showing that performance drop on recent years disappears, and open-sources the checkpoints for continual learning research.

0 favorites 0 likes

#common-crawl

@AI_Whisper_X: Bitter Lesson Part Two: If you have enough compute, the best data filter is no filter. The biggest takeaway from reading this paper is that Rich Sutton's bitter lesson is now coming to the data side? Stanford's Hashimoto published "A Bitter Lesson for Data Filtering"...

X AI KOLs Timeline ↗ · 2026-05-24 Cached

A research paper from Stanford University proposes that with sufficient compute, the best data filtering strategy is no filtering. Experiments show that large-scale models are robust to low-quality data, and unfiltered data pools perform better at larger scales. However, this conclusion applies to standard pre-training of dense models, and filtering remains important when compute is limited.

0 favorites 0 likes

#common-crawl

@vanstriendaniel: You can now run SQL over 2.19 BILLION web pages. Zero download! @CommonCrawl April 2026 crawl + URL index are on @huggi…

X AI KOLs Following ↗ · 2026-05-22 Cached

A Hugging Face Space allows running SQL queries over 2.19 billion web pages from Common Crawl without downloading, using DuckDB to read directly from Hugging Face storage buckets.

0 favorites 0 likes

#common-crawl

@lhoestq: You don't know you actually need local Common Crawl

X AI KOLs Timeline ↗ · 2026-05-22 Cached

Learn how to set up and use Common Crawl data locally for web data processing tasks.

0 favorites 0 likes

#common-crawl

A Bitter Lesson for Data Filtering (1 minute read)

TLDR AI ↗ · 2026-05-21 Cached

This paper investigates data filtering for large model pretraining and finds that in the high-compute, data-scarce regime, filtering may not be necessary and can even be detrimental; sufficiently trained large models benefit from nominally low-quality data.

1 favorites 1 likes

#common-crawl

Web Feeds in 2026: A Survey

Lobsters Hottest ↗ · 2026-05-11 Cached

A survey analyzing over 300,000 web feeds on the top 500k sites reveals that while feeds remain prevalent, most are abandoned or low quality due to automatic CMS generation. The author used AI agents to process Common Crawl data and calls for better feed management practices.

0 favorites 0 likes

common-crawl

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

@kyutai_labs: We train 6B-param models on Common Crawl ordered sequentially from 2018 to 2025, so that the freshest data is seen last…

@AI_Whisper_X: Bitter Lesson Part Two: If you have enough compute, the best data filter is no filter. The biggest takeaway from reading this paper is that Rich Sutton's bitter lesson is now coming to the data side? Stanford's Hashimoto published "A Bitter Lesson for Data Filtering"...

@vanstriendaniel: You can now run SQL over 2.19 BILLION web pages. Zero download! @CommonCrawl April 2026 crawl + URL index are on @huggi…

@lhoestq: You don't know you actually need local Common Crawl

A Bitter Lesson for Data Filtering (1 minute read)

Web Feeds in 2026: A Survey

Submit Feedback