Tag
This paper introduces WebGraphMix, a lightweight framework that uses web graph centrality scores from Common Crawl to select pretraining data, showing that mixing central and peripheral documents improves language model performance.