Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

arXiv cs.CL Papers

Summary

This paper introduces WebGraphMix, a lightweight framework that uses web graph centrality scores from Common Crawl to select pretraining data, showing that mixing central and peripheral documents improves language model performance.

arXiv:2606.11499v1 Announce Type: new Abstract: The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data. We propose WebGraphMix, a lightweight data selection framework that computes structural centrality scores over the Common Crawl host-level web graph and uses them to vary the proportion of central versus peripheral documents in the pretraining mixture. We hypothesize that central hosts expose models to reusable abstractions, while peripheral hosts encode specialized, long-tail knowledge. WebGraphMix computes centrality scores efficiently at web scale, requiring no model training, labeled data, or downstream supervision. We integrate WebGraphMix into the DataComp-LM pipeline and train models at 400M and 1B parameter scales with 8B and 28B tokens respectively, evaluating on 23 tasks ranging from factual knowledge to symbolic reasoning. Our experiments show that central and peripheral web regions encode complementary capabilities. Mixture combining both at a ratio of 1:1 achieves 41.4% on average, compared to 39.8% for uniform sampling. Combining structural scores with document-level quality classifier scores further improves performance to 43.8%. These findings demonstrate that web graph topology is a meaningful axis for pretraining data curation, capturing information that is largely orthogonal to existing content-based approaches.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:37 PM

# Pretraining Data Selection via Web Graph Centrality
Source: [https://arxiv.org/html/2606.11499](https://arxiv.org/html/2606.11499)
Vedant Badoni Danqi Chen Xinyi Wang \{vedantbadoni, danqic\}@princeton\.eduwangxinyilinda@gmail\.com Princeton Language and Intelligence

###### Abstract

The performance of modern language models depends critically on pretraining data composition\. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data\. We proposeWebGraphMix, a lightweight data selection framework that computes structural centrality scores over the Common Crawl host\-level web graph and uses them to vary the proportion of central versus peripheral documents in the pretraining mixture\. We hypothesize that central hosts expose models to reusable abstractions, while peripheral hosts encode specialized, long\-tail knowledge\.WebGraphMixcomputes centrality scores efficiently at web scale, requiring no model training, labeled data, or downstream supervision\. We integrateWebGraphMixinto the DataComp\-LM pipeline and train models at 400M and 1B parameter scales with 8B and 28B tokens respectively, evaluating on 23 tasks ranging from factual knowledge to symbolic reasoning\. Our experiments show that central and peripheral web regions encode complementary capabilities\. Mixture combining both at a ratio of 1:1 achieves 41\.4% on average, compared to 39\.8% for uniform sampling\. Combining structural scores with document\-level quality classifier scores further improves performance to 43\.8%\. These findings demonstrate that web graph topology is a meaningful axis for pretraining data curation, capturing information that is largely orthogonal to existing content\-based approaches\.

## 1Introduction

The performance of modern language models \(LMs\) depends critically on the composition of their pretraining data\. While neural scaling laws\(Kaplanet al\.,[2020](https://arxiv.org/html/2606.11499#bib.bib4); Hoffmannet al\.,[2022](https://arxiv.org/html/2606.11499#bib.bib34)\)characterize how data size affects performance, far less is understood about how the structure of large\-scale web corpora should influence data selection\. In practice, modern pretraining pipelines rely on massive web dumps that are filtered, deduplicated, and sampled at the document level\(Albalaket al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib5)\)\. These pipelines implicitly treat documents as independent units, applying heuristic quality filters or domain classifiers without considering relationships between documents\(Soldainiet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib6)\)\. As a result, existing approaches largely ignore how information is organized across the web\.

However, the web is fundamentally a graph\. Webpages and hosts are connected through hyperlinks, forming a large\-scale network that encodes topical structure, citation patterns, and information flow\. We hypothesize that a document’s structural position in this graph may correlate with the type and transferability of knowledge it provides during pretraining\. Structurally central documents—those that lie on many shortest paths or connect diverse regions—act as hubs or bridges between otherwise weakly connected communities, and are more likely to co\-occur with heterogeneous contexts and expose models to reusable abstractions\. In contrast, peripheral documents may encode specialized or long\-tail content that is less broadly shared\. From a language modeling perspective, this suggests that graph structure may influence the diversity and overlap of token\-level learning signals, and therefore shape the capabilities learned during pretraining\.

![Refer to caption](https://arxiv.org/html/2606.11499v1/x3.png)Figure 1:Subgraph of the Common Crawl host\-level web graph\. Node size is proportional to their Betweenness centrality score\.In this work, we introduceWebGraphMix, a graph\-based data selection framework that leverages web\-scale structural signals to construct pretraining mixtures\.WebGraphMixoperates directly on the hyperlink graph and is fully unsupervised\. We compute centrality measures over a large Common Crawl host\-level graph and use these scores to partition data into structurally distinct subsets\. We then construct training mixtures that emphasize \(i\) structurallycentraldata, \(ii\) structurallyperipheraldata, and \(iii\) combinations of the two, enabling controlled investigation of how graph position affects downstream model behavior\. We test mainly two ways of computing graph centrality: Betweenness centrality\(Freeman,[1977](https://arxiv.org/html/2606.11499#bib.bib40)\)and Katz centrality\(Katz,[1953](https://arxiv.org/html/2606.11499#bib.bib41)\)\. We also tried PageRank\-based\(Pageet al\.,[1999](https://arxiv.org/html/2606.11499#bib.bib25)\)scoring but failed to show improvement, which is consistent with the observation from DCLM\(Liet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib2)\)\.

WebGraphMixdiffers from the prior domain\-based and quality\-based approaches\. Domain\-based methods construct semantic taxonomies \(e\.g\., topic and format categories\)\(Wettiget al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib8)\)or optimize coarse\-grained domain mixtures \(e\.g\., arXiv, GitHub, Common Crawl\) through regression or proxy training\(Xieet al\.,[2023](https://arxiv.org/html/2606.11499#bib.bib18); Liuet al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib33)\), while quality\-based methods score documents by abstract qualities \(e\.g\., educational value, difference between raw web and curated high\-quality data\)\(Penedoet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib1); Sachdevaet al\.,[2026](https://arxiv.org/html/2606.11499#bib.bib15); Wettiget al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib7); Liet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib2); Gunasekaret al\.,[2023](https://arxiv.org/html/2606.11499#bib.bib39)\)\. In contrast,WebGraphMixdoes not require a taxonomy, classifier, or regression model—only structural signals intrinsic to the web graph—making it lightweight and directly transferable across corpora that expose hyperlink structure\.

We integrateWebGraphMixinto the standardized DataComp\-LM \(DCLM\) pipeline\(Liet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib2)\)and train models at 400M and 1B parameter scales with 8B and 28B tokens, respectively\. Centrality scores for the full Common Crawl host graph \(13\.9M nodes, 439\.6M edges\) take fewer than 9 GPU hours to compute in total and can then be reused across all downstream experiments\. All training runs use identical tokenization, shuffling, and optimization procedures to isolate the effect of data selection, and we evaluate on a wide range of 23 tasks from the DCLM CORE v2 benchmark\(Liet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib2)\)\.

Our results show that graph structure provides a meaningful and complementary signal for pretraining data curation\. At 1B scale, selecting documents from structurally central hosts improves performance on Symbolic & Algorithmic Reasoning by \+1\.4% over uniform sampling, while selecting from peripheral hosts improves Science & Factual Knowledge and Commonsense & Reasoning\. These opposing effects indicate that different regions of the web graph encode distinct capability\-relevant signals, and motivate mixture sampling: combining 50% central and 50% peripheral documents with betweenness centrality reaches 41\.4% average across all 23 tasks, compared to 39\.8% for uniform sampling\. Combining the centrality signal with the DCLM\-fasttext quality classifier through multiplicative & divisive scoring further improves performance to 43\.8%, indicating that web graph topology captures information that is largely orthogonal to content\-based quality signals\.

Together, our results suggest that treating the web as a structured graph—rather than an unordered corpus—opens a new direction for studying the relationship between data distribution andmodel capabilities\.

## 2Related Work

#### Heuristic filtering & deduplication\.

Existing approaches to data curation largely operate at the document level and treat documents as independent units\. The first stage of curation usually applies heuristic filtering and deduplication\. Rule\-based filters remove boilerplate, spam, and malformed text\(Raffelet al\.,[2020](https://arxiv.org/html/2606.11499#bib.bib10); Raeet al\.,[2021](https://arxiv.org/html/2606.11499#bib.bib11); Penedoet al\.,[2023](https://arxiv.org/html/2606.11499#bib.bib12)\), while deduplication techniques such as MinHash\(Broder,[1997](https://arxiv.org/html/2606.11499#bib.bib13); Leeet al\.,[2022](https://arxiv.org/html/2606.11499#bib.bib14)\)and Bloom\-filter\-based methods\(Soldainiet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib6)\)eliminate near\-duplicate documents to reduce memorization\. Frameworks such as DataComp\-LM \(DCLM\)\(Liet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib2)\)standardize these preprocessing steps and enable compute\-controlled comparisons\. While effective at improving data cleanliness and diversity, these methods do not model relationships between documents\.

#### Document quality scoring\.

The second stage of curation usually assigns scalar quality scores to documents and selects data based on ranking\. FineWeb\-Edu\(Penedoet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib1)\), DCLM\-fasttext\(Liet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib2)\), QuRating\(Wettiget al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib7)\), and Ask\-LLM\(Sachdevaet al\.,[2026](https://arxiv.org/html/2606.11499#bib.bib15)\)estimate properties such as educational value or difference between curated high\-quality corpora and low\-quality corpora\. Benchmark\-Targeted Ranking \(BETR\)\(Mizrahiet al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib16)\)explicitly aligns pretraining data with downstream tasks by selecting documents similar to benchmark examples, achieving substantial gains under scaling\-law analysis\. Other approaches use perplexity\(Wenzeket al\.,[2020](https://arxiv.org/html/2606.11499#bib.bib17)\), n\-gram overlap\(Xieet al\.,[2023](https://arxiv.org/html/2606.11499#bib.bib18)\), or attention\-based signals\(Huaet al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib19)\)to identify useful data\. Despite their diversity, these methods share a common formulation: data selection is treated as a ranking problem over independently scored documents\.

#### Domain mixture optimization\.

The third stage of curation usually introduces higher\-level structure by partitioning web data into domains and optimizing mixture weights\. Most of the work like DoReMi\(Xieet al\.,[2023](https://arxiv.org/html/2606.11499#bib.bib18)\), RegMix\(Liuet al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib33)\), TiKMiX\(Wanget al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib22)\), DoGE\(Fanet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib36)\), and Aioli\(Chenet al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib20)\)use a coarse\-grained, pre\-defined domain categorization and optimize over the weights of mixtures using proxy models, regression, or influence\-based techniques\. To demystify the domain taxonomy of pretraining data, work like Skill\-it\(Chenet al\.,[2023](https://arxiv.org/html/2606.11499#bib.bib35)\), WebOrganizer\(Wettiget al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib8)\), Nemotron\-CLIMB\(Diaoet al\.,[2026](https://arxiv.org/html/2606.11499#bib.bib21)\), and Group\-MATES\(Yuet al\.,[2026](https://arxiv.org/html/2606.11499#bib.bib23)\)defines their own data domains before optimizing the mixture, by either clustering or constructing a compact and interpretable domain taxonomy\. These approaches can yield strong empirical gains, but typically require substantial computation, model training,or downstream supervision\.

Underlying all these approaches is a shared assumption: documents are evaluated primarily based on their content or similarity, rather than on how they relate to one another\. Even when structure is introduced \(e\.g\., domains or clusters\), it is derived from semantic similarity or learned representations, not from the native connectivity of the web\.

#### Useful web graph structure\.

In contrast, The web is fundamentally a graph: hyperlinks connect pages and hosts into a large\-scale network encoding citation, topical proximity, and information flow\. Graph\-based methods such as PageRank\(Pageet al\.,[1999](https://arxiv.org/html/2606.11499#bib.bib25)\)and HITS\(Kleinberg,[1999](https://arxiv.org/html/2606.11499#bib.bib26)\)have long exploited this structure for ranking and retrieval\. Recent work, Craw4LLM\(Yuet al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib27)\), introduces quality\-aware crawling to improve crawler efficiency—using webpage quality as the crawler scheduler’s priority score rather than graph connectivity, reducing crawled pages to 21% of the baseline while matching its performance\. While Craw4LLM incorporates quality signals during crawling, we reintroduce web graph structure*after*crawling for data selection\. A complementary direction leverages web metadata at training time: MeCo\(Gaoet al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib24)\)conditions on URL information to improve data efficiency and enable controllable inference, with gains persisting even under URL anonymization—suggesting that grouping documents by source provides useful structural signal\. Unlike these approaches, our method operates purely at the data selection stage\.

To the best of our knowledge, prior work has not used graph\-theoretic position as a direct signal for selecting and weighting documents within an already\-crawled corpus for pretraining\.

## 3Our Method:WebGraphMix

We introduceWebGraphMix, a lightweight pretraining data selection framework that leverages structural signals from the web graph\. Rather than scoring documents independently based on content, our method assigns*centrality scores*based on each document’s position in the global hyperlink network and uses these scores to guide sampling\.

### 3\.1Web Graph Construction

We operate on the Common Crawl host\-level graph111We usecc\-main\-2023\-24\-sep\-nov\-feb\-hostfrom[https://commoncrawl\.org/web\-graphs](https://commoncrawl.org/web-graphs)\., where each node represents a web host \(e\.g\.,wikipedia\.org\) and directed edges correspond to hyperlinks between hosts\. Formally, we define a directed graphG=\(V,E\)G=\(V,E\), wherev∈Vv\\in Vdenotes a host and\(u,v\)∈E\(u,v\)\\in Eindicates that hostuulinks to hostvv\. This host\-level representation aggregates all documents from the same domain into a single node, yielding a large\-scale graph with 13\.9M nodes and 439\.6M edges\.

The raw pretraining corpus we use, Corpus\-200B222[https://huggingface\.co/datasets/WebOrganizer/Corpus\-200B](https://huggingface.co/datasets/WebOrganizer/Corpus-200B)fromWettiget al\.\([2025](https://arxiv.org/html/2606.11499#bib.bib8)\), is a pre\-processed version of the1b\-1xCommonCrawl pool from DataComps\-LM\(Liet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib2)\)cleaned with RefinedWeb filters\(Penedoet al\.,[2023](https://arxiv.org/html/2606.11499#bib.bib12)\)and BFF deduplication\(Dirk Groeneveld,[2024](https://arxiv.org/html/2606.11499#bib.bib31)\)\. Each document in the preprocessed corpus is mapped to its corresponding host via its URL\. We discard about 5% of the documents in the corpus without a host in the web graph\. Centrality scores are computed at the host level and inherited by all associated documents\. Specifically, if a hostvvhas centrality scorec​\(v\)c\(v\), then each documentdid\_\{i\}from that host is assigned scoresi=c​\(v\)s\_\{i\}=c\(v\)\.

### 3\.2Centrality Score

We quantify structural importance using classical graph centrality measures that capture complementary aspects of connectivity\.

Betweenness centrality\(Freeman,[1977](https://arxiv.org/html/2606.11499#bib.bib40)\)measures how frequently a node lies on shortest paths between other nodes:

cB​\(v\)=∑s≠v≠tσ​\(s,t∣v\)σ​\(s,t\),c\_\{B\}\(v\)=\\sum\_\{s\\neq v\\neq t\}\\frac\{\\sigma\(s,t\\mid v\)\}\{\\sigma\(s,t\)\},\(1\)wheres,t,v∈Es,t,v\\in E,σ​\(s,t\)\\sigma\(s,t\)is the number of shortest paths from nodessto nodett, andσ​\(s,t∣v\)\\sigma\(s,t\\mid v\)counts those passing through nodevv\. Nodes with high betweenness act as bridges between otherwise weakly connected regions\. Representated crawled content from the hosts with the highest and lowest betweenness centrality scores are shown in[Table˜1](https://arxiv.org/html/2606.11499#S3.T1)\.

Table 1:Host centrality reflects different types of web content\.High\-betweenness hosts tend to contain broadly reusable, cross\-domain patterns, whereas low\-betweenness hosts more often contain specialized or long\-tail information\. Examples are based on representative URLs observed in the crawl\. For actually text crawled from these URLs, see Appendix[C](https://arxiv.org/html/2606.11499#A3)\.Katz centrality\(Katz,[1953](https://arxiv.org/html/2606.11499#bib.bib41)\)captures recursive influence by aggregating contributions from all walks in the graph:

cK​\(vi\)=η​∑jAi​j​cK​\(vj\)\+τ,c\_\{K\}\(v\_\{i\}\)=\\eta\\sum\_\{j\}A\_\{ij\}c\_\{K\}\(v\_\{j\}\)\+\\tau,\(2\)whereAAis the adjacency matrix,vi,vj∈Ev\_\{i\},v\_\{j\}\\in E,iiandjjboth index all nodes in the graph,0<η<1/λmax0<\\eta<1/\\lambda\_\{\\max\}ensures convergence, andτ\\tauis a bias term\. This assigns higher scores to nodes connected to other influential nodes, while attenuating longer paths\. These measures capture complementary notions of structural importance: Betweenness emphasizes cross\-community connectivity, while Katz centrality reflects global influence\.

PageRank\(Pageet al\.,[1999](https://arxiv.org/html/2606.11499#bib.bib25)\)is a specific variant of eigenvector centrality\. Prior work has shown that eigenvector centrality can be used in place of PageRank in directed networks with lower computational cost while preserving rank correlation\(Chandrashekharet al\.,[2022](https://arxiv.org/html/2606.11499#bib.bib32)\)\. We ran ablations using eigenvector centrality but found it did not yield improvements over the baseline\. A similar conclusion was reached by DCLM\(Liet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib2)\): they find varying the top quantile data selection based on PageRank scores does not outperform uniform sampling\. Thus we focus on Betweenness and Katz centrality in the main paper instead as they are shown to be effective and capture distinct and complementary notions of structural importance—bridging versus weighted influence\.

#### Efficiency and scalability\.

A key advantage ofWebGraphMixis that centrality scores can be computed efficiently at web scale using distributed graph algorithms\. We implement centrality computation over the host graph using GPU\-parallelized primitives and graph partitioning with thecuGraphlibrary333[https://github\.com/rapidsai/cugraph](https://github.com/rapidsai/cugraph)\. In practice, computing Katz centrality\(Fosteret al\.,[2001](https://arxiv.org/html/2606.11499#bib.bib42)\)for the full Common Crawl host graph took us < 3 hours on one H100 GPU and computing Betweenness centrality\(Brandes,[2001](https://arxiv.org/html/2606.11499#bib.bib30)\)took us < 6 hours on 4 H100 GPUs, after which the scores can be reused across alldownstream experiments\.

Unlike prior data selection methods that require repeated model training, gradient computation, or proxy evaluation, this is a*compute\-efficient one\-time preprocessing step*\. Once computed, centrality scores incur negligible overhead during data sampling\.

### 3\.3Centrality\-Guided Data Sampling

Each host can be viewed as a subdomain embedded within the global web graph\. Hosts differ substantially in their structural roles: some connect multiple regions of the graph and act as hubs or bridges, while others lie in sparsely connected or peripheral regions\. We hypothesize that these structural differences correspond to differences in the type of knowledge encoded: Structurally central hosts are more likely to expose models to broadly reusable and cross\-domain patterns, whereas peripheral hosts tend to contain specialized or long\-tail information\. This can qualitatively observed from[Table˜1](https://arxiv.org/html/2606.11499#S3.T1), where we show crawled content of central hosts and peripheral hosts\. To study this effect, we construct data mixtures that vary systematically across the centrality spectrum\.

Given host\-level centrality scoresc​\(v\)c\(v\), each document inherits a scoresi=c​\(vi\)s\_\{i\}=c\(v\_\{i\}\)based on its hostviv\_\{i\}\. We then construct training datasets under a fixed token budget using thefollowing sampling strategies\.

Top\-KK\(Central\) sampling: We select documents whose hosts fall within the top percentile of the centrality distribution \(e\.g\., top 25%, or 75%\), emphasizing structurally central regions of the web\.

Bottom\-KK\(Peripheral\) sampling: We select documents from the lowest percentile of the centrality distribution, focusing on peripheral or long\-tail regions\.

Mixed sampling: To test whether central and peripheral regions provide complementary signals, we construct mixtures combining both strata:

α⋅Top\-K\+\(1−α\)⋅Bottom\-K,\\displaystyle\\alpha\\cdot\\text\{Top\-$K$\}\+\(1\-\\alpha\)\\cdot\\text\{Bottom\-$K$\},\(3\)
where the mixture ratioα∈\{0,0\.25,0\.5,0\.75,1\}\\alpha\\in\\\{0,0\.25,0\.5,0\.75,1\\\}\. Documents are sampled proportionally until the token budget is reached\.

### 3\.4Combining Structural and Quality Signals

In addition to pure structural selection, we explore combining centrality scores with document\-level quality scores\. We use the quality scores produced by DCLM\-fasttext\(Liet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib2)\), a bigram model trained to classify high quality text sampled from different sources and low quality text sampled from RefinedWeb\(Penedoet al\.,[2023](https://arxiv.org/html/2606.11499#bib.bib12)\)reproduction\. We normalize both the centrality scores and the quality scores by:

s^i=exp⁡\(si−maxj⁡sj\),\\displaystyle\\hat\{s\}\_\{i\}=\\exp\(s\_\{i\}\-\\max\_\{j\}s\_\{j\}\),\(4\)whereiiandjjboth index all hosts in the web graph\. This gives us a score within\(0,1\]\(0,1\]\. After normalizing both signals, we combine graph centrality and document quality in two complementary ways\. For Top\-KKselection, we use additive and multiplicative scores,

s^iadd=s^icentrality\+s^iquality,s^imult=s^icentrality⋅s^iquality,\\hat\{s\}\_\{i\}^\{\\mathrm\{add\}\}=\\hat\{s\}\_\{i\}^\{\\mathrm\{centrality\}\}\+\\hat\{s\}\_\{i\}^\{\\mathrm\{quality\}\},\\qquad\\hat\{s\}\_\{i\}^\{\\mathrm\{mult\}\}=\\hat\{s\}\_\{i\}^\{\\mathrm\{centrality\}\}\\cdot\\hat\{s\}\_\{i\}^\{\\mathrm\{quality\}\},which favor documents that are both central in the web graph and high quality\. For Bottom\-KKselection, we instead use contrastive scores,

s^isub=s^icentrality−s^iquality,s^idiv=s^icentrality/s^iquality,\\hat\{s\}\_\{i\}^\{\\mathrm\{sub\}\}=\\hat\{s\}\_\{i\}^\{\\mathrm\{centrality\}\}\-\\hat\{s\}\_\{i\}^\{\\mathrm\{quality\}\},\\qquad\\hat\{s\}\_\{i\}^\{\\mathrm\{div\}\}=\\hat\{s\}\_\{i\}^\{\\mathrm\{centrality\}\}/\\hat\{s\}\_\{i\}^\{\\mathrm\{quality\}\},and select documents with the lowest scores, thereby prioritizing high\-quality documents that are less central\. Documents are ranked by the corresponding combined score and selected under the same token budget\. These strategies allow us to test whether graph structure provides a signal complementary to document quality\.

## 4Experiments

### 4\.1Experimental Setup

All experiments are conducted using the official DataComp‑LM \(DCLM\) framework, which provides standardized data pools, fixed model architectures, compute\-optimal token budgets, and a fully reproducible training and evaluation pipeline\. We evaluate two compute scales:400m‑1x, which trains a 412M\-parameter model on approximately 8\.2B tokens, and1b‑1x, which trains a 1\.4B\-parameter model on approximately 28B tokens\. We mainly report 1B model results in the main paper as they are more significant\. Full 400M model results can be found in Appendix[B](https://arxiv.org/html/2606.11499#A2)\.

We report task\-level and average accuracy on DCLM CORE v2 benchmark\(Liet al\.,[2024](https://arxiv.org/html/2606.11499#bib.bib2)\), which consists of 23 tasks\. As described in[Table˜5](https://arxiv.org/html/2606.11499#A1.T5)in Appendix[A](https://arxiv.org/html/2606.11499#A1), the eval tasks are classified into 5 categories444As marked in meta data:[https://github\.com/mlfoundations/dclm/blob/main/eval/eval\_meta\_data\.csv](https://github.com/mlfoundations/dclm/blob/main/eval/eval_meta_data.csv): Commonsense & Reasoning, QA & Comprehension, Science & Factual Knowledge, Symbolic & Algo Reasoning, and Language Understanding\. In the following paper, we will use*Commonsense*,*Comprehension*,*Knowledge*,*Reasoning*, and*Language*as abbreviations\. We also look into each category of tasks to get a better understanding of the distinct effect of Top\-KK\(central\) and Bottom\-KK\(peripheral\) sampling\.

We consider the following baselines:Random: uniformly randomly select from the data pool;Quality: select documents with top K quality score produced by DCLM\-fasttext;WebOrganizer\(Wettiget al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib8)\): topic and format domain pairs mixture predicted from RegMix\(Liuet al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib33)\)pipelines;WebOrganizer\+\(Wettiget al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib8)\): the WebOrganizer domain mixture combined with DCLM\-fasttext quality filter;PageRank: select documents with top K eigenvector centrality which can be used in place of the classical PageRank algorithm\(Chandrashekharet al\.,[2022](https://arxiv.org/html/2606.11499#bib.bib32)\)\.

### 4\.2Main Results

Recall thatWebGraphMixselects pretraining data by computing host\-level centrality scores over the Common Crawl web graph and constructing a*mixed*dataset that combines documents from the top \(central\) and bottom \(peripheral\) quantiles of the centrality distribution, with mixture ratioα\\alphacontrolling the balance between the two strata \(see[Equation˜3](https://arxiv.org/html/2606.11499#S3.E3)\)\. Among the two centrality measures explored, we find*betweenness centrality*most effective; it identifies hosts that bridge otherwise weakly connected regions of the web, yielding documents with broadly reusable, cross\-domain patterns\.WebGraphMix\+further incorporates document\-level quality scores from DCLM\-fasttext: for the Top\-KKstratum, documents are ranked by the*multiplicative*combination of centrality and quality scores; for the Bottom\-KKstratum, documents are ranked by the*division*of centrality by quality, thereby surfacing high\-quality documents that are structurally peripheral\.

[Table˜2](https://arxiv.org/html/2606.11499#S4.T2)shows the overall benchmark performance comparison between our best methods and baselines, averaged over task categories\. When mixing Top\-KKand Bottom\-KKdocuments withα=0\.5\\alpha=0\.5ranked by Betweenness centrality score, our method improves upon the random selection baseline by 1\.6% on average and improves upon the Top\-KKquality score selection baseline by 1\.5% on average\. Note that using quality score alone would improve upon the random selection baseline by 2\.5%, so our method combining with quality score improves upon random selection baseline by 4% in total\.WebGraphMiximproves over the random selection baseline on all task categories, whileWebGraphMix\+improves over the Quality baseline on 4 out of 5 categories\.

The WebOrganizer baseline requires significant human effort for proposing the domain taxonomies\. It also substantial computation: training 512 proxy models of 50M parameters and fitting a gradient\-boosted regression model to optimize domain weights toward specific target tasks, namely MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.11499#bib.bib38)\)and HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2606.11499#bib.bib37)\)\. This explains WebOrganizer\+’s strong performance on Commonsense—HellaSwag is a commonsense sentence\-completion benchmark—but limits the method’s generalizability to other capability categories\.WebGraphMix\+slightly outperforms WebOrganizer\+ on overall average while requiring no proxy training, no labeled targets, and no benchmark\-specific tuning\.

Table 2:Accuracy on DCLM CORE v2 benchmark at 1B scale, averaged by task category\.WebGraphMixuses betweenness centrality withα=0\.5\\alpha=0\.5Top\-KK/Bottom\-KKmixture;WebGraphMix\+additionally combines centrality with the DCLM\-fasttext quality score via multiplication and division\. Note that while ourWebGraphMixis close to WebOrganizer baseline, our method is significantly cheaper and more transferable\. Per\-task results are reported in Appendix[B](https://arxiv.org/html/2606.11499#A2)\.The effectiveness ofWebGraphMixscales with model size\. As shown in[Table˜3](https://arxiv.org/html/2606.11499#S4.T3), the gain from the best mixture strategy over baseline grows from 0\.1% at 400M parameters to 1\.6% at 1B parameters, and the gain from combining quality scores grows from 0\.6% at 400M parameters to 1\.5% at 1B parameters\. This is consistent with the scaling behavior observed in other data selection work\(Mizrahiet al\.,[2025](https://arxiv.org/html/2606.11499#bib.bib16); Yuet al\.,[2026](https://arxiv.org/html/2606.11499#bib.bib23)\)and suggests that our method may provide larger gains at larger scales\.

Table 3:Best average accuracy by strategy category at 400M and 1B scale\. Gain is computed relative to either the Random baseline or the Quality baseline\.\+\-means combining with addition for Top\-KKand subtraction for Bottom\-KK\.\*/means combining with multiplication for Top\-KKand division for Bottom\-KK\. Full per\-task results are in Appendix[B](https://arxiv.org/html/2606.11499#A2)\.

## 5Analysis

### 5\.1Structural Position Differentially Affects Capability Categories

[Table˜4](https://arxiv.org/html/2606.11499#S5.T4)breaks down performance by capability category for Top\-KK\(central\) and Bottom\-KK\(peripheral\) sampling strategies at the 1B scale\. The results show that the effect of structural position is highly capability\-dependent, and that central and peripheral regions of the web encode different types of useful information\.

#### Bottom\-KKsampling consistently improves factual and commonsense knowledge\.

The clearest and most consistent pattern appears in*Knowledge*and*Commonsense*task categories\. In*Knowledge*, both Bottom\-KKstrategies outperform the random baseline: Betweenness Bottom\-KKimproves from 34\.2% to 35\.4% \(\+1\.2%\), while Katz Bottom\-KKreaches 35\.3% \(\+1\.1%\)\. In contrast, Betweenness Top\-KKslightly hurts performance \(\-0\.3%\)\.

A similar trend appears for*Commonsense*\. Katz Bottom\-KKachieves the best score \(57\.8%,\+0\.5%\), while Katz Top\-KKsubstantially underperforms the baseline \(56\.1%,\-1\.2%\)\. Betweenness Bottom\-KKis roughly neutral \(\+0\.1%\), while Betweenness Top\-KKagain slightly decreases performance \(\-0\.6%\)\.

These results suggest that peripheral regions of the web contain useful long\-tail and diverse knowledge signals that are beneficial for factual recall and commonsense reasoning tasks\.

#### Structured reasoning benefits from both Top\-KKand Bottom\-KKsampling\.

Unlike the knowledge\-oriented categories,*Reasoning*improves under all centrality\-based sampling strategies\. Katz Top\-KKachieves the strongest result \(20\.4%\+1\.4%\), followed closely by Katz Bottom\-KK\(\+1\.2%\)\. Betweenness Top\-KKand Bottom\-KKproduce similar gains \(\+0\.9%and\+0\.8%respectively\)\.

This indicates that reasoning tasks benefit from structural selection in general\. However, the relatively stronger gains from Top\-KKsampling methods suggest that highly influential hosts may contain more structured or procedural content useful for these tasks\.

#### Comprehension and language understanding exhibit asymmetric behavior\.

For*Comprehension*, Bottom\-KKand Top\-KKbehave very differently\. Katz Bottom\-KKimproves over baseline \(\+0\.7%\), while Katz Top\-KKsubstantially hurts performance \(\-2\.4%\), the largest degradation in the table\. A similar but weaker pattern appears for*Language Understanding*, with only Katz Bottom\-KKimproving\.

These results suggest that aggressive concentration on structurally central hosts may reduce linguistic diversity or contextual variability, which are important for comprehension\-oriented tasks\.

#### Centrality metric matters\.

The two centrality measures also behave differently\. Katz centrality generally produces larger, less stable positive and negative shifts than Betweenness centrality, such as the strongest gains on*Reasoning*\(\+1\.4%\) but also the largest degradation on*Comprehension*\(\-2\.4%\), suggesting that recursive influence captures a stronger structural signal than shortest\-path bridging\.

Table 4:Average accuracy across different task categories for Top\-KKand Bottom\-KKsampling at 1B scale\.Betw\.denotes Betweenness centrality scores\.Katzdenotes Katz centrality scores\.

### 5\.2Centrality Score Distribution

Figure[2](https://arxiv.org/html/2606.11499#S5.F2)shows the distributions of Betweenness and Katz centrality scores at three levels of aggregation: hosts only, weighted by documents per host, and weighted by tokens per host\. The latter two reflect the effective score distribution over the training corpus, since each document inherits itshost’s centrality score\.

The two metrics exhibit very different shapes\. On a log scale, Betweenness \(Fig\.[2](https://arxiv.org/html/2606.11499#S5.F2)a–c\) is bell\-shaped and roughly symmetric, with the bulk of mass between10−1510^\{\-15\}and10−510^\{\-5\}\. A discrete spike at zero reflects a structural artifact of the Common Crawl host graph: it consists of roughly a dozen weakly connected components, and hosts in the smaller components receive near\-zero betweenness because the shortest paths through them are bounded by their component size\. Katz centrality \(Fig\.[2](https://arxiv.org/html/2606.11499#S5.F2)d–f\) is instead sharply right\-skewed: most hosts cluster near the low end of the score range \(∼2\.75×10−4\\sim 2\.75\\times 10^\{\-4\}\), with a long, sparse tail of high\-scoring hosts that are recursively linked to other influential hosts\. Document\- and token\-weighting shifts mass slightly toward higher scores in both cases, since central hosts contribute more content to the corpus, but the qualitative shapes are preserved\.

![Refer to caption](https://arxiv.org/html/2606.11499v1/img/score_distributions.png)Figure 2:Histograms of Betweenness centrality scores and Katz centrality scores distribution, with respect to hosts, documents, and tokens\.
### 5\.3Mixture Sampling

We investigate the effect of mixing Top\-KKand Bottom\-KKsampling by varying the proportion of Top\-KKand Bottom\-KKdocuments in the sampled data\. We find a mixture at around 1:1 yields the strongest performance\.[Figure˜3](https://arxiv.org/html/2606.11499#S5.F3)\(a\) summarizes the 23\-task averages across mixture ratiosand centrality metrics\.

![Refer to caption](https://arxiv.org/html/2606.11499v1/x4.png)Figure 3:Average accuracy of mixture sampling with Betweenness centrality score and Katz centrality score at 1B scale, varying ratio of Top\-KKand Bottom\-KKtokens\. \(a\) is centrality score only\. \(b\) is centrality score combined with quality score\.\+\-means combining with addition for Top\-KKand subtraction for Bottom\-KK\.\*/means combining with multiplication for Top\-KKand division for Bottom\-KK\.#### Mixing outperforms pure sampling\.

Neither pure Top\-KKnor pure Bottom\-KKachieves the gain ofα=0\.5\\alpha=0\.5mixture\. This confirms our central hypothesis: central and peripheral web regions encode complementary capabilities, and balancing them yields better data mixture than either extreme alone\.

#### The optimal ratio is roughly balanced\.

Across betweenness mixtures,α=0\.5\\alpha=0\.5outperforms bothα=0\.25\\alpha=0\.25\(40\.5%\) andα=0\.75\\alpha=0\.75\(41\.0%\)\. This suggests that neither central nor peripheral documents should dominate—the best pretraining data draws roughly equally from both structural extremesof the web graph\.

Betweenness mixtures are consistently stronger than or equal to Katz mixtures at the 1B scale, with the gap largest atα=0\.5\\alpha=0\.5\(\+0\.6%\)\. There is a clear non\-monotonic pattern for betweenness: performance peaks atα=0\.5\\alpha=0\.5and declines on either side\. This inverted\-U shape supports the complementarity hypothesis—too much of either extreme hurts\. Katz mixtures also support this trend, with performance improving as the proportion of central documents increases \(39\.5% to 40\.8% to 41\.0%\), but then decreasing to 39\.2% when there are too many \(Katz Top\-KK\)\. This may reflect the different nature of Katz centrality, which emphasizes local connectivity rather than global bridging\.

At 400M, mixture improvements are smaller but present\. Katzα=0\.25\\alpha=0\.25improves 0\.1% over the baseline while Betweennessα=0\.75\\alpha=0\.75makes gains on specific tasks like ARC Easy \(2\.6%\), Winogrande \(2\.4%\), and ARC Challenge \(2\.5%\)\. This weaker signal is expected: smaller models have less capacity to leverage the complementary information from different web regions\.

### 5\.4Combining with Quality Scores

[Figure˜3](https://arxiv.org/html/2606.11499#S5.F3)\(b\) reports results for combining centrality with quality scores at 1B scale\. The headline finding is that centrality extracts robustly positive value*on top of*the quality filter: every one of the 18 reported configurations exceeds the quality\-only baseline of 42\.3%, with gains ranging from 0\.5% to 1\.5%\. The strongest configuration,*Multiply Betweenness 50% Top, achieves an average of 43\.8%*—a 1\.5% improvement over quality\-only and a 4\.0% improvement over random sampling\. This indicates that web graph centrality is not merely competitive with content\-based quality scoring but consistently complementary to it\. The signal centrality captures \(structural position in the hyperlink graph\) appears largely orthogonal to what DCLM\-fasttext captures, so combining the two yields broadly compounding gains\.

## 6Conclusion

We introducedWebGraphMix, a lightweight pretraining data selection framework that uses structural position in the Common Crawl web graph as a signal for sampling documents\. Our results show that different regions of the web graph encode complementary capabilities: structurally central hosts improve symbolic and procedural reasoning more, while peripheral hosts improve commonsense and factual knowledge more\. Mixing these regions outperforms either extreme alone, and combining centrality with quality\-based filtering yields further gains\. Unlike prior data selection methods that require auxiliary model training, influence estimation, or domain taxonomy construction,WebGraphMixcomputes centrality scores once over publicly available web graph using standard graph algorithms, requiring less than 9 GPU\-hours\. The resulting signal is lightweight, reusable, and complementary to existing content\-based approaches, suggesting that web graph topology is a promising new axis for pretraining data curation\.

## Acknowledgments

This research is partially funded by the National Science Foundation \(IIS\-2211779\) and a Sloan Research Fellowship\. This research is also supported by Princeton Language and Intelligence \(PLI\) and Princeton AI Lab\. The experiments in this work were conducted on the Della high\-performance computing cluster, as a part of Princeton Research Computing resources\.

## References

- A\. Albalak, Y\. Elazar, S\. M\. Xie, S\. Longpre, N\. Lambert, X\. Wang, N\. Muennighoff, B\. Hou, L\. Pan, H\. Jeong, C\. Raffel, S\. Chang, T\. Hashimoto, and W\. Y\. Wang \(2024\)A survey on data selection for language models\.arXiv preprint arXiv:2402\.16827\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2402.16827),[Link](https://arxiv.org/abs/2402.16827)Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p1.1)\.
- A faster algorithm for betweenness centrality\.Journal of Mathematical Sociology25\(2\),pp\. 163–177\.External Links:[Document](https://dx.doi.org/10.1080/0022250X.2001.9990249)Cited by:[§3\.2](https://arxiv.org/html/2606.11499#S3.SS2.SSS0.Px1.p1.1)\.
- A\. Z\. Broder \(1997\)On the resemblance and containment of documents\.Proceedings\. Compression and Complexity of SEQUENCES 1997 \(Cat\. No\.97TB100171\),pp\. 21–29\.External Links:[Link](https://api.semanticscholar.org/CorpusID:11748509)Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px1.p1.1)\.
- S\. S\. Chandrashekhar, M\. Srivastava, B\. Jaganathan, and P\. Shukla \(2022\)PageRank algorithm using eigenvector centrality–new approach\.arXiv preprint arXiv:2201\.05469\.Cited by:[§3\.2](https://arxiv.org/html/2606.11499#S3.SS2.p4.1),[§4\.1](https://arxiv.org/html/2606.11499#S4.SS1.p3.1)\.
- M\. F\. Chen, M\. Y\. Hu, N\. Lourie, K\. Cho, and C\. Re \(2025\)Aioli: a unified optimization framework for language model data mixing\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=sZGZJhaNSe)Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px3.p1.1)\.
- M\. F\. Chen, N\. Roberts, K\. Bhatia, J\. WANG, C\. Zhang, F\. Sala, and C\. Re \(2023\)Skill\-it\! a data\-driven skills framework for understanding and training language models\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=IoizwO1NLf)Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Diao, Y\. Yang, Y\. Fu, X\. Dong, D\. SU, M\. Kliegl, Z\. CHEN, P\. Belcak, Y\. Suhara, H\. Yin, M\. Patwary, Y\. C\. Lin, J\. Kautz, and P\. Molchanov \(2026\)Nemotron\-CLIMB: clustering\-based iterative data mixture bootstrapping for language model pre\-training\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=aBlqKPkc4a)Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px3.p1.1)\.
- Dirk Groeneveld \(2024\)BFF: the big friendly filter\.Note:[https://github\.com/allenai/bff](https://github.com/allenai/bff)Bloom filter\-based n\-gram deduplication tool for language model pretraining dataCited by:[§3\.1](https://arxiv.org/html/2606.11499#S3.SS1.p2.4)\.
- S\. Fan, M\. Pagliardini, and M\. Jaggi \(2024\)DOGE: domain reweighting with generalization estimation\.InInternational Conference on Machine Learning,pp\. 12895–12915\.Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px3.p1.1)\.
- K\. C\. Foster, S\. Q\. Muth, J\. J\. Potterat, and R\. B\. Rothenberg \(2001\)A faster Katz status score algorithm\.Computational & Mathematical Organization Theory7\(4\),pp\. 275–285\.External Links:ISSN 1572\-9346,[Document](https://dx.doi.org/10.1023/A%3A1013470632383),[Link](https://doi.org/10.1023/A:1013470632383)Cited by:[§3\.2](https://arxiv.org/html/2606.11499#S3.SS2.SSS0.Px1.p1.1)\.
- L\. C\. Freeman \(1977\)A set of measures of centrality based on betweenness\.Sociometry40\(1\),pp\. 35–41\.External Links:ISSN 00380431, 23257938,[Link](http://www.jstor.org/stable/3033543)Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p3.1),[§3\.2](https://arxiv.org/html/2606.11499#S3.SS2.p2.7)\.
- T\. Gao, A\. Wettig, L\. He, Y\. Dong, S\. Malladi, and D\. Chen \(2025\)Metadata conditioning accelerates language model pre\-training\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=DdMMzlI5YE)Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px4.p1.1)\.
- S\. Gunasekar, Y\. Zhang, J\. Aneja, C\. C\. T\. Mendes, A\. Del Giorno, S\. Gopi, M\. Javaheripi, P\. Kauffmann, G\. de Rosa, O\. Saarikivi,et al\.\(2023\)Textbooks are all you need\.arXiv preprint arXiv:2306\.11644\.Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p4.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§4\.2](https://arxiv.org/html/2606.11499#S4.SS2.p3.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. de Las Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark, T\. Hennigan, E\. Noland, K\. Millican, G\. van den Driessche, B\. Damoc, A\. Guy, S\. Osindero, K\. Simonyan, E\. Elsen, J\. W\. Rae, O\. Vinyals, and L\. Sifre \(2022\)Training compute\-optimal large language models\.arXiv preprint arXiv:2203\.15556\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2203.15556),[Link](https://arxiv.org/abs/2203.15556)Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p1.1)\.
- K\. Hua, S\. Wu, G\. Zhang, and K\. Shen \(2025\)Attentioninfluence: adopting attention head influence for weak\-to\-strong pretraining data selection\.arXiv preprint arXiv:2505\.07293\.Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2001.08361),[Link](https://arxiv.org/abs/2001.08361)Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p1.1)\.
- L\. Katz \(1953\)A new status index derived from sociometric analysis\.Psychometrika18\(1\),pp\. 39–43\.External Links:[Document](https://dx.doi.org/10.1007/BF02289026)Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p3.1),[§3\.2](https://arxiv.org/html/2606.11499#S3.SS2.p3.7)\.
- J\. M\. Kleinberg \(1999\)Authoritative sources in a hyperlinked environment\.Journal of the ACM46\(5\),pp\. 604–632\.External Links:[Document](https://dx.doi.org/10.1145/324133.324140)Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px4.p1.1)\.
- K\. Lee, D\. Ippolito, A\. Nystrom, C\. Zhang, D\. Eck, C\. Callison\-Burch, and N\. Carlini \(2022\)Deduplicating training data makes language models better\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 8424–8445\.External Links:[Link](https://aclanthology.org/2022.acl-long.577/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.577)Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Li, A\. Fang, G\. Smyrnis, M\. Ivgi, M\. Jordan, S\. Gadre, H\. Bansal, E\. Guha, S\. Keh, K\. Arora,et al\.\(2024\)Datacomp\-lm: in search of the next generation of training sets for language models\.Advances in Neural Information Processing Systems37,pp\. 14200–14282\.Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p3.1.5),[§1](https://arxiv.org/html/2606.11499#S1.p4.1),[§1](https://arxiv.org/html/2606.11499#S1.p5.1),[§1](https://arxiv.org/html/2606.11499#S1.p5.1.2),[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.11499#S3.SS1.p2.4),[§3\.2](https://arxiv.org/html/2606.11499#S3.SS2.p4.1),[§3\.4](https://arxiv.org/html/2606.11499#S3.SS4.p1.6),[§4\.1](https://arxiv.org/html/2606.11499#S4.SS1.p2.2)\.
- Q\. Liu, X\. Zheng, N\. Muennighoff, G\. Zeng, L\. Dou, T\. Pang, J\. Jiang, and M\. Lin \(2025\)RegMix: data mixture as regression for language model pre\-training\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=5BjQOUXq7i)Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p4.1),[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.11499#S4.SS1.p3.1)\.
- D\. Mizrahi, A\. B\. L\. Larsen, J\. Allardice, S\. Petryk, Y\. Gorokhov, J\. Li, A\. Fang, J\. Gardner, T\. Gunter, and A\. Dehghan \(2025\)Language models improve when pretraining data matches target tasks\.arXiv preprint arXiv:2507\.12466\.Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2606.11499#S4.SS2.p4.1)\.
- L\. Page, S\. Brin, R\. Motwani, and T\. Winograd \(1999\)The pagerank citation ranking: bringing order to the web\.Technical ReportStanford InfoLab\.Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p3.1),[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px4.p1.1),[§3\.2](https://arxiv.org/html/2606.11499#S3.SS2.p4.1)\.
- G\. Penedo, H\. Kydlíček, A\. Lozhkov, M\. Mitchell, C\. Raffel, L\. Von Werra, T\. Wolf,et al\.\(2024\)The fineweb datasets: decanting the web for the finest text data at scale\.Advances in Neural Information Processing Systems37,pp\. 30811–30849\.Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p4.1),[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Penedo, Q\. Malartic, D\. Hesslow, R\. Cojocaru, H\. Alobeidli, A\. Cappelli, B\. Pannier, E\. Almazrouei, and J\. Launay \(2023\)The refinedweb dataset for falcon LLM: outperforming curated corpora with web data only\.InThirty\-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=kM5eGcdCzq)Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.11499#S3.SS1.p2.4),[§3\.4](https://arxiv.org/html/2606.11499#S3.SS4.p1.6)\.
- J\. W\. Rae, S\. Borgeaud, T\. Cai, K\. Millican, J\. Hoffmann, F\. Song, J\. Aslanides, S\. Henderson, R\. Ring, S\. Young,et al\.\(2021\)Scaling language models: methods, analysis & insights from training gopher\.arXiv preprint arXiv:2112\.11446\.Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Sachdeva, B\. Coleman, W\. Kang, J\. Ni, L\. Hong, E\. H\. Chi, J\. Caverlee, J\. McAuley, and D\. Z\. Cheng \(2026\)How to train data\-efficient LLMs\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=yKUbw7q1IA)Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p4.1),[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Soldaini, R\. Kinney, A\. Bhagia, D\. Schwenk, D\. Atkinson, R\. Authur, B\. Bogin, K\. Chandu, J\. Dumas, Y\. Elazar, V\. Hofmann, A\. H\. Jha, S\. Kumar, L\. Lucy, X\. Lyu, N\. Lambert, I\. Magnusson, J\. Morrison, N\. Muennighoff, A\. Naik, C\. Nam, M\. E\. Peters, A\. Ravichander, K\. Richardson, Z\. Shen, E\. Strubell, N\. Subramani, O\. Tafjord, P\. Walsh, L\. Zettlemoyer, N\. A\. Smith, H\. Hajishirzi, I\. Beltagy, D\. Groeneveld, J\. Dodge, and K\. Lo \(2024\)Dolma: an open corpus of three trillion tokens for language model pretraining research\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),External Links:[Document](https://dx.doi.org/10.48550/arXiv.2402.00159),[Link](https://arxiv.org/abs/2402.00159)Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p1.1),[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Wang, B\. Liu, F\. Liu, Y\. Guo, J\. Deng, X\. Wu, W\. Zhou, X\. Zhou, and T\. Wang \(2025\)TiKMiX: take data influence into dynamic mixture for language model pre\-training\.arXiv preprint arXiv:2508\.17677\.Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Wenzek, M\. Lachaux, A\. Conneau, V\. Chaudhary, F\. Guzmán, A\. Joulin, and E\. Grave \(2020\)CCNet: extracting high quality monolingual datasets from web crawl data\.InProceedings of the Twelfth Language Resources and Evaluation Conference,N\. Calzolari, F\. Béchet, P\. Blache, K\. Choukri, C\. Cieri, T\. Declerck, S\. Goggi, H\. Isahara, B\. Maegaard, J\. Mariani, H\. Mazo, A\. Moreno, J\. Odijk, and S\. Piperidis \(Eds\.\),Marseille, France,pp\. 4003–4012\(eng\)\.External Links:[Link](https://aclanthology.org/2020.lrec-1.494/),ISBN 979\-10\-95546\-34\-4Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Wettig, A\. Gupta, S\. Malik, and D\. Chen \(2024\)QuRating: selecting high\-quality data for training language models\.InProceedings of the 41st International Conference on Machine Learning \(ICML\),External Links:[Document](https://dx.doi.org/10.48550/arXiv.2402.09739),[Link](https://arxiv.org/abs/2402.09739)Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p4.1),[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Wettig, K\. Lo, S\. Min, H\. Hajishirzi, D\. Chen, and L\. Soldaini \(2025\)Organize the web: constructing domains enhances pre\-training data curation\.InProceedings of the 42nd International Conference on Machine Learning \(ICML\),External Links:[Document](https://dx.doi.org/10.48550/arXiv.2502.10341),[Link](https://arxiv.org/abs/2502.10341)Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p4.1),[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2606.11499#S3.SS1.p2.4),[§4\.1](https://arxiv.org/html/2606.11499#S4.SS1.p3.1)\.
- S\. M\. Xie, H\. Pham, X\. Dong, N\. Du, H\. Liu, Y\. Lu, P\. Liang, Q\. V\. Le, T\. Ma, and A\. W\. Yu \(2023\)DoReMi: optimizing data mixtures speeds up language model pretraining\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=lXuByUeHhd)Cited by:[§1](https://arxiv.org/html/2606.11499#S1.p4.1),[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Yu, Z\. Liu, and C\. Xiong \(2025\)Craw4LLM: efficient web crawling for LLM pretraining\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 13843–13851\.External Links:[Link](https://aclanthology.org/2025.findings-acl.712/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.712),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px4.p1.1)\.
- Z\. Yu, F\. Peng, J\. Lei, A\. Overwijk, W\. Yih, and C\. Xiong \(2026\)Group\-level data selection for efficient pretraining\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=uX4dyc7Z5Z)Cited by:[§2](https://arxiv.org/html/2606.11499#S2.SS0.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2606.11499#S4.SS2.p4.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)Hellaswag: can a machine really finish your sentence?\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 4791–4800\.Cited by:[§4\.2](https://arxiv.org/html/2606.11499#S4.SS2.p3.1)\.

## Appendix AExperiment Details

After selecting documents according to our graph\-based sampling strategy, we construct an untokenized dataset in JSONL format consistent with DCLM specifications\. Tokenization and shuffling are performed using DCLM’s official Rust\-based tokshuf pipeline\. Specifically, we tokenize with the GPT‑NeoX tokenizer at sequence length 2049, following DCLM’s standard configuration\. The Rust pipeline produces WebDataset shards and generates the corresponding manifest file required by the DCLM training script\. For each experiment, we create a dataset reference JSON to integrate seamlessly with the DCLM workflow\. We do not modify tokenizer settings, sequence length, sharding configuration, or preprocessing logic\. By using the official tokenize\-and\-shuffle implementation, we maintain identical preprocessing behavior to prior DCLM submissions and eliminate potential implementation\-induced variation\.

Model training is executed using DCLM’s training\.train entrypoint, which builds upon the OpenLM framework\. All experiments follow fixed DCLM scale\-specific recipes\. We evaluate two compute scales: 400M‑1x, which trains a 412M\-parameter model on approximately 8\.2B tokens, and 1B‑1x, which trains a 1\.4B\-parameter model on approximately 28B tokens\. For each scale, DCLM specifies the model architecture, number of layers, hidden size, attention heads, learning rate schedule, warmup steps, batch size, weight decay, gradient accumulation, and total number of training tokens\. We use these configurations exactly as provided, without modification\. In practice, we train with slightly more raw tokens than the nominal DCLM target to account for token loss during shuffling and padding, ensuring the effective training token count matches the intended compute budget\. The 400M model takes around 20 hours on 4 H100 GPUs while the 1B model takes around 90 hours on 4 H100 GPUs\.

Table 5:DCLM CORE v2 evaluation tasks used in our experiments, along with their categories\.TaskCategoryFew\-shotDescriptionHellaSwagCommonsense & Reasoning0/10Sentence completion, grounded commonsenseCommonsenseQACommonsense & Reasoning105\-choice commonsense QACOPACommonsense & Reasoning0Causal reasoning, cause/effectPIQACommonsense & Reasoning10Physical commonsense \(2\-choice\)WinogradCommonsense & Reasoning0Pronoun resolution, commonsenseWinograndeCommonsense & Reasoning0Large\-scale Winograd\-styleBoolQQA & Comprehension10Binary yes/no QA from passagesSQuAD \(v2\)QA & Comprehension10Extractive QA; may be unanswerableCoQAQA & Comprehension0Conversational QAOpenBookQAQA & Comprehension0Multi\-step reasoning \+ commonsenseARC EasyScience & Factual Knowledge10Grade\-school science \(easy\), 4\-choiceARC ChallengeScience & Factual Knowledge10Grade\-school science \(hard\), 4\-choiceJeopardyScience & Factual Knowledge10Diverse trivia, generativeQA WikidataScience & Factual Knowledge10Big\-Bench factual completionsMMLUScience & Factual Knowledge557\-subject academic QA \(aggregate\)LSAT\-ARScience & Factual Knowledge3Analytical reasoning from LSATCS AlgorithmsSymbolic & Algo Reasoning10Big\-Bench: recursion, DP executionDyck LanguagesSymbolic & Algo Reasoning10Big\-Bench: balanced bracket completionOperatorsSymbolic & Algo Reasoning10Big\-Bench: novel operator definitionsRepeat Copy LogicSymbolic & Algo Reasoning10Big\-Bench: words repeating and orderingLAMBADALanguage Understanding0Last\-word prediction, long contextLanguage IdentificationLanguage Understanding10Big\-Bench: identify written language

Table 6:Licenses for existing assets used in this paper\.
## Appendix BFull Results

Pure centrality sampling at 1B scale tells a nuanced story\. Bottom\-KKoutperforms Top\-KKon average: Katz Bottom\-KKachieves 0\.405 \(\+0\.4pp over baseline\), while Katz Top\-KKachieves only 0\.392 \(−\-0\.9pp\)\. This pattern holds across both centrality metrics\.

At the smaller 400M scale, the pattern is weaker\. The baseline \(0\.325\) ties with Katz Bottom\-KK\(0\.325\) as the best average\. Pure centrality strategies show less consistent improvement, likely because the 400M model has less capacity to exploit the structural signal\. However, the same task\-level asymmetry exists: Katz Top\-KKachieves 0\.601 on BoolQ \(\+7\.3pp over baseline\) while the baseline wins on bigbench\_qa\_wikidata \(0\.444 vs\. 0\.387, \+5\.7pp\)\.

Table 7:Accuracy on 23 tasks from DCLM CORE v2 benchmark\. All 1\.4B models are trained on 28B tokens selected by baseline methods and ourWebGraphMixmethod\. Our methods use the betweenness centrality scores with a Top\-KKand Bottom\-KKmixture of 1:1\. We use multiplication for combining with the quality scores, denoted asOurs\+\. Note that while ourWebGraphMixis close to WebOrganizer baseline, our method is significantly cheaper and more transferable\.Table 8:Pure centrality sampling at 1B scale \(1\.4B parameters, 28B tokens\)\. Each column selects documents whose hosts fall in the highest \(Top\-KK\) or lowest \(Bottom\-KK\) centrality stratum\. Katz Bottom\-KKachieves the highest average, suggesting peripheral web regions encode complementary capabilities at this scale\.Table 9:Mixture sampling at 1B scale \(1\.4B parameters, 28B tokens\)\. Each column combines a specified percentage of Top\-KK\(central\) documents with the remainder drawn from Bottom\-KK\(peripheral\) documents\. Betweenness 50% Top achieves the highest average \(0\.414\), outperforming both the uniform baseline \(0\.398\) and all pure sampling strategies from Table[8](https://arxiv.org/html/2606.11499#A2.T8)\.Table 10:Pure centrality sampling at 400M scale \(412M parameters, 8\.2B tokens\)\. The baseline \(uniform sampling\) achieves the highest average \(0\.325\), with pure centrality strategies performing comparably\. At this smaller scale, the signal from centrality alone does not consistentlyoutperform uniform sampling\.Table 11:Mixture sampling at 400M scale \(412M parameters, 8\.2B tokens\)\. Katz 25% Top achieves the highest average \(0\.326\), marginally outperforming the uniform baseline \(0\.325\)\. Betweenness 75% Top also shows gains on several individual tasks, indicating that the complementary signal from mixing central and peripheral documents is present even at smaller model scales\.Table 12:Additive quality–centrality combination at 400M scale \(412M parameters, 8\.2B tokens\)\. Normalized quality and centrality scores are summed, and documents are ranked by the combined score\. Add Katz Top\-KKachieves the highest average \(0\.351\), substantially outperforming both the uniform baseline \(0\.325\) and the quality\-only filter \(0\.345\), demonstrating that structural centrality provides an additive signal on top of content\-based quality scoring\.Table 13:Multiplicative quality–centrality combination at 400M scale \(412M parameters, 8\.2B tokens\)\. Normalized quality and centrality scores are multiplied, and documents are ranked by the product\. Both Multiply Betweenness Top\-KKand Multiply Katz Top\-KKachieve a tied best average of 0\.345, matching the quality\-only baseline\. Bottom\-KKvariants underperform, confirming that multiplicative combination is most effective when selecting structurally central documents\.Table 14:Additive quality–betweenness centrality combination at 1B scale \(1\.4B parameters, 28B tokens\)\. Normalized quality and betweenness centrality scores are summed, and documents are ranked by the combined score\. Add Betw\. 25% achieves the highest average \(0\.437\), outperforming both the uniform baseline \(0\.398\) and the quality\-only filter \(0\.423\), demonstrating that betweenness centrality provides an additive signal on top of content\-based quality scoring\.Table 15:Additive quality–Katz centrality combination at 1B scale \(1\.4B parameters, 28B tokens\)\. Normalized quality and Katz centrality scores are summed, and documents are ranked by the combined score\. All Katz variants achieve comparable average scores around 0\.431–0\.432, outperforming the uniform baseline \(0\.398\) and approaching the quality\-only filter \(0\.423\), demonstrating that Katz centrality provides a consistent additive signal across sampling thresholds\.Table 16:Multiplicative quality–betweenness centrality combination at 1B scale \(1\.4B parameters, 28B tokens\)\. Normalized quality and betweenness centrality scores are multiplied, and documents are ranked by the product\. Mult\. Betw\. 50% achieves the highest average \(0\.438\), outperforming both the uniform baseline \(0\.398\) and the quality\-only filter \(0\.423\)\. Bottom\-KKvariants remain competitive at this scale, unlike the 400M setting\.Table 17:Multiplicative quality–Katz centrality combination at 1B scale \(1\.4B parameters, 28B tokens\)\. Normalized quality and Katz centrality scores are multiplied, and documents are ranked by the product\. All Katz variants achieve comparable average scores around 0\.430–0\.432, outperforming the uniform baseline \(0\.398\) and the quality\-only filter \(0\.423\)\. Bottom\-KKvariants remain competitive at this scale, unlike the 400M setting\.#### Best combination strategy shifts with scale\.

As a secondary observation, we note that the best combination strategy reverses between the two scales: at 400M, Add Katz Top was strongest, while at 1B, Multiply Betweenness 50% takes over\. This may partially be attributed to the differing selectivity of the two strategies\. Multiplicative scoring strongly suppresses documents that are low on either signal, while additive scoring is more permissive\. At 400M, where the model has limited capacity, the broader signal from additive combination appears more useful; at 1B, the sharper selectivity of multiplicative combination yields better results\. While interesting, this reversal is less practically important than the broader finding that*both*combination strategies, in nearly all configurations, extract real value from centrality on top of quality filtering\.

## Appendix CExample Pretraining Documents

Table 18:Top hosts by betweenness centrality score\.The ten highest\-scoring hosts from the web graph, with a representative URL snippet for each\. Scores are computed over the host\-level graph and reported in scientific notation\.Table 19:Bottom hosts by betweenness centrality score\.The ten lowest\-scoring hosts from the web graph, with a representative URL snippet for each\. Scores are computed over the host\-level graph and reported in scientific notation\.
## Appendix DLimitations and future work

Our experiments are conducted at 400M and 1B parameter scales with 8B and 28B training tokens respectively, following the DCLM 1b\-1x reference setting\. The scaling pattern we observe—gains that grow with model size—suggests that further improvements may be achievable at larger scales, but verifying this requires substantially more compute\. Our centrality scores are also computed at the host level and inherited by all documents from a given host; a finer\-grained page\-level graph could capture intra\-host structural variation that is currently averaged out\. We focused on betweenness and Katz centrality because they capture distinct notions of structural importance \(cross\-community bridging vs\. recursive influence\), but other graph\-theoretic measures—including hierarchical decomposition \(k\-core, k\-truss\), random\-walk\-based methods beyond Katz, and motif\-based scores—remain unexplored\. Finally, combining WebGraphMix with domain\-based methods such as WebOrganizer is a natural next step: graph centrality and semantic taxonomies operate on independent axes, and combining them may yield further compounding gains in the same way that combining centrality with content\-based quality does\.

## Appendix EBroader Impact

This work introduces a graph\-based framework for pretraining data selection that operates on the structural topology of the web rather than on document content\. We discuss both potential positive and negative societal implications\.

#### Positive impacts\.

WebGraphMix offers a computationally lightweight alternative to data selection methods that require training auxiliary models, running proxy evaluations, or constructing domain taxonomies\. By replacing these resource\-intensive steps with a one\-time centrality computation \(fewer than 9 GPU\-hours total\), our approach lowers the barrier to principled data curation, particularly for resource\-constrained research groups\. More broadly, improving the efficiency of pretraining data selection reduces the total compute—and therefore energy—spent on training language models, since better data can substitute for additional training steps or larger model sizes\.

#### Potential risks and limitations\.

Graph\-based selection introduces a new axis of bias that differs from content\-based filtering\. Web graph centrality reflects the*linking behavior*of web publishers, which is shaped by commercial incentives, language demographics, and historical web development patterns\. Structurally central hosts tend to be large, English\-dominant platforms \(e\.g\., social media sites, major reference sites\), while peripheral hosts include small organizations, non\-English content, and niche communities\. Selecting data based on centrality scores therefore risks amplifying the structural inequalities already present in the web’s link topology—for example, systematically underrepresenting content from regions or languages with less interconnected web infrastructure\.

Our mixture\-based approach partially mitigates this concern by explicitly including peripheral documents alongside central ones, and our results show that peripheral regions contribute valuable capabilities that central regions lack\. However, we do not conduct a systematic analysis of how centrality\-based selection affects demographic, linguistic, or geographic representation in the resulting training data, and we encourage future work in this direction\.

Finally, the centrality scores and selection scripts we plan to release are metadata annotations on an already\-public corpus and do not introduce new privacy risks beyond those inherent in Common Crawl itself\. The models trained in this work are small\-scale research artifacts \(up to 1\.4B parameters\) not intended for deployment\.

Similar Articles

Spokes: Optimizing for Diverse Pretraining Data Selection

arXiv cs.CL

This paper introduces Spokes, a probabilistic diversification framework using the G-Vendi score to optimize diversity in pretraining data selection, achieving significant improvements in downstream task performance on FineWeb and DCLM by jointly optimizing quality and diversity.

Modeling Heterophily in Multiplex Graphs: An Adaptive Approach for Node Classification

arXiv cs.LG

This paper introduces HAAM, a novel method for node classification in multiplex graphs that adapts to both homophilic and heterophilic interactions across dimensions. It uses dimension-specific compatibility matrices and a product of trainable low-pass and high-pass filters approximated via Chebyshev polynomials to capture smooth and abrupt signal changes.

KletterMix: Climbing Toward High-Quality German Pretraining Data

Hugging Face Daily Papers

KletterMix is a high-quality German pretraining corpus built by translating a state-of-the-art English pretraining dataset into German while preserving structure and diversity. Controlled experiments show models trained on KletterMix achieve measurable improvements on German-language benchmarks.

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

arXiv cs.CL

This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.