Spokes: Optimizing for Diverse Pretraining Data Selection
Summary
This paper introduces Spokes, a probabilistic diversification framework using the G-Vendi score to optimize diversity in pretraining data selection, achieving significant improvements in downstream task performance on FineWeb and DCLM by jointly optimizing quality and diversity.
View Cached Full Text
Cached at: 06/16/26, 11:46 AM
# Spokes: Optimizing for Diverse Pretraining Data Selection
Source: [https://arxiv.org/html/2606.15216](https://arxiv.org/html/2606.15216)
Clarence Lee DSO National Laboratories &Yejin Choi Stanford University &Luke Zettlemoyer University of Washington &Pang Wei Koh University of Washington &Hai Leong Chieu DSO National Laboratories
###### Abstract
Diversity plays a critical role in data selection, improving performance under fixed data budgets by reducing redundancy and repetition\. However, optimizing for diversity is inherently challenging, as it is a set\-level property that depends on interactions between data points rather than individual examples\. As a result, existing approaches typically rely on proxies or approximations, which often fail to ensure sufficiently diverse subsets\. In this work, we directly optimize diversity by introducing a probabilistic diversification framework based on the G\-Vendi score, optimized via exponentiated gradient descent\. Our method produces subsets that are substantially more diverse than those obtained via random sampling, achieving a \+489 increase in G\-Vendi score on a 500k\-sample subset\. We evaluate our approach on FineWeb and DCLM, where it consistently outperforms existing methods\. Notably,Spokes\(diversity\-only\) improves average downstream performance by \+0\.4 and \+0\.5 points over random sampling on DCLM and FineWeb, respectively\. More importantly, jointly optimizing for both quality and diversity yields the strongest results:Spokesachieves gains of \+1\.5 and \+1\.4 points on DCLM and FineWeb, outperforming all baselines, including semantic deduplication and quality filtering\.
## 1Introduction
Data diversity is a key factor in the construction of pretraining corpora\. Prior work has shown that explicitly incorporating diversity into data mixture design and optimization can improve downstream performance\(Liuet al\.,[2025](https://arxiv.org/html/2606.15216#bib.bib4); Fanet al\.,[2025](https://arxiv.org/html/2606.15216#bib.bib14); Junget al\.,[2025](https://arxiv.org/html/2606.15216#bib.bib1)\)\. However, diversity is fundamentally a set\-level objective, which means that it is measured as a property of a set rather than an individual element, and directly optimizing it is computationally intractable due to the combinatorial nature of subset selection, an issue that becomes more pronounced at pretraining scale\. To address this, practical methods have been developed and are already used in modern pretraining pipelines\. These approaches typically rely on coarse grained structure, such as clustering data into topics\(Wettiget al\.,[2025](https://arxiv.org/html/2606.15216#bib.bib18)\), skills\(Chandiramaniet al\.,[2026](https://arxiv.org/html/2606.15216#bib.bib19)\), or unsupervised clustering\(Liuet al\.,[2025](https://arxiv.org/html/2606.15216#bib.bib4); Diaoet al\.,[2025](https://arxiv.org/html/2606.15216#bib.bib20)\)\. While these methods are effective at approximating diversity by ensuring each cluster is represented in the final mix, they still rely on low resolution partitions of data, and effective diversity can at best be approximated by random sampling of data\. As a result, information over long\-tail or underrepresented knowledge may not be adequately captured\. The use of finegrained signals from individual data points can be a more reliable proxy to build highly diverse sets, but a gap still remains in literature on how to scale this reliably\.
This gap motivates our central question:How may we reliably extract diverse subsets from pretraining corpora at scale, in order to lead to better downstream outcomes?
Just like how diverse normalised vectors distribute evenly on a unit circle in 2d space, which capture the imagery of spokes on a wheel, our method,Spokesis a principled approach to obtain diverse sets of data\. Rather than rely on heuristics, we solve a global optimization that analyzes the contribution of every data point\. We are then able to use this to extract data points that contribute to the overall diversity of the set\.
Our method reveals that dense, highly diverse subsets can be extracted from pretraining corpora\. While the diversity scores of random samples saturates quickly, selecting data points using the optimised weights fromSpokesled to a siginificant 489 point increase in G\-Vendi\(Junget al\.,[2025](https://arxiv.org/html/2606.15216#bib.bib1)\)scores on a 500k sample subset\. This is in contrast to existing methods that try to improve diversity such as SemDeDup\(Abbaset al\.,[2023](https://arxiv.org/html/2606.15216#bib.bib7)\), which only led to a 7 point increase in G\-Vendi scores\.
While existing measures treat the G\-Vendi score as a post\-hoc evaluation measure for diversity, we directly optimise for it and introduce a set of practical strategies \(See[Section˜4](https://arxiv.org/html/2606.15216#S4)\) that led us to successfully extract diverse sets from large scale pretraining corpora from Fineweb and DCLM\. Not only do batch level diversity scores increase with the chosen subsets, we demonstrate its success in improving downstream performance on both datasets\. This shows that the benefits of diversity reliably translate across different datasets which contain different levels of filtering and quality\.
While quality filters have traditionally been a strong baseline for data selection, as many quality signals are designed to correlate with evaluation performance,Spokesdemonstrates that diversity is also an important axis to consider\. By balancing quality scores, which were based on model based classifiers \(2\.95 vs 3\.18\), for batches with higher G\-vendi scores \(425 vs 315\)\. We were able to achieve significant improvements in evaluation score over the quality only baseline, achieving \+1\.0 and \+ 1\.9 in DCLM and Fineweb respectively\.
As such, we investigate direct diversity optimization for data selection at pretraining scale\. Our contributions are as follows: \(1\) We introduceSpokes, a scalable techniques for diversity optimization that operate efficiently at the scale of modern pretraining datasets\. \(2\) We show thatSpokesis effective, and is able to extract highly diverse subsets from existing pretraining corpora\. \(3\) We jointly optimize for both quality and diversity which yields consistent gains in pretraining performance\.
## 2Background
### 2\.1G\-Vendi as a diversity measure
The*Vendi*score\(Friedman and Dieng,[2022](https://arxiv.org/html/2606.15216#bib.bib2)\)was introduced as a principled metric for quantifying the diversity of a dataset\. Concretely, given a set of representations, the Vendi score constructs a similarity matrix and examines the exponential of the entropy of its’ spectrum \(eigenvalues\)\. When the data points are very similar, most of the spectrum is concentrated in a few directions\. When the data points are diverse, the spectrum is more evenly spread, indicating many independent directions\.
The*G\-Vendi score*\(Junget al\.,[2025](https://arxiv.org/html/2606.15216#bib.bib1)\)extends this idea by measuring diversity in*gradient space*rather than in representation space\. Instead of comparing input examples directly, G\-Vendi compares the gradients that each example induces during training\. Formally, let∇ℓ\(x;θ\)\\nabla\\ell\(x;\\theta\)denote the gradient of the loss with respect to model parametersθ\\thetafor a data samplexx, computed via backpropagation under a proxy model\. One benefit of the G\-Vendi score is that it encourages orthogonality among representation vectors, thereby promoting diversity in the learned set\. In gradient space, this reduces redundancy among update directions, leading to more independent and informative optimization steps\. As a result, parameter updates interfere less with one another and yield higher information gain per iteration, improving data efficiency\.
When using cosine similarity, the similarity between two samples is given by the dot product of theirℓ2\\ell\_\{2\}\-normalized representations\. Letgi∈ℝdg\_\{i\}\\in\\mathbb\{R\}^\{d\}denote the per\-sample gradient for data pointii, and defineX∈ℝn×dX\\in\\mathbb\{R\}^\{n\\times d\}as the matrix whoseii\-th row isgi⊤g\_\{i\}^\{\\top\}, where eachgig\_\{i\}isℓ2\\ell\_\{2\}\-normalized\. The resulting similarity \(kernel\) matrix is then given byK=XX⊤∈ℝn×nK=XX^\{\\top\}\\in\\mathbb\{R\}^\{n\\times n\}so that each entryKij=gi⊤gjK\_\{ij\}=g\_\{i\}^\{\\top\}g\_\{j\}corresponds to the cosine similarity between gradients of data pointsiiandjj\. Let\{σi\}\\\{\\sigma\_\{i\}\\\}denote the eigenvalues ofKK, and define normalized eigenvaluesλi=σi/Tr\(K\)\\lambda\_\{i\}=\\sigma\_\{i\}/\\mathrm\{Tr\}\(K\), which form a probability distribution\. The G\-Vendi score is then defined as the exponential of the Shannon entropy of this spectrum:
G\-Vendi\(K\)=exp\(−∑iλilogλi\)\.\\mathrm\{G\\text\{\-\}Vendi\}\(K\)=\\exp\\left\(\-\\sum\_\{i\}\\lambda\_\{i\}\\log\\lambda\_\{i\}\\right\)\.\(1\)
In practice, computing the full kernel matrix is unnecessary\. We exploit the fact that the Vendi score kernel is positive semidefinite and that the Gram matrix shares the same non\-zero eigenvalues as the kernel matrix\. This allows us to work directly with the Gram matrix inℝd×d\\mathbb\{R\}^\{d\\times d\}, which is significantly more computationally efficient and scalable\.
## 3Spokes: optimizing for data diversity in gradient space
### 3\.1Spokes: Scalable optimization to achieve high G\-Vendi subsets
Given the strong downstream performance associated with G\-Vendi, we introduceSpokes, a method for extracting high G\-Vendi subsets from large\-scale pretraining corpora\. In addition to diversity, our formulation incorporates per\-example quality scores, motivated by the role of quality filtering in modern data selection pipelines\. We control the trade\-off between quality and diversity using a tunable parameterα∈\[0,1\]\\alpha\\in\[0,1\], whereα=0\\alpha=0recovers diversity\-only optimization\. In our experiments, quality scores are obtained using the FineWeb\-Edu classifier\(Penedoet al\.,[2024](https://arxiv.org/html/2606.15216#bib.bib12)\), though the formulation is compatible with any per\-example quality metric\.
We begin with a discrete subset selection problem in which the objective is to select a subsetS⊆\[n\]S\\subseteq\[n\]of fixed sizekkthat jointly maximizes quality and diversity\. Let each examplexix\_\{i\}have an associated quality scoreqiq\_\{i\}, and letKSK\_\{S\}denote the similarity kernel restricted to the selected subset\. Following the log\-form quality\-weighted Vendi objective ofNguyen and Dieng \([2024](https://arxiv.org/html/2606.15216#bib.bib3)\), we define the optimization problem as
maxS⊆\[n\],\|S\|=kαln\(1k∑i∈Sqi\)\+\(1−α\)lnVendi\(KS\)\.\\max\_\{S\\subseteq\[n\],\\,\|S\|=k\}\\quad\\alpha\\ln\\left\(\\frac\{1\}\{k\}\\sum\_\{i\\in S\}q\_\{i\}\\right\)\+\(1\-\\alpha\)\\ln\\mathrm\{Vendi\}\(K\_\{S\}\)\.\(2\)
While this formulation directly captures the desired trade\-off between quality and diversity, optimizing it is computationally intractable due to the combinatorial nature of subset selection\. As such, we used a relaxed optimization as we describe below\.
To construct the similarity representation used by G\-Vendi, we first compute gradient\-based embeddings using a proxy model:
gi=∇θℓ\(xi\),i=1,…,n\.g\_\{i\}=\\nabla\_\{\\theta\}\\ell\(x\_\{i\}\),\\quad i=1,\\dots,n\.\(3\)Because these gradients are high\-dimensional, we apply a random projection using a Rademacher matrixR∈\{−1,\+1\}k×dR\\in\\\{\-1,\+1\\\}^\{k\\times d\}:
zi=1kRgi\.z\_\{i\}=\\frac\{1\}\{\\sqrt\{k\}\}Rg\_\{i\}\.\(4\)
Using the projected embeddings, we define pairwise similarities through a cosine similarity kernel\. We then relax the discrete optimization problem by adding a continuous non\-negative weight to every data pointw∈Δnw\\in\\Delta^\{n\}\. Under this relaxation, the quality objective becomes the expected quality
Q\(w\)=∑i=1nwiqi\.Q\(w\)=\\sum\_\{i=1\}^\{n\}w\_\{i\}q\_\{i\}\.\(5\)
To extend the subset kernelKSK\_\{S\}to the relaxed setting, we define a weight\-dependent kernel
K\(w\)ij=wiwj⋅zi⊤zj‖zi‖‖zj‖,K\(w\)\_\{ij\}=\\sqrt\{w\_\{i\}w\_\{j\}\}\\cdot\\frac\{z\_\{i\}^\{\\top\}z\_\{j\}\}\{\\\|z\_\{i\}\\\|\\\|z\_\{j\}\\\|\},\(6\)The resulting relaxed optimization problem is
maxw∈ΔnαlnQ\(w\)\+\(1−α\)lnVendi\(Kw\)\\max\_\{w\\in\\Delta^\{n\}\}\\;\\alpha\\ln Q\(w\)\+\(1\-\\alpha\)\\,\\ln\\operatorname\{Vendi\}\\\!\\big\(K\_\{w\}\\big\)\(7\)
We optimize this objective using exponentiated gradient descent while enforcing the simplex constraint through normalization at each iteration, preventing collapse onto a small number of high\-weight examples\. After optimization, we recover a discrete subset by selecting the top\-kkentries of the learned weight vector\. We demonstrate the effectiveness of this strategy in[Figure˜4](https://arxiv.org/html/2606.15216#S6.F4)\. The complete formulation ofSpokesis summarized in[Algorithm˜1](https://arxiv.org/html/2606.15216#alg1)\.
Algorithm 1Spokes: Probabilistic G\-Vendi via Exponentiated Gradient Descent1:Input:Dataset
\{x1,…,xn\}\\\{x\_\{1\},\\dots,x\_\{n\}\\\}, quality scores
\{qi\}i=1n\\\{q\_\{i\}\\\}\_\{i=1\}^\{n\}, learning rate
η\>0\\eta\>0, iterations
TT, initial distribution
w\(0\)∈Δnw^\{\(0\)\}\\in\\Delta^\{n\}, trade\-off parameter
α∈\[0,1\]\\alpha\\in\[0,1\]
2:Step 1:Compute gradient embeddings using a proxy model:
gi=∇θℓ\(xi\),i=1,…,ng\_\{i\}=\\nabla\_\{\\theta\}\\ell\(x\_\{i\}\),\\quad i=1,\\dots,n
3:Step 2:Sample a Rademacher projection matrix
R∈\{−1,\+1\}k×dR\\in\\\{\-1,\+1\\\}^\{k\\times d\}and project embeddings:
z~i=1kRgi,∀i\\tilde\{z\}\_\{i\}=\\frac\{1\}\{\\sqrt\{k\}\}Rg\_\{i\},\\quad\\forall i
4:Step 3:Construct weighted cosine similarity kernel:
K\(w\)ij=wiwj⋅zi⊤zj‖zi‖‖zj‖,K\(w\)\_\{ij\}=\\sqrt\{w\_\{i\}w\_\{j\}\}\\cdot\\frac\{z\_\{i\}^\{\\top\}z\_\{j\}\}\{\\\|z\_\{i\}\\\|\\\|z\_\{j\}\\\|\},
5:Step 4:Define weighted quality:
Q\(w\)=∑i=1nwiqiQ\(w\)=\\sum\_\{i=1\}^\{n\}w\_\{i\}q\_\{i\}
6:Optimization objective:
maximizeαlnQ\(w\)\+\(1−α\)lnVendi\(K\)\\displaystyle\\alpha\\ln Q\(w\)\+\(1\-\\alpha\)\\ln\\mathrm\{Vendi\}\(K\)subject tow∈Δn\\displaystyle w\\in\\Delta^\{n\}
7:Step 5:Define loss:
ℒ\(w\)=−\(αlnQ\(w\)\+\(1−α\)lnVendi\(K\)\)\\mathcal\{L\}\(w\)=\-\\left\(\\alpha\\ln Q\(w\)\+\(1\-\\alpha\)\\ln\\mathrm\{Vendi\}\(K\)\\right\)
8:for
t=0t=0to
T−1T\-1do
9:Compute gradient:
g\(t\)=∇wℒ\(w\(t\)\)g^\{\(t\)\}=\\nabla\_\{w\}\\mathcal\{L\}\(w^\{\(t\)\}\)
10:Exponentiated gradient update:
w~i\(t\+1\)=wi\(t\)exp\(−ηgi\(t\)\),∀i\\tilde\{w\}\_\{i\}^\{\(t\+1\)\}=w\_\{i\}^\{\(t\)\}\\exp\(\-\\eta g\_\{i\}^\{\(t\)\}\),\\quad\\forall i
11:Normalize:
w\(t\+1\)=w~\(t\+1\)∑j=1nw~j\(t\+1\)w^\{\(t\+1\)\}=\\frac\{\\tilde\{w\}^\{\(t\+1\)\}\}\{\\sum\_\{j=1\}^\{n\}\\tilde\{w\}\_\{j\}^\{\(t\+1\)\}\}
12:endfor
13:Output:Subset
S=Top\-k\(w\(T\)\)S=\\mathrm\{Top\\text\{\-\}k\}\(w^\{\(T\)\}\)
### 3\.2The optimisation leads to smooth trade\-offs between quality and diversity
Spokesis effective in practice and we observe a smooth trade\-off between quality and diversity\.[Figure˜2](https://arxiv.org/html/2606.15216#S3.F2)shows that increasingα\\alphaimproves average quality while reducing G\-Vendi Score in a controlled manner\. This behavior is consistent when we optimise over the full pretraining set and measure the G\-Vendi and quality score of a batch sampled from this set\.
Figure 1:Quality–diversity trade\-off acrossα\\alpha\.
Figure 2:Overlap between quality\-only, diversity\-only, and joint optimization subsets\.
To select a final value ofα\\alpha, we measure overlap between the jointly optimized subset and the two extreme solutions \(quality\-only and diversity\-only\)\. For DCLM,α=0\.001\\alpha=0\.001yields a balanced trade\-off, with 64\.8% overlap with the diversity\-only subset and 68\.8% overlap with the quality\-only subset\. For FineWeb, which contains a higher proportion of low\-quality documents and therefore induces a natural bias toward higher\-quality selections, we use a lowerα=0\.0005\\alpha=0\.0005, obtaining 70\.0% and 87\.0% overlap, respectively\. In both cases, the chosenα\\alphasubstantially increases the average batch G\-Vendi scores relative to quality\-only selection\.
### 3\.3Time complexity ofSpokes
Spokescan be used as a drop\-in alternative to existing data selection methods such as SemDeDup\(Abbaset al\.,[2023](https://arxiv.org/html/2606.15216#bib.bib7)\)\. The key advantage ofSpokesis improved computational efficiency, achieved by operating directly on a Gram matrix rather than relying on pairwise similarity computations\.
At each iteration,Spokesconstructs a Gram matrix over the currently selected subset\. Let the subset size beNNand the representation dimension bedd\. The per\-iteration computational cost isO\(Nd2\)O\(Nd^\{2\}\), leading to a total cost ofO\(TNd2\)O\(TNd^\{2\}\)overTToptimization steps\. In practice,TTis small \(typicallyT<20T<20\) because the selection objective converges quickly to a representative subset\.
To further improve scalability, the dataset can be partitioned intoppdisjoint subsets of size approximatelyN/pN/p, which are optimized independently\. This yields a per\-partition cost ofO\(TNpd2\)O\\\!\\left\(T\\frac\{N\}\{p\}d^\{2\}\\right\), resulting in an approximately linear speedup inppunder the assumption of independent optimization\. In our experiments, we setp=1p=1, as the optimization remains efficient at full scale\. For the optimization process, around 8 NVIDIA H100 GPU hours was used\.
In contrast, SemDeDup\(Abbaset al\.,[2023](https://arxiv.org/html/2606.15216#bib.bib7)\)relies on pairwise similarity computations between samples, incurring a dominant cost that scales quadratically with the subset size,O\(N2/k\)O\(N^\{2\}/k\)wherekkis the number of clusters\. This quadratic dependence makes SemDeDup substantially more expensive at scale\. Moreover, SemDeDup typically requires a larger number of clusterskkthan the number of partitionsppused inSpokes, further increasing its computational cost\. In our experiments, the semantic deduplication procedure took around 16 NVIDIA H100 GPU hours\.
Beyond efficiency,Spokesalso offers greater control over the final subset size\. Target dataset sizes can be obtained directly by selecting the top\-k samples, whereas SemDeDup requires sweeping over multipleϵ\\epsilonthresholds to achieve a desired data budget\.
## 4Improving efficiency inSpokes
Beyond the relaxed optimisation procedure in[Section˜3\.1](https://arxiv.org/html/2606.15216#S3.SS1)we perform some modifications to the pipeline to make the optimisation procedure for G\-vendi more scalable\.
### 4\.1Approximating gradients using the lastnnlayers
Embedding\-based approaches, such as those used in semantic deduplication, are already standard practice in pretraining pipelines\(Abbaset al\.,[2023](https://arxiv.org/html/2606.15216#bib.bib7)\)\. These methods typically rely on full\-model embeddings, which only require one forward pass\. In contrast, our method requires gradient\-based representations, which are substantially more expensive if computed over the entire model\.
To reduce this overhead, we approximate full gradients by restricting computation to the lastnntransformer layers\. This truncation significantly lowers computational cost while preserving the structure of the resulting similarity kernel \(see[Table˜1](https://arxiv.org/html/2606.15216#S4.T1)\)\. In practice, this makes gradient computation comparable to embedding\-based methods whennnis small \(e\.g\., 2–3 layers\)\.
To select an appropriate truncation depth, we perform a small calibration study using 100 samples\. We compute similarity kernels from truncated gradients and measure their agreement with full\-gradient kernels using average row\-wise Spearman correlation\. Empirically, we find strong agreement between kernel matrices constructed from truncated gradients and those from full gradients\. In particular, Spearman correlation between pairwise similarities remains high, with diminishing returns beyond two layers forQwen3\-0\.6B\-Baseand three layers forQwen2\.5\-0\.5B\-Instruct\.
Table 1:Average Spearman correlation with full gradientsSince our focus is on pretraining corpora rather than instruction\-tuned settings, we useQwen3\-0\.6B\-Baseas the default model for gradient computation and restrict gradient computation to the last two layers, which provides a strong accuracy and efficiency trade\-off\.
### 4\.2For subset selection a smaller k can be chosen for Johnson–Lindenstrauss projection dimensions
Despite restricting gradient computation to the final two layers, the resulting gradient representations remain extremely high\-dimensional, comprising 200 million parameters\. Direct pairwise similarity computation in this space is therefore computationally prohibitive\. To address this, we employ a dimensionality reduction scheme based on the Johnson–Lindenstrauss \(JL\) lemma\(Johnsonet al\.,[1984](https://arxiv.org/html/2606.15216#bib.bib21)\), instantiated via Rademacher random projections\(Achlioptas,[2003](https://arxiv.org/html/2606.15216#bib.bib22)\), which preserve pairwise distances in low dimensions under projection\.
Formally, letX⊂ℝdX\\subset\\mathbb\{R\}^\{d\}be a set ofnnpoints and letε∈\(0,1\)\\varepsilon\\in\(0,1\)\. We construct a random projection matrixR∈ℝk×dR\\in\\mathbb\{R\}^\{k\\times d\}, whereRij∈\{\+1,−1\}R\_\{ij\}\\in\\\{\+1,\-1\\\}are i\.i\.d\. Rademacher variables\. If
k≥4\+2βε2/2−ε3/3ln\(n\),k\\geq\\frac\{4\+2\\beta\}\{\\varepsilon^\{2\}/2\-\\varepsilon^\{3\}/3\}\\ln\(n\),\(8\)then with probability at least1−n−β1\-n^\{\-\\beta\}, the embeddingf\(u\)=1kRuf\(u\)=\\frac\{1\}\{\\sqrt\{k\}\}Ruapproximately preserves all pairwise Euclidean distances overXX, such that
\(1−ε\)‖u−v‖2≤‖f\(u\)−f\(v\)‖2≤\(1\+ε\)‖u−v‖2∀u,v∈X\.\(1\-\\varepsilon\)\\\|u\-v\\\|^\{2\}\\leq\\\|f\(u\)\-f\(v\)\\\|^\{2\}\\leq\(1\+\\varepsilon\)\\\|u\-v\\\|^\{2\}\\quad\\forall\\,u,v\\in X\.\(9\)
For example, if N =3×1083\\times 10^\{8\}andε=0\.11\\varepsilon=0\.11andβ=1\\beta=1this bound yieldsk≈18035k\\approx 18035, which is computationally expensive in both storage and runtime at pretraining scale\.
In reality, we are doing subset selection, so even if the similarities are distorted, subsets chosen may still be accurate based on the proportion of data chosen\. To study this, we empirically evaluate the effect ofkkon subset stability\. We sample 200k documents and compute gradient\-based representations using the last two model layers\. We then perform G\-Vendi optimization across different values ofkkand measure the overlap between selected subsets\. We sweepkkfrom 512 to 16,384 and compare each resulting subset against a high\-dimensional reference usingk=17,920k=17\{,\}920\. Despite large changes in projection dimension, subset selection remains highly stable across all settings \(See Appendix[A](https://arxiv.org/html/2606.15216#A1)\)\.
In particular, even atk=1024k=1024, we observe 95\.5% overlap with the reference subset at a 0\.5 selection proportion, indicating that low\-dimensional JL embeddings are sufficient for G\-Vendi optimization in practice\. Given the high overlap of subset chosen and the benefits of saving memory with lower dimensions, we chose to usek=1024k=1024\.
### 4\.3Controlling for length
Prior work has shown that sequence length can influence embedding similarity\-based metrics\. As a diagnostic baseline, we first observe that in an embedding\-based Vendi setup using Qwen3\-0\.6B\-embedding model\(Zhanget al\.,[2025](https://arxiv.org/html/2606.15216#bib.bib13)\), the selected subset has a lower average sequence length \(827 tokens\) than the random baseline \(1230 tokens\), despite higher diversity scores\. This effect is consistent with known pooling behavior in embedding models, where token representations are aggregated over context\.
To reduce length\-driven confounding effects in this baseline comparison, we truncate all sequences to a maximum length of 768 tokens, approximately matching the median length of a random DCLM sample \(710 tokens\)\. Under this setting, gradient\-based subsets still exhibit higher average sequence length \(1482 tokens\) compared to both embedding\-based subsets \(1136\) and random sampling \(1230\)\. However, it is likely that the increased sequence length in the G\-Vendi case comes from the natural diversity of the dataset as all samples are subject to the same length constraint\.
## 5Experimental Setup
### 5\.1Dataset Selection
We evaluate our method on two large\-scale pretraining corpora with differing filtering regimes to assess robustness across data distributions\.
First, we use the DCLM dataset \(Dolmino subset\)\(OLMoet al\.,[2024](https://arxiv.org/html/2606.15216#bib.bib10)\), which undergoes relatively aggressive filtering and curation\. Second, we use FineWeb\(Penedoet al\.,[2024](https://arxiv.org/html/2606.15216#bib.bib12)\), a less aggressively filtered web\-scale corpus\. These datasets differ in their noise profiles and redundancy, providing a useful testbed for evaluating the effect of diversity\-based selection under varying data quality conditions\.
### 5\.2Baselines
We compare against commonly used data selection strategies in large\-scale language model pretraining: \(1\)Random sampling, which serves as a standard baseline\. \(2\)Semantic deduplication, which removes near\-duplicate samples using embedding\-based clustering\. \(3\)Quality filtering, which selects high\-quality subsets based on heuristic or learned quality scores\. The quality scores were based on fineweb\-edu classifiers\(Penedoet al\.,[2024](https://arxiv.org/html/2606.15216#bib.bib12)\)\.
We evaluate these methods under a pretraining from scratch training regime: \(i\) pretraining from scratch on a LLaMA\-1B architecture\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.15216#bib.bib23)\)on 100 billion tokens\.
### 5\.3Hyperparameters used
We used a constant learning rate of 3e\-4 for all experiments\. We also used a batch size of 2048 with sequence length 4096\. All pretraining from scratch experiments was run on a LLaMA 1b architecture using the megatron\-lm library\(Shoeybiet al\.,[2019](https://arxiv.org/html/2606.15216#bib.bib17)\)\. Each single run took around 576 NVIDIA H100 gpu hours\.
For the semantic deduplication experiments, we tuned the epsilon until we were able to filter 50% of the data\. 10000 clusters were used and an epsilon of 0\.005 was used to filter the DCLM subset\. For fineweb, a total of 20000 clusters and an epsilon of 0\.25 was used\. More clusters were used as fineweb has almost twice the number of data samples despite having the same token count\. For the quality filters, a threshold of 3\.0625 and 1\.0693 were used for DCLM and Fineweb respectively\.
### 5\.4Evaluation
We evaluate models on 10 English benchmarks from the OLMES evaluation suite\(Guet al\.,[2025](https://arxiv.org/html/2606.15216#bib.bib15)\), which was chosen for its strong discriminative power among small models and its use of standardized prompting, consistent formatting, and cloze\-style task formulations\.
## 6Results
### 6\.1Spokesimproves batch level gradient diversity
We find that our optimization procedure is able to extract substantially more diverse subsets from pretraining corpora than are typically recovered by random sampling\. For example, a random sample of 500k documents drawn from a 200B\-token pool of DCLM\(Liet al\.,[2024](https://arxiv.org/html/2606.15216#bib.bib11)\)from the Dolmino dataset\(OLMoet al\.,[2024](https://arxiv.org/html/2606.15216#bib.bib10)\)\(173M documents\), achieves a relatively low G\-Vendi score of 315, as shown in[Figure˜4](https://arxiv.org/html/2606.15216#S6.F4)\. In contrast, using the optimised weights ofSpokesto select k data points produces a consistent scaling effect: as top\-k based on the weights optimized by[Algorithm˜1](https://arxiv.org/html/2606.15216#alg1)increases, the G\-Vendi score improves monotonically, as shown in[Figure˜4](https://arxiv.org/html/2606.15216#S6.F4)\. This shows thatSpokesis highly effective in recovering high diversity subsets, which was unable to be captured by uniform sampling\.
Figure 3:G\-vendi scores scale as the number of top\-k increases
Figure 4:Vendi score using different representations and optimisation techniques
The benefits of our subset selection extend beyond identifying small, diverse sets at the highest weight points\. When performing full subset selection for the experiments in[Section˜5](https://arxiv.org/html/2606.15216#S5), we observe that the average batch diversity remains consistently high throughout training, increasing from 315 to 520\. This indicates that higher\-diversity batches are encountered at every update step of the pretraining stage with the diverse selected data\.
### 6\.2Gradients are more amenable to optimization
We compare different representations for measuring diversity, focusing on embeddings and gradients\. As shown in[Figure˜4](https://arxiv.org/html/2606.15216#S6.F4), embedding\-based representations saturate quickly on large corpora, limiting their ability to discriminate between subsets\. In contrast, gradient\-based representations remain sensitive under optimization\.
While random subsets achieve similar Vendi scores under both representations, applying our selection procedure leads to a substantial increase in G\-Vendi score when using gradients, whereas embeddings show limited improvement\. This indicates that gradients capture finer\-grained structure that is not reflected in embedding similarity alone\. This observation is consistent with prior work showing stronger alignment between gradient\-space structure and out\-of\-distribution generalization\(Junget al\.,[2025](https://arxiv.org/html/2606.15216#bib.bib1)\), which justifies our shift to gradients rather than embeddings\.
### 6\.3Spokes improves pretraining outcomes reliably across different datasets
Table 2:Performance on world knowledge benchmarks\. Best results arehighlighted, second best areunderlined\.Overall, our results show that: \(1\) Diversity optimization alone yields consistent improvements over random sampling, \(2\) Jointly optimizing quality and diversity produces the best overall performance, outperforming all other baselines\. \(3\) Quality and diversity play a complementary role to each other\.
#### Diversity alone provides a strong signal\.
Across both DCLM and FineWeb,Spokesconsistently outperforms random sampling\. On DCLM, diversity\-only selection improves the average score from 52\.5 to 52\.9; on FineWeb, from 45\.1 to 45\.6\. These results show that optimizing for diversity alone yields consistent gains across corpora, even without explicit quality supervision\. This suggests that a more diverse gradient representation of the underlying data distribution is sufficient to improve downstream generalization\.
#### Joint optimization yields the best overall performance\.
Using joint optimiziation of both quality and diversity with[Algorithm˜1](https://arxiv.org/html/2606.15216#alg1), we were able to achieve the strongest overall results that outperform both the quality\-only and diversity\-only baselines\. The gains were much more significant, improving from 52\.5 to 54\.0 for DCLM and 45\.1 to 46\.5 in Fineweb\.
Notably, jointly optimized subsets have*lower*average quality scores than quality\-only subsets \(2\.95 vs\. 3\.18\), yet achieve better downstream performance\. This discrepancy suggests that standard quality metrics do not fully capture data utility and including diversity in the optimization can contribute complementary and useful training signals\.
#### Quality and diversity capture complementary strengths\.
Performance differences across tasks reveal a clear trade\-off\. For DCLM, Diversity\-only selection performs well on commonsense reasoning tasks \(e\.g\., HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2606.15216#bib.bib26)\)\), while quality\-based filtering is more effective on knowledge\-intensive benchmarks \(e\.g\., MMLU\(Hendryckset al\.,[2020](https://arxiv.org/html/2606.15216#bib.bib27)\)\)\.Spokesbridges this gap\. Joint optimization matches or exceeds the stronger baseline on most tasks while avoiding the weaknesses of optimizing for either objective alone\. For example, on DCLM, diversity improves HellaSwag but lags on MMLU relative to quality filtering, whereasSpokesremains competitive on both\. These results indicate that quality and diversity encode distinct but complementary properties of the training data that are both useful for subset construction under fixed data budgets\.
## 7Limitations
We acknowledge that a key limitation of our method is the computational cost of gradient evaluation\. While we restrict computation to the final two layers, which already yields a significant speedup, there remains substantial room for further efficiency improvements\. Prior work has shown that gradient computation can be made effectively compute\-optimal\(Yin and Rush,[2024](https://arxiv.org/html/2606.15216#bib.bib24)\)in regimes where the fixed cost of training the model dominates the cost of gradient evaluation\. This assumption aligns with our setting where we expect selected data to be reused over multiple large pretraining runs\. Additional acceleration strategies may be possible by leveraging modern techniques such as Cut Cross Entropy\(Wijmanset al\.,[2024](https://arxiv.org/html/2606.15216#bib.bib25)\), which reduce memory overhead and improve the efficiency of gradient computation, particularly in smaller models\.
## 8Conclusion
We proposeSpokes, a scalable method for diverse subset selection using G\-vendi scores\. Diversity, as a set\-level property, has remained challenging to optimize directly, with prior approaches typically relying on proxy objectives or heuristics such as coarse grained clustering\.Spokesaddresses this gap through a principled, global optimization framework that enables direct, data point level subset selection, producing consistently more diverse subsets than existing baselines\.
Across multiple corpora,Spokesyields subsets that improve downstream performance over all baseline methods under matched data budgets\. Beyond diversity alone,Spokessupports the joint optimization of quality and diversity, enabling the construction of subsets that better balance complementary training signals\. Empirically, we find that subsets selected bySpokesconsistently outperform those produced by prior methods across evaluation benchmarks\.
These findings suggest that direct optimization of diversity is not only tractable at scale, but also practically beneficial, positioningSpokesas a useful tool for constructing both high\-quality and diverse subsets from large pretraining corpora\.
## References
- Semdedup: data\-efficient learning at web\-scale through semantic deduplication\.arXiv preprint arXiv:2303\.09540\.Cited by:[§1](https://arxiv.org/html/2606.15216#S1.p4.1),[§3\.3](https://arxiv.org/html/2606.15216#S3.SS3.p1.1),[§3\.3](https://arxiv.org/html/2606.15216#S3.SS3.p4.4),[§4\.1](https://arxiv.org/html/2606.15216#S4.SS1.p1.1)\.
- D\. Achlioptas \(2003\)Database\-friendly random projections: johnson\-lindenstrauss with binary coins\.Journal of computer and System Sciences66\(4\),pp\. 671–687\.Cited by:[§4\.2](https://arxiv.org/html/2606.15216#S4.SS2.p1.1)\.
- A\. Chandiramani, A\. Blakeman, A\. Olaoye, A\. Gupta, A\. Somasamudramath, A\. Khattar, A\. Adesoba, A\. Renduchintala, A\. Asif, A\. Agrawal,et al\.\(2026\)Nemotron 3 super: open, efficient mixture\-of\-experts hybrid mamba\-transformer model for agentic reasoning\.arXiv preprint arXiv:2604\.12374\.Cited by:[§1](https://arxiv.org/html/2606.15216#S1.p1.1)\.
- S\. Diao, Y\. Yang, Y\. Fu, X\. Dong, D\. Su, M\. Kliegl, Z\. Chen, P\. Belcak, Y\. Suhara, H\. Yin,et al\.\(2025\)Nemotron\-climb: clustering\-based iterative data mixture bootstrapping for language model pre\-training\.arXiv preprint arXiv:2504\.13161\.Cited by:[§1](https://arxiv.org/html/2606.15216#S1.p1.1)\.
- Z\. Fan, Y\. Xian, Y\. Sun, and L\. Shen \(2025\)Joint selection for large\-scale pre\-training data via policy gradient\-based mask learning\.arXiv preprint arXiv:2512\.24265\.Cited by:[§1](https://arxiv.org/html/2606.15216#S1.p1.1)\.
- D\. Friedman and A\. B\. Dieng \(2022\)The vendi score: a diversity evaluation metric for machine learning\.arXiv preprint arXiv:2210\.02410\.Cited by:[§2\.1](https://arxiv.org/html/2606.15216#S2.SS1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5\.2](https://arxiv.org/html/2606.15216#S5.SS2.p2.1)\.
- Y\. Gu, O\. Tafjord, B\. Kuehl, D\. Haddad, J\. Dodge, and H\. Hajishirzi \(2025\)Olmes: a standard for language model evaluations\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 5005–5033\.Cited by:[§5\.4](https://arxiv.org/html/2606.15216#S5.SS4.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[§6\.3](https://arxiv.org/html/2606.15216#S6.SS3.SSS0.Px3.p1.1)\.
- W\. B\. Johnson, J\. Lindenstrauss,et al\.\(1984\)Extensions of lipschitz mappings into a hilbert space\.Contemporary mathematics26\(189\-206\),pp\. 1\.Cited by:[§4\.2](https://arxiv.org/html/2606.15216#S4.SS2.p1.1)\.
- J\. Jung, S\. Han, X\. Lu, S\. Hallinan, D\. Acuna, S\. Prabhumoye, M\. Patwary, M\. Shoeybi, B\. Catanzaro, and Y\. Choi \(2025\)Prismatic synthesis: gradient\-based data diversification boosts generalization in llm reasoning\.arXiv preprint arXiv:2505\.20161\.Cited by:[§1](https://arxiv.org/html/2606.15216#S1.p1.1),[§1](https://arxiv.org/html/2606.15216#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.15216#S2.SS1.p2.3),[§6\.2](https://arxiv.org/html/2606.15216#S6.SS2.p2.1)\.
- J\. Li, A\. Fang, G\. Smyrnis, M\. Ivgi, M\. Jordan, S\. Y\. Gadre, H\. Bansal, E\. Guha, S\. S\. Keh, K\. Arora,et al\.\(2024\)Datacomp\-lm: in search of the next generation of training sets for language models\.Advances in Neural Information Processing Systems37,pp\. 14200–14282\.Cited by:[§6\.1](https://arxiv.org/html/2606.15216#S6.SS1.p1.1)\.
- F\. Liu, W\. Zhou, B\. Liu, Z\. Yu, Y\. Zhang, H\. Lin, Y\. Yu, B\. Zhang, X\. Zhou, T\. Wang,et al\.\(2025\)Quadmix: quality\-diversity balanced data selection for efficient llm pretraining\.arXiv preprint arXiv:2504\.16511\.Cited by:[§1](https://arxiv.org/html/2606.15216#S1.p1.1)\.
- Q\. Nguyen and A\. B\. Dieng \(2024\)Quality\-weighted vendi scores and their application to diverse experimental design\.arXiv preprint arXiv:2405\.02449\.Cited by:[§3\.1](https://arxiv.org/html/2606.15216#S3.SS1.p2.5)\.
- T\. OLMo, P\. Walsh, L\. Soldaini, D\. Groeneveld, K\. Lo, S\. Arora, A\. Bhagia, Y\. Gu, S\. Huang, M\. Jordan,et al\.\(2024\)2 olmo 2 furious\.arXiv preprint arXiv:2501\.00656\.Cited by:[§5\.1](https://arxiv.org/html/2606.15216#S5.SS1.p2.1),[§6\.1](https://arxiv.org/html/2606.15216#S6.SS1.p1.1)\.
- G\. Penedo, H\. Kydlíček, A\. Lozhkov, M\. Mitchell, C\. A\. Raffel, L\. Von Werra, T\. Wolf,et al\.\(2024\)The fineweb datasets: decanting the web for the finest text data at scale\.Advances in Neural Information Processing Systems37,pp\. 30811–30849\.Cited by:[§3\.1](https://arxiv.org/html/2606.15216#S3.SS1.p1.2),[§5\.1](https://arxiv.org/html/2606.15216#S5.SS1.p2.1),[§5\.2](https://arxiv.org/html/2606.15216#S5.SS2.p1.1)\.
- M\. Shoeybi, M\. Patwary, R\. Puri, P\. LeGresley, J\. Casper, and B\. Catanzaro \(2019\)Megatron\-lm: training multi\-billion parameter language models using model parallelism\.arXiv preprint arXiv:1909\.08053\.Cited by:[§5\.3](https://arxiv.org/html/2606.15216#S5.SS3.p1.1)\.
- A\. Wettig, K\. Lo, S\. Min, H\. Hajishirzi, D\. Chen, and L\. Soldaini \(2025\)Organize the web: constructing domains enhances pre\-training data curation\.arXiv preprint arXiv:2502\.10341\.Cited by:[§1](https://arxiv.org/html/2606.15216#S1.p1.1)\.
- E\. Wijmans, B\. Huval, A\. Hertzberg, V\. Koltun, and P\. Krähenbühl \(2024\)Cut your losses in large\-vocabulary language models\.arXiv preprint arXiv:2411\.09009\.Cited by:[§7](https://arxiv.org/html/2606.15216#S7.p1.1)\.
- J\. O\. Yin and A\. M\. Rush \(2024\)Compute\-constrained data selection\.arXiv preprint arXiv:2410\.16208\.Cited by:[§7](https://arxiv.org/html/2606.15216#S7.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)Hellaswag: can a machine really finish your sentence?\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 4791–4800\.Cited by:[§6\.3](https://arxiv.org/html/2606.15216#S6.SS3.SSS0.Px3.p1.1)\.
- Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin,et al\.\(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.arXiv preprint arXiv:2506\.05176\.Cited by:[§4\.3](https://arxiv.org/html/2606.15216#S4.SS3.p1.1)\.
## Appendix AFinding k for johnson lindenstrauss transform
We empirically study an appropriate choice ofkkfor applying a Johnson–Lindenstrauss transform to project high\-dimensional gradients into a lower\-dimensional space\. We evaluate this by measuring how closely the selected subset matches a reference subset obtained in a high\-dimensional setting, using overlap as the agreement metric\. In our experiments, the reference dimensionality corresponds tok=17,920k=17\{,\}920\.
We observe that substantially smaller values ofkkare sufficient in practice\. For instance, at a 50% subset selection ratio, usingk=1,024k=1\{,\}024achieves a 95\.5% overlap with the subset selected under the reference dimensionality\. Given the scale of our experiments, this level of agreement is considered acceptable\.

Figure 5:Subset overlap \(%\) withk=17,920k=17\{,\}920reference across different Johnson–Lindenstrauss projection dimensions\. The white star shows our chosen dimension and subset proportion\.
## Appendix BSocietal Impact
Our work advances a more principled understanding of data diversity in LLM pretraining, with potential benefits for improving generalization through training on diverse data\. These methods should be applied with careful attention to ethical guidelines and responsible model deployment\.Similar Articles
Where You Inject Diversity Matters: A Unified Framework for Diverse Generation
This paper introduces a unified framework for test-time diverse generation in large language models, categorizing methods by where diversity is injected (surface-level vs. specification-level). It proposes specification-level methods that generate diverse intermediate specifications, achieving better output diversity across five open-ended tasks and four backbone models while maintaining quality.
Vector Policy Optimization: Training for Diversity Improves Test-Time Search
This paper introduces Vector Policy Optimization (VPO), a reinforcement learning algorithm that trains LLMs to produce diverse solutions by optimizing across multiple reward dimensions, significantly improving test-time search performance compared to scalar RL baselines.
SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering
This paper presents SPADER, a reinforcement learning framework for multi-answer QA that uses step-wise peer advantage for credit assignment and diversity-aware exploration rewards to improve recall of long-tail entities, achieving better performance on several benchmarks.
Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection
Weasel is a trajectory selection method for offline training of web agents that improves out-of-domain generalization by balancing importance and diversity. It achieves up to 12.5x training speedups and improved performance across several benchmarks.
DEI: Diversity in Evolutionary Inference for Quality-Diversity Search
DEI introduces a distributed Quality-Diversity search framework using heterogeneous LLMs as mutation operators, showing that model diversity improves performance over homogeneous parallel approaches. Evaluated on the Core War domain, a four-node heterogeneous ensemble achieves significant gains in QD-Score and coverage.