GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

arXiv cs.LG Papers

Summary

GEM reformulates LLM data curation as a variational problem on the hypersphere, using geometric entropy mixing and a minorize-maximize algorithm to discover balanced semantic clusters, achieving state-of-the-art improvements in data mixing strategies by up to 1.2% average downstream accuracy.

arXiv:2605.26121v1 Announce Type: new Abstract: LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering fails to address embedding anisotropy. We introduce GEM (Geometric Entropy Mixing), a framework reformulating data curation as a variational problem on the hypersphere augmented with a mixing-balance regularizer. By decoupling the generative prior and optimizing the objective via a provable MM (Minorize-Maximize) algorithm, GEM effectively counteracts the cluster collapse to discover balanced semantic structures invisible to Euclidean heuristics. We employ teacher-student distillation to scale this geometric fidelity to web-scale corpora and introduce the Geometric Influence Score (GIS) for interpretable taxonomy generation. Experiments with 1.1B-parameter models demonstrate that GEM establishes a new state-of-the-art when integrated into mixing strategies like DoReMi and RegMix, improving average downstream accuracy by up to 1.2% and offering a robust coordinate system for predictable data mixing.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:02 AM

# GEM: Geometric Entropy Mixing for Optimal LLM Data Curation
Source: [https://arxiv.org/html/2605.26121](https://arxiv.org/html/2605.26121)
###### Abstract

LLM pre\-training efficacy increasingly depends on data composition rather than sheer volume\. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering fails to address embedding anisotropy\. We introduceGEM\(GeometricEntropyMixing\), a framework reformulating data curation as a variational problem on the hypersphere augmented with amixing\-balance regularizer\. By decoupling the generative prior and optimizing the objective via a provableMM \(Minorize\-Maximize\)algorithm, GEM effectively counteracts the cluster collapse to discover balanced semantic structures invisible to Euclidean heuristics\. We employ teacher\-student distillation to scale this geometric fidelity to web\-scale corpora and introduce theGeometric Influence Score \(GIS\)for interpretable taxonomy generation\. Experiments with 1\.1B\-parameter models demonstrate that GEM establishes a new state\-of\-the\-art when integrated into mixing strategies like DoReMi and RegMix, improving average downstream accuracy by up to1\.2%and offering a robust coordinate system for predictable data mixing\.

Pretrain, Data mixing, Data curation

## 1Introduction

Data curation has emerged as the decisive factor in the performance of Large Language Models \(LLMs\)\(Hoffmannet al\.,[2022](https://arxiv.org/html/2605.26121#bib.bib25); Gunasekaret al\.,[2023](https://arxiv.org/html/2605.26121#bib.bib26); Penedoet al\.,[2023a](https://arxiv.org/html/2605.26121#bib.bib27)\), shifting the research frontier from sheer parameter scaling to the strategic ”mixing” of heterogeneous data sources\. As scaling laws\(Kaplanet al\.,[2020](https://arxiv.org/html/2605.26121#bib.bib24)\)evolve, the core challenge lies in partitioning massive\-scale, unstructured corpora into semantically distinct and balanced clusters, which is a prerequisite for any principled data mixing strategy\(Yeet al\.,[2024](https://arxiv.org/html/2605.26121#bib.bib32)\)\. However, contemporary approaches to data classification generally fall into two categories, both of which face fundamental theoretical and practical bottlenecks\.

![Refer to caption](https://arxiv.org/html/2605.26121v1/x1.png)Figure 1:Schematic illustration of geometric mismatch in semantic clustering\. While\(a\)taxonomy\-based approaches are hindered by rigid misalignment and high costs, and\(b\)Euclidean clustering fails to handle embedding anisotropy leading tocluster collapse,\(c\)our proposed GEM framework utilizes MM\-based inference to generate balanced, semantically distinct partitions on the hypersphere with superior efficiency\.The first category, taxonomy\-based methods, relies on human\-defined categorical hierarchies\(Brownet al\.,[2020](https://arxiv.org/html/2605.26121#bib.bib15); Touvronet al\.,[2023b](https://arxiv.org/html/2605.26121#bib.bib29)\)\. These approaches typically utilize high\-capacity LLMs or ensembles to assign labels to documents\. However, as illustrated in Figure[1](https://arxiv.org/html/2605.26121#S1.F1)\(a\), this paradigm suffers from a critical ontological misalignment: human\-centric categories often do not reflect the latent semantic granularity required for self\-supervised learning\. Empirical evidence suggests that even state\-of\-the\-art models exhibit low inter\-annotator consistency when classifying nuanced web data, indicating that human taxonomies fail to capture the true underlying distribution of model\-relevant knowledge\(Mainiet al\.,[2024](https://arxiv.org/html/2605.26121#bib.bib31); Abbaset al\.,[2023](https://arxiv.org/html/2605.26121#bib.bib30)\)\. Furthermore, the cost of labeling renders this approach unsustainable, especially given the dynamic nature of model development where data is continuously updated, making the constant re\-annotation of the corpus operationally prohibitive\.

Alternatively, unsupervised approaches likeKK\-Means\(MacQueen,[1967](https://arxiv.org/html/2605.26121#bib.bib22)\)provide a scalable option but are predicated on Euclidean geometry\. This creates a fundamental mismatch with modern neural embeddings \(e\.g\., BGE\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.26121#bib.bib35)\), RoBERTa\(Liuet al\.,[2019](https://arxiv.org/html/2605.26121#bib.bib39)\)\), which inherently reside on a high\-dimensional hyperspherical manifold optimized for cosine similarity\. This geometric discrepancy is exacerbated byanisotropy, the so\-called “cone effect”\(Liet al\.,[2020](https://arxiv.org/html/2605.26121#bib.bib37)\), where representations concentrate in narrow, non\-uniform sub\-regions\. Consequently, applying Euclidean clustering to this Riemannian space precipitates “cluster collapse,” as illustrated in Figure[1](https://arxiv.org/html/2605.26121#S1.F1)\(b\), where dominant clusters swallow the semantic long\-tail, severely limiting the diversity requisite for model generalization\(Ethayarajh,[2019](https://arxiv.org/html/2605.26121#bib.bib36); Gaoet al\.,[2021](https://arxiv.org/html/2605.26121#bib.bib38)\)\.

To bridge this gap, we introduceGEM \(Geometric Entropy Mixing\), which aligns semantic partitioning with the intrinsic Riemannian geometry of neural representations\. As conceptualized in Figure[1](https://arxiv.org/html/2605.26121#S1.F1)\(c\), GEM departs from Euclidean heuristics by formulating the clustering task as an entropy\-regularized variational objective augmented with a mixing\-balance regularizer on the unit hypersphere\. By explicitly decoupling the generative prior and integrating balance regularizer on empirical mass into the von Mises\-Fisher Mixture Model \(vMFMM\), our method effectively mitigates embedding anisotropy and prevents cluster collapse\. This allows GEM to discover fine\-grained semantic structures and long\-tail distributions that remain latent to traditional distance\-based methods, providing a more expressive semantic basis for data mixing\. From a systems perspective, GEM is architected for web\-scale deployment through a Teacher\-Student distillation pipeline that achieves linear time complexity with respect to corpus size\. Furthermore, to bridge the gap between geometric clustering and human\-centric data curation, we introduce a Geometric Influence Score \(GIS\)\-based sampling method to generate an interpretable, fine\-grained taxonomy with descriptions for each semantic category\. Extensive experiments demonstrate that the data mixtures derived via GEM consistently yield superior scaling laws, manifesting in lower validation perplexity and enhanced performance across diverse downstream benchmarks compared to competitive baselines\.

Our primary contributions are summarized as follows:

- •Geometric formulation with balance regularization\.We propose a hyperspherical variational framework with a novel mixing\-balance regularizer to effectively prevent cluster collapse under embedding anisotropy\.
- •Provable MM\-based inference algorithm\.We derive a provable MM \(Minorize\-Maximize\) algorithm that guarantees monotonic ascent, ensuring stable convergence for the regularized objective\.
- •Scalable deployment with interpretability\.We enable linear\-time inference via teacher–student distillation and introduce the Geometric Influence Score \(GIS\) for interpretable taxonomy generation\.
- •Consistent gains in data mixing\.Experiments with 1\.1B models demonstrate consistent performance gains over strong baselines across diverse benchmarks\.

![Refer to caption](https://arxiv.org/html/2605.26121v1/x2.png)Figure 2:Schematic overview of the GEM framework\. The pipeline consists of two phases:\(1\) Geometric Optimization \(Teacher\):We perform entropy\-regularized clustering on the hypersphere using a Mixture of von Mises\-Fisher \(vMF\) distributions\. An Minorize–Maximize \(MM\) algorithm iteratively updates the Riemannian parameters \(μ,κ\\mu,\\kappa\) on a seed corpus𝒳s​e​e​d\\mathcal\{X\}\_\{seed\}to discover semantic structures\.\(2\) Scalable Distillation \(Student\):The converged geometric partitions are used to pseudo\-label the mass corpus𝒳\\mathcal\{X\}, guided by GIS score\. These labels are then distilled into a lightweight FastText classifier, enabling efficient inference at scale\.
## 2Related Work

Data Selection and Mixing for LLMs\.Data mixing strategies are pivotal for optimizing LLM training stability and generalization\(Brownet al\.,[2020](https://arxiv.org/html/2605.26121#bib.bib15); Touvronet al\.,[2023a](https://arxiv.org/html/2605.26121#bib.bib10); Team,[2023](https://arxiv.org/html/2605.26121#bib.bib16)\)\. Recent research has introduced adaptive reweighting frameworks\. For instance,DoReMi\(Xieet al\.,[2023](https://arxiv.org/html/2605.26121#bib.bib14)\),DoGE\(Fanet al\.,[2023](https://arxiv.org/html/2605.26121#bib.bib49)\), Aioli\(Chenet al\.,[2024](https://arxiv.org/html/2605.26121#bib.bib50)\)andRegMix\(Liuet al\.,[2024](https://arxiv.org/html/2605.26121#bib.bib13)\)utilize training signals such as gradient alignment, excess loss or performance regression to dynamically adjust domain weights\. Extending this granularity,SampleMix\(Xiet al\.,[2025](https://arxiv.org/html/2605.26121#bib.bib28)\)performs evaluation of each individual sample, whileTikMix\(Wanget al\.,[2025](https://arxiv.org/html/2605.26121#bib.bib19)\)dynamically recalibrates mixing weights based on data influence\. Furthermore,QuadMix\(Liuet al\.,[2025](https://arxiv.org/html/2605.26121#bib.bib18)\)introduces a unified objective to assess both data quality and diversity\. Nevertheless, these methods typically treat the underlying categorization as an exogenous constant\. Their effectiveness is fundamentally upper\-bounded by the quality of the initial partition\. If the taxonomy is semantically misaligned or noisy, even these sophisticated mixing algorithms will struggle to isolate high\-utility data features\. To address this limitation, we argue that refined structural granularity is a prerequisite for effective mixing\. We propose a geometry\-aware classification scheme that induces semantically coherent partitions from the latent space, enabling robust mixing on high\-entropy web data\.

Pretraining Data Categorization\.Several recent works have explicitly addressed the problem of categorizing large\-scale pre\-training data, which can be broadly divided into taxonomy\-based and unsupervised methods\.Taxonomy\-based approachesrely on predefined label systems and assign categories using supervised classifiers or LLMs\. Systems such as WebOrganizer and TnT\-LLM employ LLM\-based pipelines to annotate web documents into manually designed taxonomies\(Wanet al\.,[2024](https://arxiv.org/html/2605.26121#bib.bib20); Wettiget al\.,[2025](https://arxiv.org/html/2605.26121#bib.bib12)\)\. While these methods yield human\-interpretable labels, they suffer from two limitations: \(i\) the imposed taxonomies reflect human\-defined ontologies rather than the latent semantic structure learned by the model, leading to potential ontological misalignment; and \(ii\) large\-scale inference with LLMs incurs substantial computational cost, limiting scalability\.Unsupervised categorizationmethods avoid manual labels and instead cluster representations produced by pretrained encoders\. Typical techniques include K\-Means or density\-based clustering algorithms such as HDBSCAN\(MacQueen,[1967](https://arxiv.org/html/2605.26121#bib.bib22); McInneset al\.,[2017](https://arxiv.org/html/2605.26121#bib.bib21)\), as adopted in systems like NVIDIA Climb\(Diaoet al\.,[2025](https://arxiv.org/html/2605.26121#bib.bib23)\)\. Although scalable and label\-free, these approaches generally operate in Euclidean space and rely on distance\-based objectives\. In high\-dimensional embedding spaces, however, distances tend to concentrate, making Euclidean proximity a weak proxy for semantic similarity\.

## 3Methodology

We introduceGEM \(Geometric Entropy Mixing\), a spherical mixture modeling framework for unsupervised semantic partitioning of web\-scale text embeddings\. GEM is built on directional statistics on the unit hypersphere𝒮d−1\\mathcal\{S\}^\{d\-1\}and optimizes an entropy\-regularized variational objective augmented with an explicit mixing\-balance regularizer to mitigate cluster collapse induced by embedding anisotropy\. Figure[2](https://arxiv.org/html/2605.26121#S1.F2)gives an overview\. Below, we describe \(i\) the geometric problem setup, \(ii\) an entropy\-aware vMF mixture formulation with a balance regularizer, \(iii\) a scalable*MM\-based*\(minorize–maximize\) inference scheme with provable monotonic ascent, \(iv\) an interpretable taxonomy generation pipeline leveraging Geometric Influence Scores \(GIS\), and \(v\) a teacher\-student distillation framework for efficient deployment on trillion\-token corpora\.

### 3\.1Problem Reformulation

We consider unsupervised semantic partitioning for a massive corpus of normalized text embeddings𝒳=\{xi\}i=1N⊂ℝd\\mathcal\{X\}=\\\{x\_\{i\}\\\}\_\{i=1\}^\{N\}\\subset\\mathbb\{R\}^\{d\}, where eachxix\_\{i\}isℓ2\\ell\_\{2\}\-normalized and thus lies on the unit hypersphere:xi∈𝒮d−1≔\{x∈ℝd:‖x‖2=1\}x\_\{i\}\\in\\mathcal\{S\}^\{d\-1\}\\coloneqq\\\{x\\in\\mathbb\{R\}^\{d\}:\\\|x\\\|\_\{2\}=1\\\}\. Our goal is to learn a partition𝒞=\{C1,…,CK\}\\mathcal\{C\}=\\\{C\_\{1\},\\dots,C\_\{K\}\\\}such that clusters are distinguishable by*semantic directionality*, yielding a robust semantic basis for downstream data mixing in LLM pre\-training\.

Motivation: concentration on high\-dimensional spheres\.A classical concentration phenomenon suggests that Euclidean proximity becomes less informative in high dimensions; on𝒮d−1\\mathcal\{S\}^\{d\-1\}, angles between random directions concentrate nearπ/2\\pi/2\.

###### Lemma 3\.1\(Concentration on hyperspheres\(Ledoux,[2001](https://arxiv.org/html/2605.26121#bib.bib51)\)\)\.

Letx∼Unif​\(𝒮d−1\)x\\sim\\mathrm\{Unif\}\(\\mathcal\{S\}^\{d\-1\}\)\. For any fixedp∈𝒮d−1p\\in\\mathcal\{S\}^\{d\-1\}and anyϵ\>0\\epsilon\>0,

ℙ​\(\|⟨x,p⟩\|≤ϵ\)≥1−2​exp⁡\(−d​ϵ22\)\.\\mathbb\{P\}\\\!\\left\(\\left\|\\langle x,p\\rangle\\right\|\\leq\\epsilon\\right\)~\\geq~1\-2\\exp\\\!\\left\(\-\\frac\{d\\epsilon^\{2\}\}\{2\}\\right\)\.\(1\)

Remark\.Lemma[3\.1](https://arxiv.org/html/2605.26121#S3.Thmtheorem1)provides an intuition that, ford≫1d\\gg 1, random directions are nearly orthogonal\. While real neural embeddings are not uniform on𝒮d−1\\mathcal\{S\}^\{d\-1\}, they often exhibit strong anisotropy and “hubness”, making purely Euclidean clustering unstable\. This motivates modeling directional coherence using spherical distributions whose sufficient statistic is cosine similarity\.

Variational learning objective\.We seek directional parametersΘ\\Thetaand soft assignmentsΓ=\{γi​k\}\\Gamma=\\\{\\gamma\_\{ik\}\\\}that fit a spherical mixture model while explicitly encouraging*balanced*cluster masses to improve the diversity of induced data mixtures\. Let the empirical \(soft\) cluster mass be

πk​\(Γ\)≔1N​∑i=1Nγi​k,\\displaystyle\\pi\_\{k\}\(\\Gamma\)\\;\\coloneqq\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\gamma\_\{ik\},\\quad𝝅​\(Γ\)∈ΔK−1,\\displaystyle\\boldsymbol\{\\pi\}\(\\Gamma\)\\in\\Delta^\{K\-1\},where𝐮≔1K​𝟏\.\\displaystyle\\mathbf\{u\}\\coloneqq\\tfrac\{1\}\{K\}\\mathbf\{1\}\.\(2\)We optimize an entropy\-regularized variational lower bound \(ELBO\) augmented with a mixing\-balance regularizer:

maxΘ,Γ∑i=1N∑k=1Kγi​k​log⁡\(αk​fi​k​\(Θ\)\)\+∑i=1NH​\(γi\)⏟Geometric Fidelity \(ELBO\)−λ2​‖𝝅​\(Γ\)−𝐮‖22⏟Mixing\-balance,λ\>0\.\\begin\{split\}\\max\_\{\\Theta,\\Gamma\}\\;&\\underbrace\{\\sum\_\{i=1\}^\{N\}\\sum\_\{k=1\}^\{K\}\\gamma\_\{ik\}\\log\\\!\\big\(\\alpha\_\{k\}f\_\{ik\}\(\\Theta\)\\big\)\+\\sum\_\{i=1\}^\{N\}H\(\\gamma\_\{i\}\)\}\_\{\\text\{Geometric Fidelity \(ELBO\)\}\}\\\\ &\-\\frac\{\\lambda\}\{2\}\\underbrace\{\\big\\\|\\boldsymbol\{\\pi\}\(\\Gamma\)\-\\mathbf\{u\}\\big\\\|\_\{2\}^\{2\}\}\_\{\\text\{Mixing\-balance\}\},\\qquad\\lambda\>0\.\\end\{split\}\(3\)whereH​\(γi\)≔−∑k=1Kγi​k​log⁡γi​kH\(\\gamma\_\{i\}\)\\coloneqq\-\\sum\_\{k=1\}^\{K\}\\gamma\_\{ik\}\\log\\gamma\_\{ik\}andfi​k​\(Θ\)≔fvMF​\(xi∣μk,κk\)f\_\{ik\}\(\\Theta\)\\coloneqq f\_\{\\mathrm\{vMF\}\}\(x\_\{i\}\\mid\\mu\_\{k\},\\kappa\_\{k\}\)\(defined in Sec\.[3\.2](https://arxiv.org/html/2605.26121#S3.SS2)\)\. Eq\. \([3](https://arxiv.org/html/2605.26121#S3.E3)\) is a tight lower bound of the marginal log\-likelihood∑i=1Nlog​∑k=1Kαk​fi​k​\(Θ\)\\sum\_\{i=1\}^\{N\}\\log\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}f\_\{ik\}\(\\Theta\), and becomes exact whenγi\\gamma\_\{i\}equals the posterior responsibilities\. The balance term penalizes deviation from uniform usage of clusters, mitigating degenerate solutions and stabilizing partitions under anisotropic embedding distributions\.

### 3\.2Entropy\-Aware Directional Mixture Modeling

von Mises–Fisher \(vMF\) components\.We instantiate the directional likelihood via a mixture of von Mises–Fisher \(movMF\) distributions\(Banerjeeet al\.,[2005](https://arxiv.org/html/2605.26121#bib.bib54)\), the canonical directional family on𝒮d−1\\mathcal\{S\}^\{d\-1\}\. A componentkkhas mean directionμk∈𝒮d−1\\mu\_\{k\}\\in\\mathcal\{S\}^\{d\-1\}and concentrationκk≥0\\kappa\_\{k\}\\geq 0:

fvMF​\(x∣μk,κk\)\\displaystyle f\_\{\\mathrm\{vMF\}\}\(x\\mid\\mu\_\{k\},\\kappa\_\{k\}\)=Cd​\(κk\)​exp⁡\(κk​μk⊤​x\),\\displaystyle=C\_\{d\}\(\\kappa\_\{k\}\)\\exp\(\\kappa\_\{k\}\\mu\_\{k\}^\{\\top\}x\),\(4\)Cd​\(κk\)\\displaystyle C\_\{d\}\(\\kappa\_\{k\}\)=κkd/2−1\(2​π\)d/2​Id/2−1​\(κk\),\\displaystyle=\\frac\{\\kappa\_\{k\}^\{d/2\-1\}\}\{\(2\\pi\)^\{d/2\}I\_\{d/2\-1\}\(\\kappa\_\{k\}\)\},whereIν​\(⋅\)I\_\{\\nu\}\(\\cdot\)is the modified Bessel function of the first kind\. The sufficient statisticμk⊤​x\\mu\_\{k\}^\{\\top\}xis exactly cosine similarity, aligning the generative likelihood with modern embedding metrics\.

Decoupling the generative prior from the empirical mass\.To avoid the “rich\-get\-richer” feedback in standard EM due to learned mixture weights, we fix the generative priorαk≡1/K\\alpha\_\{k\}\\equiv 1/Kfor allkk\. The resulting marginal likelihood is

P​\(xi∣Θ\)=∑k=1Kαk​fvMF​\(xi∣μk,κk\),αk=1K\.P\(x\_\{i\}\\mid\\Theta\)\\;=\\;\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}f\_\{\\mathrm\{vMF\}\}\(x\_\{i\}\\mid\\mu\_\{k\},\\kappa\_\{k\}\),\\qquad\\alpha\_\{k\}=\\frac\{1\}\{K\}\.\(5\)Importantly, the balance regularizer in Eq\. \([3](https://arxiv.org/html/2605.26121#S3.E3)\) acts on the empirical assignment mass𝝅​\(Γ\)\\boldsymbol\{\\pi\}\(\\Gamma\), not on the generative prior𝜶\\boldsymbol\{\\alpha\}\.

###### Proposition 3\.2\(Concavity, smoothness, and gradient form of the mixing\-balance regularizer\)\.

Let𝐮≔1K​𝟏\\mathbf\{u\}\\coloneqq\\tfrac\{1\}\{K\}\\mathbf\{1\}and define

R​\(𝝅\)≔−λ2​‖𝝅−𝐮‖22,λ\>0,R\(\\boldsymbol\{\\pi\}\)\\;\\coloneqq\\;\-\\frac\{\\lambda\}\{2\}\\big\\\|\\boldsymbol\{\\pi\}\-\\mathbf\{u\}\\big\\\|\_\{2\}^\{2\},\\qquad\\lambda\>0,\(6\)for𝛑∈ΔK−1\\boldsymbol\{\\pi\}\\in\\Delta^\{K\-1\}\. ThenRRis \(i\) concave onℝK\\mathbb\{R\}^\{K\}\(hence also onΔK−1\\Delta^\{K\-1\}\), \(ii\) differentiable with gradient

∇πkR​\(𝝅\)=−λ​\(πk−uk\),uk=1K,\\nabla\_\{\\pi\_\{k\}\}R\(\\boldsymbol\{\\pi\}\)\\;=\\;\-\\lambda\(\\pi\_\{k\}\-u\_\{k\}\),\\qquad u\_\{k\}=\\tfrac\{1\}\{K\},\(7\)and \(iii\)λ\\lambda\-smooth with respect to the Euclidean norm, i\.e\., its gradient isλ\\lambda\-Lipschitz:

‖∇R​\(𝝅\)−∇R​\(𝝅′\)‖2≤λ​‖𝝅−𝝅′‖2,∀𝝅,𝝅′∈ℝK\.\\big\\\|\\nabla R\(\\boldsymbol\{\\pi\}\)\-\\nabla R\(\\boldsymbol\{\\pi\}^\{\\prime\}\)\\big\\\|\_\{2\}\\;\\leq\\;\\lambda\\big\\\|\\boldsymbol\{\\pi\}\-\\boldsymbol\{\\pi\}^\{\\prime\}\\big\\\|\_\{2\},\\qquad\\forall\\,\\boldsymbol\{\\pi\},\\boldsymbol\{\\pi\}^\{\\prime\}\\in\\mathbb\{R\}^\{K\}\.\(8\)

###### Proof\.

Expanding Eq\. \([6](https://arxiv.org/html/2605.26121#S3.E6)\) givesR​\(𝝅\)=−λ2​∑k=1K\(πk−uk\)2R\(\\boldsymbol\{\\pi\}\)=\-\\tfrac\{\\lambda\}\{2\}\\sum\_\{k=1\}^\{K\}\(\\pi\_\{k\}\-u\_\{k\}\)^\{2\}\. Differentiating w\.r\.t\.πk\\pi\_\{k\}yields Eq\. \([7](https://arxiv.org/html/2605.26121#S3.E7)\)\. Moreover,RRis a negative quadratic function with constant Hessian∇2R​\(𝝅\)=−λ​IK⪯0\\nabla^\{2\}R\(\\boldsymbol\{\\pi\}\)=\-\\lambda I\_\{K\}\\preceq 0, which implies concavity onℝK\\mathbb\{R\}^\{K\}\. Finally,∇R​\(𝝅\)=−λ​\(𝝅−𝐮\)\\nabla R\(\\boldsymbol\{\\pi\}\)=\-\\lambda\(\\boldsymbol\{\\pi\}\-\\mathbf\{u\}\)is an affine map, hence‖∇R​\(𝝅\)−∇R​\(𝝅′\)‖2=λ​‖𝝅−𝝅′‖2\\\|\\nabla R\(\\boldsymbol\{\\pi\}\)\-\\nabla R\(\\boldsymbol\{\\pi\}^\{\\prime\}\)\\\|\_\{2\}=\\lambda\\\|\\boldsymbol\{\\pi\}\-\\boldsymbol\{\\pi\}^\{\\prime\}\\\|\_\{2\}, establishingλ\\lambda\-smoothness\. ∎

### 3\.3MM\-Based Inference and Parameter Updates

Directly maximizing Eq\. \([3](https://arxiv.org/html/2605.26121#S3.E3)\) is nontrivial becauseR​\(𝝅​\(Γ\)\)R\(\\boldsymbol\{\\pi\}\(\\Gamma\)\)couples all samples through the global mass𝝅​\(Γ\)\\boldsymbol\{\\pi\}\(\\Gamma\)\. We therefore derive a scalable minorize–maximize \(MM\) update that yields a provable monotonic ascent of the objective\.

E\-step: an MM surrogate with guaranteed ascent\.FixΘ=Θ\(t\)\\Theta=\\Theta^\{\(t\)\}and denote𝝅\(t\)≔𝝅​\(Γ\(t\)\)\\boldsymbol\{\\pi\}^\{\(t\)\}\\coloneqq\\boldsymbol\{\\pi\}\(\\Gamma^\{\(t\)\}\)\. SinceRRis concave andλ\\lambda\-smooth, it admits the following global quadratic minorizer: for any𝝅,𝝅′∈ΔK−1\\boldsymbol\{\\pi\},\\boldsymbol\{\\pi\}^\{\\prime\}\\in\\Delta^\{K\-1\},

R​\(𝝅\)≥R​\(𝝅′\)\+⟨∇R​\(𝝅′\),𝝅−𝝅′⟩−λ2​‖𝝅−𝝅′‖22\.R\(\\boldsymbol\{\\pi\}\)\\;\\geq\\;R\(\\boldsymbol\{\\pi\}^\{\\prime\}\)\+\\big\\langle\\nabla R\(\\boldsymbol\{\\pi\}^\{\\prime\}\),\\,\\boldsymbol\{\\pi\}\-\\boldsymbol\{\\pi\}^\{\\prime\}\\big\\rangle\-\\frac\{\\lambda\}\{2\}\\big\\\|\\boldsymbol\{\\pi\}\-\\boldsymbol\{\\pi\}^\{\\prime\}\\big\\\|\_\{2\}^\{2\}\.\(9\)Applying Eq\. \([9](https://arxiv.org/html/2605.26121#S3.E9)\) with𝝅′=𝝅\(t\)\\boldsymbol\{\\pi\}^\{\\prime\}=\\boldsymbol\{\\pi\}^\{\(t\)\}and𝝅=𝝅​\(Γ\)\\boldsymbol\{\\pi\}=\\boldsymbol\{\\pi\}\(\\Gamma\)yields a tight lower bound of the regularizer atΓ\(t\)\\Gamma^\{\(t\)\}\. Consequently, we define the MM surrogate \(a global lower bound that matches the objective atΓ\(t\)\\Gamma^\{\(t\)\}\):

ℱ~t​\(Γ\)≔∑i=1N∑k=1Kγi​k​log⁡\(αk​fi​k​\(Θ\(t\)\)\)\+∑i=1NH​\(γi\)\+R​\(𝝅\(t\)\)\+⟨∇R​\(𝝅\(t\)\),𝝅​\(Γ\)−𝝅\(t\)⟩−λ2​‖𝝅​\(Γ\)−𝝅\(t\)‖22\.\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma\)\\;\\coloneqq\\;\\sum\_\{i=1\}^\{N\}\\sum\_\{k=1\}^\{K\}\\gamma\_\{ik\}\\log\\\!\\big\(\\alpha\_\{k\}f\_\{ik\}\(\\Theta^\{\(t\)\}\)\\big\)\+\\sum\_\{i=1\}^\{N\}H\(\\gamma\_\{i\}\)\\\\ \+R\(\\boldsymbol\{\\pi\}^\{\(t\)\}\)\+\\big\\langle\\nabla R\(\\boldsymbol\{\\pi\}^\{\(t\)\}\),\\,\\boldsymbol\{\\pi\}\(\\Gamma\)\-\\boldsymbol\{\\pi\}^\{\(t\)\}\\big\\rangle\-\\frac\{\\lambda\}\{2\}\\big\\\|\\boldsymbol\{\\pi\}\(\\Gamma\)\-\\boldsymbol\{\\pi\}^\{\(t\)\}\\big\\\|\_\{2\}^\{2\}\.\(10\)We then update assignments by maximizing this surrogate:

Γ\(t\+1\)∈arg⁡maxΓ:γi∈ΔK−1⁡ℱ~t​\(Γ\)\.\\Gamma^\{\(t\+1\)\}\\;\\in\\;\\arg\\max\_\{\\Gamma:\\;\\gamma\_\{i\}\\in\\Delta^\{K\-1\}\}\\;\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma\)\.\(11\)
Monotonicity guarantee\.By construction,ℱ~t​\(Γ\)≤ℱ​\(Θ\(t\),Γ\)\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma\)\\leq\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma\)for allΓ\\Gamma, andℱ~t​\(Γ\(t\)\)=ℱ​\(Θ\(t\),Γ\(t\)\)\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma^\{\(t\)\}\)=\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma^\{\(t\)\}\)\. Therefore, maximizing the surrogate yields a monotonic ascent:

ℱ​\(Θ\(t\),Γ\(t\+1\)\)≥ℱ​\(Θ\(t\),Γ\(t\)\)\.\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma^\{\(t\+1\)\}\)\\;\\geq\\;\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma^\{\(t\)\}\)\.\(12\)A complete monotonic convergence statement for the full GEM iterations is provided in Appendix[A](https://arxiv.org/html/2605.26121#A1)\.

Practical solver for the MM E\-step\.The surrogate in Eq\. \([10](https://arxiv.org/html/2605.26121#S3.E10)\) is a concave objective over\{γi∈ΔK−1\}\\\{\\gamma\_\{i\}\\in\\Delta^\{K\-1\}\\\}\(entropy plus a concave quadratic in𝝅​\(Γ\)\\boldsymbol\{\\pi\}\(\\Gamma\)\), and can be efficiently optimized with a few steps of projected \(mirror\) ascent\. In practice, we solve Eq\. \([11](https://arxiv.org/html/2605.26121#S3.E11)\) approximately to a prescribed tolerance; any update that increasesℱ~t\\widetilde\{\\mathcal\{F\}\}\_\{t\}preserves the ascent property in Eq\. \([12](https://arxiv.org/html/2605.26121#S3.E12)\)\.

M\-step: closed\-form updates on the sphere\.GivenΓ\(t\+1\)\\Gamma^\{\(t\+1\)\}, update the empirical mass:

πk\(t\+1\)=1N​∑i=1Nγi​k\(t\+1\)\.\\pi\_\{k\}^\{\(t\+1\)\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\gamma\_\{ik\}^\{\(t\+1\)\}\.\(13\)For mean directions, the vMF maximum\-likelihood update normalizes the weighted resultant vector:

rk\(t\+1\)=∑i=1Nγi​k\(t\+1\)​xi,μk\(t\+1\)=rk\(t\+1\)‖rk\(t\+1\)‖2\+ε,r\_\{k\}^\{\(t\+1\)\}=\\sum\_\{i=1\}^\{N\}\\gamma\_\{ik\}^\{\(t\+1\)\}x\_\{i\},\\qquad\\mu\_\{k\}^\{\(t\+1\)\}=\\frac\{r\_\{k\}^\{\(t\+1\)\}\}\{\\\|r\_\{k\}^\{\(t\+1\)\}\\\|\_\{2\}\+\\varepsilon\},\(14\)whereε\>0\\varepsilon\>0is a small constant to avoid numerical issues \(e\.g\., near\-empty clusters with‖rk‖≈0\\\|r\_\{k\}\\\|\\approx 0\)\. For concentration parameters, let

R¯k\(t\+1\)≔‖rk\(t\+1\)‖2∑i=1Nγi​k\(t\+1\)∈\[0,1\),\\bar\{R\}\_\{k\}^\{\(t\+1\)\}\\coloneqq\\frac\{\\\|r\_\{k\}^\{\(t\+1\)\}\\\|\_\{2\}\}\{\\sum\_\{i=1\}^\{N\}\\gamma\_\{ik\}^\{\(t\+1\)\}\}\\in\[0,1\),\(15\)and estimateκk\\kappa\_\{k\}using the standard high\-dimensional approximation:

κk\(t\+1\)≈R¯k\(t\+1\)​d−\(R¯k\(t\+1\)\)31−\(R¯k\(t\+1\)\)2\.\\kappa\_\{k\}^\{\(t\+1\)\}\\approx\\frac\{\\bar\{R\}\_\{k\}^\{\(t\+1\)\}d\-\(\\bar\{R\}\_\{k\}^\{\(t\+1\)\}\)^\{3\}\}\{1\-\(\\bar\{R\}\_\{k\}^\{\(t\+1\)\}\)^\{2\}\}\.\(16\)
Summary\.GEM combines \(i\) a spherical vMF mixture with fixed uniform generative prior𝜶\\boldsymbol\{\\alpha\}and \(ii\) a mixing\-balance regularizer on the empirical mass𝝅​\(Γ\)\\boldsymbol\{\\pi\}\(\\Gamma\), optimized via an MM\-based E\-step with guaranteed monotonic ascent \(Eqs\. \([9](https://arxiv.org/html/2605.26121#S3.E9)\)–\([12](https://arxiv.org/html/2605.26121#S3.E12)\)\) and closed\-form M\-step updates \(Eqs\. \([14](https://arxiv.org/html/2605.26121#S3.E14)\)–\([16](https://arxiv.org/html/2605.26121#S3.E16)\)\)\. This yields semantically coherent yet diverse clusters suitable for robust data mixing in LLM pre\-training\. The pseudocode of the overall process is provided in Appendix[I](https://arxiv.org/html/2605.26121#A9)\.

![Refer to caption](https://arxiv.org/html/2605.26121v1/x3.png)Figure 3:Visualization of the discovered latent semantic taxonomy\. The area of each rectangle is proportional to the square root of the number of tokens in that domain within the pre\-training corpus\.
### 3\.4Interpretable Taximony Generation

To transform the unsupervised partition into an interpretable taxonomy for human\-understandable data mixing, we employ a Geometric Influence Score \(GIS\) to select representative samples from each cluster, which are then summarized by LLMs to generate semantic labels\. The generated taximony is presented in Figure[3](https://arxiv.org/html/2605.26121#S3.F3)\. The detailed formulation of GIS and the taxonomy generation process are provided in Appendix[B\.1](https://arxiv.org/html/2605.26121#A2.SS1)\.

### 3\.5Scalable Deployment of GEM for Data Mixing

Direct application of GEM to trillion\-token corpora is computationally prohibitive due to the cost of iterative EM optimization\. We bridge this gap using a Teacher\-Student distillation framework\. In this framework, GEM serves as the ”Teacher,” operating on a representative seed corpus to discover latent semantic structures\. We then utilize the proposed Geometric Influence Score \(GIS\) to curate a high\-confidence, balanced training set from these clusters\. This dataset supervises a “Student” model, a lightweight linear classifier which approximates the GEM\-induced partition function\. This design aligns with established industry practices for processing web\-scale data, where linear classifiers are preferred for their strict latency constraints\(Wenzeket al\.,[2020](https://arxiv.org/html/2605.26121#bib.bib56); Soldainiet al\.,[2024](https://arxiv.org/html/2605.26121#bib.bib57)\)\. By distilling the hyperspherical geometry into a fast inference model, we enable semantic partitioning over the full pre\-training corpus with negligible computational overhead\. Detailed procedures for the clustering, pseudo\-labeling, and distillation phases are provided in Appendix[C](https://arxiv.org/html/2605.26121#A3)\.

Table 1:Main results on downstream tasks\. We compare ourGEMsampling strategy against baseline methods \(K\-Means, TOS\) and existing semantic organizers \(WebOrganizer\) across three different pre\-training frameworks\. The best results within each group are highlighted inbold, and the second best results are highlighted byunderline\.Model VariationScience QACommonsense ReasoningLogic & LinguisticsAverageUnder DoReMiK\-Means32\.1834\.2153\.4339\.94TOS32\.4934\.2953\.9840\.25WebOrganizer Topic34\.6838\.2655\.3542\.76WebOrganizer Format34\.4438\.7355\.1942\.79GEM \(Ours\)34\.7939\.9657\.1143\.95Under PerfK\-Means32\.4835\.1155\.5241\.04TOS32\.5434\.5153\.2940\.11WebOrganizer Topic35\.0539\.7357\.9044\.23WebOrganizer Format35\.0639\.7357\.9744\.25GEM \(Ours\)35\.9640\.4357\.9844\.79Under RegMixK\-Means31\.6334\.8655\.6740\.72TOS32\.1834\.3552\.4039\.64WebOrganizer Topic33\.9033\.8352\.5040\.08WebOrganizer Format34\.1233\.9454\.2640\.77GEM \(Ours\)34\.0735\.3054\.9741\.45

## 4Experiments

In this section, we evaluate the effectiveness of our proposed taxonomy in the context of large\-scale data mixing\. We aim to demonstrate that our classification scheme provides a more granular and semantically coherent partitioning of web data, leading to superior downstream model performance when combined with various data selection and reweighting algorithms\.

### 4\.1Experimental Settings

Datasets and Models\.To systematically investigate the effectiveness of data classification for data mixing on unlabeled web\-scale corpora, we construct our pre\-training dataset from raw CommonCrawl \(CC\)\(Raffelet al\.,[2020](https://arxiv.org/html/2605.26121#bib.bib58)\)data\. We apply a rigorous cleaning and filtering pipeline that closely follows the protocols established in RefinedWeb\(Penedoet al\.,[2023b](https://arxiv.org/html/2605.26121#bib.bib9)\), ensuring high\-quality and noise\-reduced training data\. We adopt a LLaMA\-style Transformer architecture\(Touvronet al\.,[2023a](https://arxiv.org/html/2605.26121#bib.bib10)\)with 1\.1B parameters\. The model configuration is trained with a fixed compute budget of 25 billion tokens to ensure fair and controlled comparisons across different data mixing strategies\. Details of the pre\-training hyperparameters and experimental infrastructure are provided in Appendix[D](https://arxiv.org/html/2605.26121#A4)\.

Categorization Baselines\.We compare our proposed taxonomy against several established data organization paradigms: \(1\)TOS\(Penget al\.,[2025](https://arxiv.org/html/2605.26121#bib.bib11)\), which employs a human\-curated hierarchical semantic structure for domain partitioning; \(2\)WebOrganizer \(Domain\)and \(3\)WebOrganizer \(Format\), which leverage metadata\-driven signals such as source URL domains and structural layout features, respectively\(Wettiget al\.,[2025](https://arxiv.org/html/2605.26121#bib.bib12)\); and \(4\)K\-means\(MacQueen,[1967](https://arxiv.org/html/2605.26121#bib.bib22)\), an unsupervised clustering baseline that partitions data based on document\-level embeddings\. Details for implementation and training is provided in Appendix[E](https://arxiv.org/html/2605.26121#A5)\.

Data Mixing Strategies\.To evaluate the downstream utility of these taxonomies, we integrate them with the following mixing algorithms: \(1\)Perf, which utilizes a sensitivity\-driven approach by upsampling high\-impact categories identified through preliminary performance gains\(Penget al\.,[2025](https://arxiv.org/html/2605.26121#bib.bib11)\); \(2\)RegMix, which formulates data mixing as a regression task to predict model performance across different distribution vectors\(Liuet al\.,[2024](https://arxiv.org/html/2605.26121#bib.bib13)\); and \(3\)DoReMi, which exploits a distributionally robust optimization objective to minimize the maximum excess loss via a proxy model\(Xieet al\.,[2023](https://arxiv.org/html/2605.26121#bib.bib14)\)\. We provide the detailed search space and computational budgets for ratio optimization in Appendix[F](https://arxiv.org/html/2605.26121#A6)\.

Downstream Task Evaluation\.We evaluate the zero\-shot generalization capabilities of our pre\-trained models using theOLMES\(Guet al\.,[2025](https://arxiv.org/html/2605.26121#bib.bib48)\)framework across nine benchmark datasets\. To provide a granular analysis of model performance, we categorize these benchmarks into three core dimensions:\(1\)Science QA: ARC\-Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2605.26121#bib.bib40)\), ARC\-Easy\(Clarket al\.,[2018](https://arxiv.org/html/2605.26121#bib.bib40)\), SciQ\(Welblet al\.,[2017](https://arxiv.org/html/2605.26121#bib.bib41)\), and OpenBookQA\(Mihaylovet al\.,[2018](https://arxiv.org/html/2605.26121#bib.bib42)\)\. \(2\)Commonsense Reasoning: HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2605.26121#bib.bib44)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2605.26121#bib.bib45)\), and CommonsenseQA\(Talmoret al\.,[2019](https://arxiv.org/html/2605.26121#bib.bib43)\)\. \(3\)Logic & Linguistics: WinoGrande\(Sakaguchiet al\.,[2021](https://arxiv.org/html/2605.26121#bib.bib46)\)and COPA\(Roemmeleet al\.,[2011](https://arxiv.org/html/2605.26121#bib.bib47)\)\. By adhering to the standardized evaluation matrix in OLMES\(Guet al\.,[2025](https://arxiv.org/html/2605.26121#bib.bib48)\), we ensure consistency in prompt templates and scoring metrics\.

### 4\.2Main Results

We evaluate the effectiveness of our proposed method by integrating the GEM\-based taxonomy into three distinct data mixing frameworks: DoReMi, Perf, and RegMix\. Table[1](https://arxiv.org/html/2605.26121#S3.T1)presents the comparative performance across three aggregated capabilities and the overall average\.

Downstream Performance Analysis\.The empirical results presented in Table[1](https://arxiv.org/html/2605.26121#S3.T1)substantiate the effectiveness of our proposed GEM strategy, which consistently achieves superior performance across diverse data mixing frameworks\. Specifically, under theDoReMiframework, GEM exhibits the most pronounced improvements, surpassing the strongest baseline, WebOrganizer, by substantial margins of 1\.23% and 1\.76% points on Commonsense Reasoning and Logic & Linguistics, respectively\. This trend persists within thePerfframework, where GEM establishes best results in Science QA and Commonsense Reasoning, while maintaining a competitive edge in Logic & Linguistics by attaining a score of 57\.98% compared to the 57\.97% achieved by the runner\-up\. In theRegMixsetting, although the K\-Means baseline holds a marginal advantage in Logic & Linguistics with a performance of 55\.67%, GEM remains dominant in Science QA and Commonsense Reasoning, significantly outperforming other semantic\-aware organizers\. Collectively, these results underscore the robustness of GEM in selecting high\-value training samples that enhance model reasoning capabilities independent of the underlying mixing strategy\.

![Refer to caption](https://arxiv.org/html/2605.26121v1/x4.png)Figure 4:Downstream loss predictability across taxonomies\. Violin plots show per\-subcategory Spearman correlation \(ρ\\rho\) between RegMix and ground\-truth validation loss, averaged over 10 splits\. Dots: subcategories; Boxes: Median/IQR; Whiskers:1\.5×1\.5\\timesIQR\.Evaluating Taxonomy Quality via Data Mixing Predictability\.A robust taxonomy for data mixing must yield well\-conditioned mixing coordinates, where small adjustments to mixture weights induce consistent and predictable shifts in validation loss\. We quantify this property usingRegMix\(Liuet al\.,[2024](https://arxiv.org/html/2605.26121#bib.bib13)\)as a*predictability probe*\. Specifically, we measure the Spearman rank correlation \(ρ\\rho\) between the ground\-truth validation loss and the loss predicted by RegMix on held\-out mixture vectors \(implementation details in Appendix[G](https://arxiv.org/html/2605.26121#A7)\)\. To reduce variance due to a particular split of mixture vectors, we repeat the procedure over1010independent train/test splits and report, for each sub\-taxonomy unit, the averageρ\\rhoacross splits\. As illustrated in Figure[4](https://arxiv.org/html/2605.26121#S4.F4), the choice of taxonomy significantly impacts predictability\. Baselines such as K\-Means and WebOrganizer exhibit lower and more dispersedρ\\rhovalues, indicating unstable mixing dynamics dominated by redundant or entangled factors\. In contrast,GEMdemonstrates superior predictability, achieving consistently higherρ\\rhowith a notably tighter distribution\. This reduction in variance suggests that GEM induces a taxonomy with minimized factor entanglement and a smoother optimization landscape, thereby enabling more sample\-efficient mixture search and reliable control over data composition during pre\-training\.

![Refer to caption](https://arxiv.org/html/2605.26121v1/x5.png)Figure 5:Sensitivity to the number of clustersKK\. Accuracy \(%\) on Science QA, Commonsense Reasoning, Logic & Linguistics, and Average asKKvaries from 12 to 48 \(GEM \+ Perf\.\)Impact of Cluster Granularity\.In our primary experiments, we set the number of clusters toK=24K=24to ensure a controlled comparison across all baselines\. To further investigate the optimal granularity for the GIS module, we conduct a sensitivity analysis by varyingKKfrom 12 to 48\. As illustrated in Figure[5](https://arxiv.org/html/2605.26121#S4.F5), the model performance manifests a clear dependence on cluster density\. Specifically, we observe a consistent performance trajectory that peaks atK=36K=36, achieving anAveragescore of 41\.21%\. This trend suggests that increasing cluster granularity facilitates the capture of more refined latent semantic patterns\. However, performance begins to plateau or slightly degrade asKKreaches 48\. This marginal decline is likely attributable to the over\-fragmentation of the embedding space, where excessive partitioning may introduce stochastic noise or impede the model’s ability to derive robust, generalized representations\.

![Refer to caption](https://arxiv.org/html/2605.26121v1/x6.png)Figure 6:Ablation study on clustering mechanisms\. The results highlight the improvements gained by shifting from Euclidean to Riemannian geometry and by incorporating regularizer\.
### 4\.3Ablation Study

To rigorously assess the contribution of each component within our framework, we perform an ablation study comparing GEM against three baseline variants: K\-Means, Spherical K\-Means \(Hyperspherical geometry with hard assignments\), and Vanilla vMF \(Riemannian geometry without regularizer\)\. As visualized in Figure[6](https://arxiv.org/html/2605.26121#S4.F6), the results exhibit a monotonic performance improvement that aligns with the theoretical fidelity of the clustering objective\. The standard K\-Means baseline yields the lowest average accuracy of 38\.5%, corroborating the limitations of Euclidean metrics in high\-dimensional embedding spaces\. Transitioning to Spherical K\-Means increases the average accuracy to 40\.6%\. Vanilla vMF further elevates performance to 41\.3% by leveraging probabilistic soft assignments to capture semantic nuances\. Crucially, the proposed GEM framework achieves the highest average accuracy of 42\.1%, with notable gains in Science QA at 36\.0% and Logic & Linguistics at 57\.0%\. These findings confirm that the gradient\-consistent entropy constraint is essential for mitigating anisotropic cluster collapse, thereby ensuring a semantically balanced taxonomy that facilitates robust downstream generalization\.

![Refer to caption](https://arxiv.org/html/2605.26121v1/x7.png)Figure 7:t\-SNE visualization of GEM\-induced clusters \(3D\)\. Colors denote the 24 taxonomy topics generated by GIS\-based sampling and LLM summarization\.

## 5Discussion

A key message of GEM is that “better” categorization for LLM data mixing should be evaluated not only by human interpretability, but by whether the induced coordinates make mixture search controllable and predictable\. Our RegMix\-based predictability probe \(Figure[4](https://arxiv.org/html/2605.26121#S4.F4)\) indicates that GEM yields a better\-conditioned mixing simplex—small weight perturbations produce more consistent loss orderings—suggesting reduced factor entanglement relative to Euclidean heuristics\. Although GEM clusters are unlabeled, GIS\-guided representative selection makes interpretability a practical byproduct rather than a fragile post\-hoc step: the t\-SNE visualization \(Figure[7](https://arxiv.org/html/2605.26121#S4.F7)\) shows multiple discernible local neighborhoods alongside overlap in dense regions, reflecting genuine semantic proximity among discovered topics rather than mere labeling noise\.

Several avenues remain for future exploration\. First, extending GEM to multi\-trillion\-token regimes and diverse model architectures needs to be done to further validate its scaling properties\. Second, investigating GIS as a standalone data curation primitive can clarify its efficacy in prioritizing high\-signal samples beyond mere partitioning\. Finally, exploring the co\-optimization of taxonomy discovery and downstream objectives offers a compelling direction; such bi\-level optimization frameworks could enable a self\-improving cycle where the semantic basis and model performance mutually adapt to enhance task\-specific generalization\.

## 6Conclusion

We presentedGEM \(Geometric Entropy Mixing\), a geometry\-aware framework for unsupervised data categorization that aligns semantic partitioning with hyperspherical embedding structures\. By combining a vMF mixture model with an explicit mixing\-balance regularizer and an MM\-based inference scheme with monotonic ascent, GEM mitigates anisotropy\-induced cluster collapse and yields semantically coherent yet balanced partitions\. To scale to web\-scale corpora, we distill GEM’s decisions into a lightweight classifier and employ GIS\-guided sampling for interpretable taxonomies\. Extensive experiments with 1\.1B\-parameter models show that GEM consistently improves downstream performance across multiple mixing frameworks\. GEM offers a principled foundation for future work on jointly optimizing taxonomy discovery and data mixture learning\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- A\. Abbas, K\. Tirumala, D\. Simig, S\. Ganguli, and A\. S\. Morcos \(2023\)Semdedup: data\-efficient learning at web\-scale through semantic deduplication\.arXiv preprint arXiv:2303\.09540\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p2.1)\.
- A\. Banerjee, I\. S\. Dhillon, J\. Ghosh, S\. Sra, and G\. Ridgeway \(2005\)Clustering on the unit hypersphere using von mises\-fisher distributions\.\.Journal of Machine Learning Research6\(9\)\.Cited by:[§3\.2](https://arxiv.org/html/2605.26121#S3.SS2.p1.4)\.
- Y\. Bisk, R\. Zellers, J\. Gao, Y\. Choi,et al\.\(2020\)Piqa: reasoning about physical commonsense in natural language\.InProceedings of the AAAI conference on artificial intelligence,Vol\.34,pp\. 7432–7439\.Cited by:[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p4.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p2.1),[§2](https://arxiv.org/html/2605.26121#S2.p1.1)\.
- M\. F\. Chen, M\. Y\. Hu, N\. Lourie, K\. Cho, and C\. Ré \(2024\)Aioli: a unified optimization framework for language model data mixing\.arXiv preprint arXiv:2411\.05735\.Cited by:[§2](https://arxiv.org/html/2605.26121#S2.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p4.1)\.
- S\. Diao, Y\. Yang, Y\. Fu, X\. Dong, D\. Su, M\. Kliegl, Z\. Chen, P\. Belcak, Y\. Suhara, H\. Yin,et al\.\(2025\)Climb: clustering\-based iterative data mixture bootstrapping for language model pre\-training\.arXiv preprint arXiv:2504\.13161\.Cited by:[§2](https://arxiv.org/html/2605.26121#S2.p2.1)\.
- K\. Ethayarajh \(2019\)How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt\-2 embeddings\.arXiv preprint arXiv:1909\.00512\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p3.1)\.
- S\. Fan, M\. Pagliardini, and M\. Jaggi \(2023\)Doge: domain reweighting with generalization estimation\.arXiv preprint arXiv:2310\.15393\.Cited by:[§2](https://arxiv.org/html/2605.26121#S2.p1.1)\.
- T\. Gao, X\. Yao, and D\. Chen \(2021\)Simcse: simple contrastive learning of sentence embeddings\.arXiv preprint arXiv:2104\.08821\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p3.1)\.
- Y\. Gu, O\. Tafjord, B\. Kuehl, D\. Haddad, J\. Dodge, and H\. Hajishirzi \(2025\)Olmes: a standard for language model evaluations\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 5005–5033\.Cited by:[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p4.1)\.
- S\. Gunasekar, Y\. Zhang, J\. Aneja, C\. C\. T\. Mendes, A\. Del Giorno, S\. Gopi, M\. Javaheripi, P\. Kauffmann, G\. de Rosa, O\. Saarikivi,et al\.\(2023\)Textbooks are all you need\.arXiv preprint arXiv:2306\.11644\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p1.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. d\. L\. Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)Training compute\-optimal large language models\.arXiv preprint arXiv:2203\.15556\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p1.1)\.
- A\. Joulin, E\. Grave, P\. Bojanowski, and T\. Mikolov \(2017\)Bag of tricks for efficient text classification\.InProceedings of the 15th conference of the European chapter of the association for computational linguistics: volume 2, short papers,pp\. 427–431\.Cited by:[Appendix C](https://arxiv.org/html/2605.26121#A3.p3.1),[Appendix E](https://arxiv.org/html/2605.26121#A5.p1.2)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p1.1)\.
- M\. Ledoux \(2001\)The concentration of measure phenomenon\.American Mathematical Soc\.\.Cited by:[Lemma 3\.1](https://arxiv.org/html/2605.26121#S3.Thmtheorem1)\.
- B\. Li, H\. Zhou, J\. He, M\. Wang, Y\. Yang, and L\. Li \(2020\)On the sentence embeddings from pre\-trained language models\.arXiv preprint arXiv:2011\.05864\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p3.1)\.
- F\. Liu, W\. Zhou, B\. Liu, Z\. Yu, Y\. Zhang, H\. Lin, Y\. Yu, B\. Zhang, X\. Zhou, T\. Wang,et al\.\(2025\)Quadmix: quality\-diversity balanced data selection for efficient llm pretraining\.arXiv preprint arXiv:2504\.16511\.Cited by:[§2](https://arxiv.org/html/2605.26121#S2.p1.1)\.
- Q\. Liu, X\. Zheng, N\. Muennighoff, G\. Zeng, L\. Dou, T\. Pang, J\. Jiang, and M\. Lin \(2024\)Regmix: data mixture as regression for language model pre\-training\.arXiv preprint arXiv:2407\.01492\.Cited by:[§2](https://arxiv.org/html/2605.26121#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p3.1),[§4\.2](https://arxiv.org/html/2605.26121#S4.SS2.p3.5)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)Roberta: a robustly optimized bert pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p3.1)\.
- J\. MacQueen \(1967\)Multivariate observations\.InProceedings ofthe 5th Berkeley Symposium on Mathematical Statisticsand Probability,Vol\.1,pp\. 281–297\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p3.1),[§2](https://arxiv.org/html/2605.26121#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p2.1)\.
- P\. Maini, S\. Seto, R\. Bai, D\. Grangier, Y\. Zhang, and N\. Jaitly \(2024\)Rephrasing the web: a recipe for compute and data\-efficient language modeling\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 14044–14072\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p2.1)\.
- L\. McInnes, J\. Healy, S\. Astels,et al\.\(2017\)Hdbscan: hierarchical density based clustering\.\.J\. Open Source Softw\.2\(11\),pp\. 205\.Cited by:[§2](https://arxiv.org/html/2605.26121#S2.p2.1)\.
- T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.arXiv preprint arXiv:1809\.02789\.Cited by:[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p4.1)\.
- G\. Penedo, Q\. Malartic, D\. Hesslow, R\. Cojocaru, A\. Cappelli, H\. Alobeidli, B\. Pannier, E\. Almazrouei, and J\. Launay \(2023a\)The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only\.arXiv preprint arXiv:2306\.01116\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p1.1)\.
- G\. Penedo, Q\. Malartic, D\. Hesslow, R\. Cojocaru, A\. Cappelli, H\. Alobeidli, B\. Pannier, E\. Almazrouei, and J\. Launay \(2023b\)The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only\.arXiv preprint arXiv:2306\.01116\.Cited by:[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p1.1)\.
- J\. Peng, X\. Zhuang, J\. Qiu, R\. Ma, J\. Yu, H\. Zhu, and C\. He \(2025\)Topic over source: the key to effective data mixing for language models pre\-training\.External Links:2502\.16802,[Link](https://arxiv.org/abs/2502.16802)Cited by:[Appendix F](https://arxiv.org/html/2605.26121#A6.p2.1),[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p2.1),[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p3.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p1.1)\.
- M\. Roemmele, C\. A\. Bejan, and A\. S\. Gordon \(2011\)Choice of plausible alternatives: an evaluation of commonsense causal reasoning\.\.InAAAI spring symposium: logical formalizations of commonsense reasoning,pp\. 90–95\.Cited by:[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p4.1)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2021\)Winogrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p4.1)\.
- L\. Soldaini, R\. Kinney, A\. Bhagia, D\. Schwenk, D\. Atkinson, R\. Authur, B\. Bogin, K\. Chandu, J\. Dumas, Y\. Elazar,et al\.\(2024\)Dolma: an open corpus of three trillion tokens for language model pretraining research\.InProceedings of the 62nd annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 15725–15788\.Cited by:[§3\.5](https://arxiv.org/html/2605.26121#S3.SS5.p1.1)\.
- A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2019\)Commonsenseqa: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 4149–4158\.Cited by:[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p4.1)\.
- I\. Team \(2023\)Internlm: a multilingual language model with progressively enhanced capabilities\.Cited by:[§2](https://arxiv.org/html/2605.26121#S2.p1.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023a\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§2](https://arxiv.org/html/2605.26121#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023b\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p2.1)\.
- M\. Wan, T\. Safavi, S\. K\. Jauhar, Y\. Kim, S\. Counts, J\. Neville, S\. Suri, C\. Shah, R\. W\. White, L\. Yang,et al\.\(2024\)Tnt\-llm: text mining at scale with large language models\.InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining,pp\. 5836–5847\.Cited by:[§2](https://arxiv.org/html/2605.26121#S2.p2.1)\.
- Y\. Wang, B\. Liu, F\. Liu, Y\. Guo, J\. Deng, X\. Wu, W\. Zhou, X\. Zhou, and T\. Wang \(2025\)TiKMiX: take data influence into dynamic mixture for language model pre\-training\.arXiv preprint arXiv:2508\.17677\.Cited by:[§2](https://arxiv.org/html/2605.26121#S2.p1.1)\.
- J\. Welbl, N\. F\. Liu, and M\. Gardner \(2017\)Crowdsourcing multiple choice science questions\.arXiv preprint arXiv:1707\.06209\.Cited by:[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p4.1)\.
- G\. Wenzek, M\. Lachaux, A\. Conneau, V\. Chaudhary, F\. Guzmán, A\. Joulin, and E\. Grave \(2020\)CCNet: extracting high quality monolingual datasets from web crawl data\.InProceedings of the twelfth language resources and evaluation conference,pp\. 4003–4012\.Cited by:[§3\.5](https://arxiv.org/html/2605.26121#S3.SS5.p1.1)\.
- A\. Wettig, K\. Lo, S\. Min, H\. Hajishirzi, D\. Chen, and L\. Soldaini \(2025\)Organize the web: constructing domains enhances pre\-training data curation\.arXiv preprint arXiv:2502\.10341\.Cited by:[§2](https://arxiv.org/html/2605.26121#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p2.1)\.
- X\. Xi, D\. Kong, J\. Yang, J\. Yang, Z\. Chen, W\. Wang, J\. Wang, X\. Cai, S\. Zhang, and W\. Ye \(2025\)SampleMix: a sample\-wise pre\-training data mixing strategey by coordinating data quality and diversity\.arXiv preprint arXiv:2503\.01506\.Cited by:[§2](https://arxiv.org/html/2605.26121#S2.p1.1)\.
- S\. Xiao, Z\. Liu, P\. Zhang, N\. Muennighoff, D\. Lian, and J\. Nie \(2024\)C\-pack: packed resources for general chinese embeddings\.InProceedings of the 47th international ACM SIGIR conference on research and development in information retrieval,pp\. 641–649\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p3.1)\.
- S\. M\. Xie, H\. Pham, X\. Dong, N\. Du, H\. Liu, Y\. Lu, P\. S\. Liang, Q\. V\. Le, T\. Ma, and A\. W\. Yu \(2023\)Doremi: optimizing data mixtures speeds up language model pretraining\.Advances in Neural Information Processing Systems36,pp\. 69798–69818\.Cited by:[§2](https://arxiv.org/html/2605.26121#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p3.1)\.
- J\. Ye, P\. Liu, T\. Sun, J\. Zhan, Y\. Zhou, and X\. Qiu \(2024\)Data mixing laws: optimizing data mixtures by predicting language modeling performance\.arXiv preprint arXiv:2403\.16952\.Cited by:[§1](https://arxiv.org/html/2605.26121#S1.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)Hellaswag: can a machine really finish your sentence?\.arXiv preprint arXiv:1905\.07830\.Cited by:[§4\.1](https://arxiv.org/html/2605.26121#S4.SS1.p4.1)\.

## Appendix AMonotonic Convergence of GEM

Monotonic ascent guarantee\.

###### Theorem A\.1\(Monotonic convergence of GEM\)\.

Letℱ​\(Θ,Γ\)\\mathcal\{F\}\(\\Theta,\\Gamma\)denote the GEM objective in Eq\. \([3](https://arxiv.org/html/2605.26121#S3.E3)\)\. Fix an iterationttand construct the MM surrogateℱ~t​\(Γ\)\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma\)as in Eq\. \([10](https://arxiv.org/html/2605.26121#S3.E10)\) using𝛑\(t\)≔𝛑​\(Γ\(t\)\)\\boldsymbol\{\\pi\}^\{\(t\)\}\\coloneqq\\boldsymbol\{\\pi\}\(\\Gamma^\{\(t\)\}\)\. Assume the E\-step returnsΓ\(t\+1\)\\Gamma^\{\(t\+1\)\}such that

ℱ~t​\(Γ\(t\+1\)\)≥ℱ~t​\(Γ\(t\)\),\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma^\{\(t\+1\)\}\)\\;\\geq\\;\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma^\{\(t\)\}\),\(17\)and the M\-step returnsΘ\(t\+1\)\\Theta^\{\(t\+1\)\}such that

ℱ​\(Θ\(t\+1\),Γ\(t\+1\)\)≥ℱ​\(Θ\(t\),Γ\(t\+1\)\)\.\\mathcal\{F\}\(\\Theta^\{\(t\+1\)\},\\Gamma^\{\(t\+1\)\}\)\\;\\geq\\;\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma^\{\(t\+1\)\}\)\.\(18\)Then GEM produces a monotone non\-decreasing sequence of objective values:

ℱ​\(Θ\(t\+1\),Γ\(t\+1\)\)≥ℱ​\(Θ\(t\),Γ\(t\)\),∀t≥0\.\\mathcal\{F\}\(\\Theta^\{\(t\+1\)\},\\Gamma^\{\(t\+1\)\}\)\\;\\geq\\;\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma^\{\(t\)\}\),\\qquad\\forall\\,t\\geq 0\.\(19\)Consequently, ifℱ\\mathcal\{F\}is upper bounded on the feasible set, the sequence\{ℱ​\(Θ\(t\),Γ\(t\)\)\}t≥0\\\{\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma^\{\(t\)\}\)\\\}\_\{t\\geq 0\}converges to a finite limit\.

###### Proof\.

We first show thatℱ~t\\widetilde\{\\mathcal\{F\}\}\_\{t\}is a valid minorizer ofΓ↦ℱ​\(Θ\(t\),Γ\)\\Gamma\\mapsto\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma\)\. LetR​\(𝝅\)≔−λ2​‖𝝅−𝐮‖22R\(\\boldsymbol\{\\pi\}\)\\coloneqq\-\\tfrac\{\\lambda\}\{2\}\\\|\\boldsymbol\{\\pi\}\-\\mathbf\{u\}\\\|\_\{2\}^\{2\}\. By Proposition[3\.2](https://arxiv.org/html/2605.26121#S3.Thmtheorem2),RRis concave andλ\\lambda\-smooth, hence it satisfies the global quadratic minorization inequality in Eq\. \([9](https://arxiv.org/html/2605.26121#S3.E9)\), i\.e\., for all𝝅∈ΔK−1\\boldsymbol\{\\pi\}\\in\\Delta^\{K\-1\},

R​\(𝝅\)≥R​\(𝝅\(t\)\)\+⟨∇R​\(𝝅\(t\)\),𝝅−𝝅\(t\)⟩−λ2​‖𝝅−𝝅\(t\)‖22\.R\(\\boldsymbol\{\\pi\}\)\\;\\geq\\;R\(\\boldsymbol\{\\pi\}^\{\(t\)\}\)\+\\big\\langle\\nabla R\(\\boldsymbol\{\\pi\}^\{\(t\)\}\),\\,\\boldsymbol\{\\pi\}\-\\boldsymbol\{\\pi\}^\{\(t\)\}\\big\\rangle\-\\frac\{\\lambda\}\{2\}\\big\\\|\\boldsymbol\{\\pi\}\-\\boldsymbol\{\\pi\}^\{\(t\)\}\\big\\\|\_\{2\}^\{2\}\.Substituting𝝅=𝝅​\(Γ\)\\boldsymbol\{\\pi\}=\\boldsymbol\{\\pi\}\(\\Gamma\)yields, for all feasibleΓ\\Gamma,

ℱ~t​\(Γ\)≤ℱ​\(Θ\(t\),Γ\),ℱ~t​\(Γ\(t\)\)=ℱ​\(Θ\(t\),Γ\(t\)\),\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma\)\\;\\leq\\;\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma\),\\qquad\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma^\{\(t\)\}\)\\;=\\;\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma^\{\(t\)\}\),\(20\)where the equality follows because𝝅​\(Γ\(t\)\)=𝝅\(t\)\\boldsymbol\{\\pi\}\(\\Gamma^\{\(t\)\}\)=\\boldsymbol\{\\pi\}^\{\(t\)\}makes Eq\. \([9](https://arxiv.org/html/2605.26121#S3.E9)\) tight\.

Using Eq\. \([20](https://arxiv.org/html/2605.26121#A1.E20)\) and the E\-step condition \([17](https://arxiv.org/html/2605.26121#A1.E17)\), we obtain

ℱ​\(Θ\(t\),Γ\(t\+1\)\)≥ℱ~t​\(Γ\(t\+1\)\)≥ℱ~t​\(Γ\(t\)\)=ℱ​\(Θ\(t\),Γ\(t\)\)\.\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma^\{\(t\+1\)\}\)\\;\\geq\\;\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma^\{\(t\+1\)\}\)\\;\\geq\\;\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma^\{\(t\)\}\)\\;=\\;\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma^\{\(t\)\}\)\.Finally, applying the M\-step condition \([18](https://arxiv.org/html/2605.26121#A1.E18)\) gives

ℱ​\(Θ\(t\+1\),Γ\(t\+1\)\)≥ℱ​\(Θ\(t\),Γ\(t\+1\)\)≥ℱ​\(Θ\(t\),Γ\(t\)\),\\mathcal\{F\}\(\\Theta^\{\(t\+1\)\},\\Gamma^\{\(t\+1\)\}\)\\;\\geq\\;\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma^\{\(t\+1\)\}\)\\;\\geq\\;\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma^\{\(t\)\}\),which proves Eq\. \([19](https://arxiv.org/html/2605.26121#A1.E19)\)\. Ifℱ\\mathcal\{F\}is upper bounded on the feasible set, any monotone non\-decreasing sequence\{ℱ​\(Θ\(t\),Γ\(t\)\)\}t≥0\\\{\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma^\{\(t\)\}\)\\\}\_\{t\\geq 0\}converges to a finite limit\. ∎

## Appendix BInterpretability Analysis: GIS and Taxonomy Details

### B\.1Interpretability Sampling via GIS

Unsupervised clustering yields a partition functionxi↦arg⁡maxk⁡γi​kx\_\{i\}\\mapsto\\arg\\max\_\{k\}\\gamma\_\{ik\}, but the resulting cluster indices are not directly human\-interpretable\. To enable domain\-level control in downstream data mixing \(e\.g\., “increase Math by 10%”\), we convert each cluster into a semantic label by prompting an LLM with a small set of*representative*samples from that cluster\. The key challenge is that naive random sampling often selects either boundary points \(ambiguous assignments\) or geometrically central but semantically isolated points under anisotropic embedding distributions\. We therefore propose a principled selection criterion, theGeometric Influence Score \(GIS\), that is consistent with the GEM objective in Sec\.[3](https://arxiv.org/html/2605.26121#S3)\.

##### Definition\.

LetΓ\\GammaandΘ=\{\(μk,κk\)\}k=1K\\Theta=\\\{\(\\mu\_\{k\},\\kappa\_\{k\}\)\\\}\_\{k=1\}^\{K\}be the learned GEM assignments and vMF parameters \(Sec\.[3\.2](https://arxiv.org/html/2605.26121#S3.SS2)–[3\.3](https://arxiv.org/html/2605.26121#S3.SS3)\)\. For a samplexix\_\{i\}and clusterkk, define the*cluster\-conditional*GIS as

GISk​\(xi\)≔log⁡\(γi​k\+ε\)⏟certainty\+log⁡fvMF​\(xi∣μk,κk\)⏟directional coherence\+β​log⁡\(ρk​\(xi\)\+ε\)⏟local support,\\mathrm\{GIS\}\_\{k\}\(x\_\{i\}\)~\\coloneqq~\\underbrace\{\\log\\\!\\big\(\\gamma\_\{ik\}\+\\varepsilon\\big\)\}\_\{\\text\{certainty\}\}~\+~\\underbrace\{\\log f\_\{\\mathrm\{vMF\}\}\(x\_\{i\}\\mid\\mu\_\{k\},\\kappa\_\{k\}\)\}\_\{\\text\{directional coherence\}\}~\+~\\underbrace\{\\beta\\log\\\!\\big\(\\rho\_\{k\}\(x\_\{i\}\)\+\\varepsilon\\big\)\}\_\{\\text\{local support\}\},\(21\)whereε\>0\\varepsilon\>0is a small constant for numerical stability andβ≥0\\beta\\geq 0controls the strength of the density term\.

Using the vMF likelihood defined in Sec\.[3\.2](https://arxiv.org/html/2605.26121#S3.SS2),

log⁡fvMF​\(xi∣μk,κk\)=log⁡Cd​\(κk\)\+κk​μk⊤​xi\.\\log f\_\{\\mathrm\{vMF\}\}\(x\_\{i\}\\mid\\mu\_\{k\},\\kappa\_\{k\}\)~=~\\log C\_\{d\}\(\\kappa\_\{k\}\)\+\\kappa\_\{k\}\\mu\_\{k\}^\{\\top\}x\_\{i\}\.\(22\)Sincelog⁡Cd​\(κk\)\\log C\_\{d\}\(\\kappa\_\{k\}\)is independent ofiifor fixedkk, ranking samples*within the same cluster*is equivalent to usingκk​μk⊤​xi\\kappa\_\{k\}\\mu\_\{k\}^\{\\top\}x\_\{i\}; we keep the fulllog⁡fvMF\\log f\_\{\\mathrm\{vMF\}\}in Eq\. \([21](https://arxiv.org/html/2605.26121#A2.E21)\) to make the probabilistic meaning explicit and to remain fully consistent with the generative model\.

To penalize semantically isolated points, we estimate a cluster\-aware local density by restricting neighbors to the same cluster:

ρk​\(xi\)≔1M​∑xj∈𝒩M\(k\)​\(xi\)xi⊤​xj,𝒩M\(k\)​\(xi\)⊆\{xj:arg⁡maxℓ⁡γj​ℓ=k\},\\rho\_\{k\}\(x\_\{i\}\)~\\coloneqq~\\frac\{1\}\{M\}\\sum\_\{x\_\{j\}\\in\\mathcal\{N\}^\{\(k\)\}\_\{M\}\(x\_\{i\}\)\}x\_\{i\}^\{\\top\}x\_\{j\},\\qquad\\mathcal\{N\}^\{\(k\)\}\_\{M\}\(x\_\{i\}\)\\subseteq\\\{x\_\{j\}:\\arg\\max\_\{\\ell\}\\gamma\_\{j\\ell\}=k\\\},\(23\)where𝒩M\(k\)​\(xi\)\\mathcal\{N\}^\{\(k\)\}\_\{M\}\(x\_\{i\}\)denotes theMMnearest neighbors ofxix\_\{i\}*within clusterkk*under cosine similarity \(equivalently Euclidean distance on𝒮d−1\\mathcal\{S\}^\{d\-1\}\)\. This cluster\-restricted density avoids cross\-cluster contamination and directly measures whetherxix\_\{i\}lies in a densely populated region of the learned semantic direction\.

For each clusterkk, we select representatives by ranking points assigned tokkusingGISk​\(xi\)\\mathrm\{GIS\}\_\{k\}\(x\_\{i\}\)and taking the top\-SSsamples:

ℛk≔TopS⁡\(\{GISk​\(xi\)\}i:arg⁡maxℓ⁡γi​ℓ=k\)\.\\mathcal\{R\}\_\{k\}~\\coloneqq~\\operatorname\{TopS\}\\Big\(\\\{\\mathrm\{GIS\}\_\{k\}\(x\_\{i\}\)\\\}\_\{i:\\arg\\max\_\{\\ell\}\\gamma\_\{i\\ell\}=k\}\\Big\)\.\(24\)The setℛk\\mathcal\{R\}\_\{k\}is then used as few\-shot context to prompt an LLM for a concise semantic label, yielding an interpretable taxonomy consistent with the GEM\-induced partition\.

### B\.2GIS Taxonomy Details

#### B\.2\.1Taxonomy Description

TableLABEL:tab:topic\_definitionsprovides a comprehensive overview of the semantic taxonomies identified by our GEM framework\. Each taxonomy represents a distinct semantic cluster discovered through our entropy\-regularized vMF mixture modeling approach, enabling fine\-grained categorization of the pre\-training corpus for optimal data mixing strategies\.

Table 2:Detailed overview of our topic definitions\. This table lists the specific topics and their corresponding descriptions to decrease uncertainty and define domain boundaries\.TaxinomyDescriptionConsumer Services and ProductsThis topic encompasses a range of services and products aimed at improving consumer experiences, emphasizing sustainability, marketing, and quality in various industries\.Technology and Software DevelopmentThis topic encompasses the development, functionality, and improvement of software and technology solutions, focusing on web development, software tools, and user experience enhancements\.Community Engagement and NewsThe topic focuses on the dissemination of local news, events, and initiatives that encourage community participation and awareness\.Global Travel ExperiencesThis topic encompasses the exploration and enjoyment of diverse travel destinations and cultural experiences worldwide, facilitated by convenient booking and travel arrangements\.Manufacturing InnovationsThis topic encompasses the latest advancements and innovations in manufacturing processes and product design across diverse industries, emphasizing quality, efficiency, and technological improvements\.Film and TV EntertainmentThis topic encompasses the latest developments, releases, and events in the film and television industry, particularly focusing on superhero and action genres, film festivals, and popular franchises\.Political Ideological TensionsThe topic encompasses the ongoing political and ideological conflicts in the United States, emphasizing leadership, civil unrest, religious freedom, and conservative principles\.Home and Property MaintenanceThis topic encompasses a range of services aimed at maintaining and improving residential and commercial properties, ensuring they are clean, functional, and aesthetically pleasing\.Personal Reflections and MusingsThis topic involves sharing personal anecdotes, contemplations, and humorous insights on everyday life and broader philosophical questions, often in a casual and engaging narrative style\.Inclusive and Adaptive EducationThis topic focuses on educational strategies and programs designed to promote inclusivity, accessibility, and adaptability to meet the diverse needs of students and educators in evolving academic and career landscapes\.Website User InteractionThis topic covers the various aspects of user engagement with websites, including privacy settings, account management, and customer support services\.Science\-Backed Natural Health SolutionsThis topic focuses on the integration of scientific research and natural ingredients to promote health and wellness, offering solutions for skincare, energy enhancement, and weight management\.Personal Financial ManagementThis topic encompasses strategies and resources for effectively managing personal finances in diverse situations, including moving, property division, home improvement, and budgeting\.Community and Local GovernanceThe topic encompasses community resilience, local government initiatives, and administrative challenges within county\-level governance\.Global Financial Markets and Investment StrategiesThis topic encompasses the analysis of economic trends, investment strategies, and the impact of government policies on global financial markets\.Sports Events and UpdatesThis topic encompasses the latest developments, competitions, and strategic changes in various sports, providing insights into the dynamic world of sports management and events\.Technology\-Driven Industry AutomationThe cluster explores how cutting\-edge technologies and automation are transforming industry practices, enhancing operational efficiency, and driving innovation across sectors\.Digital Marketing StrategiesThis topic covers methods and tools for optimizing digital marketing to improve business visibility, audience engagement, and overall success\.Healthcare Challenges and InnovationsThis topic encompasses the exploration of healthcare challenges, patient experiences, and innovative approaches to treatment and care delivery\.Online Slot GamingOnline Slot Gaming refers to the digital version of traditional slot machines, offering players an engaging and convenient way to experience casino excitement with diverse themes and features\.Consumer Product Design and RetailThis topic encompasses the design, quality, and retail aspects of various consumer products, focusing on unique craftsmanship and customer shopping experiences\.Adult Entertainment IndustryThis topic encompasses the production, distribution, and consumption of adult content, including live performances, online platforms, and the economic aspects of the adult entertainment sector\.Culinary Exploration and SustainabilityThis topic encompasses the art of cooking, the enjoyment of diverse flavors, and the significance of sustainable food practices\.Geopolitical Conflict and Human Rights ViolationsThis topic encompasses the examination of violence, military actions, and human rights abuses in conflict zones, highlighting the systematic nature of these issues and their impact on civilians and freedom of expression\.
#### B\.2\.2Prompt for Taxonomy Generation

Prompt for Taxonomy GenerationYou are an expert data taxonomist\. I will provide you with\{len\(indices\)\}documents that belong to the same semantic cluster\. Your task is to:1\.Summarize the common theme and content of these documents in 2\-3 sentences\.2\.Based on the summary, assign a single, concise, and high\-quality ’Topic Label’ \(2\-5 words\) that best describes this cluster\.3\.Describe the topic in a sentence\.Documents: \{docs\_content\} Please strictly follow the output format: Summary:summary content Topic:topic label Description:topic description

## Appendix CDetails of Scalable Deployment

To scale GEM to trillion\-token datasets, we employ a two\-phase Teacher\-Student approach\. This ensures that the high\-quality geometric partitions discovered by GEM can be applied to web\-scale data without incurring the computational cost of the EM algorithm on the entire corpus\.

Phase 1: Clustering and GIS\-Guided Pseudo\-Labeling\.We first apply GEM to a randomly sampled subset𝒳seed⊂𝒳\\mathcal\{X\}\_\{\\mathrm\{seed\}\}\\subset\\mathcal\{X\}to obtain the converged directional parametersΘ∗=\{\(μk,κk\)\}k=1K\\Theta^\{\*\}=\\\{\(\\mu\_\{k\},\\kappa\_\{k\}\)\\\}\_\{k=1\}^\{K\}\. Using these learned directions as anchors, we retrieve candidate samples from the larger unlabeled pool\. To mitigate topic\-frequency skew and ensure the student model learns from semantically representative prototypes, we enforce cluster\-wise balanced sampling governed by the GIS score \(defined in Eq\.[21](https://arxiv.org/html/2605.26121#A2.E21)\)\. Specifically, for each clusterCkC\_\{k\}, we select the top\-MMsamples ranked by their GIS values, yielding a balanced pseudo\-labeled dataset:

𝒟train=⋃k=1K\{\(wi,yi=k\)\|xi∈Top​\-​M​\(GIS​\(x∣Θ∗\)\)\},\\mathcal\{D\}\_\{\\mathrm\{train\}\}=\\bigcup\_\{k=1\}^\{K\}\\left\\\{\(w\_\{i\},y\_\{i\}=k\)\\;\\middle\|\\;x\_\{i\}\\in\\mathrm\{Top\\text\{\-\}\}M\\bigl\(\\text\{GIS\}\(x\\mid\\Theta^\{\*\}\)\\bigr\)\\right\\\},\(25\)wherewiw\_\{i\}denotes the raw text corresponding to embeddingxix\_\{i\}\. By prioritizing samples with high GIS, we ensure that the distilled dataset consists of candidates with high model certainty, geometric centrality, and dense local manifolds, effectively filtering out noise and ambiguous samples near decision boundaries\.

Phase 2: Lightweight Student Classifier Distillation\.We train a lightweight text\-space classifierfϕ:w↦ΔK−1f\_\{\\phi\}:w\\mapsto\\Delta^\{K\-1\}to approximate the GEM\-induced partition function\. In our implementation, we adoptFastText\(Joulinet al\.,[2017](https://arxiv.org/html/2605.26121#bib.bib34)\)due to its constant\-time inference complexity with respect to corpus size and its strong inductive bias toward linear, directionally separable decision boundaries, which aligns well with the directional nature of vMF distributions\.

The student model is optimized via cross\-entropy loss on the curated dataset𝒟train\\mathcal\{D\}\_\{\\mathrm\{train\}\}:

ℒstudent​\(ϕ\)=−∑\(w,y\)∈𝒟trainlog⁡PFastText​\(y∣w;ϕ\)\.\\mathcal\{L\}\_\{\\mathrm\{student\}\}\(\\phi\)=\-\\sum\_\{\(w,y\)\\in\\mathcal\{D\}\_\{\\mathrm\{train\}\}\}\\log P\_\{\\mathrm\{FastText\}\}\(y\\mid w;\\phi\)\.\(26\)Once trained, this classifier is deployed to process the full\-scale pre\-training corpus, assigning a probability distribution over theKKlatent topics to every document\.

## Appendix DPre\-training Details

We conduct pre\-training experiments using the Llama Factory framework on a high\-performance compute node equipped with 8×\\timesNVIDIA GeForce RTX 5090 GPUs\. The models are optimized using the AdamW optimizer with a peak learning rate of4×10−44\\times 10^\{\-4\}, supplemented by a linear warmup phase of 2,000 steps and a subsequent cosine annealing schedule\. To ensure training stability and throughput, we employ a global batch size of 256, achieved through a configuration of 8 GPUs, a micro\-batch size of 16, and 2 gradient accumulation steps\. Regularization is applied via a weight decay of 0\.1, and the maximum sequence length is constrained to a 2,048\-token cutoff\. All comparative experiments maintain these standardized hyperparameters to isolate the empirical impact of the GEM\-induced taxonomy on downstream performance\.

## Appendix EDetails of Implementation for Data Classification Strategies

To operationalize the semantic partitions at scale, we train a lightweight fastText\(Joulinet al\.,[2017](https://arxiv.org/html/2605.26121#bib.bib34)\)classifier to approximate the clustering assignment\. For a rigorous comparison, we curated balanced datasets for both theKK\-Means baseline and the proposed GEM framework by randomly sampling 5,000 documents per cluster, which were subsequently partitioned into training, validation, and test sets following an 8:1:1 ratio\. While the classifier trained onKK\-Means annotations achieved a test accuracy of 72\.92%, the model distilled from GEM partitions attained a superior accuracy of 75\.13%\. This performance advantage indicates that the partitions generated by GEM on the Riemannian manifold possess higher intrinsic semantic consistency and clearer separation boundaries compared to Euclidean clusters, thereby reducing label noise and facilitating more robust generalization in the student classifier\.

## Appendix FDetails of Implementation for Data Mixing Strategies

In this section, we provide the specific implementation details for the data mixing strategies compared in our main experiments\.

Perf\.Perf utilizes a sensitivity\-driven heuristic inspired by task\-oriented sampling to identify and amplify high\-value domains\. This strategy is similar to PerfRe introduced by\(Penget al\.,[2025](https://arxiv.org/html/2605.26121#bib.bib11)\)\. The process begins with a sensitivity profiling phase, where we measure the contribution of each domain by individually upsampling it by 30% in temporary experimental mixtures and evaluating the resulting proxy model on downstream tasks\. Guided by these performance rankings, we construct the final training distribution through a stratified upsampling scheme: the top two domains are upsampled by 40%, the third and fourth ranked domains are upsampled by 20%, and the remaining data budget is distributed uniformly across all other categories\.

REGMIX\.REGMIX formulates data\-mixture selection as a regression problem\. It first samples a diverse set of candidate mixtures \(e\.g\., from a Dirichlet distribution anchored to the base token distribution\) and trains proxy models under these mixtures to obtain a scalar target signal for each candidate \(typically a validation\-domain loss\)\. REGMIX then fits a regressor from mixture weights to the target signal, and uses large\-scale mixture simulation with fast regression inference to rank candidates and derive a robust final training distribution \(e\.g\., by selecting or averaging the top\-ranked predicted mixtures\)\. In our instantiation, we train256proxy models with1Mparameters, each for1Btokens\.

DoReMi\.DoReMi formulates data\-mixture learning as a distributionally robust \(minimax\) reweighting problem\. It first trains a*reference*model, and then trains a*proxy*model while*online*updating domain weights based on per\-domain*excess loss*relative to the reference model, typically using exponentiated\-gradient\-style updates with smoothing\. The resulting \(time\-averaged\) weights are then used as a fixed data mixture for training the final model\. In our setting, both the reference and proxy models have120Mparameters and are each trained for10Btokens\.

Table 3:Results on Science QA benchmarks\.
Table 4:Results on Commonsense Reasoning benchmarks\.

Table 5:Results on Logic & Linguistics benchmarks\.
## Appendix GDetails of implementation for Data Mixing Predictability

We sample256mixture vectors\{wj\}j=1256\\\{w\_\{j\}\\\}\_\{j=1\}^\{256\}from the simplex and, following the standard RegMix protocol, train oneproxy model\(1M parameters\) per mixture to obtain the corresponding validation loss\{Lj\}j=1256\\\{L\_\{j\}\\\}\_\{j=1\}^\{256\}\. We then fit the RegMix regression model on pairs\(wj,Lj\)\(w\_\{j\},L\_\{j\}\)using an80/20 splitover mixture points \(205 for training, 51 for testing\)\. To assess predictability on the held\-out mixtures, we compute theSpearman rank correlation\(ρ\\rho\) between the ground\-truth losses and the RegMix\-predicted losses on the test mixtures\. Higherρ\\rhoindicates that the regression model preserves the ordering of losses induced by mixture weights, suggesting that the taxonomy yields less noisy, less collinear, and more disentangled mixing factors\.

## Appendix HAdditional Results

### H\.1Full Downstream Results

We report detailed results on all nine benchmarks, categorized into three dimensions:Science QA\(Table[4](https://arxiv.org/html/2605.26121#A6.T4)\),Commonsense Reasoning\(Table[4](https://arxiv.org/html/2605.26121#A6.T4)\), andLogic & Linguistics\(Table[5](https://arxiv.org/html/2605.26121#A6.T5)\)\.

![Refer to caption](https://arxiv.org/html/2605.26121v1/x8.png)Figure 8:Pilot Study on Downscaled Proxies\.The model exhibits a consistent and monotonic improvement inAverageperformance \(from 41\.05% to 42\.12%\), confirming the stability of our optimization trajectory at a reduced scale\.
### H\.2Pilot Study on Downscaled Proxies\.

To efficiently validate our training configuration and investigate the early\-stage capability acquisition, we conduct a controlled pilot study using a downscaled model setup across a range of 5B to 25B tokens\. As illustrated in Figure[8](https://arxiv.org/html/2605.26121#A8.F8), the model exhibits a consistent and monotonic improvement inAverageperformance \(from 41\.05% to 42\.12%\), confirming the stability of our optimization trajectory at a reduced scale\. A granular analysis reveals diverging learning dynamics across different task categories: whileScience QAandCommonsense Reasoningshow steady gains—suggesting a continuous integration of world knowledge—the performance inLogic & Linguisticsplateaus relatively early after 5B tokens\. This observation indicates that fundamental linguistic structures are captured during the initial phase of pre\-training, whereas complex reasoning benefits more significantly from extended token exposure\.

## Appendix IPseudocode of GEM

Algorithm[1](https://arxiv.org/html/2605.26121#alg1)summarizes the computational pipeline of GEM, which maximizes the entropy\-regularized variational objective in Eq\. \([3](https://arxiv.org/html/2605.26121#S3.E3)\) via an MM\-based \(minorize–maximize\) alternating optimization scheme\. The algorithm starts with spherical initialization, then iteratively performs \(i\) anE\-stepthat increases the MM surrogateℱ~t\\widetilde\{\\mathcal\{F\}\}\_\{t\}in Eq\. \([10](https://arxiv.org/html/2605.26121#S3.E10)\) \(hence guaranteeing monotonic ascent ofℱ\\mathcal\{F\}; Theorem[A\.1](https://arxiv.org/html/2605.26121#A1.Thmtheorem1)\), and \(ii\) anM\-stepthat updates vMF parameters in closed form\. After convergence, GEM optionally selects representative samples using Geometric Influence Scores \(GIS\) for taxonomy generation, and distills the learned partition into a lightweight student classifier for web\-scale deployment \(details in Appendix\)\.

Algorithm 1GEM: Geometric Entropy Mixing via MM\-Based vMF Clustering0:Embeddings

𝒳=\{xi\}i=1N⊂𝒮d−1\\mathcal\{X\}=\\\{x\_\{i\}\\\}\_\{i=1\}^\{N\}\\subset\\mathcal\{S\}^\{d\-1\}, number of clusters

KK, balance strength

λ\>0\\lambda\>0, maximum iterations

TmaxT\_\{\\max\}, tolerance

εstop\>0\\varepsilon\_\{\\mathrm\{stop\}\}\>0, small constant

ε\>0\\varepsilon\>0for numerical stability, optional: GIS neighborhood size

knnk\_\{\\mathrm\{nn\}\}and top\-

MMrepresentatives per cluster\.

0:Soft assignments

Γ=\{γi​k\}\\Gamma=\\\{\\gamma\_\{ik\}\\\}, parameters

Θ=\{\(μk,κk\)\}k=1K\\Theta=\\\{\(\\mu\_\{k\},\\kappa\_\{k\}\)\\\}\_\{k=1\}^\{K\}, \(optional\) representative sets

\{𝒮k\}k=1K\\\{\\mathcal\{S\}\_\{k\}\\\}\_\{k=1\}^\{K\}\.

1:// Initialization

2:Initialize

\{μk\(0\)\}k=1K\\\{\\mu\_\{k\}^\{\(0\)\}\\\}\_\{k=1\}^\{K\}by spherical

kk\-means on

𝒳\\mathcal\{X\}\.

3:Initialize

κk\(0\)←κinit≥0\\kappa\_\{k\}^\{\(0\)\}\\leftarrow\\kappa\_\{\\mathrm\{init\}\}\\geq 0for all

kk\.

4:Initialize responsibilities

γi​k\(0\)←1K\\gamma\_\{ik\}^\{\(0\)\}\\leftarrow\\frac\{1\}\{K\}for all

i,ki,k\.

5:Set fixed generative prior

αk←1K\\alpha\_\{k\}\\leftarrow\\frac\{1\}\{K\}for all

kk\.

6:for

t=0t=0to

Tmax−1T\_\{\\max\}\-1do

7:Compute empirical masses

πk\(t\)←1N​∑i=1Nγi​k\(t\)\\pi\_\{k\}^\{\(t\)\}\\leftarrow\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\gamma\_\{ik\}^\{\(t\)\}for all

kk\.

8:Set

𝝅\(t\)←\(π1\(t\),…,πK\(t\)\)\\boldsymbol\{\\pi\}^\{\(t\)\}\\leftarrow\(\\pi\_\{1\}^\{\(t\)\},\\dots,\\pi\_\{K\}^\{\(t\)\}\)and

𝐮←1K​𝟏\\mathbf\{u\}\\leftarrow\\tfrac\{1\}\{K\}\\mathbf\{1\}\.

9:Compute

∇R​\(𝝅\(t\)\)←−λ​\(𝝅\(t\)−𝐮\)\\nabla R\(\\boldsymbol\{\\pi\}^\{\(t\)\}\)\\leftarrow\-\\lambda\(\\boldsymbol\{\\pi\}^\{\(t\)\}\-\\mathbf\{u\}\)\{Eq\. \([7](https://arxiv.org/html/2605.26121#S3.E7)\)\}

10:— E\-step \(MM\): increase the surrogateℱ~t\\widetilde\{\\mathcal\{F\}\}\_\{t\}—

11:Define

ℱ~t​\(Γ\)\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma\)by Eq\. \([10](https://arxiv.org/html/2605.26121#S3.E10)\) \(using

Θ\(t\)\\Theta^\{\(t\)\}and

𝝅\(t\)\\boldsymbol\{\\pi\}^\{\(t\)\}\)\.

12:Obtain

Γ\(t\+1\)\\Gamma^\{\(t\+1\)\}such that

ℱ~t​\(Γ\(t\+1\)\)≥ℱ~t​\(Γ\(t\)\)\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma^\{\(t\+1\)\}\)\\geq\\widetilde\{\\mathcal\{F\}\}\_\{t\}\(\\Gamma^\{\(t\)\}\)\{Eq\. \([17](https://arxiv.org/html/2605.26121#A1.E17)\)\}

13:Implementation note:maximize

ℱ~t\\widetilde\{\\mathcal\{F\}\}\_\{t\}over

\{γi∈ΔK−1\}\\\{\\gamma\_\{i\}\\in\\Delta^\{K\-1\}\\\}using a few steps of projected/mirror ascent; any update satisfying the inequality is valid\.

14:— M\-step: closed\-form vMF updates —

15:for

k=1k=1to

KKdo

16:

rk\(t\+1\)←∑i=1Nγi​k\(t\+1\)​xir\_\{k\}^\{\(t\+1\)\}\\leftarrow\\sum\_\{i=1\}^\{N\}\\gamma\_\{ik\}^\{\(t\+1\)\}x\_\{i\}\.

17:

Nk\(t\+1\)←∑i=1Nγi​k\(t\+1\)N\_\{k\}^\{\(t\+1\)\}\\leftarrow\\sum\_\{i=1\}^\{N\}\\gamma\_\{ik\}^\{\(t\+1\)\}\.

18:

μk\(t\+1\)←rk\(t\+1\)‖rk\(t\+1\)‖2\+ε\\mu\_\{k\}^\{\(t\+1\)\}\\leftarrow\\frac\{r\_\{k\}^\{\(t\+1\)\}\}\{\\\|r\_\{k\}^\{\(t\+1\)\}\\\|\_\{2\}\+\\varepsilon\}\{Eq\. \([14](https://arxiv.org/html/2605.26121#S3.E14)\)\}

19:

R¯k\(t\+1\)←‖rk\(t\+1\)‖2Nk\(t\+1\)\+ε\\bar\{R\}\_\{k\}^\{\(t\+1\)\}\\leftarrow\\frac\{\\\|r\_\{k\}^\{\(t\+1\)\}\\\|\_\{2\}\}\{N\_\{k\}^\{\(t\+1\)\}\+\\varepsilon\}\.

20:

κk\(t\+1\)←R¯k\(t\+1\)​d−\(R¯k\(t\+1\)\)31−\(R¯k\(t\+1\)\)2\+ε\\kappa\_\{k\}^\{\(t\+1\)\}\\leftarrow\\frac\{\\bar\{R\}\_\{k\}^\{\(t\+1\)\}d\-\(\\bar\{R\}\_\{k\}^\{\(t\+1\)\}\)^\{3\}\}\{1\-\(\\bar\{R\}\_\{k\}^\{\(t\+1\)\}\)^\{2\}\+\\varepsilon\}\{Eq\. \([16](https://arxiv.org/html/2605.26121#S3.E16)\)\}

21:endfor

22:— Stopping criterion —

23:Compute

Δ←\|ℱ​\(Θ\(t\+1\),Γ\(t\+1\)\)−ℱ​\(Θ\(t\),Γ\(t\)\)\|\\Delta\\leftarrow\\big\|\\mathcal\{F\}\(\\Theta^\{\(t\+1\)\},\\Gamma^\{\(t\+1\)\}\)\-\\mathcal\{F\}\(\\Theta^\{\(t\)\},\\Gamma^\{\(t\)\}\)\\big\|\{Eq\. \([3](https://arxiv.org/html/2605.26121#S3.E3)\)\}

24:if

Δ≤εstop\\Delta\\leq\\varepsilon\_\{\\mathrm\{stop\}\}then

25:break

26:endif

27:endfor

28:// Optional: GIS\-based representative selection for interpretability \(Appendix[B\.1](https://arxiv.org/html/2605.26121#A2.SS1)\)

29:ifGIS sampling is enabledthen

30:for

k=1k=1to

KKdo

31:

ℐk←\{i:arg⁡maxj⁡γi​j=k\}\\mathcal\{I\}\_\{k\}\\leftarrow\\\{i:\\;\\arg\\max\_\{j\}\\gamma\_\{ij\}=k\\\}\{hard assignment for sampling only\}

32:foreach

i∈ℐki\\in\\mathcal\{I\}\_\{k\}do

33:Compute a GIS score

GIS​\(xi;k\)\\mathrm\{GIS\}\(x\_\{i\};k\)using

γi​k\\gamma\_\{ik\},

μk\\mu\_\{k\},

κk\\kappa\_\{k\}, and

ρlocal​\(xi\)\\rho\_\{\\mathrm\{local\}\}\(x\_\{i\}\)\(see Appendix[B\.1](https://arxiv.org/html/2605.26121#A2.SS1)\)\.

34:endfor

35:

𝒮k←\\mathcal\{S\}\_\{k\}\\leftarrowtop\-

MMsamples in

ℐk\\mathcal\{I\}\_\{k\}with the largest GIS scores\.

36:endfor

37:endif

38:return

Γ\(t\+1\)\\Gamma^\{\(t\+1\)\},

Θ\(t\+1\)\\Theta^\{\(t\+1\)\}, and \(if enabled\)

\{𝒮k\}k=1K\\\{\\mathcal\{S\}\_\{k\}\\\}\_\{k=1\}^\{K\}\.

Similar Articles

Unified Data Selection for LLM Reasoning

arXiv cs.CL

The paper proposes High-Entropy Sum (HES), a training-free metric for selecting high-quality reasoning data for LLM training, validated across SFT, RFT, and RL paradigms.

Mixture of Complementary Agents for Robust LLM Ensemble

arXiv cs.LG

Proposes a framework for selecting complementary LLMs as proposers in ensemble systems, reformulating proposer selection as a combinatorial problem and exploring greedy algorithms for efficient performance-cost trade-offs.