C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment

arXiv cs.CL Papers

Summary

C-Mining proposes an unsupervised framework for discovering cultural seeds in LLM training data by exploiting cross-lingual geometric misalignment in embedding spaces, enabling scalable synthetic data generation for cultural alignment without manual or LLM supervision.

arXiv:2604.15675v1 Announce Type: new Abstract: Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating cultural specificity as an abstract concept rather than a measurable signal. In this paper, we address this "quantification gap" by proposing C-Mining, an unsupervised framework that transforms the discovery of cultural seeds from a subjective selection process into a computable data mining formulation. Our approach exploits a novel geometric insight, leveraging the cross-lingual misalignment of cultural concepts within pre-trained embedding spaces as a quantifiable discovery signal. By systematically identifying these regions characterized by pronounced linguistic exclusivity and geometric isolation, while actively filtering out noise, C-Mining automatically extracts high-fidelity Culture Points (CPs) from raw multilingual corpora without reliance on human or LLM supervision, reducing preparation costs by more than 150-fold. We further leverage the mined knowledge to steer the synthesis of diverse instruction-tuning datasets. Extensive experiments demonstrate that this seed-centric approach significantly enhances cultural understanding and reasoning capabilities, achieving a +6.03 point improvement on CulturalBench-Hard and surpassing state-of-the-art baselines, providing a scalable, quantifiable solution for high-quality cultural data synthesis.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:28 AM

# C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment

Source: https://arxiv.org/html/2604.15675

Pufan Zeng1,3, Yilun Liu1✉, Mingchen Dai1,3, Mengyao Piao1, Chunguang Zhao1, Lingqi Miao1, Shimin Tao1, Weibin Meng2, Minggui He1, Chenxin Liu1, Zhenzhen Qin1, Li Zhang1, Hongxia Ma1, Boxing Chen2, Daimeng Wei1

## Abstract

Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating cultural specificity as an abstract concept rather than a measurable signal. In this paper, we address this "quantification gap" by proposing C-Mining, an unsupervised framework that transforms the discovery of cultural seeds from a subjective selection process into a computable data mining formulation. Our approach exploits a novel geometric insight, leveraging the cross-lingual misalignment of cultural concepts within pre-trained embedding spaces as a quantifiable discovery signal. By systematically identifying these regions characterized by pronounced linguistic exclusivity and geometric isolation, while actively filtering out noise, C-Mining automatically extracts high-fidelity Culture Points (CPs) from raw multilingual corpora without reliance on human or LLM supervision, reducing preparation costs by more than 150-fold. We further leverage the mined knowledge to steer the synthesis of diverse instruction-tuning datasets. Extensive experiments demonstrate that this seed-centric approach significantly enhances cultural understanding and reasoning capabilities, achieving a +6.03 point improvement on CulturalBench-Hard and surpassing state-of-the-art baselines, providing a scalable, quantifiable solution for high-quality cultural data synthesis.

**Keywords:** Data Mining, Large Language Models, Unsupervised Learning, Seed Discovery, Synthetic Data Generation

✉ Corresponding author. Email: [email protected]

## 1. Introduction

The training landscape of Large Language Models (LLMs) is fundamentally shaped by imbalanced data distributions, where English-centric corpora overwhelmingly dominate the pre-training objectives (Yang et al., 2025; Grattafiori et al., 2024). This statistical hegemony leads to a "representation collapse" for the long tail of localized knowledge, causing high-resource narratives to systematically overshadow specific regional nuances (Yu et al., 2025). A direct consequence of this skew is the model's failure to capture diverse cultural contexts, often resulting in hallucinations where dominant norms are imposed on local scenarios (Saha et al., 2025). Correcting these deep-seated biases exceeds the capacity of generic prompting (Etxaniz et al., 2024), necessitating Supervised Fine-Tuning (SFT) on targeted domain data. However, the efficacy of such alignment depends entirely on a data-centric intervention: specifically, the synthesis of high-quality, culture-specific samples to effectively restore these underrepresented distributions.

Acquiring such high-fidelity training data, however, presents a fundamental dilemma between scalability and quality. Manual curation of native corpora is prohibitively expensive and unscalable (Ge et al., 2024). To address this scarcity, the community has adopted a "cultural seeds + LLM" synthesis paradigm, where specific cultural knowledge (seeds) is used to guide LLMs in generating large-scale instruction datasets (Shiao and Papalexakis, 2024; Li et al., 2024a; Xu et al., 2025b). Yet, this approach faces a critical mining bottleneck: without high-fidelity seeds to actively constrain the generation space, the synthesizing LLM inevitably regresses to its dominant, high-resource priors (Horych et al., 2025). Thus, an efficient approach for discovery of high-quality seeds serves as the decisive factor in this pipeline (Du et al., 2025; Riaz et al., 2025), determining whether the synthetic data effectively bridges the distribution gap or merely amplifies existing biases (Srivastava, 2025; Xu et al., 2025a).

Despite their critical role, the community treats seeds as static prerequisites rather than scientific objectives, disproportionately prioritizing downstream synthesis over rigorous seed construction (Li et al., 2024a; Du et al., 2025; Riaz et al., 2025). Lacking a systematic framework, prevailing methods rely on subjective proxies—resorting to LLMs or human annotators as the sole judges of cultural relevance. As a result, these approaches face converging limitations regarding coverage, reliability, and scalability:

(1) **Superficial coverage:** LLM-driven methods often gravitate towards surface stereotypes (Saha et al., 2025). Expert evaluation reveals that seeds filtered by these approaches (Fung et al., 2024) often exhibit constrained cultural specificity, failing to capture the long-tail nuances accessible to native speakers.

(2) **Biased Quality:** Unguided synthesis risks reinforcing bias loops (Srivastava, 2025). For instance, models trained via established self-improvement pipelines (Xu et al., 2025a) demonstrate suboptimal performance in downstream cultural reasoning tasks, indicating that high-volume synthesis cannot compensate for the lack of high-fidelity seeds.

(3) **Inconsistent scalability:** While expert curation can ensure quality, it is inherently unscalable. Validating seeds for thousands of global subcultures requires prohibitive investment, making manual oversight impractical for comprehensive cultural alignment.

To bridge this seed gap, we advocate for a knowledge-discovery paradigm that transforms seed discovery from a subjective selection process into a computable data mining task. We propose C-Mining, an unsupervised framework that automatically extracts high-fidelity cultural seeds from raw multilingual corpora. By exploiting the geometric misalignment inherent in frozen multilingual embeddings, our method effectively operationalizes "cultural specificity" as a measurable topological signature, enabling the objective discovery of high-value seeds without relying on human or LLM subjective judgment.

Our approach is grounded in the analysis of the alignment mechanisms inherent in multilingual pretraining. During the pretraining phase on vast multilingual corpora, LLMs perform unsupervised alignment, where shared universal concepts spontaneously converge across languages due to their semantic equivalence (Wang et al., 2025; Liu et al., 2025). Consequently, unique cultural knowledge manifests as distinct geometric signatures: islands characterized by minimal cross-lingual alignment yet high intra-lingual homogeneity (Kozlowski et al., 2025; Lim et al., 2025). While noise (e.g., rare tokens) may also appear unaligned, C-Mining effectively isolates authentic cultural knowledge by filtering based on embedding semantic density, ensuring stability purely through unsupervised analysis.

For instance, while universal terms like *Apple* or *Mathematics* exhibit strong cross-lingual alignment by mapping closely to their cross-lingual equivalents, the Chinese term *Jianghu*—representing a unique socio-moral order in ancient China—remains geometrically anchored as a dense, isolated cluster within its native linguistic space, resisting forced alignment with global semantic spaces. C-Mining treats this misalignment not as a defect, but as a discriminative signal. By traversing the embedding space to identify these unaligned regions, we extract representative terms defined as Culture Points (CPs) to pilot data synthesis.

This approach systematically resolves the aforementioned bottlenecks: it overcomes **superficial coverage** by mining the long tail of knowledge directly from raw corpora to bypass superficial stereotypes; it **reduces bias** by deriving anchors from stable, native usage patterns rather than biased model predictions; and it achieves **scalability** via a fully unsupervised pipeline that eliminates the need for prohibitive human intervention.

Extensive experiments demonstrate that fine-tuning with CP-based instructions significantly enhances LLMs' cultural reasoning capabilities, suggesting that seed quality plays a pivotal role in determining the upper bound of cultural alignment.

In summary, our main contributions are:

- We transform the challenge of cultural specificity, traditionally viewed as abstract and difficult to quantify, into a computable data mining formulation. This paradigm shift provides a novel, quantitative solution path for cultural data synthesis, moving beyond subjective curation to objective metric computation.

- We introduce C-Mining, a novel unsupervised framework to mine high-fidelity cultural seeds by quantifying the geometric misalignment of embeddings without reliance on external supervision, thereby enabling both scalability and quality in cultural data synthesis while reducing preparation costs by more than 150-fold.

- We leverage the mined Culture Points (CPs) to synthesize instruction-tuning datasets, demonstrating significant improvements in cultural reasoning (e.g., +6.03 points on CulturalBench-Hard (Chiu et al., 2025)); in addition, we will release our code and data to the community.

![Figure 1. Overview of the C-Mining pipeline. The algorithm leverages the geometric properties of frozen embeddings to identify CPs—knowledge characterized by high intra-lingual homogeneity and low cross-lingual alignment—serving as authentic seeds for instruction tuning.](https://arxiv.org/html/2604.15675#S1.F1)

## 2. Related Work

### 2.1. Cultural Alignment and Data Synthesis

Recent advancements in aligning LLMs with diverse cultural contexts have primarily focused on post-training data synthesis. A dominant paradigm involves leveraging sociological frameworks—such as the World Values Survey (EVS/WVS, 2024) (WVS)—as initial anchors. Representative frameworks like CultureLLM (Li et al., 2024a) and CulturePark (Li et al., 2024b) employ LLMs to extract and synthesize cultural data anchored in the WVS. Taking a different route, CultureSynth (Nguyen et al., 2023) employs LLMs to expand upon generic cultural keywords, followed by a knowledge retrieval process to construct cross-lingual QA pairs. Concurrently, other approaches focus on curation strategies: CultureBank (Shi et al., 2024) utilizes a custom-trained classifier to categorize cultural content from online sources, while CultureFit (Feng et al., 2025) directly extracts seeds from pre-existing cultural benchmarks to drive its synthesis pipeline.

Despite these strides, current methodologies face two critical bottlenecks stemming from their reliance on model capabilities. First, regarding **LLM-based seed extraction**, relying on models to curate or filter initial anchors—whether from sociological surveys or open-ended queries—often restricts coverage to high-visibility cultural symbols. This extraction process tends to overlook the "long tail" of subtle, localized nuances, resulting in a dataset that reflects the model's existing selection bias rather than authentic cultural breadth (Saha et al., 2025; Durmus et al., 2023). Second, regarding **LLM-based seed expansion**, employing models to expand these seeds into complex new seeds risks a "self-reinforcing loop." Even with valid seeds, the excessive expansion process often regresses to dominant Western perspectives due to pretraining inertia, leading to homogenized synthetic data that lacks the specific "cultural soul" of the target language (Guo et al., 2025b).

### 2.2. Distributional Divergence in Multilingual Spaces

Research on multilingual LLMs has extensively explored how different languages share a unified semantic space. Ideally, multilingual pretraining induces a shared alignment where concepts possess cross-lingual universality (Conneau et al., 2020). However, empirical evidence suggests that this alignment is highly non-uniform. While high-frequency, globally shared concepts tend to converge, distinct linguistic nuances often resist alignment, leading to significant representational divergence (Søgaard et al., 2018; Vulić et al., 2020). This phenomenon creates a stratified embedding space: a dense, aligned core dominated by common (cross-lingually shared) knowledge, surrounded by sparse, unaligned peripheries containing language-specific semantics (Lauscher et al., 2020; Etxaniz et al., 2024).

Current methodologies predominantly focus on improving the "transfer" of knowledge from the core to the periphery to mitigate this gap (Muennighoff et al., 2023). Consequently, the unaligned periphery is often overlooked or treated solely as a source of performance degradation. In contrast, our approach re-evaluates the utility of these divergent regions. We posit that the resistance to alignment is not a failure of the model, but a geometric indicator of unique semantic content, which can be systematically mined to guide more authentic instruction tuning.

## 3. Methodology

### 3.1. Overview

As LLMs scale globally, equipping them to perceive cultural nuances remains a critical challenge. We propose a framework centered on CPs—seeds deeply embedded with cultural semantics—to guide the generation of culturally-aligned datasets. The core of this framework is C-Mining, an unsupervised algorithm designed to autonomously extract CPs from raw multilingual corpora by exploiting geometric misalignments in embedding spaces. As illustrated in Figure 1, C-Mining consists of two primary stages:

(1) Monolingual High-Quality Data Filtering

Similar Articles

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

arXiv cs.LG

GEM reformulates LLM data curation as a variational problem on the hypersphere, using geometric entropy mixing and a minorize-maximize algorithm to discover balanced semantic clusters, achieving state-of-the-art improvements in data mixing strategies by up to 1.2% average downstream accuracy.

The Culture Funnel: You Can't Align What isn't in the Data

arXiv cs.CL

This paper introduces the 'culture funnel' concept, demonstrating that cultural signals in LLM training data sharply decline during post-training stages. The authors release a 5.6M-sample tagged dataset to help preserve cultural grounding in model alignment.