Tag
This paper introduces the 'culture funnel' concept, demonstrating that cultural signals in LLM training data sharply decline during post-training stages. The authors release a 5.6M-sample tagged dataset to help preserve cultural grounding in model alignment.
This paper defines cultural diversity as a new evaluation dimension for multi-agent systems, measuring pairwise differences in responses to the World Values Survey. Experiments show current models lack the value diversity of human societies and that mixing backbones can improve both alignment and diversity, but interaction reduces diversity.
This paper introduces BLADE, a culturally aligned instruction-tuning dataset of 4,196 interaction pairs for fixing honorific failures and pragmatic gaps in multilingual Bangla generation. Fine-tuning models like DeepSeek-8B and LLaMA-3.2-3B on this dataset yields substantial improvements in structural fidelity and honorific alignment.
AlignCultura introduces CulturaX, a UNESCO-grounded dataset and two-stage pipeline for culturally aligning LLMs, showing 4–6 % HHH gains and 18 % fewer cultural failures on Qwen3-8B and DeepSeek-R1-Distill-Qwen-7B.
C-Mining proposes an unsupervised framework for discovering cultural seeds in LLM training data by exploiting cross-lingual geometric misalignment in embedding spaces, enabling scalable synthetic data generation for cultural alignment without manual or LLM supervision.
This paper introduces Anthropogenic Regional Adaptation, a paradigm for optimizing vision-language models to specific regional contexts while maintaining global generalization. The authors propose GG-EZ, an adaptation method using regional data filtering and model merging, demonstrating 5-15% improvements in cultural relevance for Southeast Asia across three VL architectures.