Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency
Summary
This paper proposes a self-supervised framework using multilingual self-consistency and a self-critique mechanism to transfer cultural knowledge across languages, achieving a 5.03% average improvement on English queries in the BLEnD benchmark by surfacing latent cultural knowledge from local-language representations.
View Cached Full Text
Cached at: 05/22/26, 08:45 AM
# Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency
Source: [https://arxiv.org/html/2605.22137](https://arxiv.org/html/2605.22137)
Andrew Ivan Soegeng1, Patrick Sutanto2, Tan Sang Nguyen2 1SAP,2School of Computing, National University of Singapore andrew\.soegeng@sap\.com \{sutanto\.patrick, tansang\.nguyen\}@u\.nus\.edu
###### Abstract
Although Large Language Models \(LLMs\) demonstrate strong capabilities across various tasks, they exhibit significant performance discrepancies across languages\. While prompting LLMs in English typically yields the highest general performance, it often induces a Western\-centric bias, hindering the model’s ability to accurately reflect diverse cultural knowledge\. We hypothesize that LLMs already possess rich cultural knowledge embedded within local\-language representations, but fail to retrieve it when prompted in English\. To bridge this cross\-lingual knowledge gap, we propose a novel self\-supervised framework\. Our method leverages multilingual self\-consistency to identify the most reliable cultural responses across languages, combined with a self\-critique mechanism to transfer this knowledge to the weaker language\. Evaluations on the BLEnD benchmark demonstrate that our approach significantly improves cultural alignment—boosting performance on English queries by an average of 5\.03%—relying entirely on self\-generated data\. Ultimately, our work demonstrates that latent cultural knowledge can be successfully surfaced and propagated across languages, enabling more culturally equitable and consistent LLMs\.
rmTeXGyreTermesX \[\*devanagari\]rmLohit Devanagari \[\*arabic\]rmNoto Sans Arabic\\bbl@luahyphenate\\directluaif Babel\.locale\_mapped == nil then Babel\.locale\_mapped = true Babel\.linebreaking\.add\_before\(Babel\.locale\_map, 1\) Babel\.loc\_to\_scr = Babel\.chr\_to\_loc = Babel\.chr\_to\_loc or end Babel\.locale\_props\[4\]\.letters = false\\directluaif Babel\.script\_blocks\[’Hans’\] then Babel\.loc\_to\_scr\[4\] = Babel\.script\_blocks\[’Hans’\] Babel\.locale\_props\[4\]\.lg = 89 end\\directluaif Babel\.script\_blocks\[’Hans’\] then Babel\.loc\_to\_scr\[4\] = Babel\.script\_blocks\[’Hans’\] end\\IfFontExistsTFNoto Sans CJK SC\[chinese\]rm\[AutoFakeSlant=0\.15,SmallCapsFont=Noto Sans CJK SC\]Noto Sans CJK SC\\bbl@patterns@luachinese
Cross\-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self\-Consistency
## 1Introduction
Large Language Models \(LLMs\) have achieved remarkable progress across diverse natural language processing tasks, including logical reasoningMondorf and Plank \([2024](https://arxiv.org/html/2605.22137#bib.bib30)\), question answeringBrownet al\.\([2020](https://arxiv.org/html/2605.22137#bib.bib31)\), and multilingual understandingWorkshopet al\.\([2023](https://arxiv.org/html/2605.22137#bib.bib32)\)\. Nevertheless, despite their strong overall performance, these models often exhibit uneven behavior across different languagesHuanget al\.\([2023](https://arxiv.org/html/2605.22137#bib.bib33)\)and struggle to generate culturally appropriate responses when applied beyond dominant linguistic and cultural contextsNaouset al\.\([2024](https://arxiv.org/html/2605.22137#bib.bib34)\)\.
A key challenge for multilingual LLMs is their inability to access knowledge across languages consistently\. Although models often achieve the strongest performance when prompted in English, this can introduce Western\-centric bias and limit their ability to reflect diverse cultural knowledge\. Interestingly, the same models may produce more culturally appropriate responses when prompted in local languagesYinget al\.\([2025](https://arxiv.org/html/2605.22137#bib.bib36)\); Myunget al\.\([2024](https://arxiv.org/html/2605.22137#bib.bib5)\), suggesting that relevant knowledge is already present but not effectively retrieved or transferred across languages\. This mismatch limits the reliability of LLMs in diverse real\-world settings\.
To address these limitations, prior work has explored several directions for improving cross\-lingual and cultural alignment in LLMs\. Prompt\-based methods attempt to inject cultural knowledge at inference timeWanget al\.\([2024](https://arxiv.org/html/2605.22137#bib.bib24)\), but often require careful design and fail to capture deeper cultural understandingDURMUSet al\.\([2024](https://arxiv.org/html/2605.22137#bib.bib14)\); Kovačet al\.\([2023](https://arxiv.org/html/2605.22137#bib.bib25)\)\. Training\-based approaches instead rely on curated datasets from surveys, social media, or multilingual sources to enhance cultural awarenessLiet al\.\([2024a](https://arxiv.org/html/2605.22137#bib.bib26)\); Shiet al\.\([2024](https://arxiv.org/html/2605.22137#bib.bib27)\); Adilazuardaet al\.\([2025](https://arxiv.org/html/2605.22137#bib.bib15)\), though these methods are costly and difficult to scale\. More recent work leverages critique\-based data synthesis and self\-consistency to improve cultural knowledge and model reliabilityFenget al\.\([2025](https://arxiv.org/html/2605.22137#bib.bib1)\); Wanget al\.\([2025](https://arxiv.org/html/2605.22137#bib.bib29)\)\. However, these approaches either depend on stronger external models or are primarily evaluated in structured settings, leaving open challenges for scalable and robust alignment in open\-ended generation\.
In this work, we propose a novel framework that leverages multilingual self\-consistency to improve cultural knowledge alignment across languages\. Instead of relying on external annotations, our method exploits the model’s own responses across multiple languages to identify reliable knowledge\. By comparing response consistency across languages, we can determine which language yields more stable and coherent answers and use this signal to construct self\-supervised training data\.
Our contributions can be summarized as follows:
- •We propose a self\-supervised multilingual self\-consistency framework to generate reliable training signals without human annotation\.
- •We introduce a cross\-lingual knowledge transfer mechanism that leverages stronger\-language responses to improve weaker\-language performance\.
## 2Related Works
Cross\-Lingual Performance DisparitiesMultilingual Large Language Models \(LLMs\) have been shown to yield different answers when a query is posed in different languages, leading to high performance variance across languages\(Xuanet al\.,[2025](https://arxiv.org/html/2605.22137#bib.bib17); Bandarkaret al\.,[2024](https://arxiv.org/html/2605.22137#bib.bib18); Pontiet al\.,[2020](https://arxiv.org/html/2605.22137#bib.bib19)\)\. This discrepancy is primarily attributed to English\-skewed training data, which causes models to route their internal reasoning through English\(Weihuaet al\.,[2026](https://arxiv.org/html/2605.22137#bib.bib20); Schutet al\.,[2025](https://arxiv.org/html/2605.22137#bib.bib21); Wendleret al\.,[2024](https://arxiv.org/html/2605.22137#bib.bib22)\)\. Consequently, as the target language drifts further from English, models suffer significant performance degradation, particularly in low\-resource languages\(Huanget al\.,[2024](https://arxiv.org/html/2605.22137#bib.bib28)\)\. Conversely, in some cases, LLMs may possess cultural knowledge in a local language but fail to retrieve or translate it when prompted in English\(Myunget al\.,[2024](https://arxiv.org/html/2605.22137#bib.bib5)\)\.
Cultural Bias and Western\-CentricityCultural bias has emerged as a significant concern in current LLMs as these models tend to reflect the dominant values of their pretraining corpora, often marginalizing other demographic groups\(Liet al\.,[2025](https://arxiv.org/html/2605.22137#bib.bib12)\)\. Consequently, LLMs predominantly exhibit a bias toward Western values\(Mushtaqet al\.,[2025](https://arxiv.org/html/2605.22137#bib.bib16); Liet al\.,[2024b](https://arxiv.org/html/2605.22137#bib.bib13)\)\. Moreover, simply prompting these models to adopt a specific cultural perspective often yields answers grounded in superficial stereotypes rather than a deep understanding of the underlying cultural nuances\(DURMUSet al\.,[2024](https://arxiv.org/html/2605.22137#bib.bib14)\)\. Such biases raise serious safety concerns, particularly regarding the deployment of LLMs in underrepresented or localized cultural contexts\(Azmiet al\.,[2025](https://arxiv.org/html/2605.22137#bib.bib11)\)\.
Approaches to Cultural AlignmentVarious strategies attempt to mitigate cultural bias in LLMs\. One approach aims to directly inject cultural knowledge via prompting\(Wanget al\.,[2024](https://arxiv.org/html/2605.22137#bib.bib24)\)\. However, such methods often yield only a shallow understanding and require extensive domain expertise to design\(DURMUSet al\.,[2024](https://arxiv.org/html/2605.22137#bib.bib14); Kovačet al\.,[2023](https://arxiv.org/html/2605.22137#bib.bib25)\)\. Other approaches fine\-tune models to improve their awareness of specific cultures\. These methods often involve curating training datasets from surveys\(Liet al\.,[2024a](https://arxiv.org/html/2605.22137#bib.bib26)\), social media\(Shiet al\.,[2024](https://arxiv.org/html/2605.22137#bib.bib27)\), or a combination of diverse sources\(Adilazuardaet al\.,[2025](https://arxiv.org/html/2605.22137#bib.bib15)\)\. Recent work also leverages stronger LLMs to generate critiques that improve cultural data quality\(Fenget al\.,[2025](https://arxiv.org/html/2605.22137#bib.bib1)\)\. Another line of research demonstrates that self\-supervision via self\-consistency can enhance cultural knowledge, even without relying on stronger models\(Wanget al\.,[2025](https://arxiv.org/html/2605.22137#bib.bib29); Zhanget al\.,[2025a](https://arxiv.org/html/2605.22137#bib.bib35)\)\. Building on these foundations, our work demonstrates how to further enhance cultural alignment in open\-ended generation by combining self\-critique and self\-consistency across languages, eliminating the reliance on stronger LLMs\.
## 3Methodology
We adapt CulFiTFenget al\.\([2025](https://arxiv.org/html/2605.22137#bib.bib1)\)by omitting the ineffective Direct Preference Optimization \(DPO\) phase and introducing self\-supervised ground truth generation\. The pipeline has two stages: \(1\)Bilingual Question Generationfor synthesizing English and local\-language query pairs, and \(2\)Self\-Supervised Ground Truth Generationto distill reliable cultural knowledge from a base modelℳ\\mathcal\{M\}via multilingual self\-consistency\.
Figure 1:Overview of the self\-supervised ground truth generation via multilingual self\-consistency\. The model generatesNNresponses per language; the language with higher intra\-language consistency is selected as the stronger language, and the most consistent answer in the stronger language is translated to the weaker language and set as the ground truth\.### 3\.1Bilingual Question Generation
We convert assertive statementssis\_\{i\}from the CANDLENguyenet al\.\([2023](https://arxiv.org/html/2605.22137#bib.bib3)\)and CultureAtlasFunget al\.\([2024](https://arxiv.org/html/2605.22137#bib.bib4)\)datasets into coherent knowledge paragraphspip\_\{i\}by promptingℳ\\mathcal\{M\}\. We then promptℳ\\mathcal\{M\}to generate culturally grounded questions\{q1,…,qK\}\\\{q\_\{1\},\\ldots,q\_\{K\}\\\}frompip\_\{i\}and extract each question’s country of originoko\_\{k\}and primary languageℓk\\ell\_\{k\}\. Retaining only non\-English questions supported by Google Translate, we translate each into the local language to form bilingual pairs:
𝒬=\{\(qken,qkℓk\)∣k=1,…,\|𝒬\|\}\\mathcal\{Q\}=\\left\\\{\\left\(q\_\{k\}^\{\\text\{en\}\},\\;q\_\{k\}^\{\\ell\_\{k\}\}\\right\)\\mid k=1,\\ldots,\|\\mathcal\{Q\}\|\\right\\\}\(1\)whereqkenq\_\{k\}^\{\\text\{en\}\}andqkℓkq\_\{k\}^\{\\ell\_\{k\}\}are the English and local language translations\.
### 3\.2Self\-Supervised Ground Truth Generation
Given𝒬\\mathcal\{Q\}, we generate self\-supervised ground truth answers based onℳ\\mathcal\{M\}’s response consistency\.
#### Response Sampling\.
For eachqkq\_\{k\}and languageλ∈\{en,ℓk\}\\lambda\\in\\\{\\text\{en\},\\;\\ell\_\{k\}\\\}, we sampleNNindependent responses fromℳ\\mathcal\{M\}:
𝒜kλ=\{ak,1λ,…,ak,Nλ\}\\mathcal\{A\}\_\{k\}^\{\\lambda\}=\\left\\\{a\_\{k,1\}^\{\\lambda\},\\;\\ldots,\\;a\_\{k,N\}^\{\\lambda\}\\right\\\}\(2\)
#### Intra\-Language Consistency\.
We evaluate internal agreement within each language’s responses using pairwise cosine similarity of Qwen3\-Embedding\-0\.6BZhanget al\.\([2025b](https://arxiv.org/html/2605.22137#bib.bib2)\)embeddings,𝐞\(⋅\)\\mathbf\{e\}\(\\cdot\)\. The consistency scoreCkλC\_\{k\}^\{\\lambda\}is the average similarity over all\(N2\)\\binom\{N\}\{2\}unique pairs:
Ckλ=2N\(N−1\)∑1≤i<j≤Ncos\(𝐞\(ak,iλ\),𝐞\(ak,jλ\)\)C\_\{k\}^\{\\lambda\}=\\frac\{2\}\{N\(N\-1\)\}\\sum\_\{1\\leq i<j\\leq N\}\\cos\\\!\\left\(\\mathbf\{e\}\(a\_\{k,i\}^\{\\lambda\}\),\\;\\mathbf\{e\}\(a\_\{k,j\}^\{\\lambda\}\)\\right\)\(3\)
#### Ground Truth Selection\.
The language with higher consistency \(CkλC\_\{k\}^\{\\lambda\}\) is designated thestrongerlanguageλ\+\\lambda^\{\+\}; the other is theweakerlanguageλ−\\lambda^\{\-\}\. The self\-supervised ground truthgk∗g\_\{k\}^\{\*\}is theλ\+\\lambda^\{\+\}response with the highest average pairwise similarity to its peers:
gk∗=argmaxak,iλ\+1N−1∑j=1j≠iNcos\(𝐞\(ak,iλ\+\),𝐞\(ak,jλ\+\)\)g\_\{k\}^\{\*\}=\\operatorname\*\{arg\\,max\}\_\{a\_\{k,i\}^\{\\lambda^\{\+\}\}\}\\frac\{1\}\{N\-1\}\\sum\_\{\\begin\{subarray\}\{c\}j=1\\\\ j\\neq i\\end\{subarray\}\}^\{N\}\\cos\\\!\\left\(\\mathbf\{e\}\(a\_\{k,i\}^\{\\lambda^\{\+\}\}\),\\;\\mathbf\{e\}\(a\_\{k,j\}^\{\\lambda^\{\+\}\}\)\\right\)\(4\)Next, we translategk∗g\_\{k\}^\{\*\}intoλ−\\lambda^\{\-\}via Google Translate, producingg^kλ−\\hat\{g\}\_\{k\}^\{\\lambda^\{\-\}\}and use that as the ground truth to improve the model on the weaker language\.
#### Critique\-Augmented Training Targets\.
FollowingFenget al\.\([2025](https://arxiv.org/html/2605.22137#bib.bib1)\), we sample a responsemkm\_\{k\}from𝒜kλ−\\mathcal\{A\}\_\{k\}^\{\\lambda^\{\-\}\}and promptℳ\\mathcal\{M\}to generate a critiqueckc\_\{k\}comparingmkm\_\{k\}againstg^kλ−\\hat\{g\}\_\{k\}^\{\\lambda^\{\-\}\}\.
#### Training Data Construction\.
Each instance comprises the tuple\(qkλ−,mk,ck,g^kλ−\)\(q\_\{k\}^\{\\lambda^\{\-\}\},m\_\{k\},c\_\{k\},\\hat\{g\}\_\{k\}^\{\\lambda^\{\-\}\}\)\. We restrict the dataset to instances associated with the 16 cultural regions defined in the BLEnD benchmark\. To mitigate catastrophic forgetting, we augment this corpus with the Aya DatasetSinghet al\.\([2024b](https://arxiv.org/html/2605.22137#bib.bib23)\), maintaining a 3:1 ratio of cultural to general examples\.
## 4Experimental Setup
Table 1:BLEnD evaluation results \(SEM\-B scores\) comparing Llama 3\.1 8B Instruct \(baseline\) and our method across 16 countries/regions \(see Table[5](https://arxiv.org/html/2605.22137#A2.T5)in the Appendix for country code mappings\)\.Boldvalues indicates a statistically significantly better model \(p<0\.05p<0\.05\)Table 2:Macro\-averaged accuracy \(%\) on general reasoning benchmarks\.### 4\.1Base Model and Target Languages
We use Llama 3\.1 8B InstructGrattafioriet al\.\([2024](https://arxiv.org/html/2605.22137#bib.bib8)\)\(ℳ\\mathcal\{M\}\) for training data synthesis and fine\-tuning\. We target the 13 languages \(English, Chinese, Spanish, Indonesian, Korean, Greek, Persian, Arabic, Azerbaijani, Sundanese, Assamese, Hausa, and Amharic\) and 16 regions of the BLEnD benchmarkMyunget al\.\([2024](https://arxiv.org/html/2605.22137#bib.bib5)\)\. The pipeline yields 5,007 cultural instances, combined with 1,668 Aya Dataset instances \(6,675 total\)\. Appendix[4](https://arxiv.org/html/2605.22137#A2.T4)details the cultural data distribution\. Hyperparameters are in Appendix[A](https://arxiv.org/html/2605.22137#A1)\.
### 4\.2Evaluation
We evaluate cultural alignment using BLEnDMyunget al\.\([2024](https://arxiv.org/html/2605.22137#bib.bib5)\)\(52\.6k QA pairs across 16 regions in English and local languages\)\. We assess general commonsense and multilingual reasoning preservation using Multilingual HellaSwagLaiet al\.\([2023](https://arxiv.org/html/2605.22137#bib.bib6)\)and Global MMLUSinghet al\.\([2024a](https://arxiv.org/html/2605.22137#bib.bib7)\)\. Significance is reported via paired bootstrap resampling \(p<0\.05p<0\.05\)\.
## 5Results
### 5\.1Main Results
Our method significantly improves English\-setting performance across all 16 regions \(\+5\.03% average\) \(Table[1](https://arxiv.org/html/2605.22137#S4.T1)\)\. However, severe data scarcity \(¡1% of total cultural data\) for low\-resource languages \(Hausa, Assamese, Sundanese\) caused catastrophic forgetting, reducing local\-language performance in these regions\.
General capabilities remain largely intact \(Table[2](https://arxiv.org/html/2605.22137#S4.T2)\), with minor decreases on Multilingual HellaSwag \(\-0\.85%\) and Global MMLU \(\-1\.42%\)\. Per\-language breakdowns are in Appendices[6](https://arxiv.org/html/2605.22137#A2.T6)and[7](https://arxiv.org/html/2605.22137#A2.T7)\.
### 5\.2Ablation Study
Table 3:Ablation results evaluating the impact of the filter and consistency components\. Active components are indicated with a checkmark \(✓\)\. Scores represent averaged accuracy \(%\)\.We perform an ablation study to evaluate the effectiveness of our proposed ground\-truth selection mechanism, which relies on cross\-lingual self\-consistency\. We compare this against a variant where the ground\-truth response is selected randomly\. Additionally, we analyze the impact of filtering the training data to include only the specific languages evaluated in our study\. The results are presented in Table[3](https://arxiv.org/html/2605.22137#S5.T3)\.
First, our findings show that filtering the data to our target languages does not degrade overall performance, while significantly improving training efficiency by reducing the dataset size\. More importantly, removing the self\-consistency module \(i\.e\., using random selection\) results in a substantial performance drop in the English setting compared to our full method\. Despite this degradation, the random\-selection model still outperforms the unaligned base model\. We attribute this underlying performance gain to the critique\-augmented training format, which has been demonstrated to be a crucial component for alignment in prior work\(Fenget al\.,[2025](https://arxiv.org/html/2605.22137#bib.bib1)\)\.
## 6Conclusion
In this work, we propose a self\-supervised framework for improving cross\-lingual cultural alignment in large language models through multilingual self\-consistency\. By leveraging the model’s own responses across languages, our approach identifies reliable knowledge and transfers it from stronger to weaker languages without requiring external annotations or stronger teacher models\. Experimental results on the BLEnD benchmark demonstrate that our method significantly improves cultural alignment, particularly in English settings, while largely preserving general reasoning capabilities\.
Despite these gains, our analysis reveals limitations in low\-resource languages, where data scarcity can lead to performance degradation\. This suggests that while self\-supervision is effective, it remains sensitive to the availability and balance of multilingual data\. Overall, our findings highlight the potential of exploiting latent cross\-lingual knowledge within LLMs to achieve more culturally consistent and equitable behavior\.
## References
- From surveys to narratives: rethinking cultural value adaptation in LLMs\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 18052–18079\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.912/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.912),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p3.1),[§2](https://arxiv.org/html/2605.22137#S2.p3.1)\.
- M\. F\. Azmi, M\. D\. Al Kautsar, A\. F\. Wicaksono, and F\. Koto \(2025\)IndoSafety: culturally grounded safety for LLMs in Indonesian languages\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 9135–9166\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.465/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.465),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2605.22137#S2.p2.1)\.
- L\. Bandarkar, D\. Liang, B\. Muller, M\. Artetxe, S\. N\. Shukla, D\. Husa, N\. Goyal, A\. Krishnan, L\. Zettlemoyer, and M\. Khabsa \(2024\)The belebele benchmark: a parallel reading comprehension dataset in 122 language variants\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 749–775\.External Links:[Link](https://aclanthology.org/2024.acl-long.44/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.44)Cited by:[§2](https://arxiv.org/html/2605.22137#S2.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 1877–1901\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p1.1)\.
- E\. DURMUS, K\. Nguyen, T\. Liao, N\. Schiefer, A\. Askell, A\. Bakhtin, C\. Chen, Z\. Hatfield\-Dodds, D\. Hernandez, N\. Joseph,et al\.\(2024\)Towards measuring the representation of subjective global opinions in language models\.InFirst Conference on Language Modeling,Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p3.1),[§2](https://arxiv.org/html/2605.22137#S2.p2.1),[§2](https://arxiv.org/html/2605.22137#S2.p3.1)\.
- R\. Feng, S\. Gao, X\. Chen, L\. Chen, and S\. Shang \(2025\)CulFiT: a fine\-grained cultural\-aware LLM training paradigm via multilingual critique data synthesis\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 22413–22430\.External Links:[Link](https://aclanthology.org/2025.acl-long.1092/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1092),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p3.1),[§2](https://arxiv.org/html/2605.22137#S2.p3.1),[§3\.2](https://arxiv.org/html/2605.22137#S3.SS2.SSS0.Px4.p1.6),[§3](https://arxiv.org/html/2605.22137#S3.p1.1),[§5\.2](https://arxiv.org/html/2605.22137#S5.SS2.p2.1)\.
- Y\. Fung, R\. Zhao, J\. Doo, C\. Sun, and H\. Ji \(2024\)Massively multi\-cultural knowledge acquisition & lm benchmarking\.External Links:2402\.09369,[Link](https://arxiv.org/abs/2402.09369)Cited by:[§3\.1](https://arxiv.org/html/2605.22137#S3.SS1.p1.8)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.External Links:2407\.21783Cited by:[§4\.1](https://arxiv.org/html/2605.22137#S4.SS1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.arXiv preprint arXiv:2106\.09685\.External Links:2106\.09685Cited by:[Appendix A](https://arxiv.org/html/2605.22137#A1.p1.3)\.
- H\. Huang, T\. Tang, D\. Zhang, X\. Zhao, T\. Song, Y\. Xia, and F\. Wei \(2023\)Not all languages are created equal in llms: improving multilingual capability by cross\-lingual\-thought prompting\.InFindings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6\-10, 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Findings of ACL,pp\. 12365–12394\.External Links:[Link](https://doi.org/10.18653/v1/2023.findings-emnlp.826),[Document](https://dx.doi.org/10.18653/V1/2023.FINDINGS-EMNLP.826)Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p1.1)\.
- Y\. Huang, C\. Fan, Y\. Li, S\. Wu, T\. Zhou, X\. Zhang, and L\. Sun \(2024\)1\+1\>\>2: can large language models serve as cross\-lingual knowledge aggregators?\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 13394–13412\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.743/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.743)Cited by:[§2](https://arxiv.org/html/2605.22137#S2.p1.1)\.
- G\. Kovač, M\. Sawayama, R\. Portelas, C\. Colas, P\. F\. Dominey, and P\. Oudeyer \(2023\)Large language models as superpositions of cultural perspectives\.arXiv preprint arXiv:2307\.07870\.Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p3.1),[§2](https://arxiv.org/html/2605.22137#S2.p3.1)\.
- V\. Lai, C\. Nguyen, N\. Ngo, T\. Nguyen, F\. Dernoncourt, R\. Rossi, and T\. Nguyen \(2023\)Okapi: instruction\-tuned large language models in multiple languages with reinforcement learning from human feedback\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,Singapore,pp\. 318–327\.External Links:[Link](https://aclanthology.org/2023.emnlp-demo.28/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-demo.28)Cited by:[§4\.2](https://arxiv.org/html/2605.22137#S4.SS2.p1.1)\.
- C\. Li, M\. Chen, J\. Wang, S\. Sitaram, and X\. Xie \(2024a\)Culturellm: incorporating cultural differences into large language models\.Advances in Neural Information Processing Systems37,pp\. 84799–84838\.Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p3.1),[§2](https://arxiv.org/html/2605.22137#S2.p3.1)\.
- H\. Li, A\. Goel, K\. He, and X\. Ren \(2025\)Attributing culture\-conditioned generations to pretraining corpora\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.22137#S2.p2.1)\.
- H\. Li, L\. Jiang, N\. Dziri, X\. Ren, and Y\. Choi \(2024b\)CULTURE\-gen: revealing global cultural perception in language models through natural language prompting\.InFirst Conference on Language Modeling,Cited by:[§2](https://arxiv.org/html/2605.22137#S2.p2.1)\.
- P\. Mondorf and B\. Plank \(2024\)Beyond accuracy: evaluating the reasoning behavior of large language models – a survey\.External Links:2404\.01869,[Link](https://arxiv.org/abs/2404.01869)Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p1.1)\.
- A\. Mushtaq, R\. Naeem, I\. Taj, I\. Ghaznavi, and J\. Qadir \(2025\)Towards inclusive educational ai: auditing frontier llms for cultural biases through a multiplexity lens\.In2025 IEEE Global Engineering Education Conference \(EDUCON\),pp\. 1–10\.Cited by:[§2](https://arxiv.org/html/2605.22137#S2.p2.1)\.
- J\. Myung, N\. Lee, Y\. Zhou, J\. Jin, R\. A\. Putri, D\. Antypas, H\. Borkakoty, E\. Kim, C\. Perez\-Almendros, A\. A\. Ayele, V\. Gutiérrez\-Basulto, Y\. Ibáñez\-García, H\. Lee, S\. H\. Muhammad, K\. Park, A\. S\. Rzayev, N\. White, S\. M\. Yimam, M\. T\. Pilehvar, N\. Ousidhoum, J\. Camacho\-Collados, and A\. Oh \(2024\)BLEnD: a benchmark for llms on everyday knowledge in diverse cultures and languages\.arXiv preprint arXiv:2406\.09948\.Note:NeurIPS 2024 Datasets & Benchmark TrackCited by:[§1](https://arxiv.org/html/2605.22137#S1.p2.1),[§2](https://arxiv.org/html/2605.22137#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.22137#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.22137#S4.SS2.p1.1)\.
- T\. Naous, M\. J\. Ryan, A\. Ritter, and W\. Xu \(2024\)Having beer after prayer? measuring cultural bias in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 16366–16393\.External Links:[Link](https://aclanthology.org/2024.acl-long.862/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.862)Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p1.1)\.
- T\. Nguyen, S\. Razniewski, A\. Varde, and G\. Weikum \(2023\)Extracting cultural commonsense knowledge at scale\.InProceedings of the ACM Web Conference 2023,WWW ’23,New York, NY, USA,pp\. 1907–1917\.External Links:ISBN 9781450394161,[Link](https://doi.org/10.1145/3543507.3583535),[Document](https://dx.doi.org/10.1145/3543507.3583535)Cited by:[§3\.1](https://arxiv.org/html/2605.22137#S3.SS1.p1.8)\.
- E\. M\. Ponti, G\. Glavaš, O\. Majewska, Q\. Liu, I\. Vulić, and A\. Korhonen \(2020\)XCOPA: a multilingual dataset for causal commonsense reasoning\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 2362–2376\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.185/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.185)Cited by:[§2](https://arxiv.org/html/2605.22137#S2.p1.1)\.
- L\. Schut, Y\. Gal, and S\. Farquhar \(2025\)Do multilingual llms think in english?\.InICLR 2025 Workshop on Building Trust in Language Models and Applications,Cited by:[§2](https://arxiv.org/html/2605.22137#S2.p1.1)\.
- W\. Shi, R\. Li, Y\. Zhang, C\. Ziems, S\. Yu, R\. Horesh, R\. A\. D\. Paula, and D\. Yang \(2024\)CultureBank: an online community\-driven knowledge base towards culturally aware language technologies\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 4996–5025\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.288/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.288)Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p3.1),[§2](https://arxiv.org/html/2605.22137#S2.p3.1)\.
- S\. Singh, A\. Romanou, C\. Fourrier, D\. I\. Adelani, J\. G\. Ngui, D\. Vila\-Suero, P\. Limkonchotiwat, K\. Marchisio, W\. Q\. Leong, Y\. Susanto, R\. Ng, S\. Longpre, W\. Ko, M\. Smith, A\. Bosselut, A\. Oh, A\. F\. T\. Martins, L\. Choshen, D\. Ippolito, E\. Ferrante, M\. Fadaee, and B\. Ermis \(2024a\)Global mmlu: understanding and addressing cultural and linguistic biases in multilingual evaluation\.arXiv preprint arXiv:2412\.03304\.Cited by:[§4\.2](https://arxiv.org/html/2605.22137#S4.SS2.p1.1)\.
- S\. Singh, F\. Vargus, D\. D’souza, B\. F\. Karlsson, A\. Mahendiran, W\. Ko, H\. Shandilya, J\. Patel, D\. Mataciunas, L\. O’Mahony, M\. Zhang, R\. Hettiarachchi, J\. Wilson, M\. Machado, L\. Moura, D\. Krzemiński, H\. Fadaei, I\. Ergun, I\. Okoh, A\. Alaagib, O\. Mudannayake, Z\. Alyafeai, V\. Chien, S\. Ruder, S\. Guthikonda, E\. Alghamdi, S\. Gehrmann, N\. Muennighoff, M\. Bartolo, J\. Kreutzer, A\. Üstün, M\. Fadaee, and S\. Hooker \(2024b\)Aya dataset: an open\-access collection for multilingual instruction tuning\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 11521–11567\.External Links:[Link](https://aclanthology.org/2024.acl-long.620/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.620)Cited by:[§3\.2](https://arxiv.org/html/2605.22137#S3.SS2.SSS0.Px5.p1.1)\.
- W\. Wang, W\. Jiao, J\. Huang, R\. Dai, J\. Huang, Z\. Tu, and M\. Lyu \(2024\)Not all countries celebrate thanksgiving: on the cultural dominance in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 6349–6384\.External Links:[Link](https://aclanthology.org/2024.acl-long.345/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.345)Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p3.1),[§2](https://arxiv.org/html/2605.22137#S2.p3.1)\.
- Y\. Wang, Z\. Fan, Q\. Wang, Y\. R\. Fung, and H\. Ji \(2025\)CALM: unleashing the cross\-lingual self\-aligning ability of language model question answering\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 2809–2817\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.152/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.152),ISBN 979\-8\-89176\-195\-7Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p3.1),[§2](https://arxiv.org/html/2605.22137#S2.p3.1)\.
- Z\. Weihua, X\. Huang, Z\. Liu, T\. K\. Vangani, B\. Zou, X\. Tao, Y\. Wu, A\. Aw, N\. F\. Chen, and R\. K\. Lee \(2026\)AdaMCoT: rethinking cross\-lingual factual reasoning through adaptive multilingual chain\-of\-thought\.Proceedings of the AAAI Conference on Artificial Intelligence40\(40\),pp\. 33863–33871\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/40678),[Document](https://dx.doi.org/10.1609/aaai.v40i40.40678)Cited by:[§2](https://arxiv.org/html/2605.22137#S2.p1.1)\.
- C\. Wendler, V\. Veselovsky, G\. Monea, and R\. West \(2024\)Do llamas work in English? on the latent language of multilingual transformers\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15366–15394\.External Links:[Link](https://aclanthology.org/2024.acl-long.820/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.820)Cited by:[§2](https://arxiv.org/html/2605.22137#S2.p1.1)\.
- B\. Workshop, :, T\. L\. Scao, A\. Fan, C\. Akiki, E\. Pavlick, S\. Ilić, D\. Hesslow, R\. Castagné, A\. S\. Luccioni, F\. Yvon, M\. Gallé, J\. Tow, A\. M\. Rush, S\. Biderman, A\. Webson, P\. S\. Ammanamanchi, T\. Wang, B\. Sagot, N\. Muennighoff, A\. V\. del Moral, O\. Ruwase, R\. Bawden, S\. Bekman, A\. McMillan\-Major, I\. Beltagy, H\. Nguyen, L\. Saulnier, S\. Tan, P\. O\. Suarez, V\. Sanh, H\. Laurençon, Y\. Jernite, J\. Launay, M\. Mitchell, C\. Raffel, A\. Gokaslan, A\. Simhi, A\. Soroa, A\. F\. Aji, A\. Alfassy, A\. Rogers, A\. K\. Nitzav, C\. Xu, C\. Mou, C\. Emezue, C\. Klamm, C\. Leong, D\. van Strien, D\. I\. Adelani, D\. Radev, E\. G\. Ponferrada, E\. Levkovizh, E\. Kim, E\. B\. Natan, F\. D\. Toni, G\. Dupont, G\. Kruszewski, G\. Pistilli, H\. Elsahar, H\. Benyamina, H\. Tran, I\. Yu, I\. Abdulmumin, I\. Johnson, I\. Gonzalez\-Dios, J\. de la Rosa, J\. Chim, J\. Dodge, J\. Zhu, J\. Chang, J\. Frohberg, J\. Tobing, J\. Bhattacharjee, K\. Almubarak, K\. Chen, K\. Lo, L\. V\. Werra, L\. Weber, L\. Phan, L\. B\. allal, L\. Tanguy, M\. Dey, M\. R\. Muñoz, M\. Masoud, M\. Grandury, M\. Šaško, M\. Huang, M\. Coavoux, M\. Singh, M\. T\. Jiang, M\. C\. Vu, M\. A\. Jauhar, M\. Ghaleb, N\. Subramani, N\. Kassner, N\. Khamis, O\. Nguyen, O\. Espejel, O\. de Gibert, P\. Villegas, P\. Henderson, P\. Colombo, P\. Amuok, Q\. Lhoest, R\. Harliman, R\. Bommasani, R\. L\. López, R\. Ribeiro, S\. Osei, S\. Pyysalo, S\. Nagel, S\. Bose, S\. H\. Muhammad, S\. Sharma, S\. Longpre, S\. Nikpoor, S\. Silberberg, S\. Pai, S\. Zink, T\. T\. Torrent, T\. Schick, T\. Thrush, V\. Danchev, V\. Nikoulina, V\. Laippala, V\. Lepercq, V\. Prabhu, Z\. Alyafeai, Z\. Talat, A\. Raja, B\. Heinzerling, C\. Si, D\. E\. Taşar, E\. Salesky, S\. J\. Mielke, W\. Y\. Lee, A\. Sharma, A\. Santilli, A\. Chaffin, A\. Stiegler, D\. Datta, E\. Szczechla, G\. Chhablani, H\. Wang, H\. Pandey, H\. Strobelt, J\. A\. Fries, J\. Rozen, L\. Gao, L\. Sutawika, M\. S\. Bari, M\. S\. Al\-shaibani, M\. Manica, N\. Nayak, R\. Teehan, S\. Albanie, S\. Shen, S\. Ben\-David, S\. H\. Bach, T\. Kim, T\. Bers, T\. Fevry, T\. Neeraj, U\. Thakker, V\. Raunak, X\. Tang, Z\. Yong, Z\. Sun, S\. Brody, Y\. Uri, H\. Tojarieh, A\. Roberts, H\. W\. Chung, J\. Tae, J\. Phang, O\. Press, C\. Li, D\. Narayanan, H\. Bourfoune, J\. Casper, J\. Rasley, M\. Ryabinin, M\. Mishra, M\. Zhang, M\. Shoeybi, M\. Peyrounette, N\. Patry, N\. Tazi, O\. Sanseviero, P\. von Platen, P\. Cornette, P\. F\. Lavallée, R\. Lacroix, S\. Rajbhandari, S\. Gandhi, S\. Smith, S\. Requena, S\. Patil, T\. Dettmers, A\. Baruwa, A\. Singh, A\. Cheveleva, A\. Ligozat, A\. Subramonian, A\. Névéol, C\. Lovering, D\. Garrette, D\. Tunuguntla, E\. Reiter, E\. Taktasheva, E\. Voloshina, E\. Bogdanov, G\. I\. Winata, H\. Schoelkopf, J\. Kalo, J\. Novikova, J\. Z\. Forde, J\. Clive, J\. Kasai, K\. Kawamura, L\. Hazan, M\. Carpuat, M\. Clinciu, N\. Kim, N\. Cheng, O\. Serikov, O\. Antverg, O\. van der Wal, R\. Zhang, R\. Zhang, S\. Gehrmann, S\. Mirkin, S\. Pais, T\. Shavrina, T\. Scialom, T\. Yun, T\. Limisiewicz, V\. Rieser, V\. Protasov, V\. Mikhailov, Y\. Pruksachatkun, Y\. Belinkov, Z\. Bamberger, Z\. Kasner, A\. Rueda, A\. Pestana, A\. Feizpour, A\. Khan, A\. Faranak, A\. Santos, A\. Hevia, A\. Unldreaj, A\. Aghagol, A\. Abdollahi, A\. Tammour, A\. HajiHosseini, B\. Behroozi, B\. Ajibade, B\. Saxena, C\. M\. Ferrandis, D\. McDuff, D\. Contractor, D\. Lansky, D\. David, D\. Kiela, D\. A\. Nguyen, E\. Tan, E\. Baylor, E\. Ozoani, F\. Mirza, F\. Ononiwu, H\. Rezanejad, H\. Jones, I\. Bhattacharya, I\. Solaiman, I\. Sedenko, I\. Nejadgholi, J\. Passmore, J\. Seltzer, J\. B\. Sanz, L\. Dutra, M\. Samagaio, M\. Elbadri, M\. Mieskes, M\. Gerchick, M\. Akinlolu, M\. McKenna, M\. Qiu, M\. Ghauri, M\. Burynok, N\. Abrar, N\. Rajani, N\. Elkott, N\. Fahmy, O\. Samuel, R\. An, R\. Kromann, R\. Hao, S\. Alizadeh, S\. Shubber, S\. Wang, S\. Roy, S\. Viguier, T\. Le, T\. Oyebade, T\. Le, Y\. Yang, Z\. Nguyen, A\. R\. Kashyap, A\. Palasciano, A\. Callahan, A\. Shukla, A\. Miranda\-Escalada, A\. Singh, B\. Beilharz, B\. Wang, C\. Brito, C\. Zhou, C\. Jain, C\. Xu, C\. Fourrier, D\. L\. Periñán, D\. Molano, D\. Yu, E\. Manjavacas, F\. Barth, F\. Fuhrimann, G\. Altay, G\. Bayrak, G\. Burns, H\. U\. Vrabec, I\. Bello, I\. Dash, J\. Kang, J\. Giorgi, J\. Golde, J\. D\. Posada, K\. R\. Sivaraman, L\. Bulchandani, L\. Liu, L\. Shinzato, M\. H\. de Bykhovetz, M\. Takeuchi, M\. Pàmies, M\. A\. Castillo, M\. Nezhurina, M\. Sänger, M\. Samwald, M\. Cullan, M\. Weinberg, M\. D\. Wolf, M\. Mihaljcic, M\. Liu, M\. Freidank, M\. Kang, N\. Seelam, N\. Dahlberg, N\. M\. Broad, N\. Muellner, P\. Fung, P\. Haller, R\. Chandrasekhar, R\. Eisenberg, R\. Martin, R\. Canalli, R\. Su, R\. Su, S\. Cahyawijaya, S\. Garda, S\. S\. Deshmukh, S\. Mishra, S\. Kiblawi, S\. Ott, S\. Sang\-aroonsiri, S\. Kumar, S\. Schweter, S\. Bharati, T\. Laud, T\. Gigant, T\. Kainuma, W\. Kusa, Y\. Labrak, Y\. S\. Bajaj, Y\. Venkatraman, Y\. Xu, Y\. Xu, Y\. Xu, Z\. Tan, Z\. Xie, Z\. Ye, M\. Bras, Y\. Belkada, and T\. Wolf \(2023\)BLOOM: a 176b\-parameter open\-access multilingual language model\.External Links:2211\.05100,[Link](https://arxiv.org/abs/2211.05100)Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p1.1)\.
- W\. Xuan, R\. Yang, H\. Qi, Q\. Zeng, Y\. Xiao, A\. Feng, D\. Liu, Y\. Xing, J\. Wang, F\. Gao, J\. Lu, Y\. Jiang, H\. Li, X\. Li, K\. Yu, R\. Dong, S\. Gu, Y\. Li, X\. Xie, F\. Juefei\-Xu, F\. Khomh, O\. Yoshie, Q\. Chen, D\. Teodoro, N\. Liu, R\. Goebel, L\. Ma, E\. Marrese\-Taylor, S\. Lu, Y\. Iwasawa, Y\. Matsuo, and I\. Li \(2025\)MMLU\-ProX: a multilingual benchmark for advanced large language model evaluation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 1513–1532\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.79/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.79),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2605.22137#S2.p1.1)\.
- J\. Ying, W\. Tang, Y\. Zhao, Y\. Cao, Y\. Rong, and W\. Zhang \(2025\)Disentangling language and culture for evaluating multilingual large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 22230–22251\.External Links:[Link](https://aclanthology.org/2025.acl-long.1082/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1082),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2605.22137#S1.p2.1)\.
- X\. Zhang, Y\. Liang, F\. Meng, S\. Zhang, Y\. Chen, J\. Xu, and J\. Zhou \(2025a\)CM\-align: consistency\-based multilingual alignment for large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 25689–25702\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1401/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1401),ISBN 979\-8\-89176\-335\-7Cited by:[§2](https://arxiv.org/html/2605.22137#S2.p3.1)\.
- Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou \(2025b\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.arXiv preprint arXiv:2506\.05176\.Cited by:[§3\.2](https://arxiv.org/html/2605.22137#S3.SS2.SSS0.Px2.p1.3)\.
## Appendix ATraining Details
We fine\-tuneℳ\\mathcal\{M\}using LoRAHuet al\.\([2022](https://arxiv.org/html/2605.22137#bib.bib9)\)applied to all linear layers with rankr=16r=16\. Training is conducted with a learning rate of1×10−51\\times 10^\{\-5\}using a cosine learning rate scheduler with a warmup ratio of 0\.1\. We use a per\-device batch size of 2 with no gradient accumulation with 8 H200 GPUs, and train for 1,000 steps with a maximum sequence length of 4,096 tokens\. Training is performed in bfloat16 precision\.
## Appendix BMulti\-Turn Dialogue Format
Each training instance is structured as a six\-turn dialogue between a user and an assistant\. The cross\-entropy loss during supervised fine\-tuning is computed exclusively over the assistant turns \(turns 2, 4, and 6\), with all user turns masked\.
1. 1\.User:qkλ−q\_\{k\}^\{\\lambda^\{\-\}\}\(culturally grounded question\)
2. 2\.Assistant:mkm\_\{k\}\(sampled answer in weaker language\)
3. 3\.User:Critique request \(prompt to identify errors\)
4. 4\.Assistant:ckc\_\{k\}\(critique ofmkm\_\{k\}\)
5. 5\.User:Refinement request \(prompt to correct the answer\)
6. 6\.Assistant:g^kλ−\\hat\{g\}\_\{k\}^\{\\lambda^\{\-\}\}\(translated ground truth\)
This structure guides the model through a self\-reflective reasoning process: first reproducing a potentially flawed response, then identifying its shortcomings, and finally producing a corrected answer grounded in cross\-lingual consensus\.
Table 4:Distribution of synthesized cultural training instances across regions, sorted by total count\.Table 5:Country/region codes used in Table[1](https://arxiv.org/html/2605.22137#S4.T1)\.Table 6:Multilingual HellaSwag per\-language accuracy \(%\)\.Table 7:Global MMLU per\-language accuracy \(%\)\.Similar Articles
When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models
This paper introduces CulturalNB, a dataset of Bengali cultural question-answer pairs, and evaluates nine LLMs for cross-lingual cultural bias. Findings show that English prompting increases global narrative substitution and reduces local perspectives, revealing that cultural failures in LLMs are grounding and prioritization issues, not just missing knowledge.
CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations
This paper introduces CroCo, a method for cross-lingual contrastive preference tuning on self-generated responses, showing that a reward model trained on English preferences can effectively rank responses in other languages, improving model performance across 14 languages without language-specific annotations.
AlignCultura: Towards Culturally Aligned Large Language Models?
AlignCultura introduces CulturaX, a UNESCO-grounded dataset and two-stage pipeline for culturally aligning LLMs, showing 4–6 % HHH gains and 18 % fewer cultural failures on Qwen3-8B and DeepSeek-R1-Distill-Qwen-7B.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
This paper proposes two new metrics—Knowledge Separability Score (KSS) and Knowledge Persistence Score (KPS)—to evaluate cross-linguistic information removal in multilingual machine unlearning for LLMs, addressing shortcomings of prior per-language evaluation protocols.
DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge
This paper presents the DFKI-MLT system for SemEval-2026 Task 7 on cultural awareness, which applies activation steering to multilingual LLMs using language vectors from parallel FLORES data. The system achieved 86.96% accuracy in the MCQ track, ranking 7th out of 17 teams, and post-hoc analyses reveal that gains are layer-sensitive and vary across language-region pairs.