When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

arXiv cs.CL 06/01/26, 04:00 AM Papers

Summary

This paper introduces CulturalNB, a dataset of Bengali cultural question-answer pairs, and evaluates nine LLMs for cross-lingual cultural bias. Findings show that English prompting increases global narrative substitution and reduces local perspectives, revealing that cultural failures in LLMs are grounding and prioritization issues, not just missing knowledge.

arXiv:2605.30481v1 Announce Type: new Abstract: Large language models (LLMs) are widely used as cross-lingual knowledge interfaces. However, culturally grounded questions often reflect globally dominant narratives rather than local contexts. We study this failure mode as \textit{global narrative dominance} in Bangla, a low-resource cultural context. We introduce \texttt{CulturalNB}, a dataset of 717 manually curated Bengali cultural instances with parallel Bangla--English question--answer pairs and supporting evidence, metadata, and sociocultural annotations. Using question-only and evidence-based prompting, we evaluate nine state-of-the-art LLMs with human and two independent LLM judges across metrics for cross-lingual consistency, language anchoring, global substitution, institutional bias, and epistemic perspective coverage. Results show that questions asked in English systematically increase global substitution and institutional framing while reducing local perspective coverage. Local evidence improves factual consistency and perspective coverage, but does not eliminate language-induced epistemic shifts. These findings suggest that cultural failures in LLMs are not only missing-knowledge errors but also failures of grounding and narrative prioritization.

Original Article

View Cached Full Text

Cached at: 06/01/26, 09:23 AM

# When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models
Source: [https://arxiv.org/html/2605.30481](https://arxiv.org/html/2605.30481)
Md Arid Hasan1, Ruwad Naswan2, Farhan Samir1, Sharifa Sultana3, Syed Ishtiaque Ahmed1 1University of Toronto,2BUET,3University of Illinois Urbana\-Champaign \{arid, ishtiaque\}@cs\.toronto\.edu

###### Abstract

Large language models \(LLMs\) are widely used as cross\-lingual knowledge interfaces\. However, culturally grounded questions often reflect globally dominant narratives rather than local contexts\. We study this failure mode asglobal narrative dominancein Bangla, a low\-resource cultural context\. We introduceCulturalNB, a dataset of 717 manually curated Bengali cultural instances with parallel Bangla–English question–answer pairs and supporting evidence, metadata, and sociocultural annotations\. Using question\-only and evidence\-based prompting, we evaluate nine state\-of\-the\-art LLMs with human and two independent LLM judges across metrics for cross\-lingual consistency, language anchoring, global substitution, institutional bias, and epistemic perspective coverage\. Results show that questions asked in English systematically increase global substitution and institutional framing while reducing local perspective coverage\. Local evidence improves factual consistency and perspective coverage, but does not eliminate language\-induced epistemic shifts\. These findings suggest that cultural failures in LLMs are not only missing\-knowledge errors but also failures of grounding and narrative prioritization\.

When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

Md Arid Hasan1, Ruwad Naswan2, Farhan Samir1, Sharifa Sultana3,Syed Ishtiaque Ahmed11University of Toronto,2BUET,3University of Illinois Urbana\-Champaign\{arid, ishtiaque\}@cs\.toronto\.edu

![[Uncaptioned image]](https://arxiv.org/html/2605.30481v1/x1.png)

Figure 1:Example fromCulturalNBillustrating global narrative dominance and evaluation dimensions\.The figure shows a culturally grounded question, a translated English question, and the responses \(both Bangla and English\) generated byGPT\-5\.4\. Highlighted responses represent narrative dominance\.## 1Introduction

Large language models \(LLMs\) increasingly mediate access to knowledge across languages and cultures\. However, their cultural competence remains uneven: knowledge about low\-resource communities is often less reliable, less grounded, and more sensitive to prompt language than knowledge about globally dominant contexts\(Joshi et al\.,[2020](https://arxiv.org/html/2605.30481#bib.bib15); Bender et al\.,[2021](https://arxiv.org/html/2605.30481#bib.bib3); Blodgett et al\.,[2020](https://arxiv.org/html/2605.30481#bib.bib4)\)\. This is especially consequential when users ask culturally grounded questions whose correct interpretation depends on local history, institutions, practices, or epistemic traditionsCaughman et al\. \([2026](https://arxiv.org/html/2605.30481#bib.bib5)\); Younas and Zeng \([2026](https://arxiv.org/html/2605.30481#bib.bib32)\)\. In such cases, an answer can be fluent and fact\-like while still replacing a local interpretation with a globally dominant one\.

Prior work shows that multilingual LLMs exhibit cultural knowledge gaps, cross\-lingual inconsistency, and globally dominant biasesRystrøm et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib23)\); Cecilia Liu et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib6)\); Naous et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib18)\); Wang et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib29)\)\. Models often answer culture\-specific questions more accurately in English than in the corresponding local language\(Tanwar et al\.,[2025](https://arxiv.org/html/2605.30481#bib.bib26)\), while larger models improve factual accuracy more than cross\-lingual consistency\(Qi et al\.,[2023](https://arxiv.org/html/2605.30481#bib.bib21)\)\. Broader evaluations also demonstrate that culturally specific benchmarks alter model rankings and that region\-implicit questions remain particularly challenging\(Singh et al\.,[2025](https://arxiv.org/html/2605.30481#bib.bib25); Romanou et al\.,[2025](https://arxiv.org/html/2605.30481#bib.bib22)\)\. These findings suggest that cultural knowledge in LLMs is shaped not only by data scarcity but also by unequal distributions of language, visibility, and authority\.

However, existing work leaves three gaps\. First, most studies measure whether models know culturally specific factsGuo et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib11)\), but not whether interpretations remain consistent across languages\. Second, few evaluations distinguish missing knowledge from cases where local knowledge is availableHupkes and Bogoychev \([2025](https://arxiv.org/html/2605.30481#bib.bib14)\)but is overridden by global priors, where errors can arise from inference\-time biases as well as absence of knowledge\(Yu et al\.,[2024](https://arxiv.org/html/2605.30481#bib.bib33)\)\. Third, prior evaluations rarely test whether providing local evidence is sufficient to disrupt globally dominant narrativesNguyen et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib19)\); Wan et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib27)\), despite evidence\-grounded reasoning remains unreliable\(Feng et al\.,[2024](https://arxiv.org/html/2605.30481#bib.bib9); Shao et al\.,[2026](https://arxiv.org/html/2605.30481#bib.bib24)\)\.

We address these gaps throughglobal narrative dominance\(GND\)\. We operationalize GND as the replacement, abstraction, or reframing of a culturally localized referent with a more globally prevalent, institutionally standardized, or high\-frequency alternative\. We introduceCulturalNB, a Bengali culture\-focused dataset of717717manually curated instances across five domains\. Each instance includes a culturally grounded question–answer \(QA\) pair, supporting evidence, a context\-preserving English translation of QA pair and evidence, source metadata, and sociocultural annotations\. We evaluate nine state\-of\-the\-art LLMs usingCulturalNBin two settings:question\-only, which exposes behavior under knowledge gaps, andevidence\-based, allowing us to test whether errors persist when relevant local evidence is provided\. Each item is prompted in both Bangla and English to enable counterfactual measurement of language\-induced shifts as shown in Figure[1](https://arxiv.org/html/2605.30481#S0.F1)\.

We evaluate responses with a human judge and two independent LLM judges using five metrics: cross\-lingual factual consistency, language anchor bias, global substitution rate, institutional bias rate, and epistemic perspective coverage\. These metrics assess factual consistency, global substitution, institutional framing, and diversity of local perspectives\. Our results show that state\-of\-the\-art LLMs are not culturally or cross\-lingually stable\. English prompts frequently reduce local grounding, increase global substitution, and encourage institutionally dominant framings\. Providing local evidence improves several metrics, particularly perspective coverage and factual consistency, across several models, but it does not eliminate language\-induced shifts\. Our contributions are as follows:

- •We introduceCulturalNB, a Bengali culture\-focused dataset of717717manually curated instances across five domains\.
- •We design a parallel Bangla–English evaluation setup to measure how prompt language changes model behavior while preserving the same cultural content\.
- •We use question\-only and evidence\-based prompting to distinguish missing\-knowledge failures from failures of grounding and narrative prioritization\.
- •We evaluate nine LLMs using a human and two LLM\-based judges across five metrics\.
- •We find that English prompts systematically favor global and institutional interpretations, while local evidence improves factual grounding but does not fully eliminate language\-conditioned framing shifts\.

## 2Related Work

### 2\.1Cultural Knowledge in Multilingual LLMs

Recent work shows that LLMs encode cultural knowledge unevenly across languages and regionsPawar et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib20)\)\. Benchmark\-based studies \(such as BLENDMyung et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib17)\), CaLMQAArora et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib1)\), MultiNativQAHasan et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib12)\), etc\.\) find that model performance is not only lower for low\-resource languages but also unstable across cultural contexts\.Tanwar et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib26)\)show that models often answer questions about a culture more accurately in English than in the culture’s native language, suggesting that failures arise from weak cross\-lingual knowledge transfer rather than data scarcity alone\. Similarly,Qi et al\. \([2023](https://arxiv.org/html/2605.30481#bib.bib21)\)find that increasing model scale improves factual accuracy but does not reliably improve cross\-lingual consistency\. These findings indicate that multilingual competence and factual competence do not necessarily imply culturally stable reasoning\.

Broader multilingual evaluations reinforce this pattern\.Singh et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib25)\)show that a substantial portion of MMLU questions require culturally specific knowledge and model rankings change when evaluated on this subset\.Romanou et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib22)\)further show that LLMs fail disproportionately on region\-implicit questions from local examinations, where cultural grounding is required but not explicitly marked\. These studies demonstrate that standard benchmarks often obscure cultural dependence by treating knowledge as culturally neutral\.

### 2\.2Cultural Bias and Global Dominance

Recent work argues that LLMs tend to privilege globally dominant, often Western\-centric, perspectives\.Naous et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib18)\)show that Western\-centric bias can appear even in Arabic\-only models, tracing part of the bias to the composition of Arabic Wikipedia itself\. This suggests that using a non\-English language does not automatically guarantee locally grounded knowledge\(Zhang et al\.,[2025](https://arxiv.org/html/2605.30481#bib.bib34); Bang et al\.,[2025](https://arxiv.org/html/2605.30481#bib.bib2)\)\.Wang et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib29)\)similarly find that GPT\-4 exhibits strong cultural dominance despite its high overall capability, indicating that scale alone does not eliminate cultural bias\.

These observations connect to broader concerns about the fairness of NLP and representational harm\.Blodgett et al\. \([2020](https://arxiv.org/html/2605.30481#bib.bib4)\)argue that language technologies can reproduce social hierarchies when harms are poorly specified, whileBender et al\. \([2021](https://arxiv.org/html/2605.30481#bib.bib3)\)emphasize that large\-scale training data can amplify dominant perspectives embedded in web text\. In multilingual settings,Joshi et al\. \([2020](https://arxiv.org/html/2605.30481#bib.bib15)\)show that language technologies remain highly uneven across the world’s languages, with low\-resource communities receiving weaker support\.Gallegos et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib10)\)further note that multilingual fairness lacks shared definitions and evaluation standards, making it difficult to compare bias findings across languages and cultures\.

### 2\.3Cross\-lingual Consistency and Language as an Anchor

Prior multilingual benchmarks such as XNLI\(Conneau et al\.,[2018](https://arxiv.org/html/2605.30481#bib.bib8)\)and XTREMEHu et al\. \([2020](https://arxiv.org/html/2605.30481#bib.bib13)\)evaluate cross\-lingual transfer, but strong multilingual performance does not guarantee that models preserve the same factual or cultural interpretations across languagesYing et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib31)\)\. Recent work shows that prompt language acts as an epistemic conditioning signal, shaping both retrieved knowledge and cultural assumptions in large language modelsWang et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib28)\); Qi et al\. \([2023](https://arxiv.org/html/2605.30481#bib.bib21)\); Tanwar et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib26)\)\. This effect is especially evident in low\-resource settings, where English prompts often elicit globally dominant narratives, while local languages surface more regionally grounded perspectives\. However, prior studies largely focus on performance gaps or general cross\-lingual inconsistency, without isolating whether language choice induces systematic, directional shifts toward globally dominant interpretations\. The mechanism of cultural prioritization in model outputs thus remains underexplored\. In this work, we address this gap by testing whether Bangla and English prompts produce semantically consistent factual claims, and whether English systematically biases outputs toward globally dominant narratives, treating cross\-lingual divergence as a signal of cultural dominance\.

### 2\.4Knowledge Gaps, Hallucination, and Evidence Use

Recent work distinguishes failures caused by missing knowledge from those arising when inference\-time priors override stored knowledge\.Yu et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib33)\)show that models may hallucinate even when relevant knowledge is present, a distinction that is central to our setting: models may possess local cultural knowledge, yet generate globally dominant answers when prompt language activates prior stronger global information\. Research on abstention and uncertainty further suggests that these epistemic failures are culturally unevenClark et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib7)\); Yadkori et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib30)\)\.Feng et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib9)\)find that models are less reliable at abstaining on questions about African and Asian countries, whileShao et al\. \([2026](https://arxiv.org/html/2605.30481#bib.bib24)\)show that reinforcement learning can improve surface accuracy without strengthening evidence\-based reasoning\. These findings emphasize the importance of evidence\-based evaluation for distinguishing genuine knowledge gaps from failures of grounding and prioritization\.

Existing work shows that LLMs exhibit cultural knowledge gaps, cross\-lingual inconsistency, and globally dominant biases\. However, it remains underexplored when these failures persist despite access to relevant local information\. We address this through a controlled intervention with question\-only and evidence\-based prompting to test whether local evidence can correct or override dominant priors\. We further use counterfactual Bangla–English prompting to examine whether language alone shifts factual claims, authority framing, and epistemic coverage\. This enables us to identify when global narrative dominance persists and when it breaks down\.

## 3Dataset

We construct a Bengali culture\-focused dataset,CulturalNB, to examine culturally grounded knowledge in a low\-resource and historically marginalized cultural context\. Although the dataset is centered on the Bangla language and culturally grounded Bengali content, each instance is paired with an equivalent English translation to enable controlled cross\-lingual evaluation\. This parallel design allows us to compare the model performance and narrative in Bangla and English while preserving the cultural specificity of the original content\. This section provides a detailed overview of the data collection and annotation processes, including the sources of the records and the procedures used to structure and preprocess the data to ensure quality and consistency of the annotation\.

### 3\.1Data Collection

We collected data across five culturally grounded domains:History & Politics,Religion & Mythology,Traditional Medicine & Ecology,Geography & National Identity, andArt, Literature, & Cultural Practices\. These domains were selected to capture locally situated knowledge, including interpretations of historical events, regional belief systems, indigenous ecological practices, and culturally specific social traditions that may be underrepresented or framed differently in English\-centric corpora\. Moreover, to ensure diversity and contextual grounding, we manually collected data from multiple sources, including Bangla Wikipedia pages, regional archival repositories and encyclopedias, and cultural media sources such as folk literature, proverbs, and oral transcripts\. When available, source URLs and archival references are preserved to maintain transparency and traceability\.

![Refer to caption](https://arxiv.org/html/2605.30481v1/x2.png)Figure 2:Source distribution ofCulturalNBdataset\.Through this manual collection process, we reviewedeightbooks on Bengali culture, along with a wide range of news articles, Wikipedia pages, and regional archival sources, and compiled a dataset of717717culturally grounded instances\. Each instance consists of a culturally grounded question–answer pair, a supporting passage or transcription, domain, source type, context, and source URL \(if available\)\. This structured representation enables systematic analysis of culturally specific knowledge across domains while maintaining clear provenance and metadata documentation\.

Figure[2](https://arxiv.org/html/2605.30481#S3.F2)presents the distribution of sources ofCulturalNB\. Local books constitute the largest share \(32\.4%\), followed by Wikipedia \(26\.2%\) and local encyclopedias \(13\.7%\), with additional contributions from oral traditions \(12\.4%\) and news \(10\.9%\)\. The smaller parts come from cultural media \(3\.5%\) and government documents \(1%\), reflecting a diverse mix of institutional and community sources\. Detailed analysis is provided in Appendix[B](https://arxiv.org/html/2605.30481#A2)\.

### 3\.2Manual Annotation

The annotation was conducted by three bilingual annotators with native Bangla and near\-native English proficiency and strong familiarity with Bangladeshi cultural, historical, and social contexts\. All annotators had at least an undergraduate\-level education and were compensated at standard rates\.

Two annotators performed the primary tasks: context\-preserving translation and sociocultural categorization\. Each annotated about 360 items, with each item taking approximately 4–6 minutes, for a total workload of 80–100 hours completed over two weeks\. A third annotator served as an expert validator, reviewing all translations and labels, resolving inconsistencies, and correcting errors\. Annotators followed the guidelines in Appendix[A](https://arxiv.org/html/2605.30481#A1)and were prohibited from using machine translation, LLMs, or other AI tools\. This multi\-stage process ensured that translations preserved culturally grounded meanings and that labels were applied consistently\.

##### Annotation Quality

An expert validator reviewed all 717 annotated instances and corrected translations and labels when necessary\. For the translation task, 29 questions \(4\.04%\), 57 answers \(7\.95%\), and 105 contextual passages \(14\.64%\) were revised\. The higher correction rate for contextual passages is expected, as passages were typically longer and contained richer cultural and institutional references that required careful preservation in English translation\. We also measured inter\-annotator agreement using Cohen’sκ\\kappafor all three categorical annotation dimensions\. Agreement was consistently in thealmost perfectrange according to the interpretation ofLandis and Koch \([1977](https://arxiv.org/html/2605.30481#bib.bib16)\)\. Annotators achieved 94\.14%, 92\.61%, and 94\.70% agreement withκ\\kappaof 0\.91, 0\.86, and 0\.93 forKnowledge Frequency,Epistemic Status, andValidation Type, respectively\. These results indicate that the annotation guidelines were clear and that annotators applied the sociocultural categories with high consistency across all dimensions\.

## 4Methodology

Our evaluation methodology is designed to assess the extent to which state\-of\-the\-art large language models \(LLMs\) can answer culturally grounded questions that require knowledge of Bangladeshi social, historical, and institutional contexts\. We evaluate models under two experimental settings to distinguish between failures caused by missing background knowledge and those arising from limitations in reasoning or evidence utilization\.

### 4\.1Models

We evaluate nine state\-of\-the\-art large language models \(LLMs\), including both proprietary and open\-weight systems, to provide a broad assessment of how state\-of\-the\-art models handle culturally grounded knowledge\. We selected the models to cover a diverse range of architectures, training corpora, and alignment strategies from leading model developers\. Our benchmark includes proprietary frontier models, as well as high\-performing openly available models\. Specifically, we evaluateClaude Sonnet 4\.6,GPT\-5\.4,Gemini 3\.1 Pro Preview,Gemma 4 31B Instruct,Grok 4\.1 Fast,DeepSeek V3\.2,Qwen 3\.6 Plus,Llama 4 Maverick, andMistral Large 2512\. These models vary substantially in parameter scale, training data, and post\-training alignment methods, making them well\-suited for evaluating whether culturally specific knowledge is consistently represented across different model families\. By including both closed and open models, our benchmark provides a comprehensive snapshot of the current state of LLM performance on culturally situated question answering\.

### 4\.2Experimental Settings

We evaluate each model in two settings:Question\-onlyandEvidence\-based\.

##### Question\-Only Setting

In the question\-only setting, models receive only the question in the target language and must generate an answer without any supporting context\. This setting evaluates whether the required cultural knowledge has been internalized during pretraining and post\-training\. Performance in this condition reflects the model’s ability to retrieve and apply culturally grounded knowledge from its parametric memory alone, without relying on externally supplied evidence\.

##### Evidence\-based Setting

In this setting, models are provided with both the question and an evidence passage containing the information necessary to answer the question\. The evidence passage is the manually translated contextual paragraph associated with each instance, and the gold answer is explicitly stated or directly inferable from the passage\. This setting isolates the model’s ability to extract and use relevant information when the required knowledge is supplied\.

By comparing performance across these two settings, we can distinguish between knowledge limitations and evidence utilization failures\. Large improvements in evidence\-based setting suggest that the model lacks the relevant cultural knowledge but can effectively use supporting context, where limited improvement indicates challenges in reading comprehension, reasoning, or grounding\.

##### Prompting and Inference

All prompts were constructed in English using manually validated data\. In the Question\-Only setting, the prompt consisted of an instruction followed by the question\. In the evidence\-based setting, the prompt additionally includes a supporting evidence passage that contains the answer, and instructs the model to answer strictly based on the provided context\. We used a zero\-shot prompting strategy for all models to ensure comparability across systems and to avoid introducing task\-specific examples that might bias performance\. All the instructions are given in Appendix[D\.1](https://arxiv.org/html/2605.30481#A4.SS1)\.

### 4\.3Evaluation Metrics

We evaluate models using five metrics: cross\-lingual factual consistency \(CLFC\), language anchor bias \(LAB\), global substitution rate \(GSR\), institutional bias rate \(IBR\), and epistemic perspective coverage \(EPC\)\. These metrics measure whether models preserve culturally grounded interpretations across languages or shift toward globally dominant narratives\. Specifically, GSR captures replacement of local referents with generalized or globally common alternatives, IBR measures shifts toward institutional authority despite local grounding, EPC measures preservation of multiple culturally relevant perspectives, and LAB measures whether English prompts systematically increase globally dominant framing\. All metrics are evaluated using explicit annotation rubrics and decision criteria provided in Appendix[C](https://arxiv.org/html/2605.30481#A3); statistical correlations are reported in Appendix[E\.2](https://arxiv.org/html/2605.30481#A5.SS2)\.

### 4\.4LLM\-as\-Judge

We use an LLM\-as\-judge protocol because culturally grounded questions often involve local interpretations, contested narratives, and low\-resource knowledge, making exact\-match evaluation insufficient\. Judges assess whether responses preserve local interpretations, substitute them with globally dominant ones, or narrow the epistemic framing\.

We use two independent judges: GPT\-5\.4\-mini \(GPT\) and Mistral 4 small \(Mistral\)\. For each instance, judges receive the question, model response and explanation, and, when applicable, local evidence\. For cross\-lingual metrics, they also receive paired Bangla and English responses and evaluate semantic consistency and direction of shift\. We keep the model identities hidden from judges for fair evaluation\. Moreover, judges follow task\-specific rubrics and return structured labels for all metrics\.We aggregate judge labels over the evaluation set and report both judges separately, since absolute scores vary by evaluator\. Therefore, our analysis emphasizes trends consistent across judges, languages, and evidence settings\. All evaluation instructions used in judge\-based assessments are provided in Appendix[D\.2](https://arxiv.org/html/2605.30481#A4.SS2)\.

### 4\.5Human\-as\-Judge

To validate the reliability of our evaluation, we include a human\-as\-judge assessment onGPT\-5\.4andClaude Sonnet 4\.6responses for the Bangla Question\-only experiment\. A bilingual evaluator with strong familiarity with Bengali cultural contexts evaluates responses using the same rubrics as the LLM judges, covering language anchoring, global substitution, and institutional bias\.The human judge is shown the question, model response, and, when applicable, the supporting evidence, but not the model identity\. Human judgments are used to complement LLM\-as\-judge scores and for checking whether the main trends are robust to non\-automated evaluation\.

## 5Results and Discussion

Our results show that state\-of\-the\-art LLMs exhibit substantial epistemic instability when answering culturally grounded questions\. Across all nine models, the responses frequently changed when the question language changed from Bangla to English, often replacing locally situated interpretations with globally dominant narratives\. Although providing supporting evidence substantially improved performance, no model fully eliminated these distortions\. Moreover, the findings suggest that multilingual LLMs encode cultural knowledge unevenly and tend to privilege globally prevalent perspectives when local knowledge is uncertain\. We also provide additional analysis in Appendix[F](https://arxiv.org/html/2605.30481#A6)\.

### 5\.1Cross\-lingual Stability

We first assess whether models provide stable factual answers when same culturally grounded question is asked in Bangla and English\. Figure[3](https://arxiv.org/html/2605.30481#S5.F3)reports Cross\-Lingual Factual Consistency \(CLFC\), and Figure[4](https://arxiv.org/html/2605.30481#S5.F4)reports Language Anchor Bias \(LAB\)\.

![Refer to caption](https://arxiv.org/html/2605.30481v1/x3.png)Figure 3:Cross\-Lingual Factual Consistency \(higher is better\) across models, experimental settings, and judges\. CLFC measures whether Bangla and English questions produce semantically consistent factual claims\.CLFC results show substantial cross\-lingual instability\. In the question\-only setting, the GPT judge gives the highest consistency toGeminiandGPT\-5\.4, while weaker scores appear forLlama,Mistral,Grok, andDeepSeek\. The Mistral judge assigns a lower absolute CLFC for the question\-only setting but shows clear gains when evidence is provided, with most models improving substantially\. In contrast, GPT\-judged CLFC does not uniformly improve with evidence, indicating that evidence use remains judge\- and model\-dependent\.

![Refer to caption](https://arxiv.org/html/2605.30481v1/x4.png)Figure 4:Language Anchor Bias \(lower is better\) across models, evidence settings, and judges\. LAB measures how often switching to English shifts responses toward globally dominant interpretations\.LAB remains high across models, especially under the GPT judge, where most scores fall around 0\.43–0\.50 in the question\-only setting\.GPT\-5\.4is the most stable by this metric, but still shows non\-trivial anchoring\. The Mistral judge gives lower LAB overall; however, the same pattern persists: English questions frequently shift responses away from locally grounded interpretations\. Providing evidence does not eliminate LAB but sometimes increases LAB, particularly under the GPT judge\.

Overall, the two metrics indicate that multilingual LLMs are not language invariant\. Some models improve factual consistency when local evidence is supplied, but English questions continue to anchor responses toward globally dominant narratives, and no model consistently achieves high CLFC with low LAB across judges and settings\.

### 5\.2Global Substitution

We next examine whether models replace culturally specific answers with globally dominant narratives when local knowledge is missing\. Table[1](https://arxiv.org/html/2605.30481#S5.T1)reports the Global Substitution Rate \(GSR\) for both experiment settings, where lower values indicate better preservation of local context\.

Table 1:GSR across models, languages, experiment settings, and judges\. Lower values indicate better preservation of local cultural referents; higher values indicate stronger substitution with globally dominant alternatives\. GSR is computed following Eq\.[3](https://arxiv.org/html/2605.30481#A3.E3), restricted to local and contested instances withK\(q\)=1K\(q\)=1\. BN: Bangla, EN: English\.In question\-only setting, English questions consistently trigger substantially higher substitution rates than their Bangla counterparts\. Under the GPT judge, Bangla GSR is typically between 0\.21 and 0\.40, whereas English GSR rises above 0\.60 across most models\. This pattern holds for both frontier and open models, includingGPT\-5\.4,Gemini, andSonnet, indicating that language alone can shift responses toward globally salient but culturally inappropriate interpretations\. Although the Mistral judge assigns higher absolute scores, it preserves the same ordering and language gap, confirming that the effect is robust to evaluator choice\. However, evidence reduces substitution rates for some models, particularly for Bangla questions, but English bias remains pronounced\. Several models, includingGemma,Qwen,Llama, andDeepSeek, continue to exhibit high substitution even when the relevant information is available\.

These findings show that global substitution is not simply a consequence of missing factual knowledge\. If knowledge gaps were the sole cause, providing evidence would largely eliminate the effect\. Instead, English questions continue to anchor model outputs toward globally dominant interpretations, suggesting that prompt language influences which competing knowledge distributions are prioritized during inference\. Overall, GSR identifies a pervasive and robust failure mode: when uncertain, multilingual LLMs tend to overwrite local cultural knowledge with globally dominant narratives, especially when the question is asked in English\.

### 5\.3Institutional Bias

We next measure whether models frame culturally grounded answers through globally dominant institutions rather than local epistemic contexts\. Table[2](https://arxiv.org/html/2605.30481#S5.T2)reports the Institutional Bias Rate \(IBR\) for both experiment settings, where lower values indicate less reliance on institutional framings\.

In the question\-only setting, the GPT judge assigns low IBR values across models, mostly around 0\.10–0\.19, with only small differences between Bangla and English\. In contrast, the Mistral judge detects substantially higher institutional bias: Bangla questions range from roughly 0\.27 to 0\.40, whereas English questions increase to about 0\.30–0\.64\. The increase in IBR for English is consistent across most models and is visible forSonnet,Gemma, andLlama\. Therefore, despite differences in absolute scores across judges, both evaluations consistently show that English prompts are more likely to elicit institutionally dominant framings\.

Table 2:Institutional Bias Rate \(IBR\) across models, languages, evidence settings, and judges\. Lower values indicate less reliance on institutional or globally legitimized framings\. BN: Bangla, EN: English\.Table[2](https://arxiv.org/html/2605.30481#S5.T2)also shows that providing local evidence does not eliminate this effect\. Under the GPT judge, IBR remains low; however, English questions are often slightly higher than Bangla questions\. Under the Mistral judge, the effect is much stronger: English\-question IBR frequently exceeds 0\.34, with particularly high values forLlama,Sonnet,Gemma,Mistral, andGork\. Even models with overall lower bias, such asGPT\-5\.4, still show an increase from Bangla to English\.

These results show that institutional bias is not simply a problem of missing knowledge\. Models may use the provided local evidence while still organizing the answer around globally recognized authorities or institutional narratives\. This is consequential for low\-resource cultural contexts, where valid knowledge is often grounded in local scholarship, community memory, vernacular sources, or non\-institutional expertise\. Overall, IBR indicates that English questions not only change model answers but also shift the epistemic authority through which those answers are framed\.

### 5\.4Epistemic Perspective Coverage

Figure[5](https://arxiv.org/html/2605.30481#S5.F5)reports Epistemic Perspective Coverage \(EPC\), which measures whether models represent multiple locally relevant viewpoints rather than collapsing responses into a single dominant framing\.

![Refer to caption](https://arxiv.org/html/2605.30481v1/x5.png)Figure 5:Epistemic Perspective Coverage \(higher is better\) across Bangla and English questions, experiment settings, using GPT and Mistral judges\.In question\-only setting, EPC is modest across models, mostly around 0\.50–0\.64\. English questions generally yield lower coverage than Bangla questions, indicating knowledge gaps make models more likely to narrow culturally grounded questions into dominant interpretations\. Providing evidence substantially improves EPC for all models\. The largest gains appear for Bangla, which often reach about 0\.70–0\.76 and 0\.60–0\.73 under the GPT and Mistral judges\. English questions also improve, but typically remain below Bangla scores\.

The two judges differ in absolute values, with GPT assigning higher EPC than Mistral, but agree on the main trends: evidence broadens perspective coverage, while English questions reduce it\. Overall, EPC shows that local evidence helps models represent more diverse perspectives, but does not fully remove language\-dependent narrowing\.

### 5\.5Human\-as\-Judge Performance

To validate LLM\-as\-judge reliability, we conducted a full human evaluation on two representative models,SonnetandGPT\-5\.4, in the Bangla question\-only setting\. A native Bengali evaluator labeled responses for global substitution, source framing, and perspective coverage using the same rubric as the LLM judges\. Table[3](https://arxiv.org/html/2605.30481#A1.T3)compares human scores with GPT and Mistral judges\.

Human judgments detect substantially higher global substitution and institutional bias than both LLM judges, while perspective coverage is also slightly higher\. The calibration pattern is consistent across models: GPT is the most lenient judge, Mistral is more critical, and the human annotator identifies the most culturally inappropriate framings\. Notably, the largest gap appears when GPT judgesGPT\-5\.4, suggesting that LLM judges may under\-detect failures aligned with their own training priors\.These results show that LLM\-as\-judge evaluation can underestimate cultural failure modes in low\-resource contexts\. Reliable validation of culturally grounded benchmarks requires human evaluators with native cultural expertise, as LLM judges alone are often insufficient to assess nuanced local knowledge and context\. Statistical validation is provided in Appendix[E\.1](https://arxiv.org/html/2605.30481#A5.SS1)\.

## 6Conclusion

We introduceCulturalNB, a Bengali culture\-focused benchmark for evaluating how large language models answer culturally grounded questions in Bangla and English\. Using parallel questions, evidence injection, human evaluation, and two independent LLM judges, we measure cross\-lingual consistency, language anchoring bias, global substitution, institutional framing, and epistemic coverage\. Our results show that current LLMs exhibit systematic cultural instability: English prompts consistently increase globally dominant and institutionalized interpretations while reducing local perspectives\. Although local evidence improves factual consistency and expands epistemic coverage, it does not eliminate these shifts, indicating that cultural errors arise not only from missing knowledge but also from language\-conditioned narrative priors\.CulturalNBprovides a foundation for the development of culturally robust and epistemically plural language technologies for low\-resource settings\. Future work will explore retrieval\-augmented generation to reduce GSR by improving cultural grounding\.

## Limitation

Our study has several limitations\. First, CulturalNB focuses on Bengali cultural knowledge, and although the proposed evaluation framework is language\-agnostic, the reported findings may not directly generalize to other low\-resource cultural contexts\. Second, the benchmark contains 717 manually curated instances, which cannot fully capture the regional, historical, and contested diversity of Bengali culture\. Third, while the Bangla–English pairs were produced using context\-preserving human translation and expert validation, some semantic or cultural nuance may still shift across languages\. Therefore, our results should be interpreted as evidence of language\-conditioned framing differences rather than strictly causal effects of language alone\.

Our evaluation additionally relies on LLM judges for scalable assessment\. To reduce evaluator bias, we use two independent judges, explicit annotation rubrics, and human validation\. However, human evaluation is limited in scale and does not cover all models and settings\. The proposed metrics—GSR, IBR, EPC, and LAB—capture observable response behaviors such as substitution, institutional reframing, and perspective reduction, but they remain approximations of broader cultural and epistemic phenomena\. Finally, our evidence\-based setup evaluates whether models appropriately use provided local evidence, but does not assess retrieval quality or full retrieval\-augmented generation systems\.

## Ethics and Broader Impact

This work aims to identify cultural representation harms in LLMs for low\-resource contexts\.CulturalNBis built from public or manually documented cultural sources and does not include private or personally identifying information\. Source metadata is preserved when available for transparency\.

Annotators were bilingual speakers with familiarity in Bengali cultural contexts and were compensated at standard rates\. They were instructed to preserve cultural meaning, and an expert validator reviewed translations and labels\. Since cultural knowledge can be contested, our annotations should not be treated as the only valid account of Bengali culture, but as evidence\-based interpretations from the collected sources\.

Our findings have implications for education, search, and multilingual assistants: English prompts can amplify globally dominant narratives and institutional framings even when local evidence is provided\. This may marginalize local epistemologies if left unaddressed\. We caution against usingCulturalNBto essentialize culture or rank communities; it should instead support auditing, mitigation, and culturally responsible model development with community expertise\.

## References

- Arora et al\. \(2025\)Shane Arora, Marzena Karpinska, Hung\-Ting Chen, Ipsita Bhattacharjee, Mohit Iyyer, and Eunsol Choi\. 2025\.[CaLMQA: Exploring culturally specific long\-form question answering across 23 languages](https://doi.org/10.18653/v1/2025.acl-long.578)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 11772–11817, Vienna, Austria\. Association for Computational Linguistics\.
- Bang et al\. \(2025\)Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung\. 2025\.[HalluLens: LLM hallucination benchmark](https://doi.org/10.18653/v1/2025.acl-long.1176)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 24128–24156, Vienna, Austria\. Association for Computational Linguistics\.
- Bender et al\. \(2021\)Emily M\. Bender, Timnit Gebru, Angelina McMillan\-Major, and Shmargaret Shmitchell\. 2021\.[On the dangers of stochastic parrots: Can language models be too big?](https://doi.org/10.1145/3442188.3445922)In*Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency*, FAccT ’21, page 610–623, New York, NY, USA\. Association for Computing Machinery\.
- Blodgett et al\. \(2020\)Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach\. 2020\.[Language \(technology\) is power: A critical survey of “bias” in NLP](https://doi.org/10.18653/v1/2020.acl-main.485)\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 5454–5476, Online\. Association for Computational Linguistics\.
- Caughman et al\. \(2026\)Liliana Caughman, Claire Lauer, Stephen Carradini, and Srinivasan Ravichandran\. 2026\.[You are a river: Reorienting a civic waterbot from the bottom up](https://doi.org/10.1145/3772318.3791890)\.In*Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems*, CHI ’26, New York, NY, USA\. Association for Computing Machinery\.
- Cecilia Liu et al\. \(2024\)Chen Cecilia Liu, Fajri Koto, Timothy Baldwin, and Iryna Gurevych\. 2024\.[Are multilingual LLMs culturally\-diverse reasoners? an investigation into multicultural proverbs and sayings](https://doi.org/10.18653/v1/2024.naacl-long.112)\.In*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 2016–2039, Mexico City, Mexico\. Association for Computational Linguistics\.
- Clark et al\. \(2025\)Nicholas Clark, Hua Shen, Bill Howe, and Tanushree Mitra\. 2025\.Epistemic alignment: A mediating framework for user\-llm knowledge delivery\.*arXiv preprint arXiv:2504\.01205*\.
- Conneau et al\. \(2018\)Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov\. 2018\.[XNLI: Evaluating cross\-lingual sentence representations](https://doi.org/10.18653/v1/D18-1269)\.In*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, pages 2475–2485, Brussels, Belgium\. Association for Computational Linguistics\.
- Feng et al\. \(2024\)Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, and Yulia Tsvetkov\. 2024\.[Don’t hallucinate, abstain: Identifying LLM knowledge gaps via multi\-LLM collaboration](https://doi.org/10.18653/v1/2024.acl-long.786)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 14664–14690, Bangkok, Thailand\. Association for Computational Linguistics\.
- Gallegos et al\. \(2024\)Isabel O\. Gallegos, Ryan A\. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K\. Ahmed\. 2024\.[Bias and fairness in large language models: A survey](https://doi.org/10.1162/coli_a_00524)\.*Computational Linguistics*, 50\(3\):1097–1179\.
- Guo et al\. \(2025\)Shiwei Guo, Sihang Jiang, Qianxi He, Yanghua Xiao, Jiaqing Liang, Bi Yude, Minggui He, Shimin Tao, and Li Zhang\. 2025\.Do large language models truly understand cross\-cultural differences?*arXiv preprint arXiv:2512\.07075*\.
- Hasan et al\. \(2025\)Md Arid Hasan, Maram Hasanain, Fatema Ahmad, Sahinur Rahman Laskar, Sunaya Upadhyay, Vrunda N Sukhadia, Mucahid Kutlu, Shammur Absar Chowdhury, and Firoj Alam\. 2025\.[NativQA: Multilingual culturally\-aligned natural query for LLMs](https://doi.org/10.18653/v1/2025.findings-acl.770)\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 14886–14909, Vienna, Austria\. Association for Computational Linguistics\.
- Hu et al\. \(2020\)Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson\. 2020\.[XTREME: A massively multilingual multi\-task benchmark for evaluating cross\-lingual generalisation](https://proceedings.mlr.press/v119/hu20b.html)\.In*Proceedings of the 37th International Conference on Machine Learning*, volume 119 of*Proceedings of Machine Learning Research*, pages 4411–4421\. PMLR\.
- Hupkes and Bogoychev \(2025\)Dieuwke Hupkes and Nikolay Bogoychev\. 2025\.Multiloko: a multilingual local knowledge benchmark for llms spanning 31 languages\.*arXiv preprint arXiv:2504\.10356*\.
- Joshi et al\. \(2020\)Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury\. 2020\.[The state and fate of linguistic diversity and inclusion in the NLP world](https://doi.org/10.18653/v1/2020.acl-main.560)\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 6282–6293, Online\. Association for Computational Linguistics\.
- Landis and Koch \(1977\)J Richard Landis and Gary G Koch\. 1977\.The measurement of observer agreement for categorical data\.*biometrics*, pages 159–174\.
- Myung et al\. \(2024\)Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki A Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla Perez\-Almendros, Abinew A Ayele, and 1 others\. 2024\.Blend: A benchmark for llms on everyday knowledge in diverse cultures and languages\.*Advances in Neural Information Processing Systems*, 37:78104–78146\.
- Naous et al\. \(2024\)Tarek Naous, Michael J Ryan, Alan Ritter, and Wei Xu\. 2024\.[Having beer after prayer? Measuring cultural bias in large language models](https://doi.org/10.18653/v1/2024.acl-long.862)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 16366–16393, Bangkok, Thailand\. Association for Computational Linguistics\.
- Nguyen et al\. \(2025\)Ilana Nguyen, Harini Suresh, and Evan Shieh\. 2025\.Representational harms in llm\-generated narratives against nationalities located in the global south\.In*HEAL Workshop, CHI*, volume 2025\.
- Pawar et al\. \(2025\)Siddhesh Pawar, Junyeong Park, Jiho Jin, Arnav Arora, Junho Myung, Srishti Yadav, Faiz Ghifari Haznitrama, Inhwa Song, Alice Oh, and Isabelle Augenstein\. 2025\.Survey of cultural awareness in language models: Text and beyond\.*Computational Linguistics*, 51\(3\):907–1004\.
- Qi et al\. \(2023\)Jirui Qi, Raquel Fernández, and Arianna Bisazza\. 2023\.[Cross\-lingual consistency of factual knowledge in multilingual language models](https://doi.org/10.18653/v1/2023.emnlp-main.658)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 10650–10666, Singapore\. Association for Computational Linguistics\.
- Romanou et al\. \(2025\)Angelika Romanou, Negar Foroutan, Anna Sotnikova, Imanol Schlag, Marzieh Fadaee, Sara Hooker, and Antoine Bosselut\. 2025\.[INCLUDE: Evaluating multilingual language understanding with regional knowledge](https://openreview.net/forum?id=ced46a50be)\.In*The Thirteenth International Conference on Learning Representations*\.
- Rystrøm et al\. \(2025\)Jonathan Hvithamar Rystrøm, Hannah Rose Kirk, and Scott Hale\. 2025\.[Multilingual \!= multicultural: Evaluating gaps between multilingual capabilities and cultural alignment in LLMs](https://aclanthology.org/2025.ommm-1.9/)\.In*Proceedings of Interdisciplinary Workshop on Observations of Misunderstood, Misguided and Malicious Use of Language Models*, pages 74–85, Varna, Bulgaria\. INCOMA Ltd\., Shoumen, Bulgaria\.
- Shao et al\. \(2026\)Jiaqi Shao, Yuxiang Lin, Munish P Lohani, Yufeng Miao, and Bing Luo\. 2026\.[Do LLM agents know how to ground, recover, and assess? A benchmark for epistemic competence in information\-seeking agents](https://iclr.cc/virtual/2026/poster/10007188)\.In*Proceedings of the Fourteenth International Conference on Learning Representations*\.
- Singh et al\. \(2025\)Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David Ifeoluwa Adelani, Jian Gang Ngui, Daniel Vila\-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei\-Yin Ko, Antoine Bosselut, Alice Oh, Andre F\. T\. Martins, Leshem Choshen, Daphne Ippolito, and 4 others\. 2025\.[Global MMLU: Understanding and addressing cultural and linguistic biases in multilingual evaluation](https://doi.org/10.18653/v1/2025.acl-long.919)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 18761–18799, Vienna, Austria\. Association for Computational Linguistics\.
- Tanwar et al\. \(2025\)Eshaan Tanwar, Anwoy Chatterjee, Michael Saxon, Alon Albalak, William Yang Wang, and Tanmoy Chakraborty\. 2025\.[Do you know about my nation? Investigating multilingual language models’ cultural literacy through factual knowledge](https://doi.org/10.18653/v1/2025.emnlp-main.756)\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 14956–14979, Suzhou, China\. Association for Computational Linguistics\.
- Wan et al\. \(2025\)Yixin Wan, Xingrun Chen, and Kai\-Wei Chang\. 2025\.Which cultural lens do models adopt? on cultural positioning bias and agentic mitigation in llms\.*arXiv preprint arXiv:2509\.21080*\.
- Wang et al\. \(2025\)Qihan Wang, Shidong Pan, Tal Linzen, and Emily Black\. 2025\.[Multilingual prompting for improving LLM generation diversity](https://doi.org/10.18653/v1/2025.emnlp-main.324)\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 6367–6389, Suzhou, China\. Association for Computational Linguistics\.
- Wang et al\. \(2024\)Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen\-tse Huang, Zhaopeng Tu, and Michael Lyu\. 2024\.[Not all countries celebrate thanksgiving: On the cultural dominance in large language models](https://doi.org/10.18653/v1/2024.acl-long.345)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 6349–6384, Bangkok, Thailand\. Association for Computational Linguistics\.
- Yadkori et al\. \(2024\)Yasin A Yadkori, Ilja Kuzborskij, András György, and Csaba Szepesvari\. 2024\.To believe or not to believe your llm: Iterative prompting for estimating epistemic uncertainty\.*Advances in Neural Information Processing Systems*, 37:58077–58117\.
- Ying et al\. \(2025\)Jiahao Ying, Wei Tang, Yiran Zhao, Yixin Cao, Yu Rong, and Wenxuan Zhang\. 2025\.[Disentangling language and culture for evaluating multilingual large language models](https://doi.org/10.18653/v1/2025.acl-long.1082)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 22230–22251, Vienna, Austria\. Association for Computational Linguistics\.
- Younas and Zeng \(2026\)Ammar Younas and Yi Zeng\. 2026\.Towards culture driven artificial intelligence to bridge the cultural cognition gap\.*Discover Artificial Intelligence*, 6\(1\):252\.
- Yu et al\. \(2024\)Lei Yu, Meng Cao, Jackie CK Cheung, and Yue Dong\. 2024\.[Mechanistic understanding and mitigation of language model non\-factual hallucinations](https://doi.org/10.18653/v1/2024.findings-emnlp.466)\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 7943–7956, Miami, Florida, USA\. Association for Computational Linguistics\.
- Zhang et al\. \(2025\)Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi\. 2025\.[Siren’s song in the AI ocean: A survey on hallucination in large language models](https://doi.org/10.1162/coli.a.16)\.*Computational Linguistics*, 51\(4\):1373–1418\.

## Appendix AAnnotation Guidelines

Table 3:Human\-as\-Judge results compared to LLM judges in the Bangla question\-only setting\. Human\-judged values consistently exceed both LLM judges, with the gap largest forGPT\-5\.4under the GPT judge\.Boldmarks the worst bias scores \(highest GSR and IBR\) and the best coverage score \(highest EPC\) for each model\.To ensure that our benchmark captures culturally grounded knowledge rather than literal surface forms, we designed a two\-stage annotation protocol consisting of \(1\) context\-preserving high\-fidelity translation and \(2\) sociocultural categorization\. The annotators were instructed to interpret each item through the lens of Bangladesh’s social, historical, and cultural understanding, with particular attention to locally specific institutions, practices, and beliefs\.

### A\.1Task 1: Context\-Preserving High\-Fidelity Translation

Each annotation instance contained three components written in Bangla: a question, a statement, and a contextual passage\. In the first stage, annotators translated all three components into English\. The objective was to preserve semantic fidelity rather than stylistic fluency\. Annotators were instructed to maintain the original meaning, causal relationships, and specific informational details while avoiding the replacement of local concepts with generic or Western equivalents\. When no direct English translation existed, annotators were encouraged to provide the closest approximation and retain the original Bangla term in parentheses when necessary\.

### A\.2Task 2: Sociocultural Categorization

In this stage, the annotators categorized the Bangla statement along three dimensions that characterize the structure and social grounding of knowledge\.

##### Complexity \(Cultural Familiarity and Frequency\)

This dimension captures the degree of cultural familiarity and frequency with which a statement is encountered\.

- •Frequentstatements refer to knowledge, beliefs, or practices widely recognized across Bangladeshi and Bengali communities\.
- •Occasionalstatements are known within certain regions or social groups, but they are not universally shared\.
- •Rarestatements correspond to highly specialized, localized, or niche knowledge that is unfamiliar to most people\.

##### Epistemic Status \(Scope of Validity\)

This dimension assesses whether a statement is broadly applicable, culturally bounded, or socially debated\.

- •Globalstatements express observations or beliefs that extend beyond Bangladesh and are broadly recognizable across societies\.
- •Localstatements are culturally specific and derive their meaning from Bangladeshi or Bengali contexts, institutions, or traditions\.
- •Contestedstatements are debated, politically sensitive, or interpreted differently across communities and generations\.

##### Validation Type \(Source of Legitimacy\)

This dimension identifies the primary source of legitimacy through which a statement is recognized\.

- •Institutionalstatements are supported by formal institutions, such as government agencies, educational systems, or religious authorities\.
- •Local Consensusrefers to knowledge maintained through shared social agreement and everyday experience\.
- •Oral Traditionencompasses information transmitted through storytelling, proverbs, folklore, and intergenerational verbal communication\.
- •Niche Knowledgedenotes knowledge recognized primarily within specialized communities, professions, or subcultures\.

For each item, annotators assigned exactly one label for each of the three dimensions\. When uncertain, they were instructed to select the category that best reflected how the statement is generally understood within Bangladeshi society rather than relying on personal beliefs\. To preserve the integrity of the benchmark and ensure that all judgments reflected human cultural interpretation, annotators were explicitly prohibited from using machine translation systems, large language models, or any other AI\-based tools\.

### A\.3Annotation Quality

We measured inter\-annotator agreement using Cohen’sκ\\kappafor all three categorical annotation dimensions over the full set of 717 benchmark instances\. Agreement was consistently in thealmost perfectrange according to the interpretation ofLandis and Koch \([1977](https://arxiv.org/html/2605.30481#bib.bib16)\)\. For Knowledge Frequency, annotators achieved 94\.14% raw agreement withκ=0\.91\\kappa=0\.91\. For Epistemic Status, which distinguishes local, global, and contested knowledge, agreement was 92\.61% withκ=0\.86\\kappa=0\.86\. The highest reliability was observed for the Validation Type, with 94\.70% agreement andκ=0\.93\\kappa=0\.93\. These results indicate that the annotation guidelines were clear and that annotators were able to apply the sociocultural categories with a high degree of consistency across all dimensions\.

## Appendix BData Statistics

Table[4](https://arxiv.org/html/2605.30481#A2.T4)summarizes the composition ofCulturalNB\. The dataset is balanced across two major categories—History & Politics and Art, Literature, & Cultural Practices—each contributing 234 instances \(32\.6%\), together accounting for nearly two\-thirds of the corpus\. The remaining examples cover Religion, Folklore, & Mythology \(14\.8%\), Geography & National Identity \(10\.5%\), and Traditional Medicine & Ecology \(9\.5%\), ensuring coverage of both institutionally documented and locally transmitted knowledge\.

Table 4:Distribution of the 717 instances inCulturalNBacross domain, epistemic status, validation type, and knowledge frequency\. Percentages are computed over the full dataset\.The epistemic status annotation shows that most instances are primarily grounded in local knowledge \(61\.9%\), while 31\.4% involve globally recognized facts and 6\.7% capture contested interpretations\. This distribution is important because it allows us to distinguish failures caused by a lack of cultural specificity from those arising in areas where multiple interpretations coexist\.

Validation types further emphasize epistemic diversity\. Although over half of the examples are supported by institutional sources \(52\.2%\), a substantial portion relies on local consensus \(26\.6%\), niche knowledge \(14\.8%\), or oral tradition \(6\.4%\)\. These categories represent forms of knowledge that are less likely to be overrepresented in global web corpora and, therefore, provide a stringent test of culturally grounded reasoning\.

Finally, the dataset spans varying levels of knowledge frequency: 49\.5% frequent, 31\.2% occasional, and 19\.2% rare\. The inclusion of nearly one\-fifth of rare instances is particularly important for evaluating whether models default to globally dominant narratives when culturally specific knowledge is sparse or weakly represented in pretraining data\.

## Appendix CEvaluation Metrics

##### Cross\-Lingual Factual Consistency \(CLFC\)

Cross\-Lingual Factual Consistency measures whether asking the same question in Bangla and English yields semantically equivalent factual claims\. If a model produces inconsistent answers across languages, this suggests that language functions as an epistemic anchor rather than a neutral carrier of meaningQi et al\. \([2023](https://arxiv.org/html/2605.30481#bib.bib21)\); Wang et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib29)\)\. Let\(qbn,qen\)\(q\_\{\\text\{bn\}\},q\_\{\\text\{en\}\}\)denote a parallel question pair,f\(R\)f\(R\)extract the core factual claim from responseRR, andNLI\(a,b\)∈\[0,1\]\\text\{NLI\}\(a,b\)\\in\[0,1\]denote an entailment score between two claims\. We define CLFC as:

CLFC=1\|Q\|∑q∈Q⊮\[NLI\(f\(Rqbn\),f\(Rqen\)\)≥τ\]\\displaystyle\\text\{CLFC\}=\\frac\{1\}\{\|Q\|\}\\sum\_\{q\\in Q\}\\nVdash\\left\[\\text\{NLI\}\\\!\\left\(f\(R\_\{q\_\{\\text\{bn\}\}\}\),f\(R\_\{q\_\{\\text\{en\}\}\}\)\\right\)\\geq\\tau\\right\]

\(1\)whereτ=0\.8\\tau=0\.8is the consistency threshold\.

##### Language Anchor Bias \(LAB\)

Language Anchor Bias is a directional variant of CLFC that measures how often switching from Bangla to English causes a response to shift from a locally grounded interpretation to a globally dominant one\. LetGND\(R\)=1\\text\{GND\}\(R\)=1indicate that responseRRadopts a globally dominant framing, and letQLQ\_\{L\}denote the set of local and contested questions\. LAB is defined as:

LAB=1\|QL\|∑q∈QL⊮\[GND\(Rqen\)=1∧GND\(Rqbn\)=0\]\\begin\{split\}\\mathrm\{LAB\}=\\frac\{1\}\{\|Q\_\{L\}\|\}\\sum\_\{q\\in Q\_\{L\}\}\\nVdash\\Big\[&\\mathrm\{GND\}\(R\_\{q\_\{\\mathrm\{en\}\}\}\)=1\\;\\wedge\\\\ &\\mathrm\{GND\}\(R\_\{q\_\{\\mathrm\{bn\}\}\}\)=0\\Big\]\\end\{split\}\(2\)
Higher LAB values indicate that English prompts systematically bias models toward globally dominant interpretations\.

##### Global Substitution Rate \(GSR\)

Refusal111Refusal is when the model refuses to answer or respond\.rates alone do not distinguish between honest uncertainty and harmful substitutions in which a model fills a genuine knowledge gap with a confident but culturally inappropriate global narrative\. GSR isolates this failure mode, as observed in the Baul doctrine evaluation where the model produced confident globally\-framed answers with no hedgingFeng et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib9)\); Yu et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib33)\)\. LetK\(q\)=1K\(q\)=1if expert annotation determines that the model genuinely lacks relevant local knowledge for questionqq, and letA\(q\)=1A\(q\)=1if the model abstains\. GSR is defined as:

GSR=∑q∈QL\(1−A\(q\)\)K\(q\)⊮\[GND\(q\)=1\]∑q∈QLK\(q\)\\displaystyle\\mathrm\{GSR\}=\\frac\{\\sum\_\{q\\in Q\_\{L\}\}\(1\-A\(q\)\)\\,K\(q\)\\,\\nVdash\\\!\\left\[\\mathrm\{GND\}\(q\)=1\\right\]\}\{\\sum\_\{q\\in Q\_\{L\}\}K\(q\)\}

\(3\)
Therefore, GSR measures whether a model replaces a culturally specific entity, practice, interpretation, or terminology with a more globally common or generalized alternative\.

##### Institutional Bias Rate \(IBR\)

IBR measures whether a response legitimizes an answer primarily through formal institutional authority \(government, academic, religious, legal, or canonical sources\) when the gold interpretation is grounded in local consensus, oral tradition, or vernacular knowledgeGallegos et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib10)\); Naous et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib18)\); Wang et al\. \([2024](https://arxiv.org/html/2605.30481#bib.bib29)\)\. For each response, we annotate the dominant source typeS\(q\)∈\{institutional,community,oral/local,none\}S\(q\)\\in\\\{\\text\{institutional\},\\text\{community\},\\text\{oral/local\},\\text\{none\}\\\}\. IBR is defined as:

IBR=∑q∈Q⊮\[S\(q\)=institutional\]\|Q\|\\text\{IBR\}=\\frac\{\\sum\_\{q\\in Q\}\\nVdash\\\!\\left\[S\(q\)=\\text\{institutional\}\\right\]\}\{\|Q\|\}\(4\)

##### Epistemic Perspective Coverage \(EPC\)

Accuracy evaluates correctness with respect to a single reference answer, whereas EPC measures whether the response preserves multiple culturally relevant interpretations, communities, or contextual framingsSingh et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib25)\); Tanwar et al\. \([2025](https://arxiv.org/html/2605.30481#bib.bib26)\)instead of collapsing the answer into a single dominant narrative\. LetV=\{v1,…,vk\}V=\\\{v\_\{1\},\\ldots,v\_\{k\}\\\}denote the set of expert\-annotated valid perspectives for a question, and letVR⊆VV\_\{R\}\\subseteq Vdenote the perspectives expressed in responseRR\. We define:

Range\(R\)=\|VR∩V\|\|V\|\\text\{Range\}\(R\)=\\frac\{\|V\_\{R\}\\cap V\|\}\{\|V\|\}\(5\)Rep\(R\)=1−\|p^\(vglobal\)−p∗\(vglobal\)\|\\text\{Rep\}\(R\)=1\-\\left\|\\hat\{p\}\(v\_\{\\text\{global\}\}\)\-p^\{\*\}\(v\_\{\\text\{global\}\}\)\\right\|\(6\)wherep^\(vglobal\)\\hat\{p\}\(v\_\{\\text\{global\}\}\)is the proportion of the response devoted to the globally dominant viewpoint andp∗\(vglobal\)p^\{\*\}\(v\_\{\\text\{global\}\}\)is the expert\-annotated gold proportion\. The final EPC score is:

EPC\(R\)=α⋅Range\(R\)\+\(1−α\)⋅Rep\(R\)\\text\{EPC\}\(R\)=\\alpha\\cdot\\text\{Range\}\(R\)\+\(1\-\\alpha\)\\cdot\\text\{Rep\}\(R\)\(7\)where we setα=0\.5\\alpha=0\.5by default\.

## Appendix DPrompts

### D\.1Model Inference

We provide the instructions used for English question\-only and evidence\-based experiment settings in the Listings[1](https://arxiv.org/html/2605.30481#LST1)and[2](https://arxiv.org/html/2605.30481#LST2), respectively\.

Youarea\{target\_language\}AIassistantspecializedinBengaliculturalquestionansweringtasks\.Yourresponseshouldbeinthe\{target\_language\}language\.Yourtaskistoprovideananswerthataccuratelyreflectslocalknowledge,customs,historicalexperience,andlinguisticexpressions\.

Answerthe\[Question\]onlyfromtheculturalcontextofaBengalicommunity\.

TheanswershouldbeexperientialandculturallyspecifictotheBengalicommunity\.Also,provideashortandconciseexplanationfortheanswer,alongwithyourconfidencescoreonascaleof1to10towardstheanswer\.TheoutputshouldfollowthefollowingJSONformat:

\{"answer":"","explanation":"","confidence":0\}

Input:

\[Question\]:\{question\}

Listing 1:Prompt for Question\-only Experiment Settings\.Youarea\{target\_language\}AIassistantspecializedinBengaliculturalquestionansweringtasks\.Yourresponseshouldbeinthe\{target\_language\}language\.Yourtaskistoprovideananswerthataccuratelyreflectslocalknowledge,customs,historicalexperience,andlinguisticexpressions\.

Answerthe\[Question\]onlyfromtheculturalcontextofaBengalicommunity\.Forabetterunderstandingofthecontext,weprovidethefollowing\[evidence\]\.TheanswershouldbeexperientialandculturallyspecifictotheBengalicommunity\.Also,provideshortandconciseexplanationfortheansweralongwithyourconfidencescoreonascaleof1to10towardstheanswer\.

TheoutputshouldfollowthefollowingJSONformat:

\{"answer":"","explanation":"","confidence":0\}

Input:

\[Question\]:\{question\}

\[Evidence\]:\{target\_language\_evidence\}"

Listing 2:Prompt for Evidence\-based Experiment Settings\.
### D\.2Judge Prompts

We operationalize all evaluation metrics using structured prompts to two independent LLM judges\. Listings[3](https://arxiv.org/html/2605.30481#LST3),[4](https://arxiv.org/html/2605.30481#LST4),[5](https://arxiv.org/html/2605.30481#LST5),[6](https://arxiv.org/html/2605.30481#LST6),[7](https://arxiv.org/html/2605.30481#LST7),[8](https://arxiv.org/html/2605.30481#LST8),[9](https://arxiv.org/html/2605.30481#LST9), and[10](https://arxiv.org/html/2605.30481#LST10)show the prompts used to extract semantic entailment, detect genuine knowledge gaps and abstentions, identify global narrative dominance, classify validation sources, and measure epistemic perspective coverage\. Each prompt returns a constrained JSON output to ensure deterministic parsing and reproducibility\. These prompts decompose model behavior into interpretable signals that capture factual consistency, reliance on globally dominant narratives, institutional framing, and diversity of perspectives\.

YouareanexpertinBengaliculturalknowledgeandatextualentailmentsystem\.

Donotprovideanyexplanation\.

Provideonlytheoutputinthespecifiedformat\.

Task:

JudgeandscorehowmuchAentailsB\.

Checkthefactsofbothclaimsandscoretheentailment\.

Input:

ClaimA:\{claim\_bn\}

ClaimB:\{claim\_en\}

Output:

\{\{

"entailment\_score":0\.0\-1\.0

\}\}

Listing 3:Judge prompt for extracting entailment score for natural language inference\.Youareanexpertevaluatorofwhetheralanguagemodelgenuinelylacksknowledgeforagivenquestion\.

Donotprovideanyexplanation\.

Provideonlytheoutputinthespecifiedformat\.

\#\#TASK

Yourgoalistodetermine:

K\(q\)=1whenthemodellackstherequiredknowledgetoanswercorrectly

K\(q\)=0whenthemodelhastheknowledge\(eveniftheanswerisimperfectlyexpressed\)

\#\#IMPORTANT:

K\(q\)isNOTaboutwhetherthequestionisnicheordifficult\.

K\(q\)isaboutwhethertheMODELdemonstratesaknowledgegapinitsresponse\.

\#\#INPUT:

Question:\{question\}

Response:\{response\}

\#\#DEFINITIONOFKNOWLEDGEGAP\(K=1\):

SetK=1iftheresponseshowsevidencethatthemodeldoesNOThavetherequiredknowledge,including:

1\.Fabricationorhallucination

2\.Globalsubstitution

3\.Misinterpretation

4\.Confidentbutincorrectanswer

5\.Evasivebutnon\-abstaining

\#\#SETK=0if:

\-Responseisfactuallycorrect

\-Minoromissionsexist

\-Properabstentionexists

\#\#OUTPUT:

\{\{

"K":0or1,

"confidence":0\.0\-1\.0

\}\}

Listing 4:Judge prompt for identifying knowledge gaps\.YouareanexpertevaluatorofBengaliculturalknowledge,andyourtaskistodeterminewhethermodelabstainsornot\.

Donotprovideanyexplanation\.

Provideonlytheoutputinthespecifiedformat\.

\#\#Task:

Determinewhetherthemodelabstains\.

\#\#Definition:

Abstention=explicituncertainty\.

\#\#Input:

Response:\{response\}

\#\#OutputJSON:

\{\{

"abstain":0or1

\}\}

Listing 5:Judge prompt for detecting model abstainYouareanexpertevaluatorofBengaliculturalknowledge\.

YourtaskswouldbeidentifyingglobalnarrativedominancefromLLMresponses\.

Donotprovideanyexplanation\.

Provideonlytheoutputinthespecifiedformat\.

\#\#Task:

Youareevaluatingwhetheraresponsereflectsagloballydominantnarrative\.

\#\#Definition:

Agloballydominantnarrative:

\-alignswithwidelyknowninstitutionalknowledge

\-ignores,omits,oroverridesculturallyspecificorlocallygroundedperspectives

\-replaceslocalpractices,beliefs,orinterpretationswithgenericglobalexplanations

\-presentsdominantnarrativesasuniversallyapplicable

\#\#Input:

Question:\{question\}

Response:\{response\}

\#\#OutputJSON:

\{\{

"GND":0or1

\}\}

Listing 6:Judge prompt for identifying global narrative dominance\.YouareanexpertevaluatorofBengaliculturalknowledge\.

YourtaskistodetectthevalidationsourceoftheLLMexplanation\.

Donotprovideanyexplanation\.

Provideonlytheoutputinthespecifiedformat\.

\#\#TASK

ClassifythePRIMARYvalidationsourceusedinthemodelexplanation\.

\#\#Categories:

\-institutional\(Wikipedia,researchpapers,formalhistory\)

\-localconsensus\(sharedpractices,everydayexperience,orsocialagreement,evenwithoutformalrecognition\)

\-oraltradition\(storytelling,folklore,proverbs,orintergenerationalverbalcommunication,ratherthanwrittensources\)

\-nicheknowledge:recognizedmainlywithinaspecializedgroup\(e\.g\.,particularprofession,regionalcommunity,activistgroup,orculturalsubcommunity\)\.

\-none\(nojustificationgiven\)

\#\#INPUT:

Explanation:\{model\_explanation\}

\#\#OutputJSON:

\{\{

"source":"institutional\|local\_consensus\|oral\|niche\|none"

\}\}

Listing 7:Judge prompt for classifying primary validation source\.YouareanexpertevaluatorofBengaliculturalknowledge\.

YourtaskistoextracttheviewpointsoftheLLMresponse\.

Donotprovideanyexplanation\.

Provideonlytheoutputinthespecifiedformat\.

\#\#Task

Extractdistinctviewpointsexpressedintheresponse\.

\#\#Definition:

Aviewpoint=aculturallyorepistemicallydistinctinterpretation\.

\#\#Input:

Question:\{question\}

Response:\{response\}

\#\#OutputJSON:

\{\{

"perspectives":\["","",\.\.\.\]

\}\}

Listing 8:Judge prompt for extracting viewpoints expressed in the LLM responseYouareaninformationextractionsystem\.Yourtasktobreaktheresponseintoatomicmeaningunits\(claimsorsentences\),thenclassifytheepistemicperspective\.

\#\#Rulesforbreakingintoatomicmeaningunits:

\-Eachunitshouldexpressoneidea

\-Keepthemshortandself\-contained

\#\#Definitionsforepistemicperspective:

1\.Global:generic,widelydocumented,institutional,Western/globalnarrative

2\.Local:culturallyspecific,community\-based,oral,contested,non\-global

Response:\{response\}

\#\#OutputJSONformat:

\{\{

"units":\[

\{\{"unit":"atomicunit1","label":"global\|local\|neutral"\}\},

\{\{"unit":"atomicunit2","label":"global\|local\|neutral"\}\},

\.\.\.\.

\}\}

Listing 9:Judge prompt for extracting atomic units along with epistemic perspective classification\.YouareanexpertevaluatorofBengaliculturalknowledge\.

Yourtaskistocounthowmanyperspectivesmatch\.

Donotprovideanyexplanation\.

Provideonlytheoutputinthespecifiedformat\.

\#\#TASK

Youaregiventwolistsofperspectivesexpressedintheresponseandstatement\.

Yourtaskistocounthowmanyviewpointsexpressedinresponsematchtheviewpointsinthestatement\.

\#\#Input:

statementviewpoints:\{statement\_perspectives\}

responseviewpoints:\{response\_perspectives\}

\#\#OutputJSON:

\{\{

"match":integer

\}\}

Listing 10:Judge prompt for matching perspectives between LLM response and gold statement\.

## Appendix EStatistical Analysis on Judges and Metrics

### E\.1Validation of Human and LLM Judges

Table 5:Pearson correlation analysis between LLM judges and human evaluation across GSR, IBR, and EPC metrics\.Table 6:McNemar test results comparing binary agreement patterns between LLM judges and human evaluation\.Statistical analysis evaluates the agreement between LLM\-based judges \(GPT and Mistral\) and human evaluation across the proposed cultural metrics\. Pearson correlation analysis shows generally strong positive relationships between automated and human judgments, although most correlations do not reach statistical significance due to the limited sample size \(n=3n=3metrics per comparison\)\.

For the Sonnet model, both GPT and Mistral judges exhibit very strong correlations with human judgments \(r=0\.8187r=0\.8187andr=0\.8561r=0\.8561, respectively\), indicating that both judges broadly capture similar evaluation trends as humans\. The strongest agreement is observed between GPT and Mistral judges themselves \(r=0\.9977r=0\.9977,p<0\.05p<0\.05\), suggesting substantial consistency between the two automated evaluators for Sonnet outputs\.

For the GPT model, Mistral demonstrates stronger alignment with human judgments \(r=0\.8930r=0\.8930\) than GPT self\-evaluation \(r=0\.4559r=0\.4559\)\. This finding supports the observation that GPT\-based judging may under\-detect culturally inappropriate framings when evaluating outputs generated by models with similar training priors\. Although the correlations remain positive, the lack of statistical significance indicates that these findings should be interpreted cautiously and validated with larger\-scale human evaluation\.

The McNemar tests further show no statistically significant disagreement between LLM judges and human annotations across all evaluated settings \(p\>0\.05p\>0\.05\)\. This suggests that, at a binary decision level, the automated judges do not systematically diverge from human evaluators\. However, the perfect or near\-perfect agreement patterns also reflect the limited sample size and coarse binary aggregation\.

Overall, the statistical results indicate that LLM judges capture similar directional trends as human evaluators, particularly for detecting global substitution and institutional framing biases\. Nevertheless, the stronger correlations observed for Mistral compared to GPT self\-evaluation reinforce the paper’s broader argument that LLM\-as\-judge frameworks may inherit model\-specific epistemic priors\. These findings motivate the continued inclusion of culturally grounded human evaluation when auditing multilingual cultural reasoning in LLMs\.

### E\.2Correlation Among Evaluation Metrics

Table 7:Pairwise Pearson Correlations Among Evaluation MetricsThe Pearson correlation analysis reveals several important relationships among the evaluation metrics\. The strongest positive correlation is observed between GSR and IBR \(r=0\.556r=0\.556,p<0\.001p<0\.001\), indicating that these two measures capture related aspects of the underlying phenomenon\. This suggests that increases in global substitution are associated with higher institutional bias representation scores\. However, GSR captures narrative dominance toward globally prevalent interpretations, whereas IBR captures reliance on institutional authority as the primary source of legitimacy\.

In contrast, GSR shows significant negative correlations with EPC \(r=−0\.316r=\-0\.316,p=0\.0068p=0\.0068\) and LAB \(r=−0\.499r=\-0\.499,p<0\.001p<0\.001\)\. These results imply that higher GSR values tend to correspond with lower EPC and LAB scores, indicating potential conceptual divergence between these metrics\.

Similarly, IBR is negatively correlated with EPC \(r=−0\.366r=\-0\.366,p=0\.0016p=0\.0016\), suggesting that the constructs measured by EPC may capture behavior distinct from both GSR and IBR\. Meanwhile, EPC and LAB exhibit a moderate positive correlation \(r=0\.395r=0\.395,p<0\.001p<0\.001\), implying that these metrics may reflect partially overlapping dimensions\.

Notably, CLFC demonstrates weak and statistically non\-significant correlations with all other metrics, with coefficients ranging from−0\.143\-0\.143to0\.1530\.153\. This indicates that CLFC captures a relatively independent characteristic and provides complementary information rather than redundant measurement\.

Overall, the correlation patterns suggest that while certain metrics share moderate relationships, none of the correlations are excessively high \(e\.g\.,r\>0\.80r\>0\.80\)\. This supports the discriminant validity of the evaluation framework, indicating that the metrics measure related but distinct dimensions rather than redundant constructs\.

## Appendix FAdditional Result Analysis

### F\.1Question\-only Setting

#### F\.1\.1Domain\-Level Analysis

Table 8:Domain\-wise Global Substitution Rate \(GSR\) in the question\-only setting\. Lower values indicate better preservation of culturally grounded answers, while higher values indicate more frequent replacement of local referents with globally dominant ones\. Computed following Eq\.[3](https://arxiv.org/html/2605.30481#A3.E3)\.##### Global Substitution

Table[8](https://arxiv.org/html/2605.30481#A6.T8)reveals strong domain and language effects in global substitution\. Across both judges, Bangla prompts generally yield lower GSR than English prompts, indicating that asking questions in English more often shifts responses toward globally salient interpretations\. The effect is most pronounced in History & Politics and Geography & National Identity, where culturally specific entities are frequently replaced by internationally dominant narratives\. Traditional Medicine & Ecology exhibits the lowest GSR under the GPT judge, suggesting that some locally grounded knowledge remains relatively stable in Bangla, though this advantage largely disappears under English prompting\. Domain rankings are broadly consistent across the two judges despite differences in absolute scores, indicating robust agreement on which cultural areas are most vulnerable to substitution\. Overall, the results show that global narrative dominance is systematic rather than uniform: it is amplified in domains tied to national identity and historical interpretation, where local meanings directly compete with globally dominant frames\.

Table 9:Domain\-wise Institutional Bias Rate \(IBR\) in the question\-only setting\. Lower values indicate greater reliance on locally grounded explanations, while higher values indicate stronger dependence on globally legitimized institutional framing\.
##### Institutional Bias

Table[9](https://arxiv.org/html/2605.30481#A6.T9)shows that institutional bias varies substantially across domains, languages, and judges\. Across both judges, History & Politics consistently exhibits the highest IBR, indicating that responses in this domain are most likely to privilege globally recognized institutional narratives over locally grounded interpretations\. Geography & National Identity also shows elevated bias, particularly under the Mistral judge, suggesting that questions involving national identity are especially susceptible to institutional framing\. In contrast, Traditional Medicine & Ecology yields the lowest IBR across nearly all models, reflecting greater reliance on community\-based or practice\-oriented knowledge\. English prompts generally increase IBR relative to Bangla prompts, especially under the Mistral judge, indicating that language choice amplifies dependence on globally legitimized sources\. Despite differences in absolute scores, both judges agree that institutional bias is concentrated in domains where historical authority and national narratives compete with local epistemic traditions\.

Table 10:Domain\-wise Epistemic Perspective Coverage \(EPC\) in question\-only setting across Bangla and English language, evaluated by GPT and Mistral judges\. Higher values indicate broader inclusion of culturally relevant perspectives\.
##### Epistemic Perspective Coverage

Table[10](https://arxiv.org/html/2605.30481#A6.T10)shows that providing evidence substantially improves epistemic perspective coverage across all models and domains, with most scores falling between 0\.55 and 0\.70\. The highest coverage consistently appears inTraditional Medicine & EcologyandReligion, Folklore & Mythology, suggesting that explicit local evidence helps models preserve plural and culturally grounded interpretations in domains where knowledge is often community\-specific or orally transmitted\. In contrast,History & Politicsremains the most difficult domain, exhibiting the lowest EPC scores for nearly all models, which indicates a continued tendency to collapse contested narratives into narrower institutional accounts\. Across models,Gemini\-3\.1 Proachieves the strongest and most consistent perspective coverage under both judges, whileClaude Sonnet 4\.6andGPT\-5\.4also perform well\. Open\-weight models such asLlama 4 MaverickandQwen 3\.6generally produce lower EPC, particularly in politically sensitive domains\. The two judges show strong agreement on overall trends, reinforcing the robustness of these findings\. Together, the results demonstrate that evidence improves cultural grounding but does not fully eliminate domain\-specific limitations, especially where multiple competing historical interpretations coexist\.

Table 11:Domain\-wise Cross\-Lingual Factual Consistency \(CLFC\) for nine LLMs, evaluated by two independent judges\. Higher scores indicate stronger semantic consistency between Bangla and English responses to the same culturally grounded question\.
##### Cross\-lingual Factual Consistency

Table[11](https://arxiv.org/html/2605.30481#A6.T11)presents cross\-lingual factual consistency, which is uniformly low, indicating that models often produce substantially different claims when the same culturally grounded question is asked in Bangla and English\. Under the stricter GPT\-5\.4\-mini judge, most scores remain below 0\.15, suggesting near\-complete semantic divergence across languages\. The Mistral judge yields higher absolute values but preserves the same ranking patterns, confirming that cross\-lingual instability is robust to judge choice\. Google, Anthropic, and OpenAI frontier models perform best overall, with Google Gemini\-3\.1 Pro achieving the highest consistency across most domains \(up to 0\.453 in Religion, Folklore & Mythology\), followed by Anthropic Sonnet 4\.6 and OpenAI GPT\-5\.4\. However, even these strongest systems remain far from stable cross\-lingual grounding\. Domain effects are pronounced: History & Politics and Geography & National Identity generally exhibit higher consistency, whereas Art, Literature & Cultural Practices and Traditional Medicine & Ecology show the lowest scores, reflecting greater susceptibility to culturally specific reinterpretation\. These results demonstrate that multilingual competence does not guarantee semantic equivalence across languages; instead, changing the language of interaction can substantially alter the factual content of model responses\.

Table 12:Domain\-wise Language Anchor Bias \(LAB\) for nine LLMs, evaluated by two independent judges\. Higher scores indicate a stronger tendency for English prompts to shift responses toward globally dominant interpretations relative to equivalent Bangla prompts\.
##### Language Anchor Bias

Table[12](https://arxiv.org/html/2605.30481#A6.T12)presents language anchor bias, which is consistently positive across all models and domains, demonstrating that the language of interaction systematically affects cultural interpretation\. Asking the same question in English reliably shifts responses toward globally dominant narratives, even when the underlying cultural content is unchanged\. This pattern is robust across both judges, withGPT\-5\.4\-miniassigning higher absolute scores andMistral 4 Smallpreserving the same relative ordering\. Traditional Medicine & Ecology and Art, Literature & Cultural Practices exhibit the strongest anchoring effects, with several models exceeding 0\.50, indicating that culturally specific practices are particularly vulnerable to reinterpretation through globally familiar frames\. Among the models, GPT\-5\.4 shows the lowest overall LAB, while Gemma 4, Qwen 3\.6 Plus, and DeepSeek v3\.2 display the largest shifts\. Frontier systems such as Sonnet 4\.6 and Gemini\-3\.1 Pro reduce but do not eliminate this effect\. In particular, no model approaches zero across domains, confirming that multilingual capability does not guarantee language\-invariant reasoning\. Instead, English functions as a strong contextual prior that systematically alters how LLMs retrieve and prioritize cultural knowledge\.

#### F\.1\.2Analysis on Knowledge Frequency, Epistemic Status, and Validation Type

Table 13:GSR scores by model, knowledge frequency, epistemic status, and validation type\. Occ\.=Occasional, Fre\.=Frequent, Cont\.=Contested, Inst\.=Institutional, Local C\.=Local Consensus, Oral=Oral Tradition\. Following Eq\.[3](https://arxiv.org/html/2605.30481#A3.E3); Global epistemic\-status column omitted\.##### Global Substitution

Table[13](https://arxiv.org/html/2605.30481#A6.T13)reports Global Substitution Rate \(GSR\) by language, model, judge, and sociocultural category\. Since GSR measures whether a locally grounded answer is replaced by a globally dominant one, it is computed only for local and contested instances; the global epistemic\-status column is therefore omitted\.

The dominant pattern is a large Bangla–English gap\. Under the GPT judge, Bangla prompts generally yield lower GSR, often below 0\.35 for models such asGPT,Grok,Sonnet, andDeepSeek\. In contrast, English prompts sharply increase global substitution, with many models reaching 0\.60–0\.80 across knowledge\-frequency groups\. This effect is especially strong forGemma,Qwen,Llama,Grok,Mistral, andDeepSeek, indicating that the same Bengali cultural content is much more likely to be resolved through globally dominant referents when asked in English\.

The Mistral judge assigns higher absolute GSR values, but confirms the same trend\. Even in Bangla, several models show moderate substitution, while English prompts further increase GSR across nearly all models and categories\. Under this judge,Gemma,Qwen,Llama, andGrokoften exceed 0\.80 in English, showing strong reliance on globally salient interpretations\. Therefore, judge calibration differs, but both judges agree that English substantially amplifies global substitution\.

The category\-level breakdown shows that this effect is not limited to rare knowledge\. English\-side GSR remains high across occasional, frequent, and rare items, suggesting that global substitution is not simply a data\-sparsity problem\. Similarly, both local and contested items are vulnerable, with contested items often showing particularly high substitution\. Across validation types, institutional, niche, local\-consensus, and oral\-tradition items all show increased GSR in English, indicating that models struggle to preserve local sources of legitimacy as well as local factual referents\.

Overall, Table[13](https://arxiv.org/html/2605.30481#A6.T13)provides strong evidence that global substitution is systematic and language\-conditioned\. English prompts consistently push models away from Bengali cultural grounding and toward globally dominant interpretations, and this pattern persists across judges, models, knowledge\-frequency levels, epistemic\-status categories, and validation types\.

Table 14:Institutional Bias Rate \(IBR\) by model, knowledge frequency, epistemic status, and validation type\. Higher values indicate stronger reliance on institutional or globally dominant sources over local cultural perspectives\. Occ\.=Occasional, Fre\.=Frequent, Cont\.=Contested, Inst\.=Institutional, Local C\.=Local Consensus, Oral=Oral Tradition\.
##### Institutional Bias

Table[14](https://arxiv.org/html/2605.30481#A6.T14)reports the Institutional Bias Rate \(IBR\) across languages, models, judges, and sociocultural categories\. The main pattern is that English prompts generally increase institutional framing\. Under the GPT judge, IBR remains relatively low overall, especially for Bangla prompts, where many models stay below 0\.20 across knowledge\-frequency categories\. However, English prompts raise IBR for most models, particularlyDeepSeek,Mistral,Gemma, andSonnet, indicating a stronger tendency to frame answers through institutional or globally legitimized sources\.

The Mistral judge assigns substantially higher IBR values, but preserves the same directional trend\. Bangla responses already show moderate institutional bias for several models, while English responses increase sharply, often exceeding 0\.50 and sometimes reaching above 0\.70 forGemma,DeepSeek,Mistral, andSonnet\. This suggests that absolute IBR values are judge\-dependent, but both judges agree that English prompts make institutional framing more likely\.

The category\-level results show that institutional bias is strongest for contested and globally framed items, as well as items whose validation type is institutional\. This is expected, since such items are more likely to activate formal or dominant knowledge sources\. However, English also increases IBR for niche, local\-consensus, and oral\-tradition items, showing that even community\-grounded knowledge can be reframed through institutional authority when prompted in English\.

Overall, Table[14](https://arxiv.org/html/2605.30481#A6.T14)supports the claim that institutional bias is a systematic language\-conditioned effect\. Models not only change what they answer across languages; they also change which sources of authority they implicitly privilege, with English prompts pushing responses toward globally recognized and institutionally dominant framings\.

Table 15:Epistemic perspective coverage \(EPC\) scores by by model, knowledge frequency, epistemic status, and validation type\. Higher scores indicate broader EPC\. Occ\.=Occasional, Fre\.=Frequent, Cont\.=Contested, Inst\.=Institutional, Local C\.=Local Consensus, Oral=Oral Tradition\.
##### Epistemic Perspective Coverage

Table[15](https://arxiv.org/html/2605.30481#A6.T15)reports epistemic perspective coverage \(EPC\) across languages, models, judges, and sociocultural categories\. Since higher EPC indicates broader coverage of locally relevant viewpoints, the table shows how well models preserve cultural plurality rather than collapsing answers into a single dominant framing\.

The main pattern is that Bangla prompts generally produce higher EPC than English prompts\. Under both GPT\- and Mistral\-as\-judge settings, most models lose perspective coverage when the same cultural content is asked in English\. This drop is especially visible forQwen,Gemma,Llama,Grok, andSonnet, suggesting that English prompts not only increase global substitution and institutional bias, but also narrow the range of local interpretations represented in the response\.

Model\-level differences are also clear\.GPT,Gemini, andSonnetgenerally achieve the strongest EPC, especially in Bangla, whileGemma,Qwen,Llama, andGroktend to provide narrower coverage\. The two judges differ slightly in calibration, but they agree on the overall ranking and the language effect: Bangla responses preserve more epistemic diversity than English responses\.

The category\-level breakdown shows that rare and contested knowledge are the most difficult\. EPC is often lower for rare items than for frequent or occasional items, indicating that models cover fewer perspectives when the cultural knowledge is less visible\. Contested items also receive consistently lower EPC than local or global\-status items, which suggests that models struggle to represent disagreement or plural interpretations\. By validation type, local\-consensus and oral\-tradition items often receive relatively high EPC, while niche and institutional categories are more variable, indicating that models are better at broad cultural descriptions than at preserving specialized or authority\-sensitive epistemic contexts\.

Overall, Table[15](https://arxiv.org/html/2605.30481#A6.T15)supports the claim that language affects not only factual content but also epistemic breadth\. English prompts systematically reduce perspective coverage across models and cultural categories, reinforcing that global narrative dominance operates by narrowing the range of locally grounded viewpoints available in model responses\.

Table 16:Cross\-Lingual Factual Consistency \(CLFC\) by model, knowledge frequency, epistemic status, and validation type under two independent judges\. Lower values indicate less consistency between Bangla and English responses\.
##### Cross\-Lingual Factual Consistency

Table[16](https://arxiv.org/html/2605.30481#A6.T16)reports cross\-lingual factual consistency \(CLFC\) across models and sociocultural categories\. Since higher CLFC indicates stronger semantic consistency between Bangla and English responses, lower values reflect greater language\-induced factual instability\.

The table shows that cross\-lingual consistency is generally weak and highly category\-dependent\. Under the GPT judge, CLFC values are low across most models and categories, often below 0\.15\.Gemini,Qwen,Llama, andSonnetshow relatively better consistency, whileDeepSeekis consistently among the weakest\. Under the Mistral judge, absolute CLFC values are higher, but the model\-level pattern is similar:Gemini,GPT, andSonnetare more stable, whereasDeepSeek,Grok, andMistralremain less consistent\.

The category breakdown further shows that rare and niche knowledge tend to have lower CLFC than frequent or institutional knowledge\. This suggests that models are less able to preserve factual meaning across languages when the knowledge is less visible or less standardized\. Global and institutional categories often receive higher CLFC, indicating that models are more stable when the content aligns with widely circulated or formally documented knowledge\. In contrast, local, contested, oral tradition, and niche categories are more fragile, reflecting the difficulty of preserving culturally specific interpretations across Bangla and English\.

Overall, Table[16](https://arxiv.org/html/2605.30481#A6.T16)supports the central claim that cross\-lingual instability is not uniform: it is strongest for culturally local, rare, contested, and community\-grounded knowledge\. Although judges differ in calibration, both show that even stronger models fail to maintain consistently high factual equivalence across languages\.

Table 17:Language Anchor Bias \(LAB\) by model across knowledge frequency, epistemic status, and validation type\. Higher values indicate a stronger tendency for model responses to differ depending on whether the question is asked in Bangla or English, reflecting greater language\-dependent framing bias\.
##### Language Anchor Bias

Table[17](https://arxiv.org/html/2605.30481#A6.T17)reports Language Anchor Bias \(LAB\) across models and sociocultural categories\. Higher LAB indicates that model responses are more strongly influenced by the language of the prompt, meaning that the same culturally grounded question elicits different framings when asked in Bangla versus English\.

The results show that language anchoring is pervasive across all evaluated models\. Under the GPT judge, most systems exhibit moderate to high LAB, with values typically ranging from 0\.40 to 0\.55 across categories\.GPTis the most stable model, maintaining relatively low LAB \(approximately 0\.33 across most knowledge\-frequency groups\) and showing particularly weak anchoring oral\-tradition items\. In contrast,Qwen,Gemma,DeepSeek,Grok, andSonnetdisplay substantially higher LAB, indicating greater sensitivity to whether questions are asked in Bangla or English\.

The Mistral judge assigns lower absolute LAB values for some models, but the overall ranking remains consistent\.LlamaandGPTappear comparatively stable, whereasDeepSeek,Gemini,Gemma, andSonnetshow stronger language dependence\.DeepSeekstands out under this judge, with particularly high LAB for frequent knowledge, contested items, institutional validation, and local\-consensus categories\. This suggests that language anchoring persists even for culturally prominent and socially validated knowledge\.

The category\-level analysis further shows that LAB is not limited to rare or obscure items\. Frequent and occasional knowledge often exhibit anchoring comparable to, or greater than, rare knowledge, indicating that these differences cannot be explained solely by knowledge scarcity\. Contested items generally show higher LAB than local items, suggesting that models are especially sensitive to prompt language when cultural interpretations are disputed\. Across validation types, institutional and local\-consensus categories tend to exhibit elevated LAB, while oral\-tradition items show greater variation across models and judges\.

Overall, Table[17](https://arxiv.org/html/2605.30481#A6.T17)demonstrates that prompt language serves as a powerful epistemic anchor\. Even when the underlying question remains identical, models often shift their framing depending on whether the question is written in Bangla or English\. These findings support our central claim that global narrative dominance arises not only from factual substitution, but also from language\-conditioned differences in how models prioritize and present cultural knowledge\.

### F\.2Evidence\-based Setting

#### F\.2\.1Domain\-Level Analysis

Table 18:Global Substitution Rate \(GSR\) in the evidence\-based setting across categories\. Lower values indicate better preservation of culturally appropriate answers, while higher values indicate greater substitution of local references with globally dominant alternatives even when supporting evidence is provided\. See Eq\.[3](https://arxiv.org/html/2605.30481#A3.E3)\. BN: Bangla, EN: English\.##### Global Substitution

Table[18](https://arxiv.org/html/2605.30481#A6.T18)shows that global substitution persists even when local supporting evidence is provided\. Across both judges, English prompts yield higher GSR than Bangla prompts in nearly all thematic categories, indicating that evidence does not fully prevent models from replacing Bengali cultural referents with globally dominant alternatives\. Under the GPT judge, Bangla GSR is comparatively lower, often below 0\.35 for several models and domains, while English GSR frequently rises above 0\.70, especially in traditional medicine and ecology, religion and mythology, and geography and national identity\. The Mistral judge assigns higher absolute GSR values, but preserves the same trend: English responses remain highly substitutional across domains, often exceeding 0\.80\. The effect is broad rather than domain\-specific, appearing in history and politics, cultural practices, ecological knowledge, religious folklore, and national identity\. A few model–domain exceptions occur, but they do not alter the overall pattern\. These results strengthen our central claim that global narrative dominance is not only caused by missing knowledge; even when local evidence is available, English prompts continue to anchor models toward globally dominant interpretations\.

Table 19:Institutional Bias Rate \(IBR\) in the evidence\-based setting across thematic categories\. Lower values indicate less reliance on institutional or globally legitimized framings when local supporting evidence is provided\. BN: Bangla, EN: English\.
##### Institutional Bias

Table[19](https://arxiv.org/html/2605.30481#A6.T19)shows that institutional framing persists even when models are given local evidence\. Under the GPT judge, IBR is generally low in most domains, especially Traditional Medicine and Ecology and Religion, Folklore, and Mythology\. However, History and Politics remains consistently higher across models, and Geography and National Identity also shows elevated bias for several systems, indicating that evidence does not fully prevent models from relying on formal or dominant institutional narratives in politically and nationally salient domains\. The Mistral judge assigns higher absolute IBR values, but reveals the same domain structure more strongly: History and Politics is the most institutionally biased category, often exceeding 0\.60, followed by Geography and National Identity\. English prompts often further increase IBR under the Mistral judge, particularly forGemma,Llama,Grok, andSonnet\. Overall, the table shows that institutional bias is both domain\-sensitive and language\-conditioned: local evidence reduces some errors, but models continue to privilege institutional authority in domains where cultural knowledge is closely tied to history, statehood, and national identity\.

Table 20:Epistemic Perspective Coverage \(EPC\) in the evidence\-based setting across categories\. Higher values indicate broader coverage of locally relevant perspectives when supporting evidence is provided\. BN: Bangla, EN: English\.
##### Epistemic Perspective Coverage

Table[20](https://arxiv.org/html/2605.30481#A6.T20)shows that local evidence substantially improves perspective coverage, but coverage remains language\-dependent\. Across both judges, Bangla prompts generally achieve higher EPC than English prompts in most model–domain combinations, indicating that English still narrows the range of locally relevant perspectives even when supporting evidence is available\. Under the GPT judge, EPC is high for Art, Literature, and Cultural Practices, Religion, Folklore, and Mythology, and Geography and National Identity, often exceeding 0\.70 in Bangla\. Traditional Medicine and Ecology is consistently lower, especially in English, suggesting that ecological and vernacular knowledge remains difficult for models to represent broadly\. The Mistral judge gives slightly lower absolute scores but preserves the same pattern: Bangla responses usually retain broader epistemic coverage, while English responses decline most clearly in History and Politics, Geography and National Identity, and Traditional Medicine and Ecology\. Model\-wise,GPT,Gemini,Gemma, andSonnettend to show stronger coverage, whereasLlamais consistently lower\. Overall, the table supports our central claim that evidence helps models include more local perspectives, but does not fully prevent English\-induced epistemic narrowing\.

Table 21:Cross\-Lingual Factual Consistency \(CLFC\) in the evidence\-based setting across thematic categories\. Higher values indicate stronger semantic consistency between Bangla and English responses to the same culturally grounded question\.
##### Cross\-lingual Factual Consistency

Table[21](https://arxiv.org/html/2605.30481#A6.T21)shows that cross\-lingual factual consistency remains limited even when local evidence is provided\. Under the GPT judge, CLFC is low across most domains, often below 0\.30, with the weakest consistency appearing in Religion, Folklore, and Mythology and Traditional Medicine and Ecology\.Llama,Grok, andQwenachieve the highest GPT\-judged scores in some domains, especially Geography and National Identity, but no model is consistently stable across all categories\. The Mistral judge assigns higher absolute CLFC values, yet preserves the same broad pattern:GPT,Gemini,Grok, andSonnetare comparatively stronger, whileDeepSeekandMistralremain weaker\. Geography and National Identity is often the most stable domain, likely because such knowledge is more standardized, whereas ecological, religious, and folklore\-based knowledge remains more cross\-lingually fragile\. Overall, the table supports our claim that evidence improves grounding but does not guarantee language\-invariant factual answers\.

Table 22:Language Anchor Bias \(LAB\) across categories\. Higher values indicate stronger language\-dependent shifts between Bangla and English prompts, reflecting greater anchoring of responses to the prompt language\.
##### Language Anchor Bias

Table[22](https://arxiv.org/html/2605.30481#A6.T22)shows that language anchoring is substantial across domains\. Under the GPT judge, most models show moderate to high LAB, often between 0\.40 and 0\.60\. The effect is strongest in Art, Literature, and Cultural Practices and Traditional Medicine and Ecology, where models such asGemma,DeepSeek,Mistral, andSonnetshow particularly large Bangla–English shifts\.GPTandLlamaare comparatively more stable, but still exhibit non\-trivial anchoring across all domains\. The Mistral judge assigns lower absolute LAB values for several models, yet preserves the same conclusion: prompt language continues to alter model framing, especially forDeepSeek,Gemma,Sonnet, andMistral\. Domain\-wise, culturally interpretive domains such as folklore, cultural practices, and ecological knowledge remain especially sensitive to language\. Overall, the table supports our claim that language is not a neutral interface: asking the same Bengali cultural question in Bangla or English can shift the epistemic framing of the response, even when the underlying content is unchanged\.

#### F\.2\.2Analysis on Knowledge Frequency, Epistemic Status, and Validation Type

Table 23:Global Substitution Rate \(GSR\) in the evidence\-based setting across sociocultural annotations\. Lower values indicate better preservation of local cultural referents; higher values indicate stronger substitution with globally dominant alternatives\. Occ\.=Occasional, Fre\.=Frequent, Cont\.=Contested, Inst\.=Institutional, Local C\.=Local Consensus, Oral=Oral Tradition\. Global epistemic\-status column omitted; see Eq\.[3](https://arxiv.org/html/2605.30481#A3.E3)\.##### Global Substitution

Table[23](https://arxiv.org/html/2605.30481#A6.T23)reports GSR in the evidence\-based setting across sociocultural annotations\. Because GSR measures substitution away from a locally grounded interpretation, it is computed only for local and contested items; the global epistemic\-status column is therefore omitted\. The central pattern is a large and consistent Bangla–English gap\. Under the GPT judge, Bangla prompts generally produce low\-to\-moderate substitution rates, mostly around 0\.20–0\.40, whereas English prompts increase GSR sharply, often to 0\.70–0\.85\. This increase appears for every model and across all annotation types, including frequent knowledge, local\-consensus items, institutional knowledge, and oral traditions\. Therefore, global substitution is not confined to rare or obscure cultural content\.

The Mistral judge assigns higher absolute GSR values, especially for Bangla responses, but confirms the same directional effect\. English prompts remain substantially more substitutional, with many scores above 0\.80 across knowledge\-frequency, epistemic\-status, and validation\-type categories\. The effect is particularly strong forQwen,Gemini,Gemma,Llama, andGrok, whileGPTis relatively more robust under the GPT judge but still shows large English\-side increases\. Validation\-type results are especially important: even local\-consensus and oral\-tradition items, whose legitimacy is grounded in community knowledge rather than global institutions, show high English GSR\. Overall, the table provides strong evidence that global narrative dominance persists despite evidence injection\. Models are not merely missing local knowledge; when prompted in English, they continue to prioritize globally salient referents over the local cultural grounding provided in the context\.

Table 24:Institutional Bias Rate \(IBR\) in the evidence\-based setting across sociocultural annotations\. Lower values indicate less reliance on institutional or globally legitimized framings when local evidence is provided\. Occ\.=Occasional, Fre\.=Frequent, Cont\.=Contested, Inst\.=Institutional, Local C\.=Local Consensus, Oral=Oral Tradition\.
##### Institutional Bias

Table[24](https://arxiv.org/html/2605.30481#A6.T24)shows that institutional bias persists even when models are given local evidence, but its strength depends strongly on judge calibration and sociocultural category\. Under the GPT judge, IBR remains low overall, especially for local\-consensus, niche, and oral\-tradition items\. However, institutional and global\-status items consistently show higher IBR, and contested items often show elevated values for models such asMistral,DeepSeek,Grok, andSonnet\. This indicates that even with evidence, models are more likely to invoke institutional authority when the item already involves formal, global, or disputed knowledge\.

The Mistral judge assigns substantially higher IBR values, but reveals the same structure more clearly\. Institutional validation is the most biased category across almost all models, often exceeding 0\.50 and reaching 0\.655 forLlamain English\. Global and contested items also show high IBR, frequently above 0\.55, whereas local items remain lower\. This gap suggests that models distinguish, implicitly or explicitly, between locally grounded knowledge and knowledge that is more easily framed through formal authority\. Importantly, English prompts often increase IBR under the Mistral judge, especially forGemma,DeepSeek,Llama,Grok, andSonnet\. Therefore, English not only changes the content of responses, but also shifts the source of epistemic authority toward globally legitimized institutions\.

Overall, the table strengthens our claim that evidence availability is insufficient for cultural robustness\. Models may incorporate local evidence while still organizing answers through institutional frames, particularly for contested, global, and institutionally validated knowledge\. This indicates that global narrative dominance operates not only through factual substitution, but also through authority substitution: local evidence is reinterpreted through dominant institutional epistemologies rather than preserved on its own cultural terms\.

Table 25:Epistemic Perspective Coverage \(EPC\) in the evidence\-based setting across sociocultural annotations\. Higher values indicate broader coverage of locally relevant perspectives\. Occ\.=Occasional, Fre\.=Frequent, Cont\.=Contested, Inst\.=Institutional, Local C\.=Local Consensus, Oral=Oral Tradition\.
##### Epistemic Perspective Coverage

Table[25](https://arxiv.org/html/2605.30481#A6.T25)shows that evidence substantially improves epistemic perspective coverage, but does not make coverage language\-invariant\. Across both judges, Bangla prompts usually produce higher EPC than English prompts, especially for contested items and for responses judged by Mistral\. This is important because contested knowledge requires representing plural local interpretations; the consistent drop in English indicates that English prompts still narrow culturally situated disagreement into less diverse framings\. The pattern is strongest forQwen,Gemma,DeepSeek,Llama, andSonnet, whileGPTis comparatively more stable\.

The sociocultural breakdown shows that EPC degradation is not simply a function of rarity\. Rare items often receive coverage comparable to, or higher than, frequent items once evidence is provided, suggesting that local evidence can help models recover less common knowledge\. In contrast, contested and global\-status items are more fragile, particularly in English, indicating that models struggle less with the availability of facts than with preserving epistemic plurality and culturally specific framing\. Validation\-type results further support this interpretation: local\-consensus and oral\-tradition items often receive relatively high EPC, but English prompts still reduce coverage for many models\. Overall, the table reinforces the central finding that evidence helps models include more local perspectives, yet English continues to induce epistemic narrowing rather than fully preserving the plurality present in Bengali cultural knowledge\.

Table 26:Cross\-Lingual Factual Consistency \(CLFC\) in the evidence\-based setting across sociocultural annotations\. Higher values indicate stronger semantic consistency between Bangla and English responses to the same culturally grounded question\. Occ\.=Occasional, Fre\.=Frequent, Cont\.=Contested, Inst\.=Institutional, Local C\.=Local Consensus, Oral=Oral Tradition\.
##### Cross\-lingual Factual Consistency

Table[26](https://arxiv.org/html/2605.30481#A6.T26)shows that providing local evidence improves cross\-lingual factual consistency, but does not make model responses language\-invariant\. Under the GPT judge, CLFC remains low for most models, often below 0\.30 across knowledge\-frequency, epistemic\-status, and validation\-type groups\.Llama,Grok, andQwenare comparatively stronger under this judge, whileDeepSeek,Mistral, andGemmashow consistently weak alignment between Bangla and English responses\. Under the Mistral judge, absolute CLFC scores are higher, withGPT,Gemini,Sonnet, andGrokshowing the strongest consistency, but several models still remain far from stable\.

The sociocultural breakdown reveals where cross\-lingual instability is most pronounced\. Local\-consensus and oral\-tradition items often receive lower CLFC than institutional or globally framed items, especially under the GPT judge\. This suggests that models preserve factual meaning more reliably when knowledge is formalized or globally legible, but struggle when validity depends on community memory, oral transmission, or local consensus\. Contested items are also fragile for several models, indicating that cross\-lingual inconsistency is amplified when culturally grounded questions involve multiple possible interpretations\.

Overall, the table supports a central claim of our study: evidence availability is not sufficient for cultural robustness\. Even when the same local evidence is provided, models frequently produce semantically different claims across Bangla and English\. This indicates that cross\-lingual failures are not merely retrieval errors, but reflect language\-conditioned differences in how models select, organize, and prioritize culturally situated knowledge\.

Table 27:Language Anchor Bias \(LAB\) in the evidence\-based setting across sociocultural annotations\. Higher values indicate stronger language\-dependent shifts between Bangla and English prompts\. Occ\.=Occasional, Fre\.=Frequent, Cont\.=Contested, Inst\.=Institutional, Local C\.=Local Consensus, Oral=Oral Tradition\. The Global column is omitted for this analysis\.
##### Language Anchor Bias

Table[27](https://arxiv.org/html/2605.30481#A6.T27)shows that language anchoring persists even in the evidence\-based setting, where models are explicitly given local supporting context\. Under the GPT judge, LAB remains moderate to high across most models and sociocultural categories, often exceeding 0\.45 forGemma,Sonnet,DeepSeek, andMistral\.GPTandLlamaare comparatively more stable, but still show non\-trivial anchoring, indicating that evidence reduces neither the role of prompt language nor the tendency to reframe answers differently in Bangla and English\.

The Mistral judge assigns lower absolute LAB values, but confirms the same pattern: language\-dependent shifts remain visible despite evidence injection\. This is especially important because the relevant local information is available in both language conditions\. Therefore, the observed shifts cannot be explained solely by missing knowledge or retrieval failure\. Instead, they suggest that models use the provided evidence differently depending on the prompt language\.

The sociocultural breakdown further shows that evidence does not uniformly stabilize all types of knowledge\. LAB remains substantial for frequent and occasional items, demonstrating that anchoring is not restricted to rare knowledge\. Local\-consensus, niche, and oral\-tradition items also show elevated LAB for many models, suggesting that community\-grounded forms of legitimacy are especially vulnerable to language\-conditioned reframing\. Overall, Table[27](https://arxiv.org/html/2605.30481#A6.T27)strengthens our central claim: even when local evidence is supplied, prompt language continues to act as an epistemic anchor that changes how models interpret and present Bengali cultural knowledge\.

### F\.3Abstention, Knowledge Gap, and Global Narrative Dominance

Table 28:Abstention, knowledge gap \(KGAP\), and global narrative dominance \(GND\) rates across languages, experiment settings, and LLM judges\. Higher KGAP and GND indicate more culturally grounded failure modes\. BN: Bangla, EN: English\.Table[28](https://arxiv.org/html/2605.30481#A6.T28)shows that global narrative dominance is strongly conditioned by prompt language and only partially mitigated by evidence\. In the question\-only setting, English prompts consistently produce higher knowledge gap \(KGAP\) and global narrative dominance \(GND\) rates than Bangla prompts across both LLM judges\. Under the GPT judge, GND rises from roughly 0\.31–0\.49 in Bangla to 0\.50–0\.82 in English; under the Mistral judge, the same shift is even stronger, with several English\-prompt GND scores above 0\.75\. This indicates that when local knowledge is uncertain, models are more likely to resolve culturally grounded questions through globally dominant narratives in English\.

Evidence reduces abstention substantially across nearly all models, showing that models use the provided context to produce answers rather than refuse\. However, evidence does not eliminate narrative dominance\. In Bangla, GND often decreases or remains moderate, but in English it remains high and in some cases increases, with GPT\-judge English GND frequently around 0\.75–0\.83 and Mistral\-judge GND often above 0\.75\. This pattern is central to our claim: the failure is not only absence of local knowledge, since the relevant evidence is available, but also language\-conditioned prioritization of dominant narratives\. Model differences are visible, withGPT\-5\.4generally more robust andLlama,Gemma,Grok, andDeepSeekshowing higher GND, but no model is stable across language and evidence conditions\.

Table 29:Human validation subset for the Bangla question\-only setting\. Human judgments are reported for two representative models and compared against both LLM judges\. Abst\.: Abstention, KGAP: knowledge\-gap, GND: global narrative dominance\.Table[29](https://arxiv.org/html/2605.30481#A6.T29)provides a validity check for the LLM\-judge results on the Bangla question\-only subset\. Human and LLM judgments are broadly comparable for abstention and KGAP, but differ for GND calibration\. The human annotator assigns lower GND than the LLM judges forSonnet, while assigning a similar or slightly lower value forGPT\-5\.4relative to GPT\-as\-judge and a substantially lower value relative to Mistral\-as\-judge\. This suggests that LLM judges, especially Mistral in this setting, may over\-detect global narrative dominance in some Bangla responses, while still preserving the comparative trend thatGPT\-5\.4is more robust thanSonnet\. The human validation supports using LLM judges as scalable diagnostic tools, while motivating judge\-specific reporting rather than treating any single judge as definitive\.

## Appendix GData Release

When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

Similar Articles

Toward LLMs Beyond English-Centric Development

Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs

BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories

Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency

When a Name Is Not a Name: A Benchmark Dataset and Distilled Reasoning for Culturally Entangled Bangla Homographs in Low-Resource LLMs

Submit Feedback

Similar Articles

Toward LLMs Beyond English-Centric Development

Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs

BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories

Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency

When a Name Is Not a Name: A Benchmark Dataset and Distilled Reasoning for Culturally Entangled Bangla Homographs in Low-Resource LLMs