Characterizing Cultural Localization in AI-Generated Stories

arXiv cs.CL Papers

Summary

This paper proposes a method to measure cultural localization in AI-generated stories, detecting that only a small fraction of vocabulary distinguishes nationalities while narratives rely on shared templates, and finds that cultural markers from many Global South countries are often offensive.

arXiv:2606.14626v1 Announce Type: new Abstract: The global use of artificial intelligence has increased interest in assessing the ability to generate culturally localized content, including stories. Cultural localization in stories often occurs through either templated localization -- the use of cultural markers (e.g., names, locations) in a generic narrative -- or holistic localization -- the variation of plots, values, and themes, in addition to cultural markers. We propose a method to measure the degree to which content was generated through templated localization. Specifically, we identify the lexical tokens that distinguish stories across nationalities and measure the similarity of the narratives that remain after removing them. In stories generated by five models on 125 topics for 193 nationalities, our method is able to detect that only a small subset (9-17%) of the vocabulary accounts for the variation across nationalities and that the narratives that remain after removing them contain repeated multi-word sequences, suggesting the presence of a shared culturally-agnostic narrative template. Finally, we characterize the cultural markers for their stereotypicality and offensiveness, finding that markers from 19 countries, mostly located in the Global South, are on average offensive.
Original Article
View Cached Full Text

Cached at: 06/15/26, 08:59 AM

# Characterizing Cultural Localization in AI-Generated Stories
Source: [https://arxiv.org/html/2606.14626](https://arxiv.org/html/2606.14626)
Shaily Bhatt Carnegie Mellon University shaily@cmu\.edu &Supriti Vijay11footnotemark:1 Carnegie Mellon University supritiv@cs\.cmu\.edu Jeremiah Milbauer Carnegie Mellon University jmilbaue@cs\.cmu\.edu &Fernando Diaz Carnegie Mellon University diazf@acm\.org

###### Abstract

The global use of artificial intelligence has increased interest in assessing the ability to generate culturally localized content, including stories\. Cultural localization in stories often occurs through either templated localization—the use of cultural markers \(e\.g\., names, locations\) in a generic narrative—or holistic localization—the variation of plots, values, and themes, in addition to cultural markers\. We propose a method to measure the degree to which content was generated through templated localization\. Specifically, we identify the lexical tokens that distinguish stories across nationalities and measure the similarity of the narratives that remain after removing them\. In stories generated by five models on 125 topics for 193 nationalities, our method is able to detect that only a small subset \(9\-17%\) of the vocabulary accounts for the variation across nationalities and that the narratives that remain after removing them contain repeated multi\-word sequences, suggesting the presence of a shared culturally\-agnostic narrative template\. Finally, we characterize the cultural markers for their stereotypicality and offensiveness, finding that markers from 19 countries, mostly located in the Global South, are on average offensive\.

\*\[inlinelist,1\]label=\(\), \*\[hypotheses,1\]label=H0:,ref=H0 \*\[questions,1\]label=RQ0:,ref=RQ0

Characterizing Cultural Localization in AI\-Generated Stories

Shaily Bhatt††thanks:Equal contribution\.Carnegie Mellon Universityshaily@cmu\.eduSupriti Vijay11footnotemark:1Carnegie Mellon Universitysupritiv@cs\.cmu\.edu

Jeremiah MilbauerCarnegie Mellon Universityjmilbaue@cs\.cmu\.eduFernando DiazCarnegie Mellon Universitydiazf@acm\.org

## 1Introduction

Large language models \(LLMs\) are increasingly being used globally, requiring them to tailor their generations to diverse sociocultural contexts when instructed\. For example, when given the instruction \[Write a story about ‘honesty’ for an Indian kid\.\], a model must generate a narrative that is localized to the Indian cultural context\.

\{Mark, Biren\} stopped at a \{coffee, tea\} shop in \{Chicago, Bangalore\} after \{baseball, cricket\} practice\. He paid \{ten\-dollars, fifty\-rupees\} for a \{sandwich, samosa\}\. Outside, he noticed that the \{cashier, vendor\} had returned too much change\. Although he was already heading toward the \{bus stop, metro station\}, he went back and returned the extra money\. The \{cashier, vendor\} thanked him for his honesty\.

\(a\)Templated localization\.
Mark paid for his coffee and bagel at Pigeon Bagels in Pittsburgh and stepped outside\. On the sidewalk, he counted his change and noticed an extra five\-dollar bill\. He immediately walked back in\. The cashier looked up from the register, embarrassed\. “Sorry you had to come back,” she said\. Mark smiled as he returned the bill\.

Biren had dropped of his last passenger of the night\. It was a short ride, but took 45 minutes in Mumbai’s traffic\. He was almost home when he noticed that his passenger had left a bag in the auto\. He was tired but took a U\-turn to return it\. The passenger received the bag with relief, gratitude, and surprise at Biren’s honesty\.

\(b\)Holistic localization\.

Figure 1:Example stories for the instruction \[Write a story about ‘honesty’ for a*nationality*kid\] where*nationality*is \{Indian, American\}\.Narrative localization can take many forms, including culturally specific plot tropesColby \([1973](https://arxiv.org/html/2606.14626#bib.bib81)\), the encoding of particular valuesHobsonet al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib51)\); Wuet al\.\([2023](https://arxiv.org/html/2606.14626#bib.bib84)\), variations in narrative structureSong \([2017](https://arxiv.org/html/2606.14626#bib.bib83)\), and culturally relevant entities such as names or locationsBhatt and Diaz \([2024](https://arxiv.org/html/2606.14626#bib.bib37)\)\. We consider the two forms of localization shown in Figure[1](https://arxiv.org/html/2606.14626#S1.F1)\. Intemplated localization, cultural markers \(e\.g\., names, locations\) are inserted into a culturally\-agnostic narrative templateFanet al\.\([2019](https://arxiv.org/html/2606.14626#bib.bib136)\); Fordet al\.\([2018](https://arxiv.org/html/2606.14626#bib.bib137)\); Wisemanet al\.\([2018](https://arxiv.org/html/2606.14626#bib.bib135)\); Khanujaet al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib86)\)\. On the other hand,holistic localizationemploys culturally specific plots and settings, in addition to cultural markers\.

How models localize stories has broader impacts on cultural production and preservation\. While templated localization may be the appropriate choice in contexts where the goal is to preserve the content while making it more relatable to the audienceKhanujaet al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib86),[2025](https://arxiv.org/html/2606.14626#bib.bib174)\), when used in the general context, the resulting stories will often reflect homogeneous narratives and values, which can lead to cultural harms like erasureQadriet al\.\([2025a](https://arxiv.org/html/2606.14626#bib.bib27)\); Shelbyet al\.\([2023](https://arxiv.org/html/2606.14626#bib.bib129)\), imposition of western valuesShelbyet al\.\([2023](https://arxiv.org/html/2606.14626#bib.bib129)\); Bhattet al\.\([2022](https://arxiv.org/html/2606.14626#bib.bib134)\); Sambasivanet al\.\([2021](https://arxiv.org/html/2606.14626#bib.bib149)\), or reduced creative diversityAgarwalet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib53)\); Doshi and Hauser \([2024](https://arxiv.org/html/2606.14626#bib.bib74)\)\. Further, model outputs may implicitly rely on a limited set of culturally associated tokens, which prior analyses have shown can reflect stereotypical cultural associationsBhagatet al\.\([2026](https://arxiv.org/html/2606.14626#bib.bib65)\); Rooeinet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib69)\)\. Therefore, tools to detect the degree of templated localization can help anticipate and avoid potential harms\. While studies have demonstrated that stories generated in the presence of cultural cues contain lexical variationBhatt and Diaz \([2024](https://arxiv.org/html/2606.14626#bib.bib37)\), misrepresentationsBhagatet al\.\([2026](https://arxiv.org/html/2606.14626#bib.bib65)\), and geographical disparityBhagatet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib12)\), they have yet to provide methods for understanding the presence of templated localization\.

We propose a two\-stage method to detect the presence of templated localization in model generations\. First, we identify the set of lexical items that function as unique cultural markers for each cultural identity in the generated stories\. Next, we measure the homogeneity of the text sequences that remain after these cultural markers have been removed using multi\-word similarity\. Convergence of these remaining sequences across stories that differ in cultural markers would indicate the presence of a shared culturally\-agnostic narrative template\. We further characterize the stereotypicality and offensiveness of the cultural markers using the SeeGULL datasetJhaet al\.\([2023](https://arxiv.org/html/2606.14626#bib.bib18)\)\. We evaluate stories generated by five LLMs for 193 cultural identities, operationalized through nationalities, for 125 story topics curated from prior workBhatt and Diaz \([2024](https://arxiv.org/html/2606.14626#bib.bib37)\)and established frameworks of variation in cultural values like World Values SurveyHaerpferet al\.\([2022](https://arxiv.org/html/2606.14626#bib.bib126)\)and Hofstede’s Cultural DimensionsHofstede and Minkov \([2013](https://arxiv.org/html/2606.14626#bib.bib66)\)\. Our code and data is[publicly available](https://github.com/shaily99/templated_localization)\.

Our method reveals that localization primarily occurs through surface\-level lexical differences, suggesting that stories may use a homogeneous underlying narrative\. We find that cultural markers, constituting only 9\-17% of the vocabulary across models, are the only distinguishing characteristics of the stories\. Moreover, the narratives that remain after removing these markers exhibit higher multi\-token similarity across nationalities than the original stories\. Finally, we find that countries for which cultural markers are, on average, offensive are mostly located in the Global South, predominantly in Africa and West Asia, with dominant languages that are lower\-resourced\. Taken together, our findings demonstrate the ability of our method to characterize cultural localization in AI\-generated stories\.

## 2Background

Narrative generation systems often decompose stories into two levels: a structural plan describing events and character relationships, and a surface realization that renders this plan into natural language\. Although many early natural language generation systems used slot\-filling approaches to populate manually written templates\(Reiter and Dale,[1997](https://arxiv.org/html/2606.14626#bib.bib153); van Deemteret al\.,[2005](https://arxiv.org/html/2606.14626#bib.bib152)\), learning\-based approaches either automatically selectZhou and Hovy \([2004](https://arxiv.org/html/2606.14626#bib.bib145)\)or generateFordet al\.\([2018](https://arxiv.org/html/2606.14626#bib.bib137)\); Wisemanet al\.\([2018](https://arxiv.org/html/2606.14626#bib.bib135)\); Fabbriet al\.\([2020](https://arxiv.org/html/2606.14626#bib.bib146)\); Gangadharaiah and Narayanaswamy \([2020](https://arxiv.org/html/2606.14626#bib.bib148)\)templates\. Such plan\-based systems generate a story in multiple steps, including generating templates of plot and character actions, followed by rendering these plans into natural language\. Methods of narrative planning have ranged from story grammarsPemberton \([1989](https://arxiv.org/html/2606.14626#bib.bib101)\); Ryan \([2017](https://arxiv.org/html/2606.14626#bib.bib102)\)to symbolic plannersRiedl and Young \([2010](https://arxiv.org/html/2606.14626#bib.bib106)\); McIntyre and Lapata \([2010](https://arxiv.org/html/2606.14626#bib.bib104),[2009](https://arxiv.org/html/2606.14626#bib.bib140)\), and finally, to neural modelsMartinet al\.\([2018](https://arxiv.org/html/2606.14626#bib.bib141)\); Xuet al\.\([2018](https://arxiv.org/html/2606.14626#bib.bib142)\); Yaoet al\.\([2019](https://arxiv.org/html/2606.14626#bib.bib111)\); Goldfarb\-Tarrantet al\.\([2020](https://arxiv.org/html/2606.14626#bib.bib143)\)\. Narrative planning has also been integrated into prompt\-based story generation from LLMsXie and Riedl \([2024](https://arxiv.org/html/2606.14626#bib.bib132)\); Liet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib144)\)\. Narratives have also been studied computationally by decomposing them into their attributes such as the setting, agents, events, and so onPiperet al\.\([2021](https://arxiv.org/html/2606.14626#bib.bib172)\); Hamiltonet al\.\([2026](https://arxiv.org/html/2606.14626#bib.bib171)\)\. This separation between narrative structure and its linguistic realization suggests that variation in generated stories can arise either through modifications in the underlying template or the lexical content used to instantiate it\.

This distinction between structural plans and surface realization parallels theories of language variation across cultural identities\. Sociolinguistic scholars have argued that social meaning and identity can be conveyed, constructed, and interpreted through various channels, including1micro\-linguistic structures such as phonetic sounds or lexical choices,2macro\-linguistic forms like narrative forms or discursive orientations such as stance,3entire linguistic systems such as the choice of language or dialect, and even4material styles such as the choice of clothingEckert \([2012](https://arxiv.org/html/2606.14626#bib.bib95),[2008](https://arxiv.org/html/2606.14626#bib.bib97)\); Bucholtz and Hall \([2005](https://arxiv.org/html/2606.14626#bib.bib96)\)\. Importantly, these channels include narrative form such as the stance\. Scholars of folk narrative have shown that plot and character tropes are culturally specific: the narrative structure of Russian folktales differs systematically from that of North Alaskan storiesColby \([1973](https://arxiv.org/html/2606.14626#bib.bib81)\), and similar differences have been documented in stories across other traditionsPolti \([1916](https://arxiv.org/html/2606.14626#bib.bib82)\); Song \([2017](https://arxiv.org/html/2606.14626#bib.bib83)\); Hobsonet al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib51)\); Wuet al\.\([2023](https://arxiv.org/html/2606.14626#bib.bib84)\)\. A story can therefore represent cultural identity through the use of culturally\-specific entities—the names, places, and objects within it—or through differences in the narrative itself\. Since narrative generation can separate structure from surface realization, and cultural identity can be encoded at both levels, then cultural localization in generated stories could occur either through surface markers or through narrative differences\.

As LLMs are deployed globally, a growing body of work has investigated their cultural competence—that is, their ability to generate outputs that reflect culturally specific knowledge, norms, and values\. While intrinsic evaluations of cultural competence focus on the ability to recall cultural valuesDurmuset al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib14)\); Masoudet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib88)\); AlKhamissiet al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib36)\); Ramezani and Xu \([2023](https://arxiv.org/html/2606.14626#bib.bib19)\), normsDwivediet al\.\([2023](https://arxiv.org/html/2606.14626#bib.bib33)\); Raoet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib70)\), artifactsSethet al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib24)\), and knowledgeLiet al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib25)\); Singhet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib89)\); Majiet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib91)\); Sahooet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib92)\); Changet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib93)\); Myunget al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib94)\), extrinsic evaluations focus on user\-facing generative tasksBhatt and Diaz \([2024](https://arxiv.org/html/2606.14626#bib.bib37)\);[Sparck Jones and R\. Galliers](https://arxiv.org/html/2606.14626#bib.bib175)\. Prior work has examined the content produced for diverse cultural identities in extrinsic tasks such as open\-ended question answering, story generation, scientific writing, creating travel itineraries, and writing assistance, finding that the cultural knowledge of LLMs may not always be reflected in generative settingsBhatt and Diaz \([2024](https://arxiv.org/html/2606.14626#bib.bib37)\); Bhagatet al\.\([2026](https://arxiv.org/html/2606.14626#bib.bib65)\); cultural representation is often stereotypical or misrepresentativeRooeinet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib69)\); Bhagatet al\.\([2026](https://arxiv.org/html/2606.14626#bib.bib65),[2025](https://arxiv.org/html/2606.14626#bib.bib12)\); and generations do not adhere to expected cultural writing stylesAgarwalet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib53)\); Bhattet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib71)\)\. While this work demonstrates that LLMs can incorporate culturally salient tokens, it remains unclear whether the content reflects narrative differences beyond surface\-level lexical variation\.

Independent of cultural evaluation, recent studies have shown that LLM\-generated text often exhibits substantial homogeneity across outputs\. Prior work has evaluated homogeneity of generated outputs along various dimensions including at syntactic, semantic, and narrative levels\. Specifically, LLMs have been shown to generate recurring syntactic patterns, semantically similar concepts, homogeneous discourse structures, and epistemic claimsShaibet al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib26)\); Souratiet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib28)\); Wanget al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib75)\); Jianget al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib72)\); Wrightet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib73)\); Namuduriet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib167)\)\. Finally, both qualitative and quantitative studies of LLM\-generated stories find a lack of plot diversity, recurring narrative themes, lack in pacing and tension, and positive endingsXuet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib44)\); Tianet al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib42)\); Beguš \([2024](https://arxiv.org/html/2606.14626#bib.bib46)\); Priyanshu and Vijay \([2024](https://arxiv.org/html/2606.14626#bib.bib30)\)\. If LLM outputs tend to reuse shared narrative structures, then cultural adaptation in generated stories may occur primarily through surface\-level markers rather than through holistic localization\.

Together, these observations suggest that cultural variation in generated stories may arise primarily through surface\-level lexical markers rather than through deeper narrative differences\. However, existing work has not directly examined whether stories generated across cultures exhibit templated localization, where cultural markers are inserted into culturally\-agnostic narrative templates\.

## 3Method

We are interested in measuring the degree to which generated stories across cultures reflect templated or holistic localization\. We distinguish between these two as follows:

#### Templated localization

refers to localization when culture is represented through isolated lexical items\. Here, cultural markers such as cultural artifacts, relevant names and locations, or other entities are inserted into culturally\-agnostic templates that reflect homogeneous narrative structures, plots, settings, themes, and values\.

#### Holistic localization

refers to localization when culture shapes the narrative\. Here, cultural markers are distributed throughout the story, resulting in culturally\-specific narrative structures, plots, themes, and values\.

### 3\.1Overview

Given a prompt \[Write a children’s story about*topic*for a/an*nationality*kid in English\.\], we are interested in understanding if a generated story is composed of:1a culturally\-agnostic template about*topic*shared across nationalities and2a set of cultural markers inserted into that template\.To do so, we fix*topic*and vary*nationality*to produce stories\. Our method analyzes these stories in two stages\. First, we identify the set of cultural markers that distinguish the stories \(§[3\.2](https://arxiv.org/html/2606.14626#S3.SS2)\)\. Second, we measure the similarity of the narrative that remains after these cultural markers are removed \(§[3\.3](https://arxiv.org/html/2606.14626#S3.SS3)\)\. Finally, we characterize the stereotypicality of the cultural markers \(§[3\.4](https://arxiv.org/html/2606.14626#S3.SS4)\)\.

To make cultural localization tractable for computational analysis, we impose several methodological constraints on the scope of our study\. First, because templated localization assumes exact repeated language across cultural contexts, we adopt lexical units \(words\) as our unit of analysis, allowing us to leverage existing natural language processing tools\. Second, while imperfect, nationality serves as a proxy for culture consistent with existing researchAdilazuardaet al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib15)\), making the analysis amenable to classification methods\. Finally, we restrict our analysis to English to facilitate lexical comparison, leaving cross\-lingual template detection for an area of future study\.

### 3\.2Identifying Cultural Markers

The first step of our method identifies the minimal set of words per nationality whose removal renders stories across cultures indistinguishable\. Under templated localization, lexical differences will be concentrated in a small number of cultural markers, whose removal will eliminate variation\. By contrast, under holistic localization, the differences would be distributed throughout the story, requiring many words to be removed before stories converge\.

#### Scoring candidate cultural markers\.

Letst,c∈Ss\_\{t,c\}\\in Sbe the generated story for topict∈Tt\\in Tand culturec∈Cc\\in C\. The vocabularyVVis the union of words present across all stories\. For eachc∈Cc\\in C, we score every wordw∈Vw\\in Vaccording to its normalized pointwise mutual information \(NPMI\) withcc\(Appendix[A](https://arxiv.org/html/2606.14626#A1)\)\. We refer to these scored words as the candidate cultural markers ofcc\.

#### Identifying distinguishing cultural markers\.

Given the candidate cultural markers forcc, the final set of cultural markers forccisVck⊂VV\_\{c\}^\{k\}\\subset V, composed of the topk%k\\%candidates with highest NPMI values\. Lets¯t,ck\\overline\{s\}\_\{t,c\}^\{k\}be the storyst,cs\_\{t,c\}withVckV\_\{c\}^\{k\}removed\. In order to determinekk, we measure the ability of a classifier to identify the culture ofs¯t,ck\\overline\{s\}\_\{t,c\}^\{k\}amongst the setS¯tk=\{s¯t,ck\}c∈C\\overline\{S\}\_\{t\}^\{k\}=\\\{\\overline\{s\}\_\{t,c\}^\{k\}\\\}\_\{c\\in C\}\. Specifically, at varying values ofkk, we record the F1\{\}\_\{\\mbox\{1\}\}of the classifier\. If a subset of the vocabulary is the only identifiable characteristic of the stories across cultures, then masking words with high cultural association should make the stories indistinguishable\. While the performance of the classifier should drop more significantly when words with higher cultural association are masked, our measurement question is how many culturally\-associated words need to be removed\. We refer toVckV\_\{c\}^\{k\}as the subset of words whose removal makes the stories inS¯tk\\overline\{S\}\_\{t\}^\{k\}indistinguishable\. We refer to the resulting stories inS¯tk\\overline\{S\}\_\{t\}^\{k\}as the template images\.

### 3\.3Homogeneity of Remaining Narratives

The second step of our method detects the presence of a shared generic narrative template by measuring the homogeneity of the template images remaining after removing the cultural markers\.

Although template images are indistinguishable by construction, we need a method to determine whether this is due to randomness or homogeneity amongst images\. We can measure the homogeneity of template images by computing the pair\-wise average similarity amongst elements\. Such measures have been used in prior work on measuring homogeneity in a corpusPadmakumar and He \([2024](https://arxiv.org/html/2606.14626#bib.bib48)\); Shaibet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib165)\)\. If stories reuse a template, replacing cultural markers with masked tokens would make the resulting text sequences more similar, as compared to the original stories\. Consider the two stories from the two cultures as \[A cat sat on the table\.\] and \[A dog sat on the floor\.\]\. Let \{cat, table\} and \{dog, floor\} be the markers of the respective cultures\. Then, masking these markers will produce the same n\-gram sequence \[A*mask*sat on the*mask*\], resulting in higher similarity compared to the original stories, as well as stories where random words are masked\.

While this stylized demonstration of homogeneity suggests that template images are exact duplicates, in practice, due to model stochasticity, we need similarity metrics robust to small perturbations amongst template images\. To do so, we adopt two metrics for analyzing multi\-word sequences\. In the first, we calculate the length of common substring \(LCS\), normalized by the length of the stories\. In the second, we measure the similarity between the sets ofnn\-grams present in pairs of template images, using Jaccard similarity\. This method has been used to robustly detect duplicates in large corpora such as web crawlsBroderet al\.\([1997](https://arxiv.org/html/2606.14626#bib.bib125)\)\.

Since our goal is to measure whether template images are shared across cultures, multi\-word similarity offers a relatively simple yet efficient method to compare pairs of text sequences that remain after removing cultural markers, unlike other representations like discourse structures, narrative components, or themes that require more manual or computational effortNamuduriet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib167)\); Beguš \([2024](https://arxiv.org/html/2606.14626#bib.bib46)\); Piperet al\.\([2021](https://arxiv.org/html/2606.14626#bib.bib172)\)\.

### 3\.4Characterizing Cultural Markers

Finally, we characterize cultural markers for their degree of stereotypicality\. Assume we have access to a set of stereotypical attributes forcc, denoted asZcZ\_\{c\}\. We calculate the overlap betweenZcZ\_\{c\}andVckV\_\{c\}^\{k\}by measuring the precision of stereotypes in the cultural markers \(Appendix[B](https://arxiv.org/html/2606.14626#A2)\)\.

## 4Experimental Materials

#### Story Topics\.

We curate a set of 125 story topics that consists of 35 topics from prior workBhatt and Diaz \([2024](https://arxiv.org/html/2606.14626#bib.bib37)\), and 90 based on the World Value SurveyHaerpferet al\.\([2022](https://arxiv.org/html/2606.14626#bib.bib126)\), Hofstede’s cultural dimensionsHofstede and Minkov \([2013](https://arxiv.org/html/2606.14626#bib.bib66)\), and the Moral Foundations TheoryGrahamet al\.\([2012](https://arxiv.org/html/2606.14626#bib.bib128)\)\. We select these frameworks as they are known to capture variation in values across cultures and have been utilized to evaluate AI systems’ knowledge of cultural valuesDurmuset al\.\([2024](https://arxiv.org/html/2606.14626#bib.bib14)\); Masoudet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib88)\)\. To curate this list, two authors read and discussed each dimension of three theories and distilled them into a topic\. For example, question Q110 from the World Value Survey about rating the amount of corruption in the country is distilled into the topic ‘corruption’\. Similarly, question Q03 from Hofstede’s survey on the importance of getting recognition for good performance in the workplace is distilled into ‘recognition\.’ The complete list of topics and their corresponding sources is available in our[data](https://github.com/shaily99/templated_localization)\.

#### Prompts\.

We use a simple prompt template, \[Write a children’s story about*topic*for a/an*nationality*kid in English\.\]\. Similar to prior works, we opt for a simple instruction to generate a storyRooeinet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib69)\); Bhagatet al\.\([2026](https://arxiv.org/html/2606.14626#bib.bib65)\); Bhatt and Diaz \([2024](https://arxiv.org/html/2606.14626#bib.bib37)\), leaving examination of localization behavior in other user interaction patterns and domains to future work\. We generate prompts for each of the 125 topics for all of the 193 nationalities, resulting in 24,125 prompts\.

#### Models\.

We generate stories from two closed\-source models—GPT 3\.5 Turbo and GPT 4o Mini queried through OpenAI API in June 2025—and three open\-weights models of varying sizes—Llama 3\.1 8B Instruct, Llama 3\.3 70B InstructLlama Team, AI @ Meta \([2024](https://arxiv.org/html/2606.14626#bib.bib139)\), and Gemma 3 12B InstructGemma Team \([2025](https://arxiv.org/html/2606.14626#bib.bib138)\)hosted locally using vLLM with 8\-bit quantization\. This selection balances recency, size, and open\-source availability, demonstrating the effectiveness of our method across a range of models\. For all models, we set the temperature to 0\.7 and the maximum tokens to 1000\. To account for non\-determinism during generation, we sample five responses per prompt\. This results in 120,625 stories from each model\.

Table 1:Example of Template images, number of cultures they were found in, and respective cultural markers\.
#### Nationality Classifier\.

We train the nationality classifier used in Section[3\.2](https://arxiv.org/html/2606.14626#S3.SS2)as a multi\-class \(193\-way\) classifier to classify stories into one of 193 nationalities\. We fine\-tune the mmBERT modelMaroneet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib78)\)using a classification head\. We use 5\-fold cross\-validation with a 60:20:20 split across folds for training, validation, and testing, respectively\. All the classifiers are trained for a maximum of fifty epochs, with early stopping patience set to five epochs\. The validation split is used to pick the best classifier from a combination of hyperparameters, including learning rates, batch size, and for early stopping \(best parameters reported in Appendix[C](https://arxiv.org/html/2606.14626#A3)\)\. We then record the performance of the classifier on the test split\. Additionally, we record the performance on the masked stories created from this test split\. We run the experiment independently for each of the five LLMs\.

#### Template Image Similarity\.

We compare the average similarity amongst template images with the average pair\-wise similarity amongst1original stories, and2stories when an equivalent number of random words are masked\.When computingnn\-gram similarity, we usen=4n=4\.

#### Stereotype Data\.

In order to characterize the stereotypicality of cultural markers, we calculate the precision of stereotypes using the stereotypical attributes released in the SeeGULL datasetJhaet al\.\([2023](https://arxiv.org/html/2606.14626#bib.bib18)\)\. To create this dataset, candidate stereotypes were first sourced from language models, followed by obtaining annotations to rate the candidates as stereotypical \(or not\) from raters residing in the respective countries \(in\-group regional raters\) and North American annotators \(out\-group raters\)\. We use all attributes that were labeled as a stereotype by at least one regional rater as our reference set of stereotypes \(ZcZ\_\{c\}\)\. We present evaluation results for the 156 countries from the SeeGULL dataset that overlapped with 193 nationalities in our list\. Further, SeeGULL provides an offensiveness score for every stereotype\. Specifically, a stereotype is rated as non\-offensive \(\-1\), neutral \(0\), and offensive \(Likert scale of 1\-5\), averaged across three raters\. We use this to calculate the average offensiveness of stereotypical cultural markers\.

## 5Results

### 5\.1Identifying Cultural Markers

Figure[2](https://arxiv.org/html/2606.14626#S5.F2)shows the F1\{\}\_\{\\mbox\{1\}\}of the nationality classifier for all integer values ofkkbetween 0 and 99 for stories generated by GPT 4o Mini\. The F1\{\}\_\{\\mbox\{1\}\}on the original, unmasked stories is 0\.968, indicating that the classifier is able to reliably predict the nationality\. For reference, randomly guessing the nationality would achieve an F1\{\}\_\{\\mbox\{1\}\}of 0\.005\.

We observe that masking increasing numbers of highly culturally associated words dramatically degrades both the macro\-averaged F1\{\}\_\{\\mbox\{1\}\}and the class\-wise F1\{\}\_\{\\mbox\{1\}\}of the classifier, suggesting that the ability to distinguish stories is concentrated on a small number of cultural markers\. More concretely, we find that the classifier performance drops to random guessing when the top 11% of the highly associated cultural words are masked\. Results for other models \(Appendix[D](https://arxiv.org/html/2606.14626#A4)\) indicate similar fractions of cultural markers: GPT 3\.5 Turbo \(11%\), Llama 3\.3 70B Instruct \(9%\), Gemma 3 12B Instruct \(9%\), and Llama 3\.1 8B Instruct \(17%\)\.

To ensure that our results were not an artifact of merely removing words, we compared the F1\{\}\_\{\\mbox\{1\}\}to masking random words\. While the F1\{\}\_\{\\mbox\{1\}\}of the classifier in this condition also reduces askkincreases, it drops more slowly than when words ordered by cultural association are masked\. We find that, for all values ofkk, the F1\{\}\_\{\\mbox\{1\}\}when random words are masked is higher than that when words with the highest cultural association are masked \(one\-sided pairedtt\-test,p<0\.05p<0\.05\)\.

### 5\.2Homogeneity of Remaining Narratives

We now turn to evaluating the homogeneity in template images that remain after masking cultural markers from the stories using multi\-word similarity \(§[3\.3](https://arxiv.org/html/2606.14626#S3.SS3)\)\. Table[1](https://arxiv.org/html/2606.14626#S4.T1)shows examples of template images that were repeated across nationalities\. Table[2](https://arxiv.org/html/2606.14626#S5.T2)shows the average multi\-word similarity amongst the original stories and their template images\. We break similarity down into inter\-group similarity—amongst stories across cultures—and intra\-group similarity—amongst stories within a culture\.

For inter\-group similarity, across both LCS and Jaccard, we find that similarity amongst template images is higher than amongst original stories\. Appendix table[5](https://arxiv.org/html/2606.14626#A5.T5)shows that masking an equivalent number of random words results in lower similarity than masking cultural markers\. Together, these findings demonstrate that the sequences remaining after masking cultural markers contain a latent culturally\-agnostic narrative template\.

![Refer to caption](https://arxiv.org/html/2606.14626v1/x1.png)Figure 2:Identifying cultural markers\. F1\{\}\_\{\\mbox\{1\}\}of nationality classifier as a function of number of masked words\. Results for GPT 4o Mini generation\. Results for other models can be found in Appendix[D](https://arxiv.org/html/2606.14626#A4)\.Table 2:Narrative homogeneity\. Multi\-word similarity amongst stories on the same topic in either their original form or masked \(Section[3\.3](https://arxiv.org/html/2606.14626#S3.SS3)\)\. Inter\-group measures the similarity amongst stories for different nationalities\. Intra\-group measures the similarity amongst stories for the same nationality\. The last column group divides the inter\-group similarity by the intra\-group similarity to control for similarity attributable to a cross\-nationality template\.Comparing the inter\-group with the intra\-group similarity, we observe a consistently higher average similarity in the latter, suggesting that, even after removing cultural markers, some signals of nationality remain in the masked stories\. While higher than inter\-group similarity, the differences are modest, and we speculate that a relatively large fraction of this similarity is attributable to a generic latent template\. In the final column group, we show that inter\-group similarity after masking accounts for more than 70% of the intra\-group similarity, across both measures\.

### 5\.3Presence of Stereotypes

The top 10 countries with the highest stereotype precision and offensiveness of the cultural markers for GPT 4o Mini stories are in Table[3](https://arxiv.org/html/2606.14626#S5.T3)\. Results for other models and examples are in Appendix[F](https://arxiv.org/html/2606.14626#A6)\.

The top countries with the highest stereotype precision tend to be countries where higher\-resourced languages are dominant\. We find that the stereotypes present in cultural markers have varying degrees of offensiveness across countries\. In 42 of the 61 countries with non\-zero stereotype precision, the average offensiveness score is negative or neutral, indicating that the stereotypical cultural markers were rated as non\-offensive by the annotators\. The 19 countries with positive average offensiveness scores are primarily located in the Global South, predominantly in Africa and West Asia, with dominant languages being lower\-resourced\.

Table 3:Stereotypes\. Top 10 countries with highest stereotype precision and offensiveness for cultural markers in stories generated by GPT 4o Mini\.

## 6Discussion

Our work provides an analytical lens for extrinsic cultural competence by examining how models, in response to story\-generation prompts, adapt narratives across cultural identities\. Our results suggest that cultural localization in AI\-generated stories occurs primarily through lexical insertion of cultural markers in a culturally\-agnostic template rather than through holistic changes to the narrative\. This reduces cultural representation to a small number of recognizable cultural symbols, resulting in localization underpinned by a homogeneous narrative worldview\. Evaluations that do not analyze the mechanisms of cultural representation in text may overestimate AI’s cultural competence\.

### 6\.1Localization

The observed increase in multi\-word similarity after removing cultural markers indicates the presence of latent narrative templates reused across nationalities\. If differences in stories were distributed throughout the narrative \(holistic localization\), masking a subset of the vocabulary— or cultural markers—would not significantly increase multi\-word sequence similarity as the remaining narratives would still diverge structurally\. While we observe slightly higher similarity within stories from a nationality, our experiments suggest that this may largely be attributed to generic, cross\-nationality templates\. As a result, current LLM story generation seems to behave similarly to template\-based generation pipelines despite being trained end\-to\-end\.

Templated localization likely arises from systemic behavior resulting from LLM training\. Despite being instructed to generate narratives for varying cultural identities, models risk reverting to globally dominant narrative schemas\. This indicates that the assessment and improvement of cultural competence of AI in narrative localization needs to be broadened from the incorporation of culturally salient entities to other channels, such as narrative structures, values, stance, dialect, and so on, as suggested by the sociolinguistics literature\.

Even within the narrative channel, human raters will need specialized knowledge to make reliable assessments\. While prior work in evaluating model generations has advocated for the recruitment of participants with lived cultural experienceBhagatet al\.\([2026](https://arxiv.org/html/2606.14626#bib.bib65)\); Agarwalet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib53)\); Qadriet al\.\([2025b](https://arxiv.org/html/2606.14626#bib.bib169)\), non\-experts and experts can differ in their judgments, for example, when evaluating the quality of machine translated textFreitaget al\.\([2021](https://arxiv.org/html/2606.14626#bib.bib168)\)or success in emulating writing stylesChakrabarty and Dhillon \([2026](https://arxiv.org/html/2606.14626#bib.bib170)\)\. Since narratives within a cultural context can vary subtly, recruitment should be done with care, potentially requiring deeper expertise with the domain \(e\.g\., scholars\)\. This echoes recent calls to develop AI in collaboration with humanities experts\(Biegaet al\.,[2025](https://arxiv.org/html/2606.14626#bib.bib155); Hemment and Kommers,[2025](https://arxiv.org/html/2606.14626#bib.bib154); Bornet al\.,[2021](https://arxiv.org/html/2606.14626#bib.bib156)\)\.

While template\-based generation does not inherently exhibit templated localization \(e\.g\., a culture\- conditioned template\), we observe that cultural homogenization can surface through generic template\-like behavior from LLMs which presumably respond without using explicit templates\. This highlights the importance of understanding how implicit structuring \(e\.g\., templates, plans\) or explicit tool use \(e\.g\., retrieval\-augmented generation\) can result in narrative homogenization\. This requires developing methods for identifying and measuring homogenization throughout the reasoning and tool\-use process\.

### 6\.2Stereotyping

While the cultural markers for most countries contained neutral or non\-offensive stereotypical attributes, the presence of offensive stereotypes for particular regions \(§[5\.3](https://arxiv.org/html/2606.14626#S5.SS3)\) indicates potential for uneven representational harms\. Further, when AI systems are used to access cultural representations of communities through tasks like narrative generation—either by members within or those outside the group—stereotypes that are neutral or non\-offensive can propagate homogeneous stereotypical markers and narrativeswithinthe communityWanget al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib164)\); Sethet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib166)\)\. This suggests the need to enrich notions of stereotypes to includenarrative stereotypes, or stereotypes in narrative structures, styles, and plots, as well as those found in other modes of cultural representation\.

## 7Conclusion

In this work, we propose a method to examine how large language models localize narratives when generating stories for different cultural contexts\. Specifically, we assess whether localization is templated, where cultural markers are substituted into a culturally\-agnostic template, as opposed to holistic localization, where cultural context shapes the narrative through differences in plot, themes, or values throughout the story\. Our method first identifies the cultural markers that distinguish stories across cultural contexts and then measures the similarity of the narratives that remain after removing these markers\. Across five models, 125 topics, and 193 nationalities, we find that cultural variation is limited to a small subset of vocabulary; masking only 9\-17% of culturally\-associated words renders stories across cultures indistinguishable\. Moreover, many of these markers are stereotypical, and markers from 19 countries, primarily located in the Global South, are, on average, offensive, while those from the rest are non\-offensive or neutral\. Further, after masking these cultural markers, the remaining narratives become more similar across cultures, indicating the presence of shared narrative templates\. Overall, our method reveals that current AI\-generated stories primarily exhibits templated localization\. This suggests that evaluations of cultural competence that do not account for the mechanism of localization may overestimate models’ competence in generating localized narratives and highlights the need for methods that capture deeper narrative variation\.

## 8Limitations

We evaluate models by providing a single prompt and evaluating the resultant generation\. The use of these systems in the real world might involve more complex interactions, like multi\-turn conversations and detailed promptsWalsh \([2025](https://arxiv.org/html/2606.14626#bib.bib127)\)\. An important direction of future research here is to understand the degree of detail in the prompt or during a multi\-turn interaction that results in the model breaking out of its default homogeneous behavior, and the impact thereof on users from different sociocultural backgrounds\. Moreover, we focus on closed and open\-sourced LLMs in a zero\-shot prompting setting\. We leave the examination of other types of specialized systems, such as[Sudowrite](https://arxiv.org/html/2606.14626v1/sudowrite.com), a commercial software for fiction writers, tools with narrative planningXie and Riedl \([2024](https://arxiv.org/html/2606.14626#bib.bib132)\), or tools that are personalized for users or communitiesHamnaet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib133)\), to future work\. Our method of measuring whether localization is templated will be useful to evaluate how more detailed prompts, stronger models, or other interventions impact the presence of templated localization in LLM generations\.

We analyzed narratives generated when LLMs are instructed to write children’s stories\. While we based our selection of topics for these stories on established frame\-works of cross\-cultural variation in values, thus expecting that stories written for these topics may manifest these variations, we acknowledge that the genre of the narratives written can have an impact on the homogeneity\. We hope that the community will utilize our framework to extend the evaluation to other genres of narratives, such as writing of screenplays, fiction for adults, essays, and even multimodal narratives like films\.

We operationalize culture through the proxy of nationality\. Future work must examine the homogenization effects at different levels of granularity within cultures, such as within a specific countryBhagatet al\.\([2026](https://arxiv.org/html/2606.14626#bib.bib65)\)and for other axes of identitiesSethet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib166)\)\.

We relied on the SeeGULL datasetJhaet al\.\([2023](https://arxiv.org/html/2606.14626#bib.bib18)\)as our source of reference stereotypical attributes\. Since the candidate stereotypical attributes in SeeGULL were sourced from language models — albeit different models than the ones evaluated here — this may impact the degree of stereotypicality we observe\. We chose this dataset for its broad coverage of nationalities and ratings obtained from raters residing in those countries\. Future work should explore the use of different methods of collection of reference stereotypes, such as those created with community participationDevet al\.\([2023](https://arxiv.org/html/2606.14626#bib.bib173)\)\.

Finally, all stories we evaluated were generated in English, and we focused on examining the similarity in the narratives as operationalized through multi\-word sequences\. This was done to facilitate word\-level analysis in identifying similarities to surface latent templates across stories\. An important direction of future research is to characterize the underlying narratives of these stories either computationally or manually through other forms of narrative representations such as discourse structure, themes, narrative events, characters, values, and so onHamiltonet al\.\([2026](https://arxiv.org/html/2606.14626#bib.bib171)\); Namuduriet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib167)\); Beguš \([2024](https://arxiv.org/html/2606.14626#bib.bib46)\); Piperet al\.\([2021](https://arxiv.org/html/2606.14626#bib.bib172)\)

## Acknowledgments

We thank Anjali Kantharuban, Joel Mire, and Saujas Vaduguru for their feedback on early drafts of the manuscript\. This work partially used computational resources from Bridges\-2Brownet al\.\([2021](https://arxiv.org/html/2606.14626#bib.bib176)\)at Pittsburgh Supercomputing Center through allocation CIS250960 from the Advanced Cyber infrastructure Coordination Ecosystem: Services & Support \(ACCESS\) program, which is supported by National Science Foundation grants \#2138259, \#2138286, \#2138307, \#2137603, and \#2138296\. This research was funded by the[National Institute of Standards and Technology \(NIST\)](https://ror.org/05xpvk416)and the[Carnegie Mellon University AI Measurement Science and Engineering Center \(AIMSEC\)](https://ror.org/05x2bcf33)\. Shaily Bhatt \(ORCID: 0000\-0001\-9616\-6264\) and Fernando Diaz \(ORCID: 0000\-0003\-2345\-1288\) were funded by NIST through Federal Award ID Number 60NANB24D231\.

## References

- M\. F\. Adilazuarda, S\. Mukherjee, P\. Lavania, S\. S\. Singh, A\. F\. Aji, J\. O’Neill, A\. Modi, and M\. Choudhury \(2024\)Towards Measuring and Modeling “Culture” in LLMs: A Survey\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 15763–15784\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.882),[Link](https://aclanthology.org/2024.emnlp-main.882)Cited by:[§3\.1](https://arxiv.org/html/2606.14626#S3.SS1.p2.1)\.
- D\. Agarwal, M\. Naaman, and A\. Vashistha \(2025\)AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,CHI ’25\.External Links:[Document](https://dx.doi.org/10.1145/3706598.3713564),[Link](http://dx.doi.org/10.1145/3706598.3713564)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p3.1),[§2](https://arxiv.org/html/2606.14626#S2.p3.1),[§6\.1](https://arxiv.org/html/2606.14626#S6.SS1.p3.1)\.
- B\. AlKhamissi, M\. ElNokrashy, M\. Alkhamissi, and M\. Diab \(2024\)Investigating Cultural Alignment of Large Language Models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 12404–12422\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.671),[Link](https://aclanthology.org/2024.acl-long.671)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- N\. Beguš \(2024\)Experimental Narratives: A Comparison of Human Crowdsourced Storytelling and AI Storytelling\.Humanities and Social Sciences Communications11\(1\),pp\. 1392\.External Links:[Document](https://dx.doi.org/10.1057/s41599-024-03868-8),ISSN 2662\-9992,[Link](https://doi.org/10.1057/s41599-024-03868-8)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p4.1),[§3\.3](https://arxiv.org/html/2606.14626#S3.SS3.p4.1),[§8](https://arxiv.org/html/2606.14626#S8.p5.1)\.
- K\. Bhagat, S\. Bhatt, A\. Velagapudi, A\. Vashistha, S\. Dave, and D\. Pruthi \(2026\)TALES: a taxonomy and analysis of cultural representations in llm\-generated stories\.InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems,CHI ’26,New York, NY, USA\.External Links:ISBN 9798400722783,[Link](https://doi.org/10.1145/3772318.3790519),[Document](https://dx.doi.org/10.1145/3772318.3790519)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p3.1),[§2](https://arxiv.org/html/2606.14626#S2.p3.1),[§4](https://arxiv.org/html/2606.14626#S4.SS0.SSS0.Px2.p1.1),[§6\.1](https://arxiv.org/html/2606.14626#S6.SS1.p3.1),[§8](https://arxiv.org/html/2606.14626#S8.p3.1)\.
- K\. Bhagat, K\. Vasisht, and D\. Pruthi \(2025\)Richer Output for Richer Countries: Uncovering Geographical Disparities in Generated Stories and Travel Recommendations\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 4660–4668\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.262),ISBN 979\-8\-89176\-195\-7,[Link](https://aclanthology.org/2025.findings-naacl.262/)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p3.1),[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- S\. Bhatt, T\. August, and M\. Antoniak \(2025\)Research Borderlands: Analysing Writing Across Research Cultures\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 26238–26266\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1272),ISBN 979\-8\-89176\-251\-0,[Link](https://aclanthology.org/2025.acl-long.1272/)Cited by:[Appendix A](https://arxiv.org/html/2606.14626#A1.p1.1),[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- S\. Bhatt, S\. Dev, P\. Talukdar, S\. Dave, and V\. Prabhakaran \(2022\)Re\-contextualizing Fairness in NLP: The Case of India\.InProceedings of the 2nd Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),Y\. He, H\. Ji, S\. Li, Y\. Liu, and C\. Chang \(Eds\.\),Online only,pp\. 727–740\.External Links:[Link](https://aclanthology.org/2022.aacl-main.55)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p3.1)\.
- S\. Bhatt and F\. Diaz \(2024\)Extrinsic Evaluation of Cultural Competence in Large Language Models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 16055–16074\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.942),[Link](https://aclanthology.org/2024.findings-emnlp.942)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p2.1),[§1](https://arxiv.org/html/2606.14626#S1.p3.1),[§1](https://arxiv.org/html/2606.14626#S1.p4.1),[§2](https://arxiv.org/html/2606.14626#S2.p3.1),[§4](https://arxiv.org/html/2606.14626#S4.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.14626#S4.SS0.SSS0.Px2.p1.1)\.
- A\. Biega, G\. Born, F\. Diaz, M\. L\. Gray, and R\. Qadri \(2025\)Towards a Multidisciplinary Vision for Culturally Inclusive Generative AI \(Dagstuhl Seminar 25022\)\.Dagstuhl Reports15\(1\),pp\. 33–49\.Cited by:[§6\.1](https://arxiv.org/html/2606.14626#S6.SS1.p3.1)\.
- G\. Born, J\. Morris, F\. Diaz, and A\. Anderson \(2021\)Artificial Intelligence, Music Recommendation, and the Curation of Culture\.Schwartz Reisman Institute for Technology and Society White Paper\.Cited by:[§6\.1](https://arxiv.org/html/2606.14626#S6.SS1.p3.1)\.
- A\. Z\. Broder, S\. C\. Glassman, M\. S\. Manasse, and G\. Zweig \(1997\)Syntactic Clustering of the Web\.Computer Networks and ISDN Systems29\(8\-13\),pp\. 1157–1166\(en\)\.External Links:[Document](https://dx.doi.org/10.1016/S0169-7552%2897%2900031-7),ISSN 01697552,[Link](https://linkinghub.elsevier.com/retrieve/pii/S0169755297000317)Cited by:[§3\.3](https://arxiv.org/html/2606.14626#S3.SS3.p3.1)\.
- S\. T\. Brown, P\. Buitrago, E\. Hanna, S\. Sanielevici, R\. Scibek, and N\. A\. Nystrom \(2021\)Bridges\-2: a platform for rapidly\-evolving and data intensive research\.InPractice and Experience in Advanced Research Computing 2021: Evolution Across All Dimensions,PEARC ’21,New York, NY, USA\.External Links:ISBN 9781450382922,[Link](https://doi.org/10.1145/3437359.3465593),[Document](https://dx.doi.org/10.1145/3437359.3465593)Cited by:[Acknowledgments](https://arxiv.org/html/2606.14626#Sx1.p1.1)\.
- M\. Bucholtz and K\. Hall \(2005\)Identity and Interaction: A Sociocultural Linguistic Approach\.Discourse Studies7\(4\-5\),pp\. 585–614\(en\)\.External Links:[Document](https://dx.doi.org/10.1177/1461445605054407),ISSN 1461\-4456,[Link](https://doi.org/10.1177/1461445605054407)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p2.1)\.
- T\. Chakrabarty and P\. S\. Dhillon \(2026\)Can Good Writing Be Generative? Expert\-Level AI Writing Emerges through Fine\-Tuning on High\-Quality Books\.arXiv\.Note:arXiv:2601\.18353 \[cs\]External Links:[Link](http://arxiv.org/abs/2601.18353),[Document](https://dx.doi.org/10.48550/arXiv.2601.18353)Cited by:[§6\.1](https://arxiv.org/html/2606.14626#S6.SS1.p3.1)\.
- T\. A\. Chang, C\. Arnett, and Authors at the 5th Multilingual Representation Learning \(MRL\) Workshop \(2025\)Global PIQA: Evaluating Physical Commonsense Reasoning Across 100\+ Languages and Cultures\.Vol\.abs/2510\.24081\.External Links:[Link](https://arxiv.org/abs/2510.24081)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- B\. N\. Colby \(1973\)A Partial Grammar of Eskimo Folktales\.American Anthropologist75\(3\),pp\. 645–662\.External Links:ISSN 0002\-7294,[Link](https://www.jstor.org/stable/671780)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p2.1),[§2](https://arxiv.org/html/2606.14626#S2.p2.1)\.
- S\. Dev, J\. Goyal, D\. Tewari, S\. Dave, and V\. Prabhakaran \(2023\)Building Socio\-culturally Inclusive Stereotype Resources with Community Engagement\.arXiv\.Note:arXiv:2307\.10514 \[cs\.CL\]External Links:[Link](http://arxiv.org/abs/2307.10514),[Document](https://dx.doi.org/10.48550/arXiv.2307.10514)Cited by:[§8](https://arxiv.org/html/2606.14626#S8.p4.1)\.
- A\. R\. Doshi and O\. P\. Hauser \(2024\)Generative AI enhances individual creativity but reduces the collective diversity of novel content\.Science Advances10\(28\),pp\. eadn5290\.External Links:[Document](https://dx.doi.org/10.1126/sciadv.adn5290),[Link](https://www.science.org/doi/10.1126/sciadv.adn5290)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p3.1)\.
- E\. Durmus, K\. Nguyen, T\. Liao, N\. Schiefer, A\. Askell, A\. Bakhtin, C\. Chen, Z\. Hatfield\-Dodds, D\. Hernandez, N\. Joseph, L\. Lovitt, S\. McCandlish, O\. Sikder, A\. Tamkin, J\. Thamkul, J\. Kaplan, J\. Clark, and D\. Ganguli \(2024\)Towards Measuring the Representation of Subjective Global Opinions in Language Models\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=zl16jLb91v)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1),[§4](https://arxiv.org/html/2606.14626#S4.SS0.SSS0.Px1.p1.1)\.
- A\. Dwivedi, P\. Lavania, and A\. Modi \(2023\)EtiCor: Corpus for Analyzing LLMs for Etiquettes\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 6921–6931\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.428),[Link](https://aclanthology.org/2023.emnlp-main.428)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- P\. Eckert \(2008\)Variation and the Indexical Field1\{\}^\{\\textrm\{1\}\}\.Journal of Sociolinguistics12\(4\),pp\. 453–476\(en\)\.External Links:[Document](https://dx.doi.org/10.1111/j.1467-9841.2008.00374.x),ISSN 1360\-6441, 1467\-9841,[Link](https://onlinelibrary.wiley.com/doi/10.1111/j.1467-9841.2008.00374.x)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p2.1)\.
- P\. Eckert \(2012\)Three Waves of Variation Study: The Emergence of Meaning in the Study of Sociolinguistic Variation\.Annual Review of Anthropology41\(1\),pp\. 87–100\(en\)\.External Links:[Document](https://dx.doi.org/10.1146/annurev-anthro-092611-145828),ISSN 0084\-6570, 1545\-4290,[Link](https://www.annualreviews.org/doi/10.1146/annurev-anthro-092611-145828)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p2.1)\.
- A\. Fabbri, P\. Ng, Z\. Wang, R\. Nallapati, and B\. Xiang \(2020\)Template\-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 4508–4513\.External Links:[Link](https://aclanthology.org/2020.acl-main.413/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.413)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- A\. Fan, M\. Lewis, and Y\. Dauphin \(2019\)Strategies for Structuring Story Generation\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 2650–2660\.External Links:[Document](https://dx.doi.org/10.18653/v1/P19-1254),[Link](https://aclanthology.org/P19-1254)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p2.1)\.
- N\. Ford, D\. Duckworth, M\. Norouzi, and G\. Dahl \(2018\)The Importance of Generation Order in Language Modeling\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 2942–2946\.External Links:[Document](https://dx.doi.org/10.18653/v1/D18-1324),[Link](https://aclanthology.org/D18-1324)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p2.1),[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- M\. Freitag, G\. Foster, D\. Grangier, V\. Ratnakar, Q\. Tan, and W\. Macherey \(2021\)Experts, Errors, and Context: A Large\-Scale Study of Human Evaluation for Machine Translation\.Transactions of the Association for Computational Linguistics9,pp\. 1460–1474\.Note:Place: Cambridge, MAExternal Links:[Link](https://aclanthology.org/2021.tacl-1.87),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00437)Cited by:[§6\.1](https://arxiv.org/html/2606.14626#S6.SS1.p3.1)\.
- R\. Gangadharaiah and B\. Narayanaswamy \(2020\)Recursive Template\-based Frame Generation for Task Oriented Dialog\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 2059–2064\.External Links:[Link](https://aclanthology.org/2020.acl-main.186/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.186)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- G\. D\. Gemma Team \(2025\)Gemma 3 Technical Report\.arXiv\.Note:arXiv:2503\.19786 \[cs\]External Links:[Link](http://arxiv.org/abs/2503.19786),[Document](https://dx.doi.org/10.48550/arXiv.2503.19786)Cited by:[§4](https://arxiv.org/html/2606.14626#S4.SS0.SSS0.Px3.p1.1)\.
- S\. Goldfarb\-Tarrant, T\. Chakrabarty, R\. Weischedel, and N\. Peng \(2020\)Content Planning for Neural Story Generation with Aristotelian Rescoring\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 4319–4338\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.351/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.351)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- J\. Graham, J\. Haidt, S\. Koleva, M\. Motyl, R\. Iyer, S\. P\. Wojcik, and P\. H\. Ditto \(2012\)Moral Foundations Theory: The Pragmatic Validity of Moral Pluralism\.External Links:[Link](https://api.semanticscholar.org/CorpusID:2570757)Cited by:[§4](https://arxiv.org/html/2606.14626#S4.SS0.SSS0.Px1.p1.1)\.
- C\. Haerpfer, R\. Inglehart, A\. Moreno, C\. Welzel, K\. Kizilova, J\. Diez\-Medrano, M\. Lagos, P\. Norris, E\. Ponarin, and B\. Puranen \(Eds\.\) \(2022\)World Values Survey: Round Seven \- Country\-Pooled Datafile Version 5\.0\.JD Systems Institute & WVSA Secretariat,Madrid, Spain & Vienna, Austria\.External Links:[Document](https://dx.doi.org/10.14281/18241.20)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p4.1),[§4](https://arxiv.org/html/2606.14626#S4.SS0.SSS0.Px1.p1.1)\.
- S\. Hamilton, M\. Wilkens, and A\. Piper \(2026\)NarraBench: A Comprehensive Framework for Narrative Benchmarking\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 3786–3801\.External Links:ISBN 979\-8\-89176\-380\-7,[Link](https://aclanthology.org/2026.eacl-long.176/),[Document](https://dx.doi.org/10.18653/v1/2026.eacl-long.176)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1),[§8](https://arxiv.org/html/2606.14626#S8.p5.1)\.
- Hamna, D\. Sudharsan, A\. Seth, R\. Budhiraja, D\. Khullar, V\. Jain, K\. Bali, A\. Vashistha, and S\. Segal \(2025\)Kahani: culturally\-nuanced visual storytelling tool for non\-western cultures\.InProceedings of the 2025 ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies,COMPASS ’25,New York, NY, USA,pp\. 379–400\.External Links:ISBN 9798400714849,[Link](https://doi.org/10.1145/3715335.3735478),[Document](https://dx.doi.org/10.1145/3715335.3735478)Cited by:[§8](https://arxiv.org/html/2606.14626#S8.p1.1)\.
- D\. Hemment and C\. Kommers \(2025\)Doing AI Differently: Rethinking the foundations of AI via the humanities\.Technical reportThe Alan Turing Institute\.Cited by:[§6\.1](https://arxiv.org/html/2606.14626#S6.SS1.p3.1)\.
- D\. G\. Hobson, H\. Zhou, D\. Ruths, and A\. Piper \(2024\)Story Morals: Surfacing value\-driven narrative schemas using large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 12998–13032\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.723),[Link](https://aclanthology.org/2024.emnlp-main.723)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p2.1),[§2](https://arxiv.org/html/2606.14626#S2.p2.1)\.
- G\. Hofstede and M\. Minkov \(2013\)Values Survey Module 2013 Manual\.External Links:[Link](https://geerthofstede.com/research-and-vsm/vsm-2013/)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p4.1),[§4](https://arxiv.org/html/2606.14626#S4.SS0.SSS0.Px1.p1.1)\.
- A\. Jha, A\. Mostafazadeh Davani, C\. K\. Reddy, S\. Dave, V\. Prabhakaran, and S\. Dev \(2023\)SeeGULL: A Stereotype Benchmark with Broad Geo\-Cultural Coverage Leveraging Generative Models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 9851–9870\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.548),[Link](https://aclanthology.org/2023.acl-long.548)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p4.1),[§4](https://arxiv.org/html/2606.14626#S4.SS0.SSS0.Px6.p1.1),[§8](https://arxiv.org/html/2606.14626#S8.p4.1)\.
- L\. Jiang, Y\. Chai, M\. Li, M\. Liu, R\. Fok, N\. Dziri, Y\. Tsvetkov, M\. Sap, and Y\. Choi \(2025\)Artificial hivemind: the open\-ended homogeneity of language models \(and beyond\)\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=saDOrrnNTz)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p4.1)\.
- S\. Khanuja, V\. Iyer, X\. He, and G\. Neubig \(2025\)Towards Automatic Evaluation for Image Transcreation\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 7034–7047\.External Links:ISBN 979\-8\-89176\-189\-6,[Link](https://aclanthology.org/2025.naacl-long.359/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.359)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p3.1)\.
- S\. Khanuja, S\. Ramamoorthy, Y\. Song, and G\. Neubig \(2024\)An Image Speaks a Thousand Words, but can Everyone Listen? On Image Transcreation for Cultural Relevance\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 10258–10279\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.573/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.573)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p2.1),[§1](https://arxiv.org/html/2606.14626#S1.p3.1)\.
- H\. Li, L\. Jiang, N\. Dziri, X\. Ren, and Y\. Choi \(2024\)CULTURE\-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=DbsLm2KAqP)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- J\. Li, Y\. Chen, Z\. Liu, M\. Tan, L\. Zhang, Y\. Li, R\. Luo, L\. Chen, J\. Luo, A\. Argha, H\. Alinejad\-Rokny, W\. Zhou, and M\. Yang \(2025\)STORYTELLER: An Enhanced Plot\-Planning Framework for Coherent and Cohesive Story Generation\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 20818–20846\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1071/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1071),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- Llama Team, AI @ Meta \(2024\)The Llama 3 Herd of Models\.arXiv\.Note:arXiv:2407\.21783 \[cs\]External Links:[Link](http://arxiv.org/abs/2407.21783),[Document](https://dx.doi.org/10.48550/arXiv.2407.21783)Cited by:[§4](https://arxiv.org/html/2606.14626#S4.SS0.SSS0.Px3.p1.1)\.
- L\. Lucy, J\. Dodge, D\. Bamman, and K\. Keith \(2023\)Words as Gatekeepers: Measuring Discipline\-specific Terms and Meanings in Scholarly Publications\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 6929–6947\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.433),[Link](https://aclanthology.org/2023.findings-acl.433)Cited by:[Appendix A](https://arxiv.org/html/2606.14626#A1.p1.1)\.
- A\. Maji, R\. Kumar, A\. Ghosh, Anushka, and S\. Saha \(2025\)SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models’ Knowledge of Indian Culture\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 4434–4451\.External Links:[Link](https://aclanthology.org/2025.findings-acl.228/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.228),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- M\. Marone, O\. Weller, W\. Fleshman, E\. Yang, D\. Lawrie, and B\. V\. Durme \(2025\)MmBERT: A Modern Multilingual Encoder with Annealed Language Learning\.Vol\.abs/2509\.06888\.External Links:[Link](https://arxiv.org/abs/2509.06888)Cited by:[Appendix C](https://arxiv.org/html/2606.14626#A3.p1.2),[§4](https://arxiv.org/html/2606.14626#S4.SS0.SSS0.Px4.p1.1)\.
- L\. Martin, P\. Ammanabrolu, X\. Wang, W\. Hancock, S\. Singh, B\. Harrison, and M\. Riedl \(2018\)Event Representations for Automated Story Generation with Deep Neural Nets\.Proceedings of the AAAI Conference on Artificial Intelligence32\(1\) \(en\)\.External Links:ISSN 2374\-3468, 2159\-5399,[Link](https://ojs.aaai.org/index.php/AAAI/article/view/11430),[Document](https://dx.doi.org/10.1609/aaai.v32i1.11430)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- R\. I\. Masoud, Z\. Liu, M\. Ferianc, P\. Treleaven, and M\. Rodrigues \(2025\)Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede’s Cultural Dimensions\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 8474–8503\.External Links:[Link](https://aclanthology.org/2025.coling-main.567/)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1),[§4](https://arxiv.org/html/2606.14626#S4.SS0.SSS0.Px1.p1.1)\.
- N\. McIntyre and M\. Lapata \(2009\)Learning to Tell Tales: A Data\-driven Approach to Story Generation\.InProceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP,K\. Su, J\. Su, J\. Wiebe, and H\. Li \(Eds\.\),Suntec, Singapore,pp\. 217–225\.External Links:[Link](https://aclanthology.org/P09-1025/)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- N\. McIntyre and M\. Lapata \(2010\)Plot Induction and Evolutionary Search for Story Generation\.InProceedings of the 48th Annual Meeting of the Association for Computational Linguistics,J\. Hajič, S\. Carberry, S\. Clark, and J\. Nivre \(Eds\.\),Uppsala, Sweden,pp\. 1562–1572\.External Links:[Link](https://aclanthology.org/P10-1158)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- J\. Myung, N\. Lee, Y\. Zhou, J\. Jin, R\. A\. Putri, D\. Antypas, H\. Borkakoty, E\. Kim, C\. Pérez\-Almendros, A\. A\. Ayele, V\. Gutiérrez\-Basulto, Y\. Ibáñez\-García, H\. Lee, S\. H\. Muhammad, K\. Park, A\. Rzayev, N\. White, S\. M\. Yimam, M\. T\. Pilehvar, N\. Ousidhoum, J\. Camacho\-Collados, and A\. Oh \(2024\)BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/8eb88844dafefa92a26aaec9f3acad93-Abstract-Datasets%5C_and%5C_Benchmarks%5C_Track.html)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- R\. Namuduri, Y\. Wu, A\. A\. Zheng, M\. Wadhwa, G\. Durrett, and J\. J\. Li \(2025\)QUDsim: quantifying discourse similarities in LLM\-generated text\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=zFz1BJu211)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p4.1),[§3\.3](https://arxiv.org/html/2606.14626#S3.SS3.p4.1),[§8](https://arxiv.org/html/2606.14626#S8.p5.1)\.
- V\. Padmakumar and H\. He \(2024\)Does Writing with Language Models Reduce Content Diversity?\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=Feiz5HtCD0)Cited by:[§3\.3](https://arxiv.org/html/2606.14626#S3.SS3.p2.1)\.
- L\. Pemberton \(1989\)A modular approach to story generation\.InFourth Conference of the European Chapter of the Association for Computational Linguistics,H\. Somers and M\. McGee Wood \(Eds\.\),Manchester, England\.External Links:[Link](https://aclanthology.org/E89-1030)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- A\. Piper, R\. J\. So, and D\. Bamman \(2021\)Narrative Theory for Computational Narrative Understanding\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 298–311\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.26/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.26)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1),[§3\.3](https://arxiv.org/html/2606.14626#S3.SS3.p4.1),[§8](https://arxiv.org/html/2606.14626#S8.p5.1)\.
- G\. Polti \(1916\)The Thirty\-Six Dramatic Situations\.The Writer,Boston, MA\.Note:First published in French as "Les trente\-six situations dramatiques", 1895External Links:[Link](https://archive.org/details/thirtysixdramati00polt)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p2.1)\.
- A\. Priyanshu and S\. Vijay \(2024\)The Silent Curriculum: How Does LLM Monoculture Shape Educational Content and Its Accessibility?\.External Links:[Link](https://arxiv.org/abs/2407.10371)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p4.1)\.
- R\. Qadri, A\. M\. Davani, K\. Robinson, and V\. Prabhakaran \(2025a\)Risks of Cultural Erasure in Large Language Models\.Vol\.abs/2501\.01056\.External Links:[Link](https://arxiv.org/abs/2501.01056)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p3.1)\.
- R\. Qadri, M\. Diaz, D\. Wang, and M\. Madaio \(2025b\)The Case for "Thick Evaluations" of Cultural Representation in AI\.arXiv\.Note:arXiv:2503\.19075 \[cs\]External Links:[Link](http://arxiv.org/abs/2503.19075),[Document](https://dx.doi.org/10.48550/arXiv.2503.19075)Cited by:[§6\.1](https://arxiv.org/html/2606.14626#S6.SS1.p3.1)\.
- A\. Ramezani and Y\. Xu \(2023\)Knowledge of Cultural Moral Norms in Large Language Models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 428–446\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.26),[Link](https://aclanthology.org/2023.acl-long.26)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- A\. Rao, A\. Yerukola, V\. Shah, K\. Reinecke, and M\. Sap \(2025\)NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 2373–2403\.External Links:[Link](https://aclanthology.org/2025.naacl-long.120/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.120),ISBN 979\-8\-89176\-189\-6Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- E\. Reiter and R\. Dale \(1997\)Building Applied Natural Language Generation Systems\.Natural Language Engineering3\(1\),pp\. 57–87\.Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- M\. O\. Riedl and R\. M\. Young \(2010\)Narrative Planning: Balancing Plot and Character\.J\. Artif\. Int\. Res\.39\(1\),pp\. 217–268\.External Links:ISSN 1076\-9757Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- D\. Rooein, V\. Zouhar, D\. Nozza, and D\. Hovy \(2025\)Biased Tales: Cultural and Topic Bias in Generating Children’s stories\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 52–72\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.3/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.3),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p3.1),[§2](https://arxiv.org/html/2606.14626#S2.p3.1),[§4](https://arxiv.org/html/2606.14626#S4.SS0.SSS0.Px2.p1.1)\.
- J\. Ryan \(2017\)Grimes’ Fairy Tales: A 1960s Story Generator\.InInteractive Storytelling: 10th International Conference on Interactive Digital Storytelling, ICIDS 2017 Funchal, Madeira, Portugal, November 14–17, 2017, Proceedings,Berlin, Heidelberg,pp\. 89–103\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-71027-3%5F8),ISBN 978\-3\-319\-71026\-6,[Link](https://doi.org/10.1007/978-3-319-71027-3_8)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- P\. Sahoo, M\. Brahma, and M\. S\. Desarkar \(2025\)DIWALI \- Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 33599–33626\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1706),ISBN 979\-8\-89176\-332\-6,[Link](https://aclanthology.org/2025.emnlp-main.1706/)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- N\. Sambasivan, E\. Arnesen, B\. Hutchinson, T\. Doshi, and V\. Prabhakaran \(2021\)Re\-imagining Algorithmic Fairness in India and Beyond\.InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,FAccT ’21,New York, NY, USA,pp\. 315–328\.External Links:ISBN 978\-1\-4503\-8309\-7,[Link](https://dl.acm.org/doi/10.1145/3442188.3445896),[Document](https://dx.doi.org/10.1145/3442188.3445896)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p3.1)\.
- A\. Seth, S\. Ahuja, K\. Bali, and S\. Sitaram \(2024\)DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 5323–5337\.External Links:[Link](https://aclanthology.org/2024.lrec-main.474)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- A\. Seth, M\. Choudhury, S\. Sitaram, K\. Toyama, A\. Vashistha, and K\. Bali \(2025\)How Deep Is Representational Bias in LLMs? The Cases of Caste and Religion\.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society8\(3\),pp\. 2319–2330\(en\)\.External Links:ISSN 3065\-8365,[Link](https://ojs.aaai.org/index.php/AIES/article/view/36718),[Document](https://dx.doi.org/10.1609/aies.v8i3.36718)Cited by:[§6\.2](https://arxiv.org/html/2606.14626#S6.SS2.p1.1),[§8](https://arxiv.org/html/2606.14626#S8.p3.1)\.
- C\. Shaib, Y\. Elazar, J\. J\. Li, and B\. C\. Wallace \(2024\)Detection and Measurement of Syntactic Templates in Generated Text\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 6416–6431\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.368/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.368)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p4.1)\.
- C\. Shaib, V\. S\. Govindarajan, J\. Barrow, J\. Sun, A\. Siu, B\. C\. Wallace, and A\. Nenkova \(2025\)Standardizing the measurement of text diversity: a tool and comparative analysis\.InProceedings of The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations,X\. Liu and A\. Purwarianti \(Eds\.\),Mumbai, India,pp\. 36–46\.External Links:[Link](https://aclanthology.org/2025.ijcnlp-demo.5/),ISBN 979\-8\-89176\-301\-2Cited by:[§3\.3](https://arxiv.org/html/2606.14626#S3.SS3.p2.1)\.
- R\. Shelby, S\. Rismani, K\. Henne, A\. Moon, N\. Rostamzadeh, P\. Nicholas, N\. Yilla\-Akbari, J\. Gallegos, A\. Smart, E\. Garcia, and G\. Virk \(2023\)Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction\.AIES ’23,New York, NY, USA,pp\. 723–741\.External Links:[Document](https://dx.doi.org/10.1145/3600211.3604673),ISBN 979\-8\-4007\-0231\-0,[Link](https://dl.acm.org/doi/10.1145/3600211.3604673)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p3.1)\.
- S\. Singh, A\. Romanou, C\. Fourrier, D\. I\. Adelani, J\. G\. Ngui, D\. Vila\-Suero, P\. Limkonchotiwat, K\. Marchisio, W\. Q\. Leong, Y\. Susanto, R\. Ng, S\. Longpre, S\. Ruder, W\. Ko, A\. Bosselut, A\. Oh, A\. Martins, L\. Choshen, D\. Ippolito, E\. Ferrante, M\. Fadaee, B\. Ermis, and S\. Hooker \(2025\)Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 18761–18799\.External Links:[Link](https://aclanthology.org/2025.acl-long.919/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.919),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- S\. Song \(2017\)Narrative Structures in Korean Folktales: A Comparative Analysis of Korean and English Versions\.Topics in Linguistics18\(2\),pp\. 1–23\(en\)\.External Links:[Document](https://dx.doi.org/10.1515/topling-2017-0007),ISSN 2199\-6504, 1337\-7590,[Link](https://topling.ukf.sk/index.php/topling/article/view/42)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p2.1),[§2](https://arxiv.org/html/2606.14626#S2.p2.1)\.
- Z\. Sourati, F\. Karimi\-Malekabadi, M\. Ozcan, C\. McDaniel, A\. Ziabari, J\. Trager, A\. Tak, M\. Chen, F\. Morstatter, and M\. Dehghani \(2025\)The Shrinking Landscape of Linguistic Diversity in the Age of Large Language Models\.Vol\.abs/2502\.11266\.External Links:[Link](https://arxiv.org/abs/2502.11266)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p4.1)\.
- \[77\]K\. Sparck Jones and J\. R\. GalliersEvaluating Natural Language Processing Systems: An Analysis and Review\.External Links:ISBN 3\-540\-61309\-9Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p3.1)\.
- Y\. Tian, T\. Huang, M\. Liu, D\. Jiang, A\. Spangher, M\. Chen, J\. May, and N\. Peng \(2024\)Are Large Language Models Capable of Generating Human\-Level Narratives?\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 17659–17681\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.978),[Link](https://aclanthology.org/2024.emnlp-main.978/)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p4.1)\.
- K\. van Deemter, E\. Krahmer, and M\. Theune \(2005\)Squibs and Discussions: Real versus Template\-Based Natural Language Generation: A False Opposition?\.Computational Linguistics31\(1\),pp\. 15–24\.Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- M\. Walsh \(2025\)AI Fiction in the Wild\.Note:UC Berkeley School of Information EventAccessed: 2026\-03\-07External Links:[Link](https://www.ischool.berkeley.edu/events/2025/ai-fiction-wild)Cited by:[§8](https://arxiv.org/html/2606.14626#S8.p1.1)\.
- A\. Wang, X\. Bai, S\. Barocas, and S\. L\. Blodgett \(2025\)Measuring machine learning harms from stereotypes requires understanding who is harmed by which errors in what ways\.InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency,FAccT ’25,New York, NY, USA,pp\. 746–762\.Cited by:[§6\.2](https://arxiv.org/html/2606.14626#S6.SS2.p1.1)\.
- X\. Wang, A\. Antoniades, Y\. Elazar, A\. Amayuelas, A\. Albalak, K\. Zhang, and W\. Y\. Wang \(2024\)Generalization v\.s\. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data\.Vol\.abs/2407\.14985\.External Links:[Link](https://arxiv.org/abs/2407.14985)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p4.1)\.
- S\. Wiseman, S\. Shieber, and A\. Rush \(2018\)Learning Neural Templates for Text Generation\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 3174–3187\.External Links:[Document](https://dx.doi.org/10.18653/v1/D18-1356),[Link](https://aclanthology.org/D18-1356)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p2.1),[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- D\. Wright, S\. Masud, J\. Moore, S\. Yadav, M\. Antoniak, P\. E\. Christensen, C\. Y\. Park, and I\. Augenstein \(2025\)Epistemic Diversity and Knowledge Collapse in Large Language Models\.Vol\.abs/2510\.04226\.External Links:[Link](https://arxiv.org/abs/2510.04226)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p4.1)\.
- W\. Wu, L\. Wang, and R\. Mihalcea \(2023\)Cross\-Cultural Analysis of Human Values, Morals, and Biases in Folk Tales\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 5113–5125\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.311),[Link](https://aclanthology.org/2023.emnlp-main.311)Cited by:[§1](https://arxiv.org/html/2606.14626#S1.p2.1),[§2](https://arxiv.org/html/2606.14626#S2.p2.1)\.
- K\. Xie and M\. Riedl \(2024\)Creating Suspenseful Stories: Iterative Planning with Large Language Models\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Y\. Graham and M\. Purver \(Eds\.\),St\. Julian’s, Malta,pp\. 2391–2407\.External Links:[Link](https://aclanthology.org/2024.eacl-long.147)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1),[§8](https://arxiv.org/html/2606.14626#S8.p1.1)\.
- J\. Xu, X\. Ren, Y\. Zhang, Q\. Zeng, X\. Cai, and X\. Sun \(2018\)A Skeleton\-Based Model for Promoting Coherence Among Sentences in Narrative Story Generation\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 4306–4315\.External Links:[Link](https://aclanthology.org/D18-1462/),[Document](https://dx.doi.org/10.18653/v1/D18-1462)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- W\. Xu, N\. Jojic, S\. Rao, C\. Brockett, and B\. Dolan \(2025\)Echoes in AI: Quantifying lack of plot diversity in LLM outputs\.Proceedings of the National Academy of Sciences122\(35\),pp\. e2504966122\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2504966122),[Link](https://www.pnas.org/doi/abs/10.1073/pnas.2504966122),https://www\.pnas\.org/doi/pdf/10\.1073/pnas\.2504966122Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p4.1)\.
- L\. Yao, N\. Peng, R\. M\. Weischedel, K\. Knight, D\. Zhao, and R\. Yan \(2019\)Plan\-and\-Write: Towards Better Automatic Storytelling\.InThe Thirty\-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty\-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 \- February 1, 2019,pp\. 7378–7385\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v33i01.33017378),[Link](https://doi.org/10.1609/aaai.v33i01.33017378)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.
- J\. Zhang, W\. Hamilton, C\. Danescu\-Niculescu\-Mizil, D\. Jurafsky, and J\. Leskovec \(2017\)Community Identity and User Engagement in a Multi\-Community Landscape\.Proceedings of the International AAAI Conference on Web and Social Media11\(1\),pp\. 377–386\(en\)\.External Links:[Document](https://dx.doi.org/10.1609/icwsm.v11i1.14904),ISSN 2334\-0770, 2162\-3449,[Link](https://ojs.aaai.org/index.php/ICWSM/article/view/14904)Cited by:[Appendix A](https://arxiv.org/html/2606.14626#A1.p1.1)\.
- L\. Zhou and E\. Hovy \(2004\)Template\-Filtered Headline Summarization\.InText Summarization Branches Out,Barcelona, Spain,pp\. 56–60\.External Links:[Link](https://aclanthology.org/W04-1010/)Cited by:[§2](https://arxiv.org/html/2606.14626#S2.p1.1)\.

## Appendix ANPMI Operationalization

We count the number of times that a word appears in stories of a specific culture and in stories across all cultures\. A higher positive NPMI indicates higher association between the word and the culture\. NPMI has been used in measuring association between vocabulary and different classes within the corpora \(such as specific communities\) in prior workZhanget al\.\([2017](https://arxiv.org/html/2606.14626#bib.bib76)\); Lucyet al\.\([2023](https://arxiv.org/html/2606.14626#bib.bib77)\); Bhattet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib71)\)\.

## Appendix BStereotypicality in Cultural Markers

We measure the stereotypicality of cultural markers by calculating the precision between the set of cultural markers \(VckV\_\{c\}^\{k\}\) and the reference set of stereotypical attributes \(ZcZ\_\{c\}\)\. Specifically, precision is calculated as follows \(where\|A\|\|A\|denotes the length of the setAA\):

\|Vck∩Zc\|\|Vck\|\\displaystyle\\frac\{\|V\_\{c\}^\{k\}~\\cap~Z\_\{c\}\|\}\{\|V\_\{c\}^\{k\}\|\}\(1\)
We further measure the average offensiveness by measuring the average offensiveness of the lexical items in the intersection set\. Specifically, letO​\(x\)O\(x\)be the offensiveness score for a stereotypical attribute x, we calculate average offensiveness as:

∑x∈Vck∩ZcO​\(x\)\|Vck∩Zc\|\\displaystyle\\frac\{\\sum\_\{x\\in V\_\{c\}^\{k\}~\\cap~Z\_\{c\}\}O\(x\)\}\{\|V\_\{c\}^\{k\}~\\cap~Z\_\{c\}\|\}\(2\)

## Appendix CClassifier Parameters

We fine\-tuned the mmBERT modelMaroneet al\.\([2025](https://arxiv.org/html/2606.14626#bib.bib78)\)to predict the culture given a story using our corpus\. We fine\-tuned the model with a classifier head for 193\-way classification\. All classifiers were trained with total epochs set to 50 and an early stopping criterion of 5 epochs\. The validation macro F1\{\}\_\{\\mbox\{1\}\}was used to stop training early if necessary\. Each classifier was trained on 60% of the stories, and 20% of the stories were used for validation and testing, each\. The max length of the model was set to 768 and a batch size of 32 was used\. After preliminary trials, the learning rate was set to3​e−53e^\{\-5\}with a warm\-up schedule of 500 steps\. Random seed of 47 was used for reproducibility\. Table[4](https://arxiv.org/html/2606.14626#A3.T4)shows the validation accuracy and best epoch for each of the final classifier\.

Table 4:Best validation accuracy and corresponding epoch for each LLM and fold\.
## Appendix DComplete Classifier Results

![Refer to caption](https://arxiv.org/html/2606.14626v1/x2.png)\(a\)GPT 3\.5 Turbo
![Refer to caption](https://arxiv.org/html/2606.14626v1/x3.png)\(b\)Llama 3\.3 70B Instruct
![Refer to caption](https://arxiv.org/html/2606.14626v1/x4.png)\(c\)Gemma 3 12B Instruct
![Refer to caption](https://arxiv.org/html/2606.14626v1/x5.png)\(d\)Llama 3\.1 8B Instruct

Figure 3:Identifying Cultural markers\. F1\{\}\_\{\\mbox\{1\}\}of nationality classifier as a function of number of masked words\.Figure[3](https://arxiv.org/html/2606.14626#A4.F3)shows the performance of the culture classifier for the remaining 4 LLMs\. Similar to results described in § 5\.1, for all LLMs, when the culturally associated vocabulary is masked the classifier performance drops sharply\. The drop is much more uniform for random masking\. For all models, the performance on the original stories is near perfect\.

## Appendix EComplete Homogeneity Results

Table 5:Narrative homogeneity\. Multi\-word similarity amongst stories on the same topic in either their original form or when random words equivalent to the number of cultural markers are masked\. Intra\-group measures the similarity amongst stories for the same nationality\. The last column group divides the inter\-group similarity by the intra\-group similarity to control for similarity attributable to a cross\-nationality template\.Table[5](https://arxiv.org/html/2606.14626#A5.T5)shows the homogeneity, as measured by average similarity of stories when random words equivalent to the number of cultural markers are masked\.

## Appendix FComplete Stereotyping Results

Table 6:Stereotypes\. Top 10 countries with highest stereotype precision and offensiveness as rated by regional raters for cultural markers form GPT 3\.5 Turbo\.Table 7:Stereotypes\. Top 10 countries with highest stereotype precision and offensiveness as rated by regional raters for cultural markers form Llama 3\.1 8B Instruct\.Table 8:Stereotypes\. Top 10 countries with highest stereotype precision and offensiveness as rated by regional raters for cultural markers form Llama 3\.3 70B Instruct\.Table 9:Stereotypes\. Top 10 countries with highest stereotype precision and offensiveness as rated by regional raters for cultural markers form Gemma 3 12B Instruct\.Tables 6 to 9 show the top 10 countries with highest stereotype precision and offensiveness for all models\. We see patterns similar to those observed in regional results for GPT 4o Mini stories in § 5\.3\. Table[10](https://arxiv.org/html/2606.14626#A6.T10)shows examples of stereotypical attributes found in the cultural markers, as calculated using the SeeGULL dataset\.

NationalityStereotypesGPT 4o MiniSyriaviolentChinawitty, sophisticatedBangladeshbeggarGambiaunreliableIndiamystic, religious, witty, vegetarian, untidy, undisciplinedGPT 3\.5 TurboDenmarkprogressiveAfghanistanbackward, killerIranoppressivePakistancriminal, paranoid, uneducatedTurkeyruthless, conservativeLlama 3\.1 8B InstructCameroontribal, talkativeGermanpunctual, thorough, blondeItalyexpressive, temperamental, foodieKenyaindustriousLiberiabarbaric, uneducatedLlama 3\.3 70B InstructVietnamsmelly, communistJapanninja, courteous, samuraiNepalrationalNigeriasmelly, witchSouth Africaapartheid, unfriendlyGemma 3 12B InstructLebanonterroristUgandauntrustworthyBritainproper, aloofVenezueladumbIsraelpushyTable 10:Example of cultural markers that overlap with stereotypes from SeeGULL dataset\.

Similar Articles

When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

arXiv cs.CL

This paper introduces CulturalNB, a dataset of Bengali cultural question-answer pairs, and evaluates nine LLMs for cross-lingual cultural bias. Findings show that English prompting increases global narrative substitution and reduces local perspectives, revealing that cultural failures in LLMs are grounding and prioritization issues, not just missing knowledge.

English Centric AI Is Merging Unrelated Communities and Distorting Identities

Reddit r/artificial

The article critiques how AI systems, particularly Grokipedia and AI search, perpetuate errors by merging unrelated communities due to English-centric transliteration and biased training data. It highlights the systemic issue of erasing cultural distinctions through simplified English representations and repeated misinformation.

Cross-Prompt Generalization in Detecting AI-Generated Fake News Using Interpretable Linguistic Features

arXiv cs.CL

Researchers from Kennesaw State University investigate cross-prompt generalization in detecting AI-generated fake news using interpretable linguistic features (lexical diversity, readability, emotion). A random forest classifier trained on one prompting strategy and tested on another achieves AUC values of 0.988–1.000, suggesting these features capture stable, generalizable properties of AI-generated text.