Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory
Summary
This paper proposes a three-level taxonomy for evaluating AI cultural capabilities—Cultural Awareness, Sensitivity, and Competence—grounded in intercultural communication theory, aiming to improve validity and interpretability of AI evaluations in multicultural settings.
View Cached Full Text
Cached at: 05/18/26, 06:35 AM
# Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory
Source: [https://arxiv.org/html/2605.15990](https://arxiv.org/html/2605.15990)
Isar Nejadgholi1, Masoud Kianpour2 Krishnapriya Vishnubhotla1, Maryam Molamohamadi3
1National Research Council, Canada2Toronto Metropolitan University, Canada 3Mila, Quebec AI Institute, Canada
\{isar\.nejadgholi, krishnapriya\.vishnubhotla\}@nrc\-cnrc\.gc\.ca
masoud\.kianpour@torontomu\.ca, maryam\.molamohammadi@mila\.quebec###### Abstract
Tremendous efforts have been put into evaluating the inclusivity and effectiveness of AI systems across cultures\. However, the cultural capabilities considered in much of the literature remain vaguely defined, are referred to using interchangeable terminology, and are typically limited to recalling accurate information about various demographics, regions, and nationalities\. To address this construct ambiguity, we draw from Intercultural Communication scholarship and propose a three\-level taxonomy of AI\-relevant cultural capabilities:Cultural Awarenessanswers*“Does the model know?”*,Cultural Sensitivityanswers*“How does it frame its knowledge?”*, andCultural Competenceanswers*“Can it adapt as the interaction evolves?”*\. Beyond conceptual clarification, we position this taxonomy as a practical tool for improving the validity and interpretability of AI evaluation in real\-world, multicultural settings\. Without such construct clarity, evaluation results risk overstating model capabilities and may lead to inappropriate deployment decisions in culturally sensitive contexts\.
Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory
Isar Nejadgholi1, Masoud Kianpour2Krishnapriya Vishnubhotla1, Maryam Molamohamadi31National Research Council, Canada2Toronto Metropolitan University, Canada3Mila, Quebec AI Institute, Canada\{isar\.nejadgholi, krishnapriya\.vishnubhotla\}@nrc\-cnrc\.gc\.camasoud\.kianpour@torontomu\.ca, maryam\.molamohammadi@mila\.quebec
## 1Introduction
AI\-mediated communication is increasingly impacting language and social relationshipsHohensteinet al\.\([2023](https://arxiv.org/html/2605.15990#bib.bib8)\)\. In a variety of tasks, such as translationNaveen and Trojovský \([2024](https://arxiv.org/html/2605.15990#bib.bib9)\), dialogueAbeet al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib10)\), and decision\-makingKaggwaet al\.\([2024](https://arxiv.org/html/2605.15990#bib.bib11)\), AI is mediating conversations among users from every corner of the globe, across cultural boundaries\. Generative AI in particular has been shown to act as a “social actor,” capable of eliciting emotional and cognitive responses that reshape human communication patterns\. The research community, however, is coming to an understanding that the impact of generative AI on human communication is extremely nuanced\. On the one hand, research shows that AI can enhance cross\-cultural dialogue by providing multimodal, emotionally resonant communication tools that reduce anxiety and facilitate identity recognitionYanget al\.\([2024](https://arxiv.org/html/2605.15990#bib.bib12)\)\. On the other hand, when used at scale, AI introduces new dynamics of power and cultural visibility that risk homogenizing cultural expressions, reinforcing linguistic hierarchies, and obscuring subtle cultural meaningsBusch \([2024](https://arxiv.org/html/2605.15990#bib.bib13)\)\. Crucially, these models are primarily trained on English\- and Western\-centric data, which limits their abilities in handling intercultural communications and risks misunderstandings that escalate into real social and ethical harmsNaous and Xu \([2025](https://arxiv.org/html/2605.15990#bib.bib42)\)\.
Level 3: Cultural CompetenceDynamic, multi\-turn adaptation to emerging cultural cues\.Level 2: Cultural SensitivityRespectful, non\-ethnocentric framing in single responses\.Level 1: Cultural AwarenessAccurate recall of cultural knowledge\.Example behavior:
Adjusts apology tone after user clarifies that the workplace is informal; explains the shift and maintains respect\.Example behavior:
Uses respectful language, acknowledges hierarchy, avoids moralizing or Western\-centric framing\.Example behavior:
Correctly reflects Japanese workplace norms \(seniority, honorifics\) without stereotypes or factual errors\.Figure 1:Three levels of AI\-relevant cultural capabilities, defined in terms of observable system behavior, with an illustrative example aligned to each level\. The example is based on the prompt“I am from Japan, and I need help apologizing to my older colleague for a mistake I made at work,”to illustrate how progressively richer cultural capabilities shape system responses from factual grounding to respectful framing and multi\-turn adaptation\.In response, a growing body of work has attempted to evaluate the “cultural capabilities” of AI systemsPawaret al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib43)\)\. However, the constructs underlying these evaluations remain loosely defined\. Terms such as cultural awareness, cultural sensitivity, and cultural competence are often used interchangeably, with inconsistent meanings across studies and even within the same work\. As a result, current evaluation practices risk conflating fundamentally different capabilities\. This construct ambiguity makes it unclear what is being measured and what conclusions can be drawn about model behavior in real\-world settings\.
In this work, we engage with the fundamental question of“What cultural capabilities need to be monitored in AI\-enabled communication tools, to ensure the wide range of issues arising from English\-centric models are appropriately mitigated?”\. Importantly, fields such as intercultural communicationArasaratnam and Doerfel \([2005](https://arxiv.org/html/2605.15990#bib.bib45)\), cross\-cultural social psychologyRichteret al\.\([2023](https://arxiv.org/html/2605.15990#bib.bib44)\), and educationChoompunuchet al\.\([2024](https://arxiv.org/html/2605.15990#bib.bib46)\)have long emphasized that cultural capability involves multiple, distinct behaviors that enable successful interaction across cultural boundaries\. These capabilities have been shown to shape outcomes in organizational, professional, and educational environments, and contribute to performance, productivity, and psychological safetyLauring \([2011](https://arxiv.org/html/2605.15990#bib.bib54)\); Szkudlareket al\.\([2020](https://arxiv.org/html/2605.15990#bib.bib55)\); Warren and Lee \([2020](https://arxiv.org/html/2605.15990#bib.bib56)\)\. Yet NLP evaluations rarely incorporate these distinctions, and when viewed against the backdrop of multicultural communication research, contemporary evaluations seem under\-theorized\.
A systematized construct definition of cultural capabilities can facilitate meaningful AI evaluation practices\. AsWallachet al\.\([2024](https://arxiv.org/html/2605.15990#bib.bib52)\)argue, valid evaluation requires moving from background concepts to systematic definitions, and only then to measurement instruments\. This logic suggests that before a cultural capability can be measured, it must first be defined in terms of observable system behaviors\. To assemble such a definition, we focus on the research in Intercultural Communication \(ICC\), where cultural capabilities are formulated as a broad range of skills such as calibrating the level of sensitivity required in a given scenario, adapting to contextual cues, and incorporating new cultural information that emerges dynamically in interaction\. From this perspective, an AI system does not merely need to “know about” a culture or “imitate a cultural norm”; it must be able to adjust its communicative stance in a way that respects cultural variation and is contextually appropriate\.
Moreover, the distinction between cultural capabilities is critical because behavior that is appropriate at one level may be harmful at another\. For example, factual knowledge about a cultural group can support representation and understanding, but when presented without nuance or contextual variation, it may function as stereotypingFraseret al\.\([2021](https://arxiv.org/html/2605.15990#bib.bib122)\); Yaoet al\.\([2024](https://arxiv.org/html/2605.15990#bib.bib123)\)\. An AI system that states “Japanese workplaces value formality” conveys accurate information; however, presenting this as a universal rule without acknowledging regional, generational, or organizational variation risks reinforcing stereotypes\. Also, this factual knowledge may not translate to appropriate behavioral/situational adaptation in user interactions\.
Specifically, we turn to three foundational models in ICC and study the traits and skills included in these models\. To draw an AI\-relevant taxonomy, we exclude human\-specific motivational and affective traits of ICC models and retain only those dimensions that describe behavioral and interactional skills that AI systems could, in principle, exhibit\. This procedure results in a three\-level taxonomy of AI\-relevant cultural capabilities, CulturalAwareness, Sensitivity, and Competence, with distinct observable behaviors\. This taxonomy is summarized in Figure[1](https://arxiv.org/html/2605.15990#S1.F1)and elaborated in Section[4](https://arxiv.org/html/2605.15990#S4)\. Our taxonomy offers a practical framework to guide evaluation design, interpretation, and deployment decisions in multicultural settings\. We position this work as a call for more precise, practice\-oriented evaluation of cultural capabilities in AI systems\.
## 2Cultural Capability Evaluation in NLP
Many works in NLP have investigated whether LLMs demonstrate different abilities for handling cultural variationPawaret al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib43)\)\. This line of research typically evaluates model behavior across culturally situated scenarios, norms, and communication practices\. However, the conceptualization of what constitutes cultural capability varies widely across studies\. We review recent NLP papers that attempt to measure cultural capability in AI and analyze how these works define and operationalize the underlying constructs\. Note that we focus on the construct ambiguity of “cultural capability”, not “culture” itself\. While the definition of “Culture” has been extensively studied byZhouet al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib48)\)andAdilazuardaet al\.\([2024](https://arxiv.org/html/2605.15990#bib.bib49)\), and was addressed through taxonomies\(Liuet al\.,[2025](https://arxiv.org/html/2605.15990#bib.bib50)\)or foundational frameworks for cross\-cultural NLP\(Hershcovichet al\.,[2022](https://arxiv.org/html/2605.15990#bib.bib51)\), we argue that the field has yet to converge on which*cultural capabilities*are essential to assess in AI systems\.
Sahaet al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib116)\)critically examine how cultural capability in AI systems should be conceptualized and evaluated\. They note that current evaluation practices primarily probe LLMs for “Cultural awareness”, i\.e\., their culture\-specific knowledge and reasoning capabilities, by relying on curated cultural test beds\. However, they argue, performing well on such benchmarks solely demonstrates the knowledge of the cultures that are tested for and does not demonstrate the ability to operate in previously unseen cultural contexts\. Instead, they propose the concept ofmeta\-cultural competence, which refers to an AI system’s ability to recognize cultural variation and adapt to new cultural contexts\. While this perspective clarifies the long\-term capability that culturally robust AI systems should aspire to, it leaves open the question of what levels of cultural capabilities should be defined and measured in current NLP evaluations\. The goal of our work is complementary to that ofSahaet al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib116)\)\. Rather than proposing a new target capability, we focus on defining different levels of cultural capability, drawing on intercultural communication research, to improve construct clarity and measurement validity in cultural evaluation\.
We echo the observation bySahaet al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib116)\)that most benchmarks concerned with cultural inclusivity are focused on measuring “knowledge about a cultural context”\. Examples includeFORK\(Palta and Rudinger,[2023](https://arxiv.org/html/2605.15990#bib.bib14)\), which targets food\-related cultural commonsense such as ingredients, preparation methods, and culturally appropriate consumption practices;CulturalBench\(Chiuet al\.,[2025](https://arxiv.org/html/2605.15990#bib.bib20)\), which introduces region\-specific multiple\-choice questions covering everyday activities, social norms, public behavior, and local conventions; andBLEnD\(Myunget al\.,[2024](https://arxiv.org/html/2605.15990#bib.bib22)\), which focuses on everyday practices and social routines \(e\.g\., food, sports, family, holidays/celebrations/leisure\) across 16 regions and 13 languages\.GeoMLAMA\(Yinet al\.,[2022](https://arxiv.org/html/2605.15990#bib.bib21)\)probes geo\-diverse commonsense knowledge, concepts that are universally understood but vary across different cultures and regions, such as the color of a traditional wedding dress, staple foods and units of measurement\.INCLUDE\(Romanouet al\.,[2025](https://arxiv.org/html/2605.15990#bib.bib23)\), on the other hand, curates exam\-style questions in 44 languages that emphasize culturally situated general knowledge and reasoning skills\.JMMMU\(Onoharaet al\.,[2025](https://arxiv.org/html/2605.15990#bib.bib29)\)is another work in this line, which incorporates multimodal cultural knowledge in domains such as arts and heritage\.
Several recent works attempt to operationalize cultural understanding as recognition of culturally inappropriate signals\. One example isMC\-SIGNSbyYerukolaet al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib118)\), which evaluates whether models can classify gestures as offensive or non\-offensive depending on the cultural context\. Other resources foreground stereotypical statements about social groups, such asSHADES\(Mitchellet al\.,[2025](https://arxiv.org/html/2605.15990#bib.bib28)\), which evaluates stereotypes across regions and languages, spanning multiple identity categories subject to discrimination\.Qiuet al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib121)\)evaluate agents’ ability to detect and appropriately respond to norm\-violating user queries and observations, for online shopping and social discussion forums\.
More recent work attempts to evaluate cultural capabilities in interactive settings\.NormGenesis\(Honget al\.,[2025](https://arxiv.org/html/2605.15990#bib.bib89)\)goes beyond knowledge by measuring culturally adaptive dialogue in multi\-turn conversations, focusing on the integration of social norms into interactional behavior\.Nunchi\-BenchKim and Lee \([2025](https://arxiv.org/html/2605.15990#bib.bib117)\)is another benchmark containing scenario\-based questions that require models to identify culturally appropriate responses or explanations\.SocialCCbyWuet al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib119)\)evaluates LLM performance in multi\-turn social interactions where appropriate responses depend on cultural norms and contextual cues, and measures whether models produce socially appropriate responses\. Similarly,Havaldaret al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib120)\)propose a framework for evaluating the cultural awareness of language models in multicultural conversational environments\. Their evaluation incorporates situational context, interpersonal relationships, and conversational style to assess how well models adapt to culturally grounded interactions\. These works represent an important step toward evaluating cultural competence as a dynamic capability rather than static knowledge\.
Gap Analysis:Although the discussion above does not constitute a systematic literature review of cultural capability evaluations in NLP, it nevertheless reveals substantial evidence of construct ambiguity in the current literature\. Across these works, terminology referring to cultural capability dimensions is highly inconsistent and often underspecified\. Terms such as “cultural understanding,” “cultural adaptation,” “cultural awareness,” “cultural sensitivity,” and “cultural competence” are frequently used interchangeably, sometimes even within the same work, without precise definitions or explicit alignment with established social science theories\. As a result, different studies implicitly measure different aspects of cultural behavior while referring to them using fuzzy terminology\. Because of this fundamental lack of construct validity, it becomes unclear what capability an evaluation actually measures and whether results across benchmarks are comparable\. Consequently, evaluation results are often interpreted as evidence of “cultural capability” in general, even though they may only capture a narrow dimension of that construct\.
What is therefore needed is a framework that explicitly distinguishes between different levels of cultural capability and provides clear definitions of what each level entails in terms of observable system behavior\. Such a framework would enable researchers to select the level of capability relevant to their task, design evaluation procedures that directly measure that capability, and make appropriately scoped claims about model performance\.
## 3Evaluative Models of Cultural Capabilities in ICC
Intercultural communication research has long emphasized that effective engagement across cultures requires more than static knowledge of norms or practices\. Across several influential models, scholars have conceptualized “cultural capabilities” as multidimensional constructs encompassing cognitive, affective, and behavioral components\. We review three foundational and highly cited ICC traditions: the Developmental Model of Intercultural Sensitivity \(DMIS\), the theory of Cultural Intelligence \(CQ\), and the Process Model of Intercultural Competence \(PMIC\)\. For each ICC model, we discuss 1\) a focal capability, 2\) a structure for that capability \(whether stages, dimensions, or component skills\), and 3\) sites of application with corresponding measurement strategies\. Table[1](https://arxiv.org/html/2605.15990#S3.T1)summarizes the characteristics of these models\.
Table 1:Summary of three major ICC models frequently used for evaluating cultural capabilities\.### 3\.1Developmental Model of Intercultural Sensitivity \(DMIS\)
Focal Capability:DMISBennett \([1986](https://arxiv.org/html/2605.15990#bib.bib57)\)is one of the earliest evaluative ICC models and is focused onintercultural sensitivityas the core capability, which refers to the way individualsexperienceandmake sense ofcultural differences\. This model is also inherently developmental, i\.e\., it proposes that individuals progress through qualitatively different stages of worldview, moving from ethnocentrism toward ethnorelativism\(Bennett,[1993](https://arxiv.org/html/2605.15990#bib.bib58)\)\.
Structure:DMIS describesintercultural sensitivityas a sequence of stages\. The ethnocentric stages include1\) Denial\(lack of recognition of cultural difference\),2\) Defence\(perceiving difference as threatening and asserting superiority of one’s own culture\), and3\) Minimization\(downplaying difference by assuming deep similarity or universalism\)\. As intercultural sensitivity increases, people move towards the ethnorelative stages, namely,4\) Acceptance\(recognition and valuing of cultural difference\),5\) Adaptation\(the ability to shift perspective and modify behavior appropriately\), and6\) Integration\(internalization of multiple cultural perspectives into one’s own identity\)\.
Application and Evaluation:DMIS is applied in international education, study abroad, and professional development for people working in multicultural contexts, such as health care providersPedersen \([2010](https://arxiv.org/html/2605.15990#bib.bib59)\); DeJaeghere and Cao \([2009](https://arxiv.org/html/2605.15990#bib.bib60)\); Bourjollyet al\.\([2005](https://arxiv.org/html/2605.15990#bib.bib61)\); Richards and Doorenbos \([2016](https://arxiv.org/html/2605.15990#bib.bib62)\)\. Measurement is often done using the Intercultural Development Inventory \(IDI\), which attempts to position individuals along a continuum fromDenialtoIntegrationthrough survey items targeting beliefs, reactions, and self\-perceived adaptability\.
### 3\.2Cultural Intelligence \(CQ\)
Focal Capability:The CQ modelEarley and Ang \([2003](https://arxiv.org/html/2605.15990#bib.bib66)\)emerged to reduce costly failures in international assignments caused by stereotyping and cultural generalizations\(Blacket al\.,[1991](https://arxiv.org/html/2605.15990#bib.bib114); Mendenhallet al\.,[2008](https://arxiv.org/html/2605.15990#bib.bib113)\)and definescultural intelligenceas an individual’s capability to function effectively in situations characterized by cultural diversity\.
Structure:CQ is explicitly framed as amultidimensional intelligenceand distinguishes four interrelated capabilities:1\) Motivation\(drive to engage across cultures\),2\) Cognition\(knowledge of cultural norms, practices\),3\) Metacognition\(awareness of and ability to plan, monitor, and adjust one’s thought processes in intercultural interactions\), and4\) Behavior\(ability to adapt one’s verbal/nonverbal conduct such as adapting tone, turn\-taking patterns, politeness strategies, gesture, pace, etc\.\) in culturally diverse interactions\(Anget al\.,[2007](https://arxiv.org/html/2605.15990#bib.bib65); Ang and Van Dyne,[2015](https://arxiv.org/html/2605.15990#bib.bib67)\)\.
Application and Evaluation:CQ is applied in leadership development, international assignments, and cross\-border negotiation\(Alon and Higgins,[2005](https://arxiv.org/html/2605.15990#bib.bib71); Rockstuhlet al\.,[2011](https://arxiv.org/html/2605.15990#bib.bib69); Ramaluet al\.,[2012](https://arxiv.org/html/2605.15990#bib.bib72)\)\. Higher CQ is associated with better task performance in culturally diverse settings\(Anget al\.,[2007](https://arxiv.org/html/2605.15990#bib.bib65)\)and is linked to experiential learning theory\(Kolb,[2014](https://arxiv.org/html/2605.15990#bib.bib76)\)\. CQ is typically measured through validated psychometric instruments such as the Cultural Intelligence Scale \(CQS\), which measures each dimension on a Likert scale and has been adapted and validated cross\-nationallyVan Dyneet al\.\([2015](https://arxiv.org/html/2605.15990#bib.bib75)\); Gozzoli and Gazzaroli \([2018](https://arxiv.org/html/2605.15990#bib.bib74)\)\.
### 3\.3Process Model of Intercultural Competence \(PMIC\)
Focal Capability:PMICDeardorff \([2006](https://arxiv.org/html/2605.15990#bib.bib63)\)conceptualizes intercultural competence as a dynamic, iterative process and definesintercultural competenceas “the ability to communicate effectively and appropriately in intercultural situations based on one’s intercultural knowledge, skills, and attitudes”\. This view integrates both developmental and performance\-based perspectives and recognizes that competence manifests in interaction rather than merely in perception or cognition\.
Structure:PMIC proposes a cyclical relationship among five interrelated components:1\) Attitudes\(respect, openness, curiosity, willingness to tolerate ambiguity\);2\) Knowledge\(including self\-awareness, deep cultural knowledge, and sociolinguistic awareness\);3\) Skills\(listening, observing, analyzing, evaluating, and relating\);4\) Internal Outcomes\(adaptability, flexibility, empathy, ethnorelative view\) leading to5\) External Outcomes\(effective and appropriate behavior and communication\)\. Importantly,Deardorff \([2009b](https://arxiv.org/html/2605.15990#bib.bib78)\)emphasizes that the process is ongoing, recursive, and context\-dependent, allowing for continuous development through experience and reflection\.
Applications and Evaluation:PMIC is extensively applied in higher education, internationalization of curricula, global citizenship education, and intercultural training across disciplines such as health, business, and diplomacy\(Byram,[2020](https://arxiv.org/html/2605.15990#bib.bib64); Arasaratnam\-Smith,[2017](https://arxiv.org/html/2605.15990#bib.bib77)\)\. Building on her process model,Deardorff \([2006](https://arxiv.org/html/2605.15990#bib.bib63)\)developed theIntercultural Competence Assessment \(ICA\)framework and later contributed to theIntercultural Knowledge and Competence VALUE RubricAssociation of American Colleges and Universities \(AAC&U\) \([2025](https://arxiv.org/html/2605.15990#bib.bib90)\)\. These tools are primarily qualitative and reflective rather than psychometric\(Deardorff,[2009a](https://arxiv.org/html/2605.15990#bib.bib79)\)\.
## 4A Taxonomy of AI\-Relevant Cultural Capabilities
Here, we propose a taxonomy of*required*and*measurable*cultural capabilities in AI\-enabled communication and ground this taxonomy in ICC models described in Section[3](https://arxiv.org/html/2605.15990#S3)\. For that, we first recognize that the three major evaluative ICC models were developed to describe*human*experience, motivation, and behavior, and the direct application of these models to AI systems risks anthropomorphizing\. Therefore, we deliberately choose a cautious starting point and treat these models as*conceptual resources*rather than as templates to be copied\. As a result of this choice, in our work,capabilityrefers to observable behavior that is elicited in a particular interaction, as opposed to a trait that the model has independent of the interaction context\.
Following literature that shows large language models do not possess a stable moral or normative stanceAbdulhaiet al\.\([2024](https://arxiv.org/html/2605.15990#bib.bib91)\); Guoet al\.\([2024](https://arxiv.org/html/2605.15990#bib.bib80)\), we restrict our taxonomy to traits that are observable in the*linguistic behavior*of AI systems\. While human\-focused models of cultural competence consider “worldviews”, “attitudes”, or “motivation”, we do not assume that AI shares any analogous internal orientation\. Instead, to avoid overclaiming about AI’s cultural capabilities, we ask a narrower question:*which aspects of these constructs have recognizable linguistic footprints that can appear in model outputs and be evaluated as such?*
Concretely, we reinterpret the constructs in DMIS, CQ, and PMIC as a mixture of \(a\)*motivational*components, which are intrinsically tied to human agency and affect, and \(b\)*behavioral*components, which manifest in discourse, framing, and interactional patterns\. While both classes matter for humans, for AI, only the latter can be meaningfully operationalized\.
Our methodology is divided into three steps\. In Step 1, we identify, within each model, which elements have observable linguistic manifestations\. In Step 2, we recategorize the observable behaviors into distinct levels of capabilities\. In Step 3, we re\-interpret these levels of capability for AI\.
Step 1:In the following, across the ICC models, we distinguish between*motivational*\(human\-only\) and*behavioral*elements \(human and AI\):
DMIS:Although DMIS stages are originally framed as developmental worldviews, we argue that these stages also have recognizable*discursive correlates*\. For example,Denialcan surface as linguistic erasure of difference \(“people everywhere are basically the same”\),Defenceas superiority framing \(“our way is more advanced”\), andMinimizationas universalizing language \(“deep down, all cultures want the same things”\)\.AcceptanceandIntegrationmanifest in explicit acknowledgments of difference and multi\-perspective framing, whileAdaptationinvolves shifts in tone, register, or politeness strategies\. We therefore treat DMIS stages as*behavioral*elements for AI, even though AI does not inherently possess those worldviews\.
CQ:We categorize theMotivationalelement of CQ as a human\-only construct that is inherently tied to human intention and effort\. By contrast,CognitiveCQ \(knowledge of norms and practices\) can appear in model outputs as factual recall and distinctions between cultural practices\.MetacognitiveCQ \(planning, monitoring, and adjusting one’s interpretation\) has also partial behavioral manifestations in AI when models provide reasoning, reconsider earlier assumptions, or explicitly hedge and revise interpretations\. Finally,behavioralCQ, the ability to adapt verbal behavior across contexts, can be observed in text as shifts in tone, politeness, register, or interactional style\. These three CQ components thus contribute directly to AI\-relevant behavioral capabilities\.
PMIC:We argue that the elements ofAttitudesandInternal Outcomesin PMIC are explicitly affective and experiential; we again treat them as human\-only traits and avoid projecting them onto AI systems\. By contrast,Knowledge\(cultural knowledge and sociolinguistic awareness\), together withSkills\(observing, analyzing, relating, evaluating\), can be observed in discourse as the ability to describe, interpret, and compare cultural practices\. Lastly,External Outcomescorrespond to effective and appropriate behavior and communication in intercultural encounters, which can be evaluated for AI systems via their response content, tone, and pragmatic appropriateness\.
Step 2:We restrict attention to observable behaviors based on the above analysis and recategorize them to obtain a single taxonomy\. Across DMIS, CQ, and PMIC, intercultural effectiveness is consistently decomposed into three broad families of observablehumancapabilities, which we describe first below and reinterpret in Step 3 for AI\.
Cognitive foundations:the informational substrate of intercultural behavior, including knowledge, awareness, and understanding of cultural differences \(cognitive CQ; Knowledge in PMIC\), such as accurate descriptions of practices, recognition of group\-specific norms, and sociolinguistic knowledge \(e\.g\., honorifics, forms of address\)\.
Framing and stance\-taking:the ways in which cultural differences are*positioned*and*expressed*in discourse\. This draws on DMIS stages as observable stances \(Denial, Defence, Minimization, Acceptance, Integration\)111We omitAdaptationhere because it is captured under interactional adaptation later\.and on PMIC’s emphasis on appropriateness\.
Interactional adaptation:the competence and skills required to adjust communication in situ, across turns and evolving contexts\. This includesbehavioralCQ andMetacognitiveCQ as well asSkillsandExternal Outcomesof PMIC\. These skills can manifest as shifting tone, register, or explanatory strategy when new cultural cues emerge; revising an explanation when the user signals discomfort; and coordinating meaning over time rather than in a single shot\.
Step 3:Building on this behavioral reinterpretation, we articulate three AI capability levels that align with, but do not collapse into, the behavioral human\-focused constructs, and are empirically testable with NLP methods \(Figure[1](https://arxiv.org/html/2605.15990#S1.F1)\)\.
Capability Level 1: Cultural Awareness \-This level concerns the model’s ability to represent and retrieve culture\-specific information accurately\. It corresponds primarily to the cognitive foundations drawn from CQ and PMIC: factual knowledge about practices, norms, histories, and sociolinguistic conventions\. Evaluations at this level target informational accuracy and coverage: does the model correctly distinguish between different cultural practices, avoid hallucinating non\-existent customs, and resist collapsing distinct groups into monolithic categories?
Capability Level 2: Cultural Sensitivity \-This level concerns the model’s ability to frame cultural differences respectfully and non\-ethnocentrically\. It is a one\-shot property of the model’s initial stance toward cultural cues in the prompt and is grounded in the behavioral readings of DMIS stages and PMIC’s focus on appropriateness\. Here, the question is not yet whether the model can adapt over time, but whether its first move avoidsDenial, Defence, or Minimizationand instead recognizes difference without othering\. Evaluations at this level focus on stance and framing: whose perspective is centered, what is normalized, and whether the language implicitly ranks cultures\.
Capability Level 3: Cultural Competence \-This level concerns the model’s ability to adapt its communicative behavior dynamically as the interaction unfolds and new cultural cues emerge\. It includes interactional adaptation capabilities: perspective\-shifting, pragmatic adjustment, and context\-sensitive revisions across multiple turns\. A culturally competent model should not only begin from a non\-harmful stance but also update its responses when a user signals a particular identity, constraint, or harm history\. Evaluations at this level require multi\-turn setups and focus on dynamic behavior: how responses evolve, whether the model corrects earlier misframings, and how it coordinates meaning with the user over time\.
## 5Application of Taxonomy in AI Evaluation
While various dimensions of cultural capabilities have been measured by AI researchers, the terminologies used to describe these dimensions are often underspecified and used interchangeably\. Our taxonomy provides an ICC\-grounded vocabulary that enables researchers to identify and describe the level of cultural capability being measured in a more systematic way\. This taxonomy is a practical tool for evaluators of AI systems to 1\) specify which cultural capabilities a given task requires before designing the evaluation, 2\) design evaluations that target the corresponding observable behaviors, and 3\) clarify what the evaluations do not capture\. For example, for a narrowly focused question\-answering system,diverse factual knowledgeis the minimum required level of cultural capability; the evaluations need to capture a wide coverage of culturally\-grounded QA tests\. Scoring high on such tests demonstratesCultural Awareness, but the model might still lackCultural Sensitivity\(might use ethnocentric framing\) orCultural Competence\(fail to adapt when the context changes\)\. When the level of cultural capability being measured is not explicitly specified, these results may be misinterpreted and mislead the decision makers\.
In some tasks, all levels of cultural capabilities are required\. For a real\-world example, consider a conversational system used in K–12 education \(for instance, seeUNESCO \([2025](https://arxiv.org/html/2605.15990#bib.bib115)\)for developing such a chatbot in Zimbabwe\)\. Such a system is required to demonstrate all three levels of cultural capabilities identified in our taxonomy\. Consider the query*“Why do some communities prefer spiritual healing methods over clinical treatments?”*\. ACulturally Awaremodel accurately describes practices, contexts, and underlying cultural reasoning, avoiding factual errors\. ACulturally Sensitivemodel frames cultural differences with respect, avoids ethnocentric or moralizing language, and explicitly recognizes cultural specificity while remaining educational and informative\. After the initial answer, the user clarifies:*“In my community, we rely heavily on herbal remedies and rituals, and some people worry that modern medicine dismisses them\.”*ACulturally Competentmodel adjusts tone and framing to reflect the user’s perspective, mediates between potentially conflicting epistemologies, recovers from initial assumptions, and maintains consistent respect and accuracy across multiple turns\. Therefore, the evaluation of this system needs to tackle all these criteria at all three levels\.
Once the required level of capability is identified, researchers need to align evaluation designs with the required capability levels\. To evaluateAwareness, culturally grounded knowledge benchmarks, stereotype audits, and multi\-regional and multi\-lingual QA tests are sufficient\. Representative examples of NLP work that measuresAwareness, as defined in our taxonomy, includeGeoMLAMAYinet al\.\([2022](https://arxiv.org/html/2605.15990#bib.bib21)\),FORKPalta and Rudinger \([2023](https://arxiv.org/html/2605.15990#bib.bib14)\),BLEnDMyunget al\.\([2024](https://arxiv.org/html/2605.15990#bib.bib22)\),INCLUDERomanouet al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib23)\), andCulturalBenchChiuet al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib20)\)\. EvaluatingSensitivityis facilitated through single\-turn prompts annotated for tone, stance, and framing by intercultural experts; probes that inspect how the model describes or contrasts cultural differences\. Relevant resources includeSHADESMitchellet al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib28)\), which measures stereotype framing across languages, andMC\-SIGNSYerukolaet al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib118)\), which was developed to detect culturally offensive signals\.
Arguably, evaluatingCompetenceis more challenging than the other two levels and can only be achieved through multi\-turn simulations and user\-in\-the\-loop studies that assess whether the model adjusts to new cultural cues, resolves ambiguity, and repairs misalignment over time\. Such evaluations can be operationalized as scenario\-based dialogues in which a culturally salient cue is introduced after the model’s initial response\. For example, the user discloses their community, a religious constraint or a local practice, and the model is scored on whether the subsequent turns revise prior assumptions, produce necessary clarification, or accommodate the new information in another way\. Examples of NLP works that do evaluate competence, as defined in our paper \(although they might use other terms to refer to it\), are as follows:NormGenesisHonget al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib89)\)offers one template by tracking the integration of social norms across turns;SocialCCWuet al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib119)\)and the framework byHavaldaret al\.\([2025](https://arxiv.org/html/2605.15990#bib.bib120)\)extend this to socially situated multi\-turn exchanges; andNunchi\-BenchKim and Lee \([2025](https://arxiv.org/html/2605.15990#bib.bib117)\)provides scenario\-based prompts that could be extended into multi\-turn variants\. Appropriate systematized metrics should be developed to measure desired behaviors such as whether the model explicitly references the user\-introduced cultural cue in subsequent turns, whether earlier ethnocentric or generic framings are repaired without further prompting, or whether respectful framing is maintained as the conversation evolves\. Designing such evaluations for low\-resource languages will require participatory methods and community partnerships, since model behavior in these settings is constrained by training\-data coverage\.
Future work should focus on developing NLP methods capable of detecting the signals associated with each level of cultural capability within a given interaction\. For example, the rich bodies of work on bias detectionFieldet al\.\([2021](https://arxiv.org/html/2605.15990#bib.bib81)\), counter\-stereotype generationZhenget al\.\([2023](https://arxiv.org/html/2605.15990#bib.bib82)\); Fraseret al\.\([2023](https://arxiv.org/html/2605.15990#bib.bib83)\); Nejadgholiet al\.\([2024](https://arxiv.org/html/2605.15990#bib.bib84)\), stance detectionKüçük and Can \([2020](https://arxiv.org/html/2605.15990#bib.bib85)\), and affective computingPeiet al\.\([2024](https://arxiv.org/html/2605.15990#bib.bib86)\)provide methodological foundations for operationalizing the more complex levels of cultural capability, particularly adaptive cultural competence, which requires models to interpret users’ evolving cues, adjust tone, and modulate responses dynamically\.
## 6Conclusion
To address construct ambiguity in evaluating AI’s cultural capabilities, we introduce a taxonomy grounded in intercultural communication theory that distinguishes between Cultural Awareness, Sensitivity, and Competence, and frames them in terms of observable system behavior\.
We argue that improving construct clarity is essential for reliable evaluation in practice\. When cultural capability is underspecified, evaluation results may overestimate model readiness, particularly when knowledge\-based performance is interpreted as broader competence\. We therefore encourage more explicit, capability\-aligned evaluation practices that clarify what is being measured and what is not, particularly in multicultural contexts where the consequences of misinterpretation are amplified\.
## Limitations
It is important to note that rigorous measurement alone cannot resolve the broader sociotechnical harms associated with English\-centric AI\-mediated communication\. AsWallachet al\.\([2024](https://arxiv.org/html/2605.15990#bib.bib52)\)caution, even well\-structured measurement frameworks do not automatically translate into better outcomes; rather, they make explicit what evaluations capture and, equally importantly, what they omit\. We adopt this perspective in our work, using conceptual systematization as a means to clarify which aspects of cultural capability are being measured in AI evaluation and which remain outside the scope of measurement\.
Additionally, the taxonomy proposed in this work should not be interpreted as a comprehensive account of all cultural capabilities relevant to AI systems\. Intercultural communication is a complex and multidimensional phenomenon studied across several disciplines, including communication studies, sociology, education, and social psychology\. As such, additional constructs and distinctions may emerge as research on culturally grounded AI evaluation evolves\. Therefore, we did not exhaustively enumerate all possible cultural capabilities, but addressed a specific gap in the current NLP literature: the conceptual ambiguity surrounding the terminology used to describe cultural capabilities\.
Another limitation arises from the ICC models, on which we base our taxonomy\. DMIS, CQ, and PMIC were developed primarily in workplace, education, and expatriate\-adjustment contexts, and as a result emphasize an “outsider” view of culture\. Real\-world users of AI, however, might seek support in navigating their own social relationships, from an “insider” view of culture\. Extending our taxonomy toward insider\-oriented competence would depend on participatory and community\-informed methods, narrative\-based scenarios, and evaluators with lived cultural experiences\.
Further, given the fluid and evolving nature of both “culture” and “cultural groups”, complete knowledge of norms and variations associated with all cultural boundaries might be an impossible goal\. An important cognitive ability defined in the ICC literature ismetacognition: identifying situation\-relevant norms that may be culture\-specific and obtaining missing information before formulating a final response, rather than assuming a universal norm\. This higher level of metacognitive behaviors in intercultural interactions, where one shifts from assuming normative cultural standards to recognizing and adapting behaviors based on incoming conversational cues, is challenging and is currently understudied in the landscape of cross\-cultural AI evaluations\.
Finally, the boundaries between the levels in our taxonomy, Awareness, Sensitivity, and Competence, should not be interpreted as rigid or mutually exclusive categories\. In practice, these capabilities often interact and may appear simultaneously in system behavior\. The taxonomy is therefore best understood as a conceptual scaffold that helps researchers articulate which aspect of cultural capability an evaluation targets, rather than as a definitive or exhaustive model\. Future work may refine, expand, or reorganize these categories as empirical evidence and interdisciplinary insights accumulate\.
## References
- Moral foundations of large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 17737–17752\.Cited by:[§4](https://arxiv.org/html/2605.15990#S4.p2.1)\.
- K\. Abe, C\. Quan, S\. Cao, and Z\. Luo \(2025\)Classification of properties in human\-like dialogue systems using generative ai to adapt to individual preferences\.Applied Sciences15\(7\),pp\. 3466\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p1.1)\.
- M\. F\. Adilazuarda, S\. Mukherjee, P\. Lavania, S\. S\. Singh, A\. F\. Aji, J\. O’Neill, A\. Modi, and M\. Choudhury \(2024\)Towards measuring and modeling “culture” in LLMs: a survey\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 15763–15784\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.882/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.882)Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p1.1)\.
- I\. Alon and J\. M\. Higgins \(2005\)Global leadership success through emotional and cultural intelligences\.Business horizons48\(6\),pp\. 501–512\.Cited by:[§3\.2](https://arxiv.org/html/2605.15990#S3.SS2.p3.1)\.
- S\. Ang, L\. Van Dyne, C\. Koh, K\. Y\. Ng, K\. J\. Templer, C\. Tay, and N\. A\. Chandrasekar \(2007\)Cultural intelligence: its measurement and effects on cultural judgment and decision making, cultural adaptation and task performance\.Management and organization review3\(3\),pp\. 335–371\.Cited by:[§3\.2](https://arxiv.org/html/2605.15990#S3.SS2.p2.1),[§3\.2](https://arxiv.org/html/2605.15990#S3.SS2.p3.1),[Table 1](https://arxiv.org/html/2605.15990#S3.T1.1.3.2.1.1.1)\.
- S\. Ang and L\. Van Dyne \(2015\)Handbook of cultural intelligence: theory, measurement, and applications\.Routledge\.Cited by:[§3\.2](https://arxiv.org/html/2605.15990#S3.SS2.p2.1)\.
- L\. A\. Arasaratnam and M\. L\. Doerfel \(2005\)Intercultural communication competence: identifying key components from multicultural perspectives\.International journal of intercultural relations29\(2\),pp\. 137–163\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p3.1)\.
- L\. A\. Arasaratnam\-Smith \(2017\)Intercultural competence: an overview\.Intercultural competence in higher education,pp\. 7–18\.Cited by:[§3\.3](https://arxiv.org/html/2605.15990#S3.SS3.p3.1)\.
- Association of American Colleges and Universities \(AAC&U\) \(2025\)Inquiry and analysis value rubric\.Note:[https://www\.aacu\.org/value/rubrics/value\-rubrics\-inquiry\-and\-analysis](https://www.aacu.org/value/rubrics/value-rubrics-inquiry-and-analysis)Accessed: 2025\-12\-09Cited by:[§3\.3](https://arxiv.org/html/2605.15990#S3.SS3.p3.1)\.
- M\. J\. Bennett \(1986\)A developmental approach to training for intercultural sensitivity\.International journal of intercultural relations10\(2\),pp\. 179–196\.Cited by:[§3\.1](https://arxiv.org/html/2605.15990#S3.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.15990#S3.T1.1.2.1.1.1.1)\.
- M\. J\. Bennett \(1993\)Towards ethnorelativism: a developmental model of intercultural sensitivity\.Education for the intercultural experience2,pp\. 21–71\.Cited by:[§3\.1](https://arxiv.org/html/2605.15990#S3.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.15990#S3.T1.1.2.1.1.1.1)\.
- J\. S\. Black, M\. Mendenhall, and G\. Oddou \(1991\)Toward a comprehensive model of international adjustment: an integration of multiple theoretical perspectives\.Academy of management review16\(2\),pp\. 291–317\.Cited by:[§3\.2](https://arxiv.org/html/2605.15990#S3.SS2.p1.1)\.
- J\. N\. Bourjolly, R\. G\. Sands, P\. Solomon, V\. Stanhope, A\. Pernell\-Arnold, and L\. Finley \(2005\)The journey toward intercultural sensitivity: a non\-linear process\.Journal of Ethnic & Cultural Diversity in Social Work14\(3\-4\),pp\. 41–62\.Cited by:[§3\.1](https://arxiv.org/html/2605.15990#S3.SS1.p3.1)\.
- D\. Busch \(2024\)AI translation and intercultural communication: new questions for a new field of research\.SocArXiv 31p\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p1.1)\.
- M\. Byram \(2020\)Teaching and assessing intercultural communicative competence: revisited\.Multilingual matters\.Cited by:[§3\.3](https://arxiv.org/html/2605.15990#S3.SS3.p3.1)\.
- Y\. Y\. Chiu, L\. Jiang, B\. Y\. Lin, C\. Y\. Park, S\. S\. Li, S\. Ravi, M\. Bhatia, M\. Antoniak, Y\. Tsvetkov, V\. Shwartz, and Y\. Choi \(2025\)CulturalBench: a robust, diverse, and challenging cultural benchmark by human\-ai culturalteaming\.External Links:2410\.02677,[Link](https://arxiv.org/abs/2410.02677)Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p3.1),[§5](https://arxiv.org/html/2605.15990#S5.p3.1)\.
- B\. Choompunuch, K\. Kamdee, and P\. Taksino \(2024\)Exploring the components of multicultural competence among pre\-service teacher students in thailand: an approach utilizing confirmatory factor analysis\.European Journal of Investigation in Health, Psychology and Education14\(9\),pp\. 2476–2490\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p3.1)\.
- D\. K\. Deardorff \(2009a\)Synthesizing conceptualizations of intercultural competence: a summary and emerging themes\.InThe SAGE Handbook of Intercultural Competence,D\. K\. Deardorff \(Ed\.\),pp\. 264–270\.External Links:ISBN 9781412960458Cited by:[§3\.3](https://arxiv.org/html/2605.15990#S3.SS3.p3.1),[Table 1](https://arxiv.org/html/2605.15990#S3.T1.1.4.3.1.1.1)\.
- D\. K\. Deardorff \(2006\)Identification and assessment of intercultural competence as a student outcome of internationalization\.Journal of studies in international education10\(3\),pp\. 241–266\.Cited by:[§3\.3](https://arxiv.org/html/2605.15990#S3.SS3.p1.1),[§3\.3](https://arxiv.org/html/2605.15990#S3.SS3.p3.1),[Table 1](https://arxiv.org/html/2605.15990#S3.T1.1.4.3.1.1.1)\.
- D\. K\. Deardorff \(2009b\)The sage handbook of intercultural competence\.Sage Publications\.Cited by:[§3\.3](https://arxiv.org/html/2605.15990#S3.SS3.p2.1)\.
- J\. G\. DeJaeghere and Y\. Cao \(2009\)Developing us teachers’ intercultural competence: does professional development matter?\.International Journal of Intercultural Relations33\(5\),pp\. 437–447\.Cited by:[§3\.1](https://arxiv.org/html/2605.15990#S3.SS1.p3.1)\.
- P\. C\. Earley and S\. Ang \(2003\)Cultural intelligence: individual interactions across cultures\.Stanford University Press,Stanford, CA\.Cited by:[§3\.2](https://arxiv.org/html/2605.15990#S3.SS2.p1.1),[Table 1](https://arxiv.org/html/2605.15990#S3.T1.1.3.2.1.1.1)\.
- A\. Field, S\. L\. Blodgett, Z\. Talat, and Y\. Tsvetkov \(2021\)A survey of race, racism, and anti\-racism in nlp\.InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing \(Volume 1: long papers\),pp\. 1905–1925\.Cited by:[§5](https://arxiv.org/html/2605.15990#S5.p5.1)\.
- K\. C\. Fraser, S\. Kiritchenko, I\. Nejadgholi, and A\. Kerkhof \(2023\)What makes a good counter\-stereotype? evaluating strategies for automated responses to stereotypical text\.InProceedings of the First Workshop on Social Influence in Conversations \(SICon 2023\),pp\. 25–38\.Cited by:[§5](https://arxiv.org/html/2605.15990#S5.p5.1)\.
- K\. C\. Fraser, I\. Nejadgholi, and S\. Kiritchenko \(2021\)Understanding and countering stereotypes: a computational approach to the stereotype content model\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 600–616\.External Links:[Link](https://aclanthology.org/2021.acl-long.50/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.50)Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p5.1)\.
- C\. Gozzoli and D\. Gazzaroli \(2018\)The cultural intelligence scale \(cqs\): a contribution to the italian validation\.Frontiers in psychology9,pp\. 1183\.Cited by:[§3\.2](https://arxiv.org/html/2605.15990#S3.SS2.p3.1)\.
- R\. Guo, I\. Nejadgholi, H\. Dawkins, K\. C\. Fraser, and S\. Kiritchenko \(2024\)Adaptable moral stances of large language models on sexist content: implications for society and gender discourse\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 19548–19564\.Cited by:[§4](https://arxiv.org/html/2605.15990#S4.p2.1)\.
- S\. Havaldar, Y\. M\. Cho, S\. Rai, and L\. Ungar \(2025\)Culturally\-aware conversations: a framework & benchmark for LLMs\.InProceedings of the Fourth Workshop on Bridging Human\-Computer Interaction and Natural Language Processing \(HCI\+NLP\),S\. L\. Blodgett, A\. C\. Curry, S\. Dev, S\. Li, M\. Madaio, J\. Wang, S\. T\. Wu, Z\. Xiao, and D\. Yang \(Eds\.\),Suzhou, China,pp\. 220–229\.External Links:[Link](https://aclanthology.org/2025.hcinlp-1.18/),[Document](https://dx.doi.org/10.18653/v1/2025.hcinlp-1.18),ISBN 979\-8\-89176\-353\-1Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p5.1),[§5](https://arxiv.org/html/2605.15990#S5.p4.1)\.
- D\. Hershcovich, S\. Frank, H\. Lent, M\. de Lhoneux, M\. Abdou, S\. Brandl, E\. Bugliarello, L\. Cabello Piqueras, I\. Chalkidis, R\. Cui, C\. Fierro, K\. Margatina, P\. Rust, and A\. Søgaard \(2022\)Challenges and strategies in cross\-cultural NLP\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 6997–7013\.External Links:[Link](https://aclanthology.org/2022.acl-long.482/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.482)Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p1.1)\.
- J\. Hohenstein, R\. F\. Kizilcec, D\. DiFranzo, Z\. Aghajari, H\. Mieczkowski, K\. Levy, M\. Naaman, J\. Hancock, and M\. F\. Jung \(2023\)Artificial intelligence in communication impacts language and social relationships\.Scientific reports13\(1\),pp\. 5487\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p1.1)\.
- M\. Hong, J\. Choi, and J\. Kim \(2025\)NormGenesis: multicultural dialogue generation via exemplar\-guided social norm modeling and violation recovery\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 33781–33819\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1715/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1715),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p5.1),[§5](https://arxiv.org/html/2605.15990#S5.p4.1)\.
- S\. Kaggwa, T\. F\. Eleogu, F\. Okonkwo, O\. A\. Farayola, P\. U\. Uwaoma, and A\. Akinoso \(2024\)AI in decision making: transforming business strategies\.International Journal of Research and Scientific Innovation10\(12\),pp\. 423–444\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p1.1)\.
- K\. Kim and S\. Lee \(2025\)Nunchi\-bench: benchmarking language models on cultural reasoning with a focus on Korean superstition\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 15328–15342\.External Links:[Link](https://aclanthology.org/2025.findings-acl.794/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.794),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p5.1),[§5](https://arxiv.org/html/2605.15990#S5.p4.1)\.
- D\. A\. Kolb \(2014\)Experiential learning: experience as the source of learning and development\.FT press\.Cited by:[§3\.2](https://arxiv.org/html/2605.15990#S3.SS2.p3.1)\.
- D\. Küçük and F\. Can \(2020\)Stance detection: a survey\.ACM Computing Surveys \(CSUR\)53\(1\),pp\. 1–37\.Cited by:[§5](https://arxiv.org/html/2605.15990#S5.p5.1)\.
- J\. Lauring \(2011\)Intercultural organizational communication: the social organizing of interaction in international encounters\.The Journal of Business Communication \(1973\)48\(3\),pp\. 231–255\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p3.1)\.
- C\. C\. Liu, I\. Gurevych, and A\. Korhonen \(2025\)Culturally aware and adapted NLP: a taxonomy and a survey of the state of the art\.Transactions of the Association for Computational Linguistics13,pp\. 652–689\.External Links:[Link](https://aclanthology.org/2025.tacl-1.31/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00760)Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p1.1)\.
- M\. Mendenhall, M\. J\. Stevens, A\. Bird, G\. Oddou, and J\. Osland \(2008\)Specification of the content domain of the intercultural effectiveness scale\.The Kozai monograph series1\(2\),pp\. 1–22\.Cited by:[§3\.2](https://arxiv.org/html/2605.15990#S3.SS2.p1.1)\.
- M\. Mitchell, G\. Attanasio, I\. Baldini, M\. Clinciu, J\. Clive, P\. Delobelle, M\. Dey, S\. Hamilton, T\. Dill, J\. Doughman, R\. Dutt, A\. Ghosh, J\. Z\. Forde, C\. Holtermann, L\. Kaffee, T\. Laud, A\. Lauscher, R\. L\. Lopez\-Davila, M\. Masoud, N\. Nangia, A\. Ovalle, G\. Pistilli, D\. Radev, B\. Savoldi, V\. Raheja, J\. Qin, E\. Ploeger, A\. Subramonian, K\. Dhole, K\. Sun, A\. Djanibekov, J\. Mansurov, K\. Yin, E\. V\. Cueva, S\. Mukherjee, J\. Huang, X\. Shen, J\. Gala, H\. Al\-Ali, T\. Djanibekov, N\. Mukhituly, S\. Nie, S\. Sharma, K\. Stanczak, E\. Szczechla, T\. Timponi Torrent, D\. Tunuguntla, M\. Viridiano, O\. Van Der Wal, A\. Yakefu, A\. Névéol, M\. Zhang, S\. Zink, and Z\. Talat \(2025\)SHADES: towards a multilingual assessment of stereotypes in large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 11995–12041\.External Links:[Link](https://aclanthology.org/2025.naacl-long.600/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.600),ISBN 979\-8\-89176\-189\-6Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p4.1),[§5](https://arxiv.org/html/2605.15990#S5.p3.1)\.
- J\. Myung, N\. Lee, Y\. Zhou, J\. Jin, R\. A\. Putri, D\. Antypas, H\. Borkakoty, E\. Kim, C\. Perez\-Almendros, A\. A\. Ayele, V\. Gutiérrez\-Basulto, Y\. Ibáñez\-García, H\. Lee, S\. H\. Muhammad, K\. Park, A\. S\. Rzayev, N\. White, S\. M\. Yimam, M\. T\. Pilehvar, N\. Ousidhoum, J\. Camacho\-Collados, and A\. Oh \(2024\)BLEnD: a benchmark for LLMs on everyday knowledge in diverse cultures and languages\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 78104–78146\.External Links:[Document](https://dx.doi.org/10.52202/079017-2483),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/8eb88844dafefa92a26aaec9f3acad93-Paper-Datasets_and_Benchmarks_Track.pdf)Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p3.1),[§5](https://arxiv.org/html/2605.15990#S5.p3.1)\.
- T\. Naous and W\. Xu \(2025\)On the origin of cultural biases in language models: from pre\-training data to linguistic phenomena\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 6423–6443\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p1.1)\.
- P\. Naveen and P\. Trojovský \(2024\)Overview and challenges of machine translation for contextually appropriate translations\.Iscience27\(10\)\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p1.1)\.
- I\. Nejadgholi, K\. C\. Fraser, A\. Kerkhof, and S\. Kiritchenko \(2024\)Challenging negative gender stereotypes: a study on the effectiveness of automated counter\-stereotypes\.arXiv preprint arXiv:2404\.11845\.Cited by:[§5](https://arxiv.org/html/2605.15990#S5.p5.1)\.
- S\. Onohara, A\. Miyai, Y\. Imajuku, K\. Egashira, J\. Baek, X\. Yue, G\. Neubig, and K\. Aizawa \(2025\)JMMMU: a Japanese massive multi\-discipline multimodal understanding benchmark for culture\-aware evaluation\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 932–950\.External Links:[Link](https://aclanthology.org/2025.naacl-long.43/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.43),ISBN 979\-8\-89176\-189\-6Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p3.1)\.
- S\. Palta and R\. Rudinger \(2023\)FORK: a bite\-sized test set for probing culinary cultural biases in commonsense reasoning models\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 9952–9962\.Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p3.1),[§5](https://arxiv.org/html/2605.15990#S5.p3.1)\.
- S\. Pawar, J\. Park, J\. Jin, A\. Arora, J\. Myung, S\. Yadav, F\. G\. Haznitrama, I\. Song, A\. Oh, and I\. Augenstein \(2025\)Survey of cultural awareness in language models: text and beyond\.Computational Linguistics,pp\. 1–96\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p2.1),[§2](https://arxiv.org/html/2605.15990#S2.p1.1)\.
- P\. J\. Pedersen \(2010\)Assessing intercultural effectiveness outcomes in a year\-long study abroad program\.International Journal of intercultural relations34\(1\),pp\. 70–80\.Cited by:[§3\.1](https://arxiv.org/html/2605.15990#S3.SS1.p3.1)\.
- G\. Pei, H\. Li, Y\. Lu, Y\. Wang, S\. Hua, and T\. Li \(2024\)Affective computing: recent advances, challenges, and future trends\.Intelligent Computing3,pp\. 0076\.Cited by:[§5](https://arxiv.org/html/2605.15990#S5.p5.1)\.
- H\. Qiu, A\. Fabbri, D\. Agarwal, K\. Huang, S\. Tan, N\. Peng, and C\. Wu \(2025\)Evaluating cultural and social awareness of LLM web agents\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 3978–4005\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.222/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.222),ISBN 979\-8\-89176\-195\-7Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p4.1)\.
- S\. S\. Ramalu, R\. C\. Rose, J\. Uli, and N\. Kumar \(2012\)Cultural intelligence and expatriate performance in global assignment: the mediating role of adjustment\.International Journal of Business and Society13\(1\),pp\. 19\.Cited by:[§3\.2](https://arxiv.org/html/2605.15990#S3.SS2.p3.1)\.
- C\. A\. Richards and A\. Z\. Doorenbos \(2016\)Intercultural competency development of health professions students during study abroad in india\.Journal of nursing education and practice6\(12\),pp\. 89\.Cited by:[§3\.1](https://arxiv.org/html/2605.15990#S3.SS1.p3.1)\.
- N\. F\. Richter, C\. Schlaegel, V\. Taras, I\. Alon, and A\. Bird \(2023\)Reviewing half a century of measuring cross\-cultural competence: aligning theoretical constructs and empirical measures\.International Business Review32\(4\),pp\. 102122\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p3.1)\.
- T\. Rockstuhl, S\. Seiler, S\. Ang, L\. Van Dyne, and H\. Annen \(2011\)Beyond general intelligence \(iq\) and emotional intelligence \(eq\): the role of cultural intelligence \(cq\) on cross\-border leadership effectiveness in a globalized world\.Journal of Social Issues67\(4\),pp\. 825–840\.Cited by:[§3\.2](https://arxiv.org/html/2605.15990#S3.SS2.p3.1)\.
- A\. Romanou, N\. Foroutan, A\. Sotnikova, S\. H\. Nelaturu, S\. Singh, R\. Maheshwary, M\. Altomare, Z\. Chen, M\. A\. Haggag, S\. A, A\. Amayuelas, A\. H\. Amirudin, D\. Boiko, M\. Chang, J\. Chim, G\. Cohen, A\. K\. Dalmia, A\. Diress, S\. Duwal, D\. Dzenhaliou, D\. F\. E\. Florez, F\. Farestam, J\. M\. Imperial, S\. B\. Islam, P\. Isotalo, M\. Jabbarishiviari, B\. F\. Karlsson, E\. Khalilov, C\. Klamm, F\. Koto, D\. Krzemiński, G\. A\. de Melo, S\. Montariol, Y\. Nan, J\. Niklaus, J\. Novikova, J\. S\. O\. Ceron, D\. Paul, E\. Ploeger, J\. Purbey, S\. Rajwal, S\. S\. Ravi, S\. Rydell, R\. Santhosh, D\. Sharma, M\. P\. Skenduli, A\. S\. Moakhar, B\. soltani moakhar, A\. K\. Tarun, A\. T\. Wasi, T\. O\. Weerasinghe, S\. Yilmaz, M\. Zhang, I\. Schlag, M\. Fadaee, S\. Hooker, and A\. Bosselut \(2025\)INCLUDE: evaluating multilingual language understanding with regional knowledge\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=k3gCieTXeY)Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p3.1),[§5](https://arxiv.org/html/2605.15990#S5.p3.1)\.
- S\. Saha, S\. K\. Pandey, and M\. Choudhury \(2025\)Meta\-cultural competence: climbing the right hill of cultural awareness\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 8025–8042\.External Links:[Link](https://aclanthology.org/2025.naacl-long.408/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.408),ISBN 979\-8\-89176\-189\-6Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p2.1),[§2](https://arxiv.org/html/2605.15990#S2.p3.1)\.
- B\. Szkudlarek, J\. S\. Osland, L\. Nardon, and L\. Zander \(2020\)Communication and culture in international business–moving the field forward\.Journal of World Business55\(6\),pp\. 101126\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p3.1)\.
- UNESCO \(2025\)Terms of reference: unesco whatsapp chatbots \(bot development and ai integration\)\.Note:[https://www\.unesco\.org/en/articles/terms\-reference\-unesco\-whatsapp\-chatbots\-bot\-development\-and\-ai\-integration](https://www.unesco.org/en/articles/terms-reference-unesco-whatsapp-chatbots-bot-development-and-ai-integration)Accessed: 2026\-01\-09Cited by:[§5](https://arxiv.org/html/2605.15990#S5.p2.1)\.
- L\. Van Dyne, S\. Ang, and C\. Koh \(2015\)Development and validation of the cqs: the cultural intelligence scale\.InHandbook of cultural intelligence,pp\. 34–56\.Cited by:[§3\.2](https://arxiv.org/html/2605.15990#S3.SS2.p3.1)\.
- H\. Wallach, M\. Desai, N\. Pangakis, A\. F\. Cooper, A\. Wang, S\. Barocas, A\. Chouldechova, C\. Atalla, S\. L\. Blodgett, E\. Corvi,et al\.\(2024\)Evaluating generative ai systems is a social science measurement challenge\.arXiv preprint arXiv:2411\.10939\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p4.1),[Limitations](https://arxiv.org/html/2605.15990#Sx1.p1.1)\.
- M\. Warren and W\. W\. Lee \(2020\)Intercultural communication in professional and workplace settings\.InThe Routledge handbook of language and intercultural communication,pp\. 473–486\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p3.1)\.
- J\. Wu, J\. Lian, D\. Wang, and H\. M\. Meng \(2025\)SocialCC: interactive evaluation for cultural competence in language agents\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 33242–33271\.External Links:[Link](https://aclanthology.org/2025.acl-long.1594/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1594),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p5.1),[§5](https://arxiv.org/html/2605.15990#S5.p4.1)\.
- S\. Yang, H\. Zhao, and W\. Luo \(2024\)The impact of artificial intelligence on intercultural communication\.InBelonging in Culturally Diverse Societies\-Official Structures and Personal Customs,Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p1.1)\.
- B\. Yao, M\. Jiang, T\. Bobinac, D\. Yang, and J\. Hu \(2024\)Benchmarking machine translation with cultural awareness\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 13078–13096\.Cited by:[§1](https://arxiv.org/html/2605.15990#S1.p5.1)\.
- A\. Yerukola, S\. Gabriel, N\. Peng, and M\. Sap \(2025\)Mind the gesture: evaluating AI sensitivity to culturally offensive non\-verbal gestures\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 25041–25080\.External Links:[Link](https://aclanthology.org/2025.acl-long.1218/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1218),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p4.1),[§5](https://arxiv.org/html/2605.15990#S5.p3.1)\.
- D\. Yin, H\. Bansal, M\. Monajatipoor, L\. H\. Li, and K\. Chang \(2022\)GeoMLAMA: geo\-diverse commonsense probing on multilingual pre\-trained language models\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 2039–2055\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.132/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.132)Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p3.1),[§5](https://arxiv.org/html/2605.15990#S5.p3.1)\.
- Y\. Zheng, B\. Ross, and W\. Magdy \(2023\)What makes good counterspeech? a comparison of generation approaches and evaluation metrics\.InProceedings of the 1st Workshop on CounterSpeech for Online Abuse \(CS4OA\),pp\. 62–71\.Cited by:[§5](https://arxiv.org/html/2605.15990#S5.p5.1)\.
- N\. Zhou, D\. Bamman, and I\. L\. Bleaman \(2025\)Culture is not trivia: sociocultural theory for cultural NLP\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 25869–25886\.External Links:[Link](https://aclanthology.org/2025.acl-long.1256/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1256),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2605.15990#S2.p1.1)\.Similar Articles
AI evaluation may bias perceptions: The importance of context in interpreting academic writing
This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences across countries and fields, and proposes context-aware benchmarks for more accurate measurement.
Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting
This paper introduces EvalCards, an operational framework that standardizes AI evaluation reporting by composing benchmark metadata, evaluation run data, and model metadata into a unified record with interpretive signals for reproducibility, completeness, provenance, risk, and score comparability. The authors deploy a monitoring tool across thousands of models and benchmarks, revealing systematic gaps in current reporting practices.
Interactive Evaluation Requires a Design Science
This position paper argues that interactive AI evaluation should be treated as a design science paradigm, proposing a two-axis taxonomy and reporting standards for assessing dynamic system behavior through trajectories.
@IntuitMachine: https://x.com/IntuitMachine/status/2058141021842571510
This essay argues that evaluation is the hardest problem in production AI, not generation, and decomposes AI self-knowledge into calibration, discrimination, and expression, with implications for system design.
Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare
This paper presents a structured framework for benchmarking generative, multimodal, and agentic AI in healthcare, addressing the gap between high benchmark scores and real-world clinical reliability, safety, and relevance.