CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

arXiv cs.CL Papers

Summary

CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.

arXiv:2604.19262v1 Announce Type: new Abstract: Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks -- where models must reason within real-world, context-rich scenarios -- largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs' multilingual and multicultural competence on grounded tasks. CulturALL is built via a human--AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/22/26, 08:30 AM

# CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
Source: [https://arxiv.org/html/2604.19262](https://arxiv.org/html/2604.19262)
Peiqin Lin1, Chenyang Lyu1, Wenjiang Luo2, Haotian Ye3, Md Mehrab Hossain5, Chunlan Ma3, Shaoxiong Ji4,5, Younes Samih6, Bo Zeng1, Fan Jiang1, Yuanbin Cao1, Dilda Duisenbek2, Adrian Neo Sau Xun2, Daria Pozdniakova2, Liubou Misevich2, Nevena Marinković2, Ngoc Gia Linh Nguyen2, Thi Khanh Linh Do2, Sarakmatak Sophy2, Baotian Hu8, Guanhua Chen9, Gongbo Tang2, Alham Fikri Aji7, Longyue Wang1, Weihua Luo1 1Alibaba Group2Beijing Language and Culture University3LMU Munich 4ELLIS Institute Finland5University of Turku6IBM Research AI, UAE7MBZUAI 8Harbin Institute of Technology, Shenzhen9Southern University of Science and Technology

###### Abstract

Large language models \(LLMs\) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities\. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks—where models must reason within real\-world, context\-rich scenarios—largely unaddressed\. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs’ multilingual and multicultural competence on grounded tasks\. CulturALL is built via a human–AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload\. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage\. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging\. CulturALL contains 2,610 samples in 14 languages of 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks\. The experiments show that the best LLM achieve 44\.48% accuracy on CulturALL, underscoring substantial room for improvement\.111Code and data are publicly available at[https://github\.com/AIDC\-AI/Marco\-LLM](https://github.com/AIDC-AI/Marco-LLM)\.

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

Peiqin Lin1, Chenyang Lyu1, Wenjiang Luo2, Haotian Ye3, Md Mehrab Hossain5, Chunlan Ma3,Shaoxiong Ji4,5, Younes Samih6, Bo Zeng1, Fan Jiang1, Yuanbin Cao1, Dilda Duisenbek2,Adrian Neo Sau Xun2, Daria Pozdniakova2, Liubou Misevich2, Nevena Marinković2,Ngoc Gia Linh Nguyen2, Thi Khanh Linh Do2, Sarakmatak Sophy2, Baotian Hu8,Guanhua Chen9, Gongbo Tang2, Alham Fikri Aji7, Longyue Wang1, Weihua Luo11Alibaba Group2Beijing Language and Culture University3LMU Munich4ELLIS Institute Finland5University of Turku6IBM Research AI, UAE7MBZUAI8Harbin Institute of Technology, Shenzhen9Southern University of Science and Technology

## 1Introduction

![Refer to caption](https://arxiv.org/html/2604.19262v1/x1.png)

\(a\)Example\-level comparison\.
![Refer to caption](https://arxiv.org/html/2604.19262v1/x2.png)

\(b\)Benchmark\-level comparison\.

Figure 1:\(a\) Example\-level: Q1 is multilingual only; Q2 adds cultural knowledge; Q3 requires all three, posing the hardest challenge\. \(b\) Benchmark\-level: existing representative benchmarks test at most two axes, while CulturALL spans all three\.As LLMs are adopted across the globe, it is imperative to evaluate how well they perform in diverse languages and cultures\. Existing multilingual and multicultural benchmarks, e\.g\., BLEND\(Myunget al\.,[2024](https://arxiv.org/html/2604.19262#bib.bib1346)\), INCLUDE\(Romanouet al\.,[2024](https://arxiv.org/html/2604.19262#bib.bib1362)\), and Global MMLU\(Singhet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1390)\), cover a wide range of languages and cultures, but their content is dominated by encyclopedic trivia\. Consequently, they say little about how LLMs perform on the everyday tasks people actually care about, e\.g\., planning a trip or making an online purchase\. Recent efforts have started to introduce grounded evaluations, e\.g\., CultureBank\(Shiet al\.,[2024](https://arxiv.org/html/2604.19262#bib.bib1368)\)and NORMAD\(Raoet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1403)\)\. However, they are mostly in English and cover only a narrow band of grounded tasks, mainly social interactions\. This gap prompts a key question:How effectively can LLMs tackle the diverse grounded tasks users face across different languages and cultures?

![Refer to caption](https://arxiv.org/html/2604.19262v1/x3.png)

Figure 2:CulturALL is a comprehensive and challenging benchmark\. It contains 2,610 samples in 14 languages across 51 regions, distributed among 16 topics to capture the full breadth of grounded tasks\. As the given example illustrates, each item presents a grounded scenario followed by its question\. Successfully solving each item requires an LLM to fuse these cues with its stored knowledge and reason to the correct answer\.A truly capable LLM must solve grounded tasks across diverse linguistic and cultural contexts, because these tasks reflect what users actually need\. These tasks are particularly challenging because they probe three complementary capacities of an LLM:\(1\) language comprehension \(multilingual\): the capacity to accurately parse and interpret a user’s native tongue;\(2\) cultural knowledge acquisition \(multicultural\): the ability to access and recall long\-tail, domain\-specific cultural facts; and\(3\) contextual reasoning \(grounded\): the skill of integrating that information and synthesizing it into an accurate response\. As illustrated in Fig\.[1\(a\)](https://arxiv.org/html/2604.19262#S1.F1.sf1), Q1 merely tests a LLM’s multilingual ability, and Q2 adds a cultural fact\. In contrast, Q3 requires the full chain of multilingual, multicultural, and grounded reasoning: the LLM must first interpret the Chinese query, identify the relevant late\-August festival in China and its customs, recall the symbolic meanings of different flowers, and finally synthesize this information into a concise, culturally appropriate reply via reasoning\. Coordinating this chain of culturally grounded reasoning is anything but trivial\.

In response, we introduce CulturALL, the first benchmark to assess LLM performance in grounded scenarios across diverse languages and cultures \(Fig\.[1\(b\)](https://arxiv.org/html/2604.19262#S1.F1.sf2)\)\. CulturALL is constructed using a novel human\-LLM collaborative framework that leverages expert annotators for factual accuracy and elevated difficulty, while LLMs assist in generating and enriching diverse scenarios, ensuring comprehensive coverage and challenging samples\. Fig\.[2](https://arxiv.org/html/2604.19262#S1.F2)shows CulturALL ’s extensive language and cultural coverage, as well as its challenging nature\. The characteristics of CulturALL are as follows: 1\) coverage: the 2,610 samples in CulturALL span 16 topics that encompass diverse facets of daily life and society, covering cultures from 51 regions across 14 languages; 2\) challenging: answering each scenario\-based question is difficult because it requires LLMs to integrate nuanced cultural knowledge with strong multi\-step reasoning skills\.

Using CulturALL, we analyze existing LLMs and find that they struggle with culturally grounded tasks, and improving their performance requires effective web search and strong reasoning capabilities\. In summary, our contributions are multifold\.

- •We design a unified human\-LLM framework, which can be applied to create benchmarks with wide coverage and high difficulty\.
- •We present CulturALL, the first benchmark explicitly designed to assess LLMs’ multilingual and multicultural competence across a wide spectrum of realistic tasks\.
- •We benchmark state\-of\-the\-art LLMs on CulturALL and deliver an in\-depth analysis, highlighting key strengths and failure modes\.

## 2CulturALL: Construction and Statistics

![Refer to caption](https://arxiv.org/html/2604.19262v1/x4.png)

Figure 3:The data construction framework of CulturALL: 1\) Cultural Topic Sourcing: assemble a list of cultural topics; 2\) Sample Creation: craft original items for each topic; 3\) Sample Enrichment: enhance realism and increase difficulty; 4\) Release\-Ready: complete sample information and conduct quality validation\.Robust evaluation of LLMs on multilingual and multicultural tasks requires datasets that are both diverse and challenging\. To achieve this at scale, we introduce a unified human\-LLM framework that combines human expertise with the generative power of LLMs, resulting in CulturALL, which offers broad coverage and high difficulty\.

An overview of this framework is shown in Fig\.[3](https://arxiv.org/html/2604.19262#S2.F3)\. The framework begins with cultural topic sourcing \(§[2\.1](https://arxiv.org/html/2604.19262#S2.SS1)\), compiling an extensive list of cultural topics and illustrative examples\. Next is sample creation \(§[2\.2](https://arxiv.org/html/2604.19262#S2.SS2)\), where we draft seed instances for these topics, drawing on sources such as personal experience and online materials\. These drafts are then refined during sample enrichment \(§[2\.3](https://arxiv.org/html/2604.19262#S2.SS3)\) to increase their difficulty and better mirror grounded scenarios\. The final stage—Release\-Ready \(§[2\.4](https://arxiv.org/html/2604.19262#S2.SS4)\)—completes each sample with topic/region labels and English translation, and then conducts thorough quality checks\.

We define a culture group as the population of a single country or region\. To capture broad cultural expertise, we collaborate with annotators from a wide range of countries, regions, and linguistic backgrounds\. Real\-world queries seldom state their cultural origin explicitly, so LLMs must infer it from implicit cues—vocabulary, idioms, institutions, and other context signals\. For this reason, each annotator composes samples in the dominant language of their locale—e\.g\., English in the United States and Mandarin Chinese in mainland China—embedding authentic local references that models must recognize and interpret\. All annotation information, including annotator information and guidelines, is provided in §[A](https://arxiv.org/html/2604.19262#A1)\.

### 2\.1Stage 1: Cultural Topic Sourcing

To spur the creation of grounded tasks across a broad spectrum of cultural topics, the cultural\-topic\-sourcing stage aims to generate a comprehensive list that covers nearly every facet of daily life through human–LLM collaboration\. We first compile a preliminary set of topics with concise scope descriptions, drawing on prior research\(Yinet al\.,[2022](https://arxiv.org/html/2604.19262#bib.bib888); Romanouet al\.,[2024](https://arxiv.org/html/2604.19262#bib.bib1362); Chiuet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1387)\)and heuristics, and then engagegpt\-4o\-2024\-11\-20in several iterative rounds to merge, refine, and expand both the topics and their accompanying descriptions\. With the final list in place, we craft seed examples from personal experience and instructgpt\-4o\-2024\-11\-20to expand them, yielding a pool of 160 illustrative instances \(10 per topic\) that serve as scaffolding for the subsequent sample\-creation stage\. The complete topic list accompanied by descriptions and three representative culture\-related scenarios appears in Tab\.[4](https://arxiv.org/html/2604.19262#A2.T4)\(§[B](https://arxiv.org/html/2604.19262#A2)\); the complete set of examples will be released publicly\.

### 2\.2Stage 2: Sample Creation

#### 2\.2\.1Sample Format

Tab\.[1](https://arxiv.org/html/2604.19262#S2.T1)outlines the schema that each sample must adhere to\. During annotation, thelanguagefield is predefined based on the source data or annotator’s information, whileregionandtopicare automatically generated by the LLM \(see §[2\.4\.1](https://arxiv.org/html/2604.19262#S2.SS4.SSS1)\)\. All remaining fields are reviewed and completed by the annotators\. Below, we detail the requirements for the overall sample and each field that must be verified or completed by annotators\.

Table 1:Metadata for each item\.##### Sample

Craft a culturally grounded, grounded item that evaluates an LLM’s ability to employ cultural knowledge\. Cultural knowledge includes but not limited to local vocabulary, social norms, cultural commonsense, regulations, and domain\-specific knowledge\. Generic trivia \(e\.g\., math puzzles or textbook facts\) is out of scope\. Two items are considered distinct only if they probe different knowledge or reasoning steps, not if they are merely paraphrases of each other\.

##### Scenario

Construct a grounded scenario, withholding any explicit hints that would let a model solve the task without relevant cultural knowledge\.

##### Question

Ensure the query arises from the scenario and cannot be answered correctly without an understanding of the relevant cultural knowledge\.

##### Answer

To facilitate automatic evaluation, answers should be objective and as brief as possible\. If an objective free\-form answer is impractical, convert the question to a four\-option multiple\-choice format \(A–D\) and return only the chosen letter\.

##### Explanation

When appropriate, supply the cultural or domain knowledge that supports the answer\. These explanations make CulturALL more transparent for readers and pave the way for using CulturALL in future free\-text evaluation tasks\.

#### 2\.2\.2Cultural Knowledge Sourcing

##### Personal Experience \(Human\)

To capture unwritten social cues, emerging slang, and region\-specific practices, we ask annotators to draw from their personal experiences\. These first\-hand contributions result in scenarios that are both authentic and deeply rooted in context\. Annotators receive a detailed list of topics with descriptions and examples \(§[2\.1](https://arxiv.org/html/2604.19262#S2.SS1)\), which they can adapt, use as inspiration for new culturally relevant instances, or supplement with ideas from local forums\.

##### Cross\-lingual Inspiration \(Human\)

An example in one language often sparks analogous ideas in annotators who speak other languages\. For instance, a Chinese query about obtaining a visa for Hong Kong may inspire a French annotator to create a comparable scenario involving a French employee applying for a Belgian work permit\. To facilitate this transfer, we translate existing samples into English \(see §[2\.4\.2](https://arxiv.org/html/2604.19262#S2.SS4.SSS2)for details\), serving as a shared pivot that enables native speakers of other languages to more easily create parallel data\.

##### Existing Datasets \(LLM\)

Many prior cultural benchmarks contain culture\-relevant items yet lack explicit grounding in grounded contexts\. We refine these items through rewriting usinggpt\-4o\-2024\-11\-20, anchoring each item in a concrete scenario while preserving its original knowledge requirements\. Details are provided in §[C](https://arxiv.org/html/2604.19262#A3)\.

##### Online Resources \(LLM\)

We collect culture\-rich materials from online resources, focusing primarily on mining posts from Xiaohongshu, guided by our cultural topic example list\. For each target country/region \(Tab\.[5](https://arxiv.org/html/2604.19262#A3.T5), §[C](https://arxiv.org/html/2604.19262#A3)\), we combine the region name with topic seeds generated during the cultural\-topic–sourcing stage \(in Chinese\) as search queries to efficiently surface relevant local content\. This crawl returns 3,518 pages\. Each page is translated into the country/region’s dominant language withgpt\-4o\-2024\-11\-20\(Fig\.[6](https://arxiv.org/html/2604.19262#A4.F6), §[D](https://arxiv.org/html/2604.19262#A4)\)\. The retrieved content is then supplied togpt\-4o\-2024\-11\-20as raw input for drafting candidate items \(Fig\.[8](https://arxiv.org/html/2604.19262#A4.F8), §[D](https://arxiv.org/html/2604.19262#A4)\)\.

### 2\.3Stage 3: Sample Enrichment

To ensure CulturALL better reflects challenging grounded demands, every draft item undergoes a process of “up\-leveling” as illustrated in Fig\.[3](https://arxiv.org/html/2604.19262#S2.F3)\. Specifically, we assess the difficulty of the original samples and categorize them into hard and easy examples\. Hard examples are forwarded to Stage 4, while easy examples undergo a difficulty elevation process to increase its difficulty if possible\.

#### 2\.3\.1Difficulty Measure \(LLM\)

Inspired byPhanet al\.\([2025](https://arxiv.org/html/2604.19262#bib.bib1407)\); Fabbriet al\.\([2025a](https://arxiv.org/html/2604.19262#bib.bib1408)\), we utilize three LLMs—gpt\-4o\-2024\-11\-20,claude\-3\.5\-sonnet\-1022, andqwen\-max\-2024\-09\-19—to quantify the difficulty of items produced by Stage 2\. An item is classified as challenging if at most one of the three LLMs provides the correct answer\. Such items are forwarded to Stage 4\. Otherwise, they proceed to the difficulty elevation process for further complexity enhancement\.

#### 2\.3\.2Difficulty Elevation \(Human\)

We introduce three complementary enrichment strategies to guide human annotators in elevating the difficulty of existing samples:

##### Long\-Tail Swap

Common entities are replaced with rarer ones, e\.g\., substituting the general location "Hong Kong" with "MacLehose Trail," a lesser\-known hiking route within the region\.

##### More/Less Context

Additional situational details are introduced, requiring the answer to hinge on conditional, multi\-step reasoning \(e\.g\., determining if a traveler has a prior visa\)\. Conversely, unnecessary context that provides hints to LLMs can be removed to increase the challenge\.

##### Compositional Example

Two independent knowledge points are combined into a single query—for example, merging the entry requirements for both Hong Kong and Bangkok—forcing the model to engage in compositional reasoning\.

Annotators are encouraged to apply one or more of these techniques to enhance sample difficulty\. If no opportunities for improvement, annotators may leave it unchanged\. These refinements amplify the task complexity, moving beyond superficial matching and compelling models to demonstrate deeper understanding and advanced reasoning capabilities\.

### 2\.4Stage 4: Release\-Ready

#### 2\.4\.1Metadata Completion \(LLM\)

The Metadata Completion step utilizesgpt\-4o\-2024\-11\-20to fill in theregionandtopicfields\. Forregion, the prompt shown in Fig\.[9](https://arxiv.org/html/2604.19262#A4.F9)\(§[D](https://arxiv.org/html/2604.19262#A4)\) is used to generate the corresponding ISO 3166\-1 alpha\-2 code\. For thetopic, we prompt the LLM to select the most suitable topic from a predefined list \(§[2\.1](https://arxiv.org/html/2604.19262#S2.SS1)\) based on the created sample, using the prompts provided in Fig\.[10](https://arxiv.org/html/2604.19262#A4.F10)\(§[D](https://arxiv.org/html/2604.19262#A4)\)\.

#### 2\.4\.2Translation \(LLM\)

In this sub\-step,gpt\-4o\-2024\-11\-20translates each sample into English, using the translation prompt in Fig\.[6](https://arxiv.org/html/2604.19262#A4.F6)\(§[D](https://arxiv.org/html/2604.19262#A4)\)\. This process provides a unified reference language for readers while also inspiring annotators to develop new cross\-regional samples\.

#### 2\.4\.3Quality Control \(Human\)

Data quality is maintained via a peer\-review process wherein each annotator cross\-checks samples created by others/LLMs\. During the review, annotators are asked with checking the following aspects:

##### Region/Topic Correctness

The assigned region should be a valid ISO 3166\-1 alpha\-2 code and the topic belongs to the predefined list\. Both must accurately align with the content of the sample\.

##### Requirement Adherence

Each sample must comply with the requirements outlined in §[2\.2\.1](https://arxiv.org/html/2604.19262#S2.SS2.SSS1)\.

##### Translation Quality

LLM’s translation output should be correctly align with the source input\.

##### Sensitive or Offensive Content

The output should not include personally identifiable information or harmful, offensive, or inappropriate content\.

Based on these criteria, annotators have three options:accept,revise, orreject\. A sample should be marked asacceptif it meets all four criteria\. If issues are identified, the annotator should attempt torevisethe sample to ensure it fully satisfies the requirements\. However, if the sample cannot be revised to meet the criteria, it should be marked asreject\. Based on these criteria, 75\.0% of the samples were marked asaccept, 8\.1% were marked asrevise, and 16\.9% were marked asreject\. To ensure robustness, we randomly selected 100 samples from the final dataset for cross\-checking, and all of them aligned with the criteria\.

![Refer to caption](https://arxiv.org/html/2604.19262v1/x5.png)Figure 4:Distributions across topics, languages, and regions\. The first row includes: \(a\) topic distribution and \(b\) language distribution, and the second row shows \(c\) region distribution\.

### 2\.5Statistics

The resulting CulturALL comprises 2,610 samples, spanning 14 languages and 51 regions\. The distributions of samples across topics, languages, and regions are illustrated in Fig\.[4](https://arxiv.org/html/2604.19262#S2.F4)\.

To ensure that CulturALL remains genuinely challenging, it excludes any item that is correctly solved by all 15 model settings \(§[3](https://arxiv.org/html/2604.19262#S3)\)\. We categorize the 2,610 examples in CulturALL based on how many of the 15 settings answered them correctly\. Items solved by 10–14 settings are labeled asEasy\(470 items, 18\.01%\), while those solved by 5–9 settings form theMediumsubset \(700 items, 26\.82%\)\. Finally, items solved by at most four settings are grouped into theHardsubset \(1,440 items, 55\.17%\)\. Fig\.[13](https://arxiv.org/html/2604.19262#A5.F13)in §[E](https://arxiv.org/html/2604.19262#A5)further illustrates the language distribution of examples according to the number of settings that answered them correctly\.

## 3Setup

##### Model Selection

Tab\.[2](https://arxiv.org/html/2604.19262#S3.T2)presents the 15 experiments conducted to benchmark top\-performing LLMs on CulturALL\. Our analysis includes 8 leading LLMs from the Text Arena leaderboard as of 18 August 2025,222[https://lmarena\.ai/leaderboard/text](https://lmarena.ai/leaderboard/text)spanning both open\-source and proprietary models\. The experiments feature 15 distinct configurations, achieved by varying reasoning capabilities and the inclusion of web search\. This comprehensive experimental setup facilitates a systematic evaluation of key factors, such as reasoning performance, web search integration, model size, and other critical characteristics\.

Table 2:Performance of evaluated LLMs with diverse settings on the complete CulturALL dataset and its three subsets, categorized by difficulty level\. All results are reported as accuracy \(%\)\. Reasoning: reasoning capability\. Web: the use of web search\. Open: open\-source availability\. Experiment Name: Model Name\_Open\_Web\.
##### Prompt Design

To benchmark LLMs with CulturALL, we conduct zero\-shot prompting\. In addition, we also require the LLMs to answer the question in as few words as possible to ease the follow\-up evaluation\. The concrete prompt is provided in Fig\.[11](https://arxiv.org/html/2604.19262#A4.F11)of §[D](https://arxiv.org/html/2604.19262#A4)\.

##### Metric

Since all reference answers in CulturALL are strictly objective, automatic assessment can be applied\. During evaluation, the judge LLM,gpt\-4o\-2024\-11\-20, is provided with the full item \(scenario, question, gold answer, and optional explanation\) alongside the prediction from the evaluated LLM\. Leveraging the prompt provided in Fig\.[12](https://arxiv.org/html/2604.19262#A4.F12), the judge assesses whether the prediction aligns with the reference answer\. Each item yields a binary outcome \(correct or incorrect\), and the overall performance is measured as accuracy, defined as the proportion of correctly judged items\.

## 4Results

### 4\.1Performance Across LLMs

The performance of evaluated LLMs across various configurations on the complete CulturALL dataset is detailed in Tab\.[2](https://arxiv.org/html/2604.19262#S3.T2)\. The resulting scores support several key observations:

##### Best Setting Still Falls Short

Among all experiments, gemini\-2\.5\-pro\_auto\_true, which utilizes gemini\-2\.5\-pro with its strongest reasoning capability and web search integration, achieves the highest accuracy at 44\.48%\. However, this performance remains far from ideal, indicating significant room for LLM improvement in handling the challenging scenarios presented by CulturALL\.

##### Open\-Source LLMs Lag Behind

As shown in the results, gemini\-2\.5\-pro \(ID 2\), gpt\-5 \(ID 5\), and claude\-opus\-4 \(ID 8\) achieve accuracy rates of 37\.89%, 37\.59%, and 36\.70%, respectively, demonstrating comparable performance among proprietary models\. In contrast, the open\-source qwen series shows a significant performance gap, with its top\-performing setting, qwen3\-235b\-a22b\_high\_false \(ID 14\), achieving only 23\.68%\. This disparity highlights the challenges faced by open\-source models in addressing multilingual and culturally grounded tasks, particularly when competing with advanced proprietary alternatives\.

##### Higher Variants Consistently Outperform Their Counterparts

The performance comparisons align with claims about model capabilities\. Gemini\-2\.5\-pro \(ID 1\) outperforms gemini\-2\.5\-flash \(ID 4\) by 10\.80%, supporting its position as the more powerful variant within the Gemini series\. Similarly, claude\-opus\-4 \(ID 8\) achieves a 3\.94% higher accuracy than claude\-sonnet\-4 \(ID 10\), reinforcing its superior performance\.

##### Reasoning Effort Affects LLMs Unevenly

Reasoning capabilities play a crucial role in determining the performance of LLMs, but the impact varies across models\. For example, gemini\-2\.5\-pro with the most advanced reasoning setting of "auto" \(ID 1\) outperforms the same model configured with minimal reasoning \(128 tokens, ID 3\) by 5\.21%, highlighting the positive effect of enhanced reasoning efforts\. However, increasing reasoning capabilities for gpt\-5 \(ID 5 vs\. 6 vs\. 7\) and claude\-opus\-4 \(ID 8 vs\. 9\) fails to raise their scores by a considerable margin\. This discrepancy could stem from the fact that these LLMs lack sufficiently robust multi\-step reasoning abilities to navigate complex cultural scenarios effectively\.

##### Web Search Plays a Critical Role

Equipping gemini\-2\.5\-pro with a web\-search tool raises its score by 6\.59% \(ID 1 vs\. 2\), demonstrating the benefit of external retrieval\. In contrast, the qwen models gain little advantage from the web search tool \(ID 11 vs\. 12, ID 13 vs\. 14\), suggesting they are not yet sufficiently trained to leverage web\-search results for solving grounded tasks\.

### 4\.2Performance Across Difficulty Levels

As shown in Tab\.[2](https://arxiv.org/html/2604.19262#S3.T2), the performance of LLMs across the three subsets highlights several key patterns\. First, the relative ranking of different experimental settings remains consistent across Easy, Medium, and Hard tasks, indicating stable differences in their capabilities\. Second, the performance gap between commercial and open\-source models becomes more pronounced as tasks grow easier: on the Easy subset, gemini\-2\.5\-pro\_auto\_true \(ID 1\) achieves an impressive accuracy of 92\.55%, while the best\-performing open\-source model \(ID 14\) achieves only 67\.23%, trailing by a substantial margin of 25\.32%\. Finally, all systems continue to struggle with the Hard subset; even the top\-performing setting \(ID 1\) achieves 18\.47%, highlighting the need for further advancements\.

![Refer to caption](https://arxiv.org/html/2604.19262v1/x6.png)Figure 5:Performance of various experimental settings across 14 languages\. X\-axis: languages \(along with their sample counts\), Y\-axis: different experimental settings\.
### 4\.3Performance Across Languages

We further investigate the performance of different experimental settings across languages, as depicted in the heatmap in Fig\.[5](https://arxiv.org/html/2604.19262#S4.F5)\. Certain languages, such as Serbian, are relatively less challenging, whereas others, like Japanese, pose greater difficulties\. Notably, the performance rankings of languages vary significantly across different experimental settings\. For example, gemini\-2\.5\-flash\_auto\_true achieves an accuracy of 69\.80% in Bengali, demonstrating competitive results against state\-of\-the\-art settings\. However, in other languages, e\.g\., Arabic and Chinese, its performance lags significantly behind alternative configurations\.

### 4\.4English vs\. In\-Language Prompts

Previous work\(Yinet al\.,[2022](https://arxiv.org/html/2604.19262#bib.bib888)\)revealed that samples written in English tend to achieve superior performance compared to those written in native languages, as LLMs typically exhibit greater proficiency in English\. In our study, we evaluated the performance using native prompts and their English translations under the experimental setting of ID 1\. The results show that employing English prompts yields an accuracy of 36\.40%, which is 8\.08% lower than the accuracy achieved with the original native prompts\. We hypothesize that this discrepancy arises because the native language inherently reflects the cultural context of the scenario, whereas translation may dilute or lose these nuances\. These findings highlight the urgent need to enhance LLMs’ capabilities to adapt to diverse cultural and linguistic contexts\.

## 5Related Work

A wide range of benchmarks have been have been introduced to assess the general capabilities of LLMs\(Hendryckset al\.,[2021](https://arxiv.org/html/2604.19262#bib.bib765); Wanget al\.,[2024b](https://arxiv.org/html/2604.19262#bib.bib1380)\)\. As LLMs become ubiquitous across the world, researchers are paying growing attention to their performance across diverse languages\(Wuet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1394)\)and cultures\(Hershcovichet al\.,[2022](https://arxiv.org/html/2604.19262#bib.bib826); Adilazuardaet al\.,[2024](https://arxiv.org/html/2604.19262#bib.bib1366); Pawaret al\.,[2024](https://arxiv.org/html/2604.19262#bib.bib1361); Liuet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1406)\)\.

Early efforts to assess LLMs under multilingual and multicultural settings have curated knowledge bases and probing tasks to test the LLMs’ capacity on culture\-specific knowledge acquisition\(Yinet al\.,[2022](https://arxiv.org/html/2604.19262#bib.bib888); Wanget al\.,[2024a](https://arxiv.org/html/2604.19262#bib.bib1378); Myunget al\.,[2024](https://arxiv.org/html/2604.19262#bib.bib1346); Zhouet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1405); Hasanet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1389); Aroraet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1386)\)\. Findings from these studies show that LLMs still display pronounced cultural biases and uneven performance across different regions of the world\(Naouset al\.,[2024](https://arxiv.org/html/2604.19262#bib.bib1288); Mitchellet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1401)\)\. While these benchmarks shed light on what LLMs know about diverse cultures in different languages, they do not fully assess multilingual and multicultural competence, which requires LLMs not only to store cultural knowledge but also to apply it flexibly in grounded scenarios\(Raoet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1403)\)\.

To obtain a clearer picture of LLMs’ multilingual and multicultural competence, recent benchmarks have shifted from decontextualized trivia to grounded scenarios, covering social interaction\(Raoet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1403); Yinet al\.,[2024](https://arxiv.org/html/2604.19262#bib.bib1381); Qiuet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1402)\), psychology\(Jinet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1399)\), and cultural proverbs\(Liuet al\.,[2024](https://arxiv.org/html/2604.19262#bib.bib1376)\)\. However, they are usually monolingual or confined to a narrow domain, leaving them a long way from the breadth and diversity of situations that arise in grounded applications\.

## 6Conclusion

In this paper, we introduce CulturALL, the first benchmark designed to evaluate the multilingual and multicultural capabilities of LLMs across grounded tasks\. CulturALL is built using a human\-LLM collaboration framework, ensuring both comprehensiveness and a high level of challenge\. Through an in\-depth analysis of LLMs on CulturALL, we highlight the critical need to enhance their information retrieval and reasoning skills\.

## Limitations

The scope and design of this study inevitably come with certain limitations, which we outline in this section\. Addressing these limitations in future work can help provide a more nuanced and comprehensive understanding of LLMs’ performance across cultural contexts\.

##### Coverage Bias

As illustrated in Fig\.[4](https://arxiv.org/html/2604.19262#S2.F4), certain topics, languages, and regions are underrepresented in the dataset\. Nevertheless, our proposed unified data construction framework provides a practical foundation for expanding CulturALL to improve coverage in the future\.

##### Focus on Regional Cultural Groupings

In this work, we define cultural groups predominantly based on geographic regions\. While this approach provides a high\-level understanding of general cultural differences, it does not fully capture more nuanced or cross\-cutting aspects of culture\. Factors such as religion, age, socio\-economic status, gender, and education significantly influence cultural perspectives and may intersect in ways that are not represented by regional groupings alone\. Future studies should explore these more fine\-grained cultural dimensions to offer a holistic assessment of LLMs’ cultural\-grounded reasoning capabilities\.

##### Reliance on Objective Answer Evaluation

To ensure consistency and reproducibility in our experimental setup, we focus exclusively on tasks with objective, verifiable answers to enable automatic evaluation\. While these tasks serve as a robust benchmark for model performance, they do not account for the complexities of free\-text generation, which is a key feature of LLMs in multilingual and culturally nuanced applications\. Investigating free\-text generation compared to objective reasoning tasks is an important avenue for future exploration to better understand LLMs’ ability to engage with subjective, open\-ended questions influenced by cultural relativism\.

##### Exclusion of Multimodal Inputs

Our study focuses entirely on text\-based inputs without considering multimodal contexts, such as the integration of visual, auditory, or other non\-textual signals\. However, cultural understanding often extends beyond textual communication to include visual symbolism, nonverbal gestures, and audio cues, all of which hold significant meaning in cultural interactions\. Future research should explore the impact of multimodality on LLM performance when tackling culturally grounded tasks to better model the complexities of human communication\.

## Ethical Considerations

##### Annotation Process and Annotator Profile

Prior to starting the annotation process, annotators undergo a comprehensive briefing on the guidelines detailed in §[A\.2](https://arxiv.org/html/2604.19262#A1.SS2)\. Each annotator must first label a pilot batch of 10 randomly selected examples\. Only those who complete this trial accurately—demonstrating full comprehension of the guidelines—may advance to the main annotation phase\. During production we run continuous spot\-checks and feedback rounds to keep quality high\.

The annotation team consists of native speakers or individuals with extensive immersion in the target language \(see §[A\.1](https://arxiv.org/html/2604.19262#A1.SS1)for demographics\)\. Annotators were informed about the purpose of the data collection, its intended use, and storage policies through detailed instructions and a privacy agreement\.

##### Reproducibility Challenges and Mitigation Strategies

To facilitate reproducibility, we will make publicly available: \(i\) the key code components used for data collection, processing, and evaluation; \(ii\) the finalized CulturALL dataset accompanied by detailed documentation of its construction and evaluation workflow; and \(iii\) the full set of model outputs, performance scores, and all experimental configurations required to replicate our results\.

All artifacts will be made publicly available at the time of publication, enabling anyone to fully reproduce the entire workflow, from raw inputs to the final results presented in our tables\.

## References

- M\. F\. Adilazuarda, S\. Mukherjee, P\. Lavania, S\. Singh, A\. F\. Aji, J\. O’Neill, A\. Modi, and M\. Choudhury \(2024\)Towards measuring and modeling "culture" in llms: A survey\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12\-16, 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),pp\. 15763–15784\.External Links:[Link](https://doi.org/10.18653/v1/2024.emnlp-main.882),[Document](https://dx.doi.org/10.18653/V1/2024.EMNLP-MAIN.882)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p1.1)\.
- CaLMQA: exploring culturally specific long\-form question answering across 23 languages\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 11772–11817\.External Links:[Link](https://aclanthology.org/2025.acl-long.578/)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p2.1)\.
- Y\. Y\. Chiu, L\. Jiang, B\. Y\. Lin, C\. Y\. Park, S\. S\. Li, S\. Ravi, M\. Bhatia, M\. Antoniak, Y\. Tsvetkov, V\. Shwartz, and Y\. Choi \(2025\)CulturalBench: A robust, diverse and challenging benchmark for measuring lms’ cultural knowledge through human\-ai red\-teaming\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 25663–25701\.External Links:[Link](https://aclanthology.org/2025.acl-long.1247/)Cited by:[4th item](https://arxiv.org/html/2604.19262#A3.I1.i4.p1.1),[§2\.1](https://arxiv.org/html/2604.19262#S2.SS1.p1.1)\.
- A\. R\. Fabbri, D\. Mares, J\. Flores, M\. Mankikar, E\. Hernandez, D\. Lee, B\. Liu, and C\. Xing \(2025a\)MultiNRC: a challenging and native multilingual reasoning evaluation benchmark for llms\.arXiv preprint arXiv:2507\.17476\.Cited by:[§2\.3\.1](https://arxiv.org/html/2604.19262#S2.SS3.SSS1.p1.1)\.
- A\. R\. Fabbri, D\. Mares, J\. Flores, M\. Mankikar, E\. Hernandez, D\. Lee, B\. Liu, and C\. Xing \(2025b\)MultiNRC: A challenging and native multilingual reasoning evaluation benchmark for llms\.CoRRabs/2507\.17476\.External Links:[Link](https://doi.org/10.48550/arXiv.2507.17476),[Document](https://dx.doi.org/10.48550/ARXIV.2507.17476),2507\.17476Cited by:[6th item](https://arxiv.org/html/2604.19262#A3.I1.i6.p1.1)\.
- Md\. A\. Hasan, M\. Hasanain, F\. Ahmad, S\. R\. Laskar, S\. Upadhyay, V\. N\. Sukhadia, M\. Kutlu, S\. A\. Chowdhury, and F\. Alam \(2025\)NativQA: multilingual culturally\-aligned natural query for llms\.InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 14886–14909\.External Links:[Link](https://aclanthology.org/2025.findings-acl.770/)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p2.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p1.1)\.
- D\. Hershcovich, S\. Frank, H\. C\. Lent, M\. de Lhoneux, M\. Abdou, S\. Brandl, E\. Bugliarello, L\. C\. Piqueras, I\. Chalkidis, R\. Cui, C\. Fierro, K\. Margatina, P\. Rust, and A\. Søgaard \(2022\)Challenges and strategies in cross\-cultural NLP\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2022, Dublin, Ireland, May 22\-27, 2022,S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),pp\. 6997–7013\.External Links:[Link](https://doi.org/10.18653/v1/2022.acl-long.482),[Document](https://dx.doi.org/10.18653/V1/2022.ACL-LONG.482)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p1.1)\.
- Z\. Jin, M\. Kleiman\-Weiner, G\. Piatti, S\. Levine, J\. Liu, F\. G\. Adauto, F\. Ortu, A\. Strausz, M\. Sachan, R\. Mihalcea, Y\. Choi, and B\. Schölkopf \(2025\)Language model alignment in multilingual trolley problems\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=VEqPDZIDAh)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p3.1)\.
- C\. C\. Liu, I\. Gurevych, and A\. Korhonen \(2025\)Culturally aware and adapted NLP: A taxonomy and a survey of the state of the art\.Trans\. Assoc\. Comput\. Linguistics13,pp\. 652–689\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00760),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00760)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p1.1)\.
- C\. Liu, F\. Koto, T\. Baldwin, and I\. Gurevych \(2024\)Are multilingual llms culturally\-diverse reasoners? an investigation into multicultural proverbs and sayings\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\), NAACL 2024, Mexico City, Mexico, June 16\-21, 2024,K\. Duh, H\. Gómez\-Adorno, and S\. Bethard \(Eds\.\),pp\. 2016–2039\.External Links:[Link](https://doi.org/10.18653/v1/2024.naacl-long.112),[Document](https://dx.doi.org/10.18653/V1/2024.NAACL-LONG.112)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p3.1)\.
- M\. Mitchell, G\. Attanasio, I\. Baldini, M\. Clinciu, J\. Clive, P\. Delobelle, M\. Dey, S\. Hamilton, T\. Dill, J\. Doughman, R\. Dutt, A\. Ghosh, J\. Z\. Forde, C\. Holtermann, L\. Kaffee, T\. Laud, A\. Lauscher, R\. L\. Lopez\-Davila, M\. Masoud, N\. Nangia, A\. Ovalle, G\. Pistilli, D\. Radev, B\. Savoldi, V\. Raheja, J\. Qin, E\. Ploeger, A\. Subramonian, K\. D\. Dhole, K\. Sun, A\. Djanibekov, J\. Mansurov, K\. Yin, E\. V\. Cueva, S\. Mukherjee, J\. Huang, X\. Shen, J\. Gala, H\. Al\-Ali, T\. Djanibekov, N\. Mukhituly, S\. Nie, S\. Sharma, K\. Stanczak, E\. Szczechla, T\. T\. Torrent, D\. Tunuguntla, M\. Viridiano, O\. V\. D\. Wal, A\. Yakefu, A\. Névéol, M\. Zhang, S\. Zink, and Z\. Talat \(2025\)SHADES: towards a multilingual assessment of stereotypes in large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 \- Volume 1: Long Papers, Albuquerque, New Mexico, USA, April 29 \- May 4, 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),pp\. 11995–12041\.External Links:[Link](https://doi.org/10.18653/v1/2025.naacl-long.600),[Document](https://dx.doi.org/10.18653/V1/2025.NAACL-LONG.600)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p2.1)\.
- J\. Myung, N\. Lee, Y\. Zhou, J\. Jin, R\. A\. Putri, D\. Antypas, H\. Borkakoty, E\. Kim, C\. Pérez\-Almendros, A\. A\. Ayele, V\. Gutiérrez\-Basulto, Y\. Ibáñez\-García, H\. Lee, S\. H\. Muhammad, K\. Park, A\. S\. Rzayev, N\. White, S\. M\. Yimam, M\. T\. Pilehvar, N\. Ousidhoum, J\. Camacho\-Collados, and A\. Oh \(2024\)BLEnD: A benchmark for llms on everyday knowledge in diverse cultures and languages\.CoRRabs/2406\.09948\.External Links:[Link](https://doi.org/10.48550/arXiv.2406.09948),[Document](https://dx.doi.org/10.48550/ARXIV.2406.09948),2406\.09948Cited by:[§1](https://arxiv.org/html/2604.19262#S1.p1.1),[§5](https://arxiv.org/html/2604.19262#S5.p2.1)\.
- T\. Naous, M\. J\. Ryan, A\. Ritter, and W\. Xu \(2024\)Having beer after prayer? measuring cultural bias in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 16366–16393\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.862),[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.862)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p2.1)\.
- S\. Pawar, J\. Park, J\. Jin, A\. Arora, J\. Myung, S\. Yadav, F\. G\. Haznitrama, I\. Song, A\. Oh, and I\. Augenstein \(2024\)Survey of cultural awareness in language models: text and beyond\.CoRRabs/2411\.00860\.External Links:[Link](https://doi.org/10.48550/arXiv.2411.00860),[Document](https://dx.doi.org/10.48550/ARXIV.2411.00860),2411\.00860Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p1.1)\.
- L\. Phan, A\. Gatti, Z\. Han, N\. Li, J\. Hu, H\. Zhang, C\. B\. C\. Zhang, M\. Shaaban, J\. Ling, S\. Shi,et al\.\(2025\)Humanity’s last exam\.arXiv preprint arXiv:2501\.14249\.Cited by:[§2\.3\.1](https://arxiv.org/html/2604.19262#S2.SS3.SSS1.p1.1)\.
- H\. Qiu, A\. R\. Fabbri, D\. Agarwal, K\. Huang, S\. Tan, N\. Peng, and C\. Wu \(2025\)Evaluating cultural and social awareness of LLM web agents\.InFindings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, New Mexico, USA, April 29 \- May 4, 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),pp\. 3978–4005\.External Links:[Link](https://doi.org/10.18653/v1/2025.findings-naacl.222),[Document](https://dx.doi.org/10.18653/V1/2025.FINDINGS-NAACL.222)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p3.1)\.
- A\. Rao, A\. Yerukola, V\. Shah, K\. Reinecke, and M\. Sap \(2025\)NormAd: A framework for measuring the cultural adaptability of large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 \- Volume 1: Long Papers, Albuquerque, New Mexico, USA, April 29 \- May 4, 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),pp\. 2373–2403\.External Links:[Link](https://doi.org/10.18653/v1/2025.naacl-long.120),[Document](https://dx.doi.org/10.18653/V1/2025.NAACL-LONG.120)Cited by:[§1](https://arxiv.org/html/2604.19262#S1.p1.1),[§5](https://arxiv.org/html/2604.19262#S5.p2.1),[§5](https://arxiv.org/html/2604.19262#S5.p3.1)\.
- A\. Romanou, N\. Foroutan, A\. Sotnikova, Z\. Chen, S\. H\. Nelaturu, S\. Singh, R\. Maheshwary, M\. Altomare, M\. A\. Haggag, S\. A, A\. Amayuelas, A\. H\. Amirudin, V\. Aryabumi, D\. Boiko, M\. Chang, J\. Chim, G\. Cohen, A\. K\. Dalmia, A\. Diress, S\. Duwal, D\. Dzenhaliou, D\. F\. E\. Florez, F\. Farestam, J\. M\. Imperial, S\. B\. Islam, P\. Isotalo, M\. Jabbarishiviari, B\. F\. Karlsson, E\. Khalilov, C\. Klamm, F\. Koto, D\. Krzeminski, G\. A\. de Melo, S\. Montariol, Y\. Nan, J\. Niklaus, J\. Novikova, J\. S\. O\. Ceron, D\. Paul, E\. Ploeger, J\. Purbey, S\. Rajwal, S\. S\. Ravi, S\. Rydell, R\. Santhosh, D\. Sharma, M\. P\. Skenduli, A\. S\. Moakhar, B\. S\. Moakhar, R\. Tamir, A\. K\. Tarun, A\. T\. Wasi, T\. O\. Weerasinghe, S\. Yilmaz, M\. Zhang, I\. Schlag, M\. Fadaee, S\. Hooker, and A\. Bosselut \(2024\)INCLUDE: evaluating multilingual language understanding with regional knowledge\.CoRRabs/2411\.19799\.External Links:[Link](https://doi.org/10.48550/arXiv.2411.19799),[Document](https://dx.doi.org/10.48550/ARXIV.2411.19799),2411\.19799Cited by:[3rd item](https://arxiv.org/html/2604.19262#A3.I1.i3.p1.1),[§1](https://arxiv.org/html/2604.19262#S1.p1.1),[§2\.1](https://arxiv.org/html/2604.19262#S2.SS1.p1.1)\.
- W\. Shi, R\. Li, Y\. Zhang, C\. Ziems, S\. Yu, R\. Horesh, R\. de Paula, and D\. Yang \(2024\)CultureBank: an online community\-driven knowledge base towards culturally aware language technologies\.InFindings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12\-16, 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),pp\. 4996–5025\.External Links:[Link](https://doi.org/10.18653/v1/2024.findings-emnlp.288),[Document](https://dx.doi.org/10.18653/V1/2024.FINDINGS-EMNLP.288)Cited by:[§1](https://arxiv.org/html/2604.19262#S1.p1.1)\.
- S\. Singh, A\. Romanou, C\. Fourrier, D\. I\. Adelani, J\. G\. Ngui, D\. Vila\-Suero, P\. Limkonchotiwat, K\. Marchisio, W\. Q\. Leong, Y\. Susanto, R\. Ng, S\. Longpre, S\. Ruder, W\. Ko, A\. Bosselut, A\. Oh, A\. F\. T\. Martins, L\. Choshen, D\. Ippolito, E\. Ferrante, M\. Fadaee, B\. Ermis, and S\. Hooker \(2025\)Global MMLU: understanding and addressing cultural and linguistic biases in multilingual evaluation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 18761–18799\.External Links:[Link](https://aclanthology.org/2025.acl-long.919/)Cited by:[5th item](https://arxiv.org/html/2604.19262#A3.I1.i5.p1.1),[§1](https://arxiv.org/html/2604.19262#S1.p1.1)\.
- B\. Wang, Z\. Liu, X\. Huang, F\. Jiao, Y\. Ding, A\. Aw, and N\. Chen \(2024a\)SeaEval for multilingual foundation models: from cross\-lingual alignment to cultural reasoning\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\), NAACL 2024, Mexico City, Mexico, June 16\-21, 2024,K\. Duh, H\. Gómez\-Adorno, and S\. Bethard \(Eds\.\),pp\. 370–390\.External Links:[Link](https://doi.org/10.18653/v1/2024.naacl-long.22),[Document](https://dx.doi.org/10.18653/V1/2024.NAACL-LONG.22)Cited by:[2nd item](https://arxiv.org/html/2604.19262#A3.I1.i2.p1.1),[§5](https://arxiv.org/html/2604.19262#S5.p2.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen \(2024b\)MMLU\-pro: A more robust and challenging multi\-task language understanding benchmark\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/ad236edc564f3e3156e1b2feafb99a24-Abstract-Datasets%5C_and%5C_Benchmarks%5C_Track.html)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p1.1)\.
- M\. Wu, W\. Wang, S\. Liu, H\. Yin, X\. Wang, Y\. Zhao, C\. Lyu, L\. Wang, W\. Luo, and K\. Zhang \(2025\)The bitter lesson learned from 2,000\+ multilingual benchmarks\.CoRRabs/2504\.15521\.External Links:[Link](https://doi.org/10.48550/arXiv.2504.15521),[Document](https://dx.doi.org/10.48550/ARXIV.2504.15521),2504\.15521Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p1.1)\.
- D\. Yin, H\. Bansal, M\. Monajatipoor, L\. H\. Li, and K\. Chang \(2022\)GeoMLAMA: geo\-diverse commonsense probing on multilingual pre\-trained language models\.CoRRabs/2205\.12247\.External Links:[Link](https://doi.org/10.48550/arXiv.2205.12247),[Document](https://dx.doi.org/10.48550/arXiv.2205.12247),2205\.12247Cited by:[1st item](https://arxiv.org/html/2604.19262#A3.I1.i1.p1.1),[§2\.1](https://arxiv.org/html/2604.19262#S2.SS1.p1.1),[§4\.4](https://arxiv.org/html/2604.19262#S4.SS4.p1.1),[§5](https://arxiv.org/html/2604.19262#S5.p2.1)\.
- D\. Yin, H\. Qiu, K\. Huang, K\. Chang, and N\. Peng \(2024\)SafeWorld: geo\-diverse safety alignment\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/e8aad0aaa1309659a7d7e4c21202d9d0-Abstract-Conference.html)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p3.1)\.
- L\. Zhou, T\. Karidi, W\. Liu, N\. Garneau, Y\. Cao, W\. Chen, H\. Li, and D\. Hershcovich \(2025\)Does mapo tofu contain coffee? probing llms for food\-related cultural knowledge\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 \- Volume 1: Long Papers, Albuquerque, New Mexico, USA, April 29 \- May 4, 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),pp\. 9840–9867\.External Links:[Link](https://doi.org/10.18653/v1/2025.naacl-long.496),[Document](https://dx.doi.org/10.18653/V1/2025.NAACL-LONG.496)Cited by:[§5](https://arxiv.org/html/2604.19262#S5.p2.1)\.

## Appendix AAnnotation

### A\.1Annotators

Table 3:Annotator profiles: nationality, target language, and background with the target language\.We relied on 15 volunteer annotators from universities and industrial research labs\. All volunteers are native speaker of the target language, or near\-native with more than 5 consecutive years of residence and study in the language community\. Table[3](https://arxiv.org/html/2604.19262#A1.T3)gives an overview of their backgrounds\.

### A\.2Guideline

As illustrated in Fig\.[3](https://arxiv.org/html/2604.19262#S2.F3), human annotators are assigned four distinct tasks, which are detailed below\.

#### A\.2\.1Task A: Sample Creation \(Personal Experience\)

1. 1\.Read §[2\.2\.1](https://arxiv.org/html/2604.19262#S2.SS2.SSS1)to fully understand the sample requirements\.
2. 2\.Browse the full topic list—including descriptions and seed examples \(§[2\.1](https://arxiv.org/html/2604.19262#S2.SS1)\)—to select a topic that interests you\.
3. 3\.Craft an original sample based on your personal experience whenever possible, using seed items and local forums as inspiration\.

#### A\.2\.2Task B: Sample Creation \(Cross\-lingual Inspiration\)

1. 1\.Read §[2\.2\.1](https://arxiv.org/html/2604.19262#S2.SS2.SSS1)to fully understand the sample requirements\.
2. 2\.Read the English translations of examples originally created in other languages\.
3. 3\.Whenever possible, write a culturally plausible example in your native language that is similar to the provided ones\.

#### A\.2\.3Task C: Difficulty Elevation

1. 1\.Review §[2\.2\.1](https://arxiv.org/html/2604.19262#S2.SS2.SSS1)to fully understand the sample requirements and §[2\.3\.2](https://arxiv.org/html/2604.19262#S2.SS3.SSS2)to learn strategies for increasing difficulty\.
2. 2\.Retrieve an easy sample for your language and region\.
3. 3\.If possible, enhance its difficulty using the elevation techniques described in §[2\.3\.2](https://arxiv.org/html/2604.19262#S2.SS3.SSS2)\.

#### A\.2\.4Task D: Quality Control

1. 1\.Verify the sample against three criteria: Region/Topic Correctness, Requirement Adherence, Translation Quality, and Sensitive or Offensive Content, as outlined in §[2\.4\.3](https://arxiv.org/html/2604.19262#S2.SS4.SSS3)\.
2. 2\.Acceptthe sample if it fully satisfies all three criteria without any issues\.
3. 3\.If issues are identified,revisethe sample to ensure it meets all requirements\.
4. 4\.If the sample cannot be revised to meet the criteria, it should be marked asreject\.

## Appendix BTopics

Tab\.[4](https://arxiv.org/html/2604.19262#A2.T4)presents the full topic list, each entry paired with a short description and three illustrative culture\-specific scenarios; the complete set of seed examples will be released publicly\.

TopicDescriptionExample 1Example 2Example 3BeliefSystems of conviction that shape values, rituals, institutions, life\-cycle events, and views on existence—covering religious faith, spiritual practice, secular ethics, and cultural traditions \(e\.g\., funerary customs and ideas of an afterlife\)\.Typical length and order of a wedding ceremonyDietary restrictions during major religious holidaysWhether to pull the lever in the classic trolley\-problem dilemmaCommerceBuying, selling, marketing, and payment of goods and services—from daily necessities to luxury fashion—across bricks\-and\-mortar shops, e\-commerce sites, and mobile wallets\.Typical opening hours for supermarketsReturn policy for online purchasesLegal limits on alcohol sales in retail storesEducationFormal and informal learning, teaching, research, and skill\-building for all ages, settings, and disciplines\.Courses normally taken in middle schoolNational university\-entrance\-exam formatGrading scale used in secondary schoolsEntertainmentMedia, arts, sports, games, performances, hobbies, and events created for leisure and enjoyment\.Popular sport clubsNational mascots or iconic cartoon charactersGambling age and casino legalityFinanceEarning, saving, budgeting, investing, insuring, transferring, and distributing wealth during life and after death\.Color that signals a stock\-price rise or fall on trading screensCommon payment methods in everyday shoppingTypical tax\-filing deadline for individualsFoodAgriculture, sourcing, processing, cooking, nutrition, beverages, and dining culture from farm to table\.Typical breakfast foodsIs tipping expected in restaurants?Common allergens that must be listed on packaged foodGovernmentPublic policy, legislation, courts, law enforcement, defense, emergency response, and civic administration\.Highway speed limitsEmergency number to call when lost in the mountainsLength of mandatory military or civil serviceHabitatHomes, buildings, infrastructure, utilities, urban planning, ecosystems, weather patterns, and sustainability practices\.Typical home\-heating systemFloor\-numbering convention in multi\-story buildingsRecycling rules for household wasteHealthPhysical, mental, and emotional well\-being—prevention, treatment, fitness, wellness, palliative, and end\-of\-life care\.Standard childhood\-vaccination schedulePrescription vs\. over\-the\-counter drug availabilityLegal age of consent for medical decisionsHeritagePast events, living traditions, festivals, monuments, and other cultural inheritances—and their study, preservation, and commemoration\.Date and rituals of New\-Year celebrationsHistoric event marked by a public holidayCustoms from a particular historical periodLanguageOfficial and minority languages, scripts, dialects, idioms, emotional nuance, politeness levels, sign language, literacy, and translation norms\.Order of family and given names on official documentsAppropriate greetings and honorifics in businessMeaning and proper use of a common proverbPetsCare, health, training, companionship, and welfare of domesticated animals\.Rules for bringing pets on public transportMandatory rabies vaccination for dogsCultural status of certain animalsScienceSystematic inquiry into the natural world and its applications—research, engineering, technology, and innovation\.Unit used to state distance between two citiesStandard format for writing datesWhether smartphones support dual\-SIM useSocialFamily, friendships, romance, community networks, demographics, and social issues\.Table etiquette at family gatheringsMeaning of two women holding hands in publicTypical blind\-dating processTravelPlanning, transport, logistics, accommodation, tourism, and movement of people or goods\.Information needed before booking a city tripVisa rules for a 90\-day tourist stayCost of popular tourist attractionsWorkCareers, labor markets, workplaces, productivity tools, and professional development\.Statutory length of paid annual leaveLegal steps for ending an employment contractRegion\-specific unique occupationsTable 4:Cultural topics with concise descriptions and illustrative examples\.
## Appendix CAdapting Existing Datasets

##### Data Sources

We repurpose six public benchmarks:

- •GEOMLAMA\(Yinet al\.,[2022](https://arxiv.org/html/2604.19262#bib.bib888)\): QA pairs in five language–region pairs\.
- •SeaEval\(Wanget al\.,[2024a](https://arxiv.org/html/2604.19262#bib.bib1378)\): multiple\-choice questions in four language–region pairs\.
- •INCLUDE\(Romanouet al\.,[2024](https://arxiv.org/html/2604.19262#bib.bib1362)\): 44 languages\. We retain items whoseregional\_featureisregion implicit,region explicit, orculture\.
- •CulturalBench\(Chiuet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1387)\): 45 countries/regions, English only\. Each item is translated into the dominant local language \(Tab\.[6](https://arxiv.org/html/2604.19262#A3.T6)\)\.
- •Global\-MMLU\(Singhet al\.,[2025](https://arxiv.org/html/2604.19262#bib.bib1390)\): 42 languages\. We keep only culture\-sensitive questions\.
- •MultiNRC\(Fabbriet al\.,[2025b](https://arxiv.org/html/2604.19262#bib.bib1397)\): QA pairs in three language–region pairs\.

##### Translation

Whenever an item is not already in the target language, we translate it withgpt\-4o\-2024\-11\-20using the prompt shown in Fig\.[6](https://arxiv.org/html/2604.19262#A4.F6)\(§[D](https://arxiv.org/html/2604.19262#A4)\)\.

##### Grounding

The translated \(or original\) text is then converted into our sample format with the prompt in Fig\.[7](https://arxiv.org/html/2604.19262#A4.F7), ensuring that each item remains original cultural knowledge while satisfying CulturALL’s annotation schema\.

Table 5:ISO 639\-1 language codes \(Lang\) with language names and their representative ISO 3166\-1 alpha\-2 country/region codes \(Reg\) and region names, sorted by Lang\.Table 6:ISO 3166\-1 alpha\-2 country/region codes \(Reg\) and region names with their representative ISO 639\-1 language codes \(Lang\) and language names, sorted alphabetically by Reg\.

## Appendix DPrompts

This section lists the prompts employed in our study, including translation \(Fig\.[6](https://arxiv.org/html/2604.19262#A4.F6)\), grounded sample generation based on existing datasets \(Fig\.[7](https://arxiv.org/html/2604.19262#A4.F7)\), grounded sample generation based on online resources \(Fig\.[8](https://arxiv.org/html/2604.19262#A4.F8)\), region labeling \(Fig\.[9](https://arxiv.org/html/2604.19262#A4.F9)\), topic labeling \(Fig\.[10](https://arxiv.org/html/2604.19262#A4.F10)\), CulturALL evaluation \(Fig\.[11](https://arxiv.org/html/2604.19262#A4.F11)\), and prediction judgment \(Fig\.[12](https://arxiv.org/html/2604.19262#A4.F12)\)\.

TranslationTranslate the following text to\{target\_language\}:\{source\}Do not output anything else\.Figure 6:Prompt used for the translation task, where\{target\_language\}is the target language name and\{source\}is the input source text for translation\.Grounded Sample Generation \(Existing Datasets\)You are given a sample describing some cultural knowledge:\{source\_excerpt\}Generate a grounded item \(scenario \+ question \+ answer \+ explanation\) that assesses the consultant’s grasp of the cultural knowledge\.Ensure the generated item preserves the same cultural knowledge as the example\. Do not modify the choices or the correct answer\.The generated sample should be in the same language as the given sample\.The output format should be:\[Scenario\]XXX\[Question\]XXX\[Answer\]XXX\[Explanation\]XXXDo not output any other things\.Figure 7:Prompt used for grounded‐sample creation\. Placeholders\{source\_topic\},\{source\_excerpt\}, and\{topic\_list\}are replaced with the corresponding inputs\.Grounded Sample Generation \(Online Resources\)You are given a sample describing some cultural knowledge:\{source\_excerpt\}Generate a grounded item \(scenario \+ question \+ answer \+ explanation\) that assess the consultant’s grasp of the cultural knowledge\.Ensure the generated sample preserves the same cultural knowledge as the provided example\.The answer should be objective and as brief as possible\. If an objective free\-form answer is impractical, convert the question to a four\-option multiple\-choice format \(A–D\) and return only the chosen letter\.The generated sample should be in the same language as the given sample\.The output format should be:\[Scenario\]XXX\[Question\]XXX\[Answer\]XXX\[Explanation\]XXXDo not output any other things\.Figure 8:Prompt used for grounded\-sample creation from online resources\. The placeholder\{source\_excerpt\}is replaced with the source text\.Region LabelingBased on the following text and its written language, retrieve the corresponding ISO 3166\-1 alpha\-2 country code \(only one lowercase two\-letter country code\)\.Scenario:\{scenario\}Question:\{question\}Do not output any other things\.Figure 9:Prompt used for region classification, where\{scenario\}and\{question\}are the provided fields of the given sample\.Topic LabelingBased on the following text, select the most appropriate topic from the topic list\{topic\_list\}\.Scenario:\{scenario\}Question:\{question\}Do not output any other things\.Figure 10:Prompt used for topic classification\.\{scenario\}and\{question\}are the provided fields of the given sample\.\{topic\_list\}is the predefined list\.CulturALL EvaluationUsing the scenario as context, answer the question in as few words as possible\.Scenario:\{scenario\}Question:\{question\}Answer:Figure 11:Prompt used for CulturALL evaluation, where\{scenario\}and\{question\}are the provided fields of the given sample\.JudgementPlease evaluate whether the model prediction is correct based on the given scenario, question, answer, and explanation\.Scenario:\{scenario\}Question:\{question\}Answer:\{answer\}Explanation:\{explanation\}Model prediction:\{prediction\}Only output1or0with no additional text\.1means correct,0means incorrect\.Figure 12:Prompt used to evaluate the model’s prediction based on the given scenario, question, answer, and explanation\. The prompt ensures binary evaluation for correctness\.
## Appendix EDifficulty Distribution Across Languages

![Refer to caption](https://arxiv.org/html/2604.19262v1/x7.png)Figure 13:Language distribution of examples based on the number of settings that answered them correctly\. X\-axis: Number of settings that answered correctly, Y\-axis: Count of examples\.

Similar Articles

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

arXiv cs.CL

A new cross-domain benchmark (Metacognitive Monitoring Battery) with 524 items evaluates LLM self-monitoring capabilities across six cognitive domains using human psychometric methodology. Applied to 20 frontier LLMs, it reveals three distinct metacognitive profiles and shows that accuracy rank and metacognitive sensitivity rank are largely inverted.