Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
Summary
This paper introduces the first parallel Arabic cultural QA benchmark spanning Modern Standard Arabic and multiple dialects, converting multiple-choice questions to open-ended formats and evaluating LLMs with chain-of-thought reasoning to address gaps in culturally grounded and dialect-specific knowledge.
View Cached Full Text
Cached at: 04/20/26, 08:31 AM
# Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants Source: https://arxiv.org/html/2510.24328 ###### Abstract Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains limited across languages and their varieties. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across language varieties, making it, to our knowledge, the first of its kind. A large portion of the resulting test set is further validated through targeted human annotation and native-speaker post-editing. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, showing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. **Keywords:** Cultural Knowledge; Everyday Knowledge, Open-Ended Question, Chain-of-Thought ## Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants Hunzalah Hassan Bhatti, Firoj Alam Qatar Computing Research Institute, Qatar [email protected], [email protected] ## 1. Introduction Cultural information underpins human identity, behavior, and social interaction, encompassing shared beliefs, values, customs, languages, traditions, and collective practices. In today's tightly coupled information-communication ecosystem, hundreds of millions of users interact with LLMs for everyday queries, often asking about local norms, holidays, cuisine, or etiquette, where culturally grounded interpretations are essential (Pawar et al. 2025; Hasan et al. 2025). Yet despite rapid progress in multilingual understanding and reasoning, LLM performance remains uneven across languages, dialects, and culturally specific domains (Wei et al., 2022; Muennighoff et al., 2023). The issue is especially salient for Arabic, where MSA coexists with numerous regional dialects that differ in phonology, morphology, lexicon, and usage (Alwajih et al., 2025a; Sadallah et al., 2025). Beyond modeling challenges, widely used MCQ evaluations can mask deficiencies in reasoning by enabling superficial answer-selection strategies such as label bias or option-guessing, complicating fair cross-lingual and cross-format comparison (Raman et al., 2025; Li et al., 2024b). **Figure 1:** Example QA instances shown in two formats: multiple-choice question (MCQ) and open-ended question (OEQ). Flags in parentheses indicate representative countries where each dialect is widely spoken. A central open problem is how to **measure** and **improve** an LLM's ability to understand and generate responses to such culturally embedded queries, especially in multilingual settings with substantial dialectal variation. Another noteworthy aspect is that MCQs have long been the dominant format for evaluating QA performance in LLMs due to their simplicity, automatic scoring, and structured answer space (Myrzakhan et al. 2024). However, models can sometimes exploit the test format rather than genuinely understanding the question, leading to a form of selection bias, for instance, consistently favoring certain options (e.g., always choosing "A") regardless of content. To address these challenges, parallel efforts have emerged to develop culturally aligned language models (Wang et al. 2023) and to enable their efficient deployment in low-compute environments (Hu et al. 2022). At the same time, new culturally relevant datasets, targeted benchmarks, and evaluation protocols are beginning to operationalize the measurement of everyday cultural knowledge (Myung et al. 2024; Li et al. 2024a; Mousi et al. 2025; Alam et al. 2025a,b). Collectively, these trends demonstrate the need for new resources, evaluations, and models that are grounded in underrepresented dialectal varieties and culturally contextualized content. To shed light on the challenges, we introduce a comprehensive method for developing a new resource for underrepresented language varieties. Starting from an existing MSA MCQ dataset (Alwajih et al. 2025b), we perform the following steps: (i) translate the questions into several Arabic dialects and English, which were then manually post-edited; (ii) convert the MCQs into OEQs that require free-form answers; (iii) evaluate a range of zero-shot and fine-tuned LLMs on the resulting benchmark; and (iv) create and fine-tune models on chain-of-thought (CoT) annotations to encourage explicit reasoning for OEQs. An example of MCQ, OEQ with CoT is shown in Figure 1. Our approach allows us to isolate and study the impact of question format, language variety, and reasoning supervision on model performance. We find that OEQ settings present greater challenges than MCQ, especially in dialectal Arabic. Our contributions are as follows: - We construct a multilingual and multidialectal QA dataset, **ArabicCulturalQA**, by translating MSA MCQs into English and Arabic dialects. The dataset is publicly available for research use. - We convert the dataset into OEQs in all language variants, enabling a more rigorous evaluation of model knowledge. - A substantial portion of the test set is human annotated by native speakers: dialectal MCQs are post-edited, and the conversion from MSA MCQs to MSA OEQs is manually reviewed to ensure linguistic and semantic fidelity. - We benchmark a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings. - We generate chain-of-thought (CoT) annotations for OEQs and fine-tune models. This work represents the first effort to unify dialectal Arabic QA, open-ended reasoning, and CoT fine-tuning in a single benchmark, offering new insights into LLM performance on culturally rich, linguistically diverse data. ## 2. Related Work ### 2.1. General Capabilities of LLMs LLMs have shown strong generalization across a broad range of NLP tasks, including text generation, translation, summarization, and reasoning (Abdelali et al. 2024). At sufficient scale, LLMs exhibit emergent abilities, such as multi-step inference and commonsense reasoning (Bubeck et al. 2023; Wei et al. 2022). Prompting techniques like few-shot and chain-of-thought (CoT) significantly enhance performance on reasoning-heavy tasks (Kojima et al. 2022; Wei et al. 2022). However, most evaluations focus on English or high-resource languages. Performance often degrades on morphologically rich or low-resource languages such as Arabic, particularly in dialectal contexts (Mousi et al. 2025; Muennighoff et al. 2023). ### 2.2. Cultural and Everyday Knowledge Recent research has highlighted the limitations of LLMs in capturing culturally grounded, everyday knowledge. Myung et al. (2024) introduced BLEnD, a multilingual benchmark comprising 52.6K QA pairs across 13 languages and 16 regions, designed to evaluate models' understanding of daily-life knowledge. Similarly, Hasan et al. (2025) developed MultiNativQA, featuring 64K QA pairs covering nine locations in seven languages. Across these studies, results consistently show that LLMs underperform on questions reflecting underrepresented cultures, often reflecting Western-centric norms. In the Arabic context, Sadallah et al. (2025) proposed ArabCulture, a benchmark of 3.5K MSA-based MCQs curated by native speakers from 13 Arab countries to assess culturally specific commonsense reasoning. Likewise, Alwajih et al. (2025a) introduced Palm, a dialect-rich dataset encompassing all 22 Arab countries. ### 2.3. MCQ to OEQ Many evaluation benchmarks use MCQs because they allow straightforward automatic scoring, in which the model selects an option (A/B/C/D) that can be directly compared with the correct answer. However, recent studies show that this format may introduce artificial performance gains and mask a model's actual reasoning ability (Molfese et al. 2025; Chandak et al. 2025; Myrzakhan et al. 2024). For instance, LLMs often display a **selection bias**, favoring certain options (e.g., consistently choosing "A") due to training artifacts. To mitigate these issues, several works propose converting MCQs into OEQs that require the model to generate answers without predefined choices (Myrzakhan et al. 2024). This forces reliance on internal knowledge and reasoning rather than elimination or guessing. Yet, this conversion introduces new challenges: some MCQs become ambiguous once options are removed, and others may yield multiple valid answers unless carefully rephrased. Moreover, evaluating free-form responses is inherently harder, as correctness depends on comparing generated text with gold answers that may differ in wording. Prior work addresses this by using LLM-based evaluation pipelines (e.g., GPT-4) to judge open-ended answers against human references with high reliability (Myrzakhan et al. 2024). Overall, shifting from MCQ to open-ended formats holds promise for revealing deeper model understanding, but it demands careful question selection and robust evaluation protocols. ### 2.4. Chain-of-Thought (CoT) Reasoning CoT prompting has emerged as a powerful technique for enhancing reasoning capabilities in LLMs. Instead of producing an answer directly, the model is encouraged to generate an explicit, step-by-step reasoning path before reaching a final conclusion (Wei et al. 2022). By articulating these intermediate steps, models can decompose complex problems into manageable components, leading to substantial gains in accuracy. Remarkably, even without task-specific training, simply prefixing the prompt with "Let's think step by step" can induce this behavior in sufficiently large models, a method known as **zero-shot CoT** (Qi et al. 2023). This simple prompting strategy has demonstrated significant improvements across a wide range of reasoning tasks, including mathematical problem solving and commonsense reasoning. Furthermore, Qi et al. (2023) introduced a **self-consistency** mechanism, in which the model generates multiple reasoning chains and selects the most frequent answer, further enhancing performance. While most existing studies emphasize inference-time CoT, recent research has explored **CoT fine-tuning** to transfer reasoning skills to smaller or multilingual models (Puerto et al. 2025). However, to the best of our knowledge, no prior work has applied CoT fine-tuning to Arabic open-ended QA datasets, particularly those covering dialectal varieties, which constitutes a key contribution of our study. ## 3. Datasets Our data, **ArabicCulturalQA**, is based on the **PalmX 2025 - General Culture Evaluation (PalmX-GC)** dataset, which assesses a model's understanding of Arab culture, including customs, history, geography, arts, cuisine, notable figures, and everyday life across the 22 Arab countries. All questions and answers are written in MSA and manually verified, providing a high-quality benchmark for culturally grounded QA (Alwajih et al. 2025b). The dataset comprises 2,000 training, 500 development, and 2,000 test examples, all in MCQ format. We use PalmX-GC as the basis for creating dialectal MCQ and OEQ variants. **Figure 2:** Pipeline for the dataset construction process. ### 3.1. Dialectal MCQ To broaden cultural and linguistic coverage beyond MSA, we translate PalmX into four Arabic dialects such as Egyptian, Levantine, Gulf, and Maghrebi and into English using GPT-4.1, followed by quality checking. We selected these dialects because (i) they cover the largest speaker populations and broadest geographic span in the Arab world, (ii) capture major points on the Arabic dialect continuum, and (iii) represent the main language of everyday communication and online discourse. Including English serves two purposes: it provides a shared reference baseline for cross-lingual comparison, helping disentangle language modeling from culture-specific knowledge, and reflects real usage, where users often ask culturally grounded questions in English about Arabic contexts. This design allows us to probe (a) format sensitivity (MCQ→OEQ), (b) dialect sensitivity (MSA vs. regional varieties), and (c) cross-lingual transfer (Arabic↔English) within a single controlled benchmark. We employed controlled prompting to translate each MSA MCQ into four dialects and English. The prompts explicitly enforced semantic equivalence while allowing lexical and stylistic adaptation to dialectal norms. This approach ensured that the dialectal phrasing preserved the original question's intent without causing any semantic drift from its MSA counterpart. ### 3.2. MCQ to OEQ We converted the MSA MCQs into OEQs using GPT-4.1. Each MCQ was transformed into a natural QA pair by rephrasing the original question and its correct option into a single, self-contained QA instance. The remaining distractors were used only to guide contextual understanding but were excluded from the final prompt. We filtered out QA items where conversion was structurally infeasible, such as questions dependent on visible alternatives, to avoid ill-posed or underspecified open-ended forms. This process ensured that the resulting OEQs were
Similar Articles
HalluScore: Large Language Model Hallucination Question Answering Benchmark
Introduces HalluScore, a structured Arabic QA benchmark for evaluating hallucination in LLMs across reasoning difficulty, knowledge domains, and cultural contexts. Contains 827 questions with verified evidence and annotations, tested on 17 LLMs.
Introducing IndQA
OpenAI introduced IndQA, a new benchmark with 2,278 questions across 12 Indian languages and 10 cultural domains, designed to evaluate AI models' understanding of culturally nuanced and reasoning-heavy tasks that existing benchmarks fail to capture. Created with 261 domain experts, IndQA addresses the saturation of existing multilingual benchmarks like MMMLU and focuses on real-world cultural comprehension rather than translation or multiple-choice tasks.
Automated Scoring of Arabic Text Using Large Language Models: A Literature Review
A literature review examining LLM-based approaches for automatic scoring of Arabic text, covering short answer grading and essay scoring, with a proposed taxonomy and comparative analysis.
UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
UrduMMLU is a new benchmark of 26,431 multiple-choice questions across 26 subjects for evaluating LLMs on Urdu language understanding, sourced from native educational materials. Evaluation of 30 LLMs reveals Gemini-3.5-Flash performs best, while open-source models and region-specific subjects pose significant challenges.
QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard
QIMMA is a new quality-first Arabic LLM leaderboard introduced by TII UAE that validates benchmarks before evaluation to ensure accurate performance measurement. It addresses systematic quality issues in existing Arabic NLP benchmarks through a rigorous multi-stage validation pipeline.