UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

arXiv cs.CL 06/08/26, 04:00 AM Papers
urdu benchmark llm-evaluation multilingual natural-language-processing low-resource-language
Summary
UrduMMLU is a new benchmark of 26,431 multiple-choice questions across 26 subjects for evaluating LLMs on Urdu language understanding, sourced from native educational materials. Evaluation of 30 LLMs reveals Gemini-3.5-Flash performs best, while open-source models and region-specific subjects pose significant challenges.
arXiv:2606.07167v1 Announce Type: new Abstract: Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs. Unlike translation-based resources, UrduMMLU covers both standard academic subjects and Urdu- and region-specific content. We label the exam-derived portion through dual human annotation with strict consensus filtering. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero-shot evaluations, and further evaluate four open-source LLMs under multiple few-shot settings across both prompt languages. Gemini-3.5-Flash performs best, reaching 90.20% and 90.34% accuracy, while no other model exceeds 85%. The strongest open-source model trails by 7.79 and 8.92 points, and many models lose 25 to 40 points on Urdu-centered Humanities subjects compared with STEM. Few-shot prompting yields only modest gains. UrduMMLU shows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content.
Original Article
View Cached Full Text
Cached at: 06/08/26, 09:22 AM
# A Massive Multitask Benchmark for Urdu Language Understanding
Source: [https://arxiv.org/html/2606.07167](https://arxiv.org/html/2606.07167)
Ahmer Tabassum1Sarfraz Ahmad∗1Hasan Iqbal∗1 Owais Aijaz1Momina Ahsan1Preslav Nakov1 1MBZUAI \{ahmer\.tabassum, sarfraz\.ahmad, hasan\.iqbal\}@mbzuai\.ac\.ae[Project](https://mbzuai-nlp.github.io/UrduMMLU/)[UrduMMLU](https://huggingface.co/datasets/MBZUAI/UrduMMLU)[Code](https://github.com/mbzuai-nlp/urdu-mmlu)[Leaderboard](https://mbzuai-nlp.github.io/UrduMMLU/leaderboard.html)

###### Abstract

Meaningful multilingual evaluation must test models in the target language and educational context\. Urdu, spoken by more than 230 million people, lacks a broad MMLU\-style benchmark built from native educational sources\. We introduceUrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs\. Unlike translation\-based resources,UrduMMLUcovers both standard academic subjects and Urdu\- and region\-specific content\. We label the exam\-derived portion through dual human annotation with strict consensus filtering\. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero\-shot evaluations, and further evaluate four open\-source LLMs under multiple few\-shot settings across both prompt languages\. Gemini\-3\.5\-Flash performs best, reaching 90\.20% and 90\.34% accuracy, while no other model exceeds 85%\. The strongest open\-source model trails by 7\.79 and 8\.92 points, and many models lose 25 to 40 points on Urdu\-centered Humanities subjects compared with STEM\. Few\-shot prompting yields only modest gains\.UrduMMLUshows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content\.

\[urdu\]rm\[ Path=fonts/, UprightFont = \*, Script=Arabic, Language=Urdu, Scale=0\.85 \]NotoNastaliqUrdu\-Regular\.ttf

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

Ahmer Tabassum††thanks:Equal contribution\.1Sarfraz Ahmad∗1Hasan Iqbal∗1Owais Aijaz1Momina Ahsan1Preslav Nakov11MBZUAI\{ahmer\.tabassum, sarfraz\.ahmad, hasan\.iqbal\}@mbzuai\.ac\.ae[Project](https://mbzuai-nlp.github.io/UrduMMLU/)[UrduMMLU](https://huggingface.co/datasets/MBZUAI/UrduMMLU)[Code](https://github.com/mbzuai-nlp/urdu-mmlu)[Leaderboard](https://mbzuai-nlp.github.io/UrduMMLU/leaderboard.html)

## 1Introduction

Evaluating the knowledge and reasoning abilities of Large Language Models \(LLMs\) has become central to Natural Language Processing \(NLP\)\. Benchmarks such asMMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.07167#bib.bib6)\)andMMLU\-Pro\(Wanget al\.,[2024](https://arxiv.org/html/2606.07167#bib.bib7)\)are widely used for this purpose, but they are in English and largely reflect English\-language educational and cultural contexts\. This limits their ability to test whether model competence transfers across language, script, and regional knowledge\. As a result, these benchmarks provide only a partial view of model performance in multilingual and culturally diverse settings\.

![Refer to caption](https://arxiv.org/html/2606.07167v1/x1.png)Figure 1:The 16\-stage UrduMMLU construction pipeline \(left\) and the resulting 26,431\-MCQ benchmark broken down by 5 domains and 26 subdomains \(right\); wedge size is proportional to MCQ count\.The issue is especially important for Urdu, a language spoken by over 230 million people, with a long literary and educational tradition, but limited broad\-coverage evaluation resources\. Existing Urdu benchmarks focus mainly on reading comprehension, syntactic diagnostics, task\-level NLP evaluation, or translated reasoning benchmarks\(Kazi and Khoja,[2026](https://arxiv.org/html/2606.07167#bib.bib2); Kaziet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib3); Adeebaet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib4); Tahiret al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib5); Shafiqueet al\.,[2026](https://arxiv.org/html/2606.07167#bib.bib1)\)\. Multilingual benchmarks that include Urdu, such asMMLU\-ProX\(Xuanet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib11)\),Global\-MMLU\(Singhet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib9)\), andIndicMMLU\-Pro\(KJet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib10)\), also rely mainly on translated questions\. As a result, they only partially capture knowledge grounded in Urdu\-medium education, Urdu literature, local history, religious studies, and civic curricula\.

Recent language\-specific benchmarks such asArabicMMLU\(Kotoet al\.,[2024](https://arxiv.org/html/2606.07167#bib.bib15)\),CMMLU\(Liet al\.,[2024](https://arxiv.org/html/2606.07167#bib.bib16)\),IndoMMLU\(Kotoet al\.,[2023](https://arxiv.org/html/2606.07167#bib.bib17)\),KMMLU\(Sonet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib18)\), andKazMMLU\(Togmanovet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib19)\)highlight the importance of evaluation grounded in local educational material\. Following this direction, we introduceUrduMMLU, the first broad\-coverage, natively written MMLU\-style benchmark for Urdu\.UrduMMLUcontains 26,431 MCQs across 26 subjects and five domains, collected from Urdu MCQ banks and public SSC/HSSC examination PDFs, and combines answer\-labeled questions with exam\-derived questions annotated through dual human annotation and strict consensus filtering, and covers both standard academic subjects and Urdu\- and region\-specific content\. Figure[1](https://arxiv.org/html/2606.07167#S1.F1)summarizes the resulting subject distribution\.

We evaluate 30 open\-source and closed\-source LLMs onUrduMMLUunder English and Urdu prompts, yielding 60 zero\-shot evaluations, and further evaluate four open\-source LLMs in 1\-, 3\-, and 5\-shot settings\. Gemini\-3\.5\-Flash\(Google DeepMind,[2026](https://arxiv.org/html/2606.07167#bib.bib40)\)achieves the highest accuracy at 90\.20% and 90\.34%, while the strongest open\-source model trails by 7\.79 and 8\.92 points\.

Across models, performance remains substantially higher on STEM subjects than on Urdu\-centered Humanities, with many systems losing 25 to 40 points on Urdu literature, Urdu language, and Islamic studies\. These results show that strong English\-centered benchmark performance does not reliably transfer to Urdu educational and cultural knowledge\. They also highlight the need for benchmarks that better capture linguistic and cultural diversity beyond English\.

The main contributions of this work are:

- •We introduceUrduMMLU, a natively written Urdu MMLU\-style benchmark with 26,431 MCQs across 26 subjects and five domains, covering both standard academic subjects along with Urdu\- and region\-specific knowledge\.
- •We produce human\-annotated gold answers for the exam\-derived portion of the benchmark using dual annotation and strict consensus filtering\.
- •We conduct 60 zero\-shot evaluations across 30 open\-source and closed\-source LLMs under English and Urdu prompt settings, and 24 additional few\-shot evaluations across four open\-source LLMs\.
- •We release the dataset and evaluation code to support future work on Urdu\-capable language models\.

## 2Related Work

##### Urdu evaluation resources:

Existing Urdu resources cover reading comprehension, cross\-lingual question answering , syntax, and task\-level NLP\.UQuAD\+\(Kazi and Khoja,[2026](https://arxiv.org/html/2606.07167#bib.bib2)\)provides annotated Urdu reading comprehension, whileKaziet al\.\([2025](https://arxiv.org/html/2606.07167#bib.bib3)\)study Urdu\-English QA withUQuAD1\.0\(Kazi and Khoja,[2021](https://arxiv.org/html/2606.07167#bib.bib48)\)andSQuAD2\.0\(Rajpurkaret al\.,[2018](https://arxiv.org/html/2606.07167#bib.bib49)\)\.UrBLiMP\(Adeebaet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib4)\)evaluates Urdu syntax via minimal pairs, andTahiret al\.\([2025](https://arxiv.org/html/2606.07167#bib.bib5)\)benchmark models across Urdu NLP tasks\. For reasoning,UrduBench\(Shafiqueet al\.,[2026](https://arxiv.org/html/2606.07167#bib.bib1)\)translatesMGSM\(Shiet al\.,[2023](https://arxiv.org/html/2606.07167#bib.bib44)\),CommonsenseQA\(Talmoret al\.,[2019](https://arxiv.org/html/2606.07167#bib.bib45)\),OpenBookQA\(Mihaylovet al\.,[2018](https://arxiv.org/html/2606.07167#bib.bib47)\), andMATH\-500\(Lightmanet al\.,[2024](https://arxiv.org/html/2606.07167#bib.bib46)\)into Urdu, andUrduFactCheck\(Ahmadet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib52)\)targets factual QA\. These resources remain task\-specific, diagnostic, or translation\-derived\. In contrast,UrduMMLUevaluates broad educational knowledge using questions originally written for Urdu\-speaking educational settings\.

##### Multilingual benchmarks:

MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.07167#bib.bib6)\)andMMLU\-Pro\(Wanget al\.,[2024](https://arxiv.org/html/2606.07167#bib.bib7)\)are widely used for evaluating general knowledge and reasoning\. Several multilingual extensions adapt these benchmarks through translation\.MMLU\-ProX\(Xuanet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib11)\)extendsMMLU\-Proto 29 languages using LLM\-based translation and expert review, whileGlobal\-MMLU\(Singhet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib9)\)studies cultural and linguistic bias in multilingual evaluation\.

IndicMMLU\-Pro\(KJet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib10)\)adaptsMMLU\-Proto nine Indic languages, including Urdu\. Other multilingual exam\-based resources, such asEXAMS\(Hardalovet al\.,[2020](https://arxiv.org/html/2606.07167#bib.bib13)\),INCLUDE\(Romanouet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib12)\), andMILU\(Vermaet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib14)\), collect examination questions across multiple languages and regions\. However, Urdu still appears primarily in translated or cross\-lingual settings rather than through a dedicated native benchmark, limiting fair knowledge assessment in cultural context\.

##### Localized MMLU\-style benchmarks:

Recent work increasingly builds MMLU\-style benchmarks from local educational material instead of translating English benchmarks\.ArabicMMLU\(Kotoet al\.,[2024](https://arxiv.org/html/2606.07167#bib.bib15)\),CMMLU\(Liet al\.,[2024](https://arxiv.org/html/2606.07167#bib.bib16)\),IndoMMLU\(Kotoet al\.,[2023](https://arxiv.org/html/2606.07167#bib.bib17)\),KMMLU\(Sonet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib18)\), andKazMMLU\(Togmanovet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib19)\)show that language\-specific curricula and regional cultural knowledge remain important for evaluating LLMs beyond English\.UrduMMLUfollows this direction for Urdu by combining regional SSC/HSSC examination material, native Urdu MCQ banks, human annotation for exam\-derived questions, and broad coverage of both standard academic subjects and Urdu\- and Pakistan\-specific knowledge\.

## 3UrduMMLU

UrduMMLUis a broad\-coverage benchmark for evaluating knowledge and reasoning in Urdu\. Unlike translation\-based multilingual benchmarks,UrduMMLUdraws its questions directly from Urdu educational and examination material\. The benchmark contains 26,431 MCQs across 26 subdomains and five domains, covering both standard academic subjects and Urdu\- and region\-specific content such as Urdu literature, Urdu language, Islamic studies, and Pakistan studies\. Appendix[A\.1](https://arxiv.org/html/2606.07167#A1.SS1)and Figure[7](https://arxiv.org/html/2606.07167#A1.F7)provide detailed benchmark statistics and subdomain distributions\. We collect questions from Urdu MCQ banks and public SSC/HSSC examination PDFs, and produce gold answers for exam\-derived questions through dual human annotation with strict consensus filtering\. We designUrduMMLUaround broad subject coverage, faithful representation of Urdu educational material, and reliable multiple\-choice evaluation through clean text extraction, normalized metadata, and verified gold labels\. Figure[1](https://arxiv.org/html/2606.07167#S1.F1)summarizes the overall construction pipeline\.

### 3\.1Data Sources

We collect candidate questions from two source families\. The first consists of public SSC and HSSC examination PDFs from Pakistan covering school\- and high school\-level subjects such as mathematics, physics, chemistry, biology, computer science, Urdu, Islamic studies, Pakistan studies, and economics\. The second consists of native Urdu MCQ websites that publish answer\-labeled questions for examination preparation\. Together, these sources allowUrduMMLUto cover both globally shared academic subjects and region\-specific educational content taught in Urdu\-medium curricula\. We treat all collected items as candidates and include them in the final benchmark only after cleaning, answer annotation or verification, deduplication, and release packaging\.

### 3\.2Raw MCQ Extraction

For PDF\-based sources, we use a multi\-stage extraction pipeline to recover Urdu MCQs from heterogeneous examination layouts\. We first convert each PDF into page images and use Claude Opus 4\.7\(Anthropic,[2026a](https://arxiv.org/html/2606.07167#bib.bib28)\)as OCR to classify each page, filtering out English\-only pages, non\-MCQ pages, answer keys, and unrelated material\. For the remaining pages, we extract question stems, answer options, source metadata, and page\-level provenance using a vision\-language OCR procedure\. We design the extraction prompt specifically for Urdu examination documents\. The prompt preserves Urdu question text, answer options, poetry, quotations, and other context required to answer the question correctly\. In bilingual material, we ignore English text unless it forms a structural part of the Urdu question, and we discard unreadable questions rather than reconstructing missing content\. For web\-based sources, we directly scrape question stems, answer options, category labels, and answer keys when available\.

### 3\.3Metadata and Schema Normalization

The collected sources use heterogeneous category names, grade labels, and answer formats, so we normalize all examples into a unified representation\. We map source\-specific labels to a controlled set of subdomains\. For example, we map variants such asEveryday ScienceandGeneral Sciencetogeneral science, and mathematics\-related labels such asmaths,General Mathematics, andriazitomathematics\.

For curriculum\-derived material, we normalize grade labels into regional examination levels: Grade 9 toSSC\-I, Grade 10 toSSC\-II, Grade 11 toHSSC\-I, and Grade 12 toHSSC\-II\. Table[13](https://arxiv.org/html/2606.07167#A3.T13)in Appendix[C\.1](https://arxiv.org/html/2606.07167#A3.SS1)summarizes the final domain hierarchy, subdomains, acronyms, and examination levels covered inUrduMMLU\. We also canonicalize the MCQ schema to support consistent evaluation\. Each released item stores a question, four answer options, normalized domain and subdomain labels, academic level, source metadata, and answer annotations\. We remove ambiguous index\-based answer fields because different sources follow different option\-ordering conventions\.

### 3\.4Cleaning and Quality Control

We apply several cleaning and validation steps to reduce noise from OCR, web scraping, and heterogeneous source formatting\. First, we normalize Urdu text representation through right\-to\-left display normalization, punctuation and quote normalization, standardization of fill\-in\-the\-blank markers, and Unicode normalization for visually similar Arabic and Urdu codepoints\. We then enforce structural validity by removing items with missing, empty, duplicate, or malformed answer options, discarding examples with invalid option counts, and standardizing option fields into a consistent schema format\. Next, we deduplicate the candidate pool\. We merge exact duplicates with consistent answers while preserving source provenance and discard duplicate groups with conflicting labels\. To handle OCR and wording variations, we additionally apply conservative near\-duplicate filtering based on high question\-token overlap together with answer\-option overlap\. Finally, we remove residual non\-Urdu artifacts, including a small number of English OCR artifacts that survived earlier filtering stages\. We use the resulting cleaned pool for annotation, answer verification, and final benchmark construction\.

### 3\.5Human Annotation

The exam\-derived portion ofUrduMMLUdid not include answer keys, so we produced gold labels through annotation\. We organized annotation batches by subdomain and assigned each item to two annotators with relevant subject familiarity\. Annotators selected the correct answer, marked questions as unsure, flagged problematic items, and could suggest light corrections to question text, answer options, and subdomain labels\.

Seventeen annotators participated in the process; 94\.1% identified Urdu as their native language, and most held either a bachelor’s degree \(47\.1%\) or a master’s degree \(41\.2%\)\. Appendix[B\.1](https://arxiv.org/html/2606.07167#A2.SS1)reports full demographic and satisfaction details\. We applied a strict consensus rule and retained an item only when both annotators selected the same valid answer without flags or unsure labels\. This process helped ensure high annotation quality and label reliability\. In total, 17,565 exam\-extracted MCQs entered annotation, and 14,459 satisfied the consensus criteria\. The main exclusion reasons included answer disagreement \(1,611 items\), flags \(1,247\), unsure selections \(243\), and incomplete annotations \(5\)\. Annotators also corrected 141 domain labels during the process\. Overall observed agreement reached 89\.98%, with simplified Cohen’sκ=0\.8663\\kappa=0\.8663\. After verification, deduplication, and release packaging, the final benchmark retained 12,759 human\-annotated exam\-derived questions\. Annotators additionally verified the correctness of pre\-existing answer labels for web\-derived MCQs\.

Table 1:Composition of the finalUrduMMLUbenchmark by source type\. Web\-derived questions use human\-validated published answers, while exam\-derived questions use dual human annotation with strict consensus filtering\.
### 3\.6Final Benchmark

The final release ofUrduMMLUcontains 26,431 Urdu MCQs after cleaning, annotation, answer verification, deduplication, and release packaging\. Answer\-labeled web sources contribute 13,672 questions, while exam\-derived sources contribute 12,759 questions annotated through dual human labeling and strict consensus filtering \(Table[1](https://arxiv.org/html/2606.07167#S3.T1)\)\. Appendix[A](https://arxiv.org/html/2606.07167#A1)reports statistics for the larger cleaned candidate pool before annotation and final selection\.UrduMMLUspans 26 subdomains grouped into five domains: STEM, Humanities, Social Sciences, Profession, and Other\. Humanities and Social Sciences constitute the largest portions of the benchmark, reflecting strong coverage of Urdu language, Urdu literature, Islamic studies, Pakistan studies, and related educational content\.

Table[3](https://arxiv.org/html/2606.07167#S3.T3)summarizes the domain\-level composition, while Table[13](https://arxiv.org/html/2606.07167#A3.T13)lists the corresponding subdomains and academic levels\. Table[2](https://arxiv.org/html/2606.07167#S3.T2)reports average question and answer lengths, and Appendix[C](https://arxiv.org/html/2606.07167#A3)describes the dataset schema\.

Table 2:Average character length of questions and correct answers inUrduMMLU, grouped by domain and academic level\. Values denote mean character counts\.Table 3:Distribution of questions and subdomains across the five domains inUrduMMLU\.

## 4Experiments

We evaluateUrduMMLUwith generation\-based protocols that require each model to select an answer option for an Urdu MCQ\. We run a large zero\-shot evaluation across 30 open\- and closed\-source LLMs using both English and Urdu instruction prompts\. We also run a focused few\-shot study on four open\-source LLMs using 1\-, 3\-, and 5\-shot settings in both prompt languages\. All evaluations use the same benchmark format and accuracy metric, which allows direct comparison across model families, prompt languages, and shot settings\.

### 4\.1Models, Prompting, and Decoding

We evaluate 30 LLMs spanning a broad range of model sizes, access regimes, and training backgrounds, including proprietary API systems, open\-weight multilingual instruction\-tuned models, compact models, mixture\-of\-experts architectures, reasoning\-oriented variants, and Urdu\- or regionally specialized models\.

Table[14](https://arxiv.org/html/2606.07167#A4.T14)in Appendix[D\.1](https://arxiv.org/html/2606.07167#A4.SS1)lists the full model roster\. This setup allows us to compare open\-source and closed\-source systems and examine transfer to native Urdu educational content\. We evaluate each model with English and Urdu prompt templates while keeping the Urdu question stem, answer options, and response format fixed\. The two settings differ only in the instruction language and field labels\. Appendix[D\.2](https://arxiv.org/html/2606.07167#A4.SS2)provides the prompt templates in Figures[16](https://arxiv.org/html/2606.07167#A4.F16)and[17](https://arxiv.org/html/2606.07167#A4.F17)\. We use temperature 0 whenever deterministic decoding is available and otherwise follow provider\-specific reasoning settings\. We set the maximum output length to 4096 tokens, batch API requests with a concurrency of 10, and decode locally evaluated Hugging Face models greedily\.

ModelEnglish Prompt \(Accuracy %↑\\uparrow\)Urdu Prompt \(Accuracy %↑\\uparrow\)STEMHSSPOOverallInv%↓\\downarrowSTEMHSSPOOverallInv%↓\\downarrowOpen\-source Models:\>\>25B ParametersDeepSeek\-V4\-Flash97\.4968\.7190\.0589\.8586\.4482\.410\.197\.5767\.3288\.9589\.4484\.8881\.420\.3Gemma\-4\-26B\-A4B\-IT85\.9257\.4373\.8375\.4970\.5569\.23<<0\.187\.2157\.7375\.4177\.8271\.7970\.23<<0\.1Gemma\-4\-31B\-IT93\.3363\.6281\.7083\.6979\.4976\.38<<0\.193\.8663\.2582\.0684\.1078\.3976\.39<<0\.1LLaMA\-3\.3\-70B81\.3256\.3075\.4576\.5173\.7768\.56078\.3956\.1073\.4371\.2871\.6567\.00<<0\.1Qwen3\.6\-27B90\.4655\.9573\.1974\.7769\.6069\.22091\.1255\.7174\.5574\.4670\.5569\.700Qwen3\.6\-35B\-A3B88\.7756\.8173\.5275\.4969\.8269\.39096\.3258\.1284\.4384\.5077\.9975\.460\.4LLaMA\-4\-Scout\-17B\-16E84\.1457\.0371\.2074\.1569\.5267\.82085\.5956\.5572\.4972\.6269\.0168\.20<<0\.1LLaMA\-4\-Maverick\-17B\-128E91\.9863\.2780\.1181\.6480\.6675\.47<<0\.192\.3863\.2581\.3079\.5980\.8175\.83<<0\.1Open\-source Models:≤\\leq25B ParametersBLOOMZ\-1\.1B23\.5227\.2124\.3926\.0625\.6325\.520\.524\.5325\.8325\.3423\.4924\.4725\.27<<0\.1BLOOMZ\-1\.7B28\.9427\.7033\.3337\.1731\.4330\.192\.528\.7428\.7633\.8432\.6027\.5130\.5724\.8BLOOMZ\-3B27\.8930\.2533\.2632\.3428\.3930\.686\.526\.5627\.7030\.3532\.6426\.7528\.6274\.2BLOOMZ\-7B27\.7527\.8133\.5631\.8328\.1129\.7311\.229\.2430\.8833\.3533\.4530\.7331\.3634\.6Gemma\-2\-9B\-IT67\.0847\.2158\.4060\.0054\.5855\.28069\.0248\.0860\.5362\.1555\.8256\.800Gemma\-3\-4B\-IT49\.9737\.8747\.6847\.7945\.5743\.93051\.7938\.2748\.6950\.1546\.2344\.880LLaMA\-3\.2\-3B37\.1226\.7838\.0636\.0035\.6032\.98037\.2429\.3239\.9738\.3638\.0234\.850LLaMA\-3\.1\-8B46\.9636\.3849\.6648\.5144\.5443\.30046\.4937\.6149\.8549\.5444\.9843\.840Ministral\-3\-3B55\.0443\.6948\.9949\.0847\.9147\.90<<0\.157\.2543\.0752\.0752\.2648\.8649\.16<<0\.1Ministral\-3\-8B67\.8145\.2758\.9959\.5954\.4354\.77071\.3745\.7462\.7461\.5457\.0056\.990Phi\-4\-mini37\.0728\.8538\.1040\.4132\.6733\.85<<0\.137\.0828\.7038\.9738\.6735\.8534\.15<<0\.1Phi\-3\.5\-mini33\.7622\.8930\.9032\.2128\.2828\.03033\.8327\.2531\.7534\.2230\.5730\.310\.4Qwen3\-4B\-Instruct\-250768\.6142\.3351\.4852\.6247\.8450\.84068\.7543\.0053\.3053\.2347\.6951\.700Qwen3\-8B70\.7039\.2153\.9956\.6250\.1850\.97074\.3730\.8756\.5457\.4849\.3848\.970\.5Proprietary ModelsClaude\-Haiku\-4\.590\.4958\.5775\.8675\.9074\.1471\.400\.191\.9659\.3177\.0678\.2674\.2972\.45<<0\.1Claude\-Sonnet\-4\.696\.3472\.6987\.3687\.1886\.0182\.91096\.2672\.6987\.5387\.1885\.8682\.940Gemini\-3\.1\-Flash\-Lite96\.8574\.2090\.0190\.2686\.0884\.56<<0\.197\.0974\.3890\.1090\.3685\.5784\.68<<0\.1Gemini\-3\.5\-Flash97\.7584\.9892\.1592\.1091\.4390\.200\.197\.8185\.3192\.1491\.3891\.7290\.34<<0\.1GPT\-5\.4\-mini88\.3462\.8277\.5279\.0875\.2473\.43088\.2562\.3578\.4579\.5975\.0973\.510GPT\-5\.495\.1369\.2986\.3585\.8584\.1080\.81097\.4074\.8289\.3787\.0883\.7484\.530\.4Urdu ModelsQalb\-1\.0\-8B38\.1829\.9937\.6840\.3139\.5634\.77036\.2632\.7237\.7839\.3442\.5535\.5211\.3Alif\-1\.0\-8B41\.0925\.7441\.4039\.8741\.0434\.720\.633\.2729\.0036\.2637\.6742\.9332\.6812\.6

Table 4:Model performance onUrduMMLU\.Accuracy \(%\) under English and Urdu prompts across five domains and overall average\.Inv%denotes the percentage of unparsable or malformed outputs \(lower is better\)\. Boxed values mark the best overall score per column, while bold values indicate the best score within each model group\.
### 4\.2Evaluation Protocols

##### Zero\-shot evaluation:

We use a generation\-based zero\-shot protocol in which each input contains the domain, subdomain, academic level, Urdu question, and labeled answer options\. We evaluate all 30 models under both English and Urdu prompt templates, resulting in 60 zero\-shot runs\. Since the question and answer options remain unchanged across settings, this protocol isolates the effect of instruction language on the same Urdu MCQs\.

##### Few\-shot evaluation:

We conduct a controlled few\-shot study on four open\-source LLMs under 1\-, 3\-, and 5\-shot settings with both English and Urdu prompts, yielding 24 runs\. We reserve 200 validated MCQs as a demonstration pool forlm\-evaluation\-harness\(Gaoet al\.,[2024](https://arxiv.org/html/2606.07167#bib.bib50)\), which we use only for prompt construction, demonstration sampling, and execution management\. All models generate structured answers that we parse and compare against the gold labels\.

### 4\.3Evaluation Measure

We use accuracy as the primary evaluation measure by comparing the generated answer with the gold label\. Alongside accuracy, we report the invalid\-output rate, defined as the percentage of unparsable, malformed, or error outputs\. We further analyze results by domain, subdomain, academic level, prompt language, and model category to examine performance differences across standard academic and Urdu\- or region\-specific subjects\.

## 5Results

We first analyze overall zero\-shot performance across all evaluated models and then examine how performance changes across domains, prompt languages, model scales, and few\-shot settings\. Table[4](https://arxiv.org/html/2606.07167#S4.T4)summarizes zero\-shot accuracy onUrduMMLUunder English and Urdu prompts, together with invalid\-output rates\. We focus on four main findings: a small set of models performs strongly, STEM transfers much better than Urdu\-centered Humanities, prompt language has limited effect for most models, and few\-shot prompting gives modest but insufficient gains\. Appendix[E\.1](https://arxiv.org/html/2606.07167#A5.SS1)provides per\-subdomain and per\-level results\.

### 5\.1Overall Model Performance

Gemini\-3\.5\-Flash leads all models with90\.20%90\.20\\%accuracy under the English prompt and90\.34%90\.34\\%under the Urdu prompt, while no other model exceeds85%85\\%\. Gemini\-3\.1\-Flash\-Lite\(Google,[2026a](https://arxiv.org/html/2606.07167#bib.bib39)\), GPT\-5\.4\(Singhet al\.,[2026](https://arxiv.org/html/2606.07167#bib.bib51)\), Claude\-Sonnet\-4\.6\(Anthropic,[2026b](https://arxiv.org/html/2606.07167#bib.bib37)\), and DeepSeek\-V4\-Flash\(DeepSeek\-AI,[2026](https://arxiv.org/html/2606.07167#bib.bib25)\)are the next\-best\.

With DeepSeek\-V4\-Flash giving the strongest open\-source result at82\.41%82\.41\\%under English prompt and81\.42%81\.42\\%under Urdu prompt\. Even so, it trails Gemini\-3\.5\-Flash by7\.797\.79and8\.928\.92points\. Performance drops sharply outside this top\-tier\. In the≤25\\leq 25B open\-source group, Gemma\-2\-9B\-IT\(Teamet al\.,[2024](https://arxiv.org/html/2606.07167#bib.bib43)\)and Ministral\-3\-8BLiuet al\.\([2026](https://arxiv.org/html/2606.07167#bib.bib21)\)lead at roughly5555–57%57\\%, while Qwen3\-4B and Qwen3\-8BYanget al\.\([2025](https://arxiv.org/html/2606.07167#bib.bib23)\)remain near50%50\\%\. BLOOMZ\(Muennighoffet al\.,[2023](https://arxiv.org/html/2606.07167#bib.bib24)\)models stay close to the25%25\\%random baseline despite multilingual pretraining that includes Urdu\. The two Urdu\-specific models, Qalb\-1\.0\-8B\(Hassanet al\.,[2026](https://arxiv.org/html/2606.07167#bib.bib22)\)and Alif\-1\.0\-8B\(Shafiqueet al\.,[2025](https://arxiv.org/html/2606.07167#bib.bib20)\), also remain below36%36\\%, showing that Urdu\-focused tuning alone does not produce strong broad\-coverage Urdu knowledge\.

### 5\.2Domain\-Level Performance

Domain\-level results reveal the clearest pattern inUrduMMLU\. Nearly every model that performs above chance scores highest on STEM and lowest on Humanities\.

Under the Urdu prompt, Gemini\-3\.5\-Flash scores97\.81%97\.81\\%on STEM and85\.31%85\.31\\%on Humanities, a gap of12\.5012\.50points, while DeepSeek\-V4\-Flash drops from97\.57%97\.57\\%to67\.32%67\.32\\%\. GPT\-5\.4 and Claude\-Sonnet\-4\.6 lose more than2222points, and several Qwen models lose more than3535points between the two domains\. Figure[2](https://arxiv.org/html/2606.07167#S5.F2)illustrates this trend for representative top\-performing models from each section of Table[4](https://arxiv.org/html/2606.07167#S4.T4)\. This pattern highlights the main challenge thatUrduMMLUexposes\. STEM questions rely on scientific and mathematical concepts that transfer more consistently across languages, whereas the Humanities domain requires stronger coverage of Urdu literature, Urdu language, Islamic studies, ethics, and other culturally grounded subjects\. Many models can process Urdu well enough to answer science questions, but they struggle on Urdu literary, linguistic, and religious content\. Social Sciences generally falls between STEM and Humanities, reflecting a mix of globally shared and region\-specific knowledge\.

![Refer to caption](https://arxiv.org/html/2606.07167v1/x2.png)Figure 2:STEM and Humanities accuracy onUrduMMLUunder the Urdu prompt for top representative models from each model group\. All models score lower on Humanities\.![Refer to caption](https://arxiv.org/html/2606.07167v1/x3.png)Figure 3:Overall accuracy onUrduMMLUunder English and Urdu prompts for representative models from each model group\. Prompt language has only a small effect on overall performance\.Table 5:Few\-shot performance on UrduMMLU\.Accuracy \(%\) at0\-,11\-,33\-, and55\-shot settings under English and Urdu instruction prompts\. Coloured deltas in parentheses are relative to the0\-shot baseline of the same model under the same prompt;greenindicates a gain andredindicates a loss\. TheMeanrow aggregates the four evaluated models per shot setting\.
### 5\.3Prompt\-Language Effects

Changing the prompt language usually has little effect on overall accuracy\. Figure[3](https://arxiv.org/html/2606.07167#S5.F3)compares representative models from each group in Table[4](https://arxiv.org/html/2606.07167#S4.T4)\. The English and Urdu prompt results nearly overlap for all four: Gemini\-3\.5\-Flash changes by\+0\.14\+0\.14points, DeepSeek\-V4\-Flash by−0\.99\-0\.99, Gemma\-2\-9B\-IT by\+1\.52\+1\.52, and Qalb\-1\.0\-8B by\+0\.75\+0\.75\. The full table shows the same general pattern\. Most models move by less than one point when the prompt changes from English to Urdu\.

A few models show larger prompt effects: Qwen3\.6\-35B\-A3B gains6\.076\.07points with the Urdu prompt and GPT\-5\.4 gains3\.723\.72, while Qwen3\-8B and Alif\-1\.0\-8B lose about two points\. However, these shifts remain much smaller than the STEM\-Humanities gaps\. We therefore attribute the main difficulty ofUrduMMLUto Urdu\-specific content instead of instruction language\.

### 5\.4Invalid\-Output Rates

Invalid\-output rates provide a useful complement to accuracy\. Most modern proprietary and open\-source models follow the required response format, with invalid\-output rates below0\.1%0\.1\\%\. This includes Gemini, Claude, GPT, Gemma, LLaMA\(Meta,[2025](https://arxiv.org/html/2606.07167#bib.bib33)\), Ministral, and most larger Qwen models\. Smaller and weaker models show a different pattern\. Under the Urdu prompt, BLOOMZ\-3B returns invalid answers for74\.2%74\.2\\%of examples, BLOOMZ\-7B for34\.6%34\.6\\%, and BLOOMZ\-1\.7B for24\.8%24\.8\\%\. The Urdu\-targeted models also degrade under the Urdu prompt: Qalb\-1\.0\-8B reaches an invalid\-output rate of11\.3%11\.3\\%, and Alif\-1\.0\-8B reaches12\.6%12\.6\\%\. These failures matter because accuracy over parseable outputs can hide severe formatting breakdowns\. Reporting invalid outputs separately shows which models can both answer Urdu questions and follow Urdu evaluation instructions reliably; Appendix[F](https://arxiv.org/html/2606.07167#A6)gives one real example of each failure mode\.

![Refer to caption](https://arxiv.org/html/2606.07167v1/x4.png)Figure 4:Few\-shot accuracy onUrduMMLUfor LLaMA\-3\.1\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.07167#bib.bib30)\), Gemma\-3\-4B\-IT, Qwen3\-8B, and Qwen3\-4B\-Instruct\-2507 under English \(solid\) and Urdu \(dotted\) prompts\. Accuracy generally improves from zero\-shot to five\-shot across both prompt languages, although the gains remain modest\.
### 5\.5Few\-Shot Performance

Table[5](https://arxiv.org/html/2606.07167#S5.T5)and Figure[4](https://arxiv.org/html/2606.07167#S5.F4)summarize few\-shot evaluation for LLaMA\-3\.1\-8B, Gemma\-3\-4B\-IT, Qwen3\-8B, and Qwen3\-4B\-Instruct\-2507\.

We evaluate each model at11\-,33\-, and55\-shot under English and Urdu prompts using validated demonstrations from a held\-out pool\. Few\-shot prompting improves almost every setting:2323of2424configurations outperform their zero\-shot baselines\. Under the English prompt, mean gains reach\+1\.15\+1\.15,\+2\.20\+2\.20, and\+2\.67\+2\.67points at11\-,33\-, and55\-shot, while the Urdu prompt yields gains of\+1\.50\+1\.50,\+2\.35\+2\.35, and\+2\.28\+2\.28\. Qwen3\-8B under the Urdu prompt shows the largest improvement, increasing from48\.97%48\.97\\%at zero\-shot to53\.49%53\.49\\%at five\-shot\. Despite these gains, few\-shot prompting does not change the overall ranking\. Even at five\-shot, all four models remain well below the≥25\\geq 25B open\-source tier and far behind proprietary models\. Few\-shot prompting also reduces prompt\-language differences, with every English\-Urdu gap staying within0\.710\.71points at five\-shot\. However, it does not compensate for missing Urdu\-specific knowledge\.

## 6Conclusion and Future Work

We introducedUrduMMLU, a broad\-coverage, natively writtenMMLU\-style benchmark for Urdu with 26,431 MCQs across 26 subjects and five domains, collected from Urdu MCQ banks and public SSC/HSSC examination PDFs\. The benchmark combines standard academic subjects with Urdu\- and region\-specific content and uses dual human annotation with strict consensus filtering for exam\-derived questions\. Evaluating 30 open\-source and closed\-source LLMs under English and Urdu prompts reveals a clear gap in current model capability\. Gemini\-3\.5\-Flash performs best at 90\.20% and 90\.34% accuracy, while the strongest open\-source model trails by 7\.79 and 8\.92 points\. Models perform substantially better on STEM than on Urdu\-centered Humanities, often losing 25 to 40 points on Urdu literature, Urdu language, and Islamic studies\. Prompt language has limited effect for most models, and few\-shot prompting yields only modest gains\. Overall,UrduMMLUshows that strong English\-centered benchmark performance does not ensure reliable Urdu educational and cultural knowledge and provides a stronger foundation for evaluating Urdu\-capable LLMs\.

Future work can extendUrduMMLUbeyond MCQ\-based evaluation through open\-ended generation, summarization, and translation tasks\. Expanding the benchmark to include Indian Urdu curricula, undergraduate material, professional examinations, and dialectal content would further broaden its scope\. Psychometrics also remains difficult for all evaluated models, motivating future Urdu reasoning benchmarks focused on analogies, logical patterns, and aptitude\-style tasks\. Finally, the weak performance of Urdu\-targeted models highlights the need for stronger continued pretraining and instruction tuning on native Urdu educational and literary material\.

## Limitations

##### Curriculum and source scope:

UrduMMLUfocuses on the Pakistani SSC/HSSC curriculum and a limited set of Urdu MCQ websites targeting the same educational setting\. Strong performance therefore reflects competence on Pakistani secondary\-school material rather than Urdu in its full linguistic diversity\. The benchmark does not cover undergraduate content, Indian Urdu curricula, dialectal variation, or Urdu–English code\-switching\. Although we reduce source skew through deduplication, annotation, and balancing, Ustad 360 still contributes 58\.8% of the cleaned candidate pool\.

##### Format and ceiling effects:

UrduMMLUuses a four\-option multiple\-choice format and therefore does not evaluate open\-ended writing, summarization, translation quality, long\-form reasoning, or conversational ability\. Psychometrics partially offsets this limitation by introducing reasoning\-heavy questions; however, no model exceeds60%60\\%accuracy on this subdomain\. Future work should extend evaluation toward more open\-ended Urdu tasks\.

##### Prompt\-language and few\-shot effects are limited:

English and Urdu instruction wrappers, together with 1\-, 3\-, and 5\-shot prompting, change accuracy by only a few points and rarely alter model rankings\. We also do not evaluate option\-order robustness, chain\-of\-thought prompting, or specialized reasoning modes\. Our setup therefore prioritizes consistency and comparability over fully optimized prompting configurations\.

## Ethical Statement & Broad Impact

We developUrduMMLUto support more inclusive multilingual evaluation for Urdu, a widely spoken but underrepresented language in NLP research\. The benchmark draws from publicly available educational and examination material and aims to improve evaluation coverage beyond English\-centered benchmarks\.

##### Transparency and Reproducibility:

We release the dataset, evaluation code, and prompting protocols to support reproducible research and transparent comparison across models\. We also document the dataset construction pipeline, annotation procedure, and evaluation setup in detail\.

##### Annotation and Data Quality:

We use dual human annotation with strict consensus filtering for exam\-derived questions and additionally verify answer labels for web\-derived items\. We further apply cleaning, deduplication, and normalization procedures to reduce OCR noise, malformed questions, and metadata inconsistencies\.

##### Bias and Scope Limitations:

UrduMMLUprimarily reflects the Pakistani SSC/HSSC curriculum and the educational content available through Urdu MCQ resources\. As a result, it may not fully represent other Urdu\-speaking communities, dialects, or educational systems\. The benchmark also contains culturally and regionally grounded subjects such as Islamic studies and Pakistan studies that reflect the underlying curriculum sources\.

##### Broader Impact:

We hopeUrduMMLUsupports the development of stronger Urdu\-capable language models and more representative multilingual evaluation\. At the same time, benchmark scores should not be interpreted as complete measures of reasoning ability, factual reliability, or cultural understanding beyond the educational scope represented in the dataset\.

## References

- M\. Abdin, J\. Aneja, H\. Awadalla, A\. Awadallah, A\. A\. Awan, N\. Bach, A\. Bahree, A\. Bakhtiari, J\. Bao, H\. Behl, A\. Benhaim, M\. Bilenko, J\. Bjorck, S\. Bubeck, M\. Cai, Q\. Cai, V\. Chaudhary, D\. Chen, D\. Chen, W\. Chen, Y\. Chen, Y\. Chen, H\. Cheng, P\. Chopra, X\. Dai, M\. Dixon, R\. Eldan, V\. Fragoso, J\. Gao, M\. Gao, M\. Gao, A\. Garg, A\. D\. Giorno, A\. Goswami, S\. Gunasekar, E\. Haider, J\. Hao, R\. J\. Hewett, W\. Hu, J\. Huynh, D\. Iter, S\. A\. Jacobs, M\. Javaheripi, X\. Jin, N\. Karampatziakis, P\. Kauffmann, M\. Khademi, D\. Kim, Y\. J\. Kim, L\. Kurilenko, J\. R\. Lee, Y\. T\. Lee, Y\. Li, Y\. Li, C\. Liang, L\. Liden, X\. Lin, Z\. Lin, C\. Liu, L\. Liu, M\. Liu, W\. Liu, X\. Liu, C\. Luo, P\. Madan, A\. Mahmoudzadeh, D\. Majercak, M\. Mazzola, C\. C\. T\. Mendes, A\. Mitra, H\. Modi, A\. Nguyen, B\. Norick, B\. Patra, D\. Perez\-Becker, T\. Portet, R\. Pryzant, H\. Qin, M\. Radmilac, L\. Ren, G\. de Rosa, C\. Rosset, S\. Roy, O\. Ruwase, O\. Saarikivi, A\. Saied, A\. Salim, M\. Santacroce, S\. Shah, N\. Shang, H\. Sharma, Y\. Shen, S\. Shukla, X\. Song, M\. Tanaka, A\. Tupini, P\. Vaddamanu, C\. Wang, G\. Wang, L\. Wang, S\. Wang, X\. Wang, Y\. Wang, R\. Ward, W\. Wen, P\. Witte, H\. Wu, X\. Wu, M\. Wyatt, B\. Xiao, C\. Xu, J\. Xu, W\. Xu, J\. Xue, S\. Yadav, F\. Yang, J\. Yang, Y\. Yang, Z\. Yang, D\. Yu, L\. Yuan, C\. Zhang, C\. Zhang, J\. Zhang, L\. L\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, and X\. Zhou \(2024\)Phi\-3 technical report: a highly capable language model locally on your phone\.External Links:2404\.14219,[Link](https://arxiv.org/abs/2404.14219)Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.25.25.5.1.1)\.
- F\. Adeeba, B\. Dillon, H\. Sajjad, and R\. Bhatt \(2025\)Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p2.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Ahmad, H\. Iqbal, M\. Ahsan, N\. Naeem, M\. A\. R\. Khan, A\. Riaz, M\. A\. Manzoor, Y\. Wang, and P\. Nakov \(2025\)UrduFactCheck: an agentic fact\-checking framework for Urdu with evidence boosting and benchmarking\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 22788–22802\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1240/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1240),ISBN 979\-8\-89176\-335\-7Cited by:[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px1.p1.1)\.
- Anthropic \(2025\)Claude Haiku 4\.5\.Note:[https://www\.anthropic\.com/claude/haiku](https://www.anthropic.com/claude/haiku)Accessed 2026\-05\-23Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.26.26.5.1.1)\.
- Anthropic \(2026a\)Claude Opus 4\.7 system card\.Note:[https://www\.anthropic\.com/news/claude\-opus\-4\-7](https://www.anthropic.com/news/claude-opus-4-7)Accessed: 2026\-05\-26Cited by:[§3\.2](https://arxiv.org/html/2606.07167#S3.SS2.p1.1)\.
- Anthropic \(2026b\)Claude Sonnet 4\.6 System Card\.Note:[https://www\-cdn\.anthropic\.com/78073f739564e986ff3e28522761a7a0b4484f84\.pdf](https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf)Accessed 2026\-05\-23Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.27.27.5.1.1),[§5\.1](https://arxiv.org/html/2606.07167#S5.SS1.p1.3)\.
- DeepSeek\-AI \(2026\)DeepSeek\-V4: towards highly efficient million\-token context intelligence\.Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.14.14.5.1.1),[§5\.1](https://arxiv.org/html/2606.07167#S5.SS1.p1.3)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§D\.3](https://arxiv.org/html/2606.07167#A4.SS3.p1.1),[§4\.2](https://arxiv.org/html/2606.07167#S4.SS2.SSS0.Px2.p1.1)\.
- Gemma Team \(2025\)Gemma 3 technical report\.arXiv preprint arXiv:2503\.19786\.External Links:[Link](https://arxiv.org/abs/2503.19786)Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.15.15.5.1.1)\.
- Google DeepMind \(2026\)Gemini 3\.5 Flash Model Card\.Note:[https://deepmind\.google/models/model\-cards/gemini\-3\-5\-flash/](https://deepmind.google/models/model-cards/gemini-3-5-flash/)Accessed 2026\-05\-23Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.29.29.5.1.1),[§1](https://arxiv.org/html/2606.07167#S1.p4.1)\.
- Google \(2026a\)Gemini 3\.1 Flash\-Lite\.Note:[https://ai\.google\.dev/gemini\-api/docs/models/gemini\-3\.1\-flash\-lite](https://ai.google.dev/gemini-api/docs/models/gemini-3.1-flash-lite)Accessed 2026\-05\-23Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.28.28.5.1.1),[§5\.1](https://arxiv.org/html/2606.07167#S5.SS1.p1.3)\.
- Google \(2026b\)Gemma 4 model card\.Note:[https://ai\.google\.dev/gemma/docs/core/model\_card\_4](https://ai.google.dev/gemma/docs/core/model_card_4)Accessed 2026\-05\-23Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.17.17.5.1.1),[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.18.18.5.1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The Llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.20.20.5.1.1),[Figure 4](https://arxiv.org/html/2606.07167#S5.F4)\.
- M\. Hardalov, T\. Mihaylov, D\. Zlatkova, Y\. Dinkov, I\. Koychev, and P\. Nakov \(2020\)EXAMS: a multi\-subject high school examinations dataset for cross\-lingual and multilingual question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 5427–5444\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.438/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.438)Cited by:[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px2.p2.1)\.
- M\. T\. Hassan, J\. Ahmed, and M\. Awais \(2026\)Qalb: largest state\-of\-the\-art Urdu large language model for 230m speakers with systematic continued pre\-training\.External Links:2601\.08141,[Link](https://arxiv.org/abs/2601.08141)Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.5.5.5.1.1),[§5\.1](https://arxiv.org/html/2606.07167#S5.SS1.p2.10)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p1.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Kazi and S\. Khoja \(2021\)UQuAD1\.0: development of an Urdu question answering training data for machine reading comprehension\.arXiv preprint arXiv:2111\.01543\.External Links:2111\.01543,[Link](https://arxiv.org/abs/2111.01543)Cited by:[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Kazi and S\. Khoja \(2026\)UQuAD\+: benchmark dataset for Urdu machine reading comprehension\.ACM Trans\. Asian Low\-Resour\. Lang\. Inf\. Process\.25\(2\)\.External Links:ISSN 2375\-4699,[Link](https://doi.org/10.1145/3759455),[Document](https://dx.doi.org/10.1145/3759455)Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p2.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Kazi, M\. Rahim, and S\. A\. Khoja \(2025\)Crossing language boundaries: evaluation of large language models on Urdu\-English question answering\.InProceedings of the First Workshop on Natural Language Processing for Indo\-Aryan and Dravidian Languages,R\. Weerasinghe, I\. Anuradha, and D\. Sumanathilaka \(Eds\.\),Abu Dhabi,pp\. 141–151\.External Links:[Link](https://aclanthology.org/2025.indonlp-1.17/)Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p2.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px1.p1.1)\.
- S\. KJ, A\. Kumar, L\. Balaji, N\. Kotecha, V\. Jain, A\. Chadha, and S\. Bhaduri \(2025\)IndicMMLU\-Pro: benchmarking Indic large language models on multi\-task language understanding\.External Links:2501\.15747,[Link](https://arxiv.org/abs/2501.15747)Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p2.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px2.p2.1)\.
- F\. Koto, N\. Aisyah, H\. Li, and T\. Baldwin \(2023\)Large language models only pass primary school exams in Indonesia: a comprehensive test on IndoMMLU\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 12359–12374\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.760/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.760)Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p3.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px3.p1.1)\.
- F\. Koto, H\. Li, S\. Shatnawi, J\. Doughman, A\. Sadallah, A\. Alraeesi, K\. Almubarak, Z\. Alyafeai, N\. Sengupta, S\. Shehata, N\. Habash, P\. Nakov, and T\. Baldwin \(2024\)ArabicMMLU: assessing massive multitask language understanding in Arabic\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 5622–5640\.External Links:[Link](https://aclanthology.org/2024.findings-acl.334/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.334)Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p3.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Li, Y\. Zhang, F\. Koto, Y\. Yang, H\. Zhao, Y\. Gong, N\. Duan, and T\. Baldwin \(2024\)CMMLU: measuring massive multitask language understanding in Chinese\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 11260–11285\.External Links:[Link](https://aclanthology.org/2024.findings-acl.671/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.671)Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p3.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=v8L0pN6EOi)Cited by:[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px1.p1.1)\.
- A\. H\. Liu, K\. Khandelwal, S\. Subramanian, V\. Jouault, A\. Rastogi, A\. Sadé, A\. Jeffares, A\. Jiang, A\. Cahill, A\. Gavaudan, A\. Sablayrolles, A\. Héliou, A\. You, A\. Ehrenberg, A\. Lo, A\. Eliseev, A\. Calvi, A\. Sooriyarachchi, B\. Bout, B\. Rozière, B\. D\. Monicault, C\. Lanfranchi, C\. Barreau, C\. Courtot, D\. Grattarola, D\. Dabert, D\. de las Casas, E\. Chane\-Sane, F\. Ahmed, G\. Berrada, G\. Ecrepont, G\. Guinet, G\. Novikov, G\. Kunsch, G\. Lample, G\. Martin, G\. Gupta, J\. Ludziejewski, J\. Rute, J\. Studnia, J\. Amar, J\. Delas, J\. S\. Roberts, K\. Yadav, K\. Chandu, K\. Jain, L\. Aitchison, L\. Fainsin, L\. Blier, L\. Zhao, L\. Martin, L\. Saulnier, L\. Gao, M\. Buyl, M\. Jennings, M\. Pellat, M\. Prins, M\. Poirée, M\. Guillaumin, M\. Dinot, M\. Futeral, M\. Darrin, M\. Augustin, M\. Chiquier, M\. Schimpf, N\. Grinsztajn, N\. Gupta, N\. Raghuraman, O\. Bousquet, O\. Duchenne, P\. Wang, P\. von Platen, P\. Jacob, P\. Wambergue, P\. Kurylowicz, P\. R\. Muddireddy, P\. Chagniot, P\. Stock, P\. Agrawal, Q\. Torroba, R\. Sauvestre, R\. Soletskyi, R\. Menneer, S\. Vaze, S\. Barry, S\. Gandhi, S\. Waghjale, S\. Gandhi, S\. Ghosh, S\. Mishra, S\. Aithal, S\. Antoniak, T\. L\. Scao, T\. Cachet, T\. S\. Sorg, T\. Lavril, T\. N\. Saada, T\. Chabal, T\. Foubert, T\. Robert, T\. Wang, T\. Lawson, T\. Bewley, T\. Bewley, T\. Edwards, U\. Jamil, U\. Tomasini, V\. Nemychnikova, V\. Phung, V\. Maladière, V\. Richard, W\. Bouaziz, W\. Li, W\. Marshall, X\. Li, X\. Yang, Y\. E\. Ouahidi, Y\. Wang, Y\. Tang, and Z\. Ramzi \(2026\)Ministral 3\.External Links:2601\.08584,[Link](https://arxiv.org/abs/2601.08584)Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.3.3.5.1.1),[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.4.4.5.1.1),[§5\.1](https://arxiv.org/html/2606.07167#S5.SS1.p2.10)\.
- Meta \(2024a\)Llama 3\.2 3b instruct model card\.Note:[https://huggingface\.co/meta\-llama/Llama\-3\.2\-3B\-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)Accessed 2026\-05\-23Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.19.19.5.1.1)\.
- Meta \(2024b\)Llama 3\.3 70b instruct model card\.Note:[https://huggingface\.co/meta\-llama/Llama\-3\.3\-70B\-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)Accessed 2026\-05\-23Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.23.23.5.1.1)\.
- Meta \(2025\)Llama 4 model card\.Note:[https://github\.com/meta\-llama/llama\-models/blob/main/models/llama4/MODEL\_CARD\.md](https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md)Accessed 2026\-05\-23Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.21.21.5.1.1),[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.22.22.5.1.1),[§5\.4](https://arxiv.org/html/2606.07167#S5.SS4.p1.6)\.
- Microsoft, :, A\. Abouelenin, A\. Ashfaq, A\. Atkinson, H\. Awadalla, N\. Bach, J\. Bao, A\. Benhaim, M\. Cai, V\. Chaudhary, C\. Chen, D\. Chen, D\. Chen, J\. Chen, W\. Chen, Y\. Chen, Y\. Chen, Q\. Dai, X\. Dai, R\. Fan, M\. Gao, M\. Gao, A\. Garg, A\. Goswami, J\. Hao, A\. Hendy, Y\. Hu, X\. Jin, M\. Khademi, D\. Kim, Y\. J\. Kim, G\. Lee, J\. Li, Y\. Li, C\. Liang, X\. Lin, Z\. Lin, M\. Liu, Y\. Liu, G\. Lopez, C\. Luo, P\. Madan, V\. Mazalov, A\. Mitra, A\. Mousavi, A\. Nguyen, J\. Pan, D\. Perez\-Becker, J\. Platin, T\. Portet, K\. Qiu, B\. Ren, L\. Ren, S\. Roy, N\. Shang, Y\. Shen, S\. Singhal, S\. Som, X\. Song, T\. Sych, P\. Vaddamanu, S\. Wang, Y\. Wang, Z\. Wang, H\. Wu, H\. Xu, W\. Xu, Y\. Yang, Z\. Yang, D\. Yu, I\. Zabir, J\. Zhang, L\. L\. Zhang, Y\. Zhang, and X\. Zhou \(2025\)Phi\-4\-Mini technical report: compact yet powerful multimodal language models via Mixture\-of\-LoRAs\.External Links:2503\.01743,[Link](https://arxiv.org/abs/2503.01743)Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.24.24.5.1.1)\.
- T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 2381–2391\.External Links:[Link](https://aclanthology.org/D18-1260/),[Document](https://dx.doi.org/10.18653/v1/D18-1260)Cited by:[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Muennighoff, T\. Wang, L\. Sutawika, A\. Roberts, S\. Biderman, T\. Le Scao, M\. S\. Bari, S\. Shen, Z\. X\. Yong, H\. Schoelkopf, X\. Tang, D\. Radev, A\. F\. Aji, K\. Almubarak, S\. Albanie, Z\. Alyafeai, A\. Webson, E\. Raff, and C\. Raffel \(2023\)Crosslingual generalization through multitask finetuning\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 15991–16111\.External Links:[Link](https://aclanthology.org/2023.acl-long.891/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.891)Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.10.10.5.1.1),[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.11.11.5.1.1),[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.12.12.5.1.1),[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.13.13.5.1.1),[§5\.1](https://arxiv.org/html/2606.07167#S5.SS1.p2.10)\.
- P\. Rajpurkar, R\. Jia, and P\. Liang \(2018\)Know what you don’t know: unanswerable questions for SQuAD\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),I\. Gurevych and Y\. Miyao \(Eds\.\),Melbourne, Australia,pp\. 784–789\.External Links:[Link](https://aclanthology.org/P18-2124/),[Document](https://dx.doi.org/10.18653/v1/P18-2124)Cited by:[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Romanou, N\. Foroutan, A\. Sotnikova, S\. H\. Nelaturu, S\. Singh, R\. Maheshwary, M\. Altomare, Z\. Chen, M\. A\. Haggag, S\. A, A\. Amayuelas, A\. H\. Amirudin, D\. Boiko, M\. Chang, J\. Chim, G\. Cohen, A\. K\. Dalmia, A\. Diress, S\. Duwal, D\. Dzenhaliou, D\. F\. E\. Florez, F\. Farestam, J\. M\. Imperial, S\. B\. Islam, P\. Isotalo, M\. Jabbarishiviari, B\. F\. Karlsson, E\. Khalilov, C\. Klamm, F\. Koto, D\. Krzemiński, G\. A\. de Melo, S\. Montariol, Y\. Nan, J\. Niklaus, J\. Novikova, J\. S\. O\. Ceron, D\. Paul, E\. Ploeger, J\. Purbey, S\. Rajwal, S\. S\. Ravi, S\. Rydell, R\. Santhosh, D\. Sharma, M\. P\. Skenduli, A\. S\. Moakhar, B\. soltani moakhar, A\. K\. Tarun, A\. T\. Wasi, T\. O\. Weerasinghe, S\. Yilmaz, M\. Zhang, I\. Schlag, M\. Fadaee, S\. Hooker, and A\. Bosselut \(2025\)INCLUDE: evaluating multilingual language understanding with regional knowledge\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=k3gCieTXeY)Cited by:[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px2.p2.1)\.
- M\. A\. Shafique, K\. Mehreen, M\. Arham, M\. Amjad, S\. Butt, and H\. Farooq \(2025\)Alif: advancing Urdu large language models via multilingual synthetic data distillation\.InProceedings of the 5th Workshop on Multilingual Representation Learning \(MRL 2025\),D\. I\. Adelani, C\. Arnett, D\. Ataman, T\. A\. Chang, H\. Gonen, R\. Raja, F\. Schmidt, D\. Stap, and J\. Wang \(Eds\.\),Suzhuo, China,pp\. 271–284\.External Links:[Link](https://aclanthology.org/2025.mrl-main.19/),[Document](https://dx.doi.org/10.18653/v1/2025.mrl-main.19),ISBN 979\-8\-89176\-345\-6Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.2.2.5.1.1),[§5\.1](https://arxiv.org/html/2606.07167#S5.SS1.p2.10)\.
- M\. Shafique, A\. Mehboob, L\. Fiaz, M\. Qadeer, and H\. Farooq \(2026\)Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p2.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px1.p1.1)\.
- F\. Shi, M\. Suzgun, M\. Freitag, X\. Wang, S\. Srivats, S\. Vosoughi, H\. W\. Chung, Y\. Tay, S\. Ruder, D\. Zhou, D\. Das, and J\. Wei \(2023\)Language models are multilingual chain\-of\-thought reasoners\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=fR3wGCk-IXp)Cited by:[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram, A\. Nathan, A\. Luo, A\. Helyar, A\. Madry, A\. Efremov, A\. Spyra, A\. Baker\-Whitcomb, A\. Beutel, A\. Karpenko, A\. Makelov, A\. Neitz, A\. Wei, A\. Barr, A\. Kirchmeyer, A\. Ivanov, A\. Christakis, A\. Gillespie, A\. Tam, A\. Bennett, A\. Wan, A\. Huang, A\. M\. Sandjideh, A\. Yang, A\. Kumar, A\. Saraiva, A\. Vallone, A\. Gheorghe, A\. G\. Garcia, A\. Braunstein, A\. Liu, A\. Schmidt, A\. Mereskin, A\. Mishchenko, A\. Applebaum, A\. Rogerson, A\. Rajan, A\. Wei, A\. Kotha, A\. Srivastava, A\. Agrawal, A\. Vijayvergiya, A\. Tyra, A\. Nair, A\. Nayak, B\. Eggers, B\. Ji, B\. Hoover, B\. Chen, B\. Chen, B\. Barak, B\. Minaiev, B\. Hao, B\. Baker, B\. Lightcap, B\. McKinzie, B\. Wang, B\. Quinn, B\. Fioca, B\. Hsu, B\. Yang, B\. Yu, B\. Zhang, B\. Brenner, C\. R\. Zetino, C\. Raymond, C\. Lugaresi, C\. Paz, C\. Hudson, C\. Whitney, C\. Li, C\. Chen, C\. Cole, C\. Voss, C\. Ding, C\. Shen, C\. Huang, C\. Colby, C\. Hallacy, C\. Koch, C\. Lu, C\. Kaplan, C\. Kim, C\. Minott\-Henriques, C\. Frey, C\. Yu, C\. Czarnecki, C\. Reid, C\. Wei, C\. Decareaux, C\. Scheau, C\. Zhang, C\. Forbes, D\. Tang, D\. Goldberg, D\. Roberts, D\. Palmie, D\. Kappler, D\. Levine, D\. Wright, D\. Leo, D\. Lin, D\. Robinson, D\. Grabb, D\. Chen, D\. Lim, D\. Salama, D\. Bhattacharjee, D\. Tsipras, D\. Li, D\. Yu, D\. Strouse, D\. Williams, D\. Hunn, E\. Bayes, E\. Arbus, E\. Akyurek, E\. Y\. Le, E\. Widmann, E\. Yani, E\. Proehl, E\. Sert, E\. Cheung, E\. Schwartz, E\. Han, E\. Jiang, E\. Mitchell, E\. Sigler, E\. Wallace, E\. Ritter, E\. Kavanaugh, E\. Mays, E\. Nikishin, F\. Li, F\. P\. Such, F\. de Avila Belbute Peres, F\. Raso, F\. Bekerman, F\. Tsimpourlas, F\. Chantzis, F\. Song, F\. Zhang, G\. Raila, G\. McGrath, G\. Briggs, G\. Yang, G\. Parascandolo, G\. Chabot, G\. Kim, G\. Zhao, G\. Valiant, G\. Leclerc, H\. Salman, H\. Wang, H\. Sheng, H\. Jiang, H\. Wang, H\. Jin, H\. Sikchi, H\. Schmidt, H\. Aspegren, H\. Chen, H\. Qiu, H\. Lightman, I\. Covert, I\. Kivlichan, I\. Silber, I\. Sohl, I\. Hammoud, I\. Clavera, I\. Lan, I\. Akkaya, I\. Kostrikov, I\. Kofman, I\. Etinger, I\. Singal, J\. Hehir, J\. Huh, J\. Pan, J\. Wilczynski, J\. Pachocki, J\. Lee, J\. Quinn, J\. Kiros, J\. Kalra, J\. Samaroo, J\. Wang, J\. Wolfe, J\. Chen, J\. Wang, J\. Harb, J\. Han, J\. Wang, J\. Zhao, J\. Chen, J\. Yang, J\. Tworek, J\. Chand, J\. Landon, J\. Liang, J\. Lin, J\. Liu, J\. Wang, J\. Tang, J\. Yin, J\. Jang, J\. Morris, J\. Flynn, J\. Ferstad, J\. Heidecke, J\. Fishbein, J\. Hallman, J\. Grant, J\. Chien, J\. Gordon, J\. Park, J\. Liss, J\. Kraaijeveld, J\. Guay, J\. Mo, J\. Lawson, J\. McGrath, J\. Vendrow, J\. Jiao, J\. Lee, J\. Steele, J\. Wang, J\. Mao, K\. Chen, K\. Hayashi, K\. Xiao, K\. Salahi, K\. Wu, K\. Sekhri, K\. Sharma, K\. Singhal, K\. Li, K\. Nguyen, K\. Gu\-Lemberg, K\. King, K\. Liu, K\. Stone, K\. Yu, K\. Ying, K\. Georgiev, K\. Lim, K\. Tirumala, K\. Miller, L\. Ahmad, L\. Lv, L\. Clare, L\. Fauconnet, L\. Itow, L\. Yang, L\. Romaniuk, L\. Anise, L\. Byron, L\. Pathak, L\. Maksin, L\. Lo, L\. Ho, L\. Jing, L\. Wu, L\. Xiong, L\. Mamitsuka, L\. Yang, L\. McCallum, L\. Held, L\. Bourgeois, L\. Engstrom, L\. Kuhn, L\. Feuvrier, L\. Zhang, L\. Switzer, L\. Kondraciuk, L\. Kaiser, M\. Joglekar, M\. Singh, M\. Shah, M\. Stratta, M\. Williams, M\. Chen, M\. Sun, M\. Cayton, M\. Li, M\. Zhang, M\. Aljubeh, M\. Nichols, M\. Haines, M\. Schwarzer, M\. Gupta, M\. Shah, M\. Y\. Guan, M\. Huang, M\. Dong, M\. Wang, M\. Glaese, M\. Carroll, M\. Lampe, M\. Malek, M\. Sharman, M\. Zhang, M\. Wang, M\. Pokrass, M\. Florian, M\. Pavlov, M\. Wang, M\. Chen, M\. Wang, M\. Feng, M\. Bavarian, M\. Lin, M\. Abdool, M\. Rohaninejad, N\. Soto, N\. Staudacher, N\. LaFontaine, N\. Marwell, N\. Liu, N\. Preston, N\. Turley, N\. Ansman, N\. Blades, N\. Pancha, N\. Mikhaylin, N\. Felix, N\. Handa, N\. Rai, N\. Keskar, N\. Brown, O\. Nachum, O\. Boiko, O\. Murk, O\. Watkins, O\. Gleeson, P\. Mishkin, P\. Lesiewicz, P\. Baltescu, P\. Belov, P\. Zhokhov, P\. Pronin, P\. Guo, P\. Thacker, Q\. Liu, Q\. Yuan, Q\. Liu, R\. Dias, R\. Puckett, R\. Arora, R\. T\. Mullapudi, R\. Gaon, R\. Miyara, R\. Song, R\. Aggarwal, R\. Marsan, R\. Yemiru, R\. Xiong, R\. Kshirsagar, R\. Nuttall, R\. Tsiupa, R\. Eldan, R\. Wang, R\. James, R\. Ziv, R\. Shu, R\. Nigmatullin, S\. Jain, S\. Talaie, S\. Altman, S\. Arnesen, S\. Toizer, S\. Toyer, S\. Miserendino, S\. Agarwal, S\. Yoo, S\. Heon, S\. Ethersmith, S\. Grove, S\. Taylor, S\. Bubeck, S\. Banesiu, S\. Amdo, S\. Zhao, S\. Wu, S\. Santurkar, S\. Zhao, S\. R\. Chaudhuri, S\. Krishnaswamy, Shuaiqi, Xia, S\. Cheng, S\. Anadkat, S\. P\. Fishman, S\. Tobin, S\. Fu, S\. Jain, S\. Mei, S\. Egoian, S\. Kim, S\. Golden, S\. Mah, S\. Lin, S\. Imm, S\. Sharpe, S\. Yadlowsky, S\. Choudhry, S\. Eum, S\. Sanjeev, T\. Khan, T\. Stramer, T\. Wang, T\. Xin, T\. Gogineni, T\. Christianson, T\. Sanders, T\. Patwardhan, T\. Degry, T\. Shadwell, T\. Fu, T\. Gao, T\. Garipov, T\. Sriskandarajah, T\. Sherbakov, T\. Korbak, T\. Kaftan, T\. Hiratsuka, T\. Wang, T\. Song, T\. Zhao, T\. Peterson, V\. Kharitonov, V\. Chernova, V\. Kosaraju, V\. Kuo, V\. Pong, V\. Verma, V\. Petrov, W\. Jiang, W\. Zhang, W\. Zhou, W\. Xie, W\. Zhan, W\. McCabe, W\. DePue, W\. Ellsworth, W\. Bain, W\. Thompson, X\. Chen, X\. Qi, X\. Xiang, X\. Shi, Y\. Dubois, Y\. Yu, Y\. Khakbaz, Y\. Wu, Y\. Qian, Y\. T\. Lee, Y\. Chen, Y\. Zhang, Y\. Xiong, Y\. Tian, Y\. Cha, Y\. Bai, Y\. Yang, Y\. Yuan, Y\. Li, Y\. Zhang, Y\. Yang, Y\. Jin, Y\. Jiang, Y\. Wang, Y\. Wang, Y\. Liu, Z\. Stubenvoll, Z\. Dou, Z\. Wu, and Z\. Wang \(2026\)OpenAI GPT\-5 System Card\.External Links:2601\.03267,[Link](https://arxiv.org/abs/2601.03267)Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.30.30.5.1.1),[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.31.31.5.1.1),[§5\.1](https://arxiv.org/html/2606.07167#S5.SS1.p1.3)\.
- S\. Singh, A\. Romanou, C\. Fourrier, D\. I\. Adelani, J\. G\. Ngui, D\. Vila\-Suero, P\. Limkonchotiwat, K\. Marchisio, W\. Q\. Leong, Y\. Susanto, R\. Ng, S\. Longpre, S\. Ruder, W\. Ko, A\. Bosselut, A\. Oh, A\. Martins, L\. Choshen, D\. Ippolito, E\. Ferrante, M\. Fadaee, B\. Ermis, and S\. Hooker \(2025\)Global MMLU: understanding and addressing cultural and linguistic biases in multilingual evaluation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 18761–18799\.External Links:[Link](https://aclanthology.org/2025.acl-long.919/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.919),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p2.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Son, H\. Lee, S\. Kim, S\. Kim, N\. Muennighoff, T\. Choi, C\. Park, K\. M\. Yoo, and S\. Biderman \(2025\)KMMLU: measuring massive multitask language understanding in Korean\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 4076–4104\.External Links:[Link](https://aclanthology.org/2025.naacl-long.206/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.206),ISBN 979\-8\-89176\-189\-6Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p3.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px3.p1.1)\.
- M\. H\. Tahir, S\. Shams, L\. Fiaz, F\. Adeeba, and S\. Hussain \(2025\)Benchmarking the performance of pre\-trained LLMs across Urdu NLP tasks\.InProceedings of the First Workshop on Challenges in Processing South Asian Languages \(CHiPSAL 2025\),K\. Sarveswaran, A\. Vaidya, B\. Krishna Bal, S\. Shams, and S\. Thapa \(Eds\.\),Abu Dhabi, UAE,pp\. 17–34\.External Links:[Link](https://aclanthology.org/2025.chipsal-1.3/)Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p2.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2019\)CommonsenseQA: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4149–4158\.External Links:[Link](https://aclanthology.org/N19-1421/),[Document](https://dx.doi.org/10.18653/v1/N19-1421)Cited by:[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Team, M\. Riviere, S\. Pathak, P\. G\. Sessa, C\. Hardin, S\. Bhupatiraju, L\. Hussenot, T\. Mesnard, B\. Shahriari, A\. Ramé, J\. Ferret, P\. Liu, P\. Tafti, A\. Friesen, M\. Casbon, S\. Ramos, R\. Kumar, C\. L\. Lan, S\. Jerome, A\. Tsitsulin, N\. Vieillard, P\. Stanczyk, S\. Girgin, N\. Momchev, M\. Hoffman, S\. Thakoor, J\. Grill, B\. Neyshabur, O\. Bachem, A\. Walton, A\. Severyn, A\. Parrish, A\. Ahmad, A\. Hutchison, A\. Abdagic, A\. Carl, A\. Shen, A\. Brock, A\. Coenen, A\. Laforge, A\. Paterson, B\. Bastian, B\. Piot, B\. Wu, B\. Royal, C\. Chen, C\. Kumar, C\. Perry, C\. Welty, C\. A\. Choquette\-Choo, D\. Sinopalnikov, D\. Weinberger, D\. Vijaykumar, D\. Rogozińska, D\. Herbison, E\. Bandy, E\. Wang, E\. Noland, E\. Moreira, E\. Senter, E\. Eltyshev, F\. Visin, G\. Rasskin, G\. Wei, G\. Cameron, G\. Martins, H\. Hashemi, H\. Klimczak\-Plucińska, H\. Batra, H\. Dhand, I\. Nardini, J\. Mein, J\. Zhou, J\. Svensson, J\. Stanway, J\. Chan, J\. P\. Zhou, J\. Carrasqueira, J\. Iljazi, J\. Becker, J\. Fernandez, J\. van Amersfoort, J\. Gordon, J\. Lipschultz, J\. Newlan, J\. Ji, K\. Mohamed, K\. Badola, K\. Black, K\. Millican, K\. McDonell, K\. Nguyen, K\. Sodhia, K\. Greene, L\. L\. Sjoesund, L\. Usui, L\. Sifre, L\. Heuermann, L\. Lago, L\. McNealus, L\. B\. Soares, L\. Kilpatrick, L\. Dixon, L\. Martins, M\. Reid, M\. Singh, M\. Iverson, M\. Görner, M\. Velloso, M\. Wirth, M\. Davidow, M\. Miller, M\. Rahtz, M\. Watson, M\. Risdal, M\. Kazemi, M\. Moynihan, M\. Zhang, M\. Kahng, M\. Park, M\. Rahman, M\. Khatwani, N\. Dao, N\. Bardoliwalla, N\. Devanathan, N\. Dumai, N\. Chauhan, O\. Wahltinez, P\. Botarda, P\. Barnes, P\. Barham, P\. Michel, P\. Jin, P\. Georgiev, P\. Culliton, P\. Kuppala, R\. Comanescu, R\. Merhej, R\. Jana, R\. A\. Rokni, R\. Agarwal, R\. Mullins, S\. Saadat, S\. M\. Carthy, S\. Cogan, S\. Perrin, S\. M\. R\. Arnold, S\. Krause, S\. Dai, S\. Garg, S\. Sheth, S\. Ronstrom, S\. Chan, T\. Jordan, T\. Yu, T\. Eccles, T\. Hennigan, T\. Kocisky, T\. Doshi, V\. Jain, V\. Yadav, V\. Meshram, V\. Dharmadhikari, W\. Barkley, W\. Wei, W\. Ye, W\. Han, W\. Kwon, X\. Xu, Z\. Shen, Z\. Gong, Z\. Wei, V\. Cotruta, P\. Kirk, A\. Rao, M\. Giang, L\. Peran, T\. Warkentin, E\. Collins, J\. Barral, Z\. Ghahramani, R\. Hadsell, D\. Sculley, J\. Banks, A\. Dragan, S\. Petrov, O\. Vinyals, J\. Dean, D\. Hassabis, K\. Kavukcuoglu, C\. Farabet, E\. Buchatskaya, S\. Borgeaud, N\. Fiedel, A\. Joulin, K\. Kenealy, R\. Dadashi, and A\. Andreev \(2024\)Gemma 2: improving open language models at a practical size\.External Links:2408\.00118,[Link](https://arxiv.org/abs/2408.00118)Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.16.16.5.1.1),[§5\.1](https://arxiv.org/html/2606.07167#S5.SS1.p2.10)\.
- M\. Togmanov, N\. Mukhituly, D\. Turmakhan, J\. Mansurov, M\. Goloburda, A\. Sakip, Z\. Xie, Y\. Wang, B\. Syzdykov, N\. Laiyk, A\. F\. Aji, E\. Kochmar, P\. Nakov, and F\. Koto \(2025\)KazMMLU: evaluating language models on Kazakh, Russian, and regional knowledge of Kazakhstan\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 14403–14416\.External Links:[Link](https://aclanthology.org/2025.acl-long.701/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.701),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p3.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Verma, M\. S\. U\. R\. Khan, V\. Kumar, R\. Murthy, and J\. Sen \(2025\)MILU: a multi\-task Indic language understanding benchmark\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 10076–10132\.External Links:[Link](https://aclanthology.org/2025.naacl-long.507/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.507),ISBN 979\-8\-89176\-189\-6Cited by:[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px2.p2.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen \(2024\)MMLU\-Pro: a more robust and challenging multi\-task language understanding benchmark\.InProceedings of the 38th International Conference on Neural Information Processing Systems,NIPS ’24,Red Hook, NY, USA\.External Links:ISBN 9798331314385,[Link](https://dl.acm.org/doi/10.5555/3737916.3740934)Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p1.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Xuan, R\. Yang, H\. Qi, Q\. Zeng, Y\. Xiao, A\. Feng, D\. Liu, Y\. Xing, J\. Wang, F\. Gao, J\. Lu, Y\. Jiang, H\. Li, X\. Li, K\. Yu, R\. Dong, S\. Gu, Y\. Li, X\. Xie, F\. Juefei\-Xu, F\. Khomh, O\. Yoshie, Q\. Chen, D\. Teodoro, N\. Liu, R\. Goebel, L\. Ma, E\. Marrese\-Taylor, S\. Lu, Y\. Iwasawa, Y\. Matsuo, and I\. Li \(2025\)MMLU\-ProX: a multilingual benchmark for advanced large language model evaluation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 1513–1532\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.79/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.79),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2606.07167#S1.p2.1),[§2](https://arxiv.org/html/2606.07167#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.6.6.5.1.1),[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.7.7.5.1.1),[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.8.8.5.1.1),[Table 14](https://arxiv.org/html/2606.07167#A4.T14.3.9.9.5.1.1),[§5\.1](https://arxiv.org/html/2606.07167#S5.SS1.p2.10)\.

## Appendix ACandidate Pool Analysis

We constructUrduMMLUin two stages\. First, an automatic preprocessing pipeline collects and cleans multiple\-choice questions from Pakistani examination boards and Urdu MCQ websites to produce a candidate pool\. Second, annotation, verification, deduplication, and balancing transform this pool into the final benchmark used in all evaluations\. This appendix analyzes both stages and shows how the dataset composition changes throughout the construction process\.

Figure[5](https://arxiv.org/html/2606.07167#A1.F5)shows the distribution ofUrduMMLUitems across the four Pakistani examination levels: SSC\-I, SSC\-II, HSSC\-I, and HSSC\-II\. The left panel reports absolute item counts, while the right panel reports the within\-level domain distribution\. The Figure highlights two consistent trends, first, Humanities dominates the SSC levels, where language and literature subjects occupy a larger portion of the curriculum\. Second, STEM and Social Sciences become more prominent at the HSSC levels, where students specialize into science, commerce, and humanities tracks\. The level distribution therefore reflects the structure of the Pakistani curriculum rather than collection artifacts\.

We also analyze question length because stem length can influence model performance and varies across domains\. Figure[6](https://arxiv.org/html/2606.07167#A1.F6)summarizes the overall length distribution and the domain\-wise split between short and long stems using a 9\-word threshold\. Most UrduMMLU stems are short, but the dataset retains a substantial long\-question tier\. STEM has the most balanced short/long distribution, while Humanities and Profession contain relatively more short stems\.

![Refer to caption](https://arxiv.org/html/2606.07167v1/x5.png)Figure 5:Distribution ofUrduMMLUitems across Pakistani examination levels, grouped by domain\.Left:absolute item counts per level\.Right:within\-level domain distribution\. Humanities dominates SSC\-I and SSC\-II, while STEM and Social Sciences become more prominent at the HSSC levels, reflecting the structure of the Pakistani secondary\-school curriculum\.![Refer to caption](https://arxiv.org/html/2606.07167v1/x6.png)\(a\)Distribution of question lengths in words\. The dashed vertical line at 9 words marks the short/long boundary\.
![Refer to caption](https://arxiv.org/html/2606.07167v1/x7.png)\(b\)Domain\-wise counts of short and long questions, with within\-domain percentages shown in parentheses\.

Figure 6:Question\-length analysis for UrduMMLU\.Left:histogram of question lengths\.Right:domain\-wise counts of short \(≤9\\leq 9words\) and long \(\>9\>9words\) questions\. STEM is closest to a balanced split, while Humanities and Profession skew shorter\.### A\.1Source and Domain Distributions

Tables[6](https://arxiv.org/html/2606.07167#A1.T6)and[7](https://arxiv.org/html/2606.07167#A1.T7)compare the cleaned candidate pool \(Raw\) and released benchmark \(Final\), distinguishing the initial collection distribution from the curated evaluation benchmark\.

SourceRawFinal%ShareUstad 36023,78811,06841\.9%MCQTimes6,0995,91822\.4%TestPointPK3,6193,50213\.2%ETest3,1022,78310\.5%FBISE2,4061,4595\.5%ExamAunty6435402\.0%GoTest5665151\.9%PakMCQs4344141\.6%BISE Multan 20254402320\.9%Total40,42726,431100\.0Table 6:Source distribution of the cleaned candidate pool \(Raw\) and the releasedUrduMMLUbenchmark \(Final\)\. Percentages and share bars correspond to the final benchmark distribution\.DomainRawFinal%ShareHumanities11,53911,01041\.7%Social Sciences14,6267,96830\.2%STEM11,5905,11319\.3%Other2,0301,3655\.2%Profession6429753\.7%Total40,42726,431100\.0Table 7:Domain distribution before and after final benchmark selection\.Rawdenotes the cleaned candidate pool, whileFinaldenotes the releasedUrduMMLUbenchmark\.##### Source distribution:

The cleaned candidate pool contains 40,427 items collected from nine Pakistani examination and MCQ\-bank sources, of which 26,431 survive into the final benchmark\. Table[6](https://arxiv.org/html/2606.07167#A1.T6)shows that the raw pool is heavily concentrated in a few large sources\. Ustad 360 alone contributes 23,788 raw items, while the four largest sources together account for more than 90% of the pool\. The final benchmark is substantially less skewed\.

Annotation, deduplication, and balancing reduce the relative share of the largest sources, while smaller sources such as MCQTimes, TestPointPK, and ETest contribute proportionally more to the released benchmark\. BISE Multan 2025 shows the largest reduction because of a high duplicate rate against other examination sources\. Figure[7](https://arxiv.org/html/2606.07167#A1.F7)expands the domain\-level statistics from Table[7](https://arxiv.org/html/2606.07167#A1.T7)to the subdomain level\. Humanities is dominated by Urdu Literature and Urdu Language, whereas Social Sciences and STEM distribute more evenly across multiple medium\-sized subdomains\.

![Refer to caption](https://arxiv.org/html/2606.07167v1/x8.png)Figure 7:FinalUrduMMLUitem counts by subdomain, grouped by domain\. Urdu Literature and Urdu Language contribute the largest shares, while Social Sciences and STEM distribute across a larger number of medium\-sized subdomains\.
##### Domain distribution:

Table[7](https://arxiv.org/html/2606.07167#A1.T7)compares the cleaned candidate pool and the final benchmark across domains\. The candidate pool distributes relatively evenly across Humanities, Social Sciences, and STEM, while Other and Profession remain much smaller\. The final benchmark shifts toward Humanities, which grows from28\.5%28\.5\\%to41\.7%41\.7\\%, while Social Sciences and STEM decrease to30\.2%30\.2\\%and19\.3%19\.3\\%, respectively\. Profession is the only domain whose absolute count increases during balancing \(642→975642\\to 975\), which improves coverage for reliable domain\-level evaluation\.

These changes reflect a deliberate balancing step rather than artifacts of preprocessing or cleaning\. We down\-sample overrepresented STEM and Social Sciences items and preserve underrepresented Profession items to better align the benchmark with the structure of the Pakistani SSC and HSSC curriculum shown in Figure[5](https://arxiv.org/html/2606.07167#A1.F5)\. This process improves coverage across domains while maintaining alignment with the underlying curriculum resulting in a benchmark that provides a balanced representation of subjects encountered in Pakistani education\.

## Appendix BAnnotation Details

The exam\-derived portion ofUrduMMLUcame from Pakistani examination boards and MCQ sources that did not provide answer keys\. To produce reliable gold labels, we recruited 17 Urdu\-fluent annotators and ran a dual\-annotator consensus process supported by a custom dashboard and written guidelines\. This appendix documents the annotator pool, the annotation guidelines and dashboard, the inclusion and edit\-resolution rules, and the resulting agreement statistics\.

### B\.1Annotator Demographics and Feedback

The annotation pool consisted of 17 annotators recruited for native Urdu fluency and familiarity with the Pakistani school curriculum\. Table[8](https://arxiv.org/html/2606.07167#A2.T8)summarizes the demographic profile of the pool\. The annotators were approximately gender\-balanced \(52\.9% female, 47\.1% male\), predominantly native Urdu speakers \(94\.1%\), and concentrated in the 18–34 age range\. Educationally, 88\.3% held at least a bachelor’s degree and 41\.2% held a master’s degree, which is important for a benchmark that targets SSC\- and HSSC\-level subject content\. Most annotators also reported between one and six years of professional experience\.

After completing their assigned batches, all annotators filled out a post\-task satisfaction survey\. Table[9](https://arxiv.org/html/2606.07167#A2.T9)summarizes the responses\. Feedback was consistently positive, and no annotator selected*Disagree*or*Strongly disagree*for any statement\. Instruction clarity, compensation fairness, guideline usefulness, and overall satisfaction received entirely positive responses\. Task enjoyment received a smaller number of neutral responses \(23\.5%\), suggesting that annotators found the process clear and manageable even if not inherently engaging\.

AttributeCount%GenderFemale952\.9Male847\.1Native Urdu speakerYes1694\.1No15\.9Age range18–24741\.225–341058\.8Highest completed educationHigh school diploma15\.9Some college / vocational15\.9Bachelor’s degree847\.1Master’s degree741\.2Professional work experienceLess than 1 year529\.41–3 years635\.34–6 years423\.57–9 years211\.8Total annotators17100\.0Table 8:Demographic profile of the UrduMMLU annotator pool \(n=17n=17; identities anonymised\)\.Table 9:Post\-task satisfaction survey results \(n=17n=17, values in %\)\. SA = Strongly agree, A = Agree, N = Neutral\. No annotator selected*Disagree*or*Strongly disagree*on any item, so those columns are omitted\.*Pos\.*is the share of*Agree*plus*Strongly agree*\.
### B\.2Annotation Guidelines

Before annotation began, we held a live online onboarding session in which we walked through the task end\-to\-end, demonstrated each flag and edit category on real items, and answered annotator questions\. The full written guidelines were also embedded as an always\-available help page inside the annotation dashboard so that annotators could re\-check policies during their work, and admins remained reachable by email throughout the annotation period for cases not covered by the written guidelines\. We also encouraged annotators to consult the guidelines whenever they encountered uncertain or ambiguous cases to ensure consistent decisions across annotation batches\. These procedures helped ensure consistent annotation decisions\.

##### Task overview:

Annotators were asked to verify the answer to each MCQ by selecting the*single best*option fromA/B/C/D\. When multiple options looked plausible, annotators were instructed to select the most precise or directly relevant answer rather than guessing\.

##### Look\-up and abstention policy:

Annotators were encouraged to consult Google or Wikipedia for fact\-based questions \(dates, authors, capitals, scientific terms, historical events\) rather than relying on memory, with a target pace of 15–30 seconds per question including verification\. Annotators were asked to mark an item as*unsure / skip*rather than submit a confident guess in any of the following cases: \(i\) the answer could not be resolved within roughly a minute of search, \(ii\) two or more options remained equally plausible after verification, or \(iii\) the question required specialist context \(e\.g\., niche fiqh details or obscure regional history\) that they could not quickly acquire\.

##### Flag vs\. edit:

Annotators were given a single rule of thumb to choose between the two actions:*edit*when the issue could be fixed in\-place by changing text \(a typo, missing space, duplicated word, wrong subdomain label\), and*flag*when the issue required admin review and could not be repaired by text correction \(no correct answer, multiple correct answers, ambiguity, missing visual, out\-of\-scope content\)\. Annotators were explicitly instructed not to rewrite question semantics, not to “fix” wrong distractors into correct ones, and not to flag a question solely because they had edited a typo in it\.

##### When to flag:

Figure[8](https://arxiv.org/html/2606.07167#A2.F8)illustrates the five flag categories covered in the guidelines\. These are: \(a\) two or more options are simultaneously correct; \(b\) none of the listed options is the correct answer; \(c\) the question is ambiguous, vague, or under\-specified; \(d\) the question references a diagram, image, or chart that is not included in the text; and \(e\) the question is out of scope for the benchmark \(hyper\-local trivia, sectarian content, opinion questions\)\. For each case, annotators were asked to attach a short free\-text note explaining the issue\. This information helped reviewers verify and resolve flagged items during quality control\. Flagging was independent of answer selection, and annotators could flag with or without picking an option\.

Table 10:Inclusion rules for the final annotated pool\. An item is dropped if any rule fires; only items that pass every check enter the gold\-labelled set\.
##### When to edit:

Figure[9](https://arxiv.org/html/2606.07167#A2.F9)illustrates the three most common edit categories: \(a\) spelling fixes and missing diacritics, where the intended word is clear from context; \(b\) duplicated words and other scraping artifacts; and \(c\) subdomain reassignment, where the original subdomain label is clearly inconsistent with the question content\. Beyond these, the guidelines also permitted spacing corrections, removal of stray punctuation, stripping of redundant in\-text option\-letter prefixes \(e\.g\.,A\.,B\., or their Urdu equivalents\) already shown by the option badge, and translation of stray English option text when an unambiguous Urdu equivalent existed\. Technical English terms, HTML/CSS tags, proper names, and brand names were left in English\.

##### Workflow:

Annotators worked in batches of approximately 50 MCQs, with accuracy prioritized over speed\.

![Refer to caption](https://arxiv.org/html/2606.07167v1/images_guidelines/flag_multiple_correct.png)\(a\)Two or more options are simultaneously correct\.
![Refer to caption](https://arxiv.org/html/2606.07167v1/images_guidelines/flag_no_correct.png)\(b\)No option in the list is the correct answer\.
![Refer to caption](https://arxiv.org/html/2606.07167v1/images_guidelines/flag_ambiguous.png)\(c\)Question is ambiguous, vague, or under\-specified\.
![Refer to caption](https://arxiv.org/html/2606.07167v1/images_guidelines/flag_missing_visual.png)\(d\)Question references a missing image, diagram, or chart\.
![Refer to caption](https://arxiv.org/html/2606.07167v1/images_guidelines/flag_out_of_scope.png)\(e\)Hyper\-local, sectarian, or opinion content that is out of scope for the benchmark\.

Figure 8:Examples of the five flag categories used in the annotation guidelines\. Annotators were asked to flag the item and attach a short free\-text note for each case\.![Refer to caption](https://arxiv.org/html/2606.07167v1/images_guidelines/edit_spelling_fix.png)\(a\)Spelling fix: a missing letter is restored from context\.
![Refer to caption](https://arxiv.org/html/2606.07167v1/images_guidelines/edit_duplicated_word.png)\(b\)Duplicated word from a scraping artifact is removed\.
![Refer to caption](https://arxiv.org/html/2606.07167v1/images_guidelines/edit_wrong_subdomain.png)\(c\)Wrong subdomain \(art and drawing\) is reassigned via the dropdown topakistan studies\.

Figure 9:Examples of the most common edit categories permitted by the annotation guidelines\. Edits are restricted to OCR, scraping, formatting, and metadata corrections; annotators do not rewrite question semantics or modify answer correctness\.

### B\.3Inclusion Rules

We applied a deterministic consensus filter \(Table[10](https://arxiv.org/html/2606.07167#A2.T10)\): an item was retained only when both assigned annotators independently submitted the same valid answer choice and neither flagged nor abstained\. The agreed answer was stored as the final gold label\. Two edge cases require clarification\. First, when both annotators selected an option such as “none of these”, we treated it as a valid agreed answer because such options commonly appear in Pakistani MCQ examinations\. Second, only the explicit*unsure / skip*action counted as abstention\. Missing annotations triggered the incomplete\-annotation rule instead, so abstentions always reflected deliberate annotator decisions\.

### B\.4Edit Resolution

In addition to selecting answers, annotators could suggest edits to question text, answer options, or subdomain labels\. These edits targeted minor extraction and metadata issues such as OCR errors, dropped diacritics, malformed option labels, and incorrect subdomain assignments rather than substantive question rewrites\. We resolved all edits deterministically so that the final benchmark could be reconstructed directly from the raw annotations\. Table[11](https://arxiv.org/html/2606.07167#A2.T11)summarizes the resolution rules\.

The resolution policy prioritizes agreed edits when both annotators propose the same correction and otherwise prefers the more conservative or informative revision\. For subdomain edits, we recompute the corresponding domain label from the corrected subdomain to preserve consistency between the two fields in the released benchmark\. This procedure ensures that metadata corrections remain internally consistent throughout the final dataset\.

Table 11:Edit\-resolution rules for annotated MCQs\. The rules are applied per field, and the resolved values are written back to the item before the inclusion rules in Table[10](https://arxiv.org/html/2606.07167#A2.T10)are evaluated\.
### B\.5Annotation Dashboard

Section[B\.2](https://arxiv.org/html/2606.07167#A2.SS2)described the annotation policies; here we illustrate the dashboard used to apply them\. For each item, annotators could select an answer, mark it as*unsure / skip*, edit question or option text, or flag it for review with a free\-text explanation\. This design separated answer selection from quality\-control feedback and text correction\. Figures[10](https://arxiv.org/html/2606.07167#A2.F10),[11](https://arxiv.org/html/2606.07167#A2.F11), and[12](https://arxiv.org/html/2606.07167#A2.F12)illustrate the workflow\.

![Refer to caption](https://arxiv.org/html/2606.07167v1/images/01_question_shown.png)\(a\)Question view with answer options, metadata, and annotation controls\.
![Refer to caption](https://arxiv.org/html/2606.07167v1/images/02_answer_selected.png)\(b\)Answer selection interface before advancing to the next item\.

Figure 10:Annotation dashboard workflow for answer selection\.![Refer to caption](https://arxiv.org/html/2606.07167v1/images/03_edit_before.png)\(a\)Original OCR\-extracted question and options\.
![Refer to caption](https://arxiv.org/html/2606.07167v1/images/04_edit_in_progress.png)\(b\)In\-place editing of question and option text\.
![Refer to caption](https://arxiv.org/html/2606.07167v1/images/05_edit_done.png)\(c\)Saved edits with editable revision markers\.

Figure 11:Annotation dashboard workflow for text correction and normalization\.![Refer to caption](https://arxiv.org/html/2606.07167v1/images/06_flag_before.png)\(a\)Problematic OCR example marked as*unsure / skip*\.
![Refer to caption](https://arxiv.org/html/2606.07167v1/images/07_flag_submitted.png)\(b\)Flagged item with an attached review reason\.

Figure 12:Annotation dashboard workflow for flagging problematic items\.##### Picking an answer:

Figure[10](https://arxiv.org/html/2606.07167#A2.F10)shows the standard annotation workflow\. Annotators view the Urdu question stem, four labeled answer options, and metadata describing the subdomain, academic level, length tier, and item identifier \(Figure[10](https://arxiv.org/html/2606.07167#A2.F10)a\)\. Selecting an option highlights the choice but does not automatically advance to the next item \(Figure[10](https://arxiv.org/html/2606.07167#A2.F10)b\); annotators must explicitly confirm the selection before proceeding, which reduces accidental submissions\. Keyboard shortcuts \(1–5for option selection, arrow keys for navigation\) support efficient batch traversal\.

##### Editing an item:

Figure[11](https://arxiv.org/html/2606.07167#A2.F11)illustrates the in\-place editing UI\. The example contains OCR and formatting artifacts in the answer options \(Figure[11](https://arxiv.org/html/2606.07167#A2.F11)a\)\. After entering edit mode, annotators modify the question text and options through inline editable fields \(Figure[11](https://arxiv.org/html/2606.07167#A2.F11)b\)\. The interface records all changes and attaches revision tags to each edited field for later review, which can be reverted with a single click \(Figure[11](https://arxiv.org/html/2606.07167#A2.F11)c\)\.

##### Flagging an item:

Figure[12](https://arxiv.org/html/2606.07167#A2.F12)shows the flagging UI\. In the illustrated case, OCR corruption removes superscript formatting from a physics question, making all answer options invalid \(Figure[12](https://arxiv.org/html/2606.07167#A2.F12)a\)\.

The annotator marks the item as*unsure / skip*and submits a flag with a free\-text explanation \(Figure[12](https://arxiv.org/html/2606.07167#A2.F12)b\)\. The dashboard visually highlights flagged items so admins can review them, and the inclusion rules in Section[B\.3](https://arxiv.org/html/2606.07167#A2.SS3)automatically remove flagged items from the consensus pool\.

### B\.6Annotation Outcomes

![Refer to caption](https://arxiv.org/html/2606.07167v1/x9.png)Figure 13:Pairwise annotator agreement on final\-included MCQs\. Each cell reports simplified Cohen’sκ\\kappa, with the number of shared items shown in parentheses\. Blank cells indicate annotator pairs with no shared final\-included items\.OutcomeCountInput annotated MCQs17,565Retained after consensus filtering14,459Dropped: annotator disagreement1,611Dropped: flagged by annotator1,247Dropped: unsure/skip selected243Dropped: single annotated5Domain corrections141Table 12:Annotation outcomes for the exam\-derived portion ofUrduMMLU\. Each excluded item appears under a single exclusion rule\.A total of 17,565 exam\-derived MCQs entered annotation, of which 14,459 were retained after applying the edit\-resolution and inclusion rules from Tables[11](https://arxiv.org/html/2606.07167#A2.T11)and[10](https://arxiv.org/html/2606.07167#A2.T10)\(an overall yield of 82\.3%\)\. Table[12](https://arxiv.org/html/2606.07167#A2.T12)breaks down the 3,106 exclusions\. Answer disagreement is the dominant cause \(51\.9% of all drops\), reflecting questions where two qualified Urdu annotators could not converge on a defensible answer and which are therefore unsuitable for evaluation under a strict consensus policy\. Flagged items form the second\-largest category and predominantly contain OCR corruption or malformed options similar to Figure[12\(a\)](https://arxiv.org/html/2606.07167#A2.F12.sf1)\.

Inter\-annotator agreement was correspondingly high\. Across all annotated items, observed agreement reached 89\.98%, with a simplified Cohen’sκ\\kappaof 0\.8663\. Figure[13](https://arxiv.org/html/2606.07167#A2.F13)further breaks agreement down by annotator pair\. Each cell reports the simplified Cohen’sκ\\kappatogether with the number of shared retained items, while blank cells indicate annotator pairs without overlap\. Most populated cells exceedκ=0\.85\\kappa=0\.85, showing that agreement remains consistently strong across annotator pairs rather than depending on a small subset of annotators\. This pattern indicates that annotation quality remained stable across the workforce\. Lower\-agreement cells correspond mainly to pairs with relatively few shared items and therefore have limited influence on the aggregate statistic\.

## Appendix CDataset Format

![Refer to caption](https://arxiv.org/html/2606.07167v1/x10.png)\(a\)Character\-length distribution per item\.
![Refer to caption](https://arxiv.org/html/2606.07167v1/x11.png)\(b\)Gold answer\-key distribution\.

Figure 14:Dataset\-level sanity checks forUrduMMLU\. Most questions remain compact enough for standard MCQ prompting, while the gold answer keys remain close to uniformly distributed across A–D\.EachUrduMMLUexample is stored as a multiple\-choice item containing a question, four answer options, a gold answer label, domain and subdomain labels, academic level information, and source metadata\. The evaluation pipeline uses the question, options, and gold answer fields during inference, while the remaining metadata supports analysis, filtering, and reproducibility\. Figure[14](https://arxiv.org/html/2606.07167#A3.F14)reports two dataset\-level sanity checks: item\-length distribution and answer\-key distribution\. The median item length is 76 characters, and the answer labels remain close to uniformly distributed across A–D, reducing the risk of prompt\-length or answer\-position bias during evaluation\.

Figure[15](https://arxiv.org/html/2606.07167#A3.F15)shows the JSON schema used for each released benchmark item\.

UrduMMLU JSON Schema```
{
  "id": "...",
  "question": "...",
  "options": {
    "A": "...",
    "B": "...",
    "C": "...",
    "D": "..."
  },
  "correct_key": "...",
  "correct_option": "...",
  "domain": "...",
  "subdomain": "...",
  "level": "...",
  "source": [
    {
      "name": "...",
      "url": "..."
    }
  ]
}
```

Figure 15:JSON schema for individualUrduMMLUquestion items\.### C\.1Subject Acronyms and Education Levels

Table[13](https://arxiv.org/html/2606.07167#A3.T13)lists the full names, acronyms, and education levels for all 26UrduMMLUsubdomains, grouped by domain\. We use these acronyms in the per\-subdomain results tables \(Tables[16](https://arxiv.org/html/2606.07167#A5.T16)and[17](https://arxiv.org/html/2606.07167#A5.T17)\)\.

DomainSubdomainAcronymLevelsHumanitiesEthicsETHSSC\-I, SSC\-II, HSSC\-IHumanitiesFine ArtsFNASSC\-I, SSC\-II, HSSC\-IHumanitiesIslamic StudiesISLSSC\-I, SSC\-II, HSSC\-I, HSSC\-IIHumanitiesUrdu GrammarUGRSSC\-I, SSC\-II, HSSC\-I, HSSC\-IIHumanitiesUrdu LanguageULGSSC\-I, SSC\-II, HSSC\-I, HSSC\-IIHumanitiesUrdu LiteratureULTSSC\-I, SSC\-II, HSSC\-I, HSSC\-IIOtherGeneral KnowledgeGKNSSC\-I, SSC\-II, HSSC\-I, HSSC\-IIProfessionProfessional DevelopmentPRDSSC\-I, SSC\-II, HSSC\-I, HSSC\-IIProfessionProfessional StudiesPRSSSC\-I, SSC\-IISTEMBiologyBIOSSC\-I, SSC\-II, HSSC\-I, HSSC\-IISTEMChemistryCHMSSC\-I, SSC\-II, HSSC\-ISTEMComputer ScienceCSCSSC\-I, SSC\-II, HSSC\-ISTEMGeneral ScienceGSCSSC\-I, SSC\-II, HSSC\-ISTEMMathematicsMTHSSC\-I, SSC\-II, HSSC\-ISTEMPhysicsPHYSSC\-I, SSC\-II, HSSC\-IISocial SciencesCivicsCIVSSC\-I, SSC\-II, HSSC\-I, HSSC\-IISocial SciencesCommerceCOMSSC\-I, SSC\-II, HSSC\-I, HSSC\-IISocial SciencesCurrent & International AffairsCIASSC\-I, SSC\-II, HSSC\-I, HSSC\-IISocial SciencesEconomicsECOSSC\-I, SSC\-II, HSSC\-I, HSSC\-IISocial SciencesEducationEDUSSC\-I, SSC\-II, HSSC\-I, HSSC\-IISocial SciencesGeographyGEOSSC\-I, SSC\-II, HSSC\-I, HSSC\-IISocial SciencesHealth & Physical EducationHPESSC\-I, SSC\-II, HSSC\-IISocial SciencesPakistan StudiesPKSSSC\-I, SSC\-II, HSSC\-I, HSSC\-IISocial SciencesPsychologyPSYHSSC\-I, HSSC\-IISocial SciencesPsychometricsPMTSSC\-I, SSC\-II, HSSC\-I, HSSC\-IISocial SciencesSociologySOCHSSC\-I, HSSC\-IITable 13:UrduMMLUdomains, subdomains, acronyms, and corresponding education levels\.

## Appendix DEvaluation Details

This appendix provides the full model roster and prompt templates used in theUrduMMLUexperiments\.

### D\.1Model Roster

Table[14](https://arxiv.org/html/2606.07167#A4.T14)lists the 30 models evaluated in this work\. We group models by family for readability, while the main paper discusses them using broader categories such as proprietary API models, open\-weight multilingual models, compact models, mixture\-of\-experts models, reasoning\-oriented variants, and Urdu\- or regionally specialized models\.

### D\.2Prompt Templates

We use separate English and Urdu prompt templates for zero\-shot evaluation\. Both templates present the same Urdu question and answer options while changing only the instruction language and field labels\. The output format remains identical in both settings to support automatic parsing and consistent evaluation\.

##### English prompt:

Figure[16](https://arxiv.org/html/2606.07167#A4.F16)shows the English prompt template, which combines a fixed system prompt with a per\-item user prompt\. The system prompt instructs the model to answer in a strict two\-line format consisting of anAnswer keyandAnswer text, without additional explanation or formatting\. This structure supports deterministic answer extraction and consistent measurement of invalid outputs across models\. The user prompt fills the placeholdersdomain,subdomain,level,question, andA–Ddirectly from the dataset schema in Appendix[C](https://arxiv.org/html/2606.07167#A3), preserving the Urdu question content across both prompt\-language settings\.

Zero\-shot English PromptSystem Prompt:You are an expert multiple\-choice question answering assistant\. Read the question carefully and select the single best answer\. Respond in EXACTLY this two\-line format, with no extra text:•Answer key: <one of A, B, C, D\>•Answer text: <verbatim text of the chosen option, copied character\-for\-character\>Do not add explanations, preambles, markdown, or punctuation outside of the format\. The Answer text must match the option text exactly so the response can be parsed programmatically\.User Prompt:```
Subject: {domain} – {subdomain}
Level: {level}
Question: {question}
A) {A}
B) {B}
C) {C}
D) {D}
Answer key:
Answer text:
```

Figure 16:Prompt for multiple\-choice question answering with strict answer formatting requirements\.
##### Urdu prompt:

The Urdu prompt template mirrors the English template while translating the system instructions, user\-field labels \(مضمون, سطح, سوال\), and surrounding instructional text into Urdu\. The Urdu question stem and answer options remain unchanged across both settings\.

We also preserve the same two\-line response structure using the English fieldsAnswer key:andAnswer text:, which allows a single parser to process outputs under both prompt languages\. This design ensures that prompt language is the only substantive difference between the two evaluation settings\. Figure[17](https://arxiv.org/html/2606.07167#A4.F17)shows the full Urdu prompt template\. This minimal\-difference setup makes the prompt\-language comparison in Section[5](https://arxiv.org/html/2606.07167#S5)directly interpretable, since any performance change comes from the instruction language rather than changes in question content or evaluation logic\.

Zero\-shot Urdu PromptSystem Prompt:جوابدہ معاون ہیں۔ MCQ آپ ایک ماہر سوال کو غور سے پڑھیں اور صحیح آپشن منتخب کریں۔جواب بالکل اس دو سطری شکل میں دیں، کوئی اضافی متن شامل نہ کریں:<میں سے ایک D یا C, B, A \> :Answer key \-منتخب آپشن کا متن من و عن، حرف بہ حرف \> : Answer text \-<نقل کریںکوئی وضاحت، تمہید، مارک ڈاؤن، یا اضافی الفاظ نہ لکھیں۔کو آپشن کے متن سے بالکل مماثل ہونا چاہیے Answer textتاکہ پروگرام اسے درست طور پر پارس کر سکے۔User Prompt:\{subdomain\} – \{domain\} :مضمون\{level\} :سطح\{question\} :سوال\{A\} : \(A\{B\} : \(B\{C\} : \(C\{D\} : \(DAnswer key:Answer text:Figure 17:Urdu prompt for multiple\-choice question answering with strict answer formatting requirements\.

### D\.3Few\-Shot Evaluation Setup

We evaluate using thelm\-evaluation\-harnessframeworkGaoet al\.\([2024](https://arxiv.org/html/2606.07167#bib.bib50)\)\. Each item is formatted as a four\-option multiple\-choice question, with the answer choices labeledAthroughDand the gold label stored as an integer index in\{0,1,2,3\}\\\{0,1,2,3\\\}\. The benchmark comprises 26,431 items spanning all subject\-level splits\. We report accuracy \(acc\) and length\-normalized accuracy \(acc\_norm\) under 0\-shot, 1\-shot, 3\-shot, and 5\-shot conditions\.

### D\.4Implementation Details

All evaluations use a fixed random seed of 42\. For locally loaded open\-weight models, we usebfloat16precision, greedy decoding, batch size 10, and automatic device placement\. We evaluate instruction\-tuned models with their chat templates\. For API\-based systems, including OpenAI, Anthropic, Google Gemini, and Hugging Face Inference API, we use the same prompt format and configuration whenever provider constraints permit\. We retry failed API requests up to five times and terminate the pipeline after five consecutive failures to prevent silent evaluation errors\.

ModelSizeFamilyLicenseRef\.large\-traversaal/Alif\-1\.0\-8B\-Instruct8BAlifApache 2\.0Shafiqueet al\.\([2025](https://arxiv.org/html/2606.07167#bib.bib20)\)mistralai/Ministral\-3\-3B\-Instruct\-25123BMinistralApache 2\.0Liuet al\.\([2026](https://arxiv.org/html/2606.07167#bib.bib21)\)mistralai/Ministral\-3\-8B\-Instruct\-25128BMinistralApache 2\.0Liuet al\.\([2026](https://arxiv.org/html/2606.07167#bib.bib21)\)enstazao/Qalb\-1\.0\-8B\-Instruct8BQalbApache 2\.0Hassanet al\.\([2026](https://arxiv.org/html/2606.07167#bib.bib22)\)Qwen/Qwen3\-4B\-Instruct\-25074BQwen 3Apache 2\.0Yanget al\.\([2025](https://arxiv.org/html/2606.07167#bib.bib23)\)Qwen/Qwen3\-8B8BQwen 3Apache 2\.0Yanget al\.\([2025](https://arxiv.org/html/2606.07167#bib.bib23)\)Qwen/Qwen3\.6\-27B27BQwen 3\.6Apache 2\.0Yanget al\.\([2025](https://arxiv.org/html/2606.07167#bib.bib23)\)Qwen/Qwen3\.6\-35B\-A3B36BQwen 3\.6Apache 2\.0Yanget al\.\([2025](https://arxiv.org/html/2606.07167#bib.bib23)\)bigscience/bloomz\-1b11\.1BBLOOMZBigscience Bloom Rail 1\.0Muennighoffet al\.\([2023](https://arxiv.org/html/2606.07167#bib.bib24)\)bigscience/bloomz\-1b71\.7BBLOOMZBigscience Bloom Rail 1\.0Muennighoffet al\.\([2023](https://arxiv.org/html/2606.07167#bib.bib24)\)bigscience/bloomz\-3b3BBLOOMZBigscience Bloom Rail 1\.0Muennighoffet al\.\([2023](https://arxiv.org/html/2606.07167#bib.bib24)\)bigscience/bloomz\-7b1\-mt7BBLOOMZBigscience Bloom Rail 1\.0Muennighoffet al\.\([2023](https://arxiv.org/html/2606.07167#bib.bib24)\)deepseek\-ai/DeepSeek\-V4\-Flash158BDeepSeekDeepSeek LicenseDeepSeek\-AI \([2026](https://arxiv.org/html/2606.07167#bib.bib25)\)google/gemma\-3\-4b\-it4BGemma 3GemmaGemma Team \([2025](https://arxiv.org/html/2606.07167#bib.bib26)\)google/gemma\-2\-9b\-it9BGemmaGemmaTeamet al\.\([2024](https://arxiv.org/html/2606.07167#bib.bib43)\)google/gemma\-4\-26B\-A4B\-it27BGemmaGemmaGoogle \([2026b](https://arxiv.org/html/2606.07167#bib.bib29)\)google/gemma\-4\-31B\-it31BGemmaGemmaGoogle \([2026b](https://arxiv.org/html/2606.07167#bib.bib29)\)meta\-llama/Llama\-3\.2\-3B\-Instruct3BLLaMA 3\.2LLaMA LicenseMeta \([2024a](https://arxiv.org/html/2606.07167#bib.bib31)\)meta\-llama/Llama\-3\.1\-8B\-Instruct8BLLaMA 3\.1LLaMA LicenseGrattafioriet al\.\([2024](https://arxiv.org/html/2606.07167#bib.bib30)\)meta\-llama/Llama\-4\-Scout\-17B\-16E\-Instruct109BLLaMA 4LLaMA LicenseMeta \([2025](https://arxiv.org/html/2606.07167#bib.bib33)\)meta\-llama/Llama\-4\-Maverick\-17B\-128E\-Instruct402BLLaMA 4LLaMA LicenseMeta \([2025](https://arxiv.org/html/2606.07167#bib.bib33)\)meta\-llama/Llama\-3\.3\-70B\-Instruct70BLLaMA 3\.3LLaMA LicenseMeta \([2024b](https://arxiv.org/html/2606.07167#bib.bib32)\)microsoft/Phi\-4\-mini\-instruct3BPhi\-4MITMicrosoftet al\.\([2025](https://arxiv.org/html/2606.07167#bib.bib35)\)microsoft/Phi\-3\.5\-mini\-instruct4BPhi\-3\.5MITAbdinet al\.\([2024](https://arxiv.org/html/2606.07167#bib.bib34)\)claude\-haiku\-4\-5N/DClaudeProprietaryAnthropic \([2025](https://arxiv.org/html/2606.07167#bib.bib36)\)claude\-sonnet\-4\-6N/DClaudeProprietaryAnthropic \([2026b](https://arxiv.org/html/2606.07167#bib.bib37)\)gemini\-3\.1\-flash\-liteN/DGeminiProprietaryGoogle \([2026a](https://arxiv.org/html/2606.07167#bib.bib39)\)gemini\-3\.5\-flashN/DGeminiProprietaryGoogle DeepMind \([2026](https://arxiv.org/html/2606.07167#bib.bib40)\)gpt\-5\.4\-miniN/DGPTProprietarySinghet al\.\([2026](https://arxiv.org/html/2606.07167#bib.bib51)\)gpt\-5\.4N/DGPTProprietarySinghet al\.\([2026](https://arxiv.org/html/2606.07167#bib.bib51)\)Table 14:Language models evaluated in this study\. Model sizes are reported when publicly disclosed; N/D denotes not disclosed\.Table 15:STEM–Humanities accuracy gap under the Urdu prompt\. Models with gaps near zero either score at chance on both domains \(BLOOMZ\) or have an unusually low STEM score for their scale \(Qalb\-1\.0\-8B, Alif\-1\.0\-8B\)\. Values are taken directly from Table[4](https://arxiv.org/html/2606.07167#S4.T4)\.

## Appendix EDetailed Results

This section provides a more detailed view of the results summarized in Table[4](https://arxiv.org/html/2606.07167#S4.T4)\. Table[15](https://arxiv.org/html/2606.07167#A4.T15)reports the STEM–Humanities accuracy gap for each model under the Urdu prompt, sorted by STEM accuracy\. Across nearly all model families, performance on STEM substantially exceeds performance on Humanities, and the gap generally widens as overall capability decreases\.

The gap becomes small for the BLOOMZ family and the two Urdu\-targeted models, but for different reasons\. BLOOMZ checkpoints remain close to the random baseline on both domains, while the Urdu\-targeted models show similarly low performance on STEM and Humanities because their STEM accuracy is already far below that of comparably sized general\-purpose models\.

### E\.1Per\-Subdomain Results

We expands the domain\-level results from Table[4](https://arxiv.org/html/2606.07167#S4.T4)to all 26 subdomains\. Table[16](https://arxiv.org/html/2606.07167#A5.T16)reports accuracy under the English prompt, while Table[17](https://arxiv.org/html/2606.07167#A5.T17)reports accuracy under the Urdu prompt\. Both tables follow the same ordering, with subdomains grouped by domain and sorted by dataset size\.

#### E\.1\.1Subject\-Wise Behavior

The subdomain results sharpen the main finding from Section[5](https://arxiv.org/html/2606.07167#S5): STEM subjects transfer much more reliably than Urdu\-centered Humanities subjects\. The strongest models approach saturation on several STEM subdomains\. In contrast, performance remains lower on Urdu\-centered subjects\.

Under the English prompt, Gemini\-3\.5\-Flash reaches97\.86%97\.86\\%on chemistry,98\.60%98\.60\\%on biology, and98\.86%98\.86\\%on mathematics, while DeepSeek\-V4\-Flash reaches98\.71%98\.71\\%on physics\. These scores remain nearly unchanged under the Urdu prompt, and in some cases increase slightly\. The consistency across prompt languages suggests that scientific and mathematical concepts transfer relatively cleanly once the model can process Urdu input\.

Humanities presents a much harder challenge\. Islamic studies and Urdu grammar remain accessible for the strongest models, with Gemini\-3\.5\-Flash reaching94\.25%94\.25\\%on Islamic studies and88\.34%88\.34\\%on Urdu grammar under the English prompt\. In contrast, Urdu literature remains difficult across the entire model suite\. Even the strongest model reaches only80\.35%80\.35\\%under the English prompt and80\.81%80\.81\\%under the Urdu prompt\. Most other proprietary and open\-source models perform substantially worse, often trailing by another 10 to 20 points\. Urdu language occupies an intermediate position, with top scores near89%89\\%\.

Across nearly all capable models, the same ordering persists: Islamic studies\>\>Urdu grammar\>\>Urdu language\>\>Urdu literature\. This consistency suggests that the differences reflect genuine variation in subject difficulty rather than isolated model behavior\. Social Sciences contains both highly accessible and consistently difficult subdomains\. Geography, civics, sociology, psychology, and commerce all exceed93%93\\%accuracy for the strongest models\. Pakistan studies also remains relatively strong despite its large size\. In contrast, current and international affairs and psychometrics stand out as the two hardest Social Sciences subdomains\.

Current and international affairs peaks at roughly78%78\\%under both prompts, likely because many questions depend on time\-sensitive world knowledge beyond pretraining cutoffs\. Psychometrics is even more difficult: no model in the evaluation exceeds60%60\\%accuracy under either prompt language\. This suggest that both subdomains are challenging even for the strongest models\.

The smaller Profession and Other domains follow patterns similar to Social Sciences, with proprietary models reaching the low 90s and smaller open\-source models trailing behind\. These domains do not introduce additional failure modes\. The subdomain results further clarify the behavior of smaller open\-source models\. Among models with fewer than 25B parameters, Gemma\-2\-9B\-IT performs best on Humanities subjects, including Urdu language, Urdu grammar, ethics, and fine arts, while Qwen3\-8B leads on STEM subjects such as chemistry, mathematics, computer science, and physics\. This pattern mirrors the domain\-level results: Qwen3\-8B retains relatively strong scientific knowledge but struggles on Urdu\-centered humanities content, whereas Gemma\-2\-9B\-IT shows more balanced performance across subdomains\. The Urdu\-targeted models, Qalb\-1\.0\-8B and Alif\-1\.0\-8B, do not lead any subdomain and remain below similarly sized general\-purpose models\. BLOOMZ checkpoints remain close to the random baseline on most subdomains and should be interpreted alongside the high invalid\-output rates reported in Section[5\.4](https://arxiv.org/html/2606.07167#S5.SS4)\.

#### E\.1\.2The Psychometrics Gap

Psychometrics is the most difficult subdomain in our evaluation\. The best English\-prompt accuracy reaches only57\.30%57\.30\\%\(Gemini\-3\.5\-Flash\), while the best Urdu\-prompt accuracy reaches52\.97%52\.97\\%\(Claude\-Sonnet\-4\.6\)\. No model exceeds60%60\\%under either prompt setting, in contrast with the9090–98%98\\%accuracies achieved on many STEM and Social Sciences subjects\.

The difficulty appears specific to the content rather than the prompt language\. English\- and Urdu\-prompt results remain close, and model rankings on psychometrics largely mirror their overall rankings\. Psychometrics questions in Urdu SSC/HSSC curricula frequently emphasize analogies, number series, logical patterns, and aptitude\-style reasoning tasks that require abstract structure recognition rather than factual recall\. These results suggest that current LLMs still struggle on reasoning\-heavy Urdu educational content even when they perform strongly on factual subjects\.

Table 16:Subdomain\-level model performance on UrduMMLU under the English prompt\.Accuracy \(%\) across all 26 subdomains grouped by domain\. Subdomains are ordered within each domain by dataset size \(descending\); see Table[13](https://arxiv.org/html/2606.07167#A3.T13)for acronym expansions\. Boxed values mark the best overall score per column, while bold values indicate the best score within each model group\.Table 17:Subdomain\-level model performance on UrduMMLU under the Urdu prompt\.Accuracy \(%\) across all 26 subdomains grouped by domain\. Subdomains are ordered within each domain by dataset size \(descending\); see Table[13](https://arxiv.org/html/2606.07167#A3.T13)for acronym expansions\. Boxed values mark the best overall score per column, while bold values indicate the best score within each model group\.
#### E\.1\.3Urdu Tuning Fails on Literature

Urdu literature is the largest subdomain inUrduMMLU, with 5,859 items, and contains content with limited overlap with English\-dominated pretraining corpora, including classical poetry, prosody, and literary history\. It therefore provides a useful test of Urdu\-focused training on culturally grounded knowledge\. Figure[18](https://arxiv.org/html/2606.07167#A5.F18)compares two Urdu\-targeted 8B models, Qalb\-1\.0\-8B and Alif\-1\.0\-8B, with two general\-purpose 8B instruction\-tuned models, Qwen3\-8B and Ministral\-3\-8B\.

![Refer to caption](https://arxiv.org/html/2606.07167v1/x12.png)Figure 18:Urdu literature accuracy for four 8B\-class instruction\-tuned models under English and Urdu prompts\. Ministral\-3\-8B performs best under both settings, while Qwen3\-8B shows the largest prompt\-language drop\.The Urdu\-targeted models do not outperform the general\-purpose baselines on this subdomain\. Ministral\-3\-8B achieves the highest accuracy under both prompts at39\.4%39\.4\\%, while Qalb\-1\.0\-8B and Alif\-1\.0\-8B remain below32%32\\%\. Qwen3\-8B performs competitively under the English prompt \(30\.8%30\.8\\%\) but drops to17\.4%17\.4\\%under the Urdu prompt\. In contrast, both Urdu\-targeted models improve under the Urdu prompt, suggesting that Urdu\-specific tuning improves instruction following more than literary knowledge\. Overall, Urdu literature remains challenging even for Urdu\-targeted LLMs\.

#### E\.1\.4English\-Prompt Subdomain Accuracy

Table[16](https://arxiv.org/html/2606.07167#A5.T16)reports per\-subdomain accuracy for all 30 models under the English prompt\. The table groups subdomains by domain and orders them by dataset size within each group, so earlier columns contribute more strongly to the corresponding domain\-level scores in Table[4](https://arxiv.org/html/2606.07167#S4.T4)\. Acronym expansions appear directly in the table header\. The results provide a fine\-grained view of model behavior across subjects: Gemini\-3\.5\-Flash remains consistently strong across nearly all subdomains, DeepSeek\-V4\-Flash approaches proprietary\-level performance on STEM subjects but drops on Urdu language and literature, and the BLOOMZ models remain close to the random baseline across most subjects\.

#### E\.1\.5Urdu\-Prompt Subdomain Accuracy

Table[17](https://arxiv.org/html/2606.07167#A5.T17)reports the same per\-subdomain breakdown under the Urdu prompt\. The table follows the same structure and ordering as Table[16](https://arxiv.org/html/2606.07167#A5.T16), which allows direct comparison between the two prompt settings\. Most differences remain small, reinforcing the main finding from Section[5](https://arxiv.org/html/2606.07167#S5)that the difficulty ofUrduMMLUcomes primarily from the question content rather than the instruction language\. For most proprietary models and the Gemma family, English\- and Urdu\-prompt accuracies remain nearly identical across the majority of subdomains\. A few model\-specific shifts become clearer at the subdomain level\. Qwen3\.6\-35B\-A3B improves substantially under the Urdu prompt, driven mainly by STEM subjects, where several subdomain scores rise into the mid\-90s under the Urdu prompt\.

In contrast, Qwen3\-8B loses accuracy primarily on Humanities subjects, especially Urdu language and Urdu literature, which explains its large drop in overall Humanities performance under the Urdu prompt\. The Urdu\-targeted models also show modest gains on several Humanities subdomains under the Urdu prompt, although these improvements do not substantially change their overall ranking\. Together, these patterns further support the conclusion that prompt language plays a secondary role compared with the underlying educational and cultural knowledge required by the benchmark\.

## Appendix FInvalid\-Output Examples

Section[5\.4](https://arxiv.org/html/2606.07167#S5.SS4)reports invalid\-output rates across the model suite; this appendix provides representative examples of the corresponding failure modes\. Each example is drawn from an actual model prediction under the Urdu prompt setting\. We organize the examples by failure type in order to highlight recurring decoding behaviors and illustrate how invalid generations manifest in practice across different models\.

##### Repetition collapse:

In some cases, the model enters a degenerate decoding loop and repeatedly emits the same token sequence without producing a meaningful or valid answer\. Example[F](https://arxiv.org/html/2606.07167#A6.SS0.SSS0.Px1)illustrates this behavior for BLOOMZ\-7B, which repeatedly generates the token “Question:” dozens of times instead of producing a task\-relevant response\.

Example: repetition collapse \(BLOOMZ\-7B, Urdu literature, gold = B\)Question:خواجہ دل محمد کس مضمون کے استاد تھےOptions:•A\. ان میں سے کوئی نہیں•B\. ریاضی\(gold\)•C\. اردو•D\. پنجابیModel output: Question: Question: Question: Question: …

##### Prompt echo:

The model copies part of the user prompt instead of answering the question\. Example[F](https://arxiv.org/html/2606.07167#A6.SS0.SSS0.Px2)shows BLOOMZ\-1\.7B reproducing the beginning of the question prompt and terminating before generating a valid answer\.

Example: prompt echo \(BLOOMZ\-1\.7B, Islamic studies, gold = B\)Question:مومنو\! اگر تم میری راہ میں لڑنے اور میری خوشنودی طلب کرنے کے لیے نکلے ہو تو میرے اور اپنے دشمنوں کومت بناؤ۔Options:•A\. بھائی•B\. دوست\(gold\)•C\. دشمن•D\. راز دارModel output: مضمون: Humanities – islamic studies \| سطح: SSC\-IIسوال: مومنو\! اگر تم میری راہ میں لڑنے اور میری خوشنودی طلب کر

##### Refusal or clarification request:

Instead of selecting an answer, the model returns a conversational clarification request\. Example[F](https://arxiv.org/html/2606.07167#A6.SS0.SSS0.Px3)shows Qalb\-1\.0\-8B treating the MCQ as an ambiguous user query\.

Example: refusal \(Qalb\-1\.0\-8B, mathematics, gold = A\)Question:قطاروں اور کالموں کی تعداد کسی قالب میںکو ظاہر کرتی ہے۔Options:•A\. مرتبہ\(gold\)•B\. مقطع•C\. قطاریں•D\. کالمModel output: آپ کا سوال واضح نہیں ہے۔ کیا آپ اسے دوبارہ بیان کر سکتے ہیں؟

##### System\-prompt echo:

The model reproduces the system prompt instead of answering the question\. Example[F](https://arxiv.org/html/2606.07167#A6.SS0.SSS0.Px4)shows Alif\-1\.0\-8B repeating the assistant role description without generating an answer\.

Example: system\-prompt echo \(Alif\-1\.0\-8B, chemistry, gold = B\)Question:الکحلز کی ڈی ہائیڈریشنکے ساتھ کی جا سکتی ہے۔Options:•A\. NaOH•B\. H2SO4\(gold\)•C\. HCl•D\. KOHModel output: جوابدہ معاون ہیں۔ MCQ آپ ایک ماہر

##### Empty or placeholder output:

The model emits a nearly empty response, often copied directly from a blank marker in the question\. Example[F](https://arxiv.org/html/2606.07167#A6.SS0.SSS0.Px5)shows BLOOMZ\-3B returning only a placeholder token\.

Example: empty output \(BLOOMZ\-3B, chemistry, gold = A\)Question:فریکشن بطور جیٹ فیول استعمال ہوتی ہے۔Options:•A\. کیروسین آئل\(gold\)•B\. لبریکیٹنگ آئل•C\. فیول آئل•D\. ڈیزل آئلModel output: \_\_\_

##### Discussion:

All five examples produce outputs that cannot be mapped to a valid answer option and therefore contribute to the invalid\-output rate rather than to model accuracy\. The failures arise from different causes: repetition collapse and empty outputs reflect decoding instability, prompt and system\-prompt echoes reflect instruction\-following failures, and clarification requests reflect conversational misalignment with the MCQ format\. These behaviors are not unique toUrduMMLU, but their concentration under the Urdu prompt for weaker models motivates reporting invalid\-output rates alongside accuracy\.
UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

Similar Articles

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

HalluScore: Large Language Model Hallucination Question Answering Benchmark

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

Submit Feedback

Similar Articles

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning
KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context
HalluScore: Large Language Model Hallucination Question Answering Benchmark
MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks