Benchmarking Frontier LLMs on Arabic Cultural and Sociolinguistic Knowledge: A Cross-Evaluation Framework with Human SME Ground Truth
Summary
This paper introduces a cross-evaluation framework for benchmarking LLMs on Arabic cultural and sociolinguistic knowledge, using human SME ground truth and automated judges. The authors contribute a dataset of prompt-rubric pairs for Egyptian and Iraqi Arabic, evaluating frontier LLMs and finding that cultural reasoning remains a primary failure mode for automated grading.
View Cached Full Text
Cached at: 07/02/26, 05:35 AM
# 1. Introduction Source: [https://arxiv.org/html/2607.00139](https://arxiv.org/html/2607.00139) ![[Uncaptioned image]](https://arxiv.org/html/2607.00139v1/company_logo.png) Benchmarking Frontier LLMs on Arabic Cultural and Sociolinguistic Knowledge: A Cross\-Evaluation Framework with Human SME Ground Truth Sajjad Abdoli1,\*,†111Corresponding author:[sajjad@perle\.ai](https://arxiv.org/html/2607.00139v1/mailto:[email protected]),Ghassan Al\-Sumaidaee1,\*,†222Corresponding author:[ghassan\.al\-sumaidaee@perle\.ai](https://arxiv.org/html/2607.00139v1/mailto:[email protected]); ORCID:[0000\-0002\-5536\-0252](https://orcid.org/0000-0002-5536-0252),Ahmad ElShiekh1,\*333ORCID:[0009\-0001\-6837\-6202](https://orcid.org/0009-0001-6837-6202),Clayton W\. Taylor1,\*444ORCID:[0009\-0006\-6478\-8994](https://orcid.org/0009-0006-6478-8994),Ahmed Rashad1 1Perle AI\*Equal contribution; names sorted alphabetically\.†Corresponding authors sajjad@perle\.aighassan\.al\-sumaidaee@perle\.aiclayton@perle\.ai mad\.elshiekh@perle\.aiahmed@perle\.ai ###### Abstract The cost of human expert evaluation is a principal bottleneck to deploying language models in specialized, often high\-stakes domains\. This bottleneck is particularly acute regarding Arabic sociolinguistic knowledge: credible grading requires not only linguistic fluency but deep cultural familiarity that cannot be approximated by surface\-level, black\-and\-white metrics and training data\. We address this with a cross\-evaluation framework instantiated on two underrepresented Arabic dialect communities: Egyptian and Iraqi Arabic\. We contribute a dataset of 103 validated prompt–rubric pairs \(70 Egyptian and 33 Iraqi; 53 of which are Cultural, the other 50 Linguistic\), authored and graded by native\-speaker SMEs using penalty\-weighted rubrics that distinguish positive content requirements from answer\-specific negative error criteria\. Three frontier LLMs serve as*target models*\(whose responses are graded by human SMEs across 302 unique prompt–response pairs\), while five frontier LLMs serve as*automated judges*grading target model responses against the rubric, enforcing a provider\-level self\-evaluation guard\. A dual\-metric scheme combining Mean Absolute Deviation \(MAD\) with a Signed Mean Error separates directional grading bias from symmetric noise\. Across 1,307 judge evaluations: GPT\-5\.4 is the most reliable judge \(MADj=10\.21\\mathrm\{MAD\}\_\{j\}=10\.21pp, Signed Error=−1\.12%=\{\-1\.12\\%\}\); four of five judges show systematic leniency \(\+2\.01%\+2\.01\\%to\+6\.56%\+6\.56\\%\); Cultural tasks are harder to grade than Linguistics tasks for all judges \(MAD gap1\.831\.83–4\.784\.78pp\); and models substantially outperform on Egyptian prompts when compared to outputs for Iraqi prompts \. However, given the difference in leniency between the Iraqi and Egyptian SMEs scoring model outputs, we cannot solely attribute the Egyptian\-Iraqi performance gap to model knowledge alone\. We have therefore chosen to emphasize findings that do not assume identical leniency across human graders\. Across all samples, regardless of subdomain or judge leniency, implicit cultural reasoning, which requires models simulate native\-speaker judgment rather than rely on lexical verification, emerges as the primary failure mode for automated grading across all judge models\. Objective, verifiable domains like math and science, in which experts generally agree on what distinguishes “true” from “false,” are well\-suited for reinforcement learning \(“RL”\) with binary reward signals\. Language, however, and the experiences we use it to describe, are intrinsically subjective\. Subjective domains \(e\.g\. belief systems, cultures, linguistic dialects and subdialects\) therefore elude binary benchmarking techniques employed for more “objective,” verifiable domains\. This necessitates the use of a limited pool of human subject matter experts \(“SMEs”\) for model evaluation and feedback, which in turn embed certain bottlenecks in RL environments themselves\. The adaptation of techniques that’ve helped scale feedback and reward signaling in more “objective,” verifiable domains, such as LLMs\-as\-judges, is therefore essential to the advancement of AI in those that are inherently subjective\. The promise of LLM\-as\-a\-Judge frameworks\[[23](https://arxiv.org/html/2607.00139#bib.bib2)\]is that a capable frontier model can substitute for a human evaluator, drastically reducing annotation costs\. Embedded in this promise, however, is the risk of systematic bias: a model that is too lenient underestimates error rates and misleads downstream quality decisions; one that is too conservative suppresses signal on genuinely correct outputs\. On rubrics dominated by penalty criteria, such as nuanced cultural queries where errors are more common than successes, leniency and high deviation from the human baseline are the same underlying phenomenon, not independent failures\. In this paper we ask:which frontier LLMs can reliably substitute for human SMEs when grading Arabic sociolinguistic tasks, and where do they systematically fail?We answer this through a cross\-evaluation framework\. Every frontier model is evaluated as both a*target*\(its answers are graded by human SMEs, establishing a gold standard\) and as a*judge*\(it grades the answers of all other models, never its own\)\. This yields a complete grading matrix from which we derive judge\-level reliability measures and identify the specific criterion types where automated and human grading diverge\. Our contributions are: - •A cross\-evaluation benchmark framework for Arabic cultural and sociolinguistic grading, with a provider\-level self\-evaluation guard that prevents any model from judging outputs from the same provider, regardless of model version\. - •A dataset of 103 validated prompt–rubric pairs across Cultural and Linguistics domains \(Egyptian Arabic: 70 prompts; Iraqi Arabic: 33 prompts\), authored and graded by domain\-expert SMEs using a structured annotation platform with explicit positive and negative criteria\. - •Evidence that GPT\-5\.4 is the most reliable automated judge \(MADj=10\.21\\mathrm\{MAD\}\_\{j\}=10\.21pp, near\-zero directional bias\), while all other judges show systematic leniency bias, with Cultural tasks harder to grade than Linguistics tasks for all five judges\. - •A Signed Mean Error metric that separates directional grading bias \(leniency vs\. conservatism\) from symmetric noise, enabling more actionable characterization of each judge model’s failure mode\. The paper is organized as follows\. Section[2](https://arxiv.org/html/2607.00139#S2)situates this work in the LLM\-as\-a\-Judge literature\. Section[3](https://arxiv.org/html/2607.00139#S3)describes the dataset and annotation platform\. Section[4](https://arxiv.org/html/2607.00139#S4)formalizes the cross\-evaluation framework and metrics\. Section[5](https://arxiv.org/html/2607.00139#S5)reports results\. Section[6](https://arxiv.org/html/2607.00139#S6)interprets the findings\. Section[7](https://arxiv.org/html/2607.00139#S7)concludes\. ## 2\. Background and Motivation ### 2\.1\. The Challenge of Culturally Specific LLMs The deployment of large language models in culturally sensitive, non\-Western contexts exposes a fundamental tension: language fluency and cultural alignment are distinct capabilities, and contemporary LLMs systematically conflate them\. Hershcovich et al\.\[[6](https://arxiv.org/html/2607.00139#bib.bib17)\]first formalized this distinction, showing that speakers vary not only by language but by culture\-specific aboutness, common ground, values, and linguistic register — and that standard NLP evaluation instruments designed for English largely fail to capture these dimensions\. Adilazuarda et al\.\[[1](https://arxiv.org/html/2607.00139#bib.bib16)\]surveyed more than 90 subsequent papers and found thatnoneexplicitly define “culture”; instead, they probe models via proxy datasets representing selected cultural facets, leaving semantic domains and real\-world application effects largely under\-studied\. A growing body of work has demonstrated that the root cause of cultural misalignment lies upstream in the training pipeline\. Sahu et al\.\[[17](https://arxiv.org/html/2607.00139#bib.bib11)\]diagnose this as a*cultural data funnel*: explicit cultural signals decline sharply across fine\-tuning, alignment, and reasoning stages, while geographically concentrated, task\-specialized data dominates\. Their analysis, grounded in a 5\.6M\-sample culturally tagged corpus spanning 194 languages, confirms that multilinguality enhances geographic diversity of cultural knowledge but does not ensure balanced representation\. Agarwal et al\.\[[2](https://arxiv.org/html/2607.00139#bib.bib12)\]evaluate Indic and global LLMs against nationally representative surveys and community\-sourced QA in India and find that regional LLMs do not align better with local norms than global ones — attributing the failure to scarce culturally grounded pretraining data and demonstrating that prompting and regional fine\-tuning cannot recover alignment and can even degrade existing cultural knowledge\. The multilingual–multicultural gap is equally pronounced across other language families\. Rystrøm et al\.\[[16](https://arxiv.org/html/2607.00139#bib.bib13)\]compare LLM response distributions against World Values Survey population data across Danish, Dutch, English, and Portuguese and find no consistent relationship between multilingual capability and cultural alignment across model families, establishing self\-consistency rather than language capability as a stronger predictor of multicultural alignment\. The broader landscape of cultural misrepresentation in LLMs is surveyed in Shi et al\.\[[18](https://arxiv.org/html/2607.00139#bib.bib14)\], who construct CultureBank from 12K TikTok and 11K Reddit self\-narratives to provide a community\-driven cultural knowledge base, and show that fine\-tuning on this resource improves performance on downstream cultural tasks in zero\-shot settings\. For the Arabic\-speaking world specifically, the challenge is compounded by diglossia between Modern Standard Arabic and spoken varieties, dialectal variation across 22 Arab countries spanning over 450 million speakers, and the substantial presence of machine\-translated rather than natively authored training data\. Alwajih et al\.\[[3](https://arxiv.org/html/2607.00139#bib.bib25)\]argue that many Arabic LLMs, despite achieving linguistic fluency, largely overlook country\-specific cultural competence because they are trained on machine\-translated datasets and evaluated on general NLP tasks\. Their PalmX 2025 shared task provides the first standardized benchmark for cultural competence across Arabic and Islamic domains\. Qian et al\.\[[14](https://arxiv.org/html/2607.00139#bib.bib26)\]introduce CamelEval, an LLM\-as\-a\-Judge benchmark for culturally aligned Arabic models, and demonstrate systematic gaps between Arabic\-specific instruction\-following quality and genuine cultural understanding\. Zhang et al\.\[[20](https://arxiv.org/html/2607.00139#bib.bib15)\]propose CultureManager, a modular pipeline for task\-specific cultural alignment that stores multi\-culture knowledge in separate adapters to avoid cross\-culture interference, and show that both the relevance of cultural norms to the task at hand and interference between cultures are distinct failure modes requiring architectural solutions rather than inference\-time workarounds\. Oh et al\.\[[12](https://arxiv.org/html/2607.00139#bib.bib18)\]argue further that cultural assumptions permeate even ostensibly neutral evaluations, and call for intentional cultural evaluation that systematically examines these assumptions inallbenchmark design, not only in explicitly cultural tasks\. ### 2\.2\. Evaluating LLMs: Methods, Benchmarks, and Cultural Dimensions The evaluation of LLM outputs has evolved across three broad paradigms:*direct preference assessment*, where a judge rates a single output on a numeric scale;*pairwise ranking*, where a judge selects the better of two responses; and*rubric\-based evaluation*, where quality is decomposed into independently assessable binary criteria\. Zheng et al\.\[[23](https://arxiv.org/html/2607.00139#bib.bib2),[24](https://arxiv.org/html/2607.00139#bib.bib3)\]establish LLM\-as\-a\-Judge as a scalable alternative to human annotation, demonstrating strong correlation with human preferences for instruction\-following and conversational quality\. However, systematic biases have been documented in all three paradigms: positional bias \(favouring answers in certain positions\), verbosity bias \(favouring longer responses\), and self\-enhancement bias \(favouring outputs from the same model family\)\[[23](https://arxiv.org/html/2607.00139#bib.bib2)\]\. Fu and Liu\[[4](https://arxiv.org/html/2607.00139#bib.bib4)\]show that these biases are further amplified in multilingual settings: LLM\-as\-a\-Judge reliability degrades substantially for non\-English languages, motivating the native\-speaker SME ground truth used in the present work\. Cultural dimensions introduce further complications for each evaluation paradigm\. Parrish et al\.\[[13](https://arxiv.org/html/2607.00139#bib.bib22)\]establish BBQ, a 58,492\-example hand\-built bias benchmark covering nine U\.S\.\-anchored social dimensions, demonstrating that models rely on stereotypes under under\-informative context and retain accuracy advantages when correct answers align with social biases\. The U\.S\.\-centric framing of BBQ exemplifies the critique by Oh et al\.\[[12](https://arxiv.org/html/2607.00139#bib.bib18)\]that evaluation design itself encodes cultural assumptions\. Huang and Yang\[[7](https://arxiv.org/html/2607.00139#bib.bib23)\]operationalize cultural variation as label disagreement in natural language inference, introducing CALI, a 2\.7K dataset annotated by U\.S\. and Indian cultural groups, and show that models default to culture\-generic labels rather than exhibiting genuine cultural bias — a nuance that aggregate metrics cannot capture\. Cultural benchmark design has matured substantially in recent years\. Koto et al\.\[[11](https://arxiv.org/html/2607.00139#bib.bib19)\]introduce BLEnD, a hand\-crafted benchmark of 52\.6K everyday cultural QA pairs from 16 countries in 13 languages, and document a maximum 57\.34 percentage point performance gap between cultures that are more versus less represented online — the strongest existing quantitative link between data presence and cultural performance\. Rao et al\.\[[15](https://arxiv.org/html/2607.00139#bib.bib20)\]introduce NormAd, a hierarchical framework spanning 2\.6K social\-etiquette scenarios from 75 countries, and find that even the best LLMs achieve less than 82% accuracy when the relevant social norms are provided as context, compared with over 95% for humans — a gap that widens markedly for Global South cultures\. Zhang et al\.\[[21](https://arxiv.org/html/2607.00139#bib.bib21)\]introduce CultureScope, a three\-layer dimensional schema \(Institutional Norms; Behavioral Patterns; Core Values and Social Structures\) across 140 dimensions, enabling automated construction of culture\-specific evaluation sets for any language and confirming that multilingual data does not necessarily enhance cultural understanding\. Hayes Zhang et al\.\[[5](https://arxiv.org/html/2607.00139#bib.bib24)\]demonstrate that human preferences are substantially more diverse than model outputs, and introduce the Community Alignment Dataset comprising 233,319 comparisons across five countries, establishing that candidate\-response homogeneity prevents reward models from learning heterogeneous and underserved preferences\. In the context of rubric\-based evaluation specifically, Kim et al\.\[[8](https://arxiv.org/html/2607.00139#bib.bib5)\]pioneer Prometheus, a fully open\-source LLM evaluator trained on 1K fine\-grained score rubrics and 100K natural language feedback instances, demonstrating that open\-source models can match GPT\-4 evaluation capabilities when appropriate rubric structure and reference materials are provided\. This work establishes the rubric\-based paradigm as practically viable and reproducible at scale, and is extended in Prometheus 2\[[9](https://arxiv.org/html/2607.00139#bib.bib6)\], which supports both direct assessment and pairwise ranking under user\-defined evaluation criteria\. ### 2\.3\. Rubric Design for Grounded Evaluation The reliability of rubric\-based evaluation depends critically on the quality of the criteria themselves\. Verifiable reward methods have scaled reinforcement learning in objective domains by converting task success into checkable binary signals\[[19](https://arxiv.org/html/2607.00139#bib.bib1)\]; rubric\-based evaluation attempts to recover an analogous signal structure for domains where correctness is culturally or linguistically situated\. Zhang et al\.\[[22](https://arxiv.org/html/2607.00139#bib.bib7)\]provide the theoretical foundation: reward over\-optimization in LLM post\-training “primarily arises from inaccuracy in high\-reward regions,” and rubric\-based rewards mitigate this by distributing the evaluation signal across independently verifiable dimensions\. Their core empirical finding is that effective rubric construction requires two properties: \(1\) criteria must distinguish*excellent from merely great*responses, not just correct from incorrect; and \(2\) rubrics must discriminate among*diverse off\-policy responses*, not only the on\-policy distribution\. Li et al\.\[[10](https://arxiv.org/html/2607.00139#bib.bib8)\]extend this by introducing RubricHub, a large\-scale collection of coarse\-to\-fine generated rubrics showing that criterion granularity and discriminability are the primary determinants of reward model accuracy in post\-training\. These theoretical principles have clear practical counterparts\. Kim et al\.\[[8](https://arxiv.org/html/2607.00139#bib.bib5)\]and Kim et al\.\[[9](https://arxiv.org/html/2607.00139#bib.bib6)\]show that criterion clarity — the degree to which a rubric item can be assessed without ambiguity — is the primary determinant of inter\-judge reliability\. Operationalizing rubric quality for domain\-expert SME\-authored criteria in our setting, we instruct annotators to follow six structural principles derived from the rubric design literature\[[22](https://arxiv.org/html/2607.00139#bib.bib7),[8](https://arxiv.org/html/2607.00139#bib.bib5),[10](https://arxiv.org/html/2607.00139#bib.bib8)\]: 1. 1\.Mutually Exclusive and Collectively Exhaustive \(MECE\)\.Criteria must not overlap — the same error in a model’s response must not be penalised by two separate criteria — and together the criteria must cover all aspects of a perfect response with no gaps\. 2. 2\.Complete\.The rubric as a whole must be sufficient to evaluate any response to the prompt; a response satisfying all positive criteria and triggering no negative criteria must constitute a genuinely high\-quality answer\. 3. 3\.Non\-overlapping\.Each factual claim or cultural property appears in exactly one criterion; redundancy inflates the penalty for a single error and distorts the score distribution\. 4. 4\.Atomic / Non\-stacked\.Each criterion evaluates exactly one distinct aspect\. Compound criteria combining multiple conditions with “and” are split into separate items, guaranteeing that binary PASS/FAIL evaluation is unambiguous\. 5. 5\.Self\-contained\.Each criterion contains all information needed to evaluate a response without requiring external reference\. A criterion that names a fact must embed the fact:*not*“Mentions the capital of Canada” but “Mentions that the capital of Canada is Ottawa\.” 6. 6\.Verifiable without external search\.The judge must assess each criterion solely from the rubric text and the model’s response\. Criteria requiring enumeration of acceptable answers must list them:*not*“Names a Nobel Prize winner in Physics in 2023” but “Names at least one of the following Nobel Prize winners in Physics in 2023: Pierre Agostini, Ferenc Krausz, or Anne L’Huillier\.” These constraints collectively address the principal rubric failure modes documented in the literature\[[22](https://arxiv.org/html/2607.00139#bib.bib7),[8](https://arxiv.org/html/2607.00139#bib.bib5),[10](https://arxiv.org/html/2607.00139#bib.bib8)\]: criterion overlap \(double\-penalization\), ambiguity \(divergent judge interpretations\), and implicit facts \(criteria that no automated judge can verify from the rubric text alone\)\. The self\-contained and verifiable\-without\-search requirements carry additional weight in the cultural domain: a culturally implicit criterion — one encoding native\-speaker knowledge — cannot be fairly assessed by any judge unless the knowledge is made explicit within the criterion itself\. This design choice is directly connected to the Explicit versus Implicit dimension used to classify rubric criteria throughout our analysis \(Section[3\.2](https://arxiv.org/html/2607.00139#S3.SS2)\), and explains the systematic judge–SME divergence we observe on Implicit Cultural criteria in Section[5](https://arxiv.org/html/2607.00139#S5)\. ### 2\.4\. Penalty\-Weighted Rubrics and Their Properties A distinctive feature of our evaluation setting is that the rubric is dominated by*negative criteria*— penalty\-weighted items that are triggered when the target model makes a cultural or factual error\. This produces a ground\-truth baseline that is typically negative \(i\.e\., the sum of penalties exceeds positive credits\), reflecting the frequency of errors in LLM responses on culturally demanding tasks\. A lenient judge that fails to trigger these penalties will score closer to zero than the SME, mechanically producing a higher Mean Absolute Deviation\. This property means that on our rubrics, leniency bias and grading deviation are not independent findings — they are the same underlying failure measured in two ways\. ## 3\. Dataset ### 3\.1\. SME\-Authored Prompts Prompts and ground\-truth grades are produced by a panel of domain\-expert SMEs drawn from our in\-house expert network\. Each SME is a native speaker of the dialect community they author for\. SMEs are recruited through a structured process: credential verification, a screening stage, and a comprehensive onboarding protocol covering the annotation platform and the criterion taxonomy\. In the current dataset, each \(target model, sample\) pair is graded by a single SME from the relevant dialect community\. This single\-annotator design means the ground truth carries unmeasured annotator noise; inter\-annotator agreement is the empirical ceiling for any automated judge and will be quantified in future work\. #### Two\-Phase Authoring and Scoring Workflow Prompts and rubrics are produced through a two\-phase workflow that separates model\-agnostic design from model\-specific scoring, and this separation holds identically for both dialect communities and both domains\. Phase A \(pre\-output\)is performed before any target model is run\. The author fixes the scope \(Cultural or Linguistics, never mixed within a single rubric\), drafts the prompt in the target dialect, validates that the prompt is genuinely difficult by submitting it once to a model in a clean session and discarding that throwaway output, and then builds the positive criteria directly from the SME gold answer\. Because positive criteria are derived from the ground truth rather than from any observed response, they are model\-agnostic and are reused unchanged whenever a new target is evaluated on the same prompt\. This difficulty\-validation step is also the reason the SME baseline is low and penalty\-heavy \(Section 2\.4\): prompts a model answers perfectly in validation are revised or discarded before a rubric is built\. Phase B \(post\-output\)is repeated for every target model\. The response is first scored against the fixed positive criteria as PASS or FAIL, after which the author adds answer\-specific negative criteria for the errors actually present in that response, each tagged with one failure type \(Section[3\.2\.1](https://arxiv.org/html/2607.00139#S3.SS2.SSS1)\)\. Positive criteria thus remain constant across targets for a given prompt, while the set and weights of negative criteria vary by response\. The two roles, rubric author \(Phase A\) and output scorer \(Phase B\), may be the same SME or staffed separately; in the current dataset a single SME per \(sample, target\) pair performs both, which is the source of the single\-annotator noise discussed in Section 6\.5\. ### 3\.2\. Rubric Design Each rubric consists of two criterion types: Positive criteriaare shared across all responses to a given prompt\. They represent factual or cultural elements a correct answer must contain, each carrying a positive weight\. A judge assigns PASS if the criterion is satisfied, FAIL otherwise\. Negative criteriaare answer\-specific: they are populated by the SME only for errors actually observed in a given model’s response, each carrying a negative weight\. A judge assigns “Error Present” if the described error is present, “0” otherwise\. Because negative criteria reflect observed errors rather than a static checklist, their count and weights vary across responses to the same prompt\. #### Negative\-Criterion Failure Taxonomy Every negative criterion authored by an SME is tagged with one of six failure types, enabling error\-mode analysis beyond a raw penalty count\. The taxonomy is:Hallucination: the model fabricates a term, fact, proverb, or explanation that does not exist \(e\.g\. inventing the saying “Arabic phrase” and attributing it to Iraqi popular culture\);Omission: the model fails to provide content the criterion explicitly requires \(e\.g\. never identifying “Arabic phrase” and “Arabic phrase” as the colour\-based colloquial names for the 25,000 and 10,000 dinar notes\);Cultural Misattribution: a pan\-Arabic, Gulf, Levantine, or otherwise non\-Iraqi term or concept is presented as specifically Iraqi, or an existing Iraqi term is assigned to the wrong referent \(e\.g\. presenting the Gulf greeting “Arabic phrase” as the defining Iraqi guest greeting, or defining “Arabic phrase” as eating dates with tea rather than the bread\-and\-broth dish\);Ambiguous Framing: non\-native metaphors, dialect\-register inconsistency, or misleading phrasing \(e\.g\. using the Levantine “Arabic phrase” in place of Iraqi “Arabic phrase”, or the Jordanian “Arabic phrase” for “hot” where Iraqi uses “Arabic phrase”\);Irrelevant Addition: unsolicited content not requested by the user, which inflates surface coverage while creating openings for further error; andRendering Failure: the model generates invalid characters in place of Arabic script\. This taxonomy is the unit at which we diagnose why automated judges and SMEs diverge \(Section 6\.3\): a judge may correctly fire anOmissionpenalty it can verify lexically while missing aCultural Misattributionthat requires native dialect knowledge\. #### Worked Example \(Iraqi Cultural\) Consider the Iraqi prompt asking for the colloquial names Iraqis use for the 25,000 and 10,000 dinar banknotes\. The gold answer is “Arabic phrase” \(from the red note\) and “Arabic phrase” \(from the green note\), with the naming convention deriving from the physical colour of each note\. Positive criteria are shared across all target responses to this prompt; negative criteria are populated per response from the errors an SME actually observes\. Table[1](https://arxiv.org/html/2607.00139#S3.T1)shows the full rubric for one target response \(Muse Spark\), which scored−26\-26against a maximum of 37 \(−70\.3%\-70\.3\\%\)\. Criterion weights reflect importance bands applied consistently across the Iraqi subset: core term identification 8–10, meaning and context 5–9, cultural depth 3–7, and dialect and register 3–5\. Table 1:Complete rubric and SME verdicts for one Muse Spark response to CULT\-MENA\-IRQ\-N\. Positive criteria \(max 37\) are shared across targets; answer\-specific negative penalties yield a raw score of−26\-26\(−70\.3%\-70\.3\\%\), illustrating the taxonomy in Section[3\.2\.1](https://arxiv.org/html/2607.00139#S3.SS2.SSS1)\.Both criterion types carry an*Objectivity*dimension \(Objective vs\. Subjective\) and an*Explicitness*dimension \(Explicit vs\. Implicit\), enabling fine\-grained analysis of where automated and human grading diverge\. The LLM judge receives only criterion text and structure — never the SME’s Pass/Fail labels\. ### 3\.3\. Dataset Scale and Dialect Breakdown Table[2](https://arxiv.org/html/2607.00139#S3.T2)shows the dataset distribution\. The full dataset comprises 103 validated prompt–rubric pairs across two domains and two dialect communities\. Table 2:Dataset distribution by dialect community and domain\. ## 4\. Methodology ### 4\.1\. The Cross\-Evaluation Framework The benchmark runs every judge model against every target model’s responses, enforcing aprovider\-level self\-evaluation guard: no judge from the same provider as a target model evaluates that target’s responses, regardless of model version \(e\.g\.gemini\-3\.1\-pro\-previewas a target correctly blocksgemini\-3\.1\-pro\-previewas a judge, since a model cannot judge its own provider’s output\)\. This eliminates self\-enhancement bias more robustly than a model\-level guard\. Target model\.The model whose answers are being assessed\. For each target, human SMEs independently grade its responses, producing the ground\-truth scores\. Judge model\.The LLM assigned to grade the target model’s answers against the rubric\. Every judge evaluates all targets outside its provider; it never evaluates outputs from the same provider\. The judge receives only rubric criterion text and structure — not the SME’s Pass/Fail labels\. Running all cross\-provider combinations yields two orthogonal analytical views: the*judge\-axis*\(most reliable automated grader\) and the*target\-axis*\(whose answers are hardest to grade consistently\)\. ### 4\.2\. The Evaluation Triad For each \(target modeltt, sampless\) pair, the judge model receives: 1. 1\.The Prompt\.A cultural or sociolinguistic question authored by a human SME\. 2. 2\.The Answer\.The target model’s response to that prompt\. 3. 3\.The Rubric\.The criterion text, score type \(Positive/Negative\), and criterion weights\. The judge does*not*receive the SME’s Pass/Fail labels\. The judge is invoked with a fixed system prompt that assigns the role of an Arabic sociolinguistics SME and specifies the exact JSON output schema; the user message supplies the prompt, the target model’s answer, and the rubric as a plain\-text table \(criterion text, score type, weight only — SME verdicts withheld\)\. The complete prompt text is reproduced in Appendix[8](https://arxiv.org/html/2607.00139#S8)\. The judge returns a structured object with, for each criterion: a PASS/FAIL judgment \(Positive criteria\) or “Error Present / 0” flag \(Negative criteria\), and a free\-text justification\. ### 4\.3\. Models Under Study Table[3](https://arxiv.org/html/2607.00139#S4.T3)lists all models\. Muse Spark is evaluated as a target only \(no public API; responses entered manually\)\. Table 3:Models under study\. Muse Spark is target\-only\.All judge calls use a system prompt instructing the model to act as an Arabic sociolinguistics SME and return only a structured JSON evaluation\. Temperature is fixed to 0 to eliminate stochastic variation\. Target responses are collected under controlled conditions to ensure independence and reproducibility\. Each prompt is submitted in a clean session with memory and conversation history disabled, one prompt per session with no chaining, and the first response is captured verbatim with no re\-prompting or follow\-up turns\. Muse Spark, which has no public API, is run under the same controls with responses entered manually\. ### 4\.4\. Evaluation Metrics Let a responser=\(s,m\)r=\(s,m\)denote a \(sample, target\-model\) pair, and letRRbe the set of all human\-graded responses \(\|R\|=302\|R\|=302\)\. For a judgejj, letRj⊆RR\_\{j\}\\subseteq Rbe the responsesjjis permitted to grade under the provider\-level guard \(so\|Rj\|\|R\_\{j\}\|equals the Eval Rows in Table[5](https://arxiv.org/html/2607.00139#S5.T5)\)\.jjindexes a judge model;P\(r,j\)P\(r,j\)is the percentage score judgejjassigns to responserr;Psme\(r\)P\_\{\\mathrm\{sme\}\}\(r\)is the human SME score \(ground truth\) forrr\. #### Max Score and Raw Score Smax=∑i∈PositivewiS\_\{\\max\}=\\sum\_\{i\\,\\in\\,\\text\{Positive\}\}w\_\{i\}\(1\) Sraw\(r,j\)=∑i∈Positive∩PASSwi\+∑k∈Negative∩ErrorwkS\_\{\\mathrm\{raw\}\}\(r,j\)=\\sum\_\{\\begin\{subarray\}\{c\}i\\,\\in\\,\\text\{Positive\}\\\\ \\cap\\;\\text\{PASS\}\\end\{subarray\}\}w\_\{i\}\+\\sum\_\{\\begin\{subarray\}\{c\}k\\,\\in\\,\\text\{Negative\}\\\\ \\cap\\;\\text\{Error\}\\end\{subarray\}\}w\_\{k\}\(2\) #### Percentage Score P\(r,j\)=Sraw\(r,j\)Smax×100P\(r,j\)=\\frac\{S\_\{\\mathrm\{raw\}\}\(r,j\)\}\{S\_\{\\max\}\}\\times 100\(3\) Because the rubric is dominated by Negative criteria \(wk<0w\_\{k\}<0\),Psme\(r\)P\_\{\\mathrm\{sme\}\}\(r\)is often negative\. A lenient judge scoring closer to zero than the SME mechanically produces a higher MAD — leniency bias and grading deviation are the same phenomenon\. #### Judge\-Axis MAD MADj=1\|Rj\|∑r∈Rj\|P\(r,j\)−Psme\(r\)\|\\mathrm\{MAD\}\_\{j\}=\\frac\{1\}\{\|R\_\{j\}\|\}\\sum\_\{r\\,\\in\\,R\_\{j\}\}\\bigl\|P\(r,j\)\-P\_\{\\mathrm\{sme\}\}\(r\)\\bigr\|\(4\) #### Target\-Axis MAD For a fixed target modelmm, averaging over the cross\-provider judges that graded its responses, letRm,jR\_\{m,j\}be those evaluated response–judge pairs and letNNbe their count: MADt\(m\)=1N∑r∈Rm,j\|P\(r,j\)−Psme\(r\)\|\\mathrm\{MAD\}\_\{t\}\(m\)=\\frac\{1\}\{N\}\\sum\_\{r\\,\\in\\,R\_\{m,j\}\}\\bigl\|P\(r,j\)\-P\_\{\\mathrm\{sme\}\}\(r\)\\bigr\|\(5\) #### Signed Mean Error — Bias vs\. Noise SMEerr\(j\)=1\|Rj\|∑r∈Rj\(P\(r,j\)−Psme\(r\)\)\\mathrm\{SME\}\_\{\\mathrm\{err\}\}\(j\)=\\frac\{1\}\{\|R\_\{j\}\|\}\\sum\_\{r\\,\\in\\,R\_\{j\}\}\\bigl\(P\(r,j\)\-P\_\{\\mathrm\{sme\}\}\(r\)\\bigr\)\(6\) A positiveSMEerr\\mathrm\{SME\}\_\{\\mathrm\{err\}\}indicates leniency; negative indicates conservatism\. A judge can have low MAD but non\-zeroSMEerr\\mathrm\{SME\}\_\{\\mathrm\{err\}\}if errors cancel across samples\. Together, MAD andSMEerr\\mathrm\{SME\}\_\{\\mathrm\{err\}\}characterize both magnitude and direction of grading bias\. ## 5\. Results Dataset scope\.Results are based on\|T\|=103\|T\|=103validated prompt–rubric pairs across two domains \(Cultural and Linguistics\) and two dialect communities \(Egyptian Arabic: 70, Iraqi Arabic: 33\)\. Three target models are evaluated:gemini\-3\.1\-pro\-preview\(103 unique prompt–response pairs\), Muse Spark \(101\), andgrok\-4\.3\(98\), yielding 302 unique prompt–response pairs graded by human SMEs\. Five judge models produce 1,307 judge evaluations\. ### 5\.1\. SME Ground Truth: Target and Domain Performance Table[4](https://arxiv.org/html/2607.00139#S5.T4)shows the mean SME score per target model broken down by both dialect community and domain\. The combined Cultural and Linguistics split covers both Iraqi Arabic \(NIRQ=33N\_\{\\mathrm\{IRQ\}\}=33: 15 Cultural, 18 Linguistics\) and Egyptian Arabic \(NEGY=70N\_\{\\mathrm\{EGY\}\}=70: 38 Cultural, 32 Linguistics\)\. Three findings stand out\. First, all three models score substantially higher on Egyptian than on Iraqi prompts, across both domains\. Forgemini\-3\.1\-pro\-preview, the gap is\+16\.67%\+16\.67\\%\(IRQ Cultural\) vs\+66\.75%\+66\.75\\%\(EGY Cultural\) and\+32\.47%\+32\.47\\%\(IRQ Linguistics\) vs\+95\.21%\+95\.21\\%\(EGY Linguistics\)\. For Muse Spark the gap is even more striking:−23\.15%\-23\.15\\%on IRQ Cultural versus\+64\.09%\+64\.09\\%on EGY Cultural\. Second,grok\-4\.3scores negative on all Iraqi cells, indicating that the penalty from triggered negative criteria exceeds the credit from passed positive criteria on those prompts\. Third, the Linguistics scores are consistently higher than Cultural scores within the same dialect for gemini and Muse Spark, while grok\-4\.3 shows the opposite pattern on Egyptian prompts \(\+8\.40%\+8\.40\\%EGY Linguistics vs\+19\.29%\+19\.29\\%EGY Cultural\)\. This dialect\-domain interaction was not visible in a pure domain breakdown and motivates the separate analysis in Section[5\.5](https://arxiv.org/html/2607.00139#S5.SS5)\. The consistent EGY\>\>IRQ score gap is large and robust, holding for all three target models and within both domains\. Its interpretation is complicated by a confound identified through post\-hoc analysis of our SME panel: Egyptian SMEs grade more leniently than Iraqi SMEs, both in the number and weight of criteria they author and in the thresholds they apply when assigning PASS/FAIL verdicts\. As a result, the score gap may reflect differences in SME grading stringency rather than, or in addition to, differences in model knowledge of the two dialect communities\. Separating these contributions requires multi\-annotator ground truth with cross\-dialect SME calibration, which is beyond the scope of the current dataset; we flag this as a future work in Section[6](https://arxiv.org/html/2607.00139#S6)\. Table 4:Mean SME score by target model, dialect community, and domain\. Scores represent the average percentage of rubric points earned, as graded by human SMEs\.nn= unique prompt–response pairs in that cell\. Domain is assigned at the prompt level via an explicit field in the dataset \(Cultural or Linguistics\); both domain values span both dialect communities\. ### 5\.2\. Judge Performance Table[5](https://arxiv.org/html/2607.00139#S5.T5)reportsMADj\\mathrm\{MAD\}\_\{j\}andSMEerr\\mathrm\{SME\}\_\{\\mathrm\{err\}\}for each judge across all 302 unique prompt–response pairs\. Figure[1](https://arxiv.org/html/2607.00139#S5.F1)plots the fiveMADj\\mathrm\{MAD\}\_\{j\}values in ascending order; Figure[2](https://arxiv.org/html/2607.00139#S5.F2)shows the full distribution of judge\-assigned scores, with the human SME mean \(\+36\.17%\+36\.17\\%\) marked as a reference line\. GPT\-5\.4 is the most reliable judge considering the SME judges as refernce\.It achieves the lowestMADj\\mathrm\{MAD\}\_\{j\}\(10\.21 pp\) and the only conservative bias \(SMEerr=−1\.12%\\mathrm\{SME\}\_\{\\mathrm\{err\}\}=\-1\.12\\%\): it occasionally under\-credits responses the SME would pass, but it does not systematically over\-credit responses — the more dangerous failure mode in a quality\-control setting where the goal is to identify errors\. Its score distribution \(Figure[2](https://arxiv.org/html/2607.00139#S5.F2)\) is the most compact across all five judges, with the smallest inter\-quartile range and a median closest to the SME line\. Four of five judges exhibit systematic leniency\.SMEerr\\mathrm\{SME\}\_\{\\mathrm\{err\}\}is positive forgemini\-3\.1\-pro\-preview\(\+3\.32%\+3\.32\\%\), Claude Opus 4\.7 \(\+6\.10%\+6\.10\\%\), Qwen Plus \(\+6\.56%\+6\.56\\%\), and Grok 4\.3 \(\+2\.01%\+2\.01\\%\)\. Leniency is consequential on a rubric dominated by negative criteria: a lenient judge does not merely assign a slightly higher score — it systematically underestimates the error rate of target responses, which is the primary signal the rubric is designed to capture\. Grok 4\.3 is an interesting outlier: it shows the smallest positiveSMEerr\\mathrm\{SME\}\_\{\\mathrm\{err\}\}among the lenient judges \(\+2\.01%\+2\.01\\%\) yet the highestMADj\\mathrm\{MAD\}\_\{j\}\(13\.03 pp\), and its score distribution \(Figure[2](https://arxiv.org/html/2607.00139#S5.F2)\) has the widest IQR and median furthest above the SME reference line\. High MAD with low directional bias indicates erratic rather than systematically biased grading: Grok 4\.3 swings both above and below the SME on individual samples, with the net error cancelling to a small positive mean\. The direction of error therefore understates its unreliability as a judge\. Table 5:Judge model performance\.SMEerr\>0\\mathrm\{SME\}\_\{\\mathrm\{err\}\}\>0: lenient \(scores above SME\)\.SMEerr<0\\mathrm\{SME\}\_\{\\mathrm\{err\}\}<0: conservative\. Human SME mean:\+36\.17%\+36\.17\\%across 302 unique prompt–response pairs\.Note: gemini and Grok 4\.3 have fewer evaluation rows \(199, 204\) than the other judges because, as target models themselves, all same\-provider pairs are excluded by the provider\-level self\-evaluation guard\. Qwen Plus has 300 rather than 302 rows because its content filter declined to evaluate two Muse Spark responses; those pairs are excluded from Qwen Plus’s MAD computation\. Figure 1:MADj\\mathrm\{MAD\}\_\{j\}by Judge Model\.Mean Absolute Difference vs\. human SME across 302 unique prompt–response pairs\. GPT\-5\.4 achieves the lowest MAD \(10\.21 pp\) with near\-zero directional bias \(−1\.12%\-1\.12\\%, slightly conservative\)\. Grok 4\.3 shows the highest MAD \(13\.03 pp\)\. All judges except GPT\-5\.4 are lenient \(positiveSMEerr\\mathrm\{SME\}\_\{\\mathrm\{err\}\}\)\.Figure 2:Distribution of Judge Scores Across Samples\.Box plots show the full distribution of judge\-assigned scores per prompt–response pair\. The red dashed line marks the human SME mean \(\+36\.17%\+36\.17\\%after excluding pilot\-only target models\)\. Grok 4\.3 has the widest IQR and median furthest above the SME line, consistent with highest MAD and positiveSMEerr\\mathrm\{SME\}\_\{\\mathrm\{err\}\}\. GPT\-5\.4 has the most compact distribution\. ### 5\.3\. Target\-Axis Analysis Table[6](https://arxiv.org/html/2607.00139#S5.T6)shows the target\-axis MAD — how consistently all judges agree with the SME when grading a specific target’s responses\. A high value indicates responses that are hard to grade automatically, independent of which judge is used\. Response quality and gradeability do not scale together\.Muse Spark, the middle\-quality target \(\+38\.43%\+38\.43\\%\), is the easiest to grade \(Target MAD 10\.05 pp\)\. Both the strongest target —gemini\-3\.1\-pro\-preview\(\+62\.31%\+62\.31\\%, MAD 12\.00 pp\) — and the weakest —grok\-4\.3\(\+6\.36%\+6\.36\\%, MAD 12\.77 pp\) — are harder to grade than Muse Spark\. The reason differs at each extreme: strong responses require judges to credit nuanced correct answers they tend to under\-value, while weak responses tempt judges to over\-credit the few surface elements that are correct\. Muse Spark’s errors, by contrast, are explicit enough that all judges catch them reliably\. Target MAD therefore measures how*interpretable*a response is to an automated judge, not how good it is\. The mechanisms behind this pattern are examined further in Section[6](https://arxiv.org/html/2607.00139#S6)\. Table 6:Target model performance\. Target MAD = average deviation across all cross\-provider judges when grading that target’s responses\.Unique Pairs= unique \(sample, target\) pairs graded by human SMEs; one row per pair regardless of how many judges evaluated it\. ### 5\.4\. Domain Breakdown: Cultural vs\. Linguistics Table[7](https://arxiv.org/html/2607.00139#S5.T7)showsMADj\\mathrm\{MAD\}\_\{j\}split by domain for each judge\. All five judges show higher MAD on Cultural tasks than on Linguistics tasks \(negative gap\), indicating that implicit cultural reasoning — social conventions, contextual appropriateness, dialect norms — presents a harder target for automated judges than linguistically verifiable criteria\. The domain gap is universal but varies in magnitude\.Grok 4\.3 shows the largest gap \(−4\.78\-4\.78pp: Cultural MAD 15\.37 vs\. Linguistics MAD 10\.59\) and GPT\-5\.4 the second largest \(−4\.00\-4\.00pp: 12\.20 vs\. 8\.20\)\. At the other end, Qwen Plus shows the smallest gap \(−1\.83\-1\.83pp: 13\.33 vs\. 11\.50\), and Claude Opus 4\.7 similarly modest \(−2\.60\-2\.60pp\)\. The sign is negative for all five judges; no judge is better at Cultural than Linguistics grading\. Because the domain assignment is made at the prompt level and the same rubric structure \(positive and negative criteria, explicit and implicit tags\) is used throughout, the gap cannot be attributed to rubric format differences between domains\. It reflects instead the qualitative difference in what Cultural criteria require: a judge must simulate native\-dialect perspective and assess whether a model’s cultural reference is authentic — a form of open\-ended reasoning with no lexical check available — whereas Linguistics criteria typically ask the judge to verify specific morphological forms, syntactic constructions, or register norms against enumerable, verifiable criteria\. This maps onto the Explicit vs\. Implicit criterion taxonomy in the rubric \(Section[3\.2](https://arxiv.org/html/2607.00139#S3.SS2)\) and is examined further in Section[6](https://arxiv.org/html/2607.00139#S6)\. Table 7:MAD by domain and judge model\. Gap = Linguistics MAD−\-Cultural MAD\. All negative values confirm Cultural tasks are harder to grade automatically\. ### 5\.5\. Error\-Type Distribution Across the Full Dataset This section characterises*what kinds of errors*the three target models actually make and how their profiles differ between dialect communities\. Every failed criterion is tagged with one of the six failure types from Section[3\.2](https://arxiv.org/html/2607.00139#S3.SS2), counted across all three included targets \(gemini\-3\.1\-pro\-preview, Muse Spark,grok\-4\.3\)\. Analysis design\.Because Egyptian SMEs grade more leniently than Iraqi SMEs — particularly forCultural MisattributionandAmbiguous Framing\(see Section[6](https://arxiv.org/html/2607.00139#S6)\) — the full six\-error\-type breakdown is presented for Iraqi Arabic only\. Egyptian Arabic data are included exclusively forHallucination, the most objective error type \(a fabricated fact is verifiably wrong regardless of SME stringency\), which serves as a cross\-dialect calibration anchor\. Rate metric\.Because three models each evaluate every prompt, normalising by raw prompt count inflates rates by a factor of three\. All rates therefore usemodel\-prompt evaluation pairsas the denominator:Rate=error countmodel\-prompt pairs×100\\text\{Rate\}=\\dfrac\{\\text\{error count\}\}\{\\text\{model\-prompt pairs\}\}\\times 100\. Iraqi Arabic has9999pairs \(3×333\\times 33; perfect coverage\); Egyptian Arabic has203203pairs \(Muse Spark: 68;gemini\-3\.1\-pro\-preview: 70;grok\-4\.3: 65\), giving≈2\.9\{\\approx\}2\.9evaluations per prompt — nearly balanced with Iraqi \(3\.03\.0\)\. For example, Muse Spark’s 74 Cultural Misattribution errors across 33 Iraqi pairs give a per\-model rate of224\.2224\.2; pooling all three models:172/99×100=173\.7172/99\\times 100=173\.7per 100 pairs\. Iraqi error profile and per\-model breakdown\.Cultural MisattributionandAmbiguous Framinglead at 172 instances each, followed byOmission\(145\),Hallucination\(140\),Irrelevant Addition\(38\), andRendering Failure\(0\)\. The per\-model breakdown \(Figure[4](https://arxiv.org/html/2607.00139#S5.F4)\) reveals a strong dependence on target quality\.grok\-4\.3, the weakest target, is dominated byOmission\(76\): it most often fails to produce the correct term entirely\. Muse Spark records the highestCultural Misattribution\(74\) andAmbiguous Framing\(73\) while keepingOmissionlower \(47\): it surfaces the term but wraps it in misattributed or register\-inconsistent content\.gemini\-3\.1\-pro\-preview, the strongest target, has the lowestOmission\(22\), with residual errors inAmbiguous Framing\(55\) andIrrelevant Addition\(20\)\. Strikingly,grok\-4\.3inverts its profile across dialects:Omission\-dominated on Iraqi prompts \(76\) butHallucination\-dominated on Egyptian \(106\), fabricating confidently where it fell silent on Iraqi\. The cross\-dialect interpretation of this contrast is deferred to Section[6](https://arxiv.org/html/2607.00139#S6)\. Figure 3:Aggregated Iraqi Error Profile and Hallucination Cross\-Dialect Anchor\.*Left panel:*Raw error counts for Iraqi Arabic across all three included targets\.*Right panel:*Hallucinationrate per 100 model\-prompt pairs\.Figure 4:Per\-Model Iraqi Error Profile and Hallucination Anchor\.*Main panel:*Iraqi error counts for all six types, with the three target models clustered per group\.*Anchor panel \(right\):*Hallucinationrate per 100 model\-prompt pairs by model for Iraqi \(navy\) and Egyptian \(gold\)\.Figure 5:Aggregated Iraqi Error Rate and Share, with Hallucination Anchor\.*Left panel:*Rate \(errors per 100 model\-prompt evaluations; IRQ denominator=99=99pairs\) and share for Iraqi Arabic\.Cultural MisattributionandAmbiguous Framingare highest \(173\.7 each\), followed byOmission\(146\.5\) andHallucination\(141\.4\)\.*Right panel:*Hallucinationanchor — IRQ vs\. EGY per 100 model\-prompt pairs\. ## 6\. Discussion Judge reliability and the leniency failure mode\.GPT\-5\.4 is the most reliable judge \(MAD 10\.21 pp, Signed Error−1\.12%\-1\.12\\%\); its slight conservative bias is the safer failure mode in a quality\-control setting\. Four of five judges are lenient \(\+2\.01%\+2\.01\\%–\+6\.56%\+6\.56\\%\), which is consequential on a penalty\-dominated rubric: leniency underestimates error rates, the primary signal the rubric captures\. The consistent Cultural\>\>Linguistics MAD gap \(1\.831\.83–4\.784\.78pp, Table[7](https://arxiv.org/html/2607.00139#S5.T7)\) reflects an Implicit criterion requirement: Cultural tasks demand native\-dialect simulation rather than lexical verification, a distinction that maps onto the Explicit vs\. Implicit taxonomy of Section[3\.2](https://arxiv.org/html/2607.00139#S3.SS2)\. Why Egyptian error rates look so much lower — and why we cannot take that at face value\.Two error types stand out in the Iraqi data:Cultural MisattributionandAmbiguous Framingoccur roughly117×117\\timesand39×39\\timesmore often per evaluation in the Iraqi subset than in the Egyptian subset\. The intuitive explanation is that models know Egyptian Arabic better — it is more prevalent in pretraining data, so models are less likely to substitute a wrong dialect term\. That may well be part of the story\. However, our SME panel reveals a second explanation that we cannot rule out: Egyptian SMEs write fewer negative criteria per response and apply more lenient PASS/FAIL thresholds than Iraqi SMEs\.Cultural MisattributionandAmbiguous Framingare precisely the error types that a lenient grader is most likely to leave undocumented — catching them requires noticing a subtle dialect mismatch and then deciding to record it as a penalty\. A strict grader documents it; a lenient one may let it pass\. To test which explanation is more likely, we useHallucinationas a reference\. A hallucinated fact is objectively wrong regardless of how strict the grader is — there is no leniency involved in recognising that a model invented a term or attributed a custom to the wrong region\. If the gap between Iraqi and Egyptian were driven mainly by genuine model\-knowledge differences, we would expectHallucinationto show a similarly large gap\. Instead,Hallucinationshows only a1\.7×1\.7\\timesIRQ/EGY ratio \(Figure[5](https://arxiv.org/html/2607.00139#S5.F5)\), the smallest of any error type\. The fact that the one objective, leniency\-resistant error type shows a near\-equal split while the two judgment\-dependent types show100×100\\times\+ gaps is consistent with SME grading stringency — not model knowledge — driving much of the observed difference\. We therefore withhold causal attribution to model behaviour; separating the two contributions requires a cross\-dialect SME calibration study\. Being right and verbose can score lower than being right and brief\.Looking at Target MAD, the middle\-quality model is actually the easiest to grade: Muse Spark \(\+38\.43%\+38\.43\\%mean SME score, MAD 10\.05 pp\) is graded more consistently than eithergemini\-3\.1\-pro\-preview\(\+62\.31%\+62\.31\\%, MAD 12\.00 pp\) orgrok\-4\.3\(\+6\.36%\+6\.36\\%, MAD 12\.77 pp\)\. The reason is not that Muse Spark is average — it is in fact the best of the three at identifying the correct dialect term, often finding the gold answer on prompts where the other models miss it entirely\. The problem is what it does next\. Rather than stopping, Muse Spark routinely elaborates: it adds alternative terms, invented etymologies, regional usage claims, and unsolicited background\. Each extra clause is a new opportunity to trigger a negative criterion — a fabricated detail firesHallucination, a non\-native phrasing firesAmbiguous Framing, an off\-topic addition firesIrrelevant AdditionorCultural Misattribution\. The penalty mass from this elaboration erodes the credit earned by the correct identification\. In a penalty\-weighted rubric, being right and long can score below being right and short: correctness is bounded by fixed positive weights, while error opportunity grows with response length\. This also explains why Muse Spark posts the*lowest*Target MAD despite its strong recall\. Its errors — Cultural Misattribution \(74\), Ambiguous Framing \(73\) — are explicit and lexical; every judge can see them and agrees on them\. By contrast,grok\-4\.3’s dominant failure isOmission\(76 on Iraqi\): it simply does not produce the correct term, giving judges little to disagree about either\.gemini’s high\-quality but nuanced responses create the opposite problem — judges under\-credit elaborations that the SME would accept\. The three models therefore represent three distinct error geometries, not three points on a single quality axis, which is why Target MAD does not scale monotonically with response quality\. ### 6\.1\. Limitations - •Single SME per sample\.Each \(target model, sample\) pair is graded by a single SME, leaving annotator noise unmeasured\. Inter\-annotator agreement is the empirical ceiling for any automated judge and cannot be quantified from the current dataset\. - •Differential SME grading stringency\.Post\-hoc analysis of our SME panel indicates that Egyptian SMEs grade more leniently than Iraqi SMEs, both in the number and severity of criteria they author and in the PASS/FAIL thresholds they apply\. This confounds all cross\-dialect comparisons: the observed EGY\>\>IRQ performance gap and the concentration ofCultural MisattributionandAmbiguous Framingin the Iraqi subset may reflect SME grading behaviour as much as model behaviour\. We cannot separate these contributions from the current data\. - •Two\-dialect coverage\.The dataset covers Egyptian and Iraqi Arabic only\. Findings may not generalise to other MENA dialect communities\. ### 6\.2\. Future Work - •Multi\-annotator grading\.Recruiting two or more SMEs per sample would establish inter\-annotator agreement as an empirical ceiling and allow the annotator\-noise component of judge MAD to be isolated\. - •Cross\-dialect SME calibration\.A calibration study in which SMEs from both dialect communities grade the same subset of responses would quantify grading\-stringency differences and allow the SME\-leniency confound to be separated from genuine model\-knowledge differences\. - •Dialect expansion\.Extending the benchmark to Jordanian, Gulf, and Levantine Arabic requires additional SME recruitment and data collection but would enable cross\-dialect generalisability claims\. ## 7\. Conclusion We have presented a cross\-evaluation framework for benchmarking frontier LLMs as automated graders in Arabic cultural and sociolinguistic domains, evaluated on a dataset of 103 validated prompt–rubric pairs across Cultural and Linguistics domains in Egyptian and Iraqi Arabic\. The framework’s provider\-level self\-evaluation guard, penalty\-weighted rubric design, and dual\-metric reporting \(MAD \+ Signed Mean Error\) are designed to surface specific failure modes of automated grading rather than produce a single aggregate score\. The key findings are: GPT\-5\.4 is the most reliable automated judge \(MAD 10\.21 pp, near\-zero bias\); four of five judges exhibit systematic leniency \(\+2\.01%\+2\.01\\%–\+6\.56%\+6\.56\\%\); Cultural tasks are harder to grade automatically than Linguistics tasks for all five judges \(gap1\.831\.83–4\.784\.78pp\); and the difficulty of grading a target’s responses does not scale monotonically with response quality — both high\-quality \(gemini\) and low\-quality \(grok\-4\.3\) responses are harder to grade than moderate\-quality ones \(Muse Spark\)\. Expanding the dataset to additional MENA dialect communities, calibrating SME grading stringency across dialect communities through multi\-annotator cross\-dialect protocols, introducing multi\-annotator SME grading to establish inter\-annotator agreement as an empirical ceiling, and investigating the Explicit vs\. Implicit criterion gap at the per\-criterion level are the natural next steps\. The answer to the question of whether any frontier LLM can reliably substitute for a domain\-expert SME across the full range of Arabic cultural knowledge tasks has direct practical implications for the scalability of human\-expert evaluation in Arabic NLP and for the deployment of LLMs in culturally sensitive contexts\. ## References - \[1\]M\. F\. Adilazuarda, S\. Mukherjee, P\. Lavania, S\. Singh, A\. F\. Aji, J\. O’Neill, A\. Modi, and M\. Choudhury\(2024\)Towards measuring and modeling “culture” in LLMs: a survey\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 15763–15784\.External Links:[Link](https://arxiv.org/abs/2403.15412)Cited by:[§2\.1](https://arxiv.org/html/2607.00139#S2.SS1.p1.1)\. - \[2\]\(2026\)Fluent but foreign: even regional LLMs lack cultural alignment\.arXiv preprint arXiv:2505\.21548\.External Links:[Link](https://arxiv.org/abs/2505.21548)Cited by:[§2\.1](https://arxiv.org/html/2607.00139#S2.SS1.p2.1)\. - \[3\]M\. Alwajihet al\.\(2025\)PalmX 2025: the first shared task on benchmarking LLMs on arabic and islamic culture\.InProceedings of the Third Arabic Natural Language Processing Conference \(ArabicNLP 2025\),External Links:[Link](https://arxiv.org/abs/2509.02550)Cited by:[§2\.1](https://arxiv.org/html/2607.00139#S2.SS1.p4.1)\. - \[4\]X\. Fu and W\. Liu\(2025\)How reliable is multilingual LLM\-as\-a\-judge?\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Suzhou, China,pp\. 11040–11053\.Cited by:[§2\.2](https://arxiv.org/html/2607.00139#S2.SS2.p1.1)\. - \[5\]L\. Hayes Zhang, S\. Milli,et al\.\(2025\)Cultivating pluralism in algorithmic monoculture: the community alignment dataset\.arXiv preprint arXiv:2507\.09650\.External Links:[Link](https://arxiv.org/abs/2507.09650)Cited by:[§2\.2](https://arxiv.org/html/2607.00139#S2.SS2.p3.1)\. - \[6\]D\. Hershcovich, S\. Frank, H\. Lent, M\. de Lhoneux, M\. Abdou, S\. Brandl, E\. Bugliarello, L\. Cabello Piqueras, I\. Chalkidis, R\. Cui, C\. Fierro, K\. Margatina, P\. Rust, and A\. Søgaard\(2022\)Challenges and strategies in cross\-cultural NLP\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6997–7013\.External Links:[Link](https://aclanthology.org/2022.acl-long.482)Cited by:[§2\.1](https://arxiv.org/html/2607.00139#S2.SS1.p1.1)\. - \[7\]J\. Huang and D\. Yang\(2023\)Culturally aware natural language inference\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 7591–7609\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.509)Cited by:[§2\.2](https://arxiv.org/html/2607.00139#S2.SS2.p2.1)\. - \[8\]S\. Kim, J\. Shin, Y\. Cho, J\. Jang, S\. Longpre, H\. Lee, S\. Yun, S\. Shin, S\. Kim, J\. Thorne, and M\. Seo\(2024\)Prometheus: inducing fine\-grained evaluation capability in language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2310.08491)Cited by:[§2\.2](https://arxiv.org/html/2607.00139#S2.SS2.p4.1),[§2\.3](https://arxiv.org/html/2607.00139#S2.SS3.p2.1),[§2\.3](https://arxiv.org/html/2607.00139#S2.SS3.p4.1)\. - \[9\]S\. Kim, J\. Suk, S\. Longpre, B\. Y\. Lin, J\. Shin, S\. Welleck, G\. Neubig, M\. Lee, K\. Lee, and M\. Seo\(2024\)Prometheus 2: an open source language model specialized in evaluating other language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,External Links:[Link](https://arxiv.org/abs/2405.01535)Cited by:[§2\.2](https://arxiv.org/html/2607.00139#S2.SS2.p4.1),[§2\.3](https://arxiv.org/html/2607.00139#S2.SS3.p2.1)\. - \[10\]S\. Li, J\. Zhao, M\. Wei, H\. Ren, Y\. Zhou, J\. Yang, S\. Liu, K\. Zhang, and W\. Chen\(2026\)RubricHub: a comprehensive and highly discriminative rubric dataset via automated coarse\-to\-fine generation\.arXiv preprint arXiv:2601\.08430\.External Links:[Link](https://arxiv.org/abs/2601.08430)Cited by:[§2\.3](https://arxiv.org/html/2607.00139#S2.SS3.p1.1),[§2\.3](https://arxiv.org/html/2607.00139#S2.SS3.p2.1),[§2\.3](https://arxiv.org/html/2607.00139#S2.SS3.p4.1)\. - \[11\]J\. Myung, N\. Lee, Y\. Zhou, J\. Jin, R\. A\. Putri, A\. Oh,et al\.\(2024\)BLEnD: a benchmark for LLMs on everyday knowledge in diverse cultures and languages\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2406.09948)Cited by:[§2\.2](https://arxiv.org/html/2607.00139#S2.SS2.p3.1)\. - \[12\]J\. Oh, I\. Cha, M\. Saxon, H\. Lim, S\. Bhatt, and A\. Oh\(2025\)Culture is everywhere: a call for intentionally cultural evaluation\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 19156–19168\.External Links:[Link](https://arxiv.org/abs/2509.01301)Cited by:[§2\.1](https://arxiv.org/html/2607.00139#S2.SS1.p4.1),[§2\.2](https://arxiv.org/html/2607.00139#S2.SS2.p2.1)\. - \[13\]A\. Parrish, A\. Chen, N\. Nangia, V\. Padmakumar, J\. Phang, J\. Thompson, P\. M\. Htut, and S\. R\. Bowman\(2022\)BBQ: a hand\-built bias benchmark for question answering\.InFindings of the Association for Computational Linguistics: ACL 2022,pp\. 2086–2105\.External Links:[Link](https://arxiv.org/abs/2110.08193)Cited by:[§2\.2](https://arxiv.org/html/2607.00139#S2.SS2.p2.1)\. - \[14\]Z\. Qian, F\. Altam, M\. Alqurishi, and R\. Souissi\(2024\)CamelEval: advancing culturally aligned arabic language models and benchmarks\.arXiv preprint arXiv:2409\.12623\.External Links:[Link](https://arxiv.org/abs/2409.12623)Cited by:[§2\.1](https://arxiv.org/html/2607.00139#S2.SS1.p4.1)\. - \[15\]A\. Rao, A\. Yerukola, V\. Shah, K\. Reinecke, and M\. Sap\(2025\)NormAd: a framework for measuring the cultural adaptability of large language models\.InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 2373–2403\.External Links:[Link](https://arxiv.org/abs/2404.12464)Cited by:[§2\.2](https://arxiv.org/html/2607.00139#S2.SS2.p3.1)\. - \[16\]J\. Rystrøm, H\. R\. Kirk, and S\. Hale\(2025\)Multilingual≠\\neqmulticultural: evaluating gaps between multilingual capabilities and cultural alignment in LLMs\.InProceedings of the OMMM Workshop 2025,pp\. 74–85\.External Links:[Link](https://aclanthology.org/2025.ommm-1.9)Cited by:[§2\.1](https://arxiv.org/html/2607.00139#S2.SS1.p3.1)\. - \[17\]A\. Sahu, M\. Mofakhami, D\. D’Souza, T\. Euyang, J\. Kreutzer, and M\. Fadaee\(2026\)The culture funnel: you can’t align what isn’t in the data\.arXiv preprint arXiv:2606\.13808\.External Links:[Link](https://arxiv.org/abs/2606.13808)Cited by:[§2\.1](https://arxiv.org/html/2607.00139#S2.SS1.p2.1)\. - \[18\]W\. Shi, R\. Li, Y\. Zhang, C\. Ziems, C\. Yu, R\. Horesh, R\. Abreu de Paula, and D\. Yang\(2024\)CultureBank: an online community\-driven knowledge base towards culturally aware language technologies\.InFindings of the Association for Computational Linguistics: EMNLP 2024,External Links:[Link](https://arxiv.org/abs/2404.15238)Cited by:[§2\.1](https://arxiv.org/html/2607.00139#S2.SS1.p3.1)\. - \[19\]Y\. Su, D\. Yu, L\. Song, J\. Li, H\. Mi, Z\. Tu, M\. Zhang, and D\. Yu\(2025\)Crossing the reward bridge: expanding RL with verifiable rewards across diverse domains\.arXiv preprint arXiv:2503\.23829\.External Links:[Link](https://arxiv.org/abs/2503.23829)Cited by:[§2\.3](https://arxiv.org/html/2607.00139#S2.SS3.p1.1)\. - \[20\]B\. Zhanget al\.\(2026\)Mind the gap in cultural alignment: task\-aware culture management for large language models\.arXiv preprint arXiv:2602\.22475\.External Links:[Link](https://arxiv.org/abs/2602.22475)Cited by:[§2\.1](https://arxiv.org/html/2607.00139#S2.SS1.p4.1)\. - \[21\]J\. Zhanget al\.\(2025\)CultureScope: a dimensional lens for probing cultural understanding in LLMs\.arXiv preprint arXiv:2509\.16188\.External Links:[Link](https://arxiv.org/abs/2509.16188)Cited by:[§2\.2](https://arxiv.org/html/2607.00139#S2.SS2.p3.1)\. - \[22\]J\. Zhang, Z\. Wang, L\. Gui, S\. Mysore Sathyendra, J\. Jeong, V\. Veitch, W\. Wang, Y\. He, B\. Liu, and L\. Jin\(2026\)Chasing the tail: effective rubric\-based reward modeling for large language model post\-training\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2509.21500)Cited by:[§2\.3](https://arxiv.org/html/2607.00139#S2.SS3.p1.1),[§2\.3](https://arxiv.org/html/2607.00139#S2.SS3.p2.1),[§2\.3](https://arxiv.org/html/2607.00139#S2.SS3.p4.1)\. - \[23\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://arxiv.org/abs/2306.05685)Cited by:[§1](https://arxiv.org/html/2607.00139#S1.p1.1),[§2\.2](https://arxiv.org/html/2607.00139#S2.SS2.p1.1)\. - \[24\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)MT\-Bench: a multi\-turn benchmark for evaluating chat LLMs\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://arxiv.org/abs/2306.05685)Cited by:[§2\.2](https://arxiv.org/html/2607.00139#S2.SS2.p1.1)\. ## 8\. Judge Prompt All judge calls use the following fixed system prompt and user message template\. Temperature is set to 0 for all calls; no other sampling parameters are varied\. ### System Prompt > You are an expert Subject Matter Expert \(SME\) in Arabic Sociolinguistics and Dialects\. You will be provided with a PROMPT, an ANSWER generated by an AI model, and a grading RUBRIC\. You must evaluate the ANSWER against every criterion in the RUBRIC\. Return ONLY a JSON object with the following structure: \{ ‘‘evaluations’’: \[ \{ ‘‘criterion’’: ‘‘Text of criterion here’’, ‘‘score\_type’’: ‘‘Positive or Negative’’, ‘‘judgment’’: ‘‘PASS or FAIL \(if Positive\), or 0 or Error Present \(if Negative\)’’, ‘‘justification’’: ‘‘Your reasoning here’’ \} \] \} ### User Message Template > PROMPT: \{prompt text authored by SME\} \-\-\- ANSWER \(produced by \{target model name\}\): \{target model’s response\} \-\-\- RUBRIC: \{plain\-text table: Criterion∣\\midScore Type∣\\midWeight, one row per criterion\} The rubric table contains three columns only: criterion text, score type \(PositiveorNegative\), and numeric weight\. The SME’s PASS/FAIL verdicts and failure\-type tags are deliberately excluded so the judge receives no ground\-truth signal\. The target model’s name is included in the user message to allow attribution in the judge’s justification text; it is not used for scoring\. Judgment tokens \(PASS,FAIL,Error Present,0\) are parsed by exact string match in the evaluation pipeline\.
Similar Articles
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.
Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
This paper introduces the first parallel Arabic cultural QA benchmark spanning Modern Standard Arabic and multiple dialects, converting multiple-choice questions to open-ended formats and evaluating LLMs with chain-of-thought reasoning to address gaps in culturally grounded and dialect-specific knowledge.
Automated Scoring of Arabic Text Using Large Language Models: A Literature Review
A literature review examining LLM-based approaches for automatic scoring of Arabic text, covering short answer grading and essay scoring, with a proposed taxonomy and comparative analysis.
Challenges and Recommendations for LLMs-as-a-Judge in Multilingual Settings and Low-Resource Languages
This paper analyzes the use of LLM-as-a-Judge in multilingual and low-resource settings, finding inconsistent evaluation outcomes and overtrust in LLM judgments, and provides recommendations for better practices.
SPLIT: Cross-Lingual Empathy and Cultural Grounding in English and Ukrainian LLM Responses
Introduces SPLIT, a 500-prompt benchmark evaluating LLM cross-lingual empathy and cultural grounding in English and Ukrainian. Findings show Gemini-2.5-Flash and LLaMA-3.3-70B-Instruct degrade in Ukrainian while DeepSeek-V3 remains stable, with weak agreement between human and AI evaluators on cultural dimensions.