Tag
This paper introduces a cross-evaluation framework for benchmarking LLMs on Arabic cultural and sociolinguistic knowledge, using human SME ground truth and automated judges. The authors contribute a dataset of prompt-rubric pairs for Egyptian and Iraqi Arabic, evaluating frontier LLMs and finding that cultural reasoning remains a primary failure mode for automated grading.