SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning
Summary
Researchers release SAHM, the first Arabic financial benchmark with 14,380 expert-verified instances covering Shari’ah-compliant reasoning, showing large performance gaps for 20 evaluated LLMs.
View Cached Full Text
Cached at: 04/22/26, 08:30 AM
# A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning
Source: [https://arxiv.org/html/2604.19098](https://arxiv.org/html/2604.19098)
Rania Elbadry1Sarfraz Ahmad1Ahmed Heakl1Dani Bouch1Momina Ahsan1 Muhra AlMahri1Marwa Elsaid Khalil1Mohamed Anwar1Yuxia Wang2 Salem Lahlou1Sophia Ananiadou3Veselin Stoyanov1Jimin Huang4 Xueqing Peng4Preslav Nakov1Zhuohan Xie1 1MBZUAI2INSAIT3The University of Manchester4The Fin AI \{rania\.elbadry, preslav\.nakov, zhuohan\.xie\}@mbzuai\.ac\.ae[SAHM Benchmark](https://huggingface.co/SahmBenchmark)[Code](https://github.com/rania-hossam/SAHM)
###### Abstract
English financial NLP has advanced rapidly through benchmarks targeting earnings analysis, market sentiment, tabular reasoning, and financial question answering, yet Arabic financial NLP remains virtually nonexistent, despite 422 million speakers, $4\.9 trillion in Gulf sovereign wealth, and a $4–5 trillion Islamic finance industry requiring specialized Shari’ah compliance over instruments like sukuk, murabaha, and takaful\. We introduceSahm, the first Arabic financial benchmark spanning seven tasks: AAOIFI standards QA, fatwa\-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event\-cause reasoning, comprising 14,380 expert\-verified instances from authentic regulatory, juristic, and corporate sources\. Evaluating 20 LLMs, we find Arabic fluency does not imply financial reasoning: models achieving 91% on recognition tasks drop sharply on generation, and event\-cause reasoning exposes the widest performance gap \(1\.89–9\.84/10\)\. We release the benchmark and dataset to support trustworthy Arabic financial assistants\.
\[ Extension = \.otf, UprightFont = \*\-regular, BoldFont = \*\-bold, ItalicFont = \*\-italic, BoldItalicFont = \*\-bolditalic, \] \[arabic\]rm\[ Extension = \.ttf, UprightFont = Amiri\-Regular, BoldFont = Amiri\-Bold, ItalicFont = Amiri\-Italic, BoldItalicFont = Amiri\-BoldItalic, Script=Arabic \]Amiri \[ Extension = \.otf, UprightFont = \*\-regular, BoldFont = \*\-bold, ItalicFont = \*\-italic, BoldItalicFont = \*\-bolditalic, \] \[arabic\]rm\[ Extension = \.ttf, UprightFont = Amiri\-Regular, BoldFont = Amiri\-Bold, ItalicFont = Amiri\-Italic, BoldItalicFont = Amiri\-BoldItalic, Script=Arabic \]Amiri
![[Uncaptioned image]](https://arxiv.org/html/2604.19098v1/Figures/logo.png)Sahm: A Benchmark for Arabic Financial and Shari’ah\-Compliant Reasoning
Rania Elbadry1Sarfraz Ahmad1Ahmed Heakl1Dani Bouch1Momina Ahsan1Muhra AlMahri1Marwa Elsaid Khalil1Mohamed Anwar1Yuxia Wang2Salem Lahlou1Sophia Ananiadou3Veselin Stoyanov1Jimin Huang4Xueqing Peng4††thanks:Corresponding authorPreslav Nakov1Zhuohan Xie11MBZUAI2INSAIT3The University of Manchester4The Fin AI\{rania\.elbadry, preslav\.nakov, zhuohan\.xie\}@mbzuai\.ac\.ae[SAHM Benchmark](https://huggingface.co/SahmBenchmark)[Code](https://github.com/rania-hossam/SAHM)
## 1Introduction
Figure 1:Examples of the diverse tasks included inSahm, covering juristic Q&A, business and accounting MCQs, financial sentiment analysis, report summarization, & event causal reasoning\.The Gulf Cooperation Council \(GCC\) generates large volumes of Arabic financial text, including central bank reports, regulatory filings, corporate disclosures, andfatwasthat provide jurisprudential rulings\. Despite this, systematic evaluation of Large Language Models \(LLMs\) on Arabic financial content remains limited\. English financial NLP has advanced rapidly through dedicated benchmarks\(Maia et al\.,[2018a](https://arxiv.org/html/2604.19098#bib.bib23); Zhu et al\.,[2021](https://arxiv.org/html/2604.19098#bib.bib41); Chen et al\.,[2021](https://arxiv.org/html/2604.19098#bib.bib10),[2022](https://arxiv.org/html/2604.19098#bib.bib11); Zhao et al\.,[2024](https://arxiv.org/html/2604.19098#bib.bib40); Xie et al\.,[2025](https://arxiv.org/html/2604.19098#bib.bib37)\), with multilingual extensions emerging for other languages\(Nie et al\.,[2025](https://arxiv.org/html/2604.19098#bib.bib25); Zhang et al\.,[2024](https://arxiv.org/html/2604.19098#bib.bib39); Peng et al\.,[2025a](https://arxiv.org/html/2604.19098#bib.bib27),[b](https://arxiv.org/html/2604.19098#bib.bib28)\)\. Arabic benchmarks remain limited in scope: ArBanking77\(Jarrar et al\.,[2023](https://arxiv.org/html/2604.19098#bib.bib19)\)addresses only banking intent, and Arabic\-centric LLMs\(Sengupta et al\.,[2023](https://arxiv.org/html/2604.19098#bib.bib32); Team,[2025](https://arxiv.org/html/2604.19098#bib.bib34); Heakl et al\.,[2025a](https://arxiv.org/html/2604.19098#bib.bib15); Abbas et al\.,[2025](https://arxiv.org/html/2604.19098#bib.bib1)\)have not been evaluated on financial domains\. Islamic finance further illustrates this gap\. Unlike conventional finance, it requires Shari’ah review guided by standards issued by AAOIFI\.111[https://aaoifi\.com](https://aaoifi.com/)Although resources such as Fatwaset\(Alyemny et al\.,[2023](https://arxiv.org/html/2604.19098#bib.bib4)\)and Hajj FQA\(Aleid and Azmi,[2025](https://arxiv.org/html/2604.19098#bib.bib2)\)exist, they focus on general juristic QA rather than financial reasoning\. As a result, LLMs remain untested on tasks that combine legal and financial analysis\. We introduceSahm, the first Arabic financial NLP benchmark unifying modern finance and Islamic jurisprudence two high\-stakes domains shaping trillions in assets yet missing from LLM evaluation\.Sahmspans seven expert\-verified tasks grounded in AAOIFI standards, fatwa archives from seven countries, and corporate disclosures \(Figure[1](https://arxiv.org/html/2604.19098#S1.F1)\)\. Evaluating 20 LLMs reveals that Arabic fluency does not guarantee financial reasoning base Arabic models rank in the bottom 25% despite being designed for Arabic\. However, fine\-tuning onSahmdramatically closes this gap: domain\-adapted models gain up to \+26 points on Accounting and \+25 points on Business, enabling 7–8B models to surpass GPT\-5 on financial reasoning tasks and match 72B open\-source baselines\. Our contributions:
- •The first Arabic finance benchmark \(14,380 instances; 7 tasks\) jointly evaluating Shari’ah\-compliant reasoning \(fatwa QA, Islamic finance standards\) and core financial competencies \(accounting MCQ, sentiment, event\-cause QA\), addressing a major resource gap for Arabic financial NLP\.
- •A comprehensive benchmark of 20 LLMs showing that Arabic fluency does not guarantee financial reasoning: models that score up to 91% on MCQ\-style tasks degrade substantially on open\-ended generation, with the largest gap on Event–Cause QA \(1\.89–9\.84/10\)\.
- •Evidence that targeted adaptation rivals scale for Arabic financial NLP: fine\-tuning onSahmyields two complementary 7–8B modelsSahm\-ALLAM\-7B\(peak accuracy, surpassing GPT\-5 by \+21\.3 points on Business MCQ, 93\.99% vs\. 72\.68%\) andSahm\-Jais\-8B\(uniformly positive transfer across all tasks\) while matching 72B open\-source baselines on average demonstrating∼\\sim10×\\timesparameter efficiency and establishing domain adaptation as a practical, cost\-effective route to trustworthy Arabic financial assistants where frontier API access may be limited\.
Figure 2:Pipeline for constructing the Islamic Finance Shari’ah Standards QA dataset\.A hybrid LLMs\-human pipeline converts AAOIFI standards into QA pairs through OCR and generation stages, each followed by expert verification to ensure linguistic accuracy and legal fidelity\.
## 2Related Work
##### Financial NLP Benchmarks\.
English financial NLP has matured through progressively challenging benchmarks\. Early work focused on classification and extraction\(Araci,[2019](https://arxiv.org/html/2604.19098#bib.bib8)\), while recent datasets target numerical reasoning over tables \(FinQA\(Chen et al\.,[2021](https://arxiv.org/html/2604.19098#bib.bib10)\), TAT\-QA\(Zhu et al\.,[2021](https://arxiv.org/html/2604.19098#bib.bib41)\)\), multi\-turn dialogue \(ConvFinQA\(Chen et al\.,[2022](https://arxiv.org/html/2604.19098#bib.bib11)\)\), and chain\-of\-thought verification \(FinChain\(Xie et al\.,[2025](https://arxiv.org/html/2604.19098#bib.bib37)\)\)\. Comprehensive suites such as FinBen\(Xie et al\.,[2024](https://arxiv.org/html/2604.19098#bib.bib35)\)and PIXIU\(Xie et al\.,[2023](https://arxiv.org/html/2604.19098#bib.bib36)\)now span 24 tasks including sentiment, NER, and argument mining\. Multilingual extensions have emerged for Chinese \(CFinBench\(Nie et al\.,[2025](https://arxiv.org/html/2604.19098#bib.bib25)\)\), and Greek \(Plutus\(Peng et al\.,[2025a](https://arxiv.org/html/2604.19098#bib.bib27)\)\), demonstrating that culturally grounded evaluation reveals failure modes invisible in English\-only testing\. Yet Arabic, spoken by 422M people across economies managing $4\.9T in sovereign wealth\(Alhajraf,[2025](https://arxiv.org/html/2604.19098#bib.bib3)\), lacks any comparable financial benchmark\.
##### Arabic NLP and the Evaluation Gap\.
Arabic resources have grown substantially, but remain shallow in financial coverage\. ArBanking77\(Jarrar et al\.,[2023](https://arxiv.org/html/2604.19098#bib.bib19)\)addresses banking intent detection; Fatwaset\(Alyemny et al\.,[2023](https://arxiv.org/html/2604.19098#bib.bib4)\)and Hajj\-FQA\(Aleid and Azmi,[2025](https://arxiv.org/html/2604.19098#bib.bib2)\)target religious QA\. These datasets support general language understanding, but do not evaluate the reasoning required for regulatory compliance, numerical analysis, or Shari’ah\-aligned decision making\. This gap is significant as Arabic financial texts present distinct challenges: mixed numeral systems \(Eastern ٠١٢٣ and Western 0123\), code\-switching with English acronyms \(IFRS, AAOIFI\), and domain\-specific terminology from Islamic jurisprudence \(riba,gharar,sukuk\)\. Meanwhile, Arabic\-centric LLMs, including Jais\(Sengupta et al\.,[2023](https://arxiv.org/html/2604.19098#bib.bib32)\), Falcon\-Arabic\(Team,[2025](https://arxiv.org/html/2604.19098#bib.bib34)\), AIN\(Heakl et al\.,[2025a](https://arxiv.org/html/2604.19098#bib.bib15)\), and Fanar\(Abbas et al\.,[2025](https://arxiv.org/html/2604.19098#bib.bib1)\), are evaluated only on generic benchmarks that ignore these complexities\.
## 3Sahm
TaskDatasetNAvg\. Words \(Input\)Avg\. Chars \(Input\)Avg\. Words \(Answer\)Avg\. Chars \(Answer\)MCQAccounting Exams MCQ167111\.5±\\pm91\.1674\.3±\\pm550\.51\.0±\\pm0\.01\.0±\\pm0\.0Business Exams MCQ18346\.3±\\pm12\.2298\.3±\\pm71\.61\.0±\\pm0\.01\.0±\\pm0\.0Islamic Financial Fatwa MCQ2,00093\.1±\\pm14\.7536\.7±\\pm82\.61\.0±\\pm0\.01\.0±\\pm0\.0Financial Report Sentiment Analysis MCQ80292\.3±\\pm139\.31,780\.7±\\pm841\.91\.0±\\pm0\.01\.0±\\pm0\.0Open\-EndedEvent–Cause Reasoning QA80413\.6±\\pm299\.92,503\.7±\\pm1,752\.1350\.6±\\pm101\.82,170\.4±\\pm635\.8Islamic Fatwa QA2,00064\.1±\\pm36\.2377\.3±\\pm200\.689\.9±\\pm58\.4492\.5±\\pm324\.0Islamic Sharī'a Standards QA811140\.1±\\pm5\.2287\.0±\\pm39\.533\.2±\\pm22\.0192\.1±\\pm129\.8Report Extractive Summarization80355\.4±\\pm165\.42,144\.3±\\pm972\.3157\.4±\\pm66\.5929\.1±\\pm391\.7
Table 1:Dataset statistics forSahm\.Mean±\\pmstandard deviation of word and character counts per instance, computed over the test split of each dataset\. For MCQ tasks the answer is a single letter \(A–D\), hence the constant 1\.0 word/char count\.We introduceSahm, a comprehensive benchmark for evaluating Arabic financial reasoning across diverse, real\-world tasks spanning Islamic finance, accounting, and market analysis\. The benchmark is designed to capture both rule\-based reasoning grounded in Shari’ah standards and applied financial understanding in authentic Arabic contexts\. It covers multiple task formats, including question answering, multiple\-choice reasoning, sentiment analysis, and summarization, enabling holistic assessment of model capabilities\. Table[1](https://arxiv.org/html/2604.19098#S3.T1)provides an overview of dataset composition, task distribution, and train–test splits\.
### 3\.1Islamic Finance Shari’ah Standards QA
Finance in the Gulf and the wider MENA region differs from Western systems: banks, insurers, and capital markets must comply with Islamic principles governed by detailed Shari’ah standards\. Frameworks such asAAOIFIand local regulations specify how financial instruments are structured, e\.g\., lease\-to\-own arrangements in*Ijara*\(إجارة\) and compliance requirements for Sukuk222صكوك \(sukuk\) are Shari’ah\-compliant financial certificates representing ownership in underlying assets rather than interest\-bearing debt\.issuance\(Pomeranz,[1997](https://arxiv.org/html/2604.19098#bib.bib29); Islamic Financial Services Board \(2024\),[IFSB](https://arxiv.org/html/2604.19098#bib.bib18); Saudi Central Bank,[2024](https://arxiv.org/html/2604.19098#bib.bib31)\)\. Yet, most financial benchmarks implicitly assume Western instruments such as interest\-bearing loans and conventional bonds, leaving models untested on regionally critical reasoning about contract permissibility, legal constraints, and Shari’ah compliance\. To address this gap, we construct the first Islamic Shari’ah Standards QA dataset directly from the official1,2641\{,\}264\-page AAOIFI compendium spanning5252standards chapters, enabling systematic evaluation of LLMs on rule\-based Islamic financial reasoning\.
We built the dataset through a multi\-step pipeline that converts the AAOIFI compendium into text via OCR withGemini\-2\.5\-Pro\(Google Cloud,[2025](https://arxiv.org/html/2604.19098#bib.bib13)\)\(Appendix[A](https://arxiv.org/html/2604.19098#A1)provides details\) recommended byHeakl et al\. \([2025b](https://arxiv.org/html/2604.19098#bib.bib16)\)\. Two native Islamic finance experts manually verified the extracted text to preserve diacritics, numerals, and domain\-specific terminology\.
In a review of a25%25\\%sample, the experts measured a high exact\-match rate of98\.7±0\.7%98\.7±0\.7\\%with a95%95\\%confidence interval and strong inter\-annotator agreement \(κ=0\.962\\kappa=0\.962\), confirming the reliability of the OCR pipeline\. The remaining 1\.3% of mismatched characters consisted primarily of minor orthographic or formatting issues \(e\.g\., spacing, punctuation, and occasional diacritics\), which annotators corrected in the canonical text used for QA construction; in the audited sample, we did not observe OCR errors that altered the substance of any Shari’ah ruling\. After cleanup, we grouped the verified text into thematic clusters, e\.g\., Murabaḥa \(cost\-plus sale\) and usedGemini\-2\.5\-Proto draft candidate Arabic question–answer pairs\. Domain experts then refined and validated these samples to ensure that each question accurately captured its intended ruling and included all mandatory conditions and exceptions\. This human\-in\-the\-loop pipeline transforms dense regulatory prose into high\-quality, legally faithful QA pairs for benchmarking Shari’ah\-compliant financial reasoning\. Examples from the dataset and the full pipeline appear in Figures[1](https://arxiv.org/html/2604.19098#S1.F1)and[2](https://arxiv.org/html/2604.19098#S1.F2), respectively\.
### 3\.2Islamic Financial Fatwa QA
We scraped fatwā archives from1313official websites across77Arab countries to capture the breadth of real\-world financial questions Muslims ask \(Table[5](https://arxiv.org/html/2604.19098#A2.T5)\)\. The initial crawl yielded20k20kfatwas, which we cross\-checked against the public FatwaSet\(Alyemny et al\.,[2023](https://arxiv.org/html/2604.19098#bib.bib4)\)to remove duplicates and then organized into 11 finance\-related categories \(Table[7](https://arxiv.org/html/2604.19098#A6.T7)\), including زكاة \(almsgiving\), ربا \(usury\), and مرابحة \(cost\-plus financing\), then transformed these long, formal texts into concise QA pairs viaGemini\-2\.5\-prowhile preserving their juristic meaning\. Specifically, we removed introductory invocations \(e\.g\., “الحمد لله، والصلاة والسلام على رسول الله”\) and rhetorical openers \(e\.g\., “أما بعد”\) to expose the core inquiry and ruling\. We stripped HTML artifacts and redundant navigational references while retaining key metadata such as source URLs for traceability\. This pipeline removes greetings, honorifics, hyperlinks, and scholar names while preserving Qurʾānic citations, juristic terminology, and legal reasoning\. Further details in Appendix[B](https://arxiv.org/html/2604.19098#A2)\. Two native Arabic speakers manually reviewed 10% of the normalized data from each category to verify clarity, linguistic fidelity, and domain correctness\. This process resulted in exactly9,9539\{,\}953high\-quality training samples and2,0002\{,\}000held\-out finance\-focused test cases \(Figure[1](https://arxiv.org/html/2604.19098#S1.F1)\)\.
Afterwards, we converted each test QA pair into multiple\-choice \(MCQ\) format viaGemini\-2\.5\-Pro, enabling both open\-ended fatwā reasoning and recognition\-style testing\. Each MCQ consists of one correct answer derived from the source fatwā and three plausible distractors reflecting common misconceptions\. Two native Arabic annotators independently reviewed the test set to assess MCQ correctness, alignment with the source fatwā, and distractor plausibility\. The annotators achieved high agreement \(Cohen’sκ=0\.89\\kappa=0\.89\)\. Following this pilot phase, we conducted a calibration round in which annotators discussed disagreements, resolved ambiguous cases, and refined shared labeling criteria\. One annotator then validated the remaining MCQs under the calibrated guidelines, ensuring that each item precisely matched its source fatwā, preserved juristic terminology, and avoided misleading options\. A final audit confirmed that95%95\\%of MCQs aligned exactly with their original QA pairs; we discarded the remaining5%5\\%and excluded them from evaluation\.
### 3\.3Business & Accounting Exams MCQ
Professional accounting assessment resources remain largely English\-centric, with key certifications such as the CPA exam conducted exclusively in English\. To address this gap and the limited availability of Arabic training materials despite the existence of IFRS translations, we design culturally and linguistically adapted MCQ samples covering IFRS treatments, financial ratios, budgeting, and costing, incorporating authentic Arabic financial terminology such as معدل دوران الأصول \(asset turnover ratio\) and زكاة الشركات \(corporate almsgiving\) within contextually accurate scenarios rather than direct translations of Western exam questions \(Appendix[D](https://arxiv.org/html/2604.19098#A4)\)\. We constructed the dataset by collecting 10 business exams and 8 accounting exams from multiple Arabic\-speaking countries\. We extract the text from the exam PDFs viaGemini\-2\.5\-ProfollowingHeakl et al\. \([2025b](https://arxiv.org/html/2604.19098#bib.bib16)\), after which two native Arabic\-speaking annotators reviewed by comparing the OCR output against the original questions, correcting recognition errors, and validating formatting\. The final dataset contains457457business questions and416416accounting questions, examples in Figure[1](https://arxiv.org/html/2604.19098#S1.F1)\.
### 3\.4Financial Report Sentiment Analysis
Despite managing trillions of dollars in assets, Arabic financial markets lack sentiment benchmarks tailored to region\-specific financial discourse\. Existing English datasets\(Maia et al\.,[2018b](https://arxiv.org/html/2604.19098#bib.bib24)\)focus on Western market narratives and do not capture signals central to MENA markets, including OPEC\+ production decisions, صكوك \(sukuk\) issuances, subsidy reforms, and Shari’ah\-compliance rulings\. These challenges are amplified by the use of culturally grounded financial terminology, e\.g\., مرابحة \(cost\-plus financing\) and stylistic variation in Arabic financial reporting, where subtle modifiers can reverse sentiment polarity\. To address this gap, we construct the first Arabic financial sentiment benchmark based on authentic market reports rather than translated proxies\. We collect200200Arabic financial reports: 100 Islamic finance–focused and 100 general from Argaam333[https://www\.argaam\.com/](https://www.argaam.com/), and annotate them with three document\-level sentiment labels:Positive,Negative, andNeutral\. Two native Arabic annotators labeled all reports using a custom web\-based annotation platform \(Figure[15](https://arxiv.org/html/2604.19098#A5.F15)\) following shared guidelines that emphasize holistic document interpretation rather than sentence\-level cues resolving in a high inter\-annotator agreement \(Cohen’sκ=0\.91\\kappa=0\.91\)\. We then conducted a calibration phase where annotators resolved disagreements and refined decision criteria\. For mixed\-signal reports, we assign the dominant polarity if \>60%60\\%of content supports it; otherwiseNeutral\. A third expert adjudicates residual disagreements\. We split the dataset into120120training and8080test reports\.
### 3\.5Report Extractive Summarization
Extractive summarization is critical for Arabic financial reporting, where annual reports are written in Arabic but frequently contain mixed numeral systems, embedded English financial acronyms and brand names rendered in Arabic script \(e\.g\., المعايير الدولية للتقارير المالية / IFRS and إتش إس بي سي / HSBC\), and specialized Islamic finance terminology such as صكوك \(sukuk\)\. Misinterpreting or omitting these elements can distort regulatory interpretation, compliance assessment, and financial valuation\. To support this task, we compile200200Arabic financial reports,100100general and100100Islamic from Argaam and annotate them with extractive summaries written in Arabic by two native Arabic speakers\. Rather than treating summarization as a subjective agreement task, we use ROUGE\(Lin,[2004](https://arxiv.org/html/2604.19098#bib.bib22)\)to measure overlap between independently produced summaries as a consistency check and select the more complete summary as the gold reference\. We split the dataset into120120training reports and8080test reports\. Further details in Appendix[F](https://arxiv.org/html/2604.19098#A6)\.
### 3\.6Event\-Cause Reasoning QA
Financial event\-cause reasoning is underexplored in Arabic due to the lack of datasets that require models to explain why financial or regulatory events occur and what implications they entail\. To address this gap, we introduce an event\-cause reasoning task that evaluates whether models can analyze Arabic financial reports and produce analytical explanations grounded in reported financial data, including market movements and صكوك issuances\. We collect200200Arabic financial reports:100100Islamic finance–focused and100100general fromArgaam\. Two native Arabic financial experts annotate each report by formulating one analytical question that links multiple reported data points and by writing a concise expert answer explaining the underlying causes and implications using only information in the article\. A pilot phase on2020reports ensures guideline clarity\. We provide further details in Appendix[G](https://arxiv.org/html/2604.19098#A7)\. We assess annotation quality at two levels: Cohen’sκ=0\.86\\kappa=0\.86measures agreement on event\-cause identification, while ROUGE overlap serves as a consistency check for independently written answers\. After calibration to resolve disagreements and align on edge cases, one expert completes the remaining annotations under the agreed criteria\.
ModelMCQ \(Accuracy %↑\\uparrow\)Open\-Ended QA \(Score 0–10↑\\uparrow\)DatasetsDatasetsAccountingBusinessFatwāSentimentMeanEvent\-Cause QAIslamic\-Standards\-QAFatwa\-QAMeanOpen\-source Models:≥\\geq70B ParametersQwen2\.5\-72B\-InstructYang et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib38)\)65\.87±2\.7074\.86±0\.3284\.65±0\.3375\.00±1\.2575\.108\.1000±0\.105\.6330±0\.105\.3912±0\.066\.3747LLaMA\-3\.1\-70BGrattafiori et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib14)\)52\.10±2\.7977\.60±1\.1484\.90±0\.1580\.00±3\.3173\.656\.623±0\.153\.7245±0\.104\.7607±0\.085\.036Open\-source Models:<<70B ParametersQwen2\.5\-14B\-InstructYang et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib38)\)49\.10±3\.9363\.39±0\.8376\.05±0\.8557\.50±3\.8261\.517\.4975±0\.104\.8806±0\.104\.0576±0\.065\.4786Qwen2\.5\-7B\-InstructYang et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib38)\)48\.50±2\.8559\.56±1\.1470\.00±0\.2855\.00±1\.9158\.276\.1038±0\.123\.4039±0\.102\.6815±0\.084\.0631Gemma\-2\-9B\-ITRiviere et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib30)\)49\.10±2\.7463\.39±3\.8366\.60±0\.6155\.00±1\.4458\.527\.1438±0\.084\.2306±0\.083\.4266±0\.064\.9336Gemma\-3\-27B\-ITKamath et al\. \([2025](https://arxiv.org/html/2604.19098#bib.bib21)\)53\.89±2\.1673\.22±0\.3280\.65±0\.1880\.00±0\.7271\.948\.7188±0\.056\.1708±0\.085\.1929±0\.056\.6942Gemma\-3\-4B\-ITKamath et al\. \([2025](https://arxiv.org/html/2604.19098#bib.bib21)\)38\.32±2\.2767\.76±0\.3261\.35±0\.1875\.00±1\.4460\.617\.4075±0\.082\.8985±0\.082\.4767±0\.064\.2609LLaMA\-3\.1\-8BGrattafiori et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib14)\)41\.92±3\.2860\.66±4\.4564\.05±3\.6273\.75±5\.7760\.604\.9231±0\.182\.5168±0\.121\.4025±0\.082\.9475Mixtral\-8x7B\-InstructJiang et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib20)\)32\.93±1\.0460\.66±0\.6362\.15±0\.3470\.00±0\.7256\.444\.5538±0\.082\.4980±0\.081\.7896±0\.062\.9471Proprietary Models: Reasoning\-EnhancedGPT\-5OpenAI \([2025](https://arxiv.org/html/2604.19098#bib.bib26)\)65\.27±2\.2772\.68±1\.2690\.75±0\.4578\.75±1\.2576\.869\.6831±0\.038\.7965±0\.058\.0515±0\.048\.8437GPT\-4oHurst et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib17)\)60\.48±2\.0778\.14±0\.3287\.70±0\.1077\.50±0\.0075\.968\.3125±0\.066\.6598±0\.086\.5219±0\.047\.1647Proprietary Models: General\-PurposeClaude\-Opus\-4\.5Anthropic \([2025b](https://arxiv.org/html/2604.19098#bib.bib6)\)77\.84±2\.4276\.50±1\.1491\.75±0\.3375\.00±2\.5080\.279\.6818±0\.038\.0438±0\.058\.8090±0\.038\.8449Claude\-Sonnet\-4\.5Anthropic \([2025c](https://arxiv.org/html/2604.19098#bib.bib7)\)78\.44±1\.2076\.50±1\.4588\.15±0\.3877\.50±1\.2580\.159\.3388±0\.048\.2588±0\.057\.6049±0\.038\.4008Claude\-Haiku\-4\.5Anthropic \([2025a](https://arxiv.org/html/2604.19098#bib.bib5)\)67\.66±1\.8073\.77±1\.3084\.90±0\.4077\.50±1\.8075\.969\.1050±0\.057\.0002±0\.076\.5341±0\.057\.5464Gemini\-3\-Flash \(preview\)Google \([2024](https://arxiv.org/html/2604.19098#bib.bib12)\)76\.05±1\.9574\.86±0\.9589\.90±0\.3081\.25±1\.2580\.529\.8369±0\.029\.1649±0\.039\.1571±0\.029\.0798GPT\-4o\-miniHurst et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib17)\)58\.08±2\.1077\.60±0\.4081\.75±0\.2075\.00±0\.5073\.617\.9613±0\.085\.6094±0\.105\.3087±0\.066\.2931Arabic ModelsALLAM\-7BBari et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib9)\)44\.91±3\.5568\.31±3\.8374\.40±2\.8358\.75±2\.0061\.596\.8875±0\.104\.9364±0\.084\.2185±0\.055\.3475Fanar\-1\-9BAbbas et al\. \([2025](https://arxiv.org/html/2604.19098#bib.bib1)\)47\.31±2\.4266\.12±1\.6774\.45±0\.3558\.75±2\.6061\.667\.5850±0\.104\.9607±0\.084\.4600±0\.065\.6686SILMA\-9Bsilma\-ai \([2024](https://arxiv.org/html/2604.19098#bib.bib33)\)50\.90±21\.7369\.40±6\.6162\.55±5\.5730\.00±3\.7553\.211\.8969±0\.203\.3547±0\.122\.0711±0\.082\.4409Jais\-2\-8BSengupta et al\. \([2023](https://arxiv.org/html/2604.19098#bib.bib32)\)35\.33±3\.0060\.30±2\.8066\.10±1\.8046\.25±2\.5052\.004\.6922±0\.154\.245±0\.102\.5147±0\.083\.8133
Table 2:Unified leaderboard comparing MCQ tasks \(Accuracy %\) and open\-ended QA tasks \(Score 0–10\)\.Values shown as mean±std\{\}\_\{\\pm\\text\{std\}\}over 3 runs; open\-ended scores are judged by two independent LLM judges\. Open\-ended QA Mean is averaged over Event\-Cause QA, Islamic\-Standards\-QA, and Fatwa\-QA\.
## 4Experiments
##### Evaluated Models\.
We evaluated20models spanning Arabic\-centric modelsBari et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib9)\); Abbas et al\. \([2025](https://arxiv.org/html/2604.19098#bib.bib1)\); silma\-ai \([2024](https://arxiv.org/html/2604.19098#bib.bib33)\)\(publicly available instruction\-tuned systems for regional adaptation\), open\-weight modelsRiviere et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib30)\); Kamath et al\. \([2025](https://arxiv.org/html/2604.19098#bib.bib21)\); Grattafiori et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib14)\); Yang et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib38)\); Jiang et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib20)\)\(strong multilingual and general\-purpose baselines\), and proprietary modelsHurst et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib17)\); OpenAI \([2025](https://arxiv.org/html/2604.19098#bib.bib26)\); Anthropic \([2025b](https://arxiv.org/html/2604.19098#bib.bib6),[c](https://arxiv.org/html/2604.19098#bib.bib7),[a](https://arxiv.org/html/2604.19098#bib.bib5)\); Google \([2024](https://arxiv.org/html/2604.19098#bib.bib12)\), enabling controlled analysis across language, scale, and capability dimensions\. To assess whether domain\-specific fine\-tuning can close the gap between Arabic\-centric and frontier models, we fine\-tune three Arabic LLMs \(ALLAM\-7B, Jais\-2\-8B, and SILMA\-9B\) on theSahmtraining split using LoRA \(rr=64,α\\alpha=128, lr=2e\-4, 3 epochs\)\. Detailed model specifications are provided in Table[6](https://arxiv.org/html/2604.19098#A3.T6)\.
We evaluate Accounting Exams, Business Exams, Fatwa MCQ, and Financial Sentiment with exact\-match accuracy, normalizing free\-form outputs \(e\.g\., option text/letters\) to a single choice before scoring \(Appendix[H](https://arxiv.org/html/2604.19098#A8)\)\. For extractive summarization, we report ROUGE\-F1 \(ROUGE\-1/2/L\) against gold extractive references \(models are instructed to output verbatim sentences\)\. For Fatwa QA, Shari’ah Standards QA, and Event\-Cause QA, we useGemini\-2\.5\-Flashas an LLM\-as\-a\-judge \(blind to model identity\): given the original Arabic prompt, gold reference, and model answer, it returns a JSON\-validated, additive \(sum\-of\-components\)\[0,10\]\[0,10\]score under a shared rubric assessing alignment with the reference ruling/conclusion, preservation of key constraints or quantitative fidelity, correctness \(doctrinal/factual or financial reasoning\), Arabic clarity, and directness/grounding\. We validate the judge with two expert Arabic annotators on 200 randomly sampled outputs across the three tasks \(MSE 0\.41, Pearsonrr=0\.92; inter\-annotator agreementκ\\kappa=0\.84 on discretized scores; Appendix[J](https://arxiv.org/html/2604.19098#A10)\)\. All judge and model generations use greedy decoding \(temperature 0; no sampling\) with fixed maximum lengths; full prompts, rubrics, schema, critical checks, and settings appear in Appendix[J](https://arxiv.org/html/2604.19098#A10)\.
We organize our findings around three core questions: \(1\) How do models perform across recognition versus generation tasks? \(2\) What distinguishes strong Arabic financial reasoning from mere language fluency? \(3\) Where do models systematically fail, and why?
Figure 3:Effect of reasoning token budget on ruling accuracy\.Green indicates improvement with increased budget, red indicates decline, and blue indicates no change\.
### 4\.1Main Results
Figure 4:Qualitative error analysis showing representative failure modes\.Left:Islamic knowledge error where Gemma\-3\-27B incorrectly rules a permissible transaction as forbidden, citing fabricated evidence with wrong wording of authentic Hadith\.Right:Concept confusion error where Qwen2\.5\-72B conflates total interest incurred with capitalizable interest in a construction loan scenario\.Accounting Reasoning Gap\.Shown in Table[2](https://arxiv.org/html/2604.19098#S3.T2), Claude models exhibit substantial superiority on Accounting tasks, with Claude\-Sonnet\-4\.5 exceeding GPT\-5 by over13%13\\%the largest proprietary\-to\-proprietary gap in our evaluation\. Crucially, this disparity cannot be attributed to general Arabic language proficiency alone, as these models achieve near\-parity on Business \(76\.50%76\.50\\%vs\.72\.68%72\.68\\%\) and Fatwa \(91\.75%91\.75\\%vs\.90\.75%90\.75\\%\) tasks\. We instead attribute this divergence to Claude’s stronger capacity for procedural numerical reasoning, the ability to apply rule\-based standards \(e\.g\., IFRS, Egyptian Auditing Standards\) through multi\-step logical chains\. This suggests that Arabic domain reasoning capabilities may constitute an independent axis from general language proficiency, warranting architectural investigation in future work\. Notably, Gemini\-3\-Flash inverts the recognition\-generation tradeoff, achieving the highest Open\-Ended QA score despite moderate MCQ performance, likely because generative tasks afford extended reasoning chains\. This is supported by Figure[3](https://arxiv.org/html/2604.19098#S4.F3), where the Gemini family shows increased ruling accuracy with larger reasoning token budgets\.
Figure 5:Models Talk More, Not Better\.Despite models generating44\-6×6\\timesmore fatwas text than human, models do not achieve proportionally higher accuracy, indicating that verbosity serves as proxy for uncertainty rather than expertise\.
## 5Results
Arabic Fluency ≠ Domain Reasoning: Event\-Cause QA Exposes the GapArabic\-centric pretraining provides strong foundations for Islamic jurisprudence tasks, but fails to transfer to financial reasoning \(Accounting, Business\)\. Domain\-specific fine\-tuning onSahmcloses this gap across all Arabic LLMs, with MCQ gains of \+13\.7% \(Sahm\-ALLAM\-7B\), \+5\.8% \(Sahm\-Jais\-8B\), and \+5\.2% \(SILMA\-9B\), enablingSahm\-ALLAM\-7Bto surpass GPT\-5 on Accounting and Business and match 72B baselines\. Event\-Cause QA emerges as the “true IQ test” for Arabic financial reasoning\. The spread \(1\.891\.89\-9\.849\.84\) is the widest in the table, nearly the full scale\. Proprietary models cluster tightly at the top \(9\.19\.1\-9\.89\.8\), then a cliff drops everyone else below8\.78\.7\. This is where Arabic\-specific models expose their limits as the task requires causal reasoning over Arabic financial text\. Language fluency does not imply domain reasoning\. The task demands compositional causal inference that neither Arabic pretraining nor raw scale can approximate\. Qualitative analysis of failure cases \(Figure[4](https://arxiv.org/html/2604.19098#S4.F4)\) reveals two distinct error patterns: models exhibit surface\-level familiarity with Islamic terminology without grounding in authoritative sources, for instance, Gemma\-3\-27B incorrectly rules a permissible transaction as forbidden while citing fabricatedḥadīthevidence, and they conflate related but distinct financial concepts, as when Qwen2\.5\-72B confuses total interest incurred with capitalizable interest by summing all expenses rather than computing weighted\-average expenditures\.
The Recognition\-Generation Gap\.A model that can identify correct Islamic rulings when presented as options should, in principle, generate coherent fatwās from scratch\. Our results challenge this assumption\. On Fatwa MCQ, Claude\-Opus\-4\.5 and GPT\-5 achieve91\.75%91\.75\\%and90\.75%90\.75\\%accuracy, respectively\. However, their Fatwa QA scores drop to8\.818\.81and8\.058\.05out of1010, a gap suggesting that recognition and generation tap fundamentally different competencies\. Figure[5](https://arxiv.org/html/2604.19098#S4.F5)illuminates one mechanism behind this gap\. Human fatwās peak at approximately 50 words; model responses peak at 300 words, a44\-6×6\\timesinflation\. Despite this verbosity, models do not achieve proportionally higher scores\. We interpret this pattern as*verbosity as uncertainty*: when models lack confident knowledge, they hedge with additional text rather than committing to precise rulings\. This finding has practical implications for deployment, response length may serve as a useful signal for answer confidence in Arabic financial QA systems\. It further suggests that evaluation protocols should distinguish between recognition and generation to avoid overestimating real\-world model reliability\.
ModelROUGE\-1ROUGE\-2ROUGE\-LProprietary Models – Reasoning\-EnhancedClaude\-Opus\-4\.578\.2263\.1764\.14GPT\-575\.1963\.7064\.11Claude\-Sonnet\-4\.579\.8664\.9865\.13Proprietary Models – General\-PurposeClaude\-Haiku\-4\.579\.3961\.4063\.62GPT\-4o\-mini77\.7962\.9064\.08GPT\-4o78\.9163\.1663\.71Gemini\-3\-Flash49\.3635\.8343\.02Gemini\-2\.5\-Flash39\.4627\.1736\.81Open\-source Models:≥\\geq70B parametersGemma\-3\-27B\-IT79\.2563\.5763\.42Qwen2\.5\-72B\-Instruct40\.5229\.5034\.04Meta\-LLaMA\-3\.1\-70B39\.6431\.4032\.65Open\-source Models:<<70B parametersQwen2\.5\-14B\-Instruct44\.4230\.9035\.82Gemma\-3\-4B\-IT76\.5262\.0660\.93Meta\-LLaMA\-3\.1\-8B66\.6747\.9256\.10Mixtral\-8x7B\-Instruct32\.7113\.0723\.78Qwen2\.5\-7B\-Instruct25\.1512\.0121\.86Arabic ModelsJais\-2\-8B73\.6856\.5461\.17Fanar\-1\-9B\-Instruct60\.5135\.9746\.96ALLaM\-7B\-Instruct35\.9722\.6128\.24SILMA\-9B\-Instruct27\.9216\.6625\.99
Table 3:Extractive summarization performance on Arabic financial reports evaluated using ROUGE F1 \(%\)\.### 5\.1Extractive Summarization
Table[3](https://arxiv.org/html/2604.19098#S5.T3)reveals a striking inversion: Claude\-Sonnet\-4\.5 achieves the highest ROUGE\-1 \(79\.86\), while Gemini\-2\.5\-Flash a strong open\-ended reasoner collapses to 39\.46, underperforming even GPT\-4o\-mini \(77\.79\)\. This exposes a fundamental tension: extractive summarization rewards*verbatim selection*, not generative fluency\. Consider a typical report: “نجحت شركة بن غاطي للتطوير العقاري في طرح المزيد من الصكوك… بقيمة 300 مليون دولار أمريكي، ببورصة لندن وناسداك دبي” \(Binghatti Development successfully issued additional sukuk… valued at $300M, listed on the London Stock Exchange and Nasdaq Dubai\)\. The gold summary must preserve the entity name, Islamic instrument \(sukuk\), exact figure, and dual listing elements paraphrasing models systematically distort\. Surprisingly, Gemma\-3\-4B\-IT achieves 76\.52 ROUGE\-1, rivaling Claude\-Opus\-4\.5 \(78\.22\) with a fraction of the parameters, suggesting extraction benefits from constrained generation rather than extended reasoning\. For Arabic\-centric models, domain\-specific tuning proves decisive: SAHM\-7B\-Instruct attains 57\.79 ROUGE\-L, outperforming ALLaM\-7B by\+29\.55 points, demonstrating that Arabic pretraining alone does not confer financial extraction competence targeted domain adaptation does\.
Figure 6:Root cause distribution of model errors across Islamic knowledge and reasoning tasks\.
### 5\.2Domain Adaptation Across Arabic LLMs
We systematically evaluate domain adaptation across major Arabic\-centric LLMs\. Table[4](https://arxiv.org/html/2604.19098#S5.T4)reveals three distinct adaptation profiles: \(1\)high\-gainbases like ALLAM yieldingSahm\-ALLAM\-7Bwith substantial improvement \(\+13\.68%\), \(2\)stable\-gainbases like Jais\-2 yieldingSahm\-Jais\-8Bwith consistent improvement across all MCQ metrics \(\+5\.78%, with notably strong gains in Sentiment \+11\.72%\), and \(3\)selective\-gainbases like SILMA that improve on some tasks \(Sentiment \+23\.8%\) but regress on others \(Accounting \-7\.8%\)\.
MCQ \(Accuracy %↑\\uparrow\)Open\-Ended QA \(Score 0–10↑\\uparrow\)ModelAccountingBusinessFatwāSentimentMeanEvent\-CauseFatwa\-QAIslamic\-StdMeanBase ModelsALLAM\-7B44\.9168\.3174\.4058\.7561\.596\.894\.944\.225\.35Jais\-2\-8B35\.3360\.3066\.1046\.2552\.004\.692\.514\.243\.81SILMA\-9B50\.9069\.4062\.5530\.0053\.211\.903\.352\.072\.44Fine\-tuned ModelsSahm\-ALLAM\-7B71\.40\(\+26\.5\)93\.99\(\+25\.7\)74\.45\(\+0\.1\)61\.25\(\+2\.5\)75\.27\(\+13\.7\)6\.79\(\-0\.1\)6\.48\(\+1\.5\)4\.12\(\-0\.1\)5\.80\(\+0\.5\)Sahm\-Jais\-8B40\.72\(\+5\.4\)62\.30\(\+2\.0\)70\.14\(\+4\.0\)57\.97\(\+11\.7\)57\.78\(\+5\.8\)5\.25\(\+0\.6\)4\.69\(\+2\.2\)4\.97\(\+0\.7\)4\.97\(\+1\.16\)SILMA\-9B \(fine\-tuned\)43\.11\(\-7\.8\)75\.96\(\+6\.6\)60\.60\(\-2\.0\)53\.75\(\+23\.8\)58\.36\(\+5\.2\)2\.01\(\+0\.1\)3\.67\(\+0\.3\)3\.67\(\+1\.6\)3\.12\(\+0\.7\)
Table 4:Domain adaptation across Arabic LLMs\. MCQ accuracy \(%\) and Open\-Ended QA scores \(0–10\) before and after fine\-tuning onSahm\.Bold model names\(Sahm\-ALLAM\-7B,Sahm\-Jais\-8B\) denote the two releasedSahm\-family artifacts; SILMA\-9B is included as a comparison case illustrating that adaptation outcomes depend on base\-model properties\.
### 5\.3Error Analysis
To diagnose failure modes, we analyze 500 randomly sampled incorrect responses across all datasets, grouped by required competence:Islamic Knowledge Errors\(Fatwa QA, Shari’ah Standards QA, Fatwa MCQ\) andReasoning Errors\(Accounting, Business, Event\-Cause QA\); summarization errors are treated in §[5\.1](https://arxiv.org/html/2604.19098#S5.SS1)and sentiment via accuracy\. The two annotators \(see §[3](https://arxiv.org/html/2604.19098#S3)\) jointly adjudicated each error against the gold reference through aconsensus protocol, additionally verifying cited religious evidence for Islamic tasks\. Consensus over independent annotation with post\-hoc IAA maximizes taxonomy coverage and ensures consistent categorization across heterogeneous errors requiring both jurisprudential and financial expertise\.
##### Error Breakdown\.
Figure[6](https://arxiv.org/html/2604.19098#S5.F6)reveals that two error types dominate:Misunderstanding ConceptandWrong Ruling, together accounting for 58\.5% of all failures\. Fabricated Evidence \(11\.4%\) and Hallucination \(9\.3%\) follow\. Notably, calculation mistakes contribute only 0\.3%, models rarely fail at arithmetic but frequently fail at*knowing which arithmetic to perform*\.
Figure 7:Effect of number of evidences from Hadith and Quran on Ruling Accuracy\.
##### Effect on Evidence Count on Accuracy\.
Figure[7](https://arxiv.org/html/2604.19098#S5.F7)examines whether the presence of scriptural evidence \(Qur’ānic verses andḥadīth\) in reference answers correlates with model accuracy\. We observe a logarithmic relationship: accuracy rises from 28% with zero evidence to approximately 55% with six or more citations\. This pattern admits two interpretations\. Optimistically, models may leverage textual evidence as grounding signals\. Pessimistically, questions with more evidence may simply be easier or more frequently represented in training data\. The increased variance at higher evidence counts \(shaded region\) suggests the relationship is not deterministic\.
## 6Conclusion and Future Work
We introducedSahm, the first Arabic financial NLP benchmark integrating modern finance and Shari’ah\-compliant reasoning across seven tasks\. Evaluating 20 LLMs reveals Arabic fluency does not imply financial reasoning, and fine\-tuning onSahmyields two complementary released models:Sahm\-ALLAM\-7Bsurpasses GPT\-5 on Accounting and Business while matching 72B baselines, andSahm\-Jais\-8Bachieves uniformly positive transfer across all tasks, demonstrating that targeted domain adaptation outperforms scale\. We release all resources to support trustworthy Arabic financial assistants\.
Several directions extend this work\. First,Sahmcurrently focuses on formal financial text; incorporating informal genres such as retail investor discourse, social media financial discussions, and dialectal Arabic would broaden coverage\. Second, Arabic financial reports frequently contain tables, charts, and mixed\-format documents; extending the benchmark to multimodal reasoning over structured financial data is a natural next step\. Third, our evaluation assesses answer correctness but not evidence traceability; future metrics should explicitly verify cited Qur’anic verses,ḥadīthreports, and AAOIFI standard references\. Fourth, cross\-lingual transfer from English financial benchmarks to Arabic remains unexplored; investigating whether English financial reasoning capabilities transfer to Arabic could reduce data requirements\. Finally, regional variation in Shari’ah interpretation across different supervisory bodies warrants task variants that evaluate model robustness to jurisdictional differences in Islamic finance rulings\.
## Limitations
Scope and coverage\.Sahmis built from curated, document\-grounded sources and covers as much of the available public material as feasible; however, practical access and usage constraints on some online sources limit the extent to which additional genres can be incorporated at this time\. As a result, while the benchmark provides strong provenance and reduces ambiguity, it does not yet cover all Arabic financial genres \(e\.g\., informal retail\-investor discourse\) or fully capture regional and institutional variation in Arabic financial writing\.
Shari’ah\-related content\.For Shari’ah\-oriented questions,Sahmevaluates faithfulness to the referenced material and the reasoning constraints reflected in the provided sources; since interpretations may differ across jurisdictions and supervisory bodies, the benchmark is not intended to adjudicate between schools of thought, but rather to test source\-grounded answering under the stated assumptions\.
Future evaluation directions\.As future work, we plan to develop evaluation metrics that explicitly assess \(i\) the existence and correctness of cited, source\-verifiable evidence including traceable support from the underlying materials \(e\.g\., fatwa text, and financial report statements\) and, when answers cite religious evidence, the correctness of references such as Qur’anic verses, hadith reports, or named fiqh sources; and \(ii\) the accuracy of book/standard citations in model outputs \(e\.g\., correct document title, section/article identifiers, and pointers that match the relevant source segment\), enabling more direct measurement of citation faithfulness and evidence\-groundedness\.
## Ethical Statement and Broad Impact
##### Licensing\.
We releaseSahmunder a dual license: \(1\) code and evaluation scripts under MIT License, and \(2\) annotation data under CC BY\-NC 4\.0, restricting commercial use while enabling academic research\. Users must independently obtain source documents where applicable\.
##### Availability\.
## Acknowledgments
We acknowledge The Fin AI community for its research support, feedback, and collaborative environment that contributed to this work\.
## References
- Abbas et al\. \(2025\)Ummar Abbas, Mohammad Shahmeer Ahmad, Firoj Alam, Enes Altinisik, Ehsannedin Asgari, Yazan Boshmaf, Sabri Boughorbel, Sanjay Chawla, Shammur A\. Chowdhury, Fahim Dalvi, Kareem Darwish, Nadir Durrani, Mohamed Elfeky, Ahmed K\. Elmagarmid, Mohamed Y\. Eltabakh, Masoomali Fatehkia, Anastasios Fragkopoulos, Maram Hasanain, Majd Hawasly, Mus’ab Husaini, Soon\-Gyo Jung, Ji Kim Lucas, Walid Magdy, Safa Messaoud, Abubakr Mohamed, Tasnim Mohiuddin, Basel Mousi, Hamdy Mubarak, Ahmad Musleh, Zan Naeem, Mourad Ouzzani, Dorde Popovic, Amin Sadeghi, Husrev Taha Sencar, Mohammed Shinoy, Omar Sinan, Yifan Zhang, Ahmed Ali, Yassine El Kheir, Xiaosong Ma, and Chaoyi Ruan\. 2025\.[Fanar: An Arabic\-centric multimodal generative AI platform](https://arxiv.org/abs/2501.13944)\.*ArXiv preprint*, abs/2501\.13944\.
- Aleid and Azmi \(2025\)Hayfa A Aleid and Aqil M Azmi\. 2025\.[Hajj\-FQA: A benchmark Arabic dataset for developing question\-answering systems on Hajj fatwas](https://link.springer.com/article/10.1007/s44443-025-00128-w)\.*Journal of King Saud University Computer and Information Sciences*, 37\(6\):135\.
- Alhajraf \(2025\)Salem Alhajraf\. 2025\.[Strategic role of sovereign wealth funds in the Gulf’s energy transition and economic diversification](https://doi.org/10.25613/SWYJ-AC71)\.Technical report, Rice University’s Baker Institute for Public Policy\.
- Alyemny et al\. \(2023\)Ohoud Alyemny, Hend S\. Al\-Khalifa, and Abdulrahman A\. Mirza\. 2023\.[A data\-driven exploration of a new Islamic fatwas dataset for Arabic NLP tasks](https://doi.org/10.3390/DATA8100155)\.*Data*, 8\(10\):155\.
- Anthropic \(2025a\)Anthropic\. 2025a\.[System Card: Claude Haiku 4\.5](https://assets.anthropic.com/m/99128ddd009bdcb/Claude-Haiku-4-5-System-Card.pdf)\.*Anthropic*\.
- Anthropic \(2025b\)Anthropic\. 2025b\.[System Card: Claude Opus 4\.5](https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf)\.*Anthropic*\.
- Anthropic \(2025c\)Anthropic\. 2025c\.[System Card: Claude Sonnet 4\.5](https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf)\.*Anthropic*\.
- Araci \(2019\)Dogu Araci\. 2019\.[FinBERT: Financial sentiment analysis with pre\-trained language models](https://arxiv.org/abs/1908.10063)\.*ArXiv preprint*, abs/1908\.10063\.
- Bari et al\. \(2024\)M Saiful Bari, Yazeed Alnumay, Norah A\. Alzahrani, Nouf M\. Alotaibi, Hisham A\. Alyahya, Sultan AlRashed, Faisal A\. Mirza, Shaykhah Z\. Alsubaie, Hassan A\. Alahmed, Ghadah Alabduljabbar, Raghad Alkhathran, Yousef Almushayqih, Raneem Alnajim, Salman Alsubaihi, Maryam Al Mansour, Majed Alrubaian, Ali Alammari, Zaki Alawami, Abdulmohsen Al\-Thubaity, Ahmed Abdelali, Jeril Kuriakose, Abdalghani Abujabal, Nora Al\-Twairesh, Areeb Alowisheq, and Haidar Khan\. 2024\.[ALLaM: Large language models for Arabic and English](https://arxiv.org/abs/2407.15390)\.*Preprint*, arXiv:2407\.15390\.
- Chen et al\. \(2021\)Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting\-Hao Huang, Bryan Routledge, and William Yang Wang\. 2021\.[FinQA: A dataset of numerical reasoning over financial data](https://doi.org/10.18653/v1/2021.emnlp-main.300)\.In*Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing*, pages 3697–3711, Online and Punta Cana, Dominican Republic\. Association for Computational Linguistics\.
- Chen et al\. \(2022\)Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang\. 2022\.[ConvFinQA: Exploring the chain of numerical reasoning in conversational finance question answering](https://doi.org/10.18653/v1/2022.emnlp-main.421)\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, pages 6279–6292, Abu Dhabi, United Arab Emirates\. Association for Computational Linguistics\.
- Google \(2024\)Google\. 2024\.[Gemini 3 Flash model card](https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf)\.*Google*\.
- Google Cloud \(2025\)Google Cloud\. 2025\.Gemini 2\.5 pro — generative ai on vertex ai\.[https://cloud\.google\.com/vertex\-ai/generative\-ai/docs/models/gemini/2\-5\-pro](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro)\.Last accessed: 2025\-10\-06\.
- Grattafiori et al\. \(2024\)Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al\. 2024\.[The Llama 3 herd of models](https://arxiv.org/abs/2407.21783)\.*ArXiv preprint*, abs/2407\.21783\.
- Heakl et al\. \(2025a\)Ahmed Heakl, Sara Ghaboura, Omkar Thawakar, Fahad Shahbaz Khan, Hisham Cholakkal, Rao Muhammad Anwer, and Salman H\. Khan\. 2025a\.[AIN: The Arabic inclusive large multimodal model](https://arxiv.org/abs/2502.00094)\.*ArXiv preprint*, abs/2502\.00094\.
- Heakl et al\. \(2025b\)Ahmed Heakl, Muhammad Abdullah Sohail, Mukul Ranjan, Rania Elbadry, Ghazi Shazan Ahmad, Mohamed El\-Geish, Omar Maher, Zhiqiang Shen, Fahad Shahbaz Khan, and Salman Khan\. 2025b\.[KITAB\-Bench: A comprehensive multi\-domain benchmark for Arabic OCR and document understanding](https://doi.org/10.18653/v1/2025.findings-acl.1135)\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 22006–22024, Vienna, Austria\. Association for Computational Linguistics\.
- Hurst et al\. \(2024\)Aaron Hurst, Adam Lerer, Adam P\. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker\-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, et al\. 2024\.[GPT\-4o system card](https://arxiv.org/abs/2410.21276)\.*Preprint*, arXiv:2410\.21276\.
- Islamic Financial Services Board \(2024\) \(IFSB\)Islamic Financial Services Board \(IFSB\)\. 2024\.[Islamic financial services industry stability report 2024](https://www.ifsb.org/publication-document/islamic-financial-services-industry-stability-report-2024/)\.
- Jarrar et al\. \(2023\)Mustafa Jarrar, Ahmet Birim, Mohammed Khalilia, Mustafa Erden, and Sana Ghanem\. 2023\.[ArBanking77: Intent detection neural model and a new dataset in modern and dialectical Arabic](https://doi.org/10.18653/v1/2023.arabicnlp-1.22)\.In*Proceedings of ArabicNLP 2023*, pages 276–287, Singapore \(Hybrid\)\. Association for Computational Linguistics\.
- Jiang et al\. \(2024\)Albert Q\. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie\-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed\. 2024\.[Mixtral of experts](https://arxiv.org/abs/2401.04088)\.*Preprint*, arXiv:2401\.04088\.
- Kamath et al\. \(2025\)Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean\-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Beyer, Xiaohai Zhai, Anton Tsitsulin, et al\. 2025\.[Gemma 3 technical report](https://arxiv.org/abs/2503.19786)\.*ArXiv preprint*, abs/2503\.19786\.
- Lin \(2004\)Chin\-Yew Lin\. 2004\.[ROUGE: A package for automatic evaluation of summaries](https://aclanthology.org/W04-1013/)\.In*Text Summarization Branches Out*, pages 74–81, Barcelona, Spain\. Association for Computational Linguistics\.
- Maia et al\. \(2018a\)Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur\. 2018a\.[WWW’18 Open Challenge: Financial opinion mining and question answering](https://doi.org/10.1145/3184558.3192301)\.In*Companion of the The Web Conference 2018 on The Web Conference 2018*, WWW’18, pages 1941–1942, Lyon , France\. ACM\.
- Maia et al\. \(2018b\)Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur\. 2018b\.[WWW’18 Open Challenge: Financial opinion mining and question answering](https://doi.org/10.1145/3184558.3192301)\.In*Companion Proceedings of the The Web Conference 2018*, WWW ’18, page 1941–1942, Republic and Canton of Geneva, CHE\. International World Wide Web Conferences Steering Committee\.
- Nie et al\. \(2025\)Ying Nie, Binwei Yan, Tianyu Guo, Hao Liu, Haoyu Wang, Wei He, Binfan Zheng, Weihao Wang, Qiang Li, Weijian Sun, Yunhe Wang, and Dacheng Tao\. 2025\.[CFinBench: A comprehensive Chinese financial benchmark for large language models](https://doi.org/10.18653/v1/2025.naacl-long.40)\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, NAACL’25, pages 876–891, Albuquerque, New Mexico\. Association for Computational Linguistics\.
- OpenAI \(2025\)OpenAI\. 2025\.[GPT\-5 system card](https://cdn.openai.com/gpt-5-system-card.pdf)\.*OpenAI*\.
- Peng et al\. \(2025a\)Xueqing Peng, Triantafillos Papadopoulos, Efstathia Soufleri, Polydoros Giannouris, Ruoyu Xiang, Yan Wang, Lingfei Qian, Jimin Huang, Qianqian Xie, and Sophia Ananiadou\. 2025a\.[Plutus: Benchmarking large language models in low\-resource Greek finance](https://arxiv.org/abs/2502.18772)\.*ArXiv preprint*, abs/2502\.18772\.
- Peng et al\. \(2025b\)Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Jeff Zhao, Huan He, Yi Han, Yun Feng, Yuechen Jiang, Yupeng Cao, Haohang Li, Yangyang Yu, Xiaoyu Wang, Penglei Gao, Shengyuan Lin, Keyi Wang, Shanshan Yang, Yilun Zhao, Zhiwei Liu, Peng Lu, Jerry Huang, Suyuchen Wang, Triantafillos Papadopoulos, Polydoros Giannouris, Efstathia Soufleri, Nuo Chen, Guojun Xiong, Zhiyang Deng, Yijia Zhao, Mingquan Lin, Meikang Qiu, Kaleb E\. Smith, Arman Cohan, Xiao\-Yang Liu, Jimin Huang, Alejandro Lopez\-Lira, Xi Chen, Junichi Tsujii, Jian\-Yun Nie, Sophia Ananiadou, and Qianqian Xie\. 2025b\.[MultiFinBen: A multilingual, multimodal, and difficulty\-aware benchmark for financial LLM evaluation](https://arxiv.org/abs/2506.14028)\.*ArXiv preprint*, abs/2506\.14028\.
- Pomeranz \(1997\)Felix Pomeranz\. 1997\.The accounting and auditing organization for Islamic financial institutions: An important regulatory debut\.*Journal of International Accounting, Auditing and Taxation*, 6\(1\):123–130\.
- Riviere et al\. \(2024\)Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, et al\. 2024\.[Gemma 2: Improving open language models at a practical size](https://arxiv.org/abs/2408.00118)\.*ArXiv preprint*, abs/2408\.00118\.
- Saudi Central Bank \(2024\)Saudi Central Bank\. 2024\.[Saudi Central Bank \(sama\) regulatory framework for Islamic finance](https://www.sama.gov.sa/en-US/Pages/default.aspx)\.
- Sengupta et al\. \(2023\)Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Alham Fikri Aji, Zhengzhong Liu, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Preslav Nakov, Timothy Baldwin, and Eric P\. Xing\. 2023\.[Jais and Jais\-chat: Arabic\-centric foundation and instruction\-tuned open generative large language models](https://arxiv.org/abs/2308.16149)\.*ArXiv preprint*, abs/2308\.16149\.
- silma\-ai \(2024\)silma\-ai\. 2024\.SILMA 9B Instruct v1\.0\.[https://huggingface\.co/silma\-ai/SILMA\-9B\-Instruct\-v1\.0](https://huggingface.co/silma-ai/SILMA-9B-Instruct-v1.0)\.
- Team \(2025\)Falcon\-LLM Team\. 2025\.[Falcon\-Arabic: A breakthrough in Arabic language models](https://falcon-lm.github.io/blog/falcon-arabic)\.
- Xie et al\. \(2024\)Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al\. 2024\.FinBen: A holistic financial benchmark for large language models\.*Advances in Neural Information Processing Systems*, 37:95716–95743\.
- Xie et al\. \(2023\)Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez\-Lira, and Jimin Huang\. 2023\.[PIXIU: A large language model, instruction data and evaluation benchmark for finance](https://dl.acm.org/doi/10.5555/3666122.3667576)\.In*Proceedings of the 37th International Conference on Neural Information Processing Systems*, NeurIPS’23, Red Hook, NY, USA\. Curran Associates Inc\.
- Xie et al\. \(2025\)Zhuohan Xie, Daniil Orel, Rushil Thareja, Dhruv Sahnan, Hachem Madmoun, Fan Zhang, Debopriyo Banerjee, Georgi Georgiev, Xueqing Peng, Lingfei Qian, Jimin Huang, Jinyan Su, Aaryamonvikram Singh, Rui Xing, Rania Elbadry, Chen Xu, Haonan Li, Fajri Koto, Ivan Koychev, Tanmoy Chakraborty, Yuxia Wang, Salem Lahlou, Veselin Stoyanov, Sophia Ananiadou, and Preslav Nakov\. 2025\.[FinChain: A symbolic benchmark for verifiable chain\-of\-thought financial reasoning](https://arxiv.org/abs/2506.02515)\.*Preprint*, arXiv:2506\.02515\.
- Yang et al\. \(2024\)An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu\. 2024\.[Qwen2\.5 technical report](https://arxiv.org/abs/2412.15115)\.*ArXiv preprint*, abs/2412\.15115\.
- Zhang et al\. \(2024\)Xiao Zhang, Ruoyu Xiang, Chenhan Yuan, Duanyu Feng, Weiguang Han, Alejandro Lopez\-Lira, Xiao\-Yang Liu, Meikang Qiu, Sophia Ananiadou, Min Peng, Jimin Huang, and Qianqian Xie\. 2024\.[Dólares or Dollars? Unraveling the bilingual prowess of financial LLMs between Spanish and English](https://doi.org/10.1145/3637528.3671554)\.In*Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 25\-29, 2024*, KDD’24, pages 6236–6246, Barcelona, Spain\. ACM\.
- Zhao et al\. \(2024\)Yilun Zhao, Hongjun Liu, Yitao Long, Rui Zhang, Chen Zhao, and Arman Cohan\. 2024\.[FinanceMATH: Knowledge\-intensive math reasoning in finance domains](https://doi.org/10.18653/v1/2024.acl-long.693)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, ACL’24, pages 12841–12858, Bangkok, Thailand\. Association for Computational Linguistics\.
- Zhu et al\. \(2021\)Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat\-Seng Chua\. 2021\.[TAT\-QA: A question answering benchmark on a hybrid of tabular and textual content in finance](https://doi.org/10.18653/v1/2021.acl-long.254)\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 3277–3287, Online\. Association for Computational Linguistics\.
## Appendix AIslamic Finance Shari’ah Standards QA: Sources and Processing
##### OCR Quality Evaluation\.
We developed a dedicated OCR quality evaluation tool to systematically assess recognition accuracy in Arabic legal–financial documents\. The tool compares raw machine\-extracted text against both the original scanned page and a manually corrected reference, enabling fine\-grained verification of OCR fidelity at the page level\.
For each document, the system pairs a scanned page image \(e\.g\.,page\_001\.png\) with its corresponding OCR output \(page\_001\.txt\) and presents them side by side: the original page image appears in the left panel, while the OCR\-generated Arabic text is shown on the right\. Annotators inspect these pairs to identify errors such as نص مفقود \(missing text\), أحرف غير صحيحة \(incorrect characters\), ترتيب كلمات خاطئ \(incorrect word order\), and فقدان التنسيق \(formatting loss\)\. When needed, they correct the OCR output using an editable field while monitoring a live similarity score reflecting the edit distance between the corrected and original text\.
In addition to direct corrections, annotators label common OCR failure modes, including distorted symbols \(رموز خاصة مشوهة\), punctuation errors \(أخطاء علامات الترقيم\), and inaccurate numerals \(أرقام غير دقيقة\), and may add targeted comments on recurring issues such as confusion between visually similar Arabic characters \(e\.g\., ب vs\. ن\) or misinterpretation of التشكيل \(diacritics\)\.
Figure 8:OCR quality evaluation interface for the Shari’ah Standards QA dataset\. The tool displays each scanned page from the AAOIFI Shari’ah Standards \(left\) alongside the OCR\-extracted Arabic text \(right\) to support manual quality verification\. Annotators compare the original page with the extracted text, flag recognition errors in diacritics, numerals, and domain\-specific terminology, and add corrective notes \(bottom\)\. A progress bar tracks annotation completion and overall OCR accuracy\.The system automatically computes a quantitative quality score using character\-level edit distance and maps similarity percentages to four interpretable categories:Excellent\(≥\\geq95%\),Good\(80–95%\),Partial\(50–80%\), andPoor\(<<50%\)\. All evaluation steps are logged as structured JSON records, including original and corrected text, similarity scores, identified error types, annotator comments, and timestamps, supporting reproducibility and auditability\.
Beyond per\-page inspection, the pipeline enables aggregate analysis of OCR performance across document collections, allowing researchers to identify systematic error patterns, benchmark OCR quality across heterogeneous Arabic sources, and inform downstream normalization and model refinement\. Overall, this human\-in\-the\-loop methodology ensures that OCR text used downstream such as in Arabic financial NLP benchmarks and model training is verifiably accurate and free from recognition errors that could affect الاستدلال الشرعي \(jurisprudential reasoning\) or التحليل المالي \(financial analysis\)\. Figure[8](https://arxiv.org/html/2604.19098#A1.F8)illustrates the OCR quality evaluation interface used during annotation\.
##### Prompt Design:
To ensure high\-quality OCR extraction, we design a constrained prompt \(Figure[9](https://arxiv.org/html/2604.19098#A1.F9)\) that enforces verbatim transcription, preservation of diacritics and formatting, and strict exclusion of non\-textual artifacts\. These constraints are critical for maintaining fidelity in Arabic legal documents, where minor textual variations can alter meaning\. Similarly, we construct a controlled prompt for question–answer generation \(Figure[10](https://arxiv.org/html/2604.19098#A2.F10)\) that restricts outputs to the explicit content of the Shari’ah standards\. This design prevents hallucination, preserves juridical precision, and ensures that generated QA pairs remain faithful to the source text\.
Prompt for Arabic OCR Text ExtractionTask\.You are an expert Arabic OCR system specialized in legal and financial documents\. Given a scanned page image from an official Islamic finance standard, your task is to extract the text*verbatim*in Arabic with maximum fidelity to the original source\.Extraction guidelines\.•Preserve the original wording exactly; do*not*paraphrase, summarize, or infer missing content\.•Preserve diacritics \(التشكيل\) whenever present in the source\.•Preserve all numerals exactly as written; do not normalize or convert number formats\.•Preserve punctuation, headings, lists, and paragraph boundaries as faithfully as possible\.•Do not correct perceived grammatical, typographical, or stylistic issues\.•If text is unclear or partially illegible, extract the most faithful representation without guessing\.What to ignore\.•Page numbers, running headers, footers, or decorative elements not part of the main content\.•Marginal artifacts or scanning noise that do not belong to the text\.Critical rule\.Do*not*add explanations, comments, translations, or annotations\. Output Arabic text only\.Input\.A single scanned page image from the AAOIFI Shari’ah Standards\.Output format\.Return the extracted Arabic text as plain UTF\-8 text, preserving line breaks and paragraph structure\.Figure 9:Prompt for Arabic OCR text extraction with strict verbatim fidelity and formatting preservation\.
## Appendix BIslamic Fatwa Dataset: Sources and Processing
Prompt for Shari’ah Standards Question–Answer GenerationTask\.You are an assistant supporting the creation of evaluation data for Islamic finance\. Given a verified excerpt from an official Shari’ah standard, your task is to draft candidate Arabic question–answer pairs that reflect the*explicit ruling stated in the text*\.Question generation\.•Formulate a clear, focused question that asks about the ruling, condition, or permissibility described in the excerpt\.•Do not introduce hypothetical scenarios or facts not present in the source text\.•Ensure the question can be answered directly and completely from the provided excerpt\.Answer generation\.•Base the answer strictly on the given excerpt; do not add external knowledge\.•Preserve the original legal meaning, mandatory conditions, and stated exceptions\.•Do not simplify, reinterpret, or generalize the ruling beyond what the text explicitly states\.•Use formal Arabic consistent with fiqh al\-muʿāmalāt terminology\.Restrictions\.•Do not issue personal opinions or normative judgments\.•Do not cite sources outside the provided text\.•Do not omit conditions, constraints, or qualifiers that affect the ruling\.Input template\.STANDARD\_EXCERPT: \{arabic\_text\}Output format\.```
{
"question": "...",
"answer": "..."
}
```
Figure 10:Prompt for generating Arabic QA pairs from Shari’ah standard excerpts with strict fidelity to explicit rulings\.Figure 11:Custom annotation interface used to validate automatically generated multiple\-choice questions \(MCQs\) for the Islamic Finance Fatwa Q&A dataset\. The interface displays each original question–answer pair on the left and the corresponding AI\-generated MCQ on the right, including the question, answer options, and the automatically selected correct choice\. Annotators review conceptual alignment between the MCQ and the original fatwā, verify the correctness and terminology of the marked answer, and assess the plausibility and pedagogical value of distractors\. The bottom panel provides structured evaluation criteria and issue tagging to ensure consistent, high\-quality validation\.WebsiteLinkCountryDar Al Ifta in Saudi Arabia[https://www\.alifta\.gov\.sa/](https://www.alifta.gov.sa/)Saudi ArabiaDar Al Ifta in Egypt[https://www\.dar\-alifta\.org](https://www.dar-alifta.org/)EgyptDar Al Ifta in Jordan[https://aliftaa\.jo](https://aliftaa.jo/)JordanAl Shaikh Abdual Aziz Ibn Baz[https://binbaz\.org\.sa](https://binbaz.org.sa/)Saudi ArabiaAl Shaikh Mohammad Ibn Othaimin[https://binothaimeen\.net/site](https://binothaimeen.net/site)Saudi ArabiaAl Shaikh Abdual Aziz Al Ashaikh[https://www\.mufti\.af\.org\.sa](https://www.mufti.af.org.sa/)Saudi ArabiaAl Shaikh Saleh Al Fwzan[https://www\.alfawzan\.af\.org\.sa](https://www.alfawzan.af.org.sa/)Saudi ArabiaAl Shaikh Saleh Bin Humaid[https://www\.ibnhomaid\.af\.org\.sa/](https://www.ibnhomaid.af.org.sa/)Saudi ArabiaAl Shaikh Abdullah Al Manee[https://al\-manee\.com](https://al-manee.com/)Saudi ArabiaIslamWeb[https://www\.islamweb\.com](https://www.islamweb.com/)QatarFatwaPedia[https://fatwapedia\.com](https://fatwapedia.com/)Saudi ArabiaIslamQA[https://islamqa\.info](https://islamqa.info/)SyriaIslamOnline[https://islamonline\.net](https://islamonline.net/)QatarTable 5:Primary online fatwā archives used for collecting Islamic financial question–answer pairs\. These official and widely recognized sites span seven Arab countries, providing diverse juristic opinions and real\-world financial scenarios\. The URLs shown correspond to the original Arabic portals from which data was programmatically scraped and later cleaned for inclusion in the dataset\.The purpose of this evaluation is to determine whether an AI\-generated multiple\-choice question \(MCQ\) accurately tests the same Islamic jurisprudence concept as the original فتوى Q&A pair\. The goal is to maintain both pedagogical soundness and factual correctness\. A well\-formed MCQ must remain conceptually aligned with the original ruling \(الحكم الشرعي\), preserve the main مفهوم فقهي without distortion, and use appropriate مصطلحات فقهية to reflect the opinion of the original scholar \(المُفتي\)\. Evaluators must ensure that the question targets the central legal issue and does not introduce unrelated details or alter the scenario in a way that changes the ruling\. This evaluation is conducted through a structured annotation dashboard \(Figure[11](https://arxiv.org/html/2604.19098#A2.F11)\) that presents the original fatwā alongside the generated MCQ for systematic validation\. The fatwā Q&A pairs used in this evaluation are collected from a diverse set of authoritative online sources \(Table[5](https://arxiv.org/html/2604.19098#A2.T5)\)\.
Prompt for Fatwā Q&A NormalizationTask\.You are an expert Arabic copy\-editor specializing in Islamic finance Q&A\. Given aQUESTIONand anORIGINAL ANSWER, your goal is to produce a concise, self\-contained question and answer pair in Arabic by removing only non\-essential elements*without paraphrasing or changing the juristic intent*\. Do*not*summarize or rephrase; keep the original wording as much as possible\.1\. Referral flag\.Before editing, setIS\_MAINLY\_REFERRAL:•"YES"if the answer mainly redirects to another fatwā, link, or reference and does not provide a substantive independent ruling\.•"NO"otherwise\.2\. Clean the question\.Edit minimally while preserving wording and fiqh intent:•Remove greetings, honorifics, and personal appeals \(e\.g\., سماحة الشيخ، سلمه الله، السلام عليكم\)\.•Remove formal closings \(e\.g\., أرجو منكم التكرم، وجزاكم الله خيراً\)\.•Remove the scholar’s name if it is only a form of address; keep it only if the question explicitly seeks that scholar’s specific fatwā or opinion\.•Ensure the final question reads as a natural, standalone query\.3\. Clean the answer\.Edit minimally while preserving wording and reasoning:•Remove formal openings and closings so the answer starts with substantive content\.•Remove all fatwā numbers, hyperlinks, and navigational phrases, editing surrounding text just enough to remain grammatical\.•Convert Arabic\-Indic numerals to Western numerals\.•Remove purely formulaic closings such as وفقكم الله and والله أعلم when they are not part of practical advice\.•Always preserve Qurʾānic verses and sūrah references, ḥadīth attributions, and citations of scholars and their opinions\.Global rule\.Always delete*all*fatwa numbers from the cleaned question and cleaned answer\.Input template\.TITLE: \{title\}
QUESTION: \{question\}
ORIGINAL ANSWER: \{answer\}Output format\.```
{
"IS_MAINLY_REFERRAL": "YES" or "NO",
"cleaned_question": "...",
"cleaned_answer": "..."
}
```
Figure 12:Prompt for Arabic fatwā Q&A normalization with minimal editing and preservation of juristic intent\.For an MCQ to be marked as ملائم \(RELEVANT\), it must meet four main criteria\. First, conceptual alignment \(المواءمة المفاهيمية\) the question should test the same core ruling as the source fatwa and stay faithful to its reasoning and conditions\. Second, correct answer accuracy \(دقة الإجابة الصحيحة\) the indicated correct option must exactly match the original answer, remain free of contradictions, and use precise Islamic legal terms\. Third, distractor quality \(جودة الخيارات الخاطئة\) incorrect options should be plausible but clearly wrong according to the fatwa, reflecting common misunderstandings rather than random or nonsensical answers\. Finally, question clarity \(وضوح السؤال\) the MCQ must be clearly phrased, grammatically correct in العربية, and provide enough context to be answerable without referencing the original text\.
Conversely, an MCQ should be marked as غير ملائم \(NOT RELEVANT\) if it fails any major requirement\. Conceptual misalignment occurs when the question tests a different topic, oversimplifies a complex juristic issue, or changes critical context such as conditions \(شروط\) or scenarios\. Incorrect answer issues include a keyed option that contradicts the fatwa, multiple potentially correct answers, or misleading explanations\. Poor distractor quality arises when wrong options are obviously incorrect, factually wrong about الإسلام, or too ambiguous\. Technical problems include grammar errors that affect meaning, vague or incomplete questions, or improper mixing of different مذاهب in a way that confuses the intended ruling\.
The evaluation process follows a clear four\-step workflow\. First, read the original Q&A carefully, identify the primary حكم, any شروط or exceptions, and the supporting evidence such as Qur’anic verses or حديث\. Second, analyze the generated MCQ to check conceptual consistency, faithfulness of the correct answer, and plausibility of distractors\. Third, look for red flags such as contradictions, oversimplification, missing qualifiers, or scenario changes\. Finally, make a decision: label the MCQ as ملائم if it meets all core criteria \(minor language or formatting issues may be tolerated\) or as غير ملائم if any critical issue is present\. This structured approach ensures that evaluation is consistent, transparent, and preserves the integrity of Islamic legal reasoning in AI\-generated questions\.
##### Normalization Prompt\.
To standardize fatwā question–answer pairs, we design a constrained prompt \(Figure[12](https://arxiv.org/html/2604.19098#A2.F12)\) that removes non\-essential elements such as greetings and formatting artifacts while preserving the original wording and legal intent\. This ensures consistency across examples without altering the underlying حكم شرعي\.
## Appendix CEvaluated Models
This appendix briefly documents the rationale behind the selection of models evaluated in Table[2](https://arxiv.org/html/2604.19098#S3.T2), with model specifications summarized separately in Table[6](https://arxiv.org/html/2604.19098#A3.T6)\. The goal is not comparative analysis, but transparency regarding model coverage across language focus, scale, and accessibility\.
ModelOrganizationSizeSource / NotesArabic\-Focused ModelsALLAM\-7B\-InstructSDAIA / ALLaM\-AI7BBari et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib9)\)Fanar\-1\-9B\-InstructQCRI9BAbbas et al\. \([2025](https://arxiv.org/html/2604.19098#bib.bib1)\)SILMA\-9B\-InstructSILMA AI9Bsilma\-ai \([2024](https://arxiv.org/html/2604.19098#bib.bib33)\)Strong Multilingual / General Open\-Source ModelsQwen2\.5\-72B\-InstructAlibaba72BYang et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib38)\)LLaMA\-3\.1\-70B\-InstructMeta70BGrattafiori et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib14)\)Qwen2\.5\-14B\-InstructAlibaba14BYang et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib38)\)Qwen2\.5\-7B\-InstructAlibaba7BYang et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib38)\)Gemma\-2\-9B\-ITGoogle9BRiviere et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib30)\)Gemma\-3\-27B\-ITGoogle27BKamath et al\. \([2025](https://arxiv.org/html/2604.19098#bib.bib21)\)Gemma\-3\-4B\-ITGoogle4BKamath et al\. \([2025](https://arxiv.org/html/2604.19098#bib.bib21)\)LLaMA\-3\.1\-8B\-InstructMeta8BGrattafiori et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib14)\)Mixtral\-8x7B\-InstructMistral AI8×\\times7BJiang et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib20)\)Proprietary Models \(Upper\-Bound References\)GPT\-5OpenAI–OpenAI \([2025](https://arxiv.org/html/2604.19098#bib.bib26)\)\(API\)GPT\-4oOpenAI–Hurst et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib17)\)\(API\)GPT\-4o\-miniOpenAI–Hurst et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib17)\)\(API\)Claude Opus 4\.5Anthropic–Anthropic \([2025b](https://arxiv.org/html/2604.19098#bib.bib6)\)\(API\)Claude Sonnet 4\.5Anthropic–Anthropic \([2025c](https://arxiv.org/html/2604.19098#bib.bib7)\)\(API\)Claude Haiku 4\.5Anthropic–Anthropic \([2025a](https://arxiv.org/html/2604.19098#bib.bib5)\)\(API\)Gemini\-3\-Flash \(preview\)Google DeepMind–Google \([2024](https://arxiv.org/html/2604.19098#bib.bib12)\)\(API\)Table 6:Models evaluated in this study, grouped into Arabic\-focused models, strong multilingual open\-source baselines, and proprietary frontier models used as upper\-bound references\.##### Arabic Focused Models:
ALLAM\-7B\-InstructBari et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib9)\), Fanar\-1\-9B\-InstructAbbas et al\. \([2025](https://arxiv.org/html/2604.19098#bib.bib1)\), and SILMA\-9B\-Instructsilma\-ai \([2024](https://arxiv.org/html/2604.19098#bib.bib33)\)were selected to represent publicly available instruction\-tuned models explicitly adapted for Arabic\. These systems reflect different training strategies and base model lineages, providing coverage of current Arabic\-centric development efforts\.
##### Open\-Source Multilingual Models\.
Qwen2\.5 modelsYang et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib38)\), LLaMA\-3\.1 modelsGrattafiori et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib14)\), Gemma\-2 and Gemma\-3 modelsRiviere et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib30)\); Kamath et al\. \([2025](https://arxiv.org/html/2604.19098#bib.bib21)\), and Mixtral\-8x7B\-InstructJiang et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib20)\)were included as strong open\-weight baselines spanning a wide range of parameter scales\. These models are widely used, well\-documented, and provide reference points for general\-purpose multilingual performance on Arabic financial and jurisprudential tasks\.
##### Proprietary Models\.
GPT\-5OpenAI \([2025](https://arxiv.org/html/2604.19098#bib.bib26)\), GPT\-4oHurst et al\. \([2024](https://arxiv.org/html/2604.19098#bib.bib17)\), Claude\-4\.5 variantsAnthropic \([2025b](https://arxiv.org/html/2604.19098#bib.bib6),[c](https://arxiv.org/html/2604.19098#bib.bib7),[a](https://arxiv.org/html/2604.19098#bib.bib5)\), and Gemini\-3\-FlashGoogle \([2024](https://arxiv.org/html/2604.19098#bib.bib12)\)were evaluated as closed\-source upper\-bound references\. Their inclusion enables contextualization of open and Arabic\-focused models against contemporary frontier systems, without implying direct comparability or deployment parity\.
## Appendix DBusiness and Accounting Exam Extraction Prompts
Business and accounting exams in Arabic exhibit heterogeneous layouts, ranging from narrative exercise\-based formats to tabular true/false questions\. To reliably extract structured MCQs from these sources, we use two task\-specific prompts tailored to the dominant document formats observed in the collected exams: an exercise\-based extraction prompt \(Figure[13](https://arxiv.org/html/2604.19098#A4.F13)\) and a table\-oriented extraction prompt \(Figure[14](https://arxiv.org/html/2604.19098#A4.F14)\)\.
Prompt for Extracting Arabic Accounting Exam MCQs \(Exercise\-Based Format\)Task\.You are an expert system for extracting Arabic accounting exam questions\. The input is a scanned exam page containing exercises that begin with the keyword تمرين \(Exercise\), followed by a number\.Extraction instructions\.•Identify all exercises that begin with تمرين followed by a numeral \(e\.g\., تمرين ١, تمرين ٢\)\.•For each exercise, extract the full contextual text that follows the exercise header\.•Within each exercise, identify all multiple\-choice questions \(numbered 1, 2, 3, etc\.\)\.•For each MCQ, combine the exercise context with the specific question text to form a complete question\.•Extract all answer choices labeled as أ, ب, ج, and د\.•Identify the correct answer by detecting underlined text in the choices; underlining indicates the correct option\.Critical rules\.•Preserve the original Arabic text exactly; do not paraphrase or normalize\.•Extract*all*MCQs appearing under each exercise\.•If no underlined choice is visible, set the correct answer tonull\.Output format\.Return a JSON object with the following structure:```
{
"exercises": [
{
"exercise_number": "...",
"exercise_context": "...",
"questions": [
{
"question_number": "...",
"full_question_text": "...",
"choices": {
"أ": "...",
"ب": "...",
"ج": "...",
"د": "..."
},
"correct_answer": "...",
"is_underlined": true
}
]
}
],
"page_info": {
"total_exercises": ...,
"total_questions": ...,
"language": "Arabic",
"subject": "Accounting"
}
}
```
Figure 13:Prompt for extracting MCQs from Arabic accounting exams with exercise\-based layouts\.Prompt for Extracting Arabic Business Exam MCQs \(Tabular Format\)Task\.You are an expert system for extracting Arabic business and accounting exam questions from scanned images containing tabular layouts\.Document characteristics\.•Each row corresponds to one question\.•Questions are numbered using Arabic numerals \(e\.g\., ١٢٤, ١٢٥\)\.•Answer choices typically include صح \(True\) and خطأ \(False\), with optional additional choices\.•The correct answer is highlighted with a yellow background\.Extraction instructions\.•Identify all question rows in the table\.•Extract the question number and full Arabic question text for each row\.•Extract all visible answer choices\.•Identify the correct answer by detecting yellow highlighting\.•Label questions astrue\_falseormultiple\_choiceaccordingly\.Critical rules\.•Preserve Arabic text exactly as written, including diacritics\.•If yellow highlighting is ambiguous or not visible, sethas\_yellow\_highlighttofalse\.•Extract*all*visible questions on the page\.Output format\.Return a JSON object with the following structure:```
{
"questions": [
{
"question_number": "...",
"question_text": "...",
"question_type": "...",
"choices": {
"a": "...",
"b": "...",
"c": "..."
},
"correct_answer": "...",
"correct_choice_text": "...",
"has_yellow_highlight": true,
"subject_area": "business"
}
],
"page_info": {
"total_questions": ...,
"format": "table_with_yellow_highlighting",
"language": "Arabic",
"question_type": "mixed"
}
}
```
Figure 14:Prompt for extracting MCQs from Arabic business and accounting exams with tabular layouts\.
## Appendix EFinancial Sentiment Annotation Guidelines
Figure 15:Custom annotation platform used to label Arabic financial reports for sentiment analysis\. Annotators reviewed full reports, assigned sentiment classes, and flagged ambiguous cases for expert adjudication\.We annotate Arabic financial reports using a document\-level sentiment scheme designed to reflect overall market impact rather than sentence\-level polarity\. Annotation follows a structured human\-in\-the\-loop workflow supported by a custom web\-based interface \(Figure[15](https://arxiv.org/html/2604.19098#A5.F15)\), with clear decision rules \(Figure[16](https://arxiv.org/html/2604.19098#A5.F16)\) to ensure consistency across Islamic and conventional financial reporting\.
Document\-Level Financial Sentiment Annotation GuidelinesCore principle\.Assign a single sentiment label based on the*overall dominant sentiment of the entire report*, not on individual sentences or isolated phrases\.Annotation procedure\.•Read the complete document before assigning any label\.•Identify the main financial outcome, thesis, and conclusion\.•Give greater weight to headlines, executive summaries, and concluding sections than to supporting details\.Handling mixed sentiment\.•Dominant sentiment rule: assignPositiveorNegativeif one polarity accounts for more than 60% of the salient content\.•Neutral default: assignNeutralwhen positive and negative signals are balanced or when the report is primarily factual\.Decision criteria\.•Positive: growth announcements, profit increases, successful expansions, or favorable forecasts\.•Negative: losses, declining performance, regulatory issues, or adverse outlooks\.•Neutral: factual reporting, balanced analysis, or informational updates without a clear directional impact\.Quality control\.Annotators label reports independently, resolve disagreements during a calibration phase, and refine shared decision criteria\. A third domain expert adjudicates remaining conflicts\. Each report receives exactly one final sentiment label\.Figure 16:Guidelines for document\-level sentiment annotation of Arabic financial reports\.
## Appendix FArabic Finance Extractive Summarization Annotation Guidelines
We annotate Arabic financial reports for extractive summarization using a structured human\-in\-the\-loop workflow supported by a custom web\-based interface \(Figure[17](https://arxiv.org/html/2604.19098#A6.F17)\) and guided by explicit annotation criteria \(Figure[18](https://arxiv.org/html/2604.19098#A6.F18)\)\. Arabic financial reports also exhibit recurring linguistic and formatting challenges, including specialized terminology, code\-switching, and mixed numeral systems \(Table[8](https://arxiv.org/html/2604.19098#A6.T8)\)\. Annotators are instructed to select sentences that preserve key financial facts, numerical values, and regulatory references without introducing paraphrasing or abstraction\.
CategoryTotal CountZakat \(زكاة\)4,888Riba \(ربا\)2,454Murabaha \(مرابحة\)1,389Gharar \(غرر\)860Waqf \(وقف\)730Ijara \(إجارة\)571Maysir \(ميسر\)372Musharaka \(مشاركة\)242Mudharaba \(مضاربة\)228Takaful \(تكافل\)187Sukuk \(صكوك\)32Total records11,953Table 7:Distribution of questions across Islamic finance categories in the final dataset\.IssueExample from the reportIslamic Finance Terminology“تعزيز التمويل المستدام وتطوير الصكوك والسندات”Code\-switching“تستهدف تعزيز التمويل المستدام وتطوير الصكوك والسندات، وزيادة شفافية القطاع …”FitchMixed Numeral Systems“انخفض حجم الدين بنحو ٢٧ ريال قطري \(٧٫٤ مليار دولار\) في عام ٢٠٢٣” — combines Arabic currency and Western digitsTable 8:Key text difficulties in Arabic financial reports with real examplesFigure 17:Custom web\-based annotation interface for extractive summarization\. Annotators view Arabic financial reports, select key sentences containing figures, decisions, and disclosures, and mark them for gold\-standard summaries\.Extractive Summarization Annotation GuidelinesTask overview\.Annotators create extractive summaries by selecting the most important sentences*verbatim*from each Arabic financial document\. The goal is to produce a concise summary that preserves critical financial information and reflects the document’s main message\.Document\-level assessment\.•Read the entire document before selecting any sentences\.•Identify the document type \(e\.g\., earnings report, regulatory announcement, market analysis, company news\)\.•Segment the text into sentences using Arabic punctuation marks \(، ؛ \.\)\.•Target a summary length of approximately 30–40% of the original document\.Critical content to prioritize\.Annotators must include sentences containing:•Financial figures: الأرباح / الخسائر \(profits/losses\), الإيرادات \(revenues\), النسب المئوية \(percentages\)\.•Performance indicators: نمو / انخفاض \(growth/decline\), ارتفاع / هبوط \(increase/decrease\)\.•Strategic decisions: الاستحواذ \(acquisition\), الاندماج \(merger\), التوسع \(expansion\)\.•Regulatory or official actions: قرارات الهيئة \(authority decisions\), الموافقات \(approvals\), التراخيص \(licenses\)\.Sentence scoring\.Each sentence is scored on a 1–5 scale:•5: Critical financial data or main announcement\.•4: Important context or cause–effect explanation\.•3: Supporting detail required for clarity\.•2: General market or background information\.•1: Redundant or generic statements\.Selection procedure\.•Select all sentences scored5\.•Add sentences scored4until the target length is reached\.•Include a sentence scored3only if necessary for coherence\.Final validation\.Before submission, annotators verify that the summary:•Includes all key financial figures and the main announcement\.•Is coherent and understandable on its own\.•Falls within the 30–40% length target\.•Avoids repetition and generic background content\.Common errors to avoid\.•Selecting sentences based solely on position in the document\.•Omitting numerical or regulatory information\.•Including repetitive or stylistic filler content\.•Exceeding the target summary length without justification\.Figure 18:Guidelines for extractive summarization annotation of Arabic financial reports\.
## Appendix GEvent–Cause Reasoning Annotation Guidelines
We construct event–cause reasoning instances for Arabic financial reports using a structured annotation framework with explicit quality control procedures \(Figure[19](https://arxiv.org/html/2604.19098#A7.F19)\)\.
Event–Cause Reasoning QA Annotation and Quality ControlTask objective\.Annotators construct one event–cause reasoning instance per Arabic financial report\. Each instance consists of \(i\) an analytical question that requires causal or interpretive reasoning and \(ii\) a concise expert\-written answer grounded exclusively in the information provided in the report\. The task evaluates whether models can explain*why*financial or regulatory events occurred and*what their implications are*, rather than recalling isolated facts\.Question construction\.The question must:•Be analytical in nature \(e\.g\., “why did this occur?” or “what does this indicate?”\)\.•Connect multiple data points from the report \(e\.g\., financial figures, growth rates, market reactions\)\.•Avoid purely descriptive prompts \(e\.g\., “what was the profit?”\)\.•Be answerable using only information stated or implied in the report\.Answer construction\.The answer must:•Be written in Arabic and remain concise\.•Rely exclusively on the content of the report, without external knowledge or speculation\.•Explicitly reference numerical figures and percentages when available\.•Provide economic or financial interpretation \(e\.g\., performance drivers, risk implications, or market significance\)\.•Preserve technical and domain\-specific terminology\.Focus areas\.Annotators prioritize questions involving:•Market trend analysis and its implications\.•Performance comparison between companies or sectors\.•Economic significance of observed data patterns\.•Risk assessment based on reported financial indicators\.Quality control procedure\.We enforce quality control through a multi\-stage human validation workflow:1\.Pilot annotation\.Two native Arabic financial experts independently annotate a pilot subset of 20 reports, each producing an event–cause question and an analytical answer\.2\.Agreement assessment\.We evaluate agreement at two complementary levels:•*Event–cause identification*: measured using Cohen’sκ\\kappa, assessing consistency in identifying salient events and their causes\.•*Answer consistency*: measured using ROUGE overlap between independently written answers, used as a consistency check rather than a correctness metric\.3\.Calibration\.Annotators review disagreements from the pilot phase, discuss ambiguous cases \(e\.g\., implicit causality, multi\-factor events, overlapping economic drivers\), and refine shared annotation criteria\. This calibration aligns interpretation standards and reduces annotation drift\.4\.Full annotation\.After calibration, one expert annotates the remaining reports under the agreed guidelines\.5\.Audit and correction\.A senior annotator audits a random sample of completed annotations to verify that each instance:•Identifies a plausible event and its cause\(s\) supported by the report\.•Includes relevant numerical evidence when available\.•Provides an analytical explanation rather than a descriptive summary\.Annotations that fail these checks are revised or discarded\.Final dataset format\.Each finalized instance consists of a financial report, one analytical event–cause question, and one expert\-written answer\. This format supports evaluation using both exact\-match and partial\-match metrics and enables controlled benchmarking of causal reasoning in Arabic financial text\.Figure 19:Guidelines and quality control workflow for event–cause reasoning annotation in Arabic financial reports\.
## Appendix HMCQ Answer Normalization and Scoring
To ensure fair and reproducible evaluation of multiple\-choice questions, we normalize model outputs before computing accuracy\. Large language models frequently generate free\-form responses \(e\.g\., explanations, mixed scripts, or multiple answer mentions\) rather than a single option label\.
##### Normalization procedure\.
For each model output, we apply the following steps:
- •Normalize Unicode and Arabic script by removing diacritics, collapsing repeated whitespace and punctuation, and mapping Eastern Arabic digits \(e\.g\., ١٢٣٤\) to Western digits \(1234\)\.
- •Extract the first explicit answer mention using a cascade of regular expressions that handle: - –Latin option labels \(e\.g\.,A,B, “Option C”\), - –Arabic option letters \(e\.g\., أ, ب, ج\), - –Spelled\-out Arabic forms \(e\.g\., باء\), - –Numeric indices \(e\.g\.,1\-\-4\)\.
##### Scoring\.
We compute accuracy as an exact match between the normalized predictiony^\\hat\{y\}and the gold labelyy\. For example, the output “الإجابة هي 2 بسبب صياغة الحكم” is normalized toB, while “الخيار \(ج\) هو الصحيح” is normalized toC\. Outputs that do not contain a valid option after normalization are marked incorrect\.
This procedure ensures that evaluation is robust to superficial variation in formatting, language mixing, and numeral systems, and that all models are assessed under a consistent and deterministic scoring protocol\.
## Appendix IInstruction Templates for SAHM Tasks
To enable a unified instruction\-tuning and evaluation setup across heterogeneous tasks, we convert each SAHM task into a standardized instruction format\. Table[9](https://arxiv.org/html/2604.19098#A9.T9)lists the canonical task instructions used in our benchmark, shown in their original Arabic formulation alongside an English translation for clarity\. The Arabic prompts constitute the actual inputs used during model evaluation, while the English versions are provided solely to document task intent and facilitate reproducibility\.
DatasetOriginal Arabic PromptEnglish Translated PromptIslamic Sharia Standards QAبناءً على معايير وأحكام التمويل الإسلامي والمعاملات المالية الشرعية، أجب على السؤال التالي بدقة\.Text:السؤال: \{Question\}\.الإجابة وفقاً للضوابط الشرعية:\{Output\}Based on Islamic finance standards and Shari’ah\-compliant rulings, answer the following question accurately\.Text:Question: \{Question\}\.Answer \(Shari’ah\-compliant\):\{Output\}Islamic Fatwa QAبناءً على أحكام الشريعة الإسلامية والفقه الإسلامي، أجب على السؤال التالي بطريقة مفصلة ومدعمة بالأدلة عند الإمكان\.Text:السؤال: \{Question\}\.Answer:\{Output\}Based on Islamic jurisprudence \(fiqh\) and Shari’ah rulings, answer the following question in a detailed manner, supported by evidence when possible\.Text:Question: \{Question\}\.Answer:\{Output\}Islamic Financial Fatwa MCQاقرأ السؤال التالي بعناية واختر الإجابة الصحيحة وفقاً لأحكام الشريعة\.Text:السؤال: \{Question\}\. الخيارات: \{Choices\}\.Answer:أخرج حرف الخيار الصحيح فقط\.Read the following question carefully and choose the correct answer according to Shari’ah rulings\.Text:Question: \{Question\}\. Choices: \{Choices\}\.Answer:Output only the correct option letter\.Accounting Exams MCQاقرأ السؤال التالي بعناية واختر الإجابة الصحيحة\.Text:السؤال: \{Question\}\. الخيارات: \{Choices\}\.Answer:أخرج حرف الخيار الصحيح فقط\.Read the following question carefully and choose the correct answer\.Text:Question: \{Question\}\. Choices: \{Choices\}\.Answer:Output only the correct option letter\.Business Exams MCQاقرأ السؤال التالي بعناية واختر الإجابة الصحيحة\.Text:السؤال: \{Question\}\. الخيارات: \{Choices\}\.Answer:أخرج حرف الخيار الصحيح فقط\.Read the following business/management question carefully and choose the correct answer\.Text:Question: \{Question\}\. Choices: \{Choices\}\.Answer:Output only the correct option letter\.Financial Report Sentiment Analysis MCQاقرأ بعناية التقرير المالي التالي واختر التصنيف الصحيح من منظور المستثمر\.Text:التقرير: \{Input\}\.Answer:\(إيجابي / سلبي / محايد\)\.Read the following financial report carefully and choose the correct label from an investor’s perspective\.Text:Report: \{Input\}\.Answer:\(Positive / Negative / Neutral\)\.Report Extractive Summarizationقم بتلخيص التقرير المالي التالي باستخدام التلخيص الاستخراجي \(Extractive Summarization\)\. اختر الجمل الأكثر أهمية مباشرة من النص الأصلي دون تعديل أو إعادة صياغة، ورتّبها بنفس تسلسلها\. اجعل الملخص حوالي 30–40% من حجم النص، وركّز على الأرقام والقرارات والنتائج والتواريخ\.Text:التقرير: \{Input\}\.Answer:أخرج الملخص فقط دون أي شرح\.Summarize the following financial report using extractive summarization \(select sentences verbatim, keep original order, target 30–40% length, focus on numbers/decisions/outcomes/dates\)\.Text:Report: \{Input\}\.Answer:Output the extractive summary only \(no extra text\)\.Event–Cause Reasoning QAبناءً على التقرير المالي التالي، أجب على السؤال التحليلي بشكل مفصل ودقيق مع الالتزام بالمعلومات الواردة في النص فقط\.Text:التقرير المالي: \{Input\}\. السؤال: \{Question\}\.Answer:\{Output\}Based on the following financial report, answer the analytical question in a detailed and accurate way, grounded only in the provided text\.Text:Financial report: \{Input\}\. Question: \{Question\}\.Answer:\{Output\}Table 9:Instruction templates used for SAHM tasks \(Arabic prompts are used in evaluation; English translations document task intent\)\.
## Appendix JLLM\-as\-a\-Judge Protocol, Validation, and Reproducibility
### J\.1Judge Protocol and Reproducibility
We evaluate the three open\-ended tasks \(Fatwa QA, Shari’ah Standards QA, and Event–Cause QA\) using an LLM\-as\-a\-judge setup withGemini\-2\.5\-Flash\. For each instance, the judge receives: \(i\) the exact Arabic prompt shown to the model \(including any report/excerpt and question\), \(ii\) the gold reference answer, and \(iii\) the model’s candidate answer\. The judge is blind to model identity and always observes the inputs in fixed, explicitly labeled fields \(prompt,ground\_truth,candidate\_answer\) to avoid ordering or positional ambiguity\. The judge returns a structured JSON object containing: \(a\) rubric sub\-scores whose sum defines an overall score in\[0,10\]\[0,10\], \(b\) task\-specific critical error flags \(e\.g\., contradiction with the reference, omission of critical constraints, normalization of unlawful elements, or fabrication/alteration of figures\), and \(c\) a brief explanatory note\. Task\-specific evaluation rubrics are defined for fatwa QA \(Figure[20](https://arxiv.org/html/2604.19098#A10.F20)\), Islamic finance QA \(Figure[21](https://arxiv.org/html/2604.19098#A10.F21)\), and financial analysis tasks \(Figure[22](https://arxiv.org/html/2604.19098#A10.F22)\)\.
We enforce a strict JSON schema during parsing\. If a response is invalid JSON or violates the schema, we retry once with the same inputs and an explicit*JSON\-only*instruction; persistent failures are marked invalid and excluded from aggregate scores \(we report the invalid\-rate\)\. We run the judge deterministically \(temperature=0\.0=0\.0, greedy decoding, max output tokens=4096=4096\), and therefore do not perform repeated judging or score averaging\. Full judge prompts, rubrics, and task\-specific schemas are provided in the following subsections\.
### J\.2Human Alignment Study \(Judge Validation\)
To validate the LLM judge against expert evaluation, we conduct a human alignment study on200200randomly sampled open\-ended outputs spanning Fatwa QA, Shari’ah Standards QA, and Event–Cause QA\. Two expert Arabic annotators independently score each model response using the same\[0,10\]\[0,10\]additive rubric provided to the judge \(Section[J](https://arxiv.org/html/2604.19098#A10)\)\. We compare the judge’s scores \(fromGemini\-2\.5\-Flash\) to the mean of the two human scores, obtaining an MSE of0\.410\.41and a Pearson correlation ofr=0\.92r=0\.92\. Inter\-annotator agreement is high \(κ=0\.84\\kappa=0\.84computed on discretized integer scores\)\. These results indicate that the LLM\-as\-a\-judge scores closely track expert human judgments under our rubric\.
### J\.3Cross\-Judge Validation of Open\-Ended Evaluations
To address concerns about potential model\-family bias in our LLM\-as\-judge evaluation, we re\-ran all open\-ended tasks with two independent judges: Gemini\-2\.5\-Flash \(our primary judge\) and GPT\-4o\. Each model was evaluated 3 times under greedy decoding, and we report mean±\\pmstd across runs for both judges\. If our primary judge were systematically favoring Gemini\-family models, switching to GPT\-4o should lower Gemini scores; instead, they*rise*under GPT\-4o \(Gemini\-3\-Flash: Islamic\-Std 9\.18→\\to9\.76, Fatwa 9\.17→\\to9\.32\), i\.e\., the opposite of what circular bias would predict\. Model rankings are preserved across judges: top\-tier models \(Gemini\-3\-Flash, Claude\-Opus\-4\.5\) and bottom\-tier models \(SILMA\-9B, LLaMA\-3\.1\-8B\) remain in the same groupings regardless of judge\. Tight confidence intervals \(±\\pm0\.02–0\.10\) across 3 runs confirm reproducibility under greedy decoding\. Full per\-model, per\-judge scores appear in Table[10](https://arxiv.org/html/2604.19098#A10.T10)\.
Event\-Cause QAIslamic\-Std QAFatwa QAModelGemini JudgeGPT\-4o JudgeGemini JudgeGPT\-4o JudgeGemini JudgeGPT\-4o JudgeGemini\-3\-Flash9\.84±\\pm0\.029\.97±\\pm0\.029\.18±\\pm0\.039\.76±\\pm0\.019\.17±\\pm0\.029\.32±\\pm0\.02Claude\-Opus\-4\.59\.67±\\pm0\.039\.32±\\pm0\.978\.06±\\pm0\.059\.53±\\pm0\.028\.79±\\pm0\.039\.18±\\pm0\.02Claude\-Sonnet\-4\.59\.32±\\pm0\.049\.68±\\pm0\.048\.24±\\pm0\.059\.28±\\pm0\.027\.58±\\pm0\.038\.86±\\pm0\.02GPT\-4o8\.30±\\pm0\.069\.50±\\pm0\.096\.64±\\pm0\.088\.53±\\pm0\.036\.50±\\pm0\.048\.04±\\pm0\.01Gemma\-3\-27B8\.70±\\pm0\.059\.74±\\pm0\.026\.15±\\pm0\.088\.41±\\pm0\.035\.18±\\pm0\.057\.35±\\pm0\.02Qwen2\.5\-72B8\.08±\\pm0\.108\.45±\\pm0\.175\.61±\\pm0\.106\.96±\\pm0\.075\.37±\\pm0\.066\.30±\\pm0\.01Fanar\-1\-9B7\.57±\\pm0\.108\.03±\\pm0\.224\.94±\\pm0\.086\.03±\\pm0\.034\.44±\\pm0\.065\.15±\\pm0\.04Gemma\-3\-4B7\.39±\\pm0\.089\.09±\\pm0\.082\.88±\\pm0\.085\.32±\\pm0\.072\.46±\\pm0\.064\.39±\\pm0\.02ALLAM\-7B6\.87±\\pm0\.107\.79±\\pm0\.164\.92±\\pm0\.085\.95±\\pm0\.034\.20±\\pm0\.054\.49±\\pm0\.03LLaMA\-3\.1\-70B6\.60±\\pm0\.157\.15±\\pm0\.303\.70±\\pm0\.104\.67±\\pm0\.054\.74±\\pm0\.082\.24±\\pm0\.02Mixtral\-8x7B4\.53±\\pm0\.085\.14±\\pm0\.092\.48±\\pm0\.083\.43±\\pm0\.081\.78±\\pm0\.062\.80±\\pm0\.04SILMA\-9B1\.88±\\pm0\.201\.43±\\pm0\.373\.33±\\pm0\.122\.33±\\pm0\.032\.05±\\pm0\.081\.61±\\pm0\.05LLaMA\-3\.1\-8B4\.90±\\pm0\.182\.49±\\pm0\.172\.50±\\pm0\.121\.85±\\pm0\.031\.38±\\pm0\.080\.71±\\pm0\.02Sahm\-ALLAM\-7B6\.50±\\pm0\.107\.02±\\pm0\.126\.30±\\pm0\.026\.59±\\pm0\.044\.24±\\pm0\.044\.51±\\pm0\.03Table 10:Cross\-judge validation on open\-ended tasks\. All evaluations run 3 times under greedy decoding with two independent judges \(Gemini\-2\.5\-Flash and GPT\-4o\)\. Rankings are preserved across judges; Gemini\-family scores rise under GPT\-4o, disconfirming circular bias\.
### J\.4Frontier Model Error Analysis
ModelAccounting \(%\)Business \(%\)Fatwā MCQ \(%\)Sentiment \(%\)ProprietaryClaude\-Opus\-4\.578\.04±\\pm2\.4276\.14±\\pm1\.1491\.57±\\pm0\.3361\.25±\\pm2\.50Claude\-Sonnet\-4\.577\.25±\\pm1\.2077\.05±\\pm1\.4588\.83±\\pm0\.3866\.25±\\pm1\.25Gemini\-3\-Flash74\.65±\\pm1\.9575\.41±\\pm0\.9590\.07±\\pm0\.3070\.00±\\pm1\.25GPT\-563\.67±\\pm2\.2772\.31±\\pm1\.2691\.15±\\pm0\.4562\.50±\\pm1\.25GPT\-4o59\.28±\\pm2\.0778\.32±\\pm0\.3287\.50±\\pm0\.1061\.25±\\pm0\.00Gemini\-2\.5\-Flash55\.49±\\pm2\.1375\.05±\\pm0\.8386\.02±\\pm1\.2258\.33±\\pm4\.39Open\-source≥\\geq70BQwen2\.5\-72B63\.08±\\pm2\.7075\.23±\\pm0\.3283\.63±\\pm0\.3364\.00±\\pm1\.25LLaMA\-3\.1\-70B49\.11±\\pm2\.7975\.58±\\pm1\.1482\.90±\\pm0\.1551\.25±\\pm3\.31Open\-source<<70BGemma\-3\-27B53\.29±\\pm2\.1674\.13±\\pm0\.3280\.67±\\pm0\.1864\.17±\\pm0\.72Gemma\-2\-9B46\.71±\\pm2\.7465\.31±\\pm3\.8370\.43±\\pm0\.6154\.17±\\pm1\.44Qwen2\.5\-14B48\.49±\\pm3\.9364\.66±\\pm0\.8375\.18±\\pm0\.8560\.83±\\pm3\.82Qwen2\.5\-7B46\.11±\\pm2\.8563\.02±\\pm1\.1469\.70±\\pm0\.2854\.17±\\pm1\.91Gemma\-3\-4B38\.12±\\pm2\.2767\.58±\\pm0\.3261\.30±\\pm0\.1862\.08±\\pm1\.44Mixtral\-8x7B31\.74±\\pm1\.0459\.38±\\pm0\.6362\.32±\\pm0\.3458\.33±\\pm0\.72LLaMA\-3\.1\-8B38\.93±\\pm3\.2858\.64±\\pm4\.4560\.35±\\pm3\.6252\.08±\\pm5\.77Arabic ModelsFanar\-1\-9B43\.51±\\pm2\.4270\.13±\\pm1\.6774\.60±\\pm0\.3560\.42±\\pm2\.60SILMA\-9B49\.32±\\pm21\.7360\.11±\\pm6\.6153\.85±\\pm5\.5725\.75±\\pm3\.75ALLAM\-7B42\.24±\\pm3\.5564\.75±\\pm3\.8372\.25±\\pm2\.8356\.50±\\pm2\.00Table 11:MCQ evaluation across 3 runs using each model’s recommended temperature\. Values shown as mean±\\pmstd\. Rankings remain consistent across runs, confirming robustness of our main findings to decoding configuration\.To diagnose why frontier models fail on specificSahmtasks, we conducted a systematic error analysis of GPT\-5 and Gemini\-3\-Flash across Accounting, Business, and Summarization\. Two native Arabic\-speaking annotators with financial backgrounds jointly reviewed every error, examined the model’s chain\-of\-thought against the gold reference, diagnosed the root cause, and agreed on a category before recording it \(full agreement after joint adjudication\)\. We use a shared taxonomy across both models:Misunderstanding Concept\(correct setup, wrong principle applied\),Concept Confusion\(conflates two related but distinct concepts\),Hallucination\(generates facts not in the question\),Question Misread\(answers a different question\),Calculation Mistake\(arithmetic error\), andDomain Knowledge Gap\(lacks the terminology entirely\)\.
LLM\-as\-a\-Judge Rubric for Fatwa QA \(Arabic\)Role\.You are an expert evaluator in Islamic fatwa \(iftāʾ\)\.Inputs \(provided each time\)\.•category\(optional context\) – may be empty \(e\.g\., riba, zakat, takaful\)•prompt– the full Arabic prompt shown to the model \(instructions \+ question\)•ground\_truth– the reference fatwa answer \(Arabic\)•candidate\_answer– the model answer to evaluate \(Arabic\)Task\.Judge how wellcandidate\_answermatchesground\_truthin*ruling \(ḥukm\)*,*justification*, and*operative constraints/qualifications*\. Prioritize doctrinal correctness and required conditions/exceptions\. Do not penalize stylistic paraphrase if the core ruling and constraints are preserved\. Be concise and deterministic\.Scoring \(sum to exactly 10\)\.1\.Coverage of core ruling \(0–4\)\.The candidate must clearly state the same central hukm \(e\.g\., permissibility/prohibition, validity/invalidity\)*and*include the key justification present in the ground truth\. One\-word/minimal answers without essential justification should receive a much lower score \(e\.g\., 0–1\)\.2\.Conditions, exceptions, constraints \(0–2\)\.Does it retain critical restrictions, qualifiers, or carve\-outs that materially affect the ruling?3\.Doctrinal/factual accuracy \(0–2\)\.No misstatements that would change the fatwa; no implicit legalization of prohibited elements \(e\.g\., ribā\); no misleading generalizations or invented requirements\.4\.Clarity & Arabic language quality \(0–1\)\.Clear Arabic, understandable structure, minimal ambiguity appropriate for a fatwa answer\.5\.Directness & fatwa format \(0–1\)\.Directly answers the question; avoids long digressions; phrasing suitable for a fatwa\.Critical checks \(true/false\)\.•contradicts\_ground\_truth: Does the candidate contradict the central ruling?•omits\_critical\_conditions: Does it omit key conditions/exceptions that change the ruling?•introduces\_unlawful\_elements: Does it introduce/normalize prohibited elements \(e\.g\., ribā\)?•hallucinated\_citations: Misleading/fabricated sources claimed that distort the ruling?•non\_answer\_or\_evasive: Does it avoid giving a clear ruling?•off\_topic\_or\_unsafe: Off\-topic or otherwise inappropriate?Output format \(strict\)\.Output*only*valid JSON \(no prose, no code fences\), following this schema:```
{
"scores": {
"coverage_core_ruling": <float 0-4>,
"conditions_exceptions": <float 0-2>,
"factual_doctrinal_accuracy": <float 0-2>,
"clarity_language": <float 0-1>,
"directness_format": <float 0-1>
},
"overall": <float 0-10>,
"critical_checks": {
"contradicts_ground_truth": <true/false>,
"omits_critical_conditions": <true/false>,
"introduces_unlawful_elements": <true/false>,
"hallucinated_citations": <true/false>,
"non_answer_or_evasive": <true/false>,
"off_topic_or_unsafe": <true/false>
},
"note": "<short NOTE in {NOTE_LANG}>"
}
```
Figure 20:Evaluation rubric used for LLM\-based judgment of fatwa QA responses\.#### J\.4\.1GPT\-5 on Accounting
Seventy percent of GPT\-5 accounting errors stem from domain reasoning failures: Misunderstanding Concept \(39%\) and Concept Confusion \(31%\), while Calculation Mistakes account for only 20%\. This confirms our Section[5\.3](https://arxiv.org/html/2604.19098#S5.SS3)finding that models rarely fail at arithmetic but frequently fail at knowing which arithmetic to perform\. Concretely, GPT\-5 \(i\) reaches correct intermediate calculations but selects the wrong accounting standard at the decision point, \(ii\) confuses closely related financial concepts \(e\.g\., treating a direct relationship as inverse\), and \(iii\) in the most striking cases \(15% of errors\), arrives at the correct answer through valid reasoning and then fabricates a non\-existent rule to justify switching to a wrong option\. These patterns indicate that the model’s weaknesses lie primarily in conceptual grounding rather than computational ability\. In particular, errors often arise at the stage of mapping problem statements to the appropriate accounting principle or standard, rather than during numerical execution\. This suggests that improving domain\-specific reasoning and conceptual alignment is more critical than enhancing raw calculation capabilities for such tasks\.
LLM\-as\-a\-Judge Rubric for Islamic Finance QA \(Arabic\)Role\.You are an expert evaluator in Islamic finance \(*Fiqh al\-mu'āmalāt*\)\.Inputs \(provided each time\)\.•topic\(optional context\) – may be empty•question– in Arabic•ground\_truth– the reference correct answer \(Arabic\)•candidate\_answer– the model answer to evaluate \(Arabic\)Task\.Judge how wellcandidate\_answermatchesground\_truthin*meaning*,*ruling*,*justification*, and*constraints*\. Prioritize doctrinal correctness and completeness of key conditions/exceptions\. Do not penalize stylistic paraphrase if the core ruling and constraints are preserved\. Be concise and deterministic\.Scoring \(sum to exactly 10\)\.1\.Coverage of core ruling \(0–4\)\.2\.Conditions, exceptions, constraints \(0–2\)\.3\.Doctrinal/factual accuracy \(0–2\)\.4\.Clarity & Arabic language quality \(0–1\)\.5\.Directness & on\-topic \(0–1\)\.Critical checks \(true/false\)\.•contradicts\_ground\_truth•omits\_critical\_conditions•introduces\_unlawful\_elements•hallucinated\_citations•non\_answer\_or\_evasive•off\_topic\_or\_unsafeOutput format \(strict\)\.Output*only*valid JSON \(no prose, no code fences\), following this schema:```
{
"scores": {
"coverage_core_ruling": <float 0-4>,
"conditions_exceptions": <float 0-2>,
"factual_doctrinal_accuracy": <float 0-2>,
"clarity_language": <float 0-1>,
"directness_format": <float 0-1>
},
"overall": <float 0-10>,
"critical_checks": {
"contradicts_ground_truth": <true/false>,
"omits_critical_conditions": <true/false>,
"introduces_unlawful_elements": <true/false>,
"hallucinated_citations": <true/false>,
"non_answer_or_evasive": <true/false>,
"off_topic_or_unsafe": <true/false>
},
"note": "<short NOTE in {NOTE_LANG}>"
}
```
Figure 21:Evaluation rubric used for LLM\-based judgment of Islamic finance QA responses\.
#### J\.4\.2GPT\-5 on Extractive Summarization
GPT\-5 understands report content but fails at task execution, selecting background sentences over key financial figures \(42\.5% of errors\), copying entire reports instead of respecting the 30–40% compression target \(16\.3%\), and introducing content from unrelated reports \(11\.3%\)\. This explains why GPT\-5 collapses on extractive summarization \(ROUGE\-L: 33\.37\) despite being our strongest model on open\-ended reasoning: the task rewards verbatim selection discipline, not generative fluency\.
#### J\.4\.3Gemini\-3\-Flash on Accounting
Concept Confusion \(27\.8%\) and Misunderstanding Concept \(19\.4%\) together account for 47% of errors, particularly in auditing standards and foreign currency hedging\. Unlike GPT\-5, Gemini\-3\-Flash also exhibits Hallucination \(8\.3%\) and Question Misread \(8\.3%\), while Calculation Mistakes remain rare \(5\.6%\)\.
#### J\.4\.4Gemini\-3\-Flash on Business
Reasoning Error \(39\.5%\) and Concept Confusion \(37\.2%\) dominate at 77% combined, concentrating in Strategic Management, Marketing, and Entrepreneurship, areas requiring culturally grounded Arabic business knowledge\. Domain Knowledge Gap \(9\.3%\) appears as a distinct category: cases where the model lacks specialized Arabic business terminology entirely\.
#### J\.4\.5Cross\-Model Convergence
The most striking finding is that GPT\-5 and Gemini\-3\-Flash, built by different organizations with different architectures and training data, share the same dominant failure mode: conceptual confusion between related domain principles \(70% for GPT\-5, 47–77% for Gemini\-3\-Flash\), with arithmetic errors rare in both \(20% and 5\.6% respectively\)\. This convergence strongly suggests that Arabic financial reasoning remains a genuine challenge for frontier models, not an artifact of our evaluation design\.
LLM\-as\-a\-Judge Rubric for Financial Analysis & Capital Markets \(Arabic\)Role\.You are an expert evaluator in financial analysis and capital markets\.Inputs \(provided each time\)\.•prompt– the full Arabic prompt \(report/excerpt \+ question\) shown to the model•ground\_truth– the reference ideal analytical answer \(Arabic\)•candidate\_answer– the model answer to evaluate \(Arabic\)Task\.Judge how wellcandidate\_answermatchesground\_truthin*conclusions*,*reasoning*, and*use of provided figures*\. Prioritize factual/quantitative fidelity, correct interpretation of financial concepts \(e\.g\., spreads, yields, coverage, issuance, capital structure, Basel III, supply/demand dynamics\), and avoidance of hallucinated data\. Do not penalize stylistic paraphrase if core insights and numeric takeaways align with the reference\.Scoring \(sum to exactly 10\)\.1\.Core conclusion alignment \(0–4\)\.Does the candidate capture the main thesis and key takeaways of the ground truth \(*what/why/so\-what*\)?2\.Quantitative fidelity & use of figures \(0–2\)\.Correctly cites/uses the reported numbers \(e\.g\., percentages, amounts, maturities, oversubscription\) without inventing or altering figures\. Any simple computations/comparisons must be consistent\.3\.Financial reasoning soundness \(0–2\)\.Causality and mechanisms are plausible and consistent with standard finance/econ logic \(e\.g\., pricing vs\. credit risk, duration/tenor structure, demand/oversubscription signals, capital adequacy\)\.4\.Clarity & Arabic language quality \(0–1\)\.Clear Arabic, coherent structure, minimal ambiguity\.5\.Directness & on\-topic grounding \(0–1\)\.Answers what was asked; stays anchored in the provided scenario/data \(no generic filler\)\.Critical checks \(true/false\)\.•contradicts\_ground\_truth: contradicts the central conclusion of the reference•fabricates\_or\_alters\_numbers: introduces numbers not present or materially distorts reported figures•hallucinates\_context\_or\_sources: injects external context/sources not in the prompt that change the assessment•flawed\_financial\_logic: serious finance/econ reasoning error that would mislead the conclusion•non\_answer\_or\_evasive: avoids providing an analytical answer•off\_topic\_or\_unsafe: off\-topic or otherwise inappropriateOutput format \(strict\)\.Output*only*valid JSON \(no prose, no code fences\)\. Return JSON strictly in this schema \(all fields required\):```
{
"scores": {
"coverage_core_conclusion": <float 0-4>,
"quantitative_fidelity": <float 0-2>,
"financial_reasoning": <float 0-2>,
"clarity_language": <float 0-1>,
"directness_grounding": <float 0-1>
},
"overall": <float 0-10>,
"critical_checks": {
"contradicts_ground_truth": <true/false>,
"fabricates_or_alters_numbers": <true/false>,
"hallucinates_context_or_sources": <true/false>,
"flawed_financial_logic": <true/false>,
"non_answer_or_evasive": <true/false>,
"off_topic_or_unsafe": <true/false>
},
"note": "<short NOTE in {NOTE_LANG}>"
}
```
Figure 22:Evaluation rubric used for LLM\-based judgment of financial analysis and event–cause reasoning tasks\.
## Appendix KDecoding Configuration and Variance Analysis
Our main evaluations in the paper use greedy decoding \(temperature 0\) for reproducibility\. To verify that our conclusions are not artifacts of this choice, particularly given that some models \(e\.g\., GPT\-5\) do not support temperature 0, and others \(e\.g\., Qwen3\) recommend non\-zero temperatures for best performance, we re\-ran all MCQ evaluations 3 times using each model’s recommended temperature settings\. Table[11](https://arxiv.org/html/2604.19098#A10.T11)reports mean±\\pmstd across runs\. Rankings remain consistent across runs and across temperature settings, confirming that our findings are robust to decoding configuration\. The low variance observed across repeated runs further indicates that performance differences are stable rather than driven by sampling noise\. Overall, these results suggest that model comparisons in our benchmark are reliable under both deterministic and stochastic decoding regimes\. Notably, models that perform strongly under greedy decoding maintain their relative advantage under higher\-temperature settings, indicating that gains are not dependent on sampling variability\. Similarly, lower\-performing models do not benefit significantly from increased randomness, suggesting that errors are rooted in systematic limitations rather than decoding strategy\. This consistency reinforces the validity of our evaluation pipeline across diverse inference settings\. It also highlights that task difficulty and domain reasoning, rather than decoding configuration, are the primary drivers of performance differences in our benchmark\.
### K\.1Doctrinal Variation in Shari’ah Rulings
Islamic jurisprudence is not monolithic: the four major Sunnimadhahib\(Hanafi, Maliki, Shafi’i, Hanbali\), alongside Shia schools, may offer different rulings on the same financial question\. This raises a natural concern for benchmark construction: do our reference answers represent a single doctrinal position that could penalize models trained on legitimate alternative rulings?
We analyzed this question during dataset construction and report our findings here\.
Category\# SamplesSignificant DisputeSukuk62 \(33%\)Takaful3810 \(26%\)Zakat792173 \(22%\)Gharar14920 \(13%\)Ijara10212 \(12%\)Murabaha23420 \(9%\)Maysir646 \(9%\)Riba40732 \(8%\)Total289 \(14\.4%\)Table 12:Distribution of doctrinal variation in Islamic finance categories\. Variation concentrates in zakat \(calculation methods differ across schools\), takaful \(a modern instrument with no classical precedent and evolving rulings\), and sukuk \(small sample, actively debated across regulatory bodies\)\. For all disputed cases, reference answers explicitly present alternative valid positions, and our rubric accepts any legitimate ruling as correct\.##### \(1\) Most questions test consensus rulings\.
Seventy\-four percent of samples involve cross\-madhabagreement on established Islamic finance principles, the prohibition ofriba\(usury\), contract invalidation due togharar\(excessive uncertainty\), the impermissibility ofmaysir\(gambling\), rather than narrow inter\-school disputes\.
##### \(2\) Evaluation targets theḥukm, not the evidence path\.
Reference fatwas and model outputs naturally vary in cited Qur’anic verses,ḥadīth,fiqhsources, and reasoning detail, making exact\-match evaluation infeasible\. We therefore score at theḥukm\(ruling\) level: the rubric evaluates the final ruling and its operative constraints\. A model citing different but valid evidence while reaching the correct ruling is not penalized\.
##### \(3\) Quantified dispute distribution\.
For the 26% of samples with some degree of dispute, Table[12](https://arxiv.org/html/2604.19098#A11.T12)reports how often the reference answer itself flags a significant disagreement across recognized schools\. “Significant Dispute” means two or more recognizedmadhahibhold materially different rulings—i\.e\., theḥukmitself changes \(e\.g\., permissible vs\. impermissible\), not just the supporting evidence\.
##### \(4\) Manual error analysis confirms failures are genuine, not doctrinal\.
To verify that low scores do not stem from penalizing valid alternative positions, we analyzed 500 randomly sampled errors \(Figure[6](https://arxiv.org/html/2604.19098#S5.F6)\)\. Model failures are unambiguous: wrong rulings \(25\.2%\), fabricated evidence \(12\.1%\), and misquotedḥadīth\(2\.0%\)not cases where the model offered a legitimate but different school’s position\.Similar Articles
LLM-Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets
This paper presents a framework for Arabic financial sentiment analysis using LLMs, tailored for the Saudi market, integrating news and social media data to capture investor sentiment.
QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning
This paper presents an overview of the QIAS 2026 shared task on Islamic inheritance reasoning, evaluating LLMs on multi-step legal and numerical reasoning using the MAWARITH benchmark.
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a new benchmark of 439 research-level math problems curated by mathematicians to evaluate the reasoning capabilities of frontier LLMs, highlighting significant gaps in solving advanced problems and recognizing ill-posed questions.
Which Models Perform Better in Inheritance Reasoning?
This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning, comparing commercial and open-source large language models. Results show commercial models (e.g., Gemini 2.5 Flash) significantly outperform open-source models in structured legal reasoning with multi-step dependencies.
mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?
Introduces mmPISA-bench, a compact multilingual reasoning benchmark derived from PISA, and evaluates proprietary LLMs across 43 languages, finding that they reason effectively with some performance variations, and that machine-translated questions do not degrade accuracy.