QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning
Summary
This paper presents Qatar University's multi-stage QLoRA fine-tuning approach on Qwen3-4B for Arabic Islamic inheritance reasoning, achieving 90% MIR-E score through domain adaptation on Islamic fatwa records followed by task-specific training on 12,000 structured inheritance cases, matching commercial systems like Gemini-2.5-flash with minimal computational resources.
View Cached Full Text
Cached at: 04/21/26, 07:02 AM
# QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning
Source: [https://arxiv.org/html/2604.16396](https://arxiv.org/html/2604.16396)
###### Abstract
Islamic inheritance law \( ,’ilm al\-mawārīth\) presents a challenging domain for evaluating large language models’ structured reasoning capabilities, requiring multi\-step legal analysis, rule\-based blocking decisions, and precise fractional calculations\. We present QU\-NLP’s submission to the QIAS 2026 shared task on Arabic Islamic inheritance reasoning\. Our approach employs a multi\-stage Quantized Low\-Rank Adaptation \(QLoRA\) fine\-tuning strategy on Qwen3\-4B: \(1\) domain adaptation on 3,166 Islamic fatwa records to acquire inheritance terminology and jurisprudential reasoning patterns, followed by \(2\) task\-specific training on 12,000 structured inheritance cases to optimize JSON\-formatted output generation\. Using 4\-bit NF4 quantization with rank\-128 LoRA adapters, our model achieves 90% MIR\-E \(Mawarith Inheritance Reasoning Evaluation\) score on the test set, demonstrating competitive performance while requiring minimal computational resources\. Our results show that domain\-specific pre\-adaptation combined with structured output training enables small language models to perform complex legal reasoning tasks effectively, matching commercial systems such as Gemini\-2\.5\-flash\.
\\NAT@set@cites
QU\-NLP at QIAS 2026: Multi\-Stage QLoRA Fine\-Tuning for Arabic Islamic Inheritance Reasoning
Mohammad AL\-SmadiQatar UniversityDoha, Qatarmalsmadi@qu\.edu\.qaAbstract content
## 1\. Introduction
Large language models \(LLMs\) have demonstrated remarkable capabilities across diverse natural language processing tasksOpenAIet al\.\([2024](https://arxiv.org/html/2604.16396#bib.bib5)\)\. However, their ability to perform structured, rule\-based reasoning under strict legal constraints remains insufficiently evaluated\. Islamic inheritance law \( ,’ilm al\-mawārīth\) offers a particularly demanding testbed for evaluating multi\-step legal reasoning capabilitiesBouchekifet al\.\([2026](https://arxiv.org/html/2604.16396#bib.bib1)\)\.
Solving an Islamic inheritance case requires a well\-defined procedural chain: \(1\) identifying eligible heirs from a textual description of family relations, \(2\) applying blocking rules \(,ḥajb\) to determine which relatives are excluded by closer heirs, \(3\) assigning prescribed Qur’anic shares to eligible heirs, \(4\) detecting and applying adjustment mechanisms such as \(’awl, proportional reduction when shares exceed unity\) or \(radd, redistribution of surplus\), and \(5\) computing the final normalized distribution\. Errors at any intermediate stage propagate deterministically and invalidate subsequent calculations, making this domain particularly suitable for evaluating structured reasoning under jurisprudential constraints\.
The QIAS 2026 shared task represents a significant evolution in evaluating models on Islamic inheritance reasoning\. While the 2025 task assessed models through multiple\-choice questionsBouchekifet al\.\([2025a](https://arxiv.org/html/2604.16396#bib.bib2)\), the 2026 task introduces MAWARITH, a large\-scale dataset of 12,500 Arabic inheritance cases with detailed step\-by\-step reasoning annotationsBouchekifet al\.\([2026](https://arxiv.org/html/2604.16396#bib.bib1)\)\. QIAS 2026 requires generating complete structured reasoning traces in JSON format, exposing all intermediate legal decisions across a five\-stage pipeline: heir identification, blocking rule application, share calculation, adjustment detection, and final distribution\. This methodological shift addresses critical limitations of MCQ evaluation where models can succeed through memorization without genuine understanding, and where binary scoring provides no diagnostic insights into specific failure modes\. The structured output requirement enables the multi\-component MIR\-E evaluation metric, which assigns partial credit for correct intermediate steps—for instance, awarding up to 70% to cases with perfect legal reasoning but arithmetic errors\. This evaluation framework enables our fine\-grained error analysis identifying four distinct categories of model failures with targeted improvement strategies, analysis impossible under answer\-selection formats\.
We present QU\-NLP’s approach to this challenge, employing a multi\-stage QLoRA fine\-tuning strategy on Qwen3\-4BQwen Team \([2025](https://arxiv.org/html/2604.16396#bib.bib29)\)\. Our key contributions are:
- ∙\\bulletA two\-stage training methodology combining domain adaptation on Islamic legal texts with task\-specific fine\-tuning on structured inheritance solutions\.
- ∙\\bulletDemonstration that 4\-bit quantized models with LoRA adapters can achieve a MIR\-E score of 90% on complex multi\-step legal reasoning while requiring minimal computational resources, placing it among the top\-performing systems and significantly outperforming larger open\-weight models evaluated in the baseline studyBouchekifet al\.\([2025b](https://arxiv.org/html/2604.16396#bib.bib3)\)\.
- ∙\\bulletComprehensive error analysis identifying four distinct failure modes and their root causes, providing actionable insights for model improvement\.
## 2\. Related Work
### 1\.2\. LLMs for Islamic Knowledge Tasks
Recent work has explored LLMs for Islamic knowledge tasks including Qur’anic question answeringBhatiaet al\.\([2026](https://arxiv.org/html/2604.16396#bib.bib10)\); Malhaset al\.\([2022](https://arxiv.org/html/2604.16396#bib.bib11)\)and hallucination detection in Islamic contentMubaraket al\.\([2025](https://arxiv.org/html/2604.16396#bib.bib12)\)\. These studies reveal that while LLMs perform adequately on retrieval\-based tasks relying on textual matching, they exhibit significant limitations on tasks requiring structured reasoning or deep domain knowledge\. Bouchekif et al\.Bouchekifet al\.\([2025b](https://arxiv.org/html/2604.16396#bib.bib3)\)assess LLMs on Islamic legal reasoning, identifying systematic failures in inheritance case resolution and raising concerns about model reliability in religious and legal applications\.
Retrieval\-Augmented Generation \(RAG\) approaches have been explored to improve answer groundingAL\-Smadi \([2025](https://arxiv.org/html/2604.16396#bib.bib7)\); Alowaidi \([2025](https://arxiv.org/html/2604.16396#bib.bib14)\)\. However, RAG remains insufficient for questions requiring multi\-step inference, motivating the development of reasoning\-oriented models\.
Within Islamic inheritance specifically, prior work has focused on multiple\-choice evaluations\. QIAS 2025Bouchekifet al\.\([2025a](https://arxiv.org/html/2604.16396#bib.bib2)\)introduced a shared task on Islamic inheritance reasoning assessed through MCQs, where models select correct answers without exposing reasoning traces\. Elrefai et al\.Elrefaiet al\.\([2025](https://arxiv.org/html/2604.16396#bib.bib9)\)participated in QIAS 2025 with a fine\-tuned Arabic LLM, but the MCQ format prevented assessment of whether models truly reason correctly or merely pattern\-match\. AL\-SmadiAL\-Smadi \([2025](https://arxiv.org/html/2604.16396#bib.bib7)\)explored a two\-phase fine\-tuning approach combined with Retrieval\-Augmented Generation at QIAS 2025, investigating hybrid retrieval and generation strategies for Islamic inheritance reasoning\. MirathQAAlmasoudet al\.\([2026](https://arxiv.org/html/2604.16396#bib.bib8)\)provides a dataset of Hanbali inheritance cases in MCQ format\.
TheMAWARITHdatasetBouchekifet al\.\([2026](https://arxiv.org/html/2604.16396#bib.bib1)\)addresses this limitation by requiring end\-to\-end reasoning generation with intermediate justifications, enabling fine\-grained error analysis across the inheritance reasoning pipeline\.
### 2\.2\. Legal Reasoning with LLMs
Beyond Islamic domains, legal reasoning benchmarks have emerged to evaluate LLMs on structured argumentation and rule\-based inference\. LegalBenchGuhaet al\.\([2023](https://arxiv.org/html/2604.16396#bib.bib15)\), LexGLUEChalkidiset al\.\([2022](https://arxiv.org/html/2604.16396#bib.bib16)\), and LEXTREMENiklauset al\.\([2023](https://arxiv.org/html/2604.16396#bib.bib17)\)assess models on common\-law legal tasks\. BRIEFMEWooet al\.\([2025](https://arxiv.org/html/2604.16396#bib.bib18)\)evaluates legal argument summarization in the context of assisting with legal briefs\.
Recent models explicitly designed for multi\-step reasoning include GPT\-5Singhet al\.\([2025](https://arxiv.org/html/2604.16396#bib.bib19)\), GeminiComaniciet al\.\([2025](https://arxiv.org/html/2604.16396#bib.bib20)\), DeepSeek\-R1Guoet al\.\([2025](https://arxiv.org/html/2604.16396#bib.bib21)\), and open\-weight alternatives such as Qwen3Qwen Team \([2025](https://arxiv.org/html/2604.16396#bib.bib29)\)and FanarAbbaset al\.\([2025](https://arxiv.org/html/2604.16396#bib.bib22)\)\. These models promote consistent multi\-step inference through instruction tuning and reinforcement learning\.
Parameter\-efficient fine\-tuning methods, particularly QLoRADettmerset al\.\([2023](https://arxiv.org/html/2604.16396#bib.bib23)\), enable adaptation of large models with limited resources\. QLoRA combines 4\-bit NF4 quantization with Low\-Rank AdaptationHuet al\.\([2021](https://arxiv.org/html/2604.16396#bib.bib24)\), achieving competitive fine\-tuning performance while dramatically reducing memory requirements\. This approach has been applied to domain adaptation in specialized Arabic NLP tasksAL\-Smadi \([2025](https://arxiv.org/html/2604.16396#bib.bib7)\)\.
## 3\. Task and Dataset
### 1\.3\. Task Definition
The QIAS 2026 shared task requires models to solve Arabic Islamic inheritance cases by generating structured JSON outputs that expose all intermediate reasoning steps\. Given a natural\-language description of the deceased and surviving relatives, models must execute a complete five\-stage reasoning pipeline without access to gold intermediate steps\.
#### 1\.3\.1\. JSON Output Structure
As explained in theMAWARITHdatasetBouchekifet al\.\([2026](https://arxiv.org/html/2604.16396#bib.bib1)\), the required JSON output contains five mandatory components representing distinct reasoning stages:
1\. Heirs\(,al\-waratha\): A list identifying all eligible inheriting relatives with their counts\. Each heir entry specifies the heir category \(e\.g\., “son”, “daughter”, “mother”\) and the number of individuals in that category\. This stage requires applying Qur’anic eligibility rules based on kinship relationships\.
2\. Blocked\(,al\-maḥjūbūn\): A list of relatives mentioned in the scenario who are present but excluded from inheritance due to blocking rules \(,ḥajb\)\. Islamic inheritance law stipulates that closer relatives block more distant ones in specific patterns—for instance, a son blocks grandsons, and a father blocks uncles\. Correctly identifying blocked heirs demonstrates understanding of these hierarchical rules\.
3\. Shares\(,al\-anṣiba\): Initial prescribed fractional shares assigned to each eligible heir before any global adjustments\. The Qur’an specifies fixed shares for certain heir categories \(e\.g\., wife receives 1/4 if no children, 1/8 if children exist; daughter receives 1/2 if alone, 2/3 if multiple\)\. Residuary heirs \(,’aṣaba\) such as sons and brothers receive the remainder after fixed shares are distributed, designated as \(“remainder of estate”\) rather than numerical fractions\.
4\. ’Awl or Radd\( \): The type of global adjustment mechanism applied when prescribed shares do not sum to exactly the full estate:
- ∙\\bullet\(Radd, redistribution\): Applies when total prescribed shares are less than the full estateandno residuary heir is present\. After distributing all fixed shares, the remaining unassigned portion is redistributed proportionally among eligible fixed\-share heirs according to their original shares\. For example, if a mother receives 1/6 and a daughter receives 1/2, the total is 1/6 \+ 1/2 = 2/3, leaving 1/3 unassigned\. This remainder is redistributed throughradd, increasing each heir’s allocation proportionally to their original prescribed shares\.
- ∙\\bullet\(’Awl, proportional reduction\): Applies when total prescribed shares exceed the full estate\. Since the initially assigned shares cannot all be satisfied in full, all shares are proportionally reduced so their sum equals exactly the estate\. For instance, if prescribed shares sum to 1/2 \+ 1/6 \+ 2/3 = 8/6 \(greater than 1\), each share is scaled down through’awlso that the total distribution fits within the estate\.
- ∙\\bullet\(None\): No adjustment needed when shares sum to exactly the estate or when residuary heirs absorb the remainder naturally\.
5\. Post\-Tasil\( ,ba’da al\-taṣīl\): The final normalized distribution after applying any adjustments\. This component contains:
- ∙\\bullettotal\_shares: The denominator of the fractional distribution after adjustment\.
- ∙\\bulletdistribution: A list specifying each heir’s final allocation as both fractional shares \(e\.g\., “3/12”\) and normalized percentages \(e\.g\., 25\.0%\)\.
This stage requires precise numerical computation to ensure all percentages sum to exactly 100% and correctly reflect the applied adjustments\.
#### 1\.3\.2\. Task Complexity
While Islamic inheritance follows deterministic jurisprudential rules that could be implemented in a symbolic rule engine with perfect accuracy, the QIAS 2026 task evaluates a fundamentally different capability:end\-to\-end neural reasoning from natural language to structured output\. Unlike rule\-based systems that operate on pre\-structured inputs and apply explicit programmed logic, neural models must simultaneously solve multiple interdependent challenges:
- ∙\\bulletNatural language understanding: Models must parse Arabic text with diverse linguistic expressions to extract family relationships\. The same heir category can be expressed through multiple lexical variants \(e\.g\., “mother” vs\. \), counts must be inferred from number\-noun constructions \( “two sons”\), and relationship types must be disambiguated \( can denote full, paternal, or maternal brother\)\. This entity extraction and relationship parsing from unstructured text represents a core NLP challenge absent in symbolic approaches\.
- ∙\\bulletConditional logic through learned patterns: Share assignments depend on presence/absence of other heirs \(e\.g\., wife receives 1/4 if no children exist, 1/8 otherwise\)\. Unlike rule\-based systems where such conditions are explicitly programmed, neural models must learn these conditional dependencies from training examples\. With only finite data and class imbalances \(e\.g\.,raddappears in 2\.8% of cases\), models must generalize learned patterns to unseen combinations of heir configurations\.
- ∙\\bulletHierarchical blocking through pattern recognition: Distant relatives are excluded by closer ones following precedence rules \(e\.g\., sons block grandsons, fathers block uncles\)\. Models must learn these hierarchical relationships from examples rather than executing explicit genealogical graphs, requiring pattern recognition over complex family structures with multiple generations\.
- ∙\\bulletConditional algorithm selection: The model must detect which distribution algorithm to apply based on computed share totals\. Standard cases \(92\.3%\) use residuary distribution when a male agnate heir \(\) exists,’awlcases \(4\.9%\) require proportional reduction when shares exceed unity, andraddcases \(2\.8%\) require surplus redistribution when shares sum to less than unity and no residuary heir is present \(see section[2\.3](https://arxiv.org/html/2604.16396#S3.SS2)\)\. Correct detection requires both arithmetic computation \(checking if shares sum to<<1,==1, or\>\>1\) and logical reasoning \(verifying absence of residuary heirs\)\. The statistical rarity of adjustment cases in training data exacerbates the learning challenge\.
- ∙\\bulletNumerical precision in text generation: Unlike symbolic systems that perform exact fractional arithmetic using rational number representations, neural models must generate fractions and percentages as text strings while maintaining numerical correctness\. All percentages must sum to exactly 100%, fractions must be in lowest terms, and floating\-point approximations are invalid\. This requires learning precise numerical patterns from examples rather than executing deterministic calculations\.
- ∙\\bulletStructured output generation: Models must produce syntactically valid JSON with consistent Arabic terminology, proper nesting, and exact schema compliance\. Errors in JSON syntax \(missing brackets, unclosed quotes\), inconsistent heir naming across sections, or schema violations invalidate the entire output\. This generation constraint requires maintaining structural coherence across potentially long output sequences\.
The task difficulty arises not from the logical complexity of inheritance rules themselves—which are well\-defined and deterministic—but from the requirement to learn and apply these rules through neural pattern matching on natural language inputs while generating structured outputs with exact numerical precision\. Errors at any intermediate stage propagate deterministically and invalidate the final distribution\. A correct solution requires models to simultaneously excel at linguistic understanding, knowledge\-intensive reasoning, numerical computation, and constrained generation without access to gold intermediate representations or symbolic verification mechanisms\. This makesMAWARITHa demanding testbed for evaluating whether neural models can approximate rule\-based reasoning through end\-to\-end learning from examples\.
### 2\.3\. MAWARITH Dataset
TheMAWARITHdatasetBouchekifet al\.\([2026](https://arxiv.org/html/2604.16396#bib.bib1)\)comprises 12,500 Arabic inheritance cases following the majority opinion \(,al\-jumhūr\)\. The corpus is split into 12,000 training instances and 500 test instances, covering 36 distinct heir categories ranging from close relatives \(parents, children, spouses\) to distant extended family across multiple generations\.
Table[1](https://arxiv.org/html/2604.16396#S3.T1)shows the dataset composition\. The majority \(92\.3%\) are simple cases requiring no adjustment, while 4\.9% involve \(’awl, proportional reduction\) and 2\.8% involve \(radd, surplus redistribution\)\.
1 :MAWARITH dataset structure with distribution of inheritance cases by complexity\.2 :Example inheritance case from MAWARITH with Arabic text and full English translation\. The prescribed shares exceed the estate, so the case undergoes \(proportional reduction\)\. The four\(son’s daughter\)share 16/27 collectively, giving each one 4/27\.Table[2](https://arxiv.org/html/2604.16396#S3.T2)shows a simplified example from the training data\. This case demonstrates’awl\(proportional reduction\) where prescribed shares exceed the estate\. The gold standard provides a human\-readable explanation following Islamic legal reasoning traditions, while the structuredJSONoutput enables automated evaluation via the MIR\-E metric, with each field corresponding to a distinct reasoning stage assessed independently\.
### 3\.3\. Domain Adaptation Data
For stage 1 domain adaptation, we use 3,166 Islamic fatwa records from Islamweb111[https://www\.islamweb\.net/](https://www.islamweb.net/)covering inheritance\-related religious rulings\. These fatwas provide natural\-language explanations of inheritance scenarios, introducing models to jurisprudential terminology \(al\-waratha,al\-ḥajb,al\-’aṣaba\) and reasoning patterns used by Islamic legal scholars\.
## 4\. Methodology
### 1\.4\. Model Architecture
We use Qwen3\-4BQwen Team \([2025](https://arxiv.org/html/2604.16396#bib.bib29)\)as our base model\. Qwen3 is a multilingual reasoning model trained on diverse corpora including Arabic text, achieving strong performance on mathematical and logical reasoning benchmarks while maintaining a compact parameter count suitable for resource\-constrained settings\.
### 2\.4\. Multi\-Stage QLoRA Fine\-Tuning
Our training approach consists of two stages\. Table[3](https://arxiv.org/html/2604.16396#S4.T3)explains the training configuration for the two phases of the proposed multi\-stage QLoRA fine\-tuning approach\.
3 :Training configuration for the two phases of the proposed multi\-stage QLoRA fine\-tuning approach\. Phase 1 performs domain adaptation on Islamic fatwa records, while Phase 2 focuses on structured inheritance reasoning and JSON output generation\.#### 2\.4\.1\. Stage 1: Domain Adaptation
In the first stage, we fine\-tune Qwen3\-4B on 3,166 Islamic fatwa records to acquire inheritance\-specific terminology and jurisprudential reasoning patterns\. This stage uses a standard causal language modeling objective where the model learns to generate fatwa explanations given inheritance questions\.
#### 2\.4\.2\. Stage 2: Task\-Specific Training
The second stage continues training on 12,000 structured inheritance cases, teaching the model to produce JSON\-formatted outputs with correct heir identification, blocking decisions, share calculations, and adjustment mechanisms\.
We filter training examples to include only those with valid JSON outputs, yielding 12,000 usable instances\. The system prompt explicitly instructs the model to output JSON only without additional explanation:
> “OutputJSONonly without any additional text\. Do not write explanation, thinking, or symbols outsideJSON\. TheJSONmust contain only the following keys:heirs, blocked, shares, awl\_or\_radd, post\_tasil\.”
### 3\.4\. QLoRA Configuration
Both training stages employ QLoRADettmerset al\.\([2023](https://arxiv.org/html/2604.16396#bib.bib23)\)for parameter\-efficient fine\-tuning\. We adopt 4\-bit NF4 quantization with double quantization to reduce memory usage while maintaining performance\. The LoRA configuration uses a rank ofr=128r=128and a scaling factor ofα=256\\alpha=256\. Adaptation is applied to all projection layers, includingq\_proj,k\_proj,v\_proj,o\_proj,gate\_proj,up\_proj, anddown\_proj\. All computations are performed using the bfloat16 data type to balance numerical stability and efficiency\.
## 5\. Evaluation Metric
### 1\.5\. MIR\-E: Mawarith Inheritance Reasoning Evaluation
The QIAS 2026 shared task proposes MIR\-E \(Mawarith Inheritance Reasoning Evaluation\)Bouchekifet al\.\([2026](https://arxiv.org/html/2604.16396#bib.bib1)\), a weighted multi\-stage evaluation metric designed to assess both intermediate reasoning steps and final outputs in Islamic inheritance problems\. Unlike traditional evaluation methods that focus only on final answers, MIR\-E enables fine\-grained assessment of structured reasoning by decomposing the inheritance process into key stages\.
MIR\-E evaluates model predictions across four components:
1. \(1\)\.Heirs and Blocking \(ShS\_\{h\}\): This component evaluates whether the model correctly identifies the set of effective heirs after applying blocking rules\. It combines an F1 score over the predicted and gold effective heir sets with count accuracy, penalizing missing heirs, spurious heirs, incorrect blocking decisions, and count mismatches\.
2. \(2\)\.Share Assignment \(SsS\_\{s\}\): This component measures the correctness of the assigned shares for all eligible heirs\. Predicted shares are compared against gold values using a small tolerance threshold to account for minor numerical deviations\.
3. \(3\)\.Adjustment \(SaS\_\{a\}\): This component evaluates whether the model correctly identifies the required adjustment type \(none, , or \)\. Since adjustment depends on earlier reasoning stages, it is evaluated conditionally and assigned a non\-zero score only when bothSh=1S\_\{h\}=1andSs=1S\_\{s\}=1\.
4. \(4\)\.Final Allocation \(SfS\_\{f\}\): This component measures the accuracy of the final normalized distribution of shares across heirs after completing the full inheritance computation\. Predictions are evaluated using a tolerance threshold\.
The overall MIR\-E score is computed as a weighted sum of these components:
MIR\-E=αhSh\+αsSs\+αaSa\+αfSf\\text\{MIR\-E\}=\\alpha\_\{h\}S\_\{h\}\+\\alpha\_\{s\}S\_\{s\}\+\\alpha\_\{a\}S\_\{a\}\+\\alpha\_\{f\}S\_\{f\}\)1\(
whereαh=αs=αf=0\.30\\alpha\_\{h\}=\\alpha\_\{s\}=\\alpha\_\{f\}=0\.30andαa=0\.10\\alpha\_\{a\}=0\.10\. Equal weights are assigned to heir identification, share assignment, and final allocation, as these stages capture the core reasoning process, while the adjustment component receives a lower weight due to its conditional nature and lower frequency\.
This evaluation framework enables detailed error analysis by isolating failures at different stages of the inheritance reasoning pipeline, including heir identification, share computation, adjustment detection, and final allocation\.
4 :Component\-wise MIR\-E scores on the 500\-case test set after post\-processing\.ShS\_\{h\}: heirs and blocking,SsS\_\{s\}: share assignment,SaS\_\{a\}: adjustment detection\. QU\-NLP results reflect the Basic post\-processing pipeline\.
## 6\. Experimental Setup
### 1\.6\. Implementation Details
We implement our approach using the Hugging Face Transformers libraryWolfet al\.\([2020](https://arxiv.org/html/2604.16396#bib.bib26)\)and the PEFT library for parameter\-efficient fine\-tuningMangrulkaret al\.\([2022](https://arxiv.org/html/2604.16396#bib.bib27)\)\. Inference is performed using a Qwen3\-4B base model with a LoRA adapter, with automatic device placement\.
Data Preprocessing:Inputs are formatted using the Qwen3 chat template with system and user roles\. A fixed system prompt enforces strict JSON\-only outputs with predefined keys corresponding to the inheritance reasoning schema\.
Generation Parameters:During inference, we use greedy decoding with temperature set to 0\.0, a maximum of 1024 generated tokens, and no sampling\.
Post\-Processing:Model outputs are post\-processed using a multi\-stage pipeline\. First, raw outputs are cleaned by removing<think\>tags and extracting the JSON structure\. Additional normalization includes typo correction, removal of Arabic elongation characters, and structural validation\. We further apply rule\-based corrections, including deduplication of blocked heirs, normalization ofawl\_or\_raddlabels, and recalculation of post\-tas̄īldistributions using exact fraction arithmetic when necessary\. The pipeline implements a neural\-symbolic separation: the neural component handles jurisprudential reasoning while the symbolic component verifies adjustment arithmetic using exact rational arithmetic, as evidenced by the PostTasil variant producing results identical to Basic across all metrics \(Section[2\.7](https://arxiv.org/html/2604.16396#S7.SS2)\)\.
### 2\.6\. Baseline Comparisons
We compare our approach against a set of six large language models evaluated in prior workBouchekifet al\.\([2026](https://arxiv.org/html/2604.16396#bib.bib1)\), spanning Arabic\-specialized systems, open\-weight multilingual models, and commercial reasoning models\. These include the Arabic\-centricFanarmodels, evaluated in both their general\-purpose variantFanar\-C\-2\-27Band Islamic\-specialized variantFanar\-Sadiq, accessed via the Fanar API222[https://api\.fanar\.qa/docs\#description/introduction](https://api.fanar.qa/docs#description/introduction)\. We also include open\-weight multilingual models, namelyLLaMA\-3\.3\-70B333[https://huggingface\.co/meta\-llama/Meta\-Llama\-3\-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B)andGPT\-OSS\-120B444[https://huggingface\.co/openai/gpt\-oss\-120b](https://huggingface.co/openai/gpt-oss-120b), both accessed through the Groq API555[https://console\.groq\.com/](https://console.groq.com/)\. Additionally, we evaluateQwen3\-32B666[https://huggingface\.co/Qwen/Qwen3\-32B](https://huggingface.co/Qwen/Qwen3-32B), a multilingual reasoning model\. Finally, we includeGemini\-2\.5\-flash, a commercial reasoning model\.
All baseline systems are evaluated in a zero\-shot setting using a unified Arabic prompt that specifies the inheritance scenario, required reasoning steps, and the target JSON output schema, ensuring a fair comparison across models without task\-specific fine\-tuning\. In contrast, our model uses a task\-specific inference prompt implemented through the Qwen chat template, consisting of a fixed system instruction that enforces strict JSON\-only outputs with predefined keys, followed by the user query\.
## 7\. Results and Discussion
### 1\.7\. Overall Performance
Table[4](https://arxiv.org/html/2604.16396#S5.T4)presents overall MIR\-E scores and component\-wise performance on the 500\-instance test set\. Our QU\-NLP system achieves 90\.0% MIR\-E, closely matching the commercial Gemini\-2\.5\-flash model \(90\.1%\) and substantially outperforming all open\-weight baselines\. Notably, our 4B parameter model outperforms systems with 8–30 more parameters \(Qwen3\-32B, LLaMA\-3\.3\-70B, GPT\-OSS\-120B\), demonstrating that domain\-specific fine\-tuning on high\-quality structured data is more effective than relying solely on model scale and general reasoning capabilities\.
### 2\.7\. Effect of Post\-Processing
Beyond model training, we apply a lightweight post\-processing pipeline to raw predictions before evaluation\. The pipeline operates in three stages: \(1\) typographic normalisation, correcting Arabic elongation characters \(tatweel\) and common spelling variants such as→\\rightarrow; \(2\) structural deduplication, removing any heir that appears simultaneously in bothheirsandblocked, since Islamic jurisprudence forbids an individual from occupying both roles; and \(3\) label normalisation, replacing unrecognisedawl\_or\_raddstrings with a value inferred from the fraction sum, while leaving valid labels \(, , \) unchanged\. A fourth variant,*PostTasil*, additionally attempts to recalculate the final distribution table when the model’spost\_tasilfractions are identical to the unadjustedshares, indicating the model omitted the / adjustment step\.
Table[5](https://arxiv.org/html/2604.16396#S7.T5)reports the impact of each variant on all 500 evaluated test cases\.
5 :Effect of post\-processing on component scores across 500 test cases\.*Original*: raw model output\.*Basic*: typographic cleaning and structural deduplication\.*PostTasil*: Basic plus recalculation of final distributions for / cases\. Bold denotes the highest score per metric\.Post\-processing yields a net gain of\+0\.2\+0\.2pp on overall MIR\-E \(89\.8%→\\rightarrow90\.0%\), with the improvement concentrated entirely in the Heirs & Blocking component \(\+0\.7\+0\.7pp\)\. This gain is attributable to the deduplication step: across the test set, a subset of predictions placed the same individual in bothheirsandblocked—a structural contradiction that the Basic pipeline resolves by trusting theheirslist\.
The Share Assignment, ’Awl/Radd Detection, and Final Distribution components are unaffected by post\-processing, which is intentional\. Share fractions are taken directly from the model’s output without arithmetic intervention, as any correction would require reconstructing the underlying jurisprudential reasoning\. Theawl\_or\_raddlabel is normalised only for unrecognised strings; valid labels are always preserved\. Overriding valid labels based on fraction sums risks conflating two distinct error types: a model that correctly identifies\(no\)but assigns a wrong fraction to a residuary heir will produce an inflated sum, yet itsawl\_or\_raddlabel is correct\. This motivated the conservative design decision to trust all valid labels unconditionally, applying correction only when the model produces an unrecognised label\.
The PostTasil variant produces results identical to Basic across all metrics, indicating that in cases where the model applies \(’awl\) or \(radd\), itspost\_tasildistribution already differs from the unadjustedshares—the recalculator’s trust condition is not met, so the model’s own output is preserved\. This confirms the neural\-symbolic separation in our pipeline: the symbolic verifier found no cases requiring arithmetic correction, demonstrating that the model’s adjustment arithmetic is correct in every case it correctly classifies the adjustment type\. The 29 calculation errors \(5\.8%\) are therefore confined to the final percentage generation step—cases where the model reasons correctly through all five stages but generates numerically inconsistent text in the output field \(see section[4\.7](https://arxiv.org/html/2604.16396#S7.SS4)\)\.
### 3\.7\. Component\-Wise Analysis
Heir Identification \(Sh=97\.1%S\_\{h\}=97\.1\\%\):Our model achieves the highest heir identification score among all evaluated systems, exceeding Gemini\-2\.5\-flash by 2\.6pp \(97\.1% vs\. 94\.5%\) and surpassing all open\-weight baselines by a wide margin \(58\.4%–69\.3%\)\.
Share Assignment \(Ss=94\.3%S\_\{s\}=94\.3\\%\):Share calculation accuracy exceeds Gemini\-2\.5\-flash by 1\.4pp \(94\.3% vs\. 92\.9%\), and substantially surpasses all open\-weight baselines \(31\.4%–44\.6%\)\. The multi\-stage training strategy allows the model to first acquire share terminology and fractional notation in Stage 1, then practise accurate assignment under varied heir configurations in Stage 2\.
Adjustment Detection \(Sa=84\.6%S\_\{a\}=84\.6\\%\):Detecting the required adjustment type is inherently challenging: \(’awl, proportional reduction\) requires correctly summing all assigned fractions and identifying when they exceed unity, while \(radd, surplus redistribution\) additionally requires confirming that no residuary heir is present to absorb the surplus—a condition the model must infer from the absence of certain heir types rather than their presence\. The rarity of both adjustment types in training, at 4\.9% and at 2\.8% of cases, limits exposure to these scenarios and contributes to systematic underfitting\. Consequently, our model scores 84\.6%, trailing Gemini\-2\.5\-flash by 4\.8pp \(89\.4%\) but substantially outperforming all open\-weight baselines, which range from 17\.8% to 27\.1%\.
Per\-category analysis across the 500 evaluated predictions further illuminates adjustment performance\. Table[6](https://arxiv.org/html/2604.16396#S7.T6)shows MIR\-E broken down by case type\. Counter\-intuitively,’awlcases perform worse thanraddcases \(79\.2% vs\. 83\.0%\) despite having nearly eight times more training examples \(577 vs\. 344 training, 39 vs\. 5 test\)\. This reversal reveals that performance is limited by arithmetic complexity rather than data frequency:’awlrequires computing a new common denominator from multiple overlapping fractions \(Sa=66\.7%S\_\{a\}=66\.7\\%,Sf=50\.4%S\_\{f\}=50\.4\\%\), whileraddrequires only proportional redistribution once the type is identified\. Forraddspecifically, heir identification and share assignment both reach 100%—legal reasoning is perfect on all five test cases, with failures confined to one detection error and two final distribution arithmetic failures\. The perfectShS\_\{h\}andSsS\_\{s\}scores onraddcases argue directly against the memorisation hypothesis: a model relying on pattern matching over 344 training instances would be expected to fail on legal reasoning first, not exclusively on the arithmetic output step\.
6 :Per\-category MIR\-E scores\.’Awlcases underperformraddcases \(79\.2% vs\. 83\.0%\) despite nearly eight times more training examples, revealing arithmetic complexity rather than data frequency as the primary bottleneck\.
### 4\.7\. Error Analysis
To understand our model’s limitations, we conduct detailed error analysis on all 500 test cases, comparing model predictions against gold standard references across all reasoning components\. We identify four distinct error categories that account for 16% of test cases\.
#### 4\.7\.1\. Error Distribution
Table[7](https://arxiv.org/html/2604.16396#S7.T7)shows the distribution of errors by category\. Calculation errors represent the most impactful failure mode \(−1\.7\-1\.7pp on MIR\-E\), followed by residue label avoidance \(−0\.85\-0\.85pp\), heir identification \(−0\.4\-0\.4pp\), and radd detection \(−0\.2\-0\.2pp\)\.
7 :Distribution of error types with impact on overall MIR\-E\. Calculation, heir identification, and radd detection are discrete failure cases \(59 total, 11\.8% of test set\); impacts are estimated\.†Among the 417 cases where the gold solution requires , the model substitutes an explicit fraction in 314 \(75\.3%\)\. Because 83\.1% of those 314 cases write the numerically correct fraction, MIR\-E’s tolerance absorbs most of the penalty: the measured shares score gap between cases with and without the label is 0\.045, giving a global cost of 314/500 x 0\.045 x 0\.30 =−0\.85\-0\.85pp \(shares score: 0\.97 with label vs\. 0\.92 without, gap = 0\.045\)\. This row is not summed with the three discrete error categories above\.
#### 4\.7\.2\. Representative Error Cases
Table[8](https://arxiv.org/html/2604.16396#S7.T8)presents component scores for one representative case from each error category\.
8 :Component scores for representative error cases \(ShS\_\{h\}: heirs/blocking,SsS\_\{s\}: shares,SaS\_\{a\}: adjustment,SfS\_\{f\}: final distribution\)\.Case 1 – Calculation Error \(nf8w4p3x\_7\):A deceased left a paternal half\-uncle \( \), a son of a paternal half\-brother \( \), four sons’ daughters \( \), a maternal grandmother \( \), a paternal grandfather \( \), a wife \(\), and five sons of a paternal nephew \( \)\. The model correctly identifies the four eligible heirs and blocks the remaining three, assigns correct shares \(16,18,16,23\\frac\{1\}\{6\},\\frac\{1\}\{8\},\\frac\{1\}\{6\},\\frac\{2\}\{3\}\), and correctly detects \(’awl\) since shares sum to2724\>1\\frac\{27\}\{24\}\>1, giving awl denominator 27 \(Sh=Ss=Sa=1S\_\{h\}=S\_\{s\}=S\_\{a\}=1\)\. However,post\_tasilusestotal\_shares = 108\(=274=274, conflating the awl denominator with the count of \) and copies pre\-awl values, producing wife = 8\.33% instead of 11\.11% \(−2\.78\-2\.78pp\) and all other heirs at 13\.89% instead of 14\.81% \(−0\.93\-0\.93pp\), summing to only 91\.67%\. Legal reasoning is perfect; the error is a failure to propagate the’awladjustment through to the final output field\.
Case 2 – Heir Identification Error \(nf5n5k1z\_7\):A deceased left a paternal great\-grandfather \( \), two paternal half\-brothers \( \), a maternal grandmother \( \), a father \(\), two of the father’s paternal half\-uncles \( \), and two great\-grandmothers \( , \)\. The gold solution correctly identifies two heirs \( with16\\frac\{1\}\{6\}and with the residue\) and blocks the remaining five\. The model’s heirs, shares, and final distribution are all correct \(Ss=Sf=1S\_\{s\}=S\_\{f\}=1\), andpost\_tasilmatches gold exactly \(16\.67% and 83\.33%\)\. TheShS\_\{h\}penalty \(0\.66\) arises solely from a label mismatch in the blocked list: the model writes instead of the gold taxonomy label for the father’s paternal half\-uncle\. The underlying blocking decision is correct, but the shortened label fails the string\-matching evaluation\.
Case 3 – Radd Detection Error \(ng7z4j2b\_2\):A deceased left three full sisters \( \) and a paternal great\-grandmother \( \)\. The model correctly identifies both heirs and assigns correct shares \( :16\\frac\{1\}\{6\}, :23\\frac\{2\}\{3\}\)\. Since the shares sum to56<1\\frac\{5\}\{6\}<1and no residuary heir is present, \(radd\) should apply\. The model predicts \(no adjustment\), failing to confirm the absence of a residuary heirBouchekifet al\.\([2026](https://arxiv.org/html/2604.16396#bib.bib1)\)\. Without redistribution the final percentages are wrong: receives 16\.67% instead of 20\.00% \(\+3\.33\+3\.33pp\) and each receives 22\.22% instead of 26\.67% \(\+4\.44\+4\.44pp\)\. Since cases are only 2\.8% of training data, this reflects systematic underfitting on rare adjustment events\.
Case 4 – Share Assignment Error \(ng2p5t2e\_10\):A deceased left a paternal grandfather \( \), four sons of a full paternal uncle \( \), five sons of a paternal cousin \( \), four paternal half\-brothers \( \), a mother \(\), a maternal half\-brother \( \), a paternal grandmother \( \), a son of a paternal half\-uncle \( \), a maternal grandmother \( \), and four of the father’s uncles \( \)\. The gold solution has three heirs: \(16\\frac\{1\}\{6\}\), \(518\\frac\{5\}\{18\}\), and44\(residue, \)\. The model makes three errors\. First, it incorrectly adds \(maternal half\-brother\) as an eligible heir, who is blocked when the mother is present\. Second, it lists four full brothers \( \) as blocked—a relative not mentioned anywhere in the question, a hallucinated heir type\. Third, it assigns a share of13\\frac\{1\}\{3\}instead of518\\frac\{5\}\{18\}and replaces the residue designation for with an explicit fraction of16\\frac\{1\}\{6\}—a systematic avoidance of \. The result: receives 16\.67% instead of 27\.78% \(−11\.11\-11\.11pp\), each receives 8\.33% instead of 13\.89% \(−5\.56\-5\.56pp\), and the spurious absorbs an additional 16\.67% of the estate, with the prediction summing to only 83\.33% of the estate\.
#### 4\.7\.3\. Error Patterns and Implications
Arithmetic vs\. Semantic Errors:The most impactful errors \(calculation, 29 cases, see Table[7](https://arxiv.org/html/2604.16396#S7.T7)\) arenon\-semantic—the model understands inheritance law correctly but fails in final arithmetic\. This suggests errors occur in the text\-generation step of the final output field rather than in core reasoning, making them addressable through constrained decoding without retraining\.
Complexity Degradation:Performance varies with case complexity\. For simple cases involving 2–4 heirs, the model achieves approximately 91\.7% MIR\-E\. For medium\-complexity cases with 5–7 heirs, performance decreases slightly to around 88\.9%, while for complex cases involving≥\\geq8 mentioned heirs, it further declines to approximately 87\.0%\. This gradual degradation indicates a moderate impact of complexity on performance, with the model maintaining strong accuracy even in more complex scenarios\.
Rare Event Underfitting:Per\-category analysis \(Table[6](https://arxiv.org/html/2604.16396#S7.T6)\) shows that’awlcases underperformraddcases \(79\.2% vs\. 83\.0%\) despite having nearly eight times more training examples\. The primary bottleneck is arithmetic complexity:’awlrequires multi\-fraction common\-denominator computation, whileraddrequires only proportional redistribution once the type is identified\. Both benefit from oversampling and explicit rule\-based fallbacks as future work\.
Residue Label Recall:The gold standard requires the residuary label in 417 of 500 evaluated cases \(83\.4%\), reflecting the prevalence of male agnate heirs \(\) across the test set\. The model provides this label in only 103 of those cases \(24\.7% recall\), substituting an explicit fraction in the remaining 314 cases \(75\.3% avoidance rate\)\. Table[9](https://arxiv.org/html/2604.16396#S7.T9)summarises the breakdown\.
9 :Residue label recall analysis\. Despite a 75\.3% avoidance rate, the global MIR\-E cost is only−0\.85\-0\.85pp because 83\.1% of avoidance cases compute the numerically correct fraction within the evaluation tolerance\.Despite the low recall, the MIR\-E cost is only−0\.85\-0\.85pp because MIR\-E’s tolerance absorbs 83\.1% of avoidance cases: the model computes the numerically correct residue fraction and writes it as an explicit value—for instance, writing"7/12"when is expected, which falls within the evaluation tolerance\. This reveals arepresentationalrather thancomputationalfailure\. Among the 314 avoidance cases, 261 \(83\.1%\) produce the exact fraction a symbolic calculator would derive by subtracting fixed shares from unity—the model has learned to perform residue arithmetic correctly but defaults to explicit fraction notation due to training bias toward fixed\-share cases\. The failure is therefore in the final token selection, not in the underlying calculation\. Constrained decoding enforcing whenever an heir is present and fixed shares sum to less than unity would recover the correct label in the majority of affected cases without any change to the model weights\.
### 5\.7\. Pipeline Success Rate
Table[10](https://arxiv.org/html/2604.16396#S7.T10)reports cumulative success rates across the reasoning pipeline, where each row shows the percentage of cases in which all stages up to and including that point score perfectly\.
10 :Cumulative pipeline success rates\. Each row shows the percentage of cases where all stages up to and including that point score perfectly \(S=1S=1\)\.While 65\.5% of cases are solved perfectly across all components, the overall MIR\-E reaches 90\.0%\. This gap is explained by the partial\-credit design of MIR\-EBouchekifet al\.\([2026](https://arxiv.org/html/2604.16396#bib.bib1)\), which assigns weighted scores to each intermediate stage \(αh=αs=αf=0\.30\\alpha\_\{h\}=\\alpha\_\{s\}=\\alpha\_\{f\}=0\.30,αa=0\.10\\alpha\_\{a\}=0\.10\)\. A case with correct heirs, shares, and adjustment but wrong final percentages, for instance, still earns0\.30\+0\.30\+0\.10\+0\.00=0\.700\.30\+0\.30\+0\.10\+0\.00=0\.70MIR\-E\. It is worth noting that the adjustment scoreSaS\_\{a\}is evaluated conditionally: it receives a non\-zero value only when bothSh=1S\_\{h\}=1andSs=1S\_\{s\}=1, reflecting the sequential dependency of the reasoning pipeline\. The 34\.5% of imperfect cases therefore contribute meaningful partial credit, and a back\-of\-envelope check confirms the arithmetic: if the 65\.5% perfect cases score 1\.0 and the remaining 34\.5% score on average 0\.71, the weighted average yields\(0\.655\(0\.655x1\.0\)\+\(0\.3451\.0\)\+\(0\.345x0\.71\)≈0\.900\.71\)\\approx 0\.90, consistent with our reported MIR\-E\.
The stage\-by\-stage breakdown reveals where errors enter the pipeline\. The drop from heirs\-correct \(84\.0%\) to all\-correct \(65\.5%\) accumulates across three transitions: heir identification to share assignment \(−2\.8\-2\.8pp\), share assignment to adjustment \(−1\.8\-1\.8pp\), and adjustment to final distribution \(−13\.9\-13\.9pp\)\. The largest single drop is the last, indicating that arithmetic computation inpost\_tasilis the dominant bottleneck for cases that pass all upstream reasoning stages—consistent with calculation errors being the most impactful error category \(5\.8% of cases,−1\.7\-1\.7pp, Table[7](https://arxiv.org/html/2604.16396#S7.T7)\)\.
### 6\.7\. Implications for Deployment
Our error analysis reveals that the model is suitable for educational tools, preliminary screening, and simple cases, but requires human review for legal decisions, complex families \(≥\\geq8 relatives\), rare adjustment cases, and high\-stakes situations\. Although the model scored 90\.0% MIR\-E overall, the 10% error rate remains too high for binding legal decisions without expert verification\.
An optimal deployment strategy employs a hybrid human\-AI workflow: \(1\) model generates initial analysis with 90% accuracy, \(2\) automatic flagging of complex cases, rare cases — \(radd\) and \(’awl\)—, and cases where the model writes explicit fractions for known residuary heirs, \(3\) human expert reviews flagged cases, \(4\) model provides confidence scores to guide review priority\. This approach leverages the model’s strengths \(speed, coverage of standard cases\) while mitigating its weaknesses through targeted human oversight\.
## 8\. Conclusion
We presented QU\-NLP’s approach to the QIAS 2026 shared task on Arabic Islamic inheritance reasoning, achieving 90\.0% MIR\-E through multi\-stage QLoRA fine\-tuning of Qwen3\-4B followed by lightweight post\-processing\. Our comprehensive error analysis reveals that:
1. \(1\)\.The model demonstrates strong legal reasoning—97\.1% heir and blocking accuracy, 94\.3% share assignment accuracy—but exhibits specific weaknesses in adjustment detection \(84\.6%\) and final distribution arithmetic \(5\.8% of cases show correct reasoning but wrong numerical output\)\.
2. \(2\)\.Per\-category analysis shows’awlcases \(79\.2% MIR\-E\) underperformraddcases \(83\.0%\) despite nearly eight times more training examples, revealing arithmetic complexity rather than data volume as the primary bottleneck\.Raddcases achieve 100% heir and share accuracy with failures confined to the final arithmetic output step—evidence of generalised legal reasoning rather than pattern memorisation\.
3. \(3\)\.The PostTasil post\-processing variant produces results identical to Basic across all metrics, confirming that the model’s adjustment arithmetic is correct in every case it correctly classifies the adjustment type\. The 29 remaining calculation errors are irreducible text\-generation failures in the final output field, not addressable by any post\-hoc symbolic layer\.
4. \(4\)\.Residue label recall is 24\.7% \(103/417 cases requiring the label\), with the model substituting explicit fractions in 75\.3% of cases\. The global MIR\-E cost is only−0\.85\-0\.85pp because 83\.1% of avoidance cases compute the numerically correct residue fraction—confirming the failure is representational \(wrong output format\) rather than computational \(wrong arithmetic\), and is directly addressable through constrained decoding without retraining\.
5. \(5\)\.Domain\-specific pre\-adaptation on Islamic legal texts improves structured output quality, with fatwa records providing exposure to jurisprudential terminology and reasoning patterns not present in general pre\-training data\.
Our results demonstrate that small fine\-tuned models \(4B parameters\) can match or exceed larger general\-purpose models \(32–120B parameters\) on specialized legal reasoning tasks, achieving commercial\-grade performance with minimal computational resources suitable for deployment on consumer hardware\.
Future work will explore: \(1\) constrained decoding to guarantee validJSONstructure, correct arithmetic, and enforcement of when heirs are present and fixed shares sum to less than unity—directly addressing the 75\.3% residue avoidance rate \(−0\.85\-0\.85pp\) without retraining; \(2\) reinforcement learning from process rewards to improve multi\-step reasoning; \(3\) oversampling rare adjustment cases— \(radd, 2\.8%\) and \(’awl, 4\.9%\)—to address training imbalance; \(4\) extension to other Islamic legal domains \(marriage, divorce, financial transactions\); and \(5\) hybrid systems combining neural reasoning with symbolic rule engines for guaranteed correctness in safety\-critical applications\.
- U\. Abbas, M\. S\. Ahmad, F\. Alam, E\. Altinisik, E\. Asgari, Y\. Boshmaf, S\. Boughorbel, S\. Chawla, S\. Chowdhury, F\. Dalvi,et al\.\(2025\)Fanar: an arabic\-centric multimodal generative ai platform\.External Links:2501\.13944,[Link](https://arxiv.org/abs/2501.13944)Cited by:[§2\.2](https://arxiv.org/html/2604.16396#S2.SS2.p2.1)\.
- QU\-NLP at QIAS 2025 shared task: a two\-phase LLM fine\-tuning and retrieval\-augmented generation approach for islamic inheritance reasoning\.InProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks,K\. Darwish, A\. Ali, I\. Abu Farha, S\. Touileb, I\. Zitouni, A\. Abdelali, S\. Al\-Ghamdi, S\. Alkhereyf, W\. Zaghouani, S\. Khalifa, B\. AlKhamissi, R\. Almatham, I\. Hamed, Z\. Alyafeai, A\. Alowisheq, G\. Inoue, K\. Mrini, and W\. Alshammari \(Eds\.\),Suzhou, China,pp\. 892–898\.External Links:[Link](https://aclanthology.org/2025.arabicnlp-sharedtasks.123/),[Document](https://dx.doi.org/10.18653/v1/2025.arabicnlp-sharedtasks.123),ISBN 979\-8\-89176\-356\-2Cited by:[§1\.2](https://arxiv.org/html/2604.16396#S2.SS1.p2.1),[§1\.2](https://arxiv.org/html/2604.16396#S2.SS1.p3.1),[§2\.2](https://arxiv.org/html/2604.16396#S2.SS2.p3.1)\.
- A\. Almasoud, S\. Al\-Ghamdi, R\. Alqifari, N\. Alfear, and H\. Al\-Khalifa \(2026\)MirathQA: a dataset for evaluating large language models on hanbali islamic inheritance reasoning tasks\.Data in BriefNaturearXiv preprint arXiv:2505\.0938865,pp\. 112589\.External Links:ISSN 2352\-3409,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.dib.2026.112589),[Link](https://www.sciencedirect.com/science/article/pii/S2352340926001423)Cited by:[§1\.2](https://arxiv.org/html/2604.16396#S2.SS1.p3.1)\.
- S\. Alowaidi \(2025\)SEA\-team at QIAS 2025: enhancing LLMs for question answering in islamic texts\.InProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks,K\. Darwish, A\. Ali, I\. Abu Farha, S\. Touileb, I\. Zitouni, A\. Abdelali, S\. Al\-Ghamdi, S\. Alkhereyf, W\. Zaghouani, S\. Khalifa, B\. AlKhamissi, R\. Almatham, I\. Hamed, Z\. Alyafeai, A\. Alowisheq, G\. Inoue, K\. Mrini, and W\. Alshammari \(Eds\.\),Suzhou, China,pp\. 940–946\.External Links:[Link](https://aclanthology.org/2025.arabicnlp-sharedtasks.130/),[Document](https://dx.doi.org/10.18653/v1/2025.arabicnlp-sharedtasks.130),ISBN 979\-8\-89176\-356\-2Cited by:[§1\.2](https://arxiv.org/html/2604.16396#S2.SS1.p2.1)\.
- G\. Bhatia, H\. Mubarak, M\. Jarrar, G\. Mikros, F\. Zaraket, M\. Alhirthani, M\. Al\-Khatib, L\. Cochrane, K\. Darwish, R\. Yahiaoui, and F\. Alam \(2026\)From RAG to agentic RAG for faithful islamic question answering\.External Links:2601\.07528,[Link](https://arxiv.org/abs/2601.07528)Cited by:[§1\.2](https://arxiv.org/html/2604.16396#S2.SS1.p1.1)\.
- A\. Bouchekif, S\. Gaben, S\. Rashwani, S\. Eltanbouly, M\. Al\-Khatib, H\. Sbahi, M\. Ghaly, and E\. Mohamed \(2026\)MAWARITH: a dataset and benchmark for legal inheritance reasoning with llms\.External Links:2603\.07539,[Link](https://arxiv.org/abs/2603.07539)Cited by:[§1](https://arxiv.org/html/2604.16396#S1.p1.1),[§1](https://arxiv.org/html/2604.16396#S1.p3.1),[§1\.2](https://arxiv.org/html/2604.16396#S2.SS1.p4.1),[§1\.3\.1](https://arxiv.org/html/2604.16396#S3.SS1.SSS1.p1.1),[§2\.3](https://arxiv.org/html/2604.16396#S3.SS2.p1.1),[§1\.5](https://arxiv.org/html/2604.16396#S5.SS1.p1.1),[§2\.6](https://arxiv.org/html/2604.16396#S6.SS2.p1.1),[§4\.7\.2](https://arxiv.org/html/2604.16396#S7.SS4.SSS2.p4.5),[§5\.7](https://arxiv.org/html/2604.16396#S7.SS5.p2.9)\.
- A\. Bouchekif, S\. Rashwani, E\. S\. A\. Mohamed, M\. Alkhatib, H\. Sbahi, S\. Gaben, W\. Zaghouani, A\. Erbad, and M\. Ghaly \(2025a\)QIAS 2025: overview of the shared task on islamic inheritance reasoning and knowledge assessment\.InProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks,K\. Darwish, A\. Ali, I\. Abu Farha, S\. Touileb, I\. Zitouni, A\. Abdelali, S\. Al\-Ghamdi, S\. Alkhereyf, W\. Zaghouani, S\. Khalifa, B\. AlKhamissi, R\. Almatham, I\. Hamed, Z\. Alyafeai, A\. Alowisheq, G\. Inoue, K\. Mrini, and W\. Alshammari \(Eds\.\),Suzhou, China,pp\. 851–860\.External Links:[Link](https://aclanthology.org/2025.arabicnlp-sharedtasks.117/),[Document](https://dx.doi.org/10.18653/v1/2025.arabicnlp-sharedtasks.117),ISBN 979\-8\-89176\-356\-2Cited by:[§1](https://arxiv.org/html/2604.16396#S1.p3.1),[§1\.2](https://arxiv.org/html/2604.16396#S2.SS1.p3.1)\.
- A\. Bouchekif, S\. Rashwani, H\. Sbahi, S\. Gaben, M\. Al Khatib, and M\. Ghaly \(2025b\)Assessing large language models on islamic legal reasoning: evidence from inheritance law evaluation\.InProceedings of The Third Arabic Natural Language Processing Conference,K\. Darwish, A\. Ali, I\. Abu Farha, S\. Touileb, I\. Zitouni, A\. Abdelali, S\. Al\-Ghamdi, S\. Alkhereyf, W\. Zaghouani, S\. Khalifa, B\. AlKhamissi, R\. Almatham, I\. Hamed, Z\. Alyafeai, A\. Alowisheq, G\. Inoue, K\. Mrini, and W\. Alshammari \(Eds\.\),Suzhou, China,pp\. 246–257\.External Links:[Link](https://aclanthology.org/2025.arabicnlp-main.20/),[Document](https://dx.doi.org/10.18653/v1/2025.arabicnlp-main.20),ISBN 979\-8\-89176\-352\-4Cited by:[2nd item](https://arxiv.org/html/2604.16396#S1.I1.i2.p1.1),[§1\.2](https://arxiv.org/html/2604.16396#S2.SS1.p1.1)\.
- I\. Chalkidis, A\. Jana, D\. Hartung, M\. Bommarito, I\. Androutsopoulos, D\. Katz, and N\. Aletras \(2022\)LexGLUE: a benchmark dataset for legal language understanding in English\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 4310–4330\.External Links:[Link](https://aclanthology.org/2022.acl-long.297/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.297)Cited by:[§2\.2](https://arxiv.org/html/2604.16396#S2.SS2.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen, L\. Marris, S\. Petulla, C\. Gaffney, A\. Aharoni, N\. Lintz, T\. C\. Pais, H\. Jacobsson, I\. Szpektor, N\. Jiang, K\. Haridasan, A\. Omran,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.External Links:2507\.06261,[Link](https://arxiv.org/abs/2507.06261)Cited by:[§2\.2](https://arxiv.org/html/2604.16396#S2.SS2.p2.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)QLORA: efficient finetuning of quantized llms\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.Cited by:[§2\.2](https://arxiv.org/html/2604.16396#S2.SS2.p3.1),[§3\.4](https://arxiv.org/html/2604.16396#S4.SS3.p1.2)\.
- E\. Elrefai, M\. Lotfy Elrefai, and A\. Hassan Esmail \(2025\)Gumball at QIAS 2025: Arabic LLM automated reasoning in islamic inheritance\.InProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks,K\. Darwish, A\. Ali, I\. Abu Farha, S\. Touileb, I\. Zitouni, A\. Abdelali, S\. Al\-Ghamdi, S\. Alkhereyf, W\. Zaghouani, S\. Khalifa, B\. AlKhamissi, R\. Almatham, I\. Hamed, Z\. Alyafeai, A\. Alowisheq, G\. Inoue, K\. Mrini, and W\. Alshammari \(Eds\.\),Suzhou, China,pp\. 953–959\.External Links:[Link](https://aclanthology.org/2025.arabicnlp-sharedtasks.132/),[Document](https://dx.doi.org/10.18653/v1/2025.arabicnlp-sharedtasks.132),ISBN 979\-8\-89176\-356\-2Cited by:[§1\.2](https://arxiv.org/html/2604.16396#S2.SS1.p3.1)\.
- N\. Guha, J\. Nyarko, D\. E\. Ho, C\. Ré, A\. Chilton, A\. Narayana, A\. Chohlas\-Wood, A\. Peters, B\. Waldon, D\. N\. Rockmore, D\. Zambrano, D\. Talisman, E\. Hoque, F\. Surani, F\. Fagan, G\. Sarfaty, G\. M\. Dickinson, H\. Porat, J\. Hegland, J\. Wu, J\. Nudell, J\. Niklaus, J\. Nay, J\. H\. Choi, K\. Tobia, M\. Hagan, M\. Ma, M\. Livermore, N\. Rasumov\-Rahe, N\. Holzenberger, N\. Kolt, P\. Henderson, S\. Rehaag, S\. Goel, S\. Gao, S\. Williams, S\. Gandhi, T\. Zur, V\. Iyer, and Z\. Li \(2023\)LegalBench: a collaboratively built benchmark for measuring legal reasoning in large language models\.External Links:2308\.11462,[Link](https://arxiv.org/abs/2308.11462)Cited by:[§2\.2](https://arxiv.org/html/2604.16396#S2.SS2.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in LLMs through reinforcement learning\.645\(8081\),pp\. 633–638\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09422-z),[Link](https://doi.org/10.1038/s41586-025-09422-z)Cited by:[§2\.2](https://arxiv.org/html/2604.16396#S2.SS2.p2.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2021\)LoRA: low\-rank adaptation of large language models\.External Links:2106\.09685,[Link](https://arxiv.org/abs/2106.09685)Cited by:[§2\.2](https://arxiv.org/html/2604.16396#S2.SS2.p3.1)\.
- R\. Malhas, W\. Mansour, and T\. Elsayed \(2022\)Qur’an QA 2022: overview of the first shared task on question answering over the holy qur’an\.InProceedinsg of the 5th Workshop on Open\-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur’an QA and Fine\-Grained Hate Speech Detection,H\. Al\-Khalifa, T\. Elsayed, H\. Mubarak, A\. Al\-Thubaity, W\. Magdy, and K\. Darwish \(Eds\.\),Marseille, France,pp\. 79–87\.External Links:[Link](https://aclanthology.org/2022.osact-1.9/)Cited by:[§1\.2](https://arxiv.org/html/2604.16396#S2.SS1.p1.1)\.
- S\. Mangrulkar, S\. Gugger, L\. Debut, Y\. Belkada, S\. Paul, B\. Bossan, and M\. Tietz \(2022\)PEFT: state\-of\-the\-art parameter\-efficient fine\-tuning methods\.Note:[https://github\.com/huggingface/peft](https://github.com/huggingface/peft)Cited by:[§1\.6](https://arxiv.org/html/2604.16396#S6.SS1.p1.1)\.
- H\. Mubarak, R\. Malhas, W\. Mansour, A\. Mohamed, M\. Fawzi, M\. Hawasly, T\. Elsayed, K\. M\. Darwish, and W\. Magdy \(2025\)IslamicEval 2025: the first shared task of capturing llms hallucination in islamic content\.InProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks,Suzhou, China,pp\. 480–493\.Cited by:[§1\.2](https://arxiv.org/html/2604.16396#S2.SS1.p1.1)\.
- J\. Niklaus, V\. Matoshi, P\. Rani, A\. Galassi, M\. Stürmer, and I\. Chalkidis \(2023\)LEXTREME: a multi\-lingual and multi\-task benchmark for the legal domain\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 3016–3054\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.200/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.200)Cited by:[§2\.2](https://arxiv.org/html/2604.16396#S2.SS2.p1.1)\.
- OpenAI, J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat, R\. Avila, I\. Babuschkin, S\. Balaji, V\. Balcom, P\. Baltescu, H\. Bao, M\. Bavarian,et al\.\(2024\)GPT\-4 technical report\.External Links:2303\.08774,[Link](https://arxiv.org/abs/2303.08774)Cited by:[§1](https://arxiv.org/html/2604.16396#S1.p1.1)\.
- Qwen Team \(2025\)Qwen3 technical report\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2505.09388),[Link](https://doi.org/10.48550/arXiv.2505.09388)Cited by:[§1](https://arxiv.org/html/2604.16396#S1.p4.1),[§2\.2](https://arxiv.org/html/2604.16396#S2.SS2.p2.1),[§1\.4](https://arxiv.org/html/2604.16396#S4.SS1.p1.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram, A\. Nathan,et al\.\(2025\)OpenAI gpt\-5 system card\.External Links:2601\.03267,[Link](https://arxiv.org/abs/2601.03267)Cited by:[§2\.2](https://arxiv.org/html/2604.16396#S2.SS2.p2.1)\.
- T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz, J\. Davison, S\. Shleifer, P\. von Platen, C\. Ma, Y\. Jernite, J\. Plu, C\. Xu, T\. Le Scao, S\. Gugger, M\. Drame, Q\. Lhoest, and A\. Rush \(2020\)Transformers: state\-of\-the\-art natural language processing\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,Q\. Liu and D\. Schlangen \(Eds\.\),Online,pp\. 38–45\.External Links:[Link](https://aclanthology.org/2020.emnlp-demos.6/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-demos.6)Cited by:[§1\.6](https://arxiv.org/html/2604.16396#S6.SS1.p1.1)\.
- J\. Woo, F\. Hashemi Chaleshtori, A\. Marasovic, and K\. Marino \(2025\)BriefMe: a legal NLP benchmark for assisting with legal briefs\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 13139–13190\.External Links:[Link](https://aclanthology.org/2025.findings-acl.681/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.681),ISBN 979\-8\-89176\-256\-5Cited by:[§2\.2](https://arxiv.org/html/2604.16396#S2.SS2.p1.1)\.Similar Articles
QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning
This paper presents an overview of the QIAS 2026 shared task on Islamic inheritance reasoning, evaluating LLMs on multi-step legal and numerical reasoning using the MAWARITH benchmark.
Which Models Perform Better in Inheritance Reasoning?
This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning, comparing commercial and open-source large language models. Results show commercial models (e.g., Gemini 2.5 Flash) significantly outperform open-source models in structured legal reasoning with multi-step dependencies.
QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard
QIMMA is a new quality-first Arabic LLM leaderboard introduced by TII UAE that validates benchmarks before evaluation to ensure accurate performance measurement. It addresses systematic quality issues in existing Arabic NLP benchmarks through a rigorous multi-stage validation pipeline.
LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification
This paper presents an iterative imbalance-aware fine-tuning approach using Qwen3-8B with QLoRA for psychological defense mechanism classification, achieving a macro F1 of 0.3917 and ranking 4th out of 21 teams in the PsyDefDetect 2026 shared task.
LQS v3.1 — an open methodology for rating AI training data (multi-oracle consensus + signed certificates) [P]
The author presents LQS v3.1, an open methodology for rating AI training data using multi-oracle consensus and signed certificates, with a published paper and public index. The approach aims to solve the bottleneck of independent quality evaluation in the AI training data market.