Which Models Perform Better in Inheritance Reasoning?

arXiv cs.CL 06/15/26, 04:00 AM Papers

inheritance-reasoning large-language-models legal-reasoning multi-step-reasoning numerical-computation open-source-comparison ai-evaluation

Summary

This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning, comparing commercial and open-source large language models. Results show commercial models (e.g., Gemini 2.5 Flash) significantly outperform open-source models in structured legal reasoning with multi-step dependencies.

arXiv:2606.13751v1 Announce Type: new Abstract: This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning. The task evaluates the ability of large language models to solve inheritance cases that require legal interpretation, multi-step reasoning, and precise numerical computation. We compare \textit{commercial} and \textit{open-source} models under a unified prompting strategy to assess their effectiveness in structured legal reasoning with minimal task-specific adaptation. \\ Our results show a clear gap in reliability between the two model families. Commercial models demonstrate stronger performance in identifying eligible heirs, applying exclusion rules, and maintaining consistency across reasoning steps. In contrast, open-source models exhibit greater instability, particularly in cases involving dependent legal decisions and fractional share adjustments. The best performance is achieved by \textit{Gemini 2.5 Flash}, with an MRE of $0.989$.

Original Article

View Cached Full Text

Cached at: 06/15/26, 08:56 AM

# Which Models Perform Better in Inheritance Reasoning?
Source: [https://arxiv.org/html/2606.13751](https://arxiv.org/html/2606.13751)
###### Abstract

This paper presents the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning\. The task evaluates the ability of large language models to solve inheritance cases that require legal interpretation, multi\-step reasoning, and precise numerical computation\. We comparecommercialandopen\-sourcemodels under a unified prompting strategy to assess their effectiveness in structured legal reasoning with minimal task\-specific adaptation\. Our results show a clear gap in reliability between the two model families\. Commercial models demonstrate stronger performance in identifying eligible heirs, applying exclusion rules, and maintaining consistency across reasoning steps\. In contrast, open\-source models exhibit greater instability, particularly in cases involving dependent legal decisions and fractional share adjustments\. The best performance is achieved byGemini 2\.5 Flash, with an MRE of0\.9890\.989\.

\\NAT@set@cites

Which Models Perform Better in Inheritance Reasoning?

Mohammed Amine Mouhoub1Chahinez Bouchekif21Paris Dauphine University2University of Abou Bekr Belkaïdmohamed\.mouhoub@dauphine\.frmeriemchahinez\.bouchekif@univ\-tlemcen\.dzAbstract content

## 1\. Introduction

Large language models \(LLMs\) perform well on many NLP tasks, but this does not always mean that they can reason well in rule\-based domains\. In such tasks, a good answer must be both plausible and correct at every step\. If the model makes one wrong decision early in the process, the whole solution may become incorrect\. For this reason, more attention is now being given to evaluation tasks that test reasoning with structured rules, symbolic relations, and numerical accuracy\.

Islamic inheritance law \('ilm al\-mawārīth\) is a particularly challenging example of this problem\. Solving an inheritance case requires more than retrieving legal knowledge or recognizing common patterns\. A correct solution must identify the eligible heirs, apply blocking and exclusion rules correctly, assign the prescribed shares, and decide whether adjustment mechanisms such as'awlorraddare needed\. Because all of these decisions are closely connected, inheritance reasoning is a useful benchmark for studying whether current LLMs can perform coherent multi\-step legal reasoning rather than simply generate fluent legal language\(Bouchekifet al\.,[2026](https://arxiv.org/html/2606.13751#bib.bib1)\)\.

Recent work on reasoning\-oriented large language models has highlighted the importance of evaluating models on tasks that require more than fluent text generation\. Chain\-of\-thought prompting improves performance on multi\-step problems by encouraging models to produce intermediate reasoning steps, while self\-consistency increases robustness by selecting answers that are consistent across multiple sampled reasoning paths\(Weiet al\.,[2022](https://arxiv.org/html/2606.13751#bib.bib9); Wanget al\.,[2022](https://arxiv.org/html/2606.13751#bib.bib10)\)\. More recent reasoning\-focused systems, such as OpenAI’s o1 family and DeepSeek\-R1, further suggest that additional inference\-time reasoning or reinforcement learning can lead to strong gains on difficult reasoning tasks\(Jaechet al\.,[2024](https://arxiv.org/html/2606.13751#bib.bib11); Guoet al\.,[2025](https://arxiv.org/html/2606.13751#bib.bib12)\)\. However, these advances do not fully solve the problem\. Reasoning remains fragile in tasks that require strict rule application, long chains of dependent decisions, or exact numerical computation, where a single intermediate mistake may invalidate the final answer\(Weiet al\.,[2022](https://arxiv.org/html/2606.13751#bib.bib9); Wanget al\.,[2022](https://arxiv.org/html/2606.13751#bib.bib10)\)\.

This limitation is especially clear in specialized legal tasks such as Islamic inheritance, where correct answers depend on both legal reasoning and precise numerical calculation\. In such settings, reasoning\-oriented models often perform better than standard LLMs\. Several studies on the QIAS 2025 benchmarkBouchekifet al\.\([2025a](https://arxiv.org/html/2606.13751#bib.bib2)\)showed that models such as Gemini and ChatGPT consistently outperformed many Arabic\-focused and open\-source models, including Fanar and ALLaM, on inheritance\-law evaluation\(Bouchekifet al\.,[2025b](https://arxiv.org/html/2606.13751#bib.bib3); AlDahoul and Zaki,[2025](https://arxiv.org/html/2606.13751#bib.bib5); AL\-Smadi,[2025](https://arxiv.org/html/2606.13751#bib.bib6); Eddine Bekhoucheet al\.,[2025](https://arxiv.org/html/2606.13751#bib.bib4)\)\. Notable exceptions includeElrefaiet al\.\([2025](https://arxiv.org/html/2606.13751#bib.bib7)\), who obtained the best results using Qwen3, andXuan Phuc and Đặng Văn \([2025](https://arxiv.org/html/2606.13751#bib.bib8)\), who proposed a hybrid multi\-agent architecture\.

In this paper, we describe the participation of team PSL in QIAS 2026\. Our submission compares commercial and open\-source models under a unified prompting framework\. Instead of designing a highly specialized task\-specific pipeline, we examine a simpler and more informative question: how well can current general\-purpose models solve a structured Arabic legal reasoning task when evaluated under the same conditions\. This comparison is important because it helps clarify the gap between highly capable proprietary systems and more accessible open\-weight models\.

## 2\. Data

Our experiments are conducted on the QIAS 2026 benchmark\. The dataset contains 12,500 inheritance cases written in natural language and follows the majority juristic opinion \(al\-jumhūr\)\. Each instance describes the deceased and the surviving relatives, and the task is to determine the eligible heirs and their final shares according to the rules of Islamic inheritance\. Each case provides all the information required to solve the problem, including the inheriting heirs, blocked heirs, assigned shares, the possible application of'awlorradd, and the final normalized distribution\. In addition, the dataset includes an intermediate reasoning trace and a concise final answer, making it suitable for evaluating both reasoning quality and final prediction accuracy\. The full corpus is divided into 12,000 training instances and 500 test instances\. It covers 36 distinct heir categories, ranging from close relatives such as parents, children, and spouses to more distant agnatic relatives across multiple generations\. The cases vary in difficulty, from simple configurations with a small number of heir types to more complex scenarios involving up to twelve distinct heir categories\. In terms of legal complexity, the training split contains 11,079 simple cases, 577'awlcases, and 344raddcases\. The totals reported in the paper imply that the 500\-case test split contains 456 simple cases, 39'awlcases, and 5raddcases\. This distribution shows that most examples do not require adjustment, while a smaller but important subset evaluates the model’s ability to handle more difficult redistribution and proportional\-reduction cases\.

## 3\. System Overview

Our submission uses a simple prompting\-based chain\-of\-thought \(CoT\) design\. The goal is not to build a complex task\-specific system, but to compare different types of models in a fair setting\. In particular, we compare reasoning and non\-reasoning models, as well as commercial and non\-commercial models, under the same conditions\. This allows us to better assess their actual ability to solve Arabic inheritance reasoning problems\.

Given an inheritance case in Arabic, the system prompts the model to produce a structured solution\. The prompt encourages the model to proceed through the main inheritance stages: identifying heirs, applying blocking rules, determining legal shares, and producing the final answer\. To reduce unnecessary variation in generation, we use a constrained output format and a small amount of post\-processing\.

### 3\.1\. Models

We evaluated both commercial and open\-source models to better understand the current gap between these two groups on a structured Arabic legal reasoning task\. The models were selected for three main reasons\. First, we wanted a fair comparison between systems that are widely accessible to researchers\. Second, we wanted diversity in model families and capabilities\. Third, we wanted to test whether strong general reasoning models can transfer well to a domain that combines Arabic input, legal constraints, and precise numerical output\.

#### 3\.1\.1\. Commercial Models

Our commercial systems includedGemini 2\.5 FlashandGemini 2\.5 Pro\. These models were selected for their strong general\-purpose reasoning capabilities and solid performance in applied NLP tasks\. They provide a realistic baseline for systems that users may rely on when addressing legal or knowledge\-intensive problems without task\-specific fine\-tuning\.

#### 3\.1\.2\. Open\-Source Models

### 3\.2\. Prompting Strategy

Our method is based on a unified prompting framework\. Each model receives the inheritance case in Arabic together with instructions asking it to solve the case in a structured way\. The prompt explicitly encourages the model to reason step by step and to return an answer that follows a predefined format\.

In practice, the prompt asks the model to:

1. 1\.identify the heirs who inherit,
2. 2\.apply relevant exclusion or blocking rules,
3. 3\.assign the correct legal shares,
4. 4\.and produce the final distribution\.

We intentionally kept the prompt relatively simple\. The goal was not to encode a handcrafted expert system inside the prompt, but rather to observe how much structured reasoning the models can perform under shared instructions\.

For all models, we required the output to show the reasoning at each step, followed by a final answer in a structured format that matches the evaluation metric\. This made the outputs more consistent, easier to process, and helped us see exactly where any reasoning mistakes occurred\. We only applied simple post\-processing to ensure that each heir was represented in a single, uniform form\. We did not use extensive correction rules, since our goal was to evaluate the models’ abilities rather than to engineer a complex pipeline\.

## 4\. Results and Analysis

Table 1:Main results for the PSL systems on QIAS 2026\.∗indicates the official submission\.Table[1](https://arxiv.org/html/2606.13751#S4.T1)reports the overall MIR\-E scores obtained by all evaluated models\. Overall, the results reveal a substantial performance gap between commercial and open\-source models \(Table[1](https://arxiv.org/html/2606.13751#S4.T1)\)\. Commercial models achieve significantly higher scores, withGemini\-2\.5\-Proreaching 0\.931 andGemini\-2\.5\-Flash 2\(official submission\) achieving 0\.898\. In contrast, open\-source models perform considerably worse, with scores ranging from 33\.1 to 45\.1\. The strongest open\-source model,Qwen3\-32B, achieves 45\.1, remaining more than 50 points below the official system\. Furthermore, performance differences among open\-source models are relatively limited, suggesting a broadly similar level of capability within this group\.

Beyond aggregate performance, the comparison highlights important qualitative differences in reasoning behavior\. Commercial models demonstrate a stronger ability to maintain coherent multi\-step reasoning, preserving consistency across heir identification, exclusion rules, and final share allocation\. They are also more robust when early\-stage decisions are ambiguous or complex\. In contrast, open\-source models frequently exhibit structural errors at early stages of the reasoning process\. Such errors tend to propagate, leading to incorrect intermediate steps and ultimately invalid final distributions\. For example, the omission of a valid heir often results in incorrect redistribution of shares\.

A further observation is that surface\-level fluency does not reliably indicate correctness\. Several models produce outputs that appear legally plausible, yet contain latent structural inconsistencies, such as incorrect application of exclusion rules or infeasible share allocations\.

Taken together, these results suggest that commercial models are both more accurate and more stable in handling structured legal reasoning tasks, whereas open\-source models remain limited in their ability to sustain coherent reasoning across interdependent decision steps\. For comparison, we also consider our method as a participating system in the shared task\.

For comparison, we also report the scores of the systems that participated in the shared task\. Table[2](https://arxiv.org/html/2606.13751#S4.T2)presents the official ranking of the participating teams\. The proposed methods cover a wide range of approaches, from simple prompting\-based techniques to fine\-tuning strategies that adapt language models to the inheritance task\. Some approaches rely on Retrieval\-Augmented Generation \(RAG\) to find similar cases and support reasoning, while others adopt hybrid methods that combine large language models \(LLMs\) with rule\-based systems\.

Table 2:Official leaderboard results for QIAS 2026 for teams that participated in the test phase and submitted a paper### 4\.1\. Common Error Types

Our manual inspection identified several recurring error patterns:

##### Missing or hallucinated heirs\.

Some models failed to identify all heirs mentioned in the scenario, while others introduced relatives who were not present in the case\. This problem is especially harmful because all later steps depend on the correct heir set\.

##### Incorrect blocking decisions\.

Another common issue was the wrong application of exclusion rules\. In some outputs, an heir who should have been blocked was retained; in others, a valid heir was incorrectly removed\.

##### Share assignment errors\.

Even when the correct heirs were identified, some models assigned the wrong legal shares\. These errors often reflected confusion between similar family configurations or incomplete understanding of conditional inheritance rules\.

##### Arithmetic inconsistency\.

Some outputs contained legal\-sounding reasoning but inconsistent calculations\. In such cases, the fractions did not sum correctly, the normalization was missing, or the final per\-heir allocation contradicted the earlier explanation\.

##### Format instability\.

A smaller but still relevant issue was output instability\. Some models ignored the requested format, mixed reasoning and answer sections unpredictably, or returned partial solutions\.

This also explains why good instruction\-following is not enough\. A model can follow the prompt and produce a clear and convincing explanation, but still make mistakes in the legal structure of the solution\.

Another possible reason for the low performance of Arabic\-oriented models is that they are not trained on this type of structured reasoning\. Inheritance problems require specific logical steps and rule\-based decisions, which may not be well represented in their training data\. In this context, the QIAS 2026 dataset can be very useful\. It provides examples of Islamic inheritance reasoning that could help models learn this type of task and improve their performance\.

## 5\. Conclusion

This paper presented the participation of team PSL in the QIAS 2026 Shared Task on Arabic Islamic inheritance reasoning\. Our submission focused on a controlled comparison between commercial and open\-source language models under a shared prompting framework\.

The experiments showed that commercial models remain clearly stronger on this task, especially when the case requires several dependent legal decisions and precise numerical consistency\. Open\-source models show promise, but they still suffer from frequent structural errors that propagate across the reasoning chain\. More broadly, our findings confirm that Islamic inheritance is a demanding testbed for evaluating legal reasoning in Arabic\.

In future work, we plan to explore stronger output constraints, step\-level verification, and domain\-adapted training strategies to improve consistency on this class of tasks\. We also believe that combining LLMs with explicit reasoning checks may be a promising direction for legal and rule\-based NLP more generally\.

## Acknowledgments

We would like to thank the organizers of QIAS 2026 for their efforts in designing and running the shared task\. In particular, we are grateful to Dr\. Abdessalam Bouchekif for his guidance and support\.

- QU\-NLP at QIAS 2025 shared task: a two\-phase LLM fine\-tuning and retrieval\-augmented generation approach for islamic inheritance reasoning\.InProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks,K\. Darwish, A\. Ali, I\. Abu Farha, S\. Touileb, I\. Zitouni, A\. Abdelali, S\. Al\-Ghamdi, S\. Alkhereyf, W\. Zaghouani, S\. Khalifa, B\. AlKhamissi, R\. Almatham, I\. Hamed, Z\. Alyafeai, A\. Alowisheq, G\. Inoue, K\. Mrini, and W\. Alshammari \(Eds\.\),Suzhou, China,pp\. 892–898\.External Links:[Link](https://aclanthology.org/2025.arabicnlp-sharedtasks.123/),[Document](https://dx.doi.org/10.18653/v1/2025.arabicnlp-sharedtasks.123),ISBN 979\-8\-89176\-356\-2Cited by:[§1](https://arxiv.org/html/2606.13751#S1.p4.1)\.
- N\. AlDahoul and Y\. Zaki \(2025\)NYUAD at QIAS shared task: benchmarking the legal reasoning of LLMs in Arabic islamic inheritance cases\.InProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks,K\. Darwish, A\. Ali, I\. Abu Farha, S\. Touileb, I\. Zitouni, A\. Abdelali, S\. Al\-Ghamdi, S\. Alkhereyf, W\. Zaghouani, S\. Khalifa, B\. AlKhamissi, R\. Almatham, I\. Hamed, Z\. Alyafeai, A\. Alowisheq, G\. Inoue, K\. Mrini, and W\. Alshammari \(Eds\.\),Suzhou, China,pp\. 861–866\.External Links:[Link](https://aclanthology.org/2025.arabicnlp-sharedtasks.118/),[Document](https://dx.doi.org/10.18653/v1/2025.arabicnlp-sharedtasks.118),ISBN 979\-8\-89176\-356\-2Cited by:[§1](https://arxiv.org/html/2606.13751#S1.p4.1)\.
- O\. Alkhamis \(2026\)KMS at QIAS 2026: comparing LLM reasoning and hybrid symbolic solving for islamic inheritance division\.InProceedings of the 7th Workshop on Open\-Source Arabic Corpora and Processing Tools \(OSACT7\), co\-located with LREC\-COLING 2026,Palma de Mallorca, Spain\.Cited by:[Table 2](https://arxiv.org/html/2606.13751#S4.T2.1.4.3.2)\.
- M\. Almansour \(2026\)Simplicity at QIAS 2026: decoupling language extraction from mathematical logic in islamic inheritance law\.InProceedings of the 7th Workshop on Open\-Source Arabic Corpora and Processing Tools \(OSACT7\), co\-located with LREC\-COLING 2026,Palma de Mallorca, Spain\.Cited by:[Table 2](https://arxiv.org/html/2606.13751#S4.T2.1.3.2.2)\.
- M\. Alsmadi \(2026\)QU\-NLP at QIAS 2026: multi\-stage QLoRA fine\-tuning for arabic islamic inheritance reasoning\.InProceedings of the 7th Workshop on Open\-Source Arabic Corpora and Processing Tools \(OSACT7\), co\-located with LREC\-COLING 2026,Palma de Mallorca, Spain\.Cited by:[Table 2](https://arxiv.org/html/2606.13751#S4.T2.1.5.4.2)\.
- A\. Bouchekif, S\. Gaben, S\. Rashwani, S\. Eltanbouly, M\. Al\-Khatib, H\. Sbahi, M\. Ghaly, and E\. Mohamed \(2026\)MAWARITH: a dataset and benchmark for legal inheritance reasoning with llms\.arXiv preprint arXiv:2603\.07539\.External Links:[Link](https://arxiv.org/pdf/2603.07539),2603\.07539Cited by:[§1](https://arxiv.org/html/2606.13751#S1.p2.1)\.
- A\. Bouchekif, S\. Rashwani, E\. S\. A\. Mohamed, M\. Alkhatib, H\. Sbahi, S\. Gaben, W\. Zaghouani, A\. Erbad, and M\. Ghaly \(2025a\)QIAS 2025: overview of the shared task on islamic inheritance reasoning and knowledge assessment\.InProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks,pp\. 851–860\.External Links:[Link](https://aclanthology.org/2025.arabicnlp-sharedtasks.117/)Cited by:[§1](https://arxiv.org/html/2606.13751#S1.p4.1)\.
- A\. Bouchekif, S\. Rashwani, H\. Sbahi, S\. Gaben, M\. Al Khatib, and M\. Ghaly \(2025b\)Assessing large language models on islamic legal reasoning: evidence from inheritance law evaluation\.InProceedings of The Third Arabic Natural Language Processing Conference,pp\. 246–257\.External Links:[Link](https://aclanthology.org/2025.arabicnlp-main.20/)Cited by:[§1](https://arxiv.org/html/2606.13751#S1.p4.1)\.
- S\. Eddine Bekhouche, A\. Zakaria Sellam, T\. Hichem, C\. Distante, and A\. Hadid \(2025\)CVPD at QIAS 2025 shared task: an efficient encoder\-based approach for islamic inheritance reasoning\.InProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks,K\. Darwish, A\. Ali, I\. Abu Farha, S\. Touileb, I\. Zitouni, A\. Abdelali, S\. Al\-Ghamdi, S\. Alkhereyf, W\. Zaghouani, S\. Khalifa, B\. AlKhamissi, R\. Almatham, I\. Hamed, Z\. Alyafeai, A\. Alowisheq, G\. Inoue, K\. Mrini, and W\. Alshammari \(Eds\.\),Suzhou, China,pp\. 929–934\.External Links:[Link](https://aclanthology.org/2025.arabicnlp-sharedtasks.128/),[Document](https://dx.doi.org/10.18653/v1/2025.arabicnlp-sharedtasks.128),ISBN 979\-8\-89176\-356\-2Cited by:[§1](https://arxiv.org/html/2606.13751#S1.p4.1)\.
- E\. Elrefai, M\. Lotfy Elrefai, and A\. Hassan Esmail \(2025\)Gumball at QIAS 2025: Arabic LLM automated reasoning in islamic inheritance\.InProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks,K\. Darwish, A\. Ali, I\. Abu Farha, S\. Touileb, I\. Zitouni, A\. Abdelali, S\. Al\-Ghamdi, S\. Alkhereyf, W\. Zaghouani, S\. Khalifa, B\. AlKhamissi, R\. Almatham, I\. Hamed, Z\. Alyafeai, A\. Alowisheq, G\. Inoue, K\. Mrini, and W\. Alshammari \(Eds\.\),Suzhou, China,pp\. 953–959\.External Links:[Link](https://aclanthology.org/2025.arabicnlp-sharedtasks.132/),[Document](https://dx.doi.org/10.18653/v1/2025.arabicnlp-sharedtasks.132),ISBN 979\-8\-89176\-356\-2Cited by:[§1](https://arxiv.org/html/2606.13751#S1.p4.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2606.13751#S1.p3.1)\.
- A\. Jaech, A\. Kalai, A\. Lerer, A\. Richardson, A\. El\-Kishky, A\. Low, A\. Helyar, A\. Madry, A\. Beutel, A\. Carney,et al\.\(2024\)Openai o1 system card\.arXiv preprint arXiv:2412\.16720\.Cited by:[§1](https://arxiv.org/html/2606.13751#S1.p3.1)\.
- G\. Kurdi, H\. Justanieah, and H\. Justanieah \(2026\)Silah at QIAS 2026: fine\-tuning vs\. retrieval\-augmented generation for islamic inheritance reasoning\.InProceedings of the 7th Workshop on Open\-Source Arabic Corpora and Processing Tools \(OSACT7\), co\-located with LREC\-COLING 2026,Palma de Mallorca, Spain\.Cited by:[Table 2](https://arxiv.org/html/2606.13751#S4.T2.1.8.7.2)\.
- H\. G\. Sidaoui \(2026\)AGS\-KSU at QIAS 2026: a comparative study of prompting and LLM approaches for structured islamic inheritance reasoning\.InProceedings of the 7th Workshop on Open\-Source Arabic Corpora and Processing Tools \(OSACT7\), co\-located with LREC\-COLING 2026,Palma de Mallorca, Spain\.Cited by:[Table 2](https://arxiv.org/html/2606.13751#S4.T2.1.7.6.2)\.
- W\. Swaileh, M\. Zighem, H\. Telli, S\. E\. Bekhouche, A\. Z\. Sellam, and F\. Dornaika \(2026\)CVPD at QIAS 2026: RAG\-guided LLM reasoning for al\-mawarith share calculation and heir allocation\.InProceedings of the 7th Workshop on Open\-Source Arabic Corpora and Processing Tools \(OSACT7\), co\-located with LREC\-COLING 2026,Palma de Mallorca, Spain\.Cited by:[Table 2](https://arxiv.org/html/2606.13751#S4.T2.1.2.1.2)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2022\)Self\-consistency improves chain of thought reasoning in language models\.arXiv preprint arXiv:2203\.11171\.Cited by:[§1](https://arxiv.org/html/2606.13751#S1.p3.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.arXiv preprint arXiv:2201\.11903\.Cited by:[§1](https://arxiv.org/html/2606.13751#S1.p3.1)\.
- N\. Xuan Phuc and T\. Đặng Văn \(2025\)PuxAI at QIAS 2025: multi\-agent retrieval\-augmented generation for islamic inheritance and knowledge reasoning\.InProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks,K\. Darwish, A\. Ali, I\. Abu Farha, S\. Touileb, I\. Zitouni, A\. Abdelali, S\. Al\-Ghamdi, S\. Alkhereyf, W\. Zaghouani, S\. Khalifa, B\. AlKhamissi, R\. Almatham, I\. Hamed, Z\. Alyafeai, A\. Alowisheq, G\. Inoue, K\. Mrini, and W\. Alshammari \(Eds\.\),Suzhou, China,pp\. 905–913\.External Links:[Link](https://aclanthology.org/2025.arabicnlp-sharedtasks.125/),[Document](https://dx.doi.org/10.18653/v1/2025.arabicnlp-sharedtasks.125),ISBN 979\-8\-89176\-356\-2Cited by:[§1](https://arxiv.org/html/2606.13751#S1.p4.1)\.

Similar Articles

QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning

arXiv cs.CL

This paper presents Qatar University's multi-stage QLoRA fine-tuning approach on Qwen3-4B for Arabic Islamic inheritance reasoning, achieving 90% MIR-E score through domain adaptation on Islamic fatwa records followed by task-specific training on 12,000 structured inheritance cases, matching commercial systems like Gemini-2.5-flash with minimal computational resources.

QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning

arXiv cs.CL

This paper presents an overview of the QIAS 2026 shared task on Islamic inheritance reasoning, evaluating LLMs on multi-step legal and numerical reasoning using the MAWARITH benchmark.

Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

arXiv cs.CL

This paper investigates multilingual latent reasoning in large reasoning models across 11 languages, revealing that while latent reasoning capabilities exist, they are unevenly distributed—stronger in resource-rich languages and weaker in low-resource ones. The study finds that despite surface-level differences, the internal reasoning mechanisms are largely aligned with an English-centered pathway.

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

arXiv cs.CL

This survey synthesizes recent advancements in mathematical reasoning with large language models, covering benchmarks, architectures, training strategies, and evaluation protocols. It identifies key challenges such as reasoning faithfulness and benchmark biases.

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

arXiv cs.AI

This paper evaluates three approaches (pure chain-of-thought reasoning, single-shot code execution, and iterative code execution) on 1,000 GSM-Symbolic problems using Claude Haiku 4.5, finding that chain-of-thought is the most robust to perturbation, while code execution does not improve reasoning robustness on grade-school math problems.

Submit Feedback