A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition

arXiv cs.AI Papers

Summary

This paper presents a systematic empirical study of fine-tuning pretrained Transformer models (Wav2Vec2.0, HuBERT, XLS-R) for Quranic Automatic Speech Recognition (ASR), achieving a WER of 0.08 on the EveryAyah subset and reducing training time from 140 to 40 hours, with Wav2Vec2-XLSR-53 providing the best representation.

arXiv:2606.19747v1 Announce Type: new Abstract: Quran Automatic Speech Recognition (ASR) aims to convert Quranic recitation into text, enabling applications such as aided memorisation tools and Quranic search engines. However, existing ASR models often exhibit high Word Error Rates (WER) on user-recited verses and lack full coverage of the Quranic corpus. This paper presents a systematic empirical study of domain-specific fine-tuning of pretrained Transformer-based models for Quranic ASR, using advanced speech feature extraction methods: Wav2Vec2.0, HuBERT, and XLS-R. These models apply self-supervised learning by masking portions of input audio and using Transformer architectures to learn context-aware speech features. The pretrained models are fine-tuned on a filtered Quranic dataset exceeding 870 hours of professional and user recitations. Through comprehensive ablation studies across feature extractors, output label formats, training strategies, and clip durations, we identify the key factors that affect transcription accuracy in this domain. Our best-performing configuration achieves a WER of 0.08 on the EveryAyah subset and 0.11 on the combined EveryAyah+Tarteel setting, representing roughly a five-percentage-point gain over the Citrinet baseline (WER = 0.163) while reducing combined-model training time from 140 hours to 40 hours. Arabic text without diacritics yields the best fine-tuning results, and Wav2Vec2-XLSR-53 provides the strongest overall representation. Future work includes improving dataset quality and developing phoneme-aware models to extract deeper speech feature representations for Tajweed-sensitive applications.
Original Article
View Cached Full Text

Cached at: 06/20/26, 02:32 PM

# A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition
Source: [https://arxiv.org/html/2606.19747](https://arxiv.org/html/2606.19747)
1\]\\orgnameGreentech Apps Foundation,\\countryUnited Kingdom 2\]\\orgnameQueen Mary University of London,\\countryUnited Kingdom 3\]\\orgnameUniversity of Malaya,\\countryMalaysia

###### Abstract

Quran Automatic Speech Recognition \(ASR\) aims to convert Quranic recitation into text, enabling applications such as aided memorisation tools and Quranic search engines\. However, existing ASR models often exhibit high Word Error Rates \(WER\) on user\-recited verses and lack full coverage of the Quranic corpus\. This paper presents a systematic empirical study of domain\-specific fine\-tuning of pretrained Transformer\-based models for Quranic ASR, using advanced speech feature extraction methods: Wav2Vec2\.0, HuBERT, and XLS\-R\. These models apply self\-supervised learning by masking portions of input audio and using Transformer architectures to learn context\-aware speech features\. The pretrained models are fine\-tuned on a filtered Quranic dataset exceeding 870 hours of professional and user recitations\. Through comprehensive ablation studies across feature extractors, output label formats, training strategies, and clip durations, we identify the key factors that affect transcription accuracy in this domain\. Our best\-performing configuration achieves a WER of 0\.08 on the EveryAyah subset and 0\.11 on the combined EveryAyah\+Tarteel setting, representing roughly a five\-percentage\-point gain over the Citrinet baseline \(WER = 0\.163\) while reducing combined\-model training time from 140 hours to 40 hours\. Arabic text without diacritics yields the best fine\-tuning results, and Wav2Vec2\-XLSR\-53 provides the strongest overall representation\. Future work includes improving dataset quality and developing phoneme\-aware models to extract deeper speech feature representations for Tajweed\-sensitive applications\.

###### keywords:

Quran Automatic Speech Recognition \(ASR\), End\-to\-End Deep Learning, Transformer Models, Speech Representation Learning

## 1Introduction

### 1\.1Background and Context

The Arabic language is one of the oldest and most distinguished languages in the world, noted for its originality and adaptability\[[4](https://arxiv.org/html/2606.19747#bib.bib4)\]\. With approximately 290 million native speakers and 132 million non\-native speakers, it is the most widely spoken Semitic language\[[49](https://arxiv.org/html/2606.19747#bib.bib49)\]\. It is also one of the six official languages of the United Nations \(UN\)\[[33](https://arxiv.org/html/2606.19747#bib.bib33)\]\. While Modern Standard Arabic \(MSA\) is used in contemporary communication, Classical Arabic \(CA\) remains central due to its use in the Qur’an\[[46](https://arxiv.org/html/2606.19747#bib.bib46)\]\. The Qur’an is a foundational text in Islam, forming one of the pillars of the Islamic faith and containing divine parables, commands, and teachings\[[22](https://arxiv.org/html/2606.19747#bib.bib22)\]\.

Recitation and memorization of the Qur’an occupy a central role in Islamic education\[[51](https://arxiv.org/html/2606.19747#bib.bib51)\]\. It is vital to preserve the Qur’an in its audio form, recited word\-by\-word as it was revealed to the Prophet Muhammad \(peace be upon him\), as Muslims believe it to be divine\. The rules of recitation, known asTajweed, are critical to ensure accurate pronunciation and proper delivery\[[41](https://arxiv.org/html/2606.19747#bib.bib41)\]\. The termTajweedderives from the Arabic root “Jawwada,” meaning to improve or enhance speech accuracy\[[7](https://arxiv.org/html/2606.19747#bib.bib7)\]\. However, many Muslims do not speak Arabic as their first language and often struggle to correctly recite verses due to a lack of regular practice and understanding of linguistic nuances\[[19](https://arxiv.org/html/2606.19747#bib.bib19)\]\.

Traditionally, Qur’anic recitation is learned under the supervision of a teacher who listens and provides feedback\. However, this method requires significant time investment and face\-to\-face interaction\. Challenges such as environmental distractions, teacher availability, and high student\-to\-teacher ratios can hinder effective learning\[[32](https://arxiv.org/html/2606.19747#bib.bib32)\]\. Therefore, there is a growing need for automated systems that assist learners in recitation and memorization without constant supervision\[[13](https://arxiv.org/html/2606.19747#bib.bib13)\]\.

Automatic Speech Recognition \(ASR\) offers a solution by converting spoken language into text, enabling machines to interpret human speech\. ASR has been successfully applied in fields such as education, healthcare, robotics, telecommunications, and customer service\[[8](https://arxiv.org/html/2606.19747#bib.bib8)\]\. In the context of Qur’anic recitation, ASR can support self\-learning by detecting pronunciation errors and identifying missing words, thus reducing reliance on human teachers\[[30](https://arxiv.org/html/2606.19747#bib.bib30)\]\.

Despite these advancements, Arabic ASR systems face challenges due to limited labelled data, diverse dialects, and the lack of diacritics in textual data\[[11](https://arxiv.org/html/2606.19747#bib.bib11)\]\. Recent Arabic ASR benchmarks such as MGB\-2, MGB\-3, and MGB\-5 have made progress in this domain\[[9](https://arxiv.org/html/2606.19747#bib.bib9),[10](https://arxiv.org/html/2606.19747#bib.bib10)\]\. For instance, the MGB\-2 dataset contains 1,200 hours of broadcast television content, while MGB\-3 and MGB\-5 provide extended corpora\. These benchmarks have achieved Word Error Rates \(WER\) of 12\.5%, 27\.5%, and 33\.8% respectively\[[28](https://arxiv.org/html/2606.19747#bib.bib28)\]\. WER, defined as the percentage of incorrectly transcribed words relative to the total, is a standard metric used to evaluate ASR performance\[[30](https://arxiv.org/html/2606.19747#bib.bib30)\]\.

However, automatic recognition of Qur’anic recitation presents distinct challenges that remain inadequately addressed in current literature\[[11](https://arxiv.org/html/2606.19747#bib.bib11)\]\. Several critical research gaps limit the development of robust solutions in this domain\. First, existing approaches demonstrate limited coverage of Tajweed rules and restricted training on narrow subsets of Qur’anic verses\[[2](https://arxiv.org/html/2606.19747#bib.bib2),[1](https://arxiv.org/html/2606.19747#bib.bib1)\], failing to capture the comprehensive phonetic variations inherent in Classical Arabic recitation patterns\. Second, while foundational datasets such as Tarteel\[[52](https://arxiv.org/html/2606.19747#bib.bib52)\]and fine\-tuned architectures including Citrinet models\[[38](https://arxiv.org/html/2606.19747#bib.bib38)\]have established baseline performance, the potential of advanced frameworks including Conformer\[[25](https://arxiv.org/html/2606.19747#bib.bib25),[37](https://arxiv.org/html/2606.19747#bib.bib37)\], unsupervised speech recognition techniques\[[16](https://arxiv.org/html/2606.19747#bib.bib16)\], and DeepSpeech architectures\[[5](https://arxiv.org/html/2606.19747#bib.bib5)\]remains largely unexploited for this specific application domain\. Third, current evaluation methodologies inadequately account for the unique acoustic characteristics of Tajweed compliance and the distinction between phonetic accuracy and religious correctness in recitation assessment\. Finally, existing research lacks real\-time processing systems for error detection, memorization support, and speech\-to\-search functionalities that serve diverse educational requirements across the global Muslim community\[[11](https://arxiv.org/html/2606.19747#bib.bib11)\]\.

### 1\.2Proposed Work

This research addresses the identified gaps through systematic innovation in Qur’anic speech recognition\. The study is guided by three primary objectives:

1. 1\.To identify critical acoustic and linguistic parameters affecting accurate Qur’anic recitation transcription
2. 2\.To develop an advanced Transformer\-based model that significantly reduces WER through optimized deep learning methodologies
3. 3\.To evaluate and validate the proposed model’s performance against existing baseline approaches

The main contribution of this research is to develop a performant speech recognition model that achieves low WER and low CER while improving transcription accuracy\. The key novelties of this work include:

1. 1\.Domain\-Specific Adaptation of Transformer\-based Models for Qur’anic Speech Recognition:Systematic fine\-tuning and optimization of pretrained end\-to\-end Transformer architectures for the domain of Qur’anic recitation, constituting a thorough empirical study rather than a new architectural contribution\.
2. 2\.Feature Extraction Comparison:Systematic evaluation of input features using MFCC, Wav2Vec2, HuBERT, and XLS\-R speech representations to identify optimal feature extraction approaches for Qur’anic speech recognition\.
3. 3\.Multi\-Format Output Label Analysis:Comparative analysis of four distinct output label formats \(Arabic text, Arabic with Tashkeel, English transliteration, Buckwalter transliteration\) to determine the most effective representation for minimizing WER in Qur’anic transcription tasks\.
4. 4\.Training Strategy Evaluation:Investigation of multiple training approaches including scratch training versus fine\-tuning methodologies, dataset composition effects \(professional versus layman users\), and clip duration impact on model performance, providing systematic insights into optimal training configurations\.

Figure[1](https://arxiv.org/html/2606.19747#S1.F1)illustrates the proposed end\-to\-end model architecture, which utilizes Wav2Vec2/HuBERT/XLS\-R with a frozen CNN encoder and a fine\-tuned Transformer decoder using CTC loss for transcription\.

![Refer to caption](https://arxiv.org/html/2606.19747v1/model_for_training.jpeg)Figure 1:End\-to\-end Model Architecture: Wav2Vec2/HuBERT/XLS\-R with frozen CNN encoder and fine\-tuned Transformer decoder using CTC loss for transcriptionThe research methodology is organized into three phases that address the identified research gaps:

1. 1\.Data Collection and Preprocessing:This phase involves the collection of audio data, extraction of audio features, and pre\-training using Transformer\-based architectures\.
2. 2\.Model Training and Fine\-Tuning:In this phase, the model is trained and fine\-tuned using various hyperparameter configurations and pretrained models to optimize performance\.
3. 3\.Evaluation and Validation:The final phase includes evaluating and validating the model against benchmark systems\. Performance is assessed using standard metrics such as Word Error Rate \(WER\) and Character Error Rate \(CER\)\.

The proposed research has significant potential to benefit both existing students and introduce new readers and memorizers to Qur’anic study\. The developed model enables voice search functionality, allowing users to recite verses and locate their positions within the Qur’an\. It supports memorization enhancement by enabling device\-based recitation testing and error identification\. Additionally, the system can generate subtitles for different surahs, providing word\-by\-word or ayah\-by\-ayah guidance for improved learning and understanding\. This research facilitates the development of advanced user assessment techniques and human learning enhancement capabilities, including the creation of comprehensive error compendia for hafiz students and exploration of age\-related recitation error patterns through data analytics, ultimately transforming Qur’anic study and understanding\.

## 2Background

Automatic Speech Recognition \(ASR\) has seen significant advancements in recent years, particularly through the application of deep learning and end\-to\-end architectures\[[43](https://arxiv.org/html/2606.19747#bib.bib43)\]\. However, Arabic and Quranic ASR remain uniquely challenging due to linguistic complexity, dialectal diversity, and Tajweed rules\[[30](https://arxiv.org/html/2606.19747#bib.bib30)\]\. This section reviews the literature according to the three primary research objectives of this study\.

### 2\.1Acoustic and Linguistic Parameters in Arabic and Qur’anic Speech Recognition

Arabic presents several distinct challenges for ASR systems: its consonantal nature, dialectal diversity, complex morphology, and similar phoneme articulation\. Classical Arabic used in the Quran intensifies these challenges due to Tajweed rules that must be followed to ensure accurate pronunciation\[[41](https://arxiv.org/html/2606.19747#bib.bib41),[7](https://arxiv.org/html/2606.19747#bib.bib7)\]\. Additionally, the variability in pronunciation among professional and layman reciters introduces noise that affects model accuracy\[[19](https://arxiv.org/html/2606.19747#bib.bib19)\]\.

Several studies have proposed systems for detecting Tajweed errors\[[31](https://arxiv.org/html/2606.19747#bib.bib31),[50](https://arxiv.org/html/2606.19747#bib.bib50),[39](https://arxiv.org/html/2606.19747#bib.bib39),[42](https://arxiv.org/html/2606.19747#bib.bib42),[6](https://arxiv.org/html/2606.19747#bib.bib6),[55](https://arxiv.org/html/2606.19747#bib.bib55)\]\. Techniques range from MFCC\-VQ to DCNN and BLSTM\. However, these are often limited to specific rules or small datasets, highlighting the need for comprehensive acoustic feature identification that captures the full spectrum of Tajweed phonetic variations\.

### 2\.2Advanced Deep Learning Architectures for Speech Recognition

Initial automatic speech recognition \(ASR\) systems were based on Gaussian Mixture Model \- Hidden Markov Model \(GMM\-HMM\) architectures\. The hybrid Hidden Markov Model \- Deep Neural Network \(HMM\-DNN\) model, as proposed by Dahl et al\.\[[18](https://arxiv.org/html/2606.19747#bib.bib18)\], replaced Gaussian Mixture Models \(GMMs\) with Deep Neural Networks \(DNNs\), resulting in significant performance improvements\. Subsequent advancements introduced Time Delay Neural Networks \(TDNN\), Bidirectional Long Short\-Term Memory \(BLSTM\) networks, and optimization techniques such as Lattice\-Free Maximum Mutual Information \(LF\-MMI\)\[[45](https://arxiv.org/html/2606.19747#bib.bib45),[34](https://arxiv.org/html/2606.19747#bib.bib34),[48](https://arxiv.org/html/2606.19747#bib.bib48)\]\. However, these modular systems tend to be complex, computationally intensive, and less suitable for deployment on mobile and resource\-constrained devices\.

End\-to\-end \(E2E\) models simplify the overall architecture by directly mapping raw audio input to corresponding textual output, thereby improving both recognition performance and training efficiency\. The main types of end\-to\-end models include: \(1\) Connectionist Temporal Classification \(CTC\)\[[24](https://arxiv.org/html/2606.19747#bib.bib24)\], which aligns input and output sequences without requiring pre\-segmented data; \(2\) attention\-based Sequence\-to\-Sequence \(Seq2Seq\) models\[[54](https://arxiv.org/html/2606.19747#bib.bib54)\], which utilize attention mechanisms to dynamically focus on relevant parts of the input sequence during decoding; and \(3\) Recurrent Neural Network Transducer \(RNN\-T\) models\[[23](https://arxiv.org/html/2606.19747#bib.bib23)\], which combine acoustic modeling and language modeling into a single unified framework\. Additionally, multitask learning architectures that integrate both CTC and attention mechanisms\[[53](https://arxiv.org/html/2606.19747#bib.bib53)\]have been shown to further enhance model accuracy by leveraging complementary learning signals from both approaches\.

Recent innovations in self\-supervised speech representation learning include Wav2Vec \(Waveform to Vector\)\[[47](https://arxiv.org/html/2606.19747#bib.bib47)\], Wav2Vec 2\.0\[[15](https://arxiv.org/html/2606.19747#bib.bib15)\], Cross\-Lingual Speech Representations \(XLS\-R\)\[[14](https://arxiv.org/html/2606.19747#bib.bib14)\], and Hidden\-Unit BERT \(HuBERT\)\[[27](https://arxiv.org/html/2606.19747#bib.bib27)\]\. These methods use large volumes of unlabelled audio data to learn generalizable speech features, thereby reducing WER even when only limited labelled data is available\.

### 2\.3Evaluation Frameworks and Existing Qur’anic ASR Systems

Performance is commonly evaluated using WER and CER, derived from Levenshtein distance\[[36](https://arxiv.org/html/2606.19747#bib.bib36)\]\. These metrics quantify substitution, insertion, and deletion errors, helping to compare ASR systems effectively\.

Quranic ASR research has progressed from traditional GMM\-HMM systems\[[29](https://arxiv.org/html/2606.19747#bib.bib29),[44](https://arxiv.org/html/2606.19747#bib.bib44),[12](https://arxiv.org/html/2606.19747#bib.bib12)\]to more recent end\-to\-end and Transformer\-based models\[[1](https://arxiv.org/html/2606.19747#bib.bib1),[28](https://arxiv.org/html/2606.19747#bib.bib28)\]\. However, many models focus on a limited set of verses or reciters\. Tarteel\.io employed Citrinet\-based models using user\-generated data\[[52](https://arxiv.org/html/2606.19747#bib.bib52)\]\. Prior studies often lack support for layman reciters and fail to generalize across the full Quran\[[11](https://arxiv.org/html/2606.19747#bib.bib11)\]\. A detailed summary of recent Quranic ASR systems, including their methods, data sources, and research gaps, is presented in TableLABEL:tab:lit\-summary\.

Table 1:Summary of Recent Quran Speech Recognition ResearchAuthorTechnique \(Extraction \+ Model\)DataKey Insights \(Tajweed \+ Highlights\)Research GapIbrahim et al\., 2013MFCC, GMM \+ HMM \+ CMU Sphinx4Surah 1None; Accuracy 86\.41%Limited DataIsmail et al\., 2014MFCC\-VQ, GMM \+ HMMSurah 114Qalqalah; Faster inference, lower accuracyLimited DataOuld et al\., 2014MFCC, GMM \+ HMM8 Hours, Juz AmmaNone; Accuracy 92%Limited DataTabbaa & Soudan, 2015MFCC, GMM \+ HMM \+ SVM7\.5 hrs, Pro/Layman speakers’R’ phoneme; Accuracy 91\.2%Focused on one ruleEl Amrani et al\., 2016MFCC, GMM \+ HMM8 hrs, 20 reciters, Surah 1,112\-114None; 50% WERMissing phonemesYuwan & Lestari, 2016MFCC, GMM \+ HMM180 phonetically rich versesNone; Qscript, WER 22\.42%WER improvableMaqsood et al\., 2016MFCC, GMM \+ HMM \+ SVMNon\-native speakers5 complex letters; Accuracy 97\.5%Only 5 phonemesAbdulQader Al\-Bakeri, 2017MFCC, GMM \+ HMMSurah 55,114None; WER 89\.47%Limited DataBelkasmi et al\., 2017MFCC, GMM \+ HMM \+ Phoneme duration8 hrs, 10 recitersPhoneme duration; Inconclusive classificationRule mismatchRidwan & Lestari, 2018MFCC, GMM \+ HMMFull QuranNone; WER 18\.53%No Tajweed checkMohammed et al\., 2018MFCC, GMM \+ HMM10 reciters, 600 wordsLong Maad, Ghunnah; Duration\-based detectionRule mismatchAhmad et al\., 2018MFCC, ANN2 reciters, 300 wordsNoon Sakinah, Tanween; 77\.7% AccuracyLimited rulesAl\-Marri et al\., 2018MFCC, DNN \+ HMM83 hrs, 100 speakersMispronounced letters; 90% Accuracy, DNN better than GMMLetter level onlyAl\-Ayyoub et al\., 2018MFCC, WPD, HMM\-SPL, DNN3071 audios, 5M \+ 5F8 Tajweed rules; Accuracy 97\.7%No recitation feedbackNazir et al\., 2019MFCC, SVM, KNN, NN400 non\-native speakersAll letters; NN accuracy 90\.1%Limited DataThirafi & Lestari, 2019MFCC, HMM\-BLSTM \(Kaldi\)10 reciters, 180 versesNone; Avg WER 4\.63%No Tajweed checkTarteel\.io, 2020MFCC, End\-to\-End Citrinet80\+ hrs user/pro recitationsNone; Mobile optimised, fastWER improvableAlagrami & Eljazzar, 2020MFCC, DNN \+ SVM657 recordingsIdhgham, Ikhfa, Lam; Accuracy 99%Word level onlyZiafat et al\., 2021MFCC, BLSTM, DCNN, AlexNetArabic letter recordingsNone; Accuracy 95\.95%Letter level onlyAl\-Issa et al\., 2023MFCC, DeepSpeech \(RNN\-ASR\)257,705 male, 5,744 female audio filesNo Tajweed modeled; Best WER 0\.406 \(male\)Gender imbalanceAl Harere & Al Jallad, 2023CNN\-BiGRU, CTCAr\-DAD \(37 Surahs, 30 reciters\)None; WER 8\.34%, CER 2\.42%No full Quran coverageHadwan et al\., 2023Mel filterbank, E2E Transformer \+ RNN/LSTM\-LM10 hrs, 60 reciters, 16 SurahsDiacritized Arabic; WER 6\.16%, CER 1\.98% \(with LM\)Limited to 16 short Surahs; external LM requiredTable 1:Summary of Recent Quran Speech Recognition Research \(continued\)Despite these advancements, existing literature reveals persistent limitations: inadequate coverage of Tajweed rules\[[31](https://arxiv.org/html/2606.19747#bib.bib31),[50](https://arxiv.org/html/2606.19747#bib.bib50)\], restricted training on limited verses or specific reciters\[[1](https://arxiv.org/html/2606.19747#bib.bib1),[52](https://arxiv.org/html/2606.19747#bib.bib52)\], and insufficient generalization across diverse recitation styles from professional to layman users\[[11](https://arxiv.org/html/2606.19747#bib.bib11)\]\. To address these gaps, this research leverages emerging Transformer\-based architectures and self\-supervised learning methods \(Wav2Vec2, XLS\-R, HuBERT\) fine\-tuned on Quranic data, addressing these limitations and enabling robust, scalable ASR solutions\.

## 3Methods

![Refer to caption](https://arxiv.org/html/2606.19747v1/methodology_flowchart.jpeg)Figure 2:Methodology FlowchartFigure[2](https://arxiv.org/html/2606.19747#S3.F2)summarizes the overall workflow proposed in this research\. The methodology is structured into three integrated phases addressing data collection and preprocessing, model development and training, and comprehensive evaluation and validation\.

### 3\.1Data Collection and Preprocessing

The raw dataset comprises 1310 hours of professional recitations from EveryAyah\.com\[[21](https://arxiv.org/html/2606.19747#bib.bib21)\]and 62 hours of user\-recorded data from Tarteel\.io\[[52](https://arxiv.org/html/2606.19747#bib.bib52)\]\.

EveryAyah Dataset\.This corpus contains recordings from 44 renowned professional male reciters, spanning all 114 Surahs of the Qur’an\. Reciters represent internationally recognised recitation schools and were recorded under professional studio conditions with high\-quality microphones and minimal background noise\. Clip durations range from 1 to 20 seconds with a mean and median of approximately 6 seconds\. Transcription annotations are aligned at the verse level using the standard Uthmani Quranic text and were manually verified, providing high annotation quality with full diacritics\.

Tarteel Dataset\.This corpus consists of user\-generated recordings submitted via a mobile application, covering all 114 Surahs\. Among the 3,604 recordings with known demographic metadata \(out of 20,510 total\), approximately 78% were submitted by male and 22% by female reciters, spanning a wide range of ages \(13 to 56\+\) and nationalities including Pakistan, Egypt, USA, India, Algeria, Palestine, Bangladesh, and Syria\. This geographic and demographic diversity introduces variability in accent, dialect, and recitation proficiency that is representative of real\-world usage\. Annotation quality is variable: some recordings exhibit background noise, mispronunciation, or minor labelling inconsistencies inherent to crowdsourced data\.

These datasets were selected over custom data collection or alternative corpora because they provide comprehensive Qur’an coverage with consistent quality standards, offer authentic user data difficult to replicate in controlled environments, address the research objective of handling both expert and novice reciters, and reduce preprocessing complexity while maintaining semantic coherence\.

Audio files were standardized to 16kHz WAV format and filtered to retain clips between 1\-30 seconds to manage GPU memory constraints during training\. The filtering process reduced the professional dataset from 268,556 clips \(1,310 hours\) to 230,811 clips \(819 hours\), and the user dataset from 20,513 clips \(62 hours\) to 19,771 clips \(54 hours\)\. The frequently cited “over 870 hours” figure therefore refers to the filtered training corpus rather than the raw recordings\. No additional preprocessing such as volume normalization, silence trimming, or noise reduction was applied to preserve natural recitation variations\. Silence segments were intentionally retained following recommendations by\[[16](https://arxiv.org/html/2606.19747#bib.bib16)\], as they carry contextual cues important for Qur’anic recitation patterns\. The combined dataset was partitioned using an 80:20 train\-test split at clip level, stratified to preserve the distribution across chapters and reciters\.

### 3\.2Feature Extraction

Pretrained models, namely Wav2Vec2\.0\[[15](https://arxiv.org/html/2606.19747#bib.bib15)\], HuBERT\[[27](https://arxiv.org/html/2606.19747#bib.bib27)\], and XLS\-R\[[14](https://arxiv.org/html/2606.19747#bib.bib14)\], are used to extract context\-aware speech features without silence removal, following\[[16](https://arxiv.org/html/2606.19747#bib.bib16)\]\. These models are relevant because they provide self\-supervised learned representations that capture phonetic variations important for Arabic speech recognition\. Wav2Vec2\.0 and XLS\-R, trained on multilingual datasets, offer cross\-lingual capabilities useful for Classical Arabic, and all three models show improved performance in limited\-label scenarios typical of Qur’anic datasets\. For baseline comparison, MFCC features are also used\.

The task involves sequence\-to\-sequence mapping where extracted audio features are classified to predict character\-level transcriptions of Qur’anic recitation\. Four distinct output label formats are investigated as part of the research process to identify and compare which output representation yields optimal performance: Arabic without Tashkeel, Arabic with Tashkeel, Buckwalter transliteration, and English transliteration\. This comparative analysis enables systematic evaluation of model effectiveness across varying linguistic complexity levels and character set sizes\. Output labels are prepared in these four formats as shown in Table[2](https://arxiv.org/html/2606.19747#S3.T2)\. Each output label type was mapped at the character level without additional subword tokenization\. SentencePiece tokenization was applied only for the Citrinet baseline model, not the transformer\-based models\.

Table 2:Different Labeling Formats for Quranic Text, Surah Fatiha, Verse 2 as an Example
### 3\.3Model Architecture and Training

Two ASR architectures are employed: a Wav2Vec2\.0\-based Transformer model \(Figure[3](https://arxiv.org/html/2606.19747#S3.F3)\) and a Citrinet\-based baseline model\[[38](https://arxiv.org/html/2606.19747#bib.bib38)\]\. The Transformer model consists of a CNN encoder, quantization module, and a 24\-layer Transformer decoder\. The actual model used is facebook/wav2vec2\-large\-xlsr\-53\.

![Refer to caption](https://arxiv.org/html/2606.19747v1/wav2vec_architecture.jpeg)Figure 3:Wav2Vec2\.0 ASR Model ArchitectureThe architecture illustrated in Figure[1](https://arxiv.org/html/2606.19747#S1.F1)shows the training pipeline with a frozen CNN encoder, Transformer decoder, and Connectionist Temporal Classification \(CTC\) loss function\[[24](https://arxiv.org/html/2606.19747#bib.bib24)\]for transcription\. CTC loss was selected as it enables alignment\-free training by allowing the model to learn the mapping between variable\-length audio input and output sequences without requiring frame\-level alignment\. Although greedy decoding was used during training and validation, beam search decoding was also tested during evaluation; however, it did not yield substantial improvements in WER\.

The models are fine\-tuned using greedy decoding, learning rate of3×10−53\\times 10^\{\-5\}, dropout of 0\.1, and SpecAugment regularization\. Batch size is set to 8, with gradient accumulation \(steps=3\) and mixed\-precision training to optimize memory use\. The key hyperparameters used during model training are summarized in Table[3](https://arxiv.org/html/2606.19747#S3.T3)\.

Table 3:Model ParametersParameterValueAttention Dropout0\.1Hidden Dropout0\.1Feature Projection Dropout0\.0Mask Time Probability0\.05Layer Dropout0\.1\\botrule
### 3\.4Evaluation Metrics

Performance is evaluated using Word Error Rate \(WER\) and Character Error Rate \(CER\), which are most suitable for this work due to their complementary strengths in assessing speech recognition accuracy\. WER is one of the most common metrics to determine ASR system accuracy\[[40](https://arxiv.org/html/2606.19747#bib.bib40)\]and is particularly valuable for evaluating speech recognition models because even a single character error in a word substantially increases the error count, providing strict assessment of transcription quality\. Character Error Rate \(CER\) is also evaluated to provide granular assessment of individual character\-level accuracy, which is crucial for Qur’anic text where precise character recognition directly impacts Tajweed compliance and religious accuracy\.

WER:

WER=S\+D\+IN\\text\{WER\}=\\frac\{S\+D\+I\}\{N\}\(1\)
CER:

CER=S\+D\+IC\\text\{CER\}=\\frac\{S\+D\+I\}\{C\}\(2\)whereSS,DD,IIare substitutions, deletions, insertions, andNN,CCdenote total words or characters, respectively\.

Training was conducted over approximately 10 to 12 epochs, and model selection was performed based on WER trends on the validation set\. Early stopping was considered when performance gains plateaued, to prevent overfitting and reduce training time\.

### 3\.5Training Environment

Training was conducted on NVIDIA Tesla P100 \(16GB GPU, 25GB RAM\) with approximately 300GB of data\. Individual Transformer runs typically required 16\-17 hours, while the integrated EveryAyah\+Tarteel training schedule required about 40 hours\. The implementation utilized Google Colaboratory as the primary platform, with Python as the programming language and key libraries including PyTorch for deep learning frameworks, Huggingface for transformer models, SciKit\-Learn for machine learning utilities, Numpy for numerical computations, and Librosa for audio processing\.

To manage memory constraints, audio was downsampled to 16kHz and training batch size was reduced\. Evaluation was performed every 400 steps across 2000 training steps to monitor WER and CER\. The dataset was split using an 80:20 ratio for training and testing\. The split was performed at clip level while preserving the distribution across chapters and reciters\.

### 3\.6Baseline Comparison

The Citrinet baseline, developed by Nvidia, leverages 1D time\-channel separable convolutions with SE modules\. Tarteel\.io fine\-tuned Citrinet with MFCC and SentencePiece encoding\[[35](https://arxiv.org/html/2606.19747#bib.bib35)\]\. This study compares all models against this baseline to evaluate improvements in performance and scalability for Quranic ASR\.

A consolidated overview of the key components used in the methodology is provided in Table[4](https://arxiv.org/html/2606.19747#S3.T4)\.

Table 4:Summary of Methodological ComponentsComponentMethodAudio Feature ExtractorsWav2Vec2\.0, HuBERT, XLS\-R, MFCCOutput LabelsArabic, Tashkeel, Buckwalter, TransliterationModel ArchitecturesWav2Vec2 Transformer, CitrinetLoss FunctionCTCDecodingGreedy Search \(Beam Search tested\)Evaluation MetricsWER, CERToolsPyTorch, Huggingface, Google Colab\\botrule

## 4Results

### 4\.1Training Results

Following systematic hyperparameter tuning, multiple Automatic Speech Recognition \(ASR\) models were developed and evaluated\. Performance was assessed using two key metrics: Word Error Rate \(WER\) and Character Error Rate \(CER\), measured across training steps\. To optimize computational efficiency, models were trained separately on professional reciter and user\-recited datasets rather than combining both sources, as the distinct characteristics of each dataset type \(professional vs\. layman recordings\) require different optimization strategies for effective convergence\.

The following model configurations were investigated:

- •Wav2Vec2:Trained on professional reciters with Arabic text labels \(without diacritics\)\.
- •Tarteel:Trained on user\-recited audio with Arabic text labels\.
- •XLS\-R:Trained on professional reciters with Arabic text labels\.
- •HuBERT:Trained on professional reciters with Arabic text labels\.
- •Arabic with Tashkeel:Trained on professional reciters with fully diacritized Arabic text\.
- •English Transliteration:Trained on professional reciters using English transliterated text\.
- •Buckwalter Transliteration:Trained on professional reciters using Buckwalter transliteration\.

![Refer to caption](https://arxiv.org/html/2606.19747v1/training_loss_a.png)\(a\)Wav2Vec2
![Refer to caption](https://arxiv.org/html/2606.19747v1/training_loss_b.png)\(b\)XLS\-R
![Refer to caption](https://arxiv.org/html/2606.19747v1/training_loss_c.png)\(c\)HuBERT
![Refer to caption](https://arxiv.org/html/2606.19747v1/training_loss_d.png)\(d\)Tarteel
![Refer to caption](https://arxiv.org/html/2606.19747v1/training_loss_e.png)\(e\)Tashkeel
![Refer to caption](https://arxiv.org/html/2606.19747v1/training_loss_f.png)\(f\)Transliteration
![Refer to caption](https://arxiv.org/html/2606.19747v1/training_loss_g.png)\(g\)Buckwalter

Figure 4:Training vs Validation Loss for Various Models\.
### 4\.2Training Loss

Training loss reflects model fitting on training data, while validation loss indicates generalization\. Figure[4](https://arxiv.org/html/2606.19747#S4.F4)shows loss trends for various models\. The Wav2Vec2 model \(Figure[4\(a\)](https://arxiv.org/html/2606.19747#S4.F4.sf1)\) shows rapid convergence with minimal loss gaps, indicating a balanced fit\. XLS\-R \(Figure[4\(b\)](https://arxiv.org/html/2606.19747#S4.F4.sf2)\) shows slight divergence in the final steps, suggesting mild overfitting\. HuBERT \(Figure[4\(c\)](https://arxiv.org/html/2606.19747#S4.F4.sf3)\) exhibits higher training and validation losses, implying underfitting\. The Tarteel model \(Figure[4\(d\)](https://arxiv.org/html/2606.19747#S4.F4.sf4)\) and the Tashkeel, Transliteration, and Buckwalter variants \(Figures[4\(e\)](https://arxiv.org/html/2606.19747#S4.F4.sf5),[4\(f\)](https://arxiv.org/html/2606.19747#S4.F4.sf6), and[4\(g\)](https://arxiv.org/html/2606.19747#S4.F4.sf7), respectively\) display overfitting behavior, where validation loss diverges from training loss\.

### 4\.3Character Error Rate \(CER\)

CER evaluates transcription accuracy at character level\. Figure[5](https://arxiv.org/html/2606.19747#S4.F5)presents CER trends\. All models converged quickly below 0\.1\. Among the Transformer configurations, Wav2Vec2 achieved the lowest CER, while Tarteel and Tashkeel remained comparatively low\. The absolute CER differences are smaller than in the WER comparison, so they should be interpreted as supporting trends rather than the primary discriminator between models\.

![Refer to caption](https://arxiv.org/html/2606.19747v1/cer.png)Figure 5:Character Error Rate \(CER\) comparison for trained models
### 4\.4Word Error Rate \(WER\)

WER provides stronger differentiation across models\. Figure[6](https://arxiv.org/html/2606.19747#S4.F6)shows that HuBERT had the highest WER at 0\.52\. Buckwalter and Transliteration models performed poorly \(around 0\.40\), possibly due to mismatched phonetic representation\. The Tarteel\-trained model achieved a WER of 0\.23, likely impacted by noisy user data, and the Tashkeel model had similar performance\. Wav2Vec2 achieved the best result \(WER = 0\.08\), followed closely by XLS\-R \(WER = 0\.09\)\. The Citrinet baseline reported a WER of 0\.163 on the combined EveryAyah\+Tarteel setting\. These findings are further illustrated in Figure[6](https://arxiv.org/html/2606.19747#S4.F6)and summarized in Table[5](https://arxiv.org/html/2606.19747#S4.T5)\.

![Refer to caption](https://arxiv.org/html/2606.19747v1/wer.png)Figure 6:Word Error Rate \(WER\) comparison for trained models
### 4\.5Impact on Training Approach

In this study, the chosen model was Wav2Vec2 pretrained model and the output label was fixed to be Quran Arabic without Tashkeel\. Other ablation studies showed that this input feature and output label combination was optimal\. Additionally, professional reciter data was chosen\. Training from scratch, without any pretrained model, showed that WER slowly decreased to around 0\.9 after 2000 steps\. However, fine\-tuning from pretrained models drastically reduced WER to 0\.3 after just 800 steps, and finally to 0\.08 after 2000 steps\. These results are consistent with prior work\[[15](https://arxiv.org/html/2606.19747#bib.bib15)\]\. This performance improvement due to different training strategies is depicted in Figure[7](https://arxiv.org/html/2606.19747#S4.F7)\.

![Refer to caption](https://arxiv.org/html/2606.19747v1/training_approach_effect.png)Figure 7:Effect of Training Approach
### 4\.6Impact of Different Features

Feature extraction is essential in machine learning as it reduces data dimensionality and facilitates more accurate learning\. Wav2Vec2\-large\-XLSR\-53, trained on 53 languages, outperformed others, achieving WER = 0\.08\. HuBERT, trained on English\-only data, underperformed \(WER = 0\.52\)\. XLS\-R also performed well \(WER = 0\.09\), indicating the value of multilingual pretraining\. The comparative performance of different feature extractors is illustrated in Figure[8](https://arxiv.org/html/2606.19747#S4.F8)\.

![Refer to caption](https://arxiv.org/html/2606.19747v1/feature_extraction_effect.png)Figure 8:Effect of Input Audio Features
### 4\.7Impact of Different Output Labels

With Wav2Vec2\-large\-XLSR\-53 as the feature extractor, output labels were compared\. Transliteration and Buckwalter remained the weakest settings, ending at WER = 0\.38 and 0\.41, respectively\. Their poor performance may stem from phoneme misrepresentation\. The Tashkeel configuration achieved WER = 0\.23, while Arabic without Tashkeel remained the best\-performing label format with WER = 0\.08\. These results are visually compared in Figure[9](https://arxiv.org/html/2606.19747#S4.F9)\.

![Refer to caption](https://arxiv.org/html/2606.19747v1/output_labels_effect.png)Figure 9:Effect of Output Labels
### 4\.8Effect of Dataset Quality and Composition

Using Wav2Vec2 features and Arabic text without Tashkeel as labels, model training on EveryAyah dataset yielded WER = 0\.08\. Tarteel dataset, with layman reciters and noisier data, yielded WER = 0\.23\. Combining datasets \(fine\-tuning on Tarteel after EveryAyah\) resulted in WER = 0\.11, indicating generalizability but highlighting the need for improved Tarteel data quality\.

It is also important to acknowledge that the Tarteel dataset, while providing diversity in recitations, includes challenges such as inconsistent pronunciation, background noise, and occasional mislabeling\. These factors likely contributed to the higher WER observed in models trained solely on Tarteel data and should be addressed in future data preprocessing efforts\.

### 4\.9Effect of Clip Duration

Audio clip durations in the dataset varied between 1s to 286s\. To evaluate the impact of clip length on model performance and manage GPU memory constraints, experiments were conducted using maximum clip durations of 10s, 20s, 30s, and 40s\. Results showed a significant improvement in Word Error Rate \(WER\) as the clip duration increased\. Specifically, the WER was 0\.70 for 10\-second clips, 0\.50 for 20\-second clips, and reached its optimal value of 0\.075 for 30\-second clips\. However, using 40\-second clips consistently caused out\-of\-memory errors during training\. Therefore, a clip duration of 30 seconds was found to offer the best trade\-off between performance and hardware limitations\.

### 4\.10Common Recognition Errors

While overall WER and CER metrics indicate high model performance, several recurring transcription errors were observed that reveal specific challenges in Qur’anic speech recognition\. The most frequent error category involves confusion between phonetically similar Arabic letters, particularly pairs such as seen and saad, which share similar acoustic properties but differ in emphatic pronunciation crucial for correct Qur’anic recitation\. This confusion stems from the subtle articulatory differences that require specialized acoustic modeling to distinguish effectively\. Additionally, models demonstrated difficulty recognizing short Ayat \(verses\) with similar intonation patterns, where brief recitations often lack sufficient contextual information for accurate transcription, leading to misidentification between verses with comparable rhythmic structures\. Furthermore, occasional errors occurred in recognizing Ghunnah \(nasal sound\) and Maad \(vowel elongation\) pronunciation, which are fundamental Tajweed rules that involve temporal acoustic features extending beyond standard phonemic recognition\. These specific error patterns highlight the need for more fine\-grained modeling approaches that incorporate Tajweed\-aware acoustic features and phoneme\-aware decoding mechanisms in future ASR systems designed for Qur’anic recitation\.

### 4\.11Summary of Model Performance and Baseline Comparison

The proposed ASR models were evaluated against a baseline Citrinet model, with results summarized in Table[5](https://arxiv.org/html/2606.19747#S4.T5)\. Among all configurations, the Wav2Vec2 model trained on the EveryAyah dataset using Arabic text without Tashkeel achieved the best overall WER, with a value of 0\.08 and a corresponding CER of 0\.015\. Fine\-tuning the Tarteel dataset on this pretrained model further improved generalizability, achieving a WER of 0\.11, which is roughly a five\-percentage\-point improvement over the Citrinet baseline \(WER = 0\.163\)\.

Training time averaged 16 hours for individual models, while the integrated model trained on both datasets required 40 hours\. In contrast, the baseline Citrinet model required 140 hours to reach its final performance\. These results demonstrate the efficiency and scalability of transfer learning with pretrained Wav2Vec2 models\.

In terms of complexity, the proposed Wav2Vec2 model contains approximately 100 million parameters with a model size of 1\.2 GB, compared to 30 million parameters and 350 MB for the Citrinet model\. Although inference latency is roughly three times slower, the improved transcription accuracy makes the model more suitable for high\-accuracy research applications\. Future work may explore lightweight variants for mobile deployment\.

Table 5:Comparison of Extended Results with BaselineIn summary, the results demonstrate that the proposed Wav2Vec2\-based ASR model, particularly when trained on professional recitations using Arabic text without diacritics, delivers competitive performance compared to traditional and alternative deep learning approaches in terms of WER and CER\. The model also exhibits reduced training time and broader applicability across diverse recitation styles and speaker backgrounds, effectively addressing all three research questions\. While some variability in performance was observed across feature extractors, output labels, and dataset sources, the overall findings confirm the effectiveness of self\-supervised pretrained models for Quranic speech recognition\. These outcomes not only validate the proposed methodology but also provide a strong foundation for future enhancements in model generalization, data quality, and deployment in real\-world scenarios\. The implications of these findings are further discussed in the next chapter\.

## 5Discussion

This section reflects on the experimental findings, situating them within the broader landscape of Quranic Automatic Speech Recognition \(ASR\) research\. The evaluation of model configurations, dataset compositions, feature representations, and output label formats has yielded valuable insights into the parameters that most significantly impact transcription accuracy\.

### 5\.1Comparison with Prior Works

Table[6](https://arxiv.org/html/2606.19747#S5.T6)situates the proposed model among prior Quranic ASR systems\. The comparison highlights three dimensions that prior work does not address jointly: \(1\)dataset scale, as this study uses over 870 filtered hours across professional and user recitations, compared to 8–80 hours in most prior systems; \(2\)robustness to noisy user recitations, because training and evaluation on Tarteel user data alongside EveryAyah professional recordings allow the model to generalize beyond studio conditions, an aspect absent from systems trained solely on professional reciters; and \(3\)training efficiency, since self\-supervised pretraining reduces required compute from 140 hours \(Citrinet baseline\) to 16–40 hours, enabling faster iteration\. Against MFCC\-based GMM\-HMM baselines \(WER 18–22%\), the performance gain is substantial\. The Citrinet model achieves WER 16\.3%, while the proposed Wav2Vec2 model trained on the same combined dataset reaches WER 11\.3%, representing roughly a five\-percentage\-point gain\. These gains are consistent with the broader trend of self\-supervised Transformer models outperforming MFCC\-driven pipelines\[[15](https://arxiv.org/html/2606.19747#bib.bib15),[27](https://arxiv.org/html/2606.19747#bib.bib27),[14](https://arxiv.org/html/2606.19747#bib.bib14)\]\. Concurrently, Al Harere and Al Jallad\[[3](https://arxiv.org/html/2606.19747#bib.bib3)\]achieved WER 8\.34% using a CNN\-BiGRU with CTC on the Ar\-DAD dataset \(37 Surahs\), confirming that end\-to\-end models deliver strong Quranic ASR performance\. Hadwan et al\.\[[26](https://arxiv.org/html/2606.19747#bib.bib26)\]reported WER 6\.16% and CER 1\.98% using an encoder\-decoder Transformer with an external RNN/LSTM language model, trained on 10 hours covering 16 short Surahs\. While numerically competitive, this result is evaluated on a narrow, homogeneous subset of the Quran \(short Surahs with repetitive phoneme patterns\) and relies on an external language model, which makes direct comparison with our full\-Quran, LM\-free setup impractical\. On the EveryAyah\-only subset covering the full Quran, our model achieves WER 8\.0%; on the combined filtered corpus exceeding 870 hours, it achieves WER 11\.3%\.

A notable finding from Al\-Issa et al\.\[[5](https://arxiv.org/html/2606.19747#bib.bib5)\]highlights the significant role of transcription quality and domain\-specific preprocessing in achieving low WER and CER\. Their baseline experiments using MFCC features and Mozilla’s DeepSpeech architecture resulted in relatively high error rates, with Word Error Rates \(WER\) ranging from 46\.7% to 32\.1% in early experiments \(Exp\-1 to Exp\-4\)\. However, after applying progressive cleaning steps such as silence removal, punctuation stripping, fixing corpus alignments, and removing non\-Quranic tokens, the performance improved substantially\. In Experiment 7, by incorporating a Quran\-specific vocabulary set, they achieved a remarkably low WER of 4\.6% and CER of 2\.5%\. This demonstrates that ASR performance is not solely dependent on model architecture but also heavily influenced by the quality and consistency of transcription labels\. These insights align with our findings, where improved dataset quality and carefully chosen output label formats contributed significantly to accuracy gains\. The study underscores the importance of domain\-specific preprocessing and offers a compelling direction for future improvements in Quranic ASR systems\.

Another relevant benchmark in Quranic ASR research is the DeepSpeech\-Quran project by ElDeeb\[[20](https://arxiv.org/html/2606.19747#bib.bib20)\], which used Mozilla’s DeepSpeech \(v0\.7\.1\) RNN\-based architecture with MFCC features\. The dataset was primarily based on professional recitations from EveryAyah, segmented and labeled at the verse level\.

Two configurations were reported:

- •Imam\-Only Dataset:Trained on audio from seven professional reciters\. The model achieved WER = 5\.6% and CER = 3\.9%\.
- •Imam \+ Filtered Users Dataset:Combined professional and user recitations, resulting in WER = 9\.9% and CER = 6\.5%\.

Although these results appear strong, later analysis by Al\-Issa et al\.\[[5](https://arxiv.org/html/2606.19747#bib.bib5)\]showed that applying ElDeeb’s models on alternative datasets resulted in much higher WER, exceeding 76%\. This suggests the models lacked generalization beyond their training data\.

In comparison, our proposed Wav2Vec2\-based model achieved WER = 8\.0% and CER = 1\.5% on the EveryAyah dataset and maintained robust performance \(WER = 11\.3%\) even when extended to a noisier, more diverse dataset combining professional and user recitations\. These findings highlight the advantage of self\-supervised pretrained models like Wav2Vec2 in handling variability and scale, offering stronger generalization than traditional RNN\-based ASR systems\.

Table 6:Comparison of Proposed Work with Other Relevant Works \(Best performing model highlighted in bold\)AuthorExtractionMethodDatasetWERYuwan & Lestari \(2016\)MFCCGMM \+ HMMPhonetically rich 180 verses22\.42%Ridwan & Lestari \(2018\)MFCCGMM \+ HMMFull Quran18\.53%Thirafi & Lestari \(2019\)MFCCHMM \+ BLSTM \(Kaldi\)10 Professional Reciters \(EveryAyah\)9\.16%Tarteel\.io \(2020\)MFCCEnd\-to\-End Citrinet80\+ hours user \+ professional \(Tarteel\)16\.3%ElDeeb \(2021\)MFCCDeepSpeech \(RNN\)EveryAyah \(Imam Only\)5\.65%ElDeeb \(2021\)MFCCDeepSpeech \(RNN\)EveryAyah \+ Tarteel9\.91%Al\-Issa et al\. \(2022\)MFCCDeepSpeech \(RNN\)EveryAyah \(cleaned text\)4\.6%Al Harere & Al Jallad \(2023\)CNN\-BiGRUEnd\-to\-End CTCAr\-DAD \(37 Surahs\)8\.34%Hadwan et al\. \(2023\)Mel filterbankE2E Transformer \+ RNN/LSTM\-LM10 hrs, 16 Surahs, 60 reciters6\.16%†Proposed ModelWav2Vec2Transformer \(fine\-tuned\)Tarteel \+ EveryAyah11\.3%Proposed ModelWav2Vec2Transformer \(fine\-tuned\)EveryAyah only8\.0%†Evaluated on 16 short Surahs only \(10 hrs, with external language model\); not full\-Quran coverage\.
### 5\.2Contributions and Implications

This work introduces a reliable ASR system tailored for Quranic recitations\. The integration of self\-supervised feature extraction via Wav2Vec2 and comprehensive ablation studies across various datasets and label types contributes to the field\. The findings demonstrate measurable performance gains and lay the groundwork for practical applications such as voice\-enabled Quranic search, learner feedback systems, and accessibility tools\.

One key contribution is the comparative analysis of multiple pretrained speech representations, including HuBERT and XLS\-R\. Despite their proven success in other multilingual ASR tasks, Wav2Vec2\-large\-XLSR\-53 consistently yielded the lowest WER and CER in our experiments, underscoring its suitability for Arabic speech recognition\. Additionally, our evaluation of label formats, including Arabic text with and without Tashkeel, Buckwalter, and transliteration, confirmed that simplified Arabic without diacritics offers the most accurate recognition performance\.

The practical implications of this research extend to multiple stakeholder groups who benefit from automated Qur’anic speech recognition capabilities\. Individual learners gain access to self\-directed recitation assessment tools that provide immediate feedback on pronunciation accuracy and Tajweed compliance, enabling independent study without requiring constant teacher supervision\. Educational institutions can integrate these systems into digital learning platforms to scale Qur’anic education and provide consistent assessment across diverse student populations\. Religious organizations benefit from automated transcription services for sermon preparation, digital archiving, and accessibility tools for hearing\-impaired community members\.

The proposed pipeline is well suited to downstream tasks such as ayah segmentation and word\-level timestamp generation, where accurate audio\-text alignment enables verse navigation, synchronized highlighting during recitation playback, and improved accessibility in digital Qur’anic learning tools\. The system’s ability to process both professional and layman recitations makes it particularly valuable for creating inclusive resources that serve diverse user skill levels and learning contexts\.

### 5\.3Tajweed Awareness: Data\-Level Versus Rule\-Level Modeling

Although Tajweed compliance is a central motivation for this research, it is important to distinguish between two fundamentally different approaches\. This work addresses Tajweed at the*data level*: the model is trained on professional recitations that inherently conform to Tajweed rules, and correct pronunciation is thus implicitly encouraged through the training distribution\. No explicit Tajweed rule is modeled at the phoneme or decision level\. Phoneme\-level features such as Ghunnah duration or Qalqalah intensity are not directly quantified or evaluated\. This is an acknowledged limitation of the current system\. Future work should develop Tajweed\-specific phonetic evaluation metrics and incorporate rule\-level constraints, for example by adding phoneme\-aware decoding or integrating a Tajweed rule engine as a post\-processing layer, to move from implicit to explicit Tajweed enforcement\.

### 5\.4Deployment Feasibility and Model Compression

The best\-performing Wav2Vec2\-large\-XLSR\-53 model contains approximately 100 million parameters with a model size of 1\.2 GB, making direct deployment on mobile or edge devices challenging\. Several compression strategies are applicable for reducing this footprint without substantially compromising accuracy\. Knowledge distillation\[[17](https://arxiv.org/html/2606.19747#bib.bib17)\], in which a smaller student model is trained to reproduce the outputs of the large teacher, has been shown to achieve 3–6×\\timesparameter reduction in self\-supervised speech models while retaining competitive WER\. Post\-training quantization \(8\-bit integer or 4\-bit\) and structured pruning offer complementary size reductions and have been validated for Transformer\-based ASR\[[43](https://arxiv.org/html/2606.19747#bib.bib43)\]\. A practical deployment path would combine a distilled student of 20–30M parameters with 8\-bit quantization, targeting inference on a mobile CPU at near\-real\-time speed\. Evaluating such compressed variants against the full model on the Quranic test set is recommended as a priority in future work\.

### 5\.5Interpretation of Findings

The experiments indicate that model performance is strongly influenced by both the type of audio features and the quality of training data\. Models trained on professional reciter datasets outperformed those trained on user\-submitted Tarteel data, likely due to the latter’s inherent noise and variability\. However, fine\-tuning on both datasets improved generalization without severely compromising accuracy, suggesting potential for personalized or adaptive learning systems\.

Model performance also varied by clip duration\. Short clips \(10s\) led to poor convergence \(WER = 0\.70\), while clips up to 30s significantly improved recognition accuracy \(WER = 0\.075\)\. This supports the hypothesis that longer input contexts enable more effective modeling of Quranic recitation structures\.

Training from scratch yielded slow convergence, with high WER persisting even after 2000 steps\. In contrast, fine\-tuning pretrained models led to rapid performance gains within the first few hundred steps, validating the benefits of transfer learning in low\-resource ASR domains\.

The experimental findings successfully address the research objectives:

- •Objective 1 \- Identify critical acoustic and linguistic parameters:The research systematically evaluated multiple feature extraction approaches and label formats to determine optimal parameters for Qur’anic recitation transcription\. Results identified Wav2Vec2 as the most effective feature extractor and Arabic without Tashkeel as the optimal label format, demonstrating that these parameters significantly impact transcription accuracy with measurable performance improvements over alternative configurations\.
- •Objective 2 \- Develop advanced Transformer\-based model:The study implemented and optimized an end\-to\-end Transformer architecture utilizing transfer learning methodologies specifically adapted for Qur’anic speech recognition\. Experimental results demonstrated that the proposed transfer learning approach combined with optimized clip durations improved WER by roughly five percentage points compared to the Citrinet baseline, validating the effectiveness of advanced deep learning methodologies for WER reduction\.
- •Objective 3 \- Evaluate and validate model performance:Comprehensive evaluation against existing baseline approaches revealed that the proposed model achieved a WER of 0\.08, performing competitively with the best existing models while demonstrating strong generalization across both professional and user datasets\. The results validate the model’s superior performance and establish it as a scalable self\-supervised alternative to traditional RNN\-based ASR approaches\.

## 6Conclusion

Quran Speech Recognition has the potential to serve millions of users by addressing the scarcity and unavailability of professional teachers\. It facilitates self\-learning and distance learning, offering a practical solution for Quranic education\. Traditional speech recognition methods either do not cover the entire Quran or result in high Word Error Rates \(WER\), particularly for user\-recited verses\.

The proposed model improves upon the Citrinet baseline in WER while maintaining low CER across the Transformer configurations\. It is effective for both professional reciters and layman users and leverages fine\-tuning of a pretrained Wav2Vec2 model\. Through ablation studies, it was established that normal Arabic text without diacritics achieved the lowest WER, and Wav2Vec2\-large\-XLSR\-53 proved to be the best\-performing feature extractor\. In the combined EveryAyah\+Tarteel setting, the proposed model outperformed the baseline by roughly five percentage points in WER while reducing the end\-to\-end training schedule from 140 hours to 40 hours\. These results confirm the effectiveness and utility of the proposed Quranic ASR system in real\-world applications\.

While the proposed system demonstrated excellent performance, several limitations persist\. The model’s large size \(1\.2 GB, 100 million parameters\) and inference latency pose challenges for deployment on mobile or resource\-constrained devices\. Moreover, the system transcribes at the word level and lacks phoneme\-level resolution, which is critical for applications such as tajweed correction or pronunciation analysis\. Some common recognition errors also persisted, particularly in distinguishing phonetically similar Arabic letters \(e\.g\., seen vs saad\) and handling short ayat with subtle intonational differences\.

The findings open several avenues for future research\. Developing phoneme\-aware models or incorporating phonetic decoding layers could enhance tajweed\-sensitive applications\. Further, improving the quality of user\-submitted data, particularly from Tarteel, through manual annotation and filtering would enhance training effectiveness\. Personalized ASR models adapted to speaker age, gender, or dialect may also help expand usability across diverse user bases\. From a technical standpoint, integrating external language models or lightweight fine\-tuning strategies could bridge the gap between performance and deployment efficiency\. Exploring quantization or pruning techniques may reduce model size without significantly compromising accuracy\.

In summary, the results presented in this study affirm that pretrained Transformer\-based models, particularly Wav2Vec2, offer a transformative advantage in Quranic speech recognition tasks\. The proposed methodology significantly reduces transcription errors while enabling generalization across varied datasets and recitation styles\. Compared to prior approaches, our model provides a strong full\-Quran, LM\-free Transformer baseline and a practical foundation for future advancements in personalized, phoneme\-aware, and real\-time Quranic ASR systems\.

## Declarations

Ethics statement: This study uses previously available Qur’anic recitation datasets from EveryAyah and Tarteel\-ML\. The authors did not collect new human\-subject data for this manuscript\. User\-recorded audio remains subject to the permissions, governance, and access controls of the original source platforms\.

Funding: This research received no specific grant from any funding agency in the public, commercial, or not\-for\-profit sectors\.

Competing interests: The authors declare no competing interests\.

Data and code availability: Public source materials referenced in this study are available via EveryAyah \([https://everyayah\.com](https://everyayah.com/)\) and Tarteel\-ML \([https://github\.com/Tarteel\-io/tarteel\-ml](https://github.com/Tarteel-io/tarteel-ml)\)\. User\-recorded Tarteel audio and restricted derived files are not redistributed by the authors\. A dedicated public code release does not accompany this manuscript\.

## References

- \\bibcommenthead
- Ahmad et al\. \[2018\]Ahmad F, Yahya SZ, Saad Z, et al \(2018\) Tajweed classification using artificial neural network\. In: 2018 International Conference on Smart Applications, Communications and Networking \(SmartNets\)\. IEEE, Yasmine Hammamet, Tunisia,[10\.1109/SMARTNETS\.2018\.8707394](https://arxiv.org/doi.org/10.1109/SMARTNETS.2018.8707394)
- Al\-Ayyoub et al\. \[2018\]Al\-Ayyoub M, Damer NA, Hmeidi I \(2018\) Using deep learning for automatically determining correct application of basic quranic recitation rules\. International Arab Journal of Information Technology 15\(3A Special Issue\):620–625
- Al Harere and Al Jallad \[2023\]Al Harere A, Al Jallad K \(2023\) Quran recitation recognition using end\-to\-end deep learning\. arXiv preprint arXiv:230507034 URL[https://arxiv\.org/abs/2305\.07034](https://arxiv.org/abs/2305.07034)
- Al\-Huri \[2015\]Al\-Huri I \(2015\) Arabic language: Historic and sociolinguistic characteristics\. English Literature and Language Review 1\(4\):28–36
- Al\-Issa et al\. \[2023\]Al\-Issa S, Al\-Ayyoub M, Al\-Khaleel O, et al \(2023\) Building a neural speech recognizer for quranic recitations\. International Journal of Speech Technology 26\(4\):1131–1151
- Alagrami and Eljazzar \[2020\]Alagrami A, Eljazzar M \(2020\) SMARTAJWEED: Automatic recognition of arabic quranic recitation rules\. In: Proceedings of the Computer Science and Information Technology \(CSIT\), pp 145–152,[10\.5121/csit\.2020\.101812](https://arxiv.org/doi.org/10.5121/csit.2020.101812), URL[https://doi\.org/10\.5121/csit\.2020\.101812](https://doi.org/10.5121/csit.2020.101812)
- Alfaries et al\. \[2015\]Alfaries A, Albahlal M, Almazrua M, et al \(2015\) A rule\-based annotation system to extract tajweed rules from quran\. In: Proceedings of the 2013 Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences \(NOORIC 2013\)\. IEEE, Madinah, Saudi Arabia, pp 281–286,[10\.1109/NOORIC\.2013\.63](https://arxiv.org/doi.org/10.1109/NOORIC.2013.63)
- Alharbi et al\. \[2021\]Alharbi S, Alrazgan M, Alrashed A, et al \(2021\) Automatic speech recognition: Systematic literature review\. Ieee Access 9:131858–131876
- Ali et al\. \[2018\]Ali A, Vogel S, Renals S \(2018\) Speech recognition challenge in the wild: Arabic mgb\-3\. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop \(ASRU\)\. IEEE, Okinawa, Japan,[10\.1109/ASRU\.2017\.8268952](https://arxiv.org/doi.org/10.1109/ASRU.2017.8268952)
- Ali et al\. \[2019\]Ali A, Shon S, Samih Y, et al \(2019\) The mgb\-5 challenge: Recognition and dialect identification of dialectal arabic speech\. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop \(ASRU\)\. IEEE, Singapore,[10\.1109/ASRU46091\.2019\.9003960](https://arxiv.org/doi.org/10.1109/ASRU46091.2019.9003960)
- Alrumiah and Al\-Shargabi \[2023\]Alrumiah SS, Al\-Shargabi AA \(2023\) Intelligent quran recitation recognition and verification: Research trends and open issues\. Arabian Journal for Science and Engineering 48\(8\):9859–9885
- Amrani et al\. \[2016\]Amrani MYE, Rahman MMH, Wahiddin MR, et al \(2016\) Building cmu sphinx language model for the holy quran using simplified arabic phonemes\. Egyptian Informatics Journal 17\(3\):305–314\.[10\.1016/j\.eij\.2016\.04\.002](https://arxiv.org/doi.org/10.1016/j.eij.2016.04.002), URL[https://doi\.org/10\.1016/j\.eij\.2016\.04\.002](https://doi.org/10.1016/j.eij.2016.04.002)
- Aziz et al\. \[2019\]Aziz M, Abdullah WM, Ahmad AM, et al \(2019\) Comparison between conventional method and modern technology in al\-qur’an memorization\. International Journal of Recent Technology and Engineering 8\(1\):289–294
- Babu et al\. \[2021\]Babu A, Wang C, Tjandra A, et al \(2021\) XLS\-R: Self\-supervised Cross\-lingual Speech Representation Learning at Scale\. arXiv preprint arXiv:211109296 pp 1–23\. URL[http://arxiv\.org/abs/2111\.09296](http://arxiv.org/abs/2111.09296)
- Baevski et al\. \[2020\]Baevski A, Zhou Y, Mohamed A, et al \(2020\) wav2vec 2\.0: A framework for self\-supervised learning of speech representations\. Advances in neural information processing systems 33:12449–12460
- Baevski et al\. \[2021\]Baevski A, Hsu WN, Conneau A, et al \(2021\) Unsupervised speech recognition\. Advances in Neural Information Processing Systems 34:27826–27839
- Chang et al\. \[2022\]Chang HJ, Yang Sw, Lee Hy \(2022\) Distilhubert: Speech representation learning by layer\-wise distillation of hidden\-unit bert\. In: IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\)\. IEEE, pp 7087–7091,[10\.1109/ICASSP43922\.2022\.9747490](https://arxiv.org/doi.org/10.1109/ICASSP43922.2022.9747490)
- Dahl et al\. \[2012\]Dahl GE, Yu D, Deng L, et al \(2012\) Context\-dependent pre\-trained deep neural networks for large\-vocabulary speech recognition\. IEEE Transactions on Audio, Speech, and Language Processing 20\(1\):30–42\.[10\.1109/TASL\.2011\.2134090](https://arxiv.org/doi.org/10.1109/TASL.2011.2134090)
- Dajani et al\. \[2014\]Dajani BAS, Mubaideen S, Omari FMA \(2014\) Difficulties of learning arabic for non\-native speakers\. Procedia \- Social and Behavioral Sciences 114:895–902\.[10\.1016/j\.sbspro\.2013\.12\.808](https://arxiv.org/doi.org/10.1016/j.sbspro.2013.12.808), URL[https://doi\.org/10\.1016/j\.sbspro\.2013\.12\.808](https://doi.org/10.1016/j.sbspro.2013.12.808)
- ElDeeb \[2021\]ElDeeb T \(2021\) Deepspeech\-quran: Speech recognition for quranic recitations using mozilla deepspeech\.[https://github\.com/tarekeldeeb/DeepSpeech\-Quran](https://github.com/tarekeldeeb/DeepSpeech-Quran), accessed: 2025\-03\-18
- EveryAyah \[n\.d\.\]EveryAyah \(n\.d\.\) Everyayah: Quran audio recitation website\. URL[https://everyayah\.com](https://everyayah.com/), accessed: 21 March 2025
- Faris \[2023\]Faris S \(2023\) Exploring the divine message: Quranic studies in the context of islamic scholarship\. Dirasah International Journal of Islamic Studies 1\(2\):111–125
- Graves \[2012\]Graves A \(2012\) Sequence transduction with recurrent neural networks\. arXiv preprint arXiv:12113711
- Graves and Jaitly \[2014\]Graves A, Jaitly N \(2014\) Towards end\-to\-end speech recognition with recurrent neural networks\. In: Proceedings of the 31st International Conference on Machine Learning \(ICML\), vol 5\. PMLR, Beijing, China
- Gulati et al\. \[2020\]Gulati A, Qin J, Chiu CC, et al \(2020\) Conformer: Convolution\-augmented transformer for speech recognition\. In: Proceedings of Interspeech 2020, pp 5036–5040,[10\.21437/Interspeech\.2020\-3015](https://arxiv.org/doi.org/10.21437/Interspeech.2020-3015)
- Hadwan et al\. \[2023\]Hadwan M, Alsayadi HA, AL\-Hagree S \(2023\) An end\-to\-end transformer\-based automatic speech recognition for qur’an reciters\. Computers, Materials & Continua[10\.32604/cmc\.2023\.033457](https://arxiv.org/doi.org/10.32604/cmc.2023.033457)
- Hsu et al\. \[2021\]Hsu WN, Bolte B, Tsai YHH, et al \(2021\) Hubert: Self\-supervised speech representation learning by masked prediction of hidden units\. IEEE/ACM Transactions on Audio, Speech, and Language Processing pp 1–10\.[10\.1109/TASLP\.2021\.3122291](https://arxiv.org/doi.org/10.1109/TASLP.2021.3122291), URL[https://doi\.org/10\.1109/TASLP\.2021\.3122291](https://doi.org/10.1109/TASLP.2021.3122291)
- Hussein et al\. \[2022\]Hussein A, Watanabe S, Ali A \(2022\) Arabic speech recognition by end\-to\-end, modular systems and human\. Computer Speech & Language 71:101272
- Ibrahim et al\. \[2013\]Ibrahim NJ, Idris MYI, Razak Z, et al \(2013\) Automated tajweed checking rules engine for quranic learning\. Multicultural Education and Technology Journal 7\(4\):275–287\.[10\.1108/METJ\-03\-2013\-0012](https://arxiv.org/doi.org/10.1108/METJ-03-2013-0012), URL[https://doi\.org/10\.1108/METJ\-03\-2013\-0012](https://doi.org/10.1108/METJ-03-2013-0012)
- Ibrahim et al\. \[2015\]Ibrahim NJ, Idris MYI, Yusoff MZM, et al \(2015\) The problems, issues and future challenges of automatic speech recognition for quranic verse recitation: A review\. Al\-Bayan: Journal of Qur’an and Hadith Studies 13\(2\):168–196
- Ismail et al\. \[2014\]Ismail A, Idris MYI, Noor NM, et al \(2014\) Mfcc\-vq approach for qalqalah tajweed rule checking\. Malaysian Journal of Computer Science 27\(4\):275–293
- Ismail et al\. \[2019\]Ismail FZ, Yusof NH, Osman AFA, et al \(2019\) Retaining quranic memorisation for huffaz at the malaysian tertiary institutions: Key challenges and future iot potentialities\. In: 2019 7th International Conference on Future Internet of Things and Cloud Workshops \(FiCloudW\), IEEE, pp 26–30
- Kher \[2018\]Kher J \(2018\) The history of arabic language — verbling\.[https://www\.verbling\.com/articles/post/the\-history\-of\-arabic\-language?locale=en](https://www.verbling.com/articles/post/the-history-of-arabic-language?locale=en), retrieved June 5, 2021
- Khurana and Ali \[2017\]Khurana S, Ali A \(2017\) Qcri advanced transcription system \(qats\) for the arabic multi\-dialect broadcast media recognition: Mgb\-2 challenge\. In: 2016 IEEE Workshop on Spoken Language Technology \(SLT\)\. IEEE, San Diego, CA, USA,[10\.1109/SLT\.2016\.7846279](https://arxiv.org/doi.org/10.1109/SLT.2016.7846279)
- Kudo and Richardson \[2018\]Kudo T, Richardson J \(2018\) Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing\. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations\. Association for Computational Linguistics, Brussels, Belgium, pp 66–71,[10\.18653/v1/D18\-2012](https://arxiv.org/doi.org/10.18653/v1/D18-2012)
- Levenshtein et al\. \[1966\]Levenshtein VI, et al \(1966\) Binary codes capable of correcting deletions, insertions, and reversals\. In: Soviet physics doklady, Soviet Union, pp 707–710
- Liu et al\. \[2021\]Liu Y, Li T, Zhang P, et al \(2021\) Improved conformer\-based end\-to\-end speech recognition using neural architecture search\. arXiv preprint arXiv:210405390
- Majumdar et al\. \[2021\]Majumdar S, Balam J, Hrinchuk O, et al \(2021\) Citrinet: Closing the gap between non\-autoregressive and autoregressive end\-to\-end models for automatic speech recognition\. arXiv preprint arXiv:210401721 pp 1–5\. URL[http://arxiv\.org/abs/2104\.01721](http://arxiv.org/abs/2104.01721)
- Maqsood et al\. \[2016\]Maqsood M, Habib HA, Nawaz T, et al \(2016\) A complete mispronunciation detection system for arabic phonemes using svm\. IJCSNS International Journal of Computer Science and Network Security 16\(3\):30
- Mary and Deekshitha \[2019\]Mary L, Deekshitha G \(2019\) Searching speech databases: Features, techniques and evaluation measures\. In: SpringerBriefs in Speech Technology\. Springer
- Mohamed Elhadj et al\. \[2012\]Mohamed Elhadj YO, Aoun\-Allah M, A\. I, et al \(2012\) A new scientific formulation of tajweed rules for e\-learning of quran phonological rules\. In: E\-Learning \- Engineering, On\-Job Training and Interactive Teaching\. InTechOpen, Rijeka, Croatia,[10\.5772/29643](https://arxiv.org/doi.org/10.5772/29643)
- Nazir et al\. \[2019\]Nazir F, Majeed MN, Ghazanfar MA, et al \(2019\) Mispronunciation detection using deep convolutional neural network features and transfer learning\-based model for arabic phonemes\. IEEE Access 7:52589–52608\.[10\.1109/ACCESS\.2019\.2912648](https://arxiv.org/doi.org/10.1109/ACCESS.2019.2912648), URL[https://doi\.org/10\.1109/ACCESS\.2019\.2912648](https://doi.org/10.1109/ACCESS.2019.2912648)
- O’Shaughnessy \[2024\]O’Shaughnessy D \(2024\) Trends and developments in automatic speech recognition research\. Computer Speech & Language 83:101538
- Ould et al\. \[2014\]Ould ME, Alghamdi M, Alkanhal M \(2014\) Phoneme\-based recognizer to assist reading the holy quran\. In: Advances in Intelligent Systems and Computing, vol 235\. Springer, p 141–152,[10\.1007/978\-3\-319\-01778\-5\_15](https://arxiv.org/doi.org/10.1007/978-3-319-01778-5_15), URL[https://doi\.org/10\.1007/978\-3\-319\-01778\-5\_15](https://doi.org/10.1007/978-3-319-01778-5_15)
- Peddinti et al\. \[2015\]Peddinti V, Povey D, Khudanpur S \(2015\) A time delay neural network architecture for efficient modeling of long temporal contexts\. In: Proceedings of the Annual Conference of the International Speech Communication Association \(INTERSPEECH\), iNTERSPEECH 2015
- Putten \[2020\]Putten Mv \(2020\) Classical and modern standard arabic\. In: Lucas C, Manfredi S \(eds\) Arabic and Contact\-Induced Change\. Language Science Press, Berlin, p 57–82,[10\.5281/zenodo\.3744565](https://arxiv.org/doi.org/10.5281/zenodo.3744565)
- Schneider et al\. \[2019\]Schneider S, Baevski A, Collobert R, et al \(2019\) WAV2vec: Unsupervised Pre\-training for Speech Recognition\. In: Proceedings of the Annual Conference of the International Speech Communication Association \(INTERSPEECH\),[10\.21437/Interspeech\.2019\-1873](https://arxiv.org/doi.org/10.21437/Interspeech.2019-1873)
- Smit et al\. \[2018\]Smit P, Gangireddy SR, Enarvi S, et al \(2018\) Aalto system for the 2017 arabic multi\-genre broadcast challenge\. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop \(ASRU\)\. IEEE, Okinawa, Japan,[10\.1109/ASRU\.2017\.8268955](https://arxiv.org/doi.org/10.1109/ASRU.2017.8268955)
- Straub \[2020\]Straub J \(2020\) Which are the most spoken languages in spain?[https://www\.fluentin3months\.com/most\-spoken\-languages/](https://www.fluentin3months.com/most-spoken-languages/), accessed: 2025\-03\-18
- Tabbaa and Soudan \[2015\]Tabbaa HMA, Soudan B \(2015\) Computer\-aided training for quranic recitation\. Procedia \- Social and Behavioral Sciences 192:778–787\.[10\.1016/j\.sbspro\.2015\.06\.092](https://arxiv.org/doi.org/10.1016/j.sbspro.2015.06.092)
- Talebe et al\. \[2025\]Talebe T, Harun H, et al \(2025\) The effect of memorizing the quran on improving students’ academic achievement\. Eduvest\-Journal of Universal Studies 5\(1\):305–310
- Tarteel\.io \[2021\]Tarteel\.io \(2021\) Tarteel\-ml: Pre\-processing and training scripts for the tarteel dataset\.[https://github\.com/Tarteel\-io/tarteel\-ml](https://github.com/Tarteel-io/tarteel-ml), accessed: 2025\-03\-18
- Watanabe et al\. \[2017\]Watanabe S, Hori T, Kim S, et al \(2017\) Hybrid ctc/attention architecture for end\-to\-end speech recognition\. IEEE Journal on Selected Topics in Signal Processing 11\(8\)\.[10\.1109/JSTSP\.2017\.2763455](https://arxiv.org/doi.org/10.1109/JSTSP.2017.2763455), URL[https://doi\.org/10\.1109/JSTSP\.2017\.2763455](https://doi.org/10.1109/JSTSP.2017.2763455)
- Xiong et al\. \[2017\]Xiong W, Droppo J, Huang X, et al \(2017\) Toward human parity in conversational speech recognition\. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25\(12\):2410–2423\.[10\.1109/TASLP\.2017\.2756440](https://arxiv.org/doi.org/10.1109/TASLP.2017.2756440)
- Ziafat et al\. \[2021\]Ziafat N, Ahmad HF, Fatima I, et al \(2021\) Correct pronunciation detection of the arabic alphabet using deep learning\. Applied Sciences \(Switzerland\) 11\(6\):1–19\.[10\.3390/app11062508](https://arxiv.org/doi.org/10.3390/app11062508), URL[https://doi\.org/10\.3390/app11062508](https://doi.org/10.3390/app11062508)

Similar Articles

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

arXiv cs.CL

This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

arXiv cs.CL

This paper presents a benchmark evaluating five commercial ASR systems on code-switching speech across Arabic-English, Persian-English, and German-English pairs, using a two-stage pipeline to select 300 samples per pair and assessing performance with WER and BERTScore. ElevenLabs Scribe v2 achieves the lowest overall WER (13.2%) and highest BERTScore (0.936), with public dataset available.