Tadabur: A Large-Scale Quran Audio Dataset
Summary
Tadabur is a 1,400+ hour Quran audio dataset from 600+ reciters designed to advance Quranic speech research and benchmarking.
View Cached Full Text
Cached at: 04/23/26, 11:54 AM
Paper page - Tadabur: A Large-Scale Quran Audio Dataset
Source: https://huggingface.co/papers/2604.18932
Abstract
DespitegrowinginterestinQuranicdataresearch,existingQurandatasetsremainlimitedinbothscaleanddiversity.Toaddressthisgap,wepresentTadabur,alarge-scaleQuranaudiodataset.Tadaburcomprisesmorethan1400+hoursofrecitationaudiofromover600distinctreciters,providingsubstantialvariationinrecitationstyles,vocalcharacteristics,andrecordingconditions.ThisdiversitymakesTadaburacomprehensiveandrepresentativeresourceforQuranicspeechresearchandanalysis.BysignificantlyexpandingboththetotaldurationandvariabilityofavailableQurandata,TadaburaimstosupportfutureresearchandfacilitatethedevelopmentofstandardizedQuranicspeechbenchmarks.
View arXiv pageView PDFProject pageGitHub112Add to collection
Get this paper in your agent:
hf papers read 2604\.18932
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.18932 in a model README.md to link it from this page.
Datasets citing this paper1
#### FaisaI/tadabur Viewer• Updatedabout 22 hours ago • 409k • 3.84k • 13
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.18932 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition
This paper presents a systematic empirical study of fine-tuning pretrained Transformer models (Wav2Vec2.0, HuBERT, XLS-R) for Quranic Automatic Speech Recognition (ASR), achieving a WER of 0.08 on the EveryAyah subset and reducing training time from 140 to 40 hours, with Wav2Vec2-XLSR-53 providing the best representation.
MUSCAT: MUltilingual, SCientific ConversATion Benchmark
MUSCAT is a new multilingual, scientific conversation benchmark dataset for evaluating ASR systems on challenging multilingual scenarios including code-switching, domain-specific vocabulary, and mixed language input. The dataset consists of bilingual discussions on scientific papers between speakers using different languages, with results showing current state-of-the-art systems struggle with these multilingual challenges.
The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar
Presents Tatoxa, a state-of-the-art system for text detoxification in the Tatar language, outperforming existing LLMs. Introduces a new dataset and shows that cross-lingual transfer performs worse than native data.
TTS Benchmark Comparison (all known TTS up until May 2026)
A user-created benchmark for comparing local TTS tools, with results for Windows and Mac, and Linux testing pending. Includes an HTML results page and GitHub repository.
Text-to-Speech (TTS) Benchmark Revamped with Objective Standards and Blind Voting (46 models and counting)
A revamped TTS benchmark introduces objective standards and live blind voting to create an ELO rating for 46+ models, with participation open to the community.