Tadabur: A Large-Scale Quran Audio Dataset

Hugging Face Daily Papers Papers

Summary

Tadabur is a 1,400+ hour Quran audio dataset from 600+ reciters designed to advance Quranic speech research and benchmarking.

Despite growing interest in Quranic data research, existing Quran datasets remain limited in both scale and diversity. To address this gap, we present Tadabur, a large-scale Quran audio dataset. Tadabur comprises more than 1400+ hours of recitation audio from over 600 distinct reciters, providing substantial variation in recitation styles, vocal characteristics, and recording conditions. This diversity makes Tadabur a comprehensive and representative resource for Quranic speech research and analysis. By significantly expanding both the total duration and variability of available Quran data, Tadabur aims to support future research and facilitate the development of standardized Quranic speech benchmarks.
Original Article
View Cached Full Text

Cached at: 04/23/26, 11:54 AM

Paper page - Tadabur: A Large-Scale Quran Audio Dataset

Source: https://huggingface.co/papers/2604.18932

Abstract

DespitegrowinginterestinQuranicdataresearch,existingQurandatasetsremainlimitedinbothscaleanddiversity.Toaddressthisgap,wepresentTadabur,alarge-scaleQuranaudiodataset.Tadaburcomprisesmorethan1400+hoursofrecitationaudiofromover600distinctreciters,providingsubstantialvariationinrecitationstyles,vocalcharacteristics,andrecordingconditions.ThisdiversitymakesTadaburacomprehensiveandrepresentativeresourceforQuranicspeechresearchandanalysis.BysignificantlyexpandingboththetotaldurationandvariabilityofavailableQurandata,TadaburaimstosupportfutureresearchandfacilitatethedevelopmentofstandardizedQuranicspeechbenchmarks.

View arXiv pageView PDFProject pageGitHub112Add to collection

Get this paper in your agent:

hf papers read 2604\.18932

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.18932 in a model README.md to link it from this page.

Datasets citing this paper1

#### FaisaI/tadabur Viewer• Updatedabout 22 hours ago • 409k • 3.84k • 13

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.18932 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

MUSCAT: MUltilingual, SCientific ConversATion Benchmark

arXiv cs.CL

MUSCAT is a new multilingual, scientific conversation benchmark dataset for evaluating ASR systems on challenging multilingual scenarios including code-switching, domain-specific vocabulary, and mixed language input. The dataset consists of bilingual discussions on scientific papers between speakers using different languages, with results showing current state-of-the-art systems struggle with these multilingual challenges.