Tag
This paper presents a systematic empirical study of fine-tuning pretrained Transformer models (Wav2Vec2.0, HuBERT, XLS-R) for Quranic Automatic Speech Recognition (ASR), achieving a WER of 0.08 on the EveryAyah subset and reducing training time from 140 to 40 hours, with Wav2Vec2-XLSR-53 providing the best representation.
This paper investigates whether the wav2vec2.0 architecture exhibits perceptual compensation for tonal context in Mandarin Chinese, finding limited evidence in the self-supervised model compared to human listeners and suggesting that supervised fine-tuning may be necessary for such phonological abstraction.
This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.
easyaligner is an open-source forced alignment library with GPU acceleration and flexible text normalization that works with all wav2vec2 models on Hugging Face Hub. It addresses practical workflows like handling partial transcripts, irrelevant speech segments, and long audio without chunking while preserving original text formatting.