Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models
Summary
This paper evaluates demographic and accent biases in phoneme-based ASR systems, specifically WhisperIPA and ZIPA, using phoneme error rate and a new Soft PER metric, revealing persistent disparities across languages and groups.
View Cached Full Text
Cached at: 06/11/26, 01:40 PM
# Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models Source: [https://arxiv.org/abs/2606.11639](https://arxiv.org/abs/2606.11639) [View PDF](https://arxiv.org/pdf/2606.11639) > Abstract:The popularization of automatic speech recognition \(ASR\) systems has increased exploration of the demographic biases related to race, age, gender, and accent, often formed from imbalanced training data\. Most of these studies focused on standard grapheme\-based ASR systems with comparatively little emphasis on phoneme\-based systems, such as models that produce International Phonetic Alphabet \(IPA\) representations\. As ASR systems shift toward multilingual support and low\-resource language modeling, IPA\-based layers serve as a critical, language\-agnostic foundation\. In this study, we evaluate the performance of two state\-of\-the\-art open\-source ASR systems, WhisperIPA and ZIPA, that generate IPA transcriptions across diverse accents and language sources\. Our evaluation includes existing multilingual speech corpora and demographically annotated English\-language corpora\. We measure model performance by comparing model\-generated IPA transcriptions against grapheme\-to\-phoneme \(G2P\) systems using both standard phoneme error rate \(PER\) and a proposed Soft PER metric that tolerates linguistically similar phoneme substitutions\. Our analysis examines how performance varies across languages and demographic groups such as gender, accent, ethnicity, and age, revealing persistent disparities even after accounting for acceptable phonemic variation\. These findings provide insight into potential sources of bias and inform the development of more inclusive and linguistically robust phoneme\-based ASR systems\. Our code and data will be made publicly available to the community\. ## Submission history From: Catherine Bao Bao \[[view email](https://arxiv.org/show-email/800ed856/2606.11639)\] **\[v1\]**Wed, 10 Jun 2026 04:00:44 UTC \(209 KB\)
Similar Articles
Your Multimodal Speech Model Says I Have a Face for Radio
This paper presents the first bias evaluation of multimodal speech recognition models, finding significant accuracy differences across gender and ethnicity when pairing faces with audio, with implications for fairness in AI systems.
Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions
This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.
Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition
This paper proposes evaluating speech articulation synthesis using phoneme recognition with articulatory features, addressing limitations of traditional metrics like point-wise distance. Experiments on a single-speaker RT-MRI dataset show the approach captures phonetic nuances and improves assessment.
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
This paper presents a benchmark evaluating five commercial ASR systems on code-switching speech across Arabic-English, Persian-English, and German-English pairs, using a two-stage pipeline to select 300 samples per pair and assessing performance with WER and BERTScore. ElevenLabs Scribe v2 achieves the lowest overall WER (13.2%) and highest BERTScore (0.936), with public dataset available.
Phonetic Modeling of Dialectal Variation in Vietnamese Speech
This paper proposes a dialect-aware phonetic framework for modeling phonetic variation in Vietnamese ASR, decomposing syllables into structured components and mapping them to dialect-specific IPA representations. The approach matches pretrained baselines with fewer parameters and no external pretraining on the UIT-ViMD multi-dialect dataset.