Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models

arXiv cs.CL Papers

Summary

This paper evaluates demographic and accent biases in phoneme-based ASR systems, specifically WhisperIPA and ZIPA, using phoneme error rate and a new Soft PER metric, revealing persistent disparities across languages and groups.

arXiv:2606.11639v1 Announce Type: new Abstract: The popularization of automatic speech recognition (ASR) systems has increased exploration of the demographic biases related to race, age, gender, and accent, often formed from imbalanced training data. Most of these studies focused on standard grapheme-based ASR systems with comparatively little emphasis on phoneme-based systems, such as models that produce International Phonetic Alphabet (IPA) representations. As ASR systems shift toward multilingual support and low-resource language modeling, IPA-based layers serve as a critical, language-agnostic foundation. In this study, we evaluate the performance of two state-of-the-art open-source ASR systems, WhisperIPA and ZIPA, that generate IPA transcriptions across diverse accents and language sources. Our evaluation includes existing multilingual speech corpora and demographically annotated English-language corpora. We measure model performance by comparing model-generated IPA transcriptions against grapheme-to-phoneme (G2P) systems using both standard phoneme error rate (PER) and a proposed Soft PER metric that tolerates linguistically similar phoneme substitutions. Our analysis examines how performance varies across languages and demographic groups such as gender, accent, ethnicity, and age, revealing persistent disparities even after accounting for acceptable phonemic variation. These findings provide insight into potential sources of bias and inform the development of more inclusive and linguistically robust phoneme-based ASR systems. Our code and data will be made publicly available to the community.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:40 PM

# Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models
Source: [https://arxiv.org/abs/2606.11639](https://arxiv.org/abs/2606.11639)
[View PDF](https://arxiv.org/pdf/2606.11639)

> Abstract:The popularization of automatic speech recognition \(ASR\) systems has increased exploration of the demographic biases related to race, age, gender, and accent, often formed from imbalanced training data\. Most of these studies focused on standard grapheme\-based ASR systems with comparatively little emphasis on phoneme\-based systems, such as models that produce International Phonetic Alphabet \(IPA\) representations\. As ASR systems shift toward multilingual support and low\-resource language modeling, IPA\-based layers serve as a critical, language\-agnostic foundation\. In this study, we evaluate the performance of two state\-of\-the\-art open\-source ASR systems, WhisperIPA and ZIPA, that generate IPA transcriptions across diverse accents and language sources\. Our evaluation includes existing multilingual speech corpora and demographically annotated English\-language corpora\. We measure model performance by comparing model\-generated IPA transcriptions against grapheme\-to\-phoneme \(G2P\) systems using both standard phoneme error rate \(PER\) and a proposed Soft PER metric that tolerates linguistically similar phoneme substitutions\. Our analysis examines how performance varies across languages and demographic groups such as gender, accent, ethnicity, and age, revealing persistent disparities even after accounting for acceptable phonemic variation\. These findings provide insight into potential sources of bias and inform the development of more inclusive and linguistically robust phoneme\-based ASR systems\. Our code and data will be made publicly available to the community\.

## Submission history

From: Catherine Bao Bao \[[view email](https://arxiv.org/show-email/800ed856/2606.11639)\] **\[v1\]**Wed, 10 Jun 2026 04:00:44 UTC \(209 KB\)

Similar Articles

Your Multimodal Speech Model Says I Have a Face for Radio

arXiv cs.CL

This paper presents the first bias evaluation of multimodal speech recognition models, finding significant accuracy differences across gender and ethnicity when pairing faces with audio, with implications for fairness in AI systems.

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

arXiv cs.CL

This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

arXiv cs.CL

This paper proposes evaluating speech articulation synthesis using phoneme recognition with articulatory features, addressing limitations of traditional metrics like point-wise distance. Experiments on a single-speaker RT-MRI dataset show the approach captures phonetic nuances and improves assessment.

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

arXiv cs.CL

This paper presents a benchmark evaluating five commercial ASR systems on code-switching speech across Arabic-English, Persian-English, and German-English pairs, using a two-stage pipeline to select 300 samples per pair and assessing performance with WER and BERTScore. ElevenLabs Scribe v2 achieves the lowest overall WER (13.2%) and highest BERTScore (0.936), with public dataset available.

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

arXiv cs.CL

This paper proposes a dialect-aware phonetic framework for modeling phonetic variation in Vietnamese ASR, decomposing syllables into structured components and mapping them to dialect-specific IPA representations. The approach matches pretrained baselines with fewer parameters and no external pretraining on the UIT-ViMD multi-dialect dataset.