Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English
Summary
This paper uses layer-wise probing to investigate how wav2vec 2.0 and Whisper encode consonant cluster reduction in African American English, finding that both models distinguish reduced and canonical forms and preserve cues to underlying stops.
View Cached Full Text
Cached at: 06/24/26, 07:44 AM
# Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English Source: [https://arxiv.org/html/2606.23948](https://arxiv.org/html/2606.23948) Mojarad Tang Kevin1Department of English Language and Linguistics, Institute of English and American Studies, Faculty of Arts and Humanities, Heinrich Heine University Düsseldorf, Germany 2Department of Linguistics, University of Florida, United States of America[hamid\.mojarad@hhu\.de, kevin\.tang@hhu\.de](https://arxiv.org/html/2606.23948v1/mailto:[email protected],%[email protected]) ###### Abstract Self\-supervised and supervised speech models are increasingly used to investigate which linguistic information their internal representations encode, and at what level of abstraction they encode it\. One underexplored phenomenon is consonant cluster reduction \(CCR\) in African American English \(AAE\), a widespread phonological process and a source of automatic speech recognition \(ASR\) disparity\. To examine how CCR is represented, we conduct speaker\-independent layer\-wise probing ofwav2vec2\-baseandWhisper\-smallusing two tasks: segmental reduction detection and segmental restoration of underlying cluster identity\. Both models distinguish reduced and canonical forms with high accuracy\. Crucially, reduced segments retain cues to their underlying stops, indicating that CCR is encoded as structured gradient phonological variation rather than simple segmental deletion\. These results demonstrate structured phonological encoding of AAE CCR patterns in modern speech models\. ###### keywords: speech encoders, interpretability, consonant cluster reduction, self\-supervised model, supervised model, African American English ## 1Introduction Modern ASR systems exhibit significant performance disparities across demographic groups, with particularly high error rates for speakers of AAE, a rule\-governed variety of English shaped by historical, social, and cultural factors\[Mengesha\_2021\]\. Multiple studies have documented racial bias in commercial ASR technologies, showing word error rates \(WER\) up to twice as high for AAE speakers compared to non\-AAE speakers, even when controlling for age, gender, and content\[koenecke2020racial,martin20\_interspeech,Martin2023\]\. These disparities stem in part from training data imbalances favoring Mainstream American English and from insufficient representation of AAE phonological and morphosyntactic features\[wassink2022uneven,Martin2023\]\. ### 1\.1Under the hood of ASR To address ASR bias effectively, it is essential to move beyond purely error\-based metrics like WER and examine how self\-supervised and supervised speech models internally encode dialectal phonological variation\. Researchers have increasingly turned to interpretability methods, particularly linear probing of hidden layer representations, to uncover what occurs under the hood of these encoders\[parra\-2025\-interpretable,Pasad\_2021\]\. Self\-supervised models such as wav2vec 2\.0\[baevski2020wav2vec2\]learn hierarchical speech representations from large amounts of unlabeled audio using a convolutional feature encoder followed by a Transformer encoder, trained with a contrastive objective over jointly learned quantized latent targets\. This setup enables learning from unlabeled audio alone, eliminating the need for transcriptions\. In contrast, supervised models such as Whisper\[radford2023robust\]employ a fully supervised encoder\-decoder Transformer architecture trained on approximately 680k hours of labeled audio\-text data for direct end\-to\-end transcription\. Probing and layer\-wise analyses have addressed diverse linguistic aspects, including acoustic\-phonetic properties\[Pasad2023\], phonetic categorization\[cormac\-english\-etal\-2022\-domain\], and accent/prosodic variation\[yang23v\_interspeech\], alongside syllable structure\[a\-shams\-etal\-2024\-uncovering\]and phonemic restoration\[Shams2025\]in representations learned by wav2vec 2\.0 and Whisper, yet dialect\-specific phonological processes such as CCR in AAE remain largely unexplored\. ### 1\.2Consonant cluster reduction CCR in English is generally treated as a cross\-dialectal feature rather than a phenomenon unique to any single community\[Labov\_1972,Schreier\_2005\]\. It is a systematic phonological process in which word\-final clusters are simplified in patterned ways\[Wolfram\_2017\]\. While reduction is shaped by both the following phonological context and properties of the cluster itself, dialects differ in how these constraints are weighted\[Wolfram\_2017,Guy\_1991,Schreier\_2005\]\. This cross\-dialectal variation highlights CCR as a structured but dialect\-sensitive process governed by interacting phonological constraints\[Wolfram\_2017\]\. CCR in AAE typically involves omission of the final stop in two\-consonant clusters \(e\.g\.,test/tEst/ → \[tEs\]\) or the penultimate consonant in a cluster of three \(e\.g\.,fists/fIsts/ → \[fIs:\]\)\[Erik\_Baily\_2015\]\. Previous work has shown that ethnicity\-related dialect features, including CCR, lead to uneven ASR success across racial groups\. For instance, Wassink et al\.\[wassink2022uneven\]evaluated a commercial ASR system on a multi\-ethnic sample from the American Pacific Northwest \(including AAE speakers\) and found systematically higher phonetic error rates for nonwhite speakers, with dialectal phonological variation, including CCR, contributing to differential performance and highlighting racial bias in system output\. Similarly, recent work on wav2vec 2\.0\[mojarad25\_interspeech\]confirmed small but significant WER increases from CCR that underscores its role in ASR disparity against AAE\. ### 1\.3Present study Building upon prior behavioral evidence of AAE\-related bias in wav2vec 2\.0\[mojarad25\_interspeech\], the present study aims to uncover the root of this bias in CCR by probing how it is internally encoded inwav2vec 2\.0andWhisper\. Specifically, we conduct a two\-fold probing investigation to assess whether the models treat CCR as a simple segmental deletion task across layers, or whether they represent reduced realizations as closer to their canonical counterparts, in order to determine whether these internal representations can reveal the mechanisms underlying the observed AAE\-related bias\. We focus on frequent two\-consonant clusters \(e\.g\., /nt/, /nd/, /st/\), and perform two domain\-informed probes\[cormac\-english\-etal\-2022\-domain\]on frozen encoder representations: - •Segmental reduction detection: We test whether encoder representations distinguish reduced and canonical cluster pronunciations, thereby assessing whether the presence or absence of a final stop is explicitly encoded across model layers\. This probe directly evaluates the phonetic sensitivity of each model to CCR\. - •Segmental restoration: We examine whether reduced forms, such as nasal\-only realizations of underlying /nt/ or /nd/ clusters sharing an initial nasal consonant, still carry subtle cues to the dropped final stop\. This allows us to test whether the original identity of the deleted segment is internally reconstructed by speech encoders\. Data and code are available on[osf\.io](https://arxiv.org/html/2606.23948v1/osf.io)111[https://doi\.org/10\.17605/OSF\.IO/FE2D7](https://doi.org/10.17605/OSF.IO/FE2D7)\. ## 2Related Work Probing speech encoders has emerged as the standard method for dissecting the linguistic hierarchy of speech understanding\[Pasad2023\]\. For self\-supervised models such as wav2vec 2\.0, Pasad et al\.\[Pasad\_2021\]perform a detailed layer\-wise analysis and show that early transformer layers are dominated by low\-level acoustic cues, mid\-layers maximize phonetic and phonological information, and higher layers increasingly reflect lexical and semantic structure\. Complementary work\[cormac\-english\-etal\-2022\-domain,kim24l\_interspeech\]applied phonetic and articulatory classification probes to wav2vec 2\.0, showing that segmental categories and features \(e\.g\., nasality, place\) are most robustly encoded in intermediate layers\. These findings support a hierarchical transformation from raw acoustics to increasingly abstract phonetic and lexical representations\. More recently, similar probing approaches have been applied to Whisper’s supervised encoder\. Studies on pathological and accented speech show that Whisper’s mid\-level encoder layers are particularly informative for phonetic and phonologically structured deviations from canonical speech\. Batra et al\.\[Batra\_2025\]report that different types of stuttered disfluencies are best discriminated from fluent speech using mid\-to\-late Whisper layers, while Yue et al\.\[Yue\_etal\_2026\]find that layers 13\-15 of Whisper\-medium yield peak performance for dysarthric speech detection and severity assessment\. Closer to our research, Gessinger et al\.\[Shams2025\]investigate phonemic restoration in wav2vec 2\.0 and Whisper\. Their study introduces controlled perturbations, including noise overlays, noisy gaps, or silent gaps, into English words and pseudowords, then probes the models' transformer encoder layers for reconstruction of articulatory features \(place, manner, voicing\)\. They test wav2vec 2\.0 and Whisper on these degraded stimuli, which mimic real\-world noise or interruptions\. Linear probes across layers reveal wav2vec 2\.0's superior recovery, especially for words over pseudowords via lexical context; noisy gaps prove most disruptive, followed by silent gaps\. This design parallels our planned CCR analysis in AAE, where we expect natural deletions \(e\.g\., /st/ → /s/\) to create analogous \`\`gaps'' that test the models' ability to reconstruct canonical forms from context\. Despite prior work probing wav2vec 2\.0 and Whisper for various related linguistic aspects, no probing study has specifically investigated CCR in AAE, a dialectal process in which surface deletions are systematically predicted to recover canonical phonology via lexical context\. This phenomenon allows us to test whether wav2vec 2\.0 and Whisper internally restore deleted segments akin to human listeners, or simply encode surface realizations\. Our dual\-probe approach,*segmental reduction detection*and*segmental restoration*, applied to natural AAE data provides the first computational analysis of this widespread dialectal pattern\. ## 3Methodology ### 3\.1Data Preparation #### 3\.1\.1Corpus The Corpus of Regional African American Language \(CORAAL\)\[farrington\_corpus\_2021\]serves as the foundational dataset for this study\. The corpus provides rich linguistic resources, including audio recordings with time\-aligned orthographic transcriptions in TextGrid format, featuring speaker\-specific tiers at both utterance and word/phone alignment levels\. For this research, we utilized three subcorpora \- DCA, DCB, and DTA \- comprising a total of 156 speakers \(80 men and 76 women\)\. These subcorpora represent speakers from two distinct geographic regions, Washington, DC and Detroit, thereby incorporating regional variation into our dataset\. In addition, each subcorpus includes speakers across four age groups and three socioeconomic classes\. This design ensures diversity in terms of geography, gender, age, and social class, providing a broad and socially representative sample for examining CCR and supporting greater generalizability of the results\. #### 3\.1\.2Feature extraction To extract features related to CCR, we employed forced alignment, following the general approach of Kendall et al\.\[Kendall\_ing\_2021\]\. In their study, human annotations of the sociolinguistic variable \(ING\) were compared against forced alignment and classifiers trained by machine learning libraries, demonstrating that automated coding approaches can approximate human performance in categorizing ING variation\. Building on this methodology, we adopted a pipeline based on forced alignment to automate feature extraction in our analysis\. We used the Montreal Forced Aligner \(MFA, version 2\.2\.17;\[mcauliffe17\_interspeech\]\) together with the Carnegie Mellon University \(CMU\) Pronouncing Dictionary\. Words missing from the CMU dictionary were first identified, after which we trained a grapheme\-to\-phoneme \(G2P\) model on the CMU lexicon and generated pronunciations for these items\. The resulting pronunciations were manually inspected and added to the dictionary\. For words susceptible to CCR, we generated reduced pronunciation variants based on their canonical forms in the CMU dictionary; for example, the wordtest/tEst/ has a reduced form \[tEs\] with the final /t/ deleted\. Using MFA’straincommand, we trained a custom acoustic model on the full audio datasets\. The complete corpus was then force\-aligned using this trained model in combination with the expanded CMU pronunciation dictionary, and the resulting alignments formed the basis for classifying words as canonical or reduced throughout the study\. #### 3\.1\.3Selection strategy MFA alignment across CORAAL's three subcorpora yielded over 85,000 tokens from CCR\-prone two\-consonant cluster words, distributed across 48 cluster types\. Due to token scarcity in most types, we selected 12 high\-frequency clusters, ensuring an internal balance between reduced and canonical realizations: /st/, /nd/, /md/, /nt/, /sk/, /mp/, /ft/, /St/, /t/, /pt/, /vd/, /zd/\. We excluded /l/\- and /r/\- initial clusters \(e\.g\., /ld/, /lt/, /rd/, /rt/\), because they are subject to additional phonological processes beyond CCR\. Specifically, post\-vocalic /l/ may undergo L\-vocalization \(e\.g\.,cold/koUld/ → \[koUwd\]\), and when CCR also applies, this yields \[koUw\]\. Post\-vocalic /r/ may undergo R\-deletion \(e\.g\.,cart/ka:rt/ → \[ka:t\]\), and with CCR, \[ka:\]\. These processes confound detection of pure CCR effects\. Following the theoretical framework of Thomas and Bailey\[Erik\_Baily\_2015\], we further refined the dataset by restricting the analysis to monomorphemic consonant clusters, excluding bimorphemic past\-tense forms \(e\.g\., stunned /stVnd/, bussed /bVst/\)\. This restriction ensures that observed reduction patterns reflect phonological preferences rather than morphological conditioning\. The resulting probing dataset comprises seven cluster types \(/ft/, /nd/, /nt/, /st/, /sk/, /pt/, /mp/\), with each cluster containing a roughly balanced number of reduced and canonical tokens\. To mitigate lexical bias from high\-frequency words within clusters \(e\.g\.,justdominating /st/,anddominating /nd/\), we downsampled dominant word types to a maximum of 400 tokens per word \(200 reduced \+ 200 canonical\)\. This ensured probing performance reflects cluster\-level phonological contrasts rather than word\-specific lexical memorization by the classifiers\. The final dataset comprised 6,760 tokens \(3,409 canonical, 3,351 reduced\) across the 7 cluster types as displayed in Table[1](https://arxiv.org/html/2606.23948#S3.T1)\. Token selection was additionally distributed across speakers to improve generalizability across speakers\. Table 1:AAE CCR balanced dataset summary ### 3\.2Models and representations #### 3\.2\.1Transformer\-based models This study probes representations from two prominent speech models:wav2vec2\-base\[baevski2020wav2vec2\]andWhisper\-small\[radford2023robust\]\. Both models employ a convolutional feature extractor followed by 12 transformer layers with 768 embedding dimensions, enabling direct layer\-wise comparison of how they encode phonological phenomena such as CCR\. wav2vec2\-baseis pretrained in a self\-supervised manner on 960 hours of English LibriSpeech audio using a contrastive objective over quantized latent targets, learning representations from raw acoustics without any textual supervision\. In contrast,Whisper\-smallis trained in a supervised multilingual multitask setting on approximately 680,000 hours of audio\-text data \(approximately 65% English ASR\) in a sequence\-to\-sequence framework, exposing it to large amounts of transcript\-aligned and cross\-lingual supervision\. We probe only Whisper's encoder because it extracts contextualized frame representations from raw spectrograms, while the decoder handles text generation\[radford2023robust\]\. We use the*base*\(wav2vec 2\.0\) and*small*\(Whisper\) variants because they have the same number of Transformer encoder layers and are both used in their pretrained form without any task\-specific fine\-tuning\. All models remain frozen during probing, with only linear classifiers trained on extracted hidden states\. This setup attributes differences in probe performance to pretraining strategies rather than model depth, adaptation, or downstream fine\-tuning\. #### 3\.2\.2Representation extraction Hidden states were extracted from all 12 transformer layers of bothwav2vec2\-baseandWhisper\-smallfor the entire dataset\. For each token, the full utterance containing the target consonant cluster was fed into the frozen encoder\. Using MFA timestamps, temporal boundaries of the consonant cluster \(C1C2\) relative to utterance onset were identified and converted to frame indices based on each model’s native 20 ms frame resolution\. For wav2vec 2\.0, we used the officialfacebook/wav2vec2\-basecheckpoint\. For Whisper, we extracted representations exclusively from the encoder ofopenai/whisper\-small, consistent with prior probing work\. Aligned frames corresponding to the full cluster \(CC\) and the reduced onset \(C1\) were mean\-pooled to obtain a single 768\-dimensional representation per layer per token, following standard practice in speech probing\[yang23v\_interspeech,mohebbi\-etal\-2023\-homophone\]\. This converts variable\-length frame sequences \(e\.g\., 5×768 for short clusters\) into fixed\-length token embeddings \(1×768\) while preserving phonetic alignment\. For each model, this procedure yielded 12 layer\-wise datasets of size \(6760, 768\)\. ### 3\.3Probing experiments We performed two domain\-informed probing tasks on the frozen representations ofwav2vec2\-baseandWhisper\-small: \(1\)*Experiment 1*as segmental reduction detection, and \(2\)*Experiment 2*as segmental restoration\. We adopted a relatively simple architecture to reduce the influence of probe complexity, so that performance differences more directly reflect differences in the information encoded in the embeddings rather than the complexity of the probing model\. Specifically, we used scikit\-learn \(v\. 1\.7\.0\)\[Pedregosa\_2011\]multi\-layer perceptrons \(MLPs\) comprised of a single hidden layer of 200 ReLU neurons \(hidden\_layer\_sizes=\(200,\)\) followed by a single logistic output neuron for binary classification\. The models used default hyperparameters except for the expanded hidden layer size,max\_iter=1000, andrandom\_state=42for reproducibility\. No hyperparameter tuning was performed on the probe architecture itself, further ensuring that differences in probe performance reflect differences in the underlying representations rather than probe optimization\. For all subsequent probes, we used Stratified KFold cross validation \(K=4\) with speaker\-independent splits \(no overlap between training and test speakers\)\. Because this constraint limited how tokens could be distributed across folds, test sets comprised approximately 20\-30% of tokens rather than exactly 25%\. #### 3\.3\.1Experiment 1 \(segmental reduction detection\) For the first domain\-informed probe, we trained layer\-wise MLPs to distinguishreducedvs\.canonicalCCR realizations across our dataset\. The probe was trained and tested on both reduced and canonical tokens\. To maximize generalizability while minimizing lexical and cluster\-type biases, we evaluated performance under three dataset conditions: 1. 1\.Imbalanced: From the 6,760 available tokens, we retained 6,698 such that each cluster type was internally balanced \(reduced = canonical\), while preserving the natural frequency distribution of CCR cluster types\. This condition maintainsecological validity in cluster prevalencewhile controlling for reduction frequency within each cluster\. 2. 2\.Balanced: All seven clusters were further downsampled to match the cluster type with the fewest available tokens \(N = 676 total\)\. In this condition, clusters are equal in size and each remains internally balanced \(reduced = canonical\)\. This controls forquantity biases, where frequent clusters might inflate performance due to model overfit rather than true CCR detection\. Whenever possible, tokens were sampled from different words and speakers to enhance generalizability\. 3. 3\.Per\-cluster analysis: Each of the seven cluster types was probed independently using all available tokens for that cluster, with internal balancing maintained\. This isolatescluster\-conditioned effects, as prior AAE research shows that CCR reduction patterns vary by cluster type\[Erik\_Baily\_2015,Bayley\_Villarreal\_2019\], motivating separate analyses alongside aggregate results\. This multi\-level analysis reveals how the models perform CCR detection across varying data quantities, cluster type compositions, and individual phonological environments\. Figure 1:Segmental reduction detection \- imbalanced vs\. balanced for all cluster types \(4\-fold CV\) #### 3\.3\.2Experiment 2 \(segmental restoration\) For the second domain\-informed probe, we selected /nt/ and /nd/ tokens from our 7 cluster types because they share the same C1 \(nasal /n/\) while differing only in C2 \(/t/ vs\. /d/\)\. This eliminates C1 variation as a confound, ensuring the probe tests pure C2 recovery \(coronal stop identity\) from reduced nasal\-only input \(/n/\)\. In addition, both cluster types had sufficient tokens in reduced and canonical forms to support robust probing\. This probe tested whether models' representations of reduced nasal\-only segments \(/n/ from reduced /nt, nd/ clusters\) contain sufficient segmental restoration cues to recover canonical cluster identity \(/nt/ vs\. /nd/\)\. To mitigate the influence of highly frequent words in these two cluster types, we further limited each word to maximally 100 tokens \(50 reduced, 50 canonical\), ensuring a roughly balanced contribution of frequent and less frequent words\. We conducted three analyses as follows: 1. 1\.Reduced\-only train: Train the probe on embeddings of C1 \(the nasal\) extracted from reduced /nt, nd/ tokens and test it on embeddings from canonical /nt, nd/ forms\. The goal is to determine whether reduced tokens retain cluster\-specific information\. In other words, we test whether a reduced nasal from /nt/ correctly predicts canonical /nt/ \(and a reduced nasal from /nd/ predicts canonical /nd/\), rather than collapsing across categories\. 2. 2\.Canonical\-only train: Train and test on CC embeddings from canonical /nt, nd/ tokens\. This is expected to reveal better performance when /t, d/ articulation is fully present\. 3. 3\.C1\-only train: Train on C1 \(nasal\) embeddings from canonical /nt, nd/ tokens, test on full CC embeddings from canonical /nt, nd/\. This probe tests if nasal alone in the train set provides /t, d/ cues when full cluster is available at test time\. Table[2](https://arxiv.org/html/2606.23948#S3.T2)shows the train and test set sizes for each probe across the four cross\-validation folds\. Test sizes are reduced to ensure speaker\-independent splits, resulting in fewer tokens than the total available per category\. Table 2:Train and test sizes per 4\-fold CV split #### 3\.3\.3Coarticulatory encoding probe Following the main probing experiments, we conducted an additional supplementary analysis to examine whether the models' representations encode coarticulatory information between C1 and C2\. This investigation was motivated by the possibility that the relatively multifaceted layer\-wise patterns observed in reduction detection \(Section[4\.1\.1](https://arxiv.org/html/2606.23948#S4.SS1.SSS1)\) might reflect coarticulatory cues preserved in C1, rather than a simple segmental deletion distinction between reduced and canonical forms\. We used the same speaker\-independent 4\-fold CV splits as in the imbalanced condition of the segmental reduction detection probe \(Table[2](https://arxiv.org/html/2606.23948#S3.T2)\)\. To test this hypothesis, we adapted the gating paradigm from psycholinguistics\[grosjean1980\], in which listeners hear progressively longer portions of a word until identification becomes possible\. Here, we operationalized this by extracting C1 frames in four incremental portions \(25%, 50%, 75%, and 100%\) based on forced alignment timestamps, then mean\-pooling each portion to obtain fixed\-dimensional representations\. For each portion, we trained MLP classifiers \(identical architecture mentioned earlier\) to predict reduced versus canonical CCR forms without including any C2 frames\. This design allows us to assess how much information about the reduction contrast is recoverable from C1 alone, and whether coarticulatory information increases as more of C1 becomes available\. Figure 2:Coarticulatory encoding \- C1 portion accuracy per layer across both models \(4\-fold CV\) ## 4Results ### 4\.1Experiment 1 #### 4\.1\.1Imbalanced vs\. balanced Figure[1](https://arxiv.org/html/2606.23948#S3.F1)illustrates the layer\-wise distribution of information in both wav2vec 2\.0 and Whisper representations for the imbalanced and balanced datasets described in Section[3\.3\.1](https://arxiv.org/html/2606.23948#S3.SS3.SSS1)\. Both wav2vec2\-base and Whisper\-small clearly distinguish reduced from canonical tokens, with accuracies consistently above chance \(50%\)\. This suggests that both models exhibit similar sensitivity to this phenomenon\. For the imbalanced dataset, wav2vec2\-base accuracy gradually increases in the early layers, peaks at layer 5, then slightly declines before rising again at layer 9, followed by a moderate drop after layer 10 toward the final layers\. The initial theoretical interpretation is that this task captures a phonetic distinction, differentiating tokens with and without the final stop\. This view aligns with Pasad et al\.\[Pasad2023\]’s characterization of phonetic information, which increases from the early layers toward layer 5, exhibits moderate variation, peaks around layer 9, and then sharply declines after layer 10\. Our results generally follow this pattern, although in our plot the accuracy peaks more strongly at layer 5 than at layer 9, whereas they report a higher peak at layer 9 than at layer 5\. Additionally, the decline after layer 10 in our data is more moderate than what they observed\. In contrast, prior analyses of Whisper’s encoder \(particularly for larger models\) report a rising\-then\-plateau pattern for phonetic tasks, with performance increasing toward mid\-layers and then remaining relatively stable without a sharp decline toward higher layers\[Batra\_2025,Agaoglu2024,Yue\_etal\_2026\]\. Whisper\-small, while less studied, appears to follow this rising\-then\-plateau trajectory in our imbalanced dataset\. For the balanced dataset, wav2vec 2\.0 shows a somewhat unstable pattern: accuracy plateaus in the early layers, peaks around layers 4\-5, drops sharply toward layer 7, rises again around layers 8\-9, and then declines toward the final layers\. Likewise, Whisper does not follow the expected pattern, displaying an early drop at the initial layers \(layer 2\), a peak at layer 6, and failing to maintain a plateau in the upper layers\. On top of this, the overall classification accuracy of the MLP probe is substantially lower in the balanced setting, accompanied by higher standard deviations in both models, particularly for Whisper \(SD≈\\approx4%\)\. We attribute this decrease in performance primarily to the drastic reduction in training data\. Whereas the imbalanced dataset contained 6,698 tokens, the balanced version included only 676 tokens, approximately one tenth of the original size, due to downsampling all cluster types to match the least frequent category \(around 100 tokens per cluster\)\. This substantial reduction in training data likely limited the probe’s ability to reliably distinguish reduced from canonical forms\. Moreover, in natural speech, certain cluster types occur far more frequently than others, and ASR models are trained on similarly skewed distributions\. Artificially downsampling all clusters to the least frequent type therefore creates an evaluation setting that is not representative of the distributional properties the models were trained on\. For this reason, and given the clear performance degradation under extreme data reduction, we report only the results obtained from the imbalanced dataset in the remainder of the study\. Figure 3:Per\-cluster analysis: test accuracy per cluster type \(mean \(%\) ± SD, 4\-fold CV\)To further investigate the multifaceted representation of CCR observed in the imbalanced condition, especially for wav2vec 2\.0, which cannot be fully attributed to a phonetic distinction \(Figure[1](https://arxiv.org/html/2606.23948#S3.F1)\), we implemented the*coarticulatory encoding probe*mentioned in Section[3\.3\.3](https://arxiv.org/html/2606.23948#S3.SS3.SSS3)\. We hypothesized that the reduction detection task may not involve a simple phonetic distinction between reduced and canonical forms, but rather the encoding of coarticulatory or contextual information arising from interaction between the two consonants\. This interpretation is consistent with Pasad et al\.\[Pasad\_2021\], who argue that the layer\-wise evolution of wav2vec 2\.0 representations follows the linguistic hierarchy of speech understanding, with early layers encoding acoustic features, followed by phonetic information, and higher layers capturing word identity and semantic information\. Using the gating\-style probe described in Section[3\.3\.3](https://arxiv.org/html/2606.23948#S3.SS3.SSS3), we found that both wav2vec 2\.0 and Whisper achieved high classification accuracy based solely on C1 representations, with performance increasing as larger portions of C1 were included and peaking at 100%\. These results strongly suggest that contextual or coarticulatory information is encoded in the models’ representations, which may help explain the multifaceted representation observed across the hidden layers\. Notably, Whisper demonstrates relatively higher performance, as illustrated in Figure[2](https://arxiv.org/html/2606.23948#S3.F2), indicating stronger contextual encoding effects\. #### 4\.1\.2Per cluster analysis Figure[3](https://arxiv.org/html/2606.23948#S4.F3)illustrates the layer\-wise information encoded in the wav2vec 2\.0 and Whisper representations for each cluster type described in Section[3\.3\.1](https://arxiv.org/html/2606.23948#S3.SS3.SSS1), with the corresponding dataset sizes indicated on the left\. For high\-frequency clusters \(/st/, /nd/, and /nt/\) wav2vec 2\.0 again exhibits multiple rises in classification accuracy: an early increase in layers 3\-5, followed by a slight decline and plateau, a renewed rise around layers 8\-10, and a subsequent decline, mirroring the pattern observed in Figure 1 for the imbalanced condition\. Notably, /st/ shows the highest accuracy in the early layers, a pattern discussed from a linguistic perspective in the Discussion section\. In contrast, Whisper exhibits a rising\-then\-plateau pattern that again parallels the pattern observed for the imbalanced condition in Figure 1, with peak performance varying across cluster types and /st/ achieving comparatively higher accuracy\. Overall, both models show similar classification accuracy, with wav2vec 2\.0 exhibiting slightly higher peaks for two of the three clusters \(except /nd/\), suggesting comparable effectiveness in detecting CCR with plausible levels of variability \(SD\)\. The test accuracy becomes even more pronounced for low\-frequency clusters \(/ft/, /sk/, /mp/, /pt/\)\. Although estimates in these conditions are inherently noisier due to limited data, Whisper shows substantially larger advantages over wav2vec 2\.0 across most of these clusters, except for /mp/, where wav2vec 2\.0 performs better\. However, standard deviations for these low\-frequency clusters are notably high in both models, indicating substantial performance variability across cross\-validation folds or data subsets\. Nonetheless, Whisper's consistently stronger means under data scarcity highlight more robust generalization for CCR detection, likely from its vast weakly supervised training\. Figure 4:Segmental restoration probe \(/nt/ vs /nd/\) \- all variants \(4\-fold CV\) ### 4\.2Experiment 2 Figure[4](https://arxiv.org/html/2606.23948#S4.F4)summarizes the results of the segmental restoration probe\. In contrast to Experiment 1, both models now exhibit smoother, largely single\-peaked accuracy curves across layers\. For the reduced\-only train probe, wav2vec 2\.0 shows relatively low accuracy in the earliest layers, followed by a steady increase that peaks around layer 9, before dropping sharply after layer 10\. Whisper follows a similar trajectory, but with a steeper rise toward the peak at layer 9, followed by a more moderate decline and a subsequent plateau toward the final layers\. The standard deviations are broadly comparable across models, with slightly higher variability for wav2vec 2\.0 in the early layers\. This probe is theoretically more challenging than the other two shown in the figure, since the reduced forms lack C2, requiring the model to infer the corresponding canonical cluster\. Consequently, higher variability across splits is expected, consistent with the observed SD patterns\. Within each model, the three conditions \(reduced\-only, canonical\-only, and C1–only trains\) exhibit highly similar layer\-wise profiles and comparable peak accuracies \(for wav2vec 2\.0\-base,≈93\-95%\\approx 93\\text\{\-\}95\\%; for Whisper\-small,≈94\-96%\\approx 94\\text\{\-\}96\\%\), indicating that Whisper restores phonological information with slightly higher accuracy overall\. The consistent behavior across all three probes for each model suggests that the reduced\-only probe performed as expected\. As anticipated, CC and C1 embeddings from canonical tokens \(canonical\-only and C1\-only trains\) reliably recover their corresponding canonical forms\. Importantly, this pattern indicates that reduced embeddings for both cluster types retain information about their canonical counterparts\. Both models appear to encode these reductions not merely as surface\-level segmental differences, but as systematic phonological variants of the corresponding canonical forms\. Crucially, because the coarticulatory effect between C1 and C2 is held constant across cluster types in this probe \(both /nt/ and /nd/ share the same C1\), the substantially higher accuracies observed here \(peaking 93\.5% for wav2vec2\-base and 95% for Whisper\-small\) indicate that the lower performance in Experiment 1 \(70\-80%\) \(Section[4\.1](https://arxiv.org/html/2606.23948#S4.SS1)\) should not be interpreted as a failure of the models\. Instead, it reflects the fact that CCR tokens retain substantial information about the dropped stop: as shown by both the segmental restoration and the gating\-based probes, reduced tokens remain very close to their canonical counterparts in the models’ embedding spaces, so there is no sharp phonetic boundary for a simple MLP to exploit in a binary reduced/canonical classification\. In other words, CCR is treated as a case of segmental restoration of an underlying stop, with reduced realizations forming a continuum of gradience rather than a clean\-cut contrast between full and reduced pronunciations\. This gradient pattern, and its cluster\-type\-specific manifestations, is discussed in more detail from a linguistic perspective in the Discussion section\. The single\-peaked, mid\-to\-late layer profiles we observe in the segmental restoration probes align with earlier work on phonetic probing of wav2vec 2\.0, which reports maximal phonetic category and feature information in intermediate transformer layers\[cormac\-english\-etal\-2022\-domain,delafuente24\_interspeech,kim24l\_interspeech\]\. English et al\.\[cormac\-english\-etal\-2022\-domain\]probe layer\-wise wav2vec 2\.0 embeddings with a simple classifier and show that the transformer’s contextualization enhances phonetic detail, yielding the highest accuracies for phoneme and articulatory feature classification in intermediate layers \(layer 9\)\. For Whisper, our findings align with recent layer\-wise probing studies showing that encoder representations capture phonetic and phonological properties\. Classifiers trained on these embeddings reliably detect phonetically deviant speech, typically peaking in mid\-to\-late encoder layers\[Batra\_2025,Yue\_etal\_2026\]\. Closer to our approach, Gessinger et al\.\[Shams2025\]demonstrate phonemic restoration in wav2vec 2\.0 and Whisper using controlled deletions \(noise/silent gaps\)\. Like their artificial \`\`gaps'', our natural CCR deletions \(/nt/ → /n/\) create analogous disruptions that models successfully restore\. Our segmental restoration probe mirrors this pattern: reduced vs\. canonical CCR tokens achieve high accuracies in both wav2vec 2\.0 and Whisper, encoding cluster reduction as phonologically structured variation rather than mere absence, akin to human listeners achievingphonological abstractionwithout explicit phoneme representations\[Scharenborg2013\], where context\-specific allophones rather than canonical phonemes serve as prelexical processing units, enabling robust word recognition from gradient input\. ### 4\.3Control probes We verified the reliability of our results for both Experiment 1 and Experiment 2 probes and ruled out potential artifacts due to data leakage or overfitting\. To do so, we conducted control experiments on both the imbalanced dataset \(Experiment 1\) and the reduced\-only train probe \(Experiment 2\)\. In each case, reduced/canonical labels were randomly shuffled while preserving original train/test splits, speaker independence, and token\-cluster distributions\. MLP probes trained on these shuffled labels yielded a baseline performance floor of 46\-53%, confirming that the observed accuracies exceed chance and reflect phonetic information encoded in the models’ representations\. ## 5Discussion In this paper, we examined how layer\-wise probing results can inform our understanding of phonological structure, focusing in particular on consonant cluster reduction, coronal stop deletion, and broader patterns in consonant cluster typology\. While the Results section emphasized the computational behavior of the models, tracing how representations evolved across layers and experimental conditions, the present discussion reframed these findings through a linguistic lens\. Work on CCR in AAE shows that reduction is not a single, fixed process but a highly conditioned one\. Thomas and Bailey\[Erik\_Baily\_2015,wolframThomas\]in particular maintain that several structural factors repeatedly emerge as key in CCR production: syllable type, cluster type, and the following environment\. In other words, reduction is more common in unstressed than in stressed syllables, more common in homorganic clusters \(where C1 and C2 share voicing and place, e\.g\.cold,fist\) than in non‑homorganic ones \(mint,felt\), and more common in monomorphemic clusters \(wound,bust\) than in bimorphemic clusters where the final stop is a separate morpheme \(bussed,stunned\)\. The segment that follows the cluster also strongly conditions reduction: across English varieties, deletion is more frequent before consonant‑initial words \(first place\) than before vowel‑initial words \(first area\), and least frequent before a pause, with a further hierarchy among following consonants \(most deletion before obstruents, then nasals/liquids, then glides\)\. This following‑segment effect can be captured in terms of sonority: the less sonorous the following sound, the more likely cluster simplification becomes\. Bayley and Villarreal\[Bayley\_Villarreal\_2019\]also make this structure especially explicit by distilling Labov’s\[Labov\_1989\]findings into six hierarchically ordered constraints for coronal stop deletion \(\>\>denotes more likely to be deleted\):1\. Stress: Unstressed\>\>stressed syllables\.2\. Cluster length: CCC\>\>CC\.3\. Preceding segment: /s/\>\>stops\>\>nasals\>\>fricatives\>\>liquids\.4\. Grammatical status:n't\>\>part of stem\>\>derivational suffix\>\>past tense/participle\.5\. Following segment: obstruents\>\>liquids\>\>glides\>\>vowels\>\>pauses\.6\. Voicing agreement: Adjacent segments share voicing \(homovoiced\)\>\>different voicing\. Within this framework, the \`\`preceding segment'' constraint corresponds to the C1\-C2 interaction in our study: clusters with a highly constrained obstruent C1 \(especially /s/\) create a particularly favorable environment for C2 deletion, whereas clusters with more sonorous or less constrained C1s \(nasals, liquids\) resist deletion more strongly\. Put differently, the manner and place of C1 modulate how much coarticulatory pressure is placed on C2, and thus how distinct reduced and canonical tokens remain in the signal\. This maps directly onto our seven cluster types and provides a linguistically grounded lens for interpreting the cluster‑wise probe patterns we observe\. The per\-cluster results in Figure[3](https://arxiv.org/html/2606.23948#S4.F3)align closely with phonetic and phonological accounts of CCR production, revealing principled differences in how wav2vec 2\.0\-base and Whisper\-small encode CCR\. Following the C1 constraint hierarchy from Bayley & Villarreal\[Bayley\_Villarreal\_2019\], our alveolar clusters \(/nt/, /nd/, /st/\) show robust performance across models \(low\-70s to low\-80s accuracy\), with trivial performance gaps\. /st/ clusters yield the highest accuracies for both models, exactly as predicted: /s/ \+ stop provides the strongest CCR context\[Wolfram\_article,Thomas\_2007,Erik\_Baily\_2015\]\. /nt/ or /nd/ follow with solid but lower performance, matching their intermediate position in the preceding segment constraint \(s\>\>other stops\>\>nasals\>\>other fricatives\) from Bayley & Villarreal\[Bayley\_Villarreal\_2019\]\. This /st/\>\>/nt, nd/ gradient precisely mirrors the deletion probability hierarchy, /s/ \+ stop strongest, nasal\-stop intermediate, confirming both models capture Bayley & Villarreal\[Bayley\_Villarreal\_2019\]'s third constraint \(C1 manner effect\) reliably in their mid\-layer phonetic representations\. In contrast, low\-frequency clusters \(/ft/, /sk/, /pt/\) exhibit a different pattern\. These clusters are likewise considered homovoiced \(with C1 and C2 sharing the same voicing\) and are reported to be more susceptible to reduction\[Green\_2002\]\. Moreover, Bayley and Villarreal\[Bayley\_Villarreal\_2019\]predict that clusters beginning with /s/ \(/sk/\) should be most prone to reduction, followed by /pt/ \(stop–stop\), and then /ft/ \(fricative–stop\)\. Interestingly, /sk/ yields the highest accuracy in both wav2vec 2\.0 and Whisper, consistent with the expectation that stronger reduction leads to more phonetically distinct reduced vs\. canonical classes, which are easier for the MLP probe to separate\. Whisper achieves particularly high accuracy for /sk/, suggesting that when reduction is more salient, Whisper captures it more precisely\. For /pt/, the expectation is met for Whisper, which attains higher peak accuracy than for /ft/\. In contrast, wav2vec 2\.0 shows lower performance on /pt/ than on /ft/\. This discrepancy is likely attributable to the limited amount of training data available for /pt/ \(only 86 tokens\), which may disproportionately affect wav2vec 2\.0’s probe performance under low\-resource conditions\. For /ft/, which has comparatively more tokens, both models detect reduction reliably\. Overall, Whisper appears to maintain stronger performance than wav2vec 2\.0 even for low\-resource CCR cluster types, suggesting greater robustness under data scarcity\. /mp/ \(90 tokens\), classified as non\-homorganic by Thomas & Bailey\[Erik\_Baily\_2015\]\(voiced /m/ \+ voiceless /p/\), is expected to resist reduction most strongly among all clusters\. While wav2vec 2\.0 behaves differently, showing an early peak of 80\.2% \(layer 1\) followed by a steady decline to 72\.6%, Whisper aligns more closely with Thomas and Bailey\[Erik\_Baily\_2015\]’s expectations, exhibiting the lowest accuracy among all clusters analyzed so far for this non\-homorganic type with low susceptibility to reduction\. This suggests that Whisper’s representations may more closely mirror human perceptual patterns in this case\. Moreover, the scarcity of training data for this cluster yields high variability in performance \(SD≈\\approx13\.4% for wav2vec 2\.0 and SD≈\\approx11\.8% for Whisper\), indicating that the results for /mp/ may be less reliable\. This instability likely reflects data sparsity and may also be influenced by factors discussed earlier, such as effects of the following word’s onset, stress patterns, and contextual coarticulation\. Taken together, these results allow for a linguistically grounded interpretation of our probing findings\. Overall, the differences observed across clusters do not appear to reflect arbitrary model behavior\. Instead, they closely align with well\-established phonetic and phonological pressures that shape cluster production and reduction\. ## 6Conclusion This study set out to determine whether self\-supervised and supervised models treat CCR in AAE as a low\-level segmental deletion phenomenon, or whether they encode it as structured, gradient phonological variation with preserved cues to underlying forms\. Layer\-wise probing on 6,760 CORAAL tokens revealed that both wav2vec 2\.0\-base and Whisper\-small encode CCR in a phonologically structured manner\. In the segmental reduction detection probe, the models reliably distinguished reduced from canonical realizations \(accuracies typically 70\-80%\)\. Layer\-wise patterns partially aligned with established phonetic hierarchies: wav2vec 2\.0 showed multiple rises \(strong peak around layer 5, renewed increase near layer 9\) with a moderate late decline, broadly consistent with the literature but with a sharper mid\-layer emphasis and gentler late drop than previously reported\. Our gating paradigm analysis of C1 portion embeddings revealed these differences arise from strong coarticulatory effects in CCR, where preceding segments preserve cues to missing stops\. Whisper\-small exhibited a rising\-then\-plateau pattern, closely mirroring patterns in phonetic probing of supervised encoders\. Cluster types were also treated differently by the models due to various interacting effects, one of which being the preceding segment \(C1\)\. This highly affects the way CCR takes place and the degree of reduction that differs from one cluster type to another, making the reduction detection harder than a simple complete segmental deletion but rather a gradient phenomenon shaped by cluster\-specific phonological constraints\. The segmental restoration probe provided decisive evidence of phonological encoding: reduced tokens retained robust cues to underlying stop identity \(peak accuracies 93\-96%\), with single\-peaked mid\-to\-late layer profiles that again resembled intermediate\-layer optima for segmental feature recovery in prior work\. These results indicate that both self\-supervised and supervised models do not normalize CCR as complete deletion but represent it as systematic, contextually licensed phonological variation, with reduced realizations remaining close to their canonical counterparts in the embedding space\. ## 7Limitations and future work One limitation of this study is that we evaluated only two models, wav2vec 2\.0 and Whisper\. As a result, it remains unclear to what extent our findings generalize to other speech representation models\. In addition, certain CCR cluster types \(e\.g\., /mp/, /pt/\) were underrepresented in the data due to their lower frequency in naturally occurring AAE speech\. This imbalance may have decreased the robustness of our analyses for these rarer and often more complex clusters, while potentially inflating performance for more frequent clusters such as /nt/ and /nd/\. Furthermore, our analysis was restricted to two\-consonant clusters in short, monomorphemic words\. We did not examine three\-consonant clusters \(e\.g\., /skt/, /sts/\), which are known to undergo reduction more frequently and may reveal stronger differences in model behavior\. Similarly, we excluded bimorphemic forms, which could offer additional insight into how morphological structure interacts with CCR\. Future research should extend this investigation to additional speech models, such as HuBERT\[hubert\], WavLM\[wavlm\], and MMS\[MMS\], and compare base models with larger or fine\-tuned variants \(e\.g\., wav2vec2\-large or Whisper\-large\)\. This would help determine whether the patterns observed here are architecture\-specific or more general across model families and scales\. ## 8Acknowledgments This work is part of HM's PhD research\. The authors would like to thank Erfan Amirzadeh Shams for his helpful contribution in refining the research concept and for his technical guidance on the implementation\. The authors also thank the anonymous reviewers for their insightful comments and constructive feedback, which helped improve the quality of this work\. ## 9Generative AI Use Disclosure No Generative AI tools were used to produce the content or results of this paper\. Perplexity\.ai and Grok were used for English grammar checking, improving sentence readability, suggesting relevant literature, and code refining or debugging\. ## References
Similar Articles
Probing in the Wild: A Case Study of Self-Supervised Speech Representations on Mandarin Sub-dialects with Unsupervised Articulatory Analysis
This paper presents a case study using unsupervised articulatory probing to examine how self-supervised speech models encode phonetic features across Mandarin sub-dialects, finding that salient features like labiality remain stable while finer spectral distinctions show dialect-dependent variation.
Pretrained self-supervised speech models can recognize unseen consonants
This paper investigates whether pretrained self-supervised speech models like Wav2Vec2 and HuBERT can accurately recognize click consonants, which are rare in training data, by fine-tuning on Khoisan languages. Results show the models recognize clicks more accurately than non-clicks, indicating generalization to uncommon phonemes.
Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions
This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.
Introducing Whisper
OpenAI introduces Whisper, an end-to-end encoder-decoder Transformer model trained on large-scale diverse audio data for robust multilingual speech recognition, language identification, and speech-to-English translation. Whisper achieves 50% fewer errors than specialized models on diverse datasets and outperforms supervised benchmarks on speech translation despite not being fine-tuned to specific datasets.
WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training
WavAlign introduces a modality-aware adaptive post-training method that uses constrained preference updates and explicit anchoring to boost both semantic quality and speech expressiveness in end-to-end spoken dialogue models.