Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback

arXiv cs.CL Papers

Summary

This paper investigates whether LLMs provide grounded pronunciation feedback to L2 English learners, finding that LLMs often rely on stereotypes and prior knowledge rather than acoustic evidence, leading to inaccurate but coherent feedback.

arXiv:2606.15325v1 Announce Type: new Abstract: Large language models are increasingly deployed for written pronunciation feedback in second-language (L2) English learning, under the assumption that their diagnoses are grounded in the supplied speech evidence rather than in priors from pretraining. This assumption is tested on 1,800 L2-Arctic utterances spanning six L1 backgrounds, three audio-capable LLMs, four pronunciation dimensions, and five evidence conditions ranging from a text-only baseline to numeric acoustic features and raw audio. Each (utterance x model x condition x dimension) cell is scored on three metrics: Rating Accuracy (RA) against gold labels, Evidence Coherence (EC) assessing internal consistency without ground truth, and Grounded Correctness (GC) evaluated against gold evidence. Results show three findings across models. First, rating accuracy and grounded reasoning decouple: 39.6% of judged cells contain internally coherent reasoning that supports a wrong rating, against only 15.8% where the reasoning supports a correct rating. Second, phoneme-level feedback converges to a fixed inventory of L2-English difficulty phones that recurs across all six L1 backgrounds and all evidence conditions. Third, acoustic evidence improves the rating only when the supplied feature directly probes the target dimension: textualised F0 range raises pitch-variation grounding from (0.18-0.19) to (0.45-0.62) across all three models, while stress and phoneme correctness, which require target-to-realisation alignment, remain ungrounded. The same audio waveform without textualised F0 values does not reproduce this improvement. These findings indicate that current general-purpose LLMs are more reliable as verbalisers of externally computed pronunciation evidence than as standalone diagnostic engines.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:46 AM

# Prior over Evidence: Stereotype-Driven Diagnosis in LLM-Based L2 Pronunciation Feedback
Source: [https://arxiv.org/html/2606.15325](https://arxiv.org/html/2606.15325)
Rong WangKun SunUniversity of Tuebingen, Tuebingen, GermanyTongji University, Shanghai, Chinarong\.wang@uni\-tuebingen\.dekunsun@tongji\.edu\.cn

###### Abstract

Large language models are increasingly deployed for written pronunciation feedback in second\-language \(L2\) English learning, under the assumption that their diagnoses are grounded in the supplied speech evidence rather than in priors from pretraining\. We test this assumption on1,8001\{,\}800L2\-Arctic utterances \(300300per L1\) spanning six L1 backgrounds, three audio\-capable LLMs \(Gemini 3\.0 Flash, GPT\-4o, and Qwen 3\.5 Omni\-plus\), four pronunciation dimensions, and five evidence conditions ranging from a text\-only baseline to numeric acoustic features and raw audio\. Each \(utterance×\\timesmodel×\\timescondition×\\timesdimension\) cell is scored on three metrics: Rating Accuracy \(RA\) against gold labels, Evidence Coherence \(EC\) assessing internal consistency without ground truth, and Grounded Correctness \(GC\) evaluated against gold evidence\. Three findings hold across models\. First, rating accuracy and grounded reasoning decouple:39\.6%39\.6\\%of judged cells contain internally coherent reasoning that supports a wrong rating, against only15\.8%15\.8\\%where the reasoning supports a correct rating\. Second, phoneme\-level feedback collapses onto a fixed inventory of L2\-English difficulty phones \(/\\tipaencodingT/, /\\tipaencodingD/, /\\tipaencoding⁢r/, /\\tipaencodingv/\) that recurs across all six L1 backgrounds and all evidence conditions\. Third, acoustic evidence improves the rating only when the supplied feature directly probes the target dimension: textualised F0 range lifts pitch\-variation grounding from \(0\.180\.18–0\.190\.19\) to \(0\.450\.45–0\.620\.62\) across all three models, while stress and phoneme correctness, which require target\-to\-realisation alignment, remain ungrounded\. The same audio waveform without textualised F0 values does not reproduce the lift\. It is concluded that current general\-purpose LLMs are more reliable as verbalisers of externally computed pronunciation evidence than as standalone diagnostic engines\.

Prior over Evidence: Stereotype\-Driven Diagnosis in LLM\-Based L2 Pronunciation Feedback

Rong WangKun SunUniversity of Tuebingen, Tuebingen, GermanyTongji University, Shanghai, Chinarong\.wang@uni\-tuebingen\.dekunsun@tongji\.edu\.cn

## 1Introduction

Computer\-assisted pronunciation training \(CAPT\) systems increasingly delegate the written\-feedback step to large language models \(LLMs\)Jeon et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib8)\); Zhong et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib28)\); Fu et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib6)\)\. The promise is concrete\. Instead of returning a single goodness\-of\-pronunciation scoreWitt and Young \([2000](https://arxiv.org/html/2606.15325#bib.bib23)\)that a learner cannot act on, the model can read aligned phonemes, integrate acoustic measurements, and produce human\-readable diagnosesLi et al\. \([2017](https://arxiv.org/html/2606.15325#bib.bib12)\); Wang et al\. \([2025](https://arxiv.org/html/2606.15325#bib.bib22)\)\. The broader SpeechLM literature has begun to outline a roadmap toward “superhuman” speech understanding, in which LLMs are expected not only to process raw audio but also to reason over its semantic and paralinguistic contentBu et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib2)\); Cui et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib4)\)\. These works treat this capability as operational and assume the LLM grounds its feedback in the supplied evidence rather than in priors stored from pretraining text\.

This assumption is rarely tested directly\. A grounded model and a prior\-driven model are externally indistinguishable: both produce confident, fluent prose\. The operative difference is whether the feedback varies with the speaker’s actual production\. A grounded model adapts to the utterance; a prior\-driven model issues stereotyped advice, such as warnings about /\\tipaencodingT/ or /\\tipaencoding⁢r/, regardless of what the speaker produced\. Misdirected feedback is more costly than no feedback because it consumes learner attention that would otherwise go to the real errorJeon et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib8)\)\.

We address three primary research questions:

- •RQ1\.Does providing structured evidence \(IPA, acoustic features, or raw audio\) improve per\-dimension rating accuracy over a baseline?
- •RQ2\.When a model generates structured explanations alongside its rating, does this evidence genuinely justify the diagnosis when validated against ground truth?
- •RQ3\.To what extent are diagnostic errors driven by speaker demographic priors versus L1\-independent pedagogical stereotypes?

We evaluate three audio\-capable LLMs on1,8001\{,\}800L2\-Arctic utterancesZhao et al\. \([2018](https://arxiv.org/html/2606.15325#bib.bib27)\)across five evidence conditions, reporting Rating Accuracy \(RA\), Evidence Coherence \(EC\), and Grounded Correctness \(GC\) per \(model×\\timescondition×\\timesdimension\) cell; each response contains both a rating and typed evidence, letting us separate label correctness from explanation grounding\. Our contributions are threefold: \(i\) an evaluation framework \(RA, EC, GC\) decoupling label correctness from explanation grounding across five evidence conditions; \(ii\) empirical proof that39\.6%39\.6\\%of evaluated instances exhibit internally coherent but factually incorrect diagnoses driven by a fixed L2\-stereotype inventory; and \(iii\) a per\-dimension analysis investigating whether and when explicit acoustic evidence can successfully counteract these priors\.

## 2Related Work

#### LLM\-based pronunciation feedback\.

Classical CAPT pipelines rely on goodness\-of\-pronunciation scores derived from forced\-alignment posteriorsWitt and Young \([2000](https://arxiv.org/html/2606.15325#bib.bib23)\), later extended by neural mispronunciation detection and diagnosis \(MDD\) models that produce per\-phone error labelsLi et al\. \([2017](https://arxiv.org/html/2606.15325#bib.bib12)\); Yan et al\. \([2023](https://arxiv.org/html/2606.15325#bib.bib24)\)\. Recent work couples these signal\-level systems with LLMs along three lines\. The first uses an LLM to generate articulatory\-level explanations on top of MDD outputsZhong et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib28)\); the second prompts a multimodal LLM directly for pronunciation scoresFu et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib6)\); Wang et al\. \([2025](https://arxiv.org/html/2606.15325#bib.bib22)\); the third paraphrases acoustic features into textual prosodic descriptions before passing them to the LLMChen et al\. \([2025](https://arxiv.org/html/2606.15325#bib.bib3)\); Qian et al\. \([2025](https://arxiv.org/html/2606.15325#bib.bib18)\)\. Ourtext\+acousticcondition belongs to the third family: we supply F0 minimum, F0 maximum, duration, and intensity range as numeric text alongside the IPA transcript\. We replicate the textual\-evidence advantage on pitch variation reported byChen et al\. \([2025](https://arxiv.org/html/2606.15325#bib.bib3)\)but show that the same advantage does not generalise to stress or phoneme correctness, and that supplying the audio waveform alone does not reproduce the gain on pitch variation, where the textual form does\. None of these prior systems separate rating accuracy from explanation grounding and our three\-metric framework is designed to fill that gap\.

#### Faithfulness and grounding\.

Faithfulness of generated explanations to model decision processes is a long\-standing concern in NLPJacovi and Goldberg \([2020](https://arxiv.org/html/2606.15325#bib.bib7)\); Maynez et al\. \([2020](https://arxiv.org/html/2606.15325#bib.bib15)\); Atanasova et al\. \([2023](https://arxiv.org/html/2606.15325#bib.bib1)\)\. Reference\-match metrics such as FActScoreMin et al\. \([2023](https://arxiv.org/html/2606.15325#bib.bib16)\)and AlignScoreZha et al\. \([2023](https://arxiv.org/html/2606.15325#bib.bib26)\)verify claims against supplied sources but assume the source is ground truth\.Turpin et al\. \([2023](https://arxiv.org/html/2606.15325#bib.bib21)\)show that chain\-of\-thought rationales can systematically misrepresent the factors driving a prediction\. Our Grounded Correctness metric extends this concern to structured speech feedback: we evaluate the rating, the cited evidence, and the reason jointly against external gold annotations, rather than against any single reference\. The39\.6%39\.6\\%confabulation rate we observe is the speech\-feedback analogue of the unfaithful\-CoT finding\.

#### Parametric priors and demographic bias\.

LLMs and speech systems both carry priors that can override input evidence\. Question\-answering studies show that models often fall back on parametric knowledge when the supplied context is weakPetroni et al\. \([2019](https://arxiv.org/html/2606.15325#bib.bib17)\); Mallen et al\. \([2023](https://arxiv.org/html/2606.15325#bib.bib14)\); Tao et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib20)\); Kassner and Schütze \([2020](https://arxiv.org/html/2606.15325#bib.bib9)\)\. In speech technology, automatic speech recognition exhibits higher error rates for non\-native and minority\-dialect speakersKoenecke et al\. \([2020](https://arxiv.org/html/2606.15325#bib.bib10)\)\. We observe both effects\. When the prompt lacks the acoustic feature that probes the target dimension, the model falls back on demographic labels or on a stored “L2 English problems” inventory\. The phoneme stereotype we document is one such inventory: /\\tipaencodingT/, /\\tipaencodingD/, /\\tipaencoding⁢r/, and /\\tipaencodingv/ dominate the over\-claimed phones across all six L1 backgrounds we test, despite the contrastive\-analysis tradition predicting L1\-specific substitution patternsLado \([1957](https://arxiv.org/html/2606.15325#bib.bib11)\); Eckman \([1977](https://arxiv.org/html/2606.15325#bib.bib5)\); Swan and Smith \([2001](https://arxiv.org/html/2606.15325#bib.bib19)\)\.

## 3Methodology

The design separates three components that pronunciation\-feedback studies often conflate: the speech material, the evidence supplied to the model, and the metrics used to judge correctness and grounding\. Figure[1](https://arxiv.org/html/2606.15325#S3.F1)gives an overview\.

### 3\.1Dataset and gold targets

We use L2\-ArcticZhao et al\. \([2018](https://arxiv.org/html/2606.15325#bib.bib27)\), a read\-speech corpus of 24 non\-native speakers across six L1 backgrounds \(Arabic, Hindi, Korean, Mandarin, Spanish, Vietnamese\) with human\-verified TextGrid annotations for phone\-level errors and lexical stress\. From the human\-verified portion, we sample300300utterances per L1 \(1,8001\{,\}800utterances total\) by speaker\-balanced round\-robin sampling: within each L1 the procedure shuffles the available speakers, then iteratively draws one utterance from each speaker until the target is reached\. Speaker balance prevents a small number of speakers from dominating the acoustic or error distribution\.

We define a gold rating and gold evidence target along four dimensions\. Forfluencyandpitch\_variation\(three\-class:slow/normal/fast;monotone/normal/varied\), reference labels come from words per second and from F0 range, each binned by within\-corpus z\-score at±0\.5\\pm 0\.5SD\. Forstress\_correctnessandphoneme\_correctness\(binary\), labels come from the L2\-Arctic TextGrids: stress is positive if any stress\-bearing vowel in the utterance differs from canonical stress; phoneme is positive if any phone is annotated as substitution, deletion, or addition\. Gold evidence for the binary dimensions is the validated set of errored stressed vowels or errored phones\. On the scorable subset the gold positive rate is3\.8%3\.8\\%for stress and96\.3%96\.3\\%for phoneme; the corresponding always\-positive Detection\-F1 baselines are0\.070\.07and0\.980\.98, which motivates the auxiliary grounding metrics in §[3\.4](https://arxiv.org/html/2606.15325#S3.SS4)\.

### 3\.2Evidence conditions

We evaluate five conditions that vary the prompt’s evidence package while holding the response format fixed \(Table[1](https://arxiv.org/html/2606.15325#S3.T1)\)\.text\-onlysupplies the target sentence, speaker L1, and gender;text\+ipaadds the canonical IPA transcript;text\+acousticadds the numeric acoustic features as text;audio\-onlysupplies the raw waveform alongside IPA but withholds the numeric features;audio\+acousticprovides the waveform and the numeric features together\. Acoustic fields are supplied as primitive measurements \(duration, F0 minimum, F0 maximum, intensity range\) rather than as pre\-computed diagnostic labels \(e\.g\., words per second, F0 range\), so the model must perform a derivation to use them\. This design choice differs fromChen et al\. \([2025](https://arxiv.org/html/2606.15325#bib.bib3)\), who paraphrase the same measurements into textual prosodic descriptions, and prevents the grounded conditions from becoming label\-copying tasksTao et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib20)\)\.

Table 1:Five Evidence Conditions![Refer to caption](https://arxiv.org/html/2606.15325v1/roadmap.png)Figure 1:Evaluation framework\.L2\-Arctic utterances from six L1 backgrounds are passed to three multimodal LLMs under five evidence conditions\. Each response contains a rating and typed evidence for four pronunciation dimensions\. Finally,Rating Accuracy \(RA\),Evidence Coherence \(EC\), andGrounded Correctness \(GC\)are evaluated for each model\-condition\-dimension cell\.
### 3\.3Models and prompting protocol

We evaluate three multimodal LLMs:google/gemini\-3\.0\-flash,openai/gpt\-4ofor text conditions paired withopenai/gpt\-4o\-audiofor audio, andqwen3\.5 omni\-plus\. The three models are tested on five conditions, which use the same structured\-output schema\. For each utterance, the model returns a JSON object containing aratingsfield \(one categorical label per dimension\) and anevidencefield \(a scalar value or list of phones/stressed vowels per dimension, with a one\-sentence reason\)\. The four dimensions are presented in a per\-utterance random order seeded by a deterministic hash, removing position\-of\-listing as a confound\. We use API temperature0and retry malformed responses up to three times\. The full system and user prompts are in Appendix[A](https://arxiv.org/html/2606.15325#A1)\.

### 3\.4Evaluation metrics

We report three metrics per \(model×\\timescondition×\\timesdimension\) cell, designed to separate three questions: did the model assign the correct label, is its explanation internally coherent, and is the explanation valid against gold evidence?

#### Rating Accuracy \(RA\)\.

RA compares the model’s predicted rating against the gold label\. For three\-class dimensions we report macro\-F1,

RAcat3=13​∑c∈𝒞2​Pc​RcPc\+Rc,\\text\{RA\}\_\{\\text\{cat3\}\}=\\frac\{1\}\{3\}\\sum\_\{c\\in\\mathcal\{C\}\}\\frac\{2P\_\{c\}R\_\{c\}\}\{P\_\{c\}\+R\_\{c\}\},\(1\)with𝒞=\{slow,normal,fast\}\\mathcal\{C\}=\\\{\\text\{slow\},\\text\{normal\},\\text\{fast\}\\\}or\{monotone,normal,varied\}\\\{\\text\{monotone\},\\text\{normal\},\\text\{varied\}\\\}\. For binary dimensions we report positive\-class F1,RAbin=2​P1​R1/\(P1\+R1\)\\text\{RA\}\_\{\\text\{bin\}\}=2P\_\{1\}R\_\{1\}/\(P\_\{1\}\+R\_\{1\}\), the standard choice for imbalanced detection tasks\. We do not macro\-average across dimensions, because they differ in label structure and base rate\.

#### Evidence Coherence \(EC\)\.

EC measures whether the model’s explanation is internally coherent with the evidence it cites\. The judge sees the rating, the cited evidence, and the reason, but*not*the gold label\. EC is scored on three bins,\{0,0\.5,1\.0\}\\\{0,0\.5,1\.0\\\}, by an LLM judge with the prompt template in Appendix[A](https://arxiv.org/html/2606.15325#A1)\. EC is high when the reason follows from the cited evidence on its own terms, even if the rating happens to be wrong against ground truth\.

#### Grounded Correctness \(GC\)\.

GC measures whether the explanation remains valid when gold evidence is considered\. The LLM judge sees everything in EC plus the gold label and the dimension\-specific gold evidence \(duration and speaking\-rate cues for fluency, F0\-based cues for pitch variation, errored stressed vowels for stress, errored phones for phoneme\)\. To keep GC parallel to EC, we report it on the same three\-point scale,\{0,0\.5,1\.0\}\\\{0,0\.5,1\.0\\\}:1\.01\.0denotes a correct rating with valid evidence\-based reasoning,0\.50\.5a correct rating with decorative, unsupported or stereotype\-driven reasoning, and0\.00\.0an incorrect, unsupported response\. This shared scale makes the EC/GC cross\-tab in §[4\.1](https://arxiv.org/html/2606.15325#S4.SS1)directly interpretable and separate from the confabulation threshold defined below\.

We usegoogle/gemini\-2\.5\-proas the judge; RA is computed deterministically while the judge scores EC and GC only\. The structured evidence packages and operationally defined three\-bin rubric mitigate the subjectivity typical of LLM\-as\-judge settings\(Ye et al\.,[2024](https://arxiv.org/html/2606.15325#bib.bib25)\)\.

Because RA is a macro\-averaged F1 score while EC and GC are means over an ordinal three\-point scale \(\{0,0\.5,1\.0\}\\\{0,0\.5,1\.0\\\}\), their absolute values are not on a common scale and should not be directly compared or subtracted\. The three metrics are nonetheless complementary: high RA with low GC reveals that correct labels are assigned without grounded reasoning, and high EC with low GC is the signature of confabulation\.

#### Confabulation rate\.

We additionally report a derived rate that combines EC and GC\. We label a cell as having a coherent reason whenEC≥0\.7\\mathrm\{EC\}\\geq 0\.7and a grounded rating whenGC≥0\.7\\mathrm\{GC\}\\geq 0\.7\(both correspond to the top bin on each judge scale\)\. The confabulation rate is the share of cells with a coherent reason but an ungrounded rating:

Conf=\|\{c:ECc≥0\.7∧GCc<0\.7\}\|\|𝒞\|\.\\text\{Conf\}=\\frac\{\\big\|\\\{c:\\mathrm\{EC\}\_\{c\}\\geq 0\.7\\,\\wedge\\,\\mathrm\{GC\}\_\{c\}<0\.7\\\}\\big\|\}\{\|\\mathcal\{C\}\|\}\.\(2\)This identifies cells that look right under reference\-match scoring but fail under ground\-truth\-aware scoring, the failure mode that motivatedTurpin et al\. \([2023](https://arxiv.org/html/2606.15325#bib.bib21)\)in the chain\-of\-thought setting\.

## 4Results

### 4\.1Rating accuracy and grounded reasoning decouple

Across34,88734\{,\}887cells judged by both EC and GC,55\.4%55\.4\\%have a coherent reason \(EC≥0\.7\\mathrm\{EC\}\\geq 0\.7\), but only15\.8%15\.8\\%pair a coherent reason with a grounded rating \(GC≥0\.7\\mathrm\{GC\}\\geq 0\.7\)\. The remaining39\.6%39\.6\\%are coherent but wrong, corresponding to the confabulation cells defined in §[3\.4](https://arxiv.org/html/2606.15325#S3.SS4)\. Figure[2](https://arxiv.org/html/2606.15325#S4.F2)shows the four\-quadrant breakdown by pronunciation dimension\. Confabulation consistently exceeds genuine grounding, with ratios of1\.71\.7for fluency,2\.42\.4for pitch variation,2\.92\.9for phoneme correctness, and3\.43\.4for stress correctness\. The largest gap occurs for stress, where target alignment is hardest\. This pattern echoesTurpin et al\. \([2023](https://arxiv.org/html/2606.15325#bib.bib21)\)’s findings on unfaithful chain\-of\-thought reasoning, but is larger in magnitude: it persists even when a structured\-output schema requires the model to provide typed evidence before generating its explanation\.

![Refer to caption](https://arxiv.org/html/2606.15325v1/x1.png)Figure 2:Distribution of reasoning coherence vs\. rating correctness per pronunciation dimension\. Across all dimensions, wrong yet coherent diagnoses \(confabulation, red bars\) substantially outnumber genuinely grounded feedback \(green bars\)\.This decoupling has a direct consequence for CAPT evaluation\. A high RA score shows that the model selected the correct label but does not show that the explanation identifies the right acoustic or phonetic evidence\. High EC shows that the explanation is well\-formed relative to the prompt but does not show the rating is correct\. GC is the relevant metric for feedback quality because it asks whether rating and explanation jointly support a valid learner\-facing diagnosis\. The remaining subsections decompose this overall pattern by phoneme inventory \(§[4\.2](https://arxiv.org/html/2606.15325#S4.SS2)\), marginal rating distribution \(§[4\.3](https://arxiv.org/html/2606.15325#S4.SS3)\), and evidence condition \(§[4\.4](https://arxiv.org/html/2606.15325#S4.SS4)\)\.

### 4\.2Phoneme feedback collapses onto an L1\-independent stereotype

If phoneme feedback were grounded in the learner’s actual production, the LLM cited\-phone distribution should vary with both the speaker’s L1 and the prompting condition; this is the substitution pattern predicted by classical contrastive analysisLado \([1957](https://arxiv.org/html/2606.15325#bib.bib11)\); Eckman \([1977](https://arxiv.org/html/2606.15325#bib.bib5)\); Swan and Smith \([2001](https://arxiv.org/html/2606.15325#bib.bib19)\)\. Instead, it collapses onto a small inventory of familiar L2\-English difficulty phones\. Table[2](https://arxiv.org/html/2606.15325#S4.T2)reports the pooled top\-five over\-claimed phones per L1, across models and evidence conditions \(1515model\-condition cells per L1\)\.

Table 2:Pooled top\-five over\-claimed phones per L1*Note\.*Entries show phone, claim rate, mean model–gold gap in parentheses, and recurrence among the top\-five over\-claimed phones across 15 cells per L1\. Phones are ranked by recurrence, then by mean gap\.

The same four phones, /\\tipaencodingT/, /\\tipaencodingD/, /\\tipaencoding⁢r/, and /\\tipaencodingv/, dominate the top\-five over\-claim list for every L1, despite L1\-specific gold\-error distributions\. /\\tipaencodingD/ appears in14/1514/15cells for Mandarin and Spanish; /\\tipaencodingT/ appears in14/1514/15cells for Arabic; the two dental fricatives \(/\\tipaencodingT/, /\\tipaencodingD/\) plus /\\tipaencoding⁢r/ occupy three of the top four ranks for Mandarin, Spanish, and Vietnamese\. The pattern is not driven by a single model or prompting configuration, because the recurrence counts pool across all1515model\-condition cells\. Aggregated across cells, /\\tipaencodingT/, /\\tipaencodingD/, and /\\tipaencoding⁢r/ jointly account for over a third of all emitted phoneme tokens, against approximately20%20\\%of the gold\-error distribution\. Richer evidence does not eliminate the effect: IPA, acoustic, and audio\-based conditions all produce the same inventory, with the condition\-level matrix in Appendix Table[4](https://arxiv.org/html/2606.15325#A2.T4)\.

This contradicts both the L1\-specific substitution predictions of contrastive analysisEckman \([1977](https://arxiv.org/html/2606.15325#bib.bib5)\); Swan and Smith \([2001](https://arxiv.org/html/2606.15325#bib.bib19)\)and the speaker\-adaptive behaviour reported by recent multimodal\-LLM gradersFu et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib6)\); Ma et al\. \([2025](https://arxiv.org/html/2606.15325#bib.bib13)\)\. A learner who receives this feedback is told a particular phone was mispronounced, but the cited phone often reflects a generic L2\-English\-difficulty prior rather than a phone the speaker actually missed\.

### 4\.3Prosodic predictions default to L2\-stereotype classes

A marginal\-distribution check on the rating output reveals a pattern not visible from RA alone\. On every prosodic dimension attext\-only, all three models assign one class to the large majority of utterances, and the over\-emitted class is the same across models \(Table[3](https://arxiv.org/html/2606.15325#S4.T3)\)\. The over\-emitted classes areslowfor fluency,monotonefor pitch variation, and1 = errorfor stress correctness; each one matches a popular L2\-English stereotype, the kind of fixed parametric prior thatMallen et al\. \([2023](https://arxiv.org/html/2606.15325#bib.bib14)\)andTao et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib20)\)flag as the default fallback when supplied context is weak\.

Table 3:Default\-class prediction rates for prosodic dimensions by model and evidence condition\.*Note\.*For each dimension, the class in parentheses is the default class whose prediction rate is reported; the gold rate gives its prevalence in the reference labels\.

Attext\-only, all three models over\-predict the stereotype class:slowon7070–84%84\\%of utterances against a gold rate of33%33\\%;monotoneon6262–95%95\\%against36%36\\%;1 = erroron8282–96%96\\%against4%4\\%\. The bias is uniform in direction even though the gold distribution points the opposite way on fluency \(most utterances arenormalorfast\) and pitch variation \(most arevaried\)\.

Conditions diverge in their effect on this concentration\. IPA leaves the over\-emission essentially unchanged on every dimension\. On pitch variation,text\+acousticcollapses themonotonerate to0–1%1\\%for all three models, an unambiguous abandonment of the stereotype class;audio\+acousticproduces the same effect\. On fluency,text\+acousticreduces theslowrate substantially for Gemini \(72→4972\\to 49\) and dramatically for Qwen \(84→2284\\to 22\), but barely for GPT\-4o \(70→6470\\to 64\)\. On stress, the error rate reduces modestly undertext\+acousticandaudio\+acousticbut never approaches the4%4\\%gold rate\. Per\-utterance grounding therefore tracks the specificity of the supplied evidence to the target feature, not the richness of the evidence package, a sharper version of the context\-versus\-parametric result ofTao et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib20)\)\.

### 4\.4Acoustic evidence helps pitch and fluency, not stress and phoneme

![Refer to caption](https://arxiv.org/html/2606.15325v1/prosodic.png)Figure 3:Pronunciation evaluation metrics across models and prompt conditions\. Each panel tracks RA, EC, and GC for a specific model\-dimension pair, with selected exact values annotated\. RA denotes RatingF1F\_\{1\}, while EC and GC represent Evidence Coherence and Grounded Correctness, respectively\.Figure[3](https://arxiv.org/html/2606.15325#S4.F3)reports RA, EC, and GC per \(model×\\timescondition×\\timesdimension\)\. The pattern is direct: acoustic evidence improves the rating only when the supplied feature directly probes the target\. F0 range directly probes pitch variation; duration plus word count probes fluency through a single arithmetic step; no supplied measurement directly probes which phone or which syllable was wrong, so stress and phoneme correctness remain ungrounded\.

#### Pitch variation\.

Pitch variation provides a clear case where explicit acoustic evidence improves grounding\. GC rises from \(0\.180\.18–0\.190\.19\) to \(0\.450\.45–0\.620\.62\) across the three models when textualised F0 values are supplied \(text\+acoustic\)\. Theaudio\+acousticcondition shows a similar gain \(0\.550\.55–0\.610\.61\), whereasaudio\-onlyremains weak, indicating that the effective intervention is the explicit textual representation of F0, not audio by itself\. The pattern matches the gap identified in speech\-LM roadmaps between perceiving non\-semantic acoustic cues and using them as evidence for diagnostic reasoningBu et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib2)\); Cui et al\. \([2024](https://arxiv.org/html/2606.15325#bib.bib4)\)\.

#### Fluency\.

Fluency shows a similar but less uniform pattern\. Undertext\+acoustic, Gemini and Qwen lift in GC from0\.270\.27to0\.470\.47and from0\.250\.25to0\.540\.54\. GPT\-4o does not \(drops from0\.350\.35to0\.270\.27\)\. The supplied measurement is utterance duration, which requires the model to compute words per second from the target sentence\. The split\-by\-model pattern is consistent with the finding that small arithmetic derivations are not uniformly solved across modelsTurpin et al\. \([2023](https://arxiv.org/html/2606.15325#bib.bib21)\), and it shows that even feature\-readout tasks are not solved when the readout step is non\-trivial\.

#### Stress correctness\.

Stress correctness presents the hardest grounding problem\. RA stays low across all conditions \(0\.060\.06–0\.140\.14\) and GC never exceeds0\.610\.61\. The ceiling is architectural: grounding a stress judgment requires the model to know which syllable should be stressed, locate the stress produced by the speaker, and compare the two — a target\-to\-realisation alignment that acoustic summaries alone cannot supply\.

#### Phoneme correctness\.

As established in §[4\.2](https://arxiv.org/html/2606.15325#S4.SS2), phoneme RA is near\-ceiling \(≥0\.94\\geq 0\.94\) due to the96%96\\%base rate, while GC remains low \(0\.190\.19–0\.380\.38\) because cited phones are L1\- and condition\-independent\. No evidence condition closes this gap: the model can detect that*some*error exists but cannot ground*which*phone was mispronounced\.

### 4\.5Metadata Ablation Study

In the previous experiments, every prompt includes the speaker’s L1 and gender\. To test whether models use these labels rather than the speech itself, we ran ametadata\_suppressionvariant: the same evidence, plus one instruction: “You MUST NOT use the speaker’s L1 background or gender as a basis for any judgement\.” Because both variants share identical evidence, any score change is caused by the label alone\.

The effect is dimension\-specific\. Stress correctness shows the clearest effect\. With the L1 label visible, three of four models score only 0\.08–0\.14 on stress, essentially flagging a stress error on almost every utterance, whether or not one actually occurred\. Once the label is hidden, scores jump to 0\.40–0\.65, with Grounded Correctness rising in parallel\. Obviously, the model was not listening to the speech; it was seeing “L2 learner” and assuming errors\.

Pitch variation reveals a gender effect\. With gender visible, female speakers score 0\.16–0\.24 higher than male speakers; models routinely label male speakers asmonotoneeven when the gold annotations show the opposite\. Hide the gender label and the gap reverses: male speakers now score 0\.13–0\.27 higher than female\. The model was responding to the gender tag, not to the speaker’s actual pitch\.

Phoneme correctness and fluency do not change when labels are removed\. For phoneme correctness this is expected:96%96\\%of utterances actually contain a phoneme error, so models score near\-ceiling \(0\.94–0\.97\) regardless of what the label says, and the over\-claimed phone inventory \(§[4\.2](https://arxiv.org/html/2606.15325#S4.SS2)\) is L1\-independent anyway\. For fluency, there is no consistent stereotype linking L1 or gender to speech rate, so removing either label changes nothing\.

Together, these results answer RQ3\. Where a widely held stereotype exists, including stress errors for L2 learners, monotone pitch for male speakers, the demographic label overrides the acoustic evidence\. Where no such stereotype exists, as with phoneme inventory and speech rate, it has no effect\.

## 5Discussion

The empirical findings reveal a fundamental limitation in how audio\-capable LLMs process speech: a severe dissociation between surface\-level task performance \(Rating Accuracy\) and actual factual grounding \(Grounded Correctness\)\. This gap is not random noise but a structural vulnerability\. When explicit acoustic evidence is absent or difficult to parse, LLMs do not fail gracefully by expressing uncertainty; instead, they default to demographic and pedagogical priors ingrained during pretraining\.

This failure mode is dimension\-specific, governed by the distinction between*feature readout*and*target\-to\-realisation alignment*\. For fluency and pitch variation, low\-level acoustic summaries map almost directly onto the required label\. Diagnosing phoneme errors and lexical stress deviations, by contrast, requires comparing the target canonical form with the learner’s actual realisation, determining not just what the signal sounds like but where it deviates from a language\-specific expectation\. Acoustic summaries cannot encode this comparative alignment; in their absence, the model fills the gap with prior\-driven confabulation\. This finding is consistent with recent work showing that prosodic sensitivity in speech\-language models remains cue\-dependent rather than uniformly reliableQian et al\. \([2025](https://arxiv.org/html/2606.15325#bib.bib18)\); Chen et al\. \([2025](https://arxiv.org/html/2606.15325#bib.bib3)\)\.

The errors follows a predictable pattern\. Phoneme explanations collapse onto the rigid pretraining inventory documented in §[4\.2](https://arxiv.org/html/2606.15325#S4.SS2), irrespective of the speaker’s L1 or evidence richness, while demographic labels selectively override evidence on the two dimensions where a plausible stereotype exists\. These priors superficially resemble valid cross\-linguistic transfer knowledge but become harmful when applied to individual learners without evidence of their actual errors, the failure mode identified byTurpin et al\. \([2023](https://arxiv.org/html/2606.15325#bib.bib21)\)and the parametric\-knowledge literaturePetroni et al\. \([2019](https://arxiv.org/html/2606.15325#bib.bib17)\); Kassner and Schütze \([2020](https://arxiv.org/html/2606.15325#bib.bib9)\); Mallen et al\. \([2023](https://arxiv.org/html/2606.15325#bib.bib14)\)\.

Two implications follow for CAPT systems\. First, evaluation standards must evolve\. HighF1F\_\{1\}is an illusion of competence, and future benchmarks must decouple task accuracy from reasoning validity by reporting multi\-tiered grounding metrics such as EC and GC\. Second, architectures must be refactored\. A more robust design uses specialised acoustic models for target\-to\-realisation alignment and deploys LLMs strictly as downstream natural\-language interfaces\.

## 6Conclusion

This study tested whether audio\-capable LLMs can ground L2 pronunciation feedback in objective phonetic evidence\. Our analysis quantifies the decoupling:39\.6%39\.6\\%of evaluated outputs are structurally coherent but entirely ungrounded, while only15\.8%15\.8\\%achieve both internal coherence and factual grounding\. Phoneme feedback collapses onto a fixed L2\-difficulty inventory irrespective of the speaker’s L1 background or the evidence richness\. Prosodic predictions show a similar prior\-driven pattern through single\-class clustering and demographic\-prior effects\. These systematic, prior\-driven biases demonstrate that current LLMs remain unreliable as autonomous pronunciation diagnostic engines\. They are far more dependable when confined to verbalising externally verified, structurally aligned speech evidence\.

## 7Limitations

The evaluation is limited to three general\-purpose multimodal systems and whether these patterns generalize to specialized speech\-language models or spontaneous L2 speech remains an open question\. Sample sizes for certain per\-L1 groups and binary\-error categories in the L2\-ARCTIC corpus are relatively small\. The fluency and pitch labels are derived from human\-verified TextGrids rather than independent human annotation\. The LLM\-as\-a\-judge approach introduces heuristic biases that may affect absolute scores; however, these factors do not readily explain our primary findings: phoneme feedback repeatedly relies on generic L2\-English error stereotypes, and phoneme and stress feedback remain weakly grounded even when acoustic evidence is supplied\.

## References

- Atanasova et al\. \(2023\)Pepa Atanasova, Oana\-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein\. 2023\.Faithfulness tests for natural language explanations\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\)*, pages 283–294\.
- Bu et al\. \(2024\)Fan Bu, Yuhao Zhang, Xidong Wang, Benyou Wang, Qun Liu, and Haizhou Li\. 2024\.Roadmap towards superhuman speech understanding using large language models\.*arXiv preprint arXiv:2410\.13268*\.
- Chen et al\. \(2025\)Yu\-Wen Chen, Melody Ma, and Julia Hirschberg\. 2025\.Read to hear: A zero\-shot pronunciation assessment using textual descriptions and LLMs\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 2682–2694\.
- Cui et al\. \(2024\)Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Yiwen Guo, and Irwin King\. 2024\.Recent advances in speech language models: A survey\.*arXiv preprint arXiv:2410\.03751*\.
- Eckman \(1977\)Fred R\. Eckman\. 1977\.Markedness and the contrastive analysis hypothesis\.*Language Learning*, 27\(2\):315–330\.
- Fu et al\. \(2024\)Kaiqi Fu, Linkai Peng, Nan Yang, and Shuran Zhou\. 2024\.Pronunciation assessment with multi\-modal large language models\.*arXiv preprint arXiv:2407\.09209*\.
- Jacovi and Goldberg \(2020\)Alon Jacovi and Yoav Goldberg\. 2020\.Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\)*, pages 4198–4205\.
- Jeon et al\. \(2024\)Jaeho Jeon, Seongyong Lee, and Seongyune Choi\. 2024\.A systematic review of research on speech\-recognition chatbots for language learning: Implications for future directions in the era of large language models\.*Interactive Learning Environments*, 32\(8\):4613–4631\.
- Kassner and Schütze \(2020\)Nora Kassner and Hinrich Schütze\. 2020\.Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\)*, pages 7811–7818\.
- Koenecke et al\. \(2020\)Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R\. Rickford, Dan Jurafsky, and Sharad Goel\. 2020\.Racial disparities in automated speech recognition\.*Proceedings of the National Academy of Sciences \(PNAS\)*, 117\(14\):7684–7689\.
- Lado \(1957\)Robert Lado\. 1957\.*Linguistics Across Cultures: Applied Linguistics for Language Teachers*\.University of Michigan Press\.
- Li et al\. \(2017\)Kun Li, Xiaojun Qian, and Helen Meng\. 2017\.Mispronunciation detection and diagnosis in L2 English speech using multi\-distribution deep neural networks\.*IEEE/ACM Transactions on Audio, Speech, and Language Processing*, 25\(1\):193–207\.
- Ma et al\. \(2025\)Rao Ma, Mengjie Qian, Siyuan Tang, Stefano Bannò, Kate M\. Knill, and Mark J\. F\. Gales\. 2025\.Assessment of L2 oral proficiency using speech large language models\.*arXiv preprint arXiv:2505\.21148*\.
- Mallen et al\. \(2023\)Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi\. 2023\.[When not to trust language models: Investigating effectiveness of parametric and non\-parametric memories](https://doi.org/10.18653/v1/2023.acl-long.546)\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics*, pages 9802–9822\.
- Maynez et al\. \(2020\)Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald\. 2020\.On faithfulness and factuality in abstractive summarization\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\)*, pages 1906–1919\.
- Min et al\. \(2023\)Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen\-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi\. 2023\.FActScore: Fine\-grained atomic evaluation of factual precision in long form text generation\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 12076–12100\.
- Petroni et al\. \(2019\)Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander H\. Miller\. 2019\.[Language models as knowledge bases?](https://doi.org/10.18653/v1/D19-1250)In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\)*, pages 2463–2473\.
- Qian et al\. \(2025\)Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark A\. Hasegawa\-Johnson, Chuang Gan, and Yang Zhang\. 2025\.ProsodyLM: Uncovering the emerging prosody processing capabilities in speech language models\.In*Proceedings of the Second Conference on Language Modeling \(COLM\)*\.
- Swan and Smith \(2001\)Michael Swan and Bernard Smith, editors\. 2001\.*Learner English: A Teacher’s Guide to Interference and Other Problems*, second edition\.Cambridge University Press\.
- Tao et al\. \(2024\)Yufei Tao, Adam Hiatt, Erik Haake, Antonie J\. Jetter, and Ameeta Agrawal\. 2024\.[When context leads but parametric memory follows in large language models](https://doi.org/10.18653/v1/2024.emnlp-main.234)\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 4034–4058\.
- Turpin et al\. \(2023\)Miles Turpin, Julian Michael, Ethan Perez, and Samuel R\. Bowman\. 2023\.Language models don’t always say what they think: Unfaithful explanations in chain\-of\-thought prompting\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, volume 36, pages 71118–71138\.
- Wang et al\. \(2025\)Ke Wang, Lei He, Kun Liu, Yan Deng, Wenning Wei, and Sheng Zhao\. 2025\.Exploring the potential of large multimodal models as effective alternatives for pronunciation assessment\.*arXiv preprint arXiv:2503\.11229*\.
- Witt and Young \(2000\)Silke M\. Witt and Steve J\. Young\. 2000\.Phone\-level pronunciation scoring and assessment for interactive language learning\.*Speech Communication*, 30\(2\-3\):95–108\.
- Yan et al\. \(2023\)Bi\-Cheng Yan, Hsin\-Wei Wang, and Berlin Chen\. 2023\.PEPPANET: Effective mispronunciation detection and diagnosis leveraging phonetic, phonological, and acoustic cues\.In*Proceedings of the 2022 IEEE Spoken Language Technology Workshop \(SLT\)*, pages 1045–1051\.
- Ye et al\. \(2024\)Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin\-Yu Chen, Nitesh V\. Chawla, and Xiangliang Zhang\. 2024\.Justice or prejudice? quantifying biases in LLM\-as\-a\-Judge\.
- Zha et al\. \(2023\)Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu\. 2023\.AlignScore: Evaluating factual consistency with a unified alignment function\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\)*, pages 11328–11348\.
- Zhao et al\. \(2018\)Guanlong Zhao, Sinem Sonsaat, Alif Silpachai, Ivana Lucic, Evgeny Chukharev\-Hudilainen, John Levis, and Ricardo Gutierrez\-Osuna\. 2018\.L2\-ARCTIC: A non\-native English speech corpus\.In*Proceedings of Interspeech 2018*, pages 2783–2787\.
- Zhong et al\. \(2024\)Huihang Zhong, Yanlu Xie, and ZiJin Yao\. 2024\.Leveraging large language models to refine automatic feedback generation at articulatory level in computer aided pronunciation training\.In*Proceedings of Interspeech 2024*\.

## Appendix ASystem and User Prompt Templates

All experiments use one of three system messages combined with a per\-condition user message\. The three system messages share a common output\-format clause and differ only in their framing of the task\.

### System messages

#### system\_default

> You are a phonetician evaluating L2 English pronunciation\. For each of the four dimensions listed below, pick the categorical label that best characterises the speaker’s production relative to native English norms\.\[shared output rules\]

#### system\_grounded

fortext\+acousticandaudio\+acoustic\.

> You are a phonetician evaluating L2 English pronunciation\. For each of the four dimensions, pick the categorical label that best characterises the speaker’s production relative to native English norms\. Base your decisions on the acoustic measurements and/or the audio provided in the EVIDENCE block\. In thevalue\_used/vowels\_used/phonemes\_usedfields, report the actual measurement or symbols you consulted \(do not invent default values\)\.\[shared output rules\]

#### Shared output rules\.

> For every dimension, fill BOTH:rating\(the categorical label from the allowed set\) andevidence\(a typed value or list of phoneme symbols, plus a one\-sentencereason\)\. Even if you are uncertain, you MUST commit to a rating; do not leave any field empty\. Return ONLY a single JSON object that matches the schema, with no prose and no markdown fences\.

The four dimensions are listed in a per\-utterance randomised order, seeded reproducibly byhashlib\(utt\_id, condition\)\.

> Dimensions to rate \(in any order\): \- Fluency / speaking rate \(label: "slow" \| "normal" \| "fast"\) \- Pitch variation / F0 range \(label: "monotone" \| "normal" \| "varied"\) \- Stress correctness \(0 = no stress errors, 1 = at least one stress error\) \- Phoneme correctness \(0 = no phoneme errors, 1 = at least one phoneme error\) — EVIDENCE — Speaker L1 background: Mandarin Speaker gender: Female Target sentence: "The quick brown fox jumps over the lazy dog\." Canonical phoneme alignment \(one phoneme per line, start\_sec\-\-end\_sec : PHONEME\): 0\.00\-\-0\.10 : DH 0\.10\-\-0\.18 : AH 0\.18\-\-0\.32 : K … \- Duration: 2\.84 s \(tip: divide words by duration for words/sec\) \- F0 min: 88\.3 Hz \- F0 max: 243\.1 Hz \(tip: F0 range = max \-\- min\) \- Intensity range: 14\.6 dB — END EVIDENCE — JSON schema you must return: \{ "ratings": \{…\}, "evidence": \{…\} \}

#### system\_judge

> "You audit a phonetician’s reasoning against ground truth\. For a single " "pronunciation dimension you receive: \(a\) the task and prompt the phonetician " "was shown, \(b\) the rating they assigned, \(c\) the evidence they cited, " "\(d\) the one\-sentence reason they wrote, and \(e\) the ground\-truth label " "plus the underlying GT value or error list\. Your job is to return two " "scores: a binary rating\-correct flag and a 3 level grounded\-correctness " "\(GC\) score\. The GC rubric distinguishes \(i\) correct rating with sound " "evidence\-based reasoning, \(ii\) correct rating with decorative or " "stereotype\-driven reasoning \(lucky guess\), \(iii\) wrong rating with no defensible reading\. " "Return ONLY the JSON object\."

> JUDGE\_USER\_TEMPLATE = """DIMENSION: \{dim\_label\} EVIDENCE SHOWN TO PHONETICIAN: \{evidence\_block\} PHONETICIAN’S OUTPUT FOR THIS DIMENSION: RATING: \{rating\} EVIDENCE CITED \(\{field\}\): \{cited\} REASON: \{reason\} GROUND TRUTH FOR THIS UTTERANCE: GT LABEL: \{gt\_label\} GT \{gt\_aux\_name\}: \{gt\_aux\} Scoring: rating\_correct: 1 if RATING equals GT LABEL \(string\-equal for categorical; integer\-equal for binary\); 0 otherwise\. gc — Grounded Correctness \(3\-point\): 1\.0: Rating matches GT AND reason is a valid explanation for the correct rating, using the cited evidence\. 0\.5: Rating matches GT but reason is decorative, wrong, or invokes facts not in evidence \(correct label, ungrounded rationale\)\. 0\.0: Rating disagrees with GT and the reason does not support any defensible reading\. Return ONLY JSON: \{\{"rating\_correct": 0\|1, "gc": <1\.0\|0\.5\|0\.0\>, "verdict": "<one sentence\>"\}\} """

## Appendix BRanked top\-five phones claimed as mispronounced across models×\\timescondition

Table 4:Ranked top\-five phones claimed as mispronounced across three models, five prompting conditions, and six L1 backgrounds\. Each cell reports items ordered down from Rank 1 to 5\. Format follows: Phone \(Percentage % / Model–Gold GapΔ\\Delta\)\. Empty spaces tracking unavailable data indicators for non\-speech setups are formatted as \(–\)\.

Similar Articles

How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models

arXiv cs.CL

This paper investigates asymmetries in LLMs' pragmatic competence by comparing their performance as judges of linguistic appropriateness versus as generators of pragmatically appropriate language. The study finds that many models perform substantially better as pragmatic listeners than as speakers, suggesting misalignment between evaluation and generation capabilities.

Toward LLMs Beyond English-Centric Development

arXiv cs.CL

This paper demonstrates that LLMs are heavily biased toward English, and shows that continual pre-training does not offer cost advantages over training from scratch for adapting models to other languages, especially for cultural understanding.

Mind Your Tone: Does Tone Alter LLM Performance?

arXiv cs.AI

This paper investigates how tonal variations in prompts affect LLM accuracy on multiple-choice questions, finding systematic but model-dependent effects. The study uses multiple models and datasets to demonstrate that tone can significantly alter performance, cautioning against assuming tone-robust reliability.

Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit

arXiv cs.CL

This paper audits six large language models for gender stereotyping across English, Korean, Chinese, and Japanese, anchoring against human baselines. It finds that LLM stereotyping often exceeds human cross-country variation and can compound across languages, introducing a four-pattern framework to characterize such behaviors.