Are you speaking my languages? On spoken language adherence in multimodal LLMs
Summary
This paper addresses the problem of spoken language adherence in multimodal LLMs for ASR, proposing a soft prompting approach and novel metric to quantify language violations. It evaluates three mitigation strategies—zero-shot prompting, supervised fine-tuning, and chain-of-thought reasoning—across multiple languages to improve transcription fidelity.
View Cached Full Text
Cached at: 06/17/26, 05:39 AM
# Are you speaking my languages? On spoken language adherence in multimodal LLMs
Source: [https://arxiv.org/html/2606.17281](https://arxiv.org/html/2606.17281)
\\uselogo
###### Abstract
While Large Language Model \(LLM\) based Automatic Speech Recognition \(ASR\) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality\. To preserve flexibility and code\-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output\. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: \(1\) zero\-shot prompting for robust guidance under uncertainty, \(2\) supervised fine\-tuning \(SFT\) to improve prompt adherence, and \(3\) Chain\-of\-Thought \(CoT\) reasoning to enforce adherence during decoding\. We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance\. Finally, we discuss trade\-offs to guide strategy selection under various compute constraints\.
## 1Introduction
### 1\.1Background
Large Language Models \(LLMs\) have significantly advanced multilingual Automatic Speech Recognition \(ASR\), enabling flexible, zero\-shot transcription across many languages, including code\-switching \(where multiple languages are spoken in the same utterance\)\. While flexible, such systems often misidentify the target language in short or noisy segments\. Robust multilingual support requires the integration of user context; specifically, language hinting guides the model toward the expected language without restricting the user’s ability to code\-switch\. This approach successfully balances transcription accuracy with linguistic freedom\.
### 1\.2Problem Statement
Lack of language adherence fundamentally degrades transcription fidelity, introducing errors that distort the original meaning\. This unreliability directly compromises downstream tasks dependent on accurate ASR, such as machine translation, sentiment analysis, and command systems\. Furthermore, poor adherence creates a jarring negative user experience\. More than typical ASR errors, erroneous foreign\-language text can feel amateurish or be perceived as insensitive to cultural contexts or biased towards a user’s accent, eroding trust in the technology\. We need a way to mitigate these pitfalls while supporting flexible multilingual engagement\.
### 1\.3Proposed Approach and Contributions
To systematically address the issue of poor language adherence, this paper introduces a multifaceted approach encompassing formalization, measurement, and mitigation\. Our primary contributions are as follows\.
First, we formally define “Language Adherence Violation” and propose a novel metric to quantify these occurrences, providing a standardized evaluation method\. Then, we discuss the inherent difficulty of obtaining reliable language preferences\.
Second, we propose and investigate three non\-mutually exclusive mitigation strategies to improve language adherence while balancing flexibility and robustness to incorrect language signals:
- •Prompt Engineering: Using carefully designed prompts to guide the LLM’s focus to a target language while assessing its robustness to imperfect signals\.
- •Supervised Fine\-Tuning \(SFT\) with instruction: Employing language\-adherence prompts during the SFT process to ingrain the desired transcription behavior\.
- •Chain\-of\-Thought \(CoT\): Implementing a reasoning step that compels the model to first identify and declare the spoken language before transcribing\.
Finally, we conduct a comprehensive comparative analysis of these methods across monolingual and code\-switching datasets, evaluating the trade\-offs between language signal accuracy and model robustness\. To maintain a focused comparison of prompting\-based methods, we do not explore reinforcement learning approaches\.
## 2Related Work
### 2\.1Multilingual and Code\-Switching ASR
Some early multilingual neural ASR models used separate output layers per language\(scanzio2008\), requiring the target language to be known a priori\. Others appended a one\-hot language ID embedding to input features, allowing the model to learn language\-specific biases\(li2018\)\. While language ID can be estimated from audio\(cole1989;ma2002;lopez2014;bazazo2023\), integration with streaming ASR is difficult due to the latency required for confident estimation\. To address this,waters2019;zhang2022utilized parallel language ID modules to provide running estimates to an RNN\-T model, while more sophisticated solutions combine identification and verification to reconcile signals\(kim2025\)\.
watanabe2017proposed another approach to model acoustic language ID and ASR jointly: they modify the training data to include the language tag as the first word in the reference transcript, teaching the model to predict the language before outputting text\. While this is close to how today’s LLMs approach multitasking, the encoder was based on a bi\-directional LSTM, which has access to the entire input audio before outputting its language ID estimate and is therefore not streaming\-friendly\. Still, it’s a remarkable study as it discusses the language adherence rates explicitly\.
In the context of code\-switching in Indic languages,emond2018;datta2020proposed to transliterate all data into a single common script, which gives rise to positive cross\-lingual effects and helps with under\-resourced languages\. While this works for modeling certain language pairs, restoring the original spelling for rendering of the ASR hypothesis might be infeasible in the most general case\.
### 2\.2Language Identification in Speech Processing
To monitor unconstrained multilingual ASR or LLM outputs, robust text language identification \(LangID\) is essential, often utilizing character\-level N\-gram classifiers\(cavnar1994;caswell2020\)\. While effective for long sentences, these classifiers struggle with short inputs—such as "ja" \(a word that exists in over ten languages\), and require additional context like user profiles or conversation history for accuracy\. Furthermore, it needs to be flexible enough to allow for benign use of proper names \(e\.g\. “Apple” in a French sentence about the company\) or loan words \(e\.g\. “Download” in a German sentence about mobile apps\) and to identify individual character spans in a code\-switching environment\.
While only a few languages have unique writing systems \(e\.g\. Armenian, Georgian, Greek\), many languages share large portions of their alphabets for historical reasons\. For example, 22 out of 24 official languages of the European Union use Latin script\. This increases mutual intelligibility: even if a speaker of French doesn’t understand some Italian word, at least they recognize most characters and might even be able to guess pronunciation and meaning based on context\. Conversely, a French speaker who is exposed to Arabic or Chinese script is less likely to do so\. This simple observation suggests that it can be more practical to focus on identifying unexpectedscriptrather thanlanguage\. This can be done very efficiently based on Unicode ranges\(qasim2024\)\.
### 2\.3Large Language Models in ASR
Following early modality fusion approaches\(sun2019;zheng2021;bapna2022;wang2023\), most of today’s leading LLMs are natively multimodal\. They are trained on text, audio and images, tokenized with modality\-specific encoders to produce features in a unified space or tokens from a single flat vocabulary\. In order to use an LLM for speech recognition, it is usually prompted with “Transcribe the following audio:”, followed by the audio tokens\. While it’s possible to process some tasks in the same LLM call \(e\.g\. take speech as audio input and immediately generate audio tokens as response\), there are still applications that choose to chain different models\. In a cascaded system like this, where the ASR hypothesis is used as input to a separate model \(e\.g\. fed into a different LLM or a machine translation system\), the quality of the downstream application can be greatly harmed by language adherence issues in the ASR transcription\. Additional context is required to disambiguate between the music genre “Soul” and the capital of South Korea “서울” \(in English, “Seoul”\)\.
### 2\.4Explicit Language Control in Generative Models
While the controlled generation of formal languages such as code has been studied extensively \(e\.g\. with the help of finite state automata\(koo2024\)\), no established solution exists for natural languages yet\. Since decoding LLMs with beam search is often considered expensive, hybrid solutions seek to modify the logit or probability values heuristically in order to support or suppress subsets of the vocabulary, before the decoder samples the next token\(dathathri2019\)\.
## 3Measuring Language Adherence
### 3\.1Language Adherence Metric
For offline evaluation purposes, we assume that each utterance in the test set can be uniquely annotated with a set of languagesLrefL\_\{ref\}that appear in the audio\. Usually, the set contains only one or two languages\.
As mentioned in Sec\.[2\.2](https://arxiv.org/html/2606.17281#S2.SS2), an external text LangID tool can be used to obtain the set of languagesLhypL\_\{hyp\}that appear in the generated output\. Since the hypothesis can contain recognition errors, we don’t require the sets to match perfectly\.
DefinitionThe generated model output violates the language adherence wheneverLhyp⊈LrefL\_\{hyp\}\\not\\subseteq L\_\{ref\}\.
Thus, for a dataset withNNutterances, theLanguage Adherence Violation Rate\(LAVR\) is
LAVR=1N∑i=1N𝕀\(Lhyp,i⊈Lref,i\)\\textrm\{LAVR\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\(L\_\{hyp,i\}\\not\\subseteq L\_\{ref,i\}\)\(1\)where𝕀\\mathbb\{I\}is an indicator function andLhyp,iL\_\{hyp,i\}andLref,iL\_\{ref,i\}are the set of languages ofii\-th utterance\.
In the context of this study, measuring language adherence violations \(sometimes referred to as “\(local\) language hallucinations”\), we operationally define “language” by its corresponding canonicalscript; thus, a “language violation” occurs when the ASR system produces a script inconsistent with the target language’s standard orthography\. Thus the set of expected “languages” of a German test set is actually defined as a set of acceptable charactersLref:=\[a\-zäöüß\]L\_\{ref\}:=\[\\text\{a\-zäöüß\}\]\. Similarly,LhypL\_\{hyp\}is the set of characters that appear in the generated output\. For example, both “Danke schön” and “Thank uoy” are language adherent for German \(note the spelling typo in “uoy”\), but “Danke sçhön” is not due to ‘ç’\. In addition, we treat basic punctuations and digits in ASCII characters as “neutral” and exclude them from measurement, while language specific punctuations \(e\.g\., “¿”\) are retained\. Note that a single unexpected character inLhypL\_\{hyp\}marks the entire utterance as a violation\. This is an acceptable simplification, given the homogeneous length distribution of our evaluation sets\.
Our ultimate goal is to reduce the ratio of utterances violating the language adherence under uncertainty about user preferences without increasing the word error rate \(WER\)\. A low rate of language adherence violations suggests a high level of control over the behavior of a multilingual system\.
### 3\.2Metric Scope and Limitations
As discussed in Sec\.[2\.2](https://arxiv.org/html/2606.17281#S2.SS2), a language adherence metric based on word\-level text lang ID can be ambiguous\. We found that even the reference transcripts of many real\-world test sets have a non\-zero amount of language adherence violations\. This is partly due to loan words \(e\.g\. “Kindergarten”\) or proper names \(e\.g\. “Versace”\)\.
On the other hand, as a coarse\-grained metric, the rate of character\-level language adherence violations in reference transcripts is very close to zero\. The only violations are due to foreign words spelled in the original script, such as “jalapeño” or “Beyoncé”\. Thus many Finnish and English words will be considered “acceptable” in a German test set\. This is consistent with the notion of surprise in user perception discussed in Sec\.[2\.2](https://arxiv.org/html/2606.17281#S2.SS2)\.
The character\-level metric would miss a hypothetical bug that dropped all umlauts from German vowels\. It would also remain zero if the decoded text consisted of nothing but the word “hello” or some gibberish like “asdf”\. This is a limitation of the proposed metric, and can be thought of as trading precision for recall:ifthere is a character that would likely draw the user’s attention,thenwe want to make sure to flag it\. It should always be complemented by the standard WER between reference and hypothesis\.
### 3\.3Online Metrics
Measuring performance on real\-world traffic is challenging because manual transcription is expensive and scaling is difficult\. We propose using user language settings and past conversations as a proxy forLrefL\_\{ref\}, though this approach has limitations\. For instance, a bilingual user might speak French but receive a Spanish misrecognition that goes unflagged if both languages are in their settings; conversely, a user with English\-only settings who naturally code\-switches into Hindi will trigger a false violation\. These scenarios, including new language learning and international travel, demonstrate the limits of this formalism\. Consequently, we recommend monitoring relative changes in language adherence rather than striving for an absolute zero\.
## 4Proposed methods and experimental results
As discussed in Sec\.[3](https://arxiv.org/html/2606.17281#S3), accurate identification of the spoken language is not always possible and the contextual signals can be misleading\. Recognizing this challenge, we explore three mechanisms to improve language adherence\. While the methods can be implemented independently and combined in different ways, we choose to evaluate them sequentially, adopting the best outcome for the next stage\.
### 4\.1Experimental Setup
Evaluation DatasetsWe collected two classes of datasets: mono\-lingual and code\-switching\. The monolingual datasets comprise about a few thousands user queries per language, sourced from real\-world interactions with a production\-grade AI agent\. Most utterances range in length from 5 to 20 words\. For code\-switching, we synthesized audio from a dataset of about 10,000 anonymized queries where users code\-switch between English and another language with diverse voices\. These utterances typically range from 5 to 10 words in length\. See the Appendix for statistics and sample utterances of the evaluation dataset\.
We report results for English, French, Hindi, Korean, and their pairing with English in the main text; findings for additional languages are provided in the Appendix\.
Baseline ModelWe use Gemini Flash lite 2\.0 as the foundation model\. Specifically, the baseline model for zero\-shot experiments is a proprietary variant of the Gemini 2\.0 Flash Lite, fine\-tuned for general ASR tasks\. It’s a deep transformer\-based LLM trained on very large amounts of transcribed speech in all languages discussed in this work\. For the SFT and CoT experiments, we further fine\-tune it using a proprietary collection of transcribed speech data, consisting primarily of mono\-lingual and code\-switching single\-utterance recordings\.
Evaluation MetricFor each proposed method, we compute character\-level LAVR, defined in Eq\.[1](https://arxiv.org/html/2606.17281#S3.E1), and word error rate \(WER\)\. We evaluate performance by varying the language hints in the prompt across four scenarios relevant to real\-world use:
- •no\-hint: No language hint is provided\.
- •correct: The prompt contains the correct language\(s\) only\. This is an idealistic oracle condition\.
- •distractor: The prompt contains an incorrect language only \(Japanese for Korean, Spanish for French and Hindi\)\.
- •mix: The prompt contains both the correct language and the distractor language\.
For code\-switching datasets, thecorrectcase includes both spoken languages, while thedistractorcase includes only the distractor of the non\-English language, and themixcase includes the non\-English language and its distractor\. Forno\-hintcase, the prompt is always “Transcribe the following speech segment: ”\. As discussed in Sec\.[3\.3](https://arxiv.org/html/2606.17281#S3.SS3), getting the correct language signal is not always possible and thus it is important that the system has to perform reasonably well under non\-ideal scenarios\.
Given the ubiquity of English proper names and abbreviations \(e\.g\. “Nvidia”, “RAM” etc\.\), we always consider Latin characters\[a\-z\]\[\\text\{a\-z\}\]to be acceptable in the output when computing the language adherence metric\. Therefore, “Netflix 신작” \(Netflix new release\) as a Korean output doesnotviolate language adherence\.
Model SelectionWe select the best zero\-shot prompt and SFT mixture based on the language adherence performance on dedicated short utterance datasets \(1,500 English, 3,000 Korean utterances, each utterance is one word\), since they are challenging due to phonetic ambiguity and lack of context\. The best model is the one with the overall smallest LAVR across scenarios, and the smallest LAVR difference between thecorrectanddistractorhint scenarios, indicating robustness\.
### 4\.2Method 1: zero\-shot \(ZS\) Language Hint Prompting
The first approach leverages in\-context learning capabilities of LLMs via zero\-shot prompting\. We provide a languagehintto bias the model towards the target language, influencing its hypothesis space\. An ideal prompt must balance guiding the model while allowing it to adhere to the audio evidence if the hint is incorrect\.
We tested variants of three prompt styles:
- •P1: Transcribe the following speech segment in<languages\>:
- •P2: The following speech segment is spoken by someone who knows<languages\>\. Transcribe the following speech segment:
- •P3: Transcribe this speech segment\. It may contain a mix of<languages\>and other languages\.
The LAVR results on our short\-utterance datasets are shown in Table[1](https://arxiv.org/html/2606.17281#S4.T1)\. Based on these results, we selected P3 for our main experiments, because it proved the most robust against incorrect hints\. Phrasing variations within each prompt style had a negligible impact on the results\.
Table 1:LAVR \[%\] \(lower is better\) of short utterance datasets by prompts\. “cor” for correct and “dist” for distractor\.
### 4\.3Method 2: Supervised Fine\-Tuning \(SFT\)
To explicitly teach the model to follow a language\-hinted prompt, we apply the SFT method to minimize the cross entropy loss for token prediction\. The primary objective is to ensure adherence to language hints while preserving robustness against potentially erroneous cues\. Note that the SFT model has the same latency as the baseline model\.
Prompting strategy for trainingFor a given training set, we include a system instruction using theP3prompt from Sec\.[4\.2](https://arxiv.org/html/2606.17281#S4.SS2)\. To ensure robustness of the trained model, we randomly vary the hinted language using the four categories defined in Sec\.[4\.1](https://arxiv.org/html/2606.17281#S4.SS1)\. Fordistractorandmixcases, we add up to three \(60% one, 30% two, and 10% three\) randomly selected languages from a pool of 56 languages\.
Mixture of training promptsWe experimented with several training data configurations, varying the distribution of hinted language types\. Applying the same model selection criteria \(see the Appendix for exact results\), we chose the model trained on a mixture of 10% no\-hint, 40% of correct, 35% of distractor only, and 15% mixed prompts\.
### 4\.4Method 3: Chain\-of\-Thought \(CoT\) Prompting
Inspired bywatanabe2017, our final method uses a Chain\-of\-Thought \(CoT\) approach to force the model to explicitly reason about the language before transcribing\. For each example, we appended the following specific CoT prompt, including special control tokens to separate the language identification step from the transcription:
> Think about the languages of the speech and transcribe it in those languages\.
The rationale is to narrow the sampling space for the transcription task by first committing to a language\. Formally, the generated word sequence is conditioned on the \(estimated\) language IDs\. While this technique can be considered a variant of SFT, its distinct reasoning process merits separate analysis\. Because the model emits only a few additional tokens \(the language name\), the impact on average decoding time is negligible\. It should be noted that our models operate in a non\-streaming fashion, where the full audio context is available prior to decoding\. Conversely, a streaming implementation might face increased latency if the model defers output to stabilize its language prediction\.
For constructing the CoT training data, we prepended the correct language to the reference transcript, enclosed in control tokens\. The mixture of prompts is comprised of 90%distractor\-onlyand 10%no\-hint\. This composition ensures the model learns the correct language signal, while all other training parameters remain consistent with the SFT experiments\.
Table 2:Monolingual Results\. LAVR \(%\) and WER \(CER for Korean, % in parenthesis\) by language, model, and language hint type\.Table 3:Code\-switching \(with English\) Results\. LAVR \(%\) and WER \(CER for Korean, % in parenthesis\) by language, model, and language hint type\.
### 4\.5Comparative Results
The final LAVR and WER for the three methods are summarized in Tables[2](https://arxiv.org/html/2606.17281#S4.T2)and[3](https://arxiv.org/html/2606.17281#S4.T3)\. As expected, a consistent observation across nearly all evaluated scenarios is the superior performance associated with the “correct” language hint as this consistently produced the most favorable outcomes in both LAVR and WER metrics for all three methods\. Furthermore, a clear trend emerged: providing at least one correct language hint \(i\.e\., the “correct” and “mix” conditions\) significantly outperforms the “no\-hint” and “distractor” conditions across both metrics\. This pattern underscores the critical importance of obtaining a reliable estimation of the spoken language \(e\.g\. by model prediction and/or other user metadata\)\.
On the whole, the three tested approaches demonstrated broadly comparable performance levels when supplied with identical language hinting prompts\. A notable exception to this trend was the elevated WER observed for both the SFT and CoT methods under the “no\-hint” prompt condition\. This performance degradation is likely attributable to catastrophic forgetting\. The baseline model was previously fine\-tuned extensively using the “no\-hint” prompt; consequently, the reduced proportion \(10%\) of “no\-hint” data utilized during the SFT and CoT training phases appears to have degraded performance for this specific “no\-hint” scenario\. These results strongly suggest that advancements in prompt exploration and, most critically, the accurate inclusion of the spoken language in the prompt, are the most impactful factors for optimizing system performance\.
## 5Conclusions
We introduce a novel quantitative metric, the Language Adherence Violation Rate \(LAVR\), to systematically evaluate language adherence in multilingual ASR systems\. Recognizing the inherent challenge of obtaining an accurate language signal, we investigated three distinct strategies, zero\-shot prompting, SFT, and CoT, to enhance system robustness against potentially erroneous language hints\. Our findings, based on both LAVR and WER metrics, demonstrate that the zero\-shot prompting strategy is as effective as the SFT and CoT approaches, reinforcing the critical importance of a correct language hint\.
We also observed that when a correct language hint is present, the inclusion of a distractor language \(“mix” condition\) has a negligible negative impact and still significantly outperforms the “no\-hint” scenario\. Conversely, providing only a distractor language hint yields performance that is generally inferior to the “no\-hint” condition\. Therefore, these results strongly advocate for the implementation of upstream mechanisms, such as dedicated Language ID models or metadata analysis, to predict at least one spoken language with high confidence\.
## 6Acknowledgments
We thank Diamantino Caseiro for encouraging us to write this paper\.
## References
## Appendix AAppendices
### A\.1More information about the evaluation set
We present the overall statistics and sample utterances for the mono\-lingual and code\-switching evaluation set used in the main text in Table[4](https://arxiv.org/html/2606.17281#A1.T4)\. Transcripts are presented in the native script, except Hindi, where we transliterated to English from the original Hindi script in the dataset for presentation\. In all evaluation, we used the native scripts\.
To develop our code\-switching dataset, we utilized a proprietary voice generator featuring ten diverse male and female voices per primary \(non\-English\) language, randomly selecting one per utterance for each language pair\. The voices were configured to maintain the accent of the non\-English language; for instance, in the Korean\-English dataset, English phrases are delivered with a Korean accent while the Korean phrases maintain native fluency\.
Table 4:Statistics and representative samples for the multilingual evaluation dataset\. Hindi transcripts are transliterated to English from the original Hindi script in the dataset\.
### A\.2Model selection for SFT mixture
In this section, we report the performance on short utterance dataset when varying the SFT training mixture, defined by the ratio of correct, distractor, mix, and no\-hint prompt conditions\. We evaluated the following candidates\.
Table 5:SFT training mixture’s prompt type ratio\.We selected prompt P3 from Sec\.[4\.2](https://arxiv.org/html/2606.17281#S4.SS2)for training\. The performance on short English and Korean utterances is given in Tables[6](https://arxiv.org/html/2606.17281#A1.T6)and[7](https://arxiv.org/html/2606.17281#A1.T7)\.
Table 6:LAVR on short English data by SFT mixture\.Table 7:LAVR on short Korean data by SFT mixture\.Applying the model selection criteria discussed in Sec\.[4\.1](https://arxiv.org/html/2606.17281#S4.SS1), we choose M2 as the final candidate\.
### A\.3SFT with different base prompts\.
In the main text, the SFT training was done with the best prompt determined by the zero\-shot experiments \(P3 from Sec\.[4\.2](https://arxiv.org/html/2606.17281#S4.SS2)\)\. Here we present how other prompts affect the performance of the SFT model on the short utterances datasets\. The mixture of language hints in training is held constant \(M2 from Sec\.[A\.2](https://arxiv.org/html/2606.17281#A1.SS2)\)\.
Table 8:LAVR on short English data by SFT prompt\.Table 9:LAVR on short Korean data by SFT prompt\.We conclude that P3 performs best among the prompts tested with SFT\. Although other prompts might improve with different mixtures \(other than M2\), we did not test all combinations\. This added complexity highlights the cost\-effectiveness of the zero\-shot approach, as SFT requires tuning more variables, such as mixture composition\.
### A\.4Performance on more languages
In this section we evaluate the performance of the three proposed methods \(ZS, SFT, and CoT\) on more languages: German, Japanese and \(Brazilian\) Portuguese\. For code\-switching, all of them are mixed with English\. The design of both monolingual and code\-switching datasets is identical to the datasets used in the main results\. Distractor languages are Dutch, Korean, and Spanish for German, Japanese, and Portuguese, respectively\. The monolingual results are shown in Table[10](https://arxiv.org/html/2606.17281#A1.T10)and code\-switching results in Table[11](https://arxiv.org/html/2606.17281#A1.T11)\.
Table 10:Monolingual Results\. LAVR \(%\) and WER \(CER for Japanese, % in parenthesis\) by language, model, and language hint type\.Table 11:Code\-switching \(with English\) Results\. LAVR \(%\) and WER \(CER for Japanese, % in parenthesis\) by language, model, and language hint type\.We observe that the general trends established in the main experiments extend to these languages as well\.Similar Articles
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
Proposes TextPro-SLM, a speech large language model that minimizes the modality gap by processing spoken input to resemble prosody-aware text input, achieving strong paralinguistic understanding with low training data.
Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs
This paper applies Direct Preference Optimization (DPO) to align Audio LLMs for transcribing English-Mandarin code-switching speech, achieving up to 89.6% MER reduction in-distribution and 20% out-of-distribution. It identifies three failure modes—language omission, translation instead of transcription, and hallucination—and shows that preference-based alignment effectively elicits correct code-switching behavior from multilingual Audio LLMs.
No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs
Researchers from National Taiwan University propose replacing fixed translation-based prompting strategies in multilingual LLMs with lightweight learned classifiers that route each instance to either native or translation-based prompting. Their analysis across 10 languages and 4 benchmarks shows no single strategy is universally optimal, with translation benefiting low-resource languages most, and the learned routing achieving statistically significant improvements over fixed strategies.
LLMs Can Better Capture Human Judgments--With the Right Prompts
This paper presents simple prompting strategies that help large language models better capture the full distribution of human judgments, improving alignment on moral scenarios and beliefs. The authors show that asking models to report standard deviations and response proportions, along with ensuring scenario clarity, yields better agreement with human responses.
Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings
This paper evaluates the abilities of large language models (LLMs) and multimodal LLMs for addressee detection, turn-change prediction, and next speaker prediction in multi-party meeting conversations. Results show text-based LLMs outperform supervised models and humans in next speaker prediction, while multimodal LLMs improve over text-only models in other tasks but remain below human performance.