Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

arXiv cs.CL Papers

Summary

This paper applies Direct Preference Optimization (DPO) to align Audio LLMs for transcribing English-Mandarin code-switching speech, achieving up to 89.6% MER reduction in-distribution and 20% out-of-distribution. It identifies three failure modes—language omission, translation instead of transcription, and hallucination—and shows that preference-based alignment effectively elicits correct code-switching behavior from multilingual Audio LLMs.

arXiv:2605.23975v1 Announce Type: new Abstract: Audio large language models (Audio LLMs) exhibit systematic failures in transcribing code-switching speech despite strong multilingual capabilities. Focusing on English-Mandarin, we identify three failure modes: language omission, translation-instead-of-transcription, and hallucination. We apply Direct Preference Optimization (DPO) to align models, constructing preference pairs in which chosen responses preserve mixed-language content while rejected responses mimic failure patterns. Training three Audio LLMs on 100K pairs (570 hours), we observe consistent behavioral shifts: models learn to preserve language composition rather than translating when prompted for transcription. This alignment yields MER reductions up to 89.6% (in-distribution) and 20.0% (out-of-distribution). Our findings suggest DPO can effectively elicit correct code-switching transcription behavior from multilingual Audio LLMs.
Original Article
View Cached Full Text

Cached at: 05/26/26, 08:59 AM

# Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs
Source: [https://arxiv.org/html/2605.23975](https://arxiv.org/html/2605.23975)
Trung Won Pham He Sun Aw

Cheng Yi LewisMinh DucYingxuShuoAi Ti1Institute for Infocomm Research \(I2R\), A⋆STAR, Singapore 2Nanyang Technological University, Singapore[quangtrung5705@gmail\.com](https://arxiv.org/html/2605.23975v1/mailto:[email protected])

###### Abstract

Audio large language models \(Audio LLMs\) exhibit systematic failures in transcribing code\-switching speech despite strong multilingual capabilities\. Focusing on English\-Mandarin, we identify three failure modes: language omission, translation\-instead\-of\-transcription, and hallucination\. We apply Direct Preference Optimization \(DPO\) to align models, constructing preference pairs in which chosen responses preserve mixed\-language content while rejected responses mimic failure patterns\. Training three Audio LLMs on 100K pairs \(570 hours\), we observe consistent behavioral shifts: models learn to preserve language composition rather than translating when prompted for transcription\. This alignment yields MER reductions up to 89\.6% \(in\-distribution\) and 20\.0% \(out\-of\-distribution\)\. Our findings suggest DPO can effectively elicit correct code\-switching transcription behavior from multilingual Audio LLMs\.

###### keywords:

code\-switching, speech recognition, audio language models, direct preference optimization, multilingual ASR

††Work done during Nguyen Quang Trung’s internship at Institute for Infocomm Research \(I2R\), A⋆STAR\.## 1Introduction

Audio large language models \(Audio LLMs\) extend large language models with the ability to process and understand audio inputs alongside text, enabling tasks such as speech recognition, audio captioning, and spoken dialogue\[radford2023robust,chu2024qwen2,tang2023salmonn\]\. Since Whisper\[radford2023robust\]established the foundation through large\-scale weak supervision, numerous models have emerged with strong multilingual competence: Qwen2\-Audio\[chu2024qwen2\]and the Qwen\-Omni series\[xu2025qwen2,Qwen3\-Omni\]support multiple languages, Phi\-4 Multimodal\[abouelenin2025phi\]demonstrates multilingual performance through its Mixture\-of\-LoRAs architecture, and MERaLiON\[he\-etal\-2025\-meralion\]is specifically designed for Southeast Asian multilingual contexts\. These advances suggest that modern Audio LLMs possess multilingual proficiency, as evidenced by state\-of\-the\-art performance on benchmarks such as Common Voice\[ardila2020common\]and FLEURS\[conneau2023fleurs\]\.

However, this multilingual capability does not automatically extend to code\-switching, the practice of alternating between languages within a conversation or utterance, which is prevalent in multilingual communities worldwide\. We focus on English\-Mandarin code\-switching because it is one of the most widely studied language pairs in Southeast Asia, where the SEAME corpus\[lyu10\_interspeech,zeng2018seame\]serves as a standard benchmark\. Despite the multilingual proficiency of Audio LLMs, even models like MERaLiON, which incorporate extensive code\-switching data during supervised fine\-tuning, exhibit systematic failures when transcribing code\-switching speech\. Through our analysis, we identify three distinct failure modes: \(1\)language omission, where the model outputs only one language while dropping the other; \(2\)translation\-instead\-of\-transcription, where the model translates mixed\-language content into a single language rather than preserving the original; and \(3\)hallucination, where the model generates repeated or fabricated content\.

Prior work has explored various directions to address code\-switching in Automatic Speech Recognition \(ASR\)\. Early systems relied on hybrid pipelines such as phone merging\[vu2012first\]and factored language models\[adel2013combination\]\. More recent approaches focus on reducing reliance on naturally occurring code\-switching data via audio concatenation\[hussein2024collage,nguyen2025noreal\], adapting foundation models through encoding refinement with language\-aware decoding\[zhao2025whispercs,liu2023reducing,liu2024interactive\], and employing Mixture\-of\-Experts architectures for language\-specialized processing\[zhang2025moe,ye2024scmoe\]\. However, none explicitly target the behavioral alignment of Audio LLMs for code\-switching transcription\.

We hypothesize that Audio LLMs possess the latent ability to produce correct code\-switching transcriptions, and this behavior can be elicited through preference optimization\. To test this hypothesis, we turn to Direct Preference Optimization \(DPO\)\[rafailov2023dpo\], a common direct alignment algorithm for large language models\. Within the speech domain, SpeechAlign\[zhang2024speechalign\]first demonstrated DPO's effectiveness for aligning codec language models\. Notably, Qwen2\-Audio\[chu2024qwen2\]already incorporates DPO in its training pipeline\. However, to our knowledge, no existing work applies DPO specifically to address code\-switching transcription capability in Audio LLMs\. Therefore, we investigate DPO as a potential approach: given models with demonstrated multilingual capability, we ask whether DPO can elicit correct code\-switching transcription behavior\.

To implement this approach, we construct DPO training pairs by pairing ground\-truth code\-switching transcriptions \(chosen\) with synthetically generated flawed transcriptions \(rejected\) that mimic the observed failure modes\. Using approximately 100K preference pairs \(∼\\sim570 hours\) derived from natural and synthetic code\-switching data, we train three Audio LLMs – MERaLiON\-2\-3B\[he\-etal\-2025\-meralion\], Phi\-4\-multimodal\-instruct\[abouelenin2025phi\], and Qwen2\-Audio\-7B\-Instruct\[chu2024qwen2\]– and evaluate on both in\-distribution and out\-of\-distribution benchmarks, including SEAME dev\_man and dev\_sge\[zeng2018seame\]\.

In summary, our contributions are as follows:

- •We identify three systematic failure modes in English\-Mandarin code\-switching transcription exhibited by state\-of\-the\-art multilingual Audio LLMs\.
- •We propose a DPO approach to construct preference pairs that contrast correct code\-switching transcriptions with failure\-mimicking alternatives\.
- •We demonstrate consistent improvements across three Audio LLM architectures on English\-Mandarin benchmarks, achieving relative MER reductions of up to 20\.0% \(out\-of\-distribution\) and 89\.6% \(in\-distribution\)\.

## 2Method

Our approach consists of two main components: \(1\) constructing preference pairs where ground\-truth code\-switching transcriptions serve as chosen responses and LLM\-generated flawed transcriptions serve as rejected responses, and \(2\) applying DPO to align model transcription behavior toward the correct code\-switching output\. Figure[1](https://arxiv.org/html/2605.23975#S2.F1)provides an overview of this pipeline, while Table[1](https://arxiv.org/html/2605.23975#S2.T1)illustrates the three failure modes introduced in Section[1](https://arxiv.org/html/2605.23975#S1)with concrete examples\.

Audio InputCode\-SwitchingGround TruthTranscriptionQwen3\-32BRejection GeneratorChosen𝐲w\\mathbf\{y\}\_\{w\}我住 temasek poly 那边Rejected𝐲l\\mathbf\{y\}\_\{l\}\`\`I live temasek poly there''PreferencePair\(𝐱,𝐲w,𝐲l\)\(\\mathbf\{x\},\\mathbf\{y\}\_\{w\},\\mathbf\{y\}\_\{l\}\)DPOTrainingAlignedAudio LLMGlobal Translation \(80%\)Partial Translation \(20%\)✓×\\times

Figure 1:Overview of DPO training for code\-switching alignment\. Ground\-truth transcriptions serve as chosen responses \(𝐲w\\mathbf\{y\}\_\{w\}\), while an LLM generates rejected responses \(𝐲l\\mathbf\{y\}\_\{l\}\) that mimic failure modes via Global Translation \(full\) and Partial Translation \(spans only\)\. DPO trains the model to prefer verbatim code\-switching output\.Table 1:Three failure modes observed in Audio LLMs on code\-switching audio inputs when prompted for transcription### 2\.1DPO for code\-switching alignment

We apply DPO to align Audio LLM output behavior for code\-switching transcription\. Specifically, given an audio and a prompt asking for transcription𝐱\\mathbf\{x\}, a preferred response𝐲c\\mathbf\{y\}\_\{c\}\(ground\-truth transcription\), and a dispreferred response𝐲r\\mathbf\{y\}\_\{r\}\(flawed transcription\), DPO optimizes the policyπθ\\pi\_\{\\theta\}to increase the likelihood of generating𝐲c\\mathbf\{y\}\_\{c\}while decreasing the likelihood of generating𝐲r\\mathbf\{y\}\_\{r\}, directly without explicit reward modeling:

ℒDPO=−𝔼​\[log⁡σ​\(β​log⁡πθ​\(𝐲c\|𝐱\)πref​\(𝐲c\|𝐱\)−β​log⁡πθ​\(𝐲r\|𝐱\)πref​\(𝐲r\|𝐱\)\)\]\\displaystyle\\mathcal\{L\}\_\{\\text\{DPO\}\}=\-\\mathbb\{E\}\\Big\[\\log\\sigma\\big\(\\beta\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{c\}\|\\mathbf\{x\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\_\{c\}\|\\mathbf\{x\}\)\}\-\\beta\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{r\}\|\\mathbf\{x\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\_\{r\}\|\\mathbf\{x\}\)\}\\big\)\\Big\]\(1\)whereπθ\\pi\_\{\\theta\}is the active policy being trained,πref\\pi\_\{\\text\{ref\}\}is the reference policy \(frozen base model\),σ\\sigmais the sigmoid function,β\\betacontrols preference strength,𝐲c\\mathbf\{y\}\_\{c\}\(chosen\) is the ground\-truth code\-switching transcription, and𝐲r\\mathbf\{y\}\_\{r\}\(rejected\) is the response that mimics the observed failure modes\.

### 2\.2Training data construction

#### 2\.2\.1Rejected sample generation

To create the rejected samples, we use Qwen3\-32B\[yang2025qwen3\]to transform ground\-truth transcriptions into flawed versions\. We employ two complementary strategies, both targeting translation\-based failure modes:

Global Translation \(80%\):This strategy translates all content from one language to the other \(all Chinese→\\rightarrowEnglish or all English→\\rightarrowChinese\), thereby mimicking translation\-instead\-of\-transcription failures\.

Partial Translation \(20%\):In contrast, this strategy translates only specific short spans within the utterance, mimicking partial language omission where isolated segments are incorrectly rendered in the wrong language\.

We chose the 80/20 ratio based on the higher frequency of full translation errors observed in baseline models\. Table[2](https://arxiv.org/html/2605.23975#S2.T2)illustrates both strategies with examples\. Even though both strategies focus on translation\-based rejected samples, and we do not explicitly generate rejected pairs representing content dropping or hallucination \(repetition/fabrication\), we show in Section[4](https://arxiv.org/html/2605.23975#S4)that DPO training reduces all three failure modes\. This is likely because models trained to preserve language composition also develop more stable generation patterns\.

Table 2:Examples of rejected sample generation strategies
#### 2\.2\.2Data sources

We construct DPO training pairs from two complementary datasets as summarized in Table[3](https://arxiv.org/html/2605.23975#S2.T3)\.

Table 3:Training data composition for DPOCS\-Dialogue\[zhou2025csdialogue\]:This contains spontaneous Mandarin\-English code\-switching dialogues from 200 speakers, where each utterance is tagged as English\-only \(EN\), Chinese\-only \(CN\), or code\-switched \(MIX\)\. From this data, we construct segments in two ways: first, by grouping consecutive MIX\-tagged utterances containing natural intra\-sentential code\-switching; and second, by concatenating EN and CN utterances from the same conversation to create inter\-sentential code\-switching\. Together, these approaches combine authentic spontaneous code\-switching with controlled cross\-utterance language mixing within coherent conversational contexts\.

EMILIA\[he2024emilia\]:To complement CS\-Dialogue with additional scale and diversity, we create synthetic code\-switching samples by concatenating English and Chinese clips from the EMILIA corpus\. Each segment is constructed by randomly sampling clips from both languages and concatenating them, producing inter\-sentential code\-switching audio at scale\.

## 3Experimental setup

### 3\.1Models

To demonstrate the generalizability of our approach, we experiment with three multilingual Audio LLMs, which are already trained in both English and Mandarin\. Table[4](https://arxiv.org/html/2605.23975#S3.T4)summarizes the training configurations\.

Table 4:DPO training configurations\. We tunedβ\\betaper model based on each architecture's sensitivity to the strength of preference optimization: lowerβ\\betaenables stronger updates, while higherβ\\betaproduces more conservative changes\. All hyperparameters were selected through validation\-based tuning on held\-out splits of the training data\.MERaLiON\-2\-3B\[he\-etal\-2025\-meralion\]is specifically designed for Southeast Asian multilingual speech and includes extensive code\-switching data in its supervised fine\-tuning stage\.

Phi\-4\-multimodal\-instruct\[abouelenin2025phi\]is a general\-purpose multimodal model with strong multilingual capability, including English and Mandarin\.

Qwen2\-Audio\-7B\-Instruct\[chu2024qwen2\]is a foundational Audio LLM with state\-of\-the\-art performance across multiple audio benchmarks\. For this model, we apply LoRA adaptation with rank 256 targeting all attention and MLP modules, because preliminary experiments with full fine\-tuning across multiple hyperparameter configurations consistently produced degraded outputs with repetitive tokens and severe hallucinations\. Thus, using LoRA preserved a more stable generation behavior\.

All three models were trained for one epoch on 8 H100 GPUs\.

### 3\.2Prompt diversity

Audio LLMs require both audio input and a text prompt to perform transcription\. To prevent overfitting to a single prompt template during training, we use 20 English and 20 Chinese prompts, all requesting transcription but with varied phrasing\. Examples include: \`\`Please transcribe the speech in this audio file\.'', \`\`Can you transcribe this audio for me?'', \`\`请帮我转写这段音频。'' \(Please transcribe this audio\), and \`\`这段音频里在说什么?'' \(What is being said in this audio?\)\. During training, prompts are randomly sampled from this pool; during evaluation, we use a fixed prompt: \`\`Please transcribe this speech\.'', which is a common prompt and not included in the training pool\.

### 3\.3Evaluation benchmarks

We evaluate on four benchmarks \(Table[5](https://arxiv.org/html/2605.23975#S3.T5)\), including both in\-distribution and out\-of\-distribution test sets to assess generalization\.

Table 5:Evaluation benchmarks for English\-Mandarin code\-switching ASR\.SEAME\[lyu10\_interspeech\]is a standard English\-Mandarin code\-switching corpus collected from conversational speech in Singapore and Malaysia\. We evaluate on its dev\_man and dev\_sge splits\[zeng2018seame\]\(2,610 and 3,222 utterances, respectively\), which serve as our out\-of\-distribution test sets, since no SEAME data appears in the training set\.

EMILIA\-testandCS\-Dialogue\-test, in contrast, are held\-out portions of the training and validation data sources and thus represent in\-distribution evaluation\.

### 3\.4Evaluation metric

We use Mixed Error Rate \(MER\), a standard metric for code\-switching ASR evaluation\. MER applies character\-level tokenization to Chinese text and word\-level tokenization for English text, respecting the natural linguistic structure of both languages\. To ensure accurate MER calculation, all text is lowercased, and punctuation is removed before evaluation\. Moreover, we apply model\-specific output normalization\. For example, Qwen2\-Audio\-7B\-Instruct typically outputs \`\`The original content of this audio is: \[transcription\]''\. Wfor e filter such patterns to extract only the transcription content for e,aensuring ato ensure fair comparison\. Lower MER indicates better performance\.

## 4Results

### 4\.1Quantitative analysis

Table[6](https://arxiv.org/html/2605.23975#S4.T6)presents MER scores across all models and benchmarks\. These results show that DPO consistently improves code\-switching transcription performance across all configurations, though the magnitude of improvement varies considerably by model and benchmark\.

Table 6:Main results \(MER %, lower is better\)\. All models show consistent improvement after DPO training\.MERaLiON\-2\-3Bshows modest SEAME improvements \(0\.7–2\.0%\), which reflects that this model already incorporates extensive code\-switching data in its supervised training, leaving limited room for further gains\. Nevertheless, DPO still provides an 11\.1% relative improvement on in\-distribution CS\-Dialogue data\.

Phi\-4\-multimodal\-instruct, on the other hand, shows dramatic improvement: MER drops from 70\.98% to 7\.38% on EMILIA \(89\.6% relative reduction\)\. We attribute this large gain to the model's limited exposure to code\-switching during its training process, which causes the baseline to frequently translate mixed\-language content to a single language or produce severe repetition when asked to transcribe\. Hence, a single epoch of DPO effectively elicits correct code\-switching transcription behavior\.

Qwen2\-Audio\-7B\-Instructdemonstrates substantial improvement on SEAME dev\_man \(20\.0% relative\), alongside consistent gains on in\-distribution benchmarks \(5\.9–19\.3%\)\.

### 4\.2Qualitative analysis

Beyond aggregate MER scores, our primary goal is*behavioral alignment*for code\-switching transcription: when asked to transcribe mixed\-language audio, the model should preserve the original language composition rather than translating to a single language, dropping one language, or producing repetitive content\. In manual inspection of model outputs, we frequently observe that DPO shifts generation toward the desired behavior: models become more likely to preserve the mixed\-language pattern and produce more stable transcriptions\. To illustrate these behavioral changes, Table[7](https://arxiv.org/html/2605.23975#S4.T7)shows representative corrections after DPO training\.

Table 7:Qualitative examples showing DPO corrections\.
### 4\.3Analysis

Taken together, these results show that DPO consistently shifts transcription behavior toward the desired code\-switching output across three model families\. These qualitative corrections support our hypothesis that the ability to produce accurate code\-switching transcriptions is often latent in multilingual Audio LLMs but is not reliably expressed when prompted to transcribe\. DPO thus provides a lightweight mechanism to elicit the intended code\-switching behavior from such models through preference pairs\. Because our goal is behavioral alignment, we consider these qualitative shifts in code\-switching awareness to be a primary outcome alongside MER\.

## 5Discussion

Limitations\.While our results are encouraging, several aspects of our approach present opportunities for refinement\.First,we focus exclusively on English\-Mandarin code\-switching; generalization to other language pairs remains untested\.Second,our rejected samples are synthetic transformations rather than samples drawn from the model's actual outputs\. However, this provides controllability and scalability, but it may introduce distributional shifts relative to real failure modes\.Third,we employ vanilla DPO without algorithmic modifications; recent variants such as SimPO\[meng2024simpo\], mDPO\[wang2024mdpo\], or iterative refinement approaches could potentially yield further improvements\.Fourth,our rejection strategy focuses primarily on translation\-based failures; explicit generation of hallucination and content\-dropping examples might strengthen alignment for those specific modes\.

## 6Conclusion

In this work, we applied Direct Preference Optimization to address English\-Mandarin code\-switching failures in three Audio LLMs: MERaLiON\-2\-3B, Phi\-4\-multimodal\-instruct, and Qwen2\-Audio\-7B\-Instruct\. Starting from the observation that these models exhibit systematic failure modes – language omission, translation\-instead\-of\-transcription, and hallucination – we constructed approximately 100K DPO training pairs \(∼\\sim570 hours\) using LLM\-generated rejected samples that mimic these failures\.

Our experiments demonstrate that DPO training yields consistent improvements across all models and benchmarks, with relative MER reductions of up to 89\.6% on in\-distribution data \(Phi\-4 on EMILIA\) and 20\.0% on out\-of\-distribution data \(Qwen2\-Audio on SEAME dev\_man\)\. Furthermore, qualitative analysis confirms that DPO effectively corrects all three failure modes, enabling models to produce verbatim mixed\-language transcriptions\.

To our knowledge, this is among the first demonstrations that Direct Preference Optimization can elicit correct code\-switching transcription behavior from multilingual Audio LLMs, and it offers a viable mechanism for aligning models that already possess multilingual capability\. We hope this work opens avenues for extending to other language pairs, exploring alternative preference optimization algorithms, and advancing broader Audio LLM architectures\.

## References

Similar Articles

Direct Preference Optimization Beyond Chatbots

Hugging Face Blog

Direct Preference Optimization (DPO) is applied to OCR tasks beyond chatbots, showing significant reduction in text degeneration across multiple model families, with an average reduction of 59.4%.

No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs

arXiv cs.CL

Researchers from National Taiwan University propose replacing fixed translation-based prompting strategies in multilingual LLMs with lightweight learned classifiers that route each instance to either native or translation-based prompting. Their analysis across 10 languages and 4 benchmarks shows no single strategy is universally optimal, with translation benefiting low-resource languages most, and the learned routing achieving statistically significant improvements over fixed strategies.

LLMs for automatic annotation of Mandarin narrative transcripts

arXiv cs.CL

This paper evaluates LLMs for automatically annotating narrative macrostructure in spoken Mandarin, finding that the best model achieves near-human reliability while reducing annotation time by 65%, though performance degrades on semantically complex or lexically diverse narratives.

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

arXiv cs.CL

GroupDPO introduces a memory-efficient algorithm for group-wise direct preference optimization that leverages multiple candidate responses per prompt while reducing peak memory usage through decoupled backpropagation. The method demonstrates consistent improvements over standard DPO across offline and online alignment settings.