Tag
This paper applies Direct Preference Optimization (DPO) to align Audio LLMs for transcribing English-Mandarin code-switching speech, achieving up to 89.6% MER reduction in-distribution and 20% out-of-distribution. It identifies three failure modes—language omission, translation instead of transcription, and hallucination—and shows that preference-based alignment effectively elicits correct code-switching behavior from multilingual Audio LLMs.
EchoDistill is an alignment-based noisy-to-clean self-distillation framework that improves the robustness of Audio Large Language Models (ALLMs) against real-world noise by using a frozen clean-audio teacher to guide the student model via group-relative policy optimization (GRPO). Experiments show significant improvements in semantic reliability and task performance under strong noise without additional inference costs.