MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

Hugging Face Daily Papers 04/19/26, 12:00 AM Papers

Summary

MoVE proposes a Mixture-of-LoRA-Experts architecture that preserves laughter and crying in speech-to-speech translation, achieving 76% NV retention with only 30 minutes of curated data.

Recent Speech-to-Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non-verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router that blends experts for capturing hybrid expressive states. Third, we show pretrained AudioLLMs enable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.

Original Article

View Cached Full Text

Cached at: 04/22/26, 02:41 PM

Paper page - MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

Source: https://huggingface.co/papers/2604.17435 Published on Apr 19

Submitted byhttps://huggingface.co/47z

陳思齊on Apr 22

Abstract

MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router, enables effective speech-to-speech translation with preserved non-verbal vocalizations while achieving high naturalness and emotional fidelity using minimal curated data.

RecentSpeech-to-Speech Translation(S2ST) systems achieve strong semantic accuracy yet consistently strip awaynon-verbal vocalizations(NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalableexpressive datasetsto overcome the data scarcity limitation. Second, we propose MoVE, aMixture-of-LoRA-Expertsarchitecture withexpressive-specialized adaptersand asoft-weighting routerthat blends experts for capturing hybrid expressive states. Third, we show pretrainedAudioLLMsenable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.

View arXiv page View PDF GitHub0 Add to collection

Get this paper in your agent:

hf papers read 2604\.17435

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.17435 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.17435 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.17435 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

Paper page - MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

Streaming Speech-to-Text Translation with a SpeechLLM

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

@multimodalart: they extracted only the audio bit of LTX-2.3, fine-tuned for TTS task and achieved SOTA TTS emotional control??? try it…

Mixture of Experts (MoEs) in Transformers

Submit Feedback

Similar Articles

Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

Streaming Speech-to-Text Translation with a SpeechLLM

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

@multimodalart: they extracted only the audio bit of LTX-2.3, fine-tuned for TTS task and achieved SOTA TTS emotional control??? try it…

Mixture of Experts (MoEs) in Transformers