MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
Summary
MoVE proposes a Mixture-of-LoRA-Experts architecture that preserves laughter and crying in speech-to-speech translation, achieving 76% NV retention with only 30 minutes of curated data.
View Cached Full Text
Cached at: 04/22/26, 02:41 PM
Paper page - MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
Source: https://huggingface.co/papers/2604.17435 Published on Apr 19
·
Submitted byhttps://huggingface.co/47z
陳思齊on Apr 22
Abstract
MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router, enables effective speech-to-speech translation with preserved non-verbal vocalizations while achieving high naturalness and emotional fidelity using minimal curated data.
RecentSpeech-to-Speech Translation(S2ST) systems achieve strong semantic accuracy yet consistently strip awaynon-verbal vocalizations(NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalableexpressive datasetsto overcome the data scarcity limitation. Second, we propose MoVE, aMixture-of-LoRA-Expertsarchitecture withexpressive-specialized adaptersand asoft-weighting routerthat blends experts for capturing hybrid expressive states. Third, we show pretrainedAudioLLMsenable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2604\.17435
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.17435 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.17435 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.17435 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs
Mix-MoE proposes a mixed Mixture-of-Experts framework with specialized expert groups and Fourier-transform-enhanced routing to mitigate parameter interference in multilingual machine translation, achieving significant improvements over baselines.
Streaming Speech-to-Text Translation with a SpeechLLM
Presents a SpeechLLM architecture for streaming speech-to-text translation that adaptively decides when to output tokens based on audio, achieving 1-2 second latency with quality close to non-streaming baselines.
LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
LoMo proposes a data curation method that reformulates single-modality prompts into interleaved multimodal sequences to improve cross-modal representation alignment in vision-language models, achieving consistent gains on multiple benchmarks.
@multimodalart: they extracted only the audio bit of LTX-2.3, fine-tuned for TTS task and achieved SOTA TTS emotional control??? try it…
A fine-tuned version of the LTX-2.3 model's audio component achieves state-of-the-art emotional control in text-to-speech, now available as a Hugging Face Space called DramaBox by ResembleAI.
Mixture of Experts (MoEs) in Transformers
Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.