MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
Summary
MoVE proposes a Mixture-of-LoRA-Experts architecture that preserves laughter and crying in speech-to-speech translation, achieving 76% NV retention with only 30 minutes of curated data.
View Cached Full Text
Cached at: 04/22/26, 02:41 PM
Paper page - MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
Source: https://huggingface.co/papers/2604.17435 Published on Apr 19
·
Submitted byhttps://huggingface.co/47z
陳思齊on Apr 22
Abstract
MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router, enables effective speech-to-speech translation with preserved non-verbal vocalizations while achieving high naturalness and emotional fidelity using minimal curated data.
RecentSpeech-to-Speech Translation(S2ST) systems achieve strong semantic accuracy yet consistently strip awaynon-verbal vocalizations(NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalableexpressive datasetsto overcome the data scarcity limitation. Second, we propose MoVE, aMixture-of-LoRA-Expertsarchitecture withexpressive-specialized adaptersand asoft-weighting routerthat blends experts for capturing hybrid expressive states. Third, we show pretrainedAudioLLMsenable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2604\.17435
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.17435 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.17435 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.17435 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Streaming Speech-to-Text Translation with a SpeechLLM
Presents a SpeechLLM architecture for streaming speech-to-text translation that adaptively decides when to output tokens based on audio, achieving 1-2 second latency with quality close to non-streaming baselines.
Mixture of Experts (MoEs) in Transformers
Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.
MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method
This paper introduces MultiLinguahah, an unsupervised multilingual method for acoustic laughter segmentation using Isolation Forests on BYOL-A encoder representations. The authors demonstrate that their approach outperforms state-of-the-art supervised methods in non-English settings by treating laughter detection as an anomaly detection task.
How Descript engineers multilingual video dubbing at scale
Descript redesigned its translation pipeline using OpenAI reasoning models to optimize multilingual video dubbing at scale, achieving 15% increase in translated video exports and 13-43% improvement in duration adherence across languages by addressing the challenge of matching speech duration to video timing constraints.
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
The paper introduces XPERT, a framework that extracts and reuses expert knowledge from pre-trained Mixture-of-Experts (MoE) language models to improve training efficiency and performance in downstream models.