MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

Hugging Face Daily Papers Papers

Summary

MoVE proposes a Mixture-of-LoRA-Experts architecture that preserves laughter and crying in speech-to-speech translation, achieving 76% NV retention with only 30 minutes of curated data.

Recent Speech-to-Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non-verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router that blends experts for capturing hybrid expressive states. Third, we show pretrained AudioLLMs enable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.
Original Article
View Cached Full Text

Cached at: 04/22/26, 02:41 PM

Paper page - MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

Source: https://huggingface.co/papers/2604.17435 Published on Apr 19

·

Submitted byhttps://huggingface.co/47z

陳思齊on Apr 22

Abstract

MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router, enables effective speech-to-speech translation with preserved non-verbal vocalizations while achieving high naturalness and emotional fidelity using minimal curated data.

RecentSpeech-to-Speech Translation(S2ST) systems achieve strong semantic accuracy yet consistently strip awaynon-verbal vocalizations(NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalableexpressive datasetsto overcome the data scarcity limitation. Second, we propose MoVE, aMixture-of-LoRA-Expertsarchitecture withexpressive-specialized adaptersand asoft-weighting routerthat blends experts for capturing hybrid expressive states. Third, we show pretrainedAudioLLMsenable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.

View arXiv pageView PDFGitHub0Add to collection

Get this paper in your agent:

hf papers read 2604\.17435

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.17435 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.17435 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.17435 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Streaming Speech-to-Text Translation with a SpeechLLM

arXiv cs.CL

Presents a SpeechLLM architecture for streaming speech-to-text translation that adaptively decides when to output tokens based on audio, achieving 1-2 second latency with quality close to non-streaming baselines.

Mixture of Experts (MoEs) in Transformers

Hugging Face Blog

Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.

MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method

arXiv cs.CL

This paper introduces MultiLinguahah, an unsupervised multilingual method for acoustic laughter segmentation using Isolation Forests on BYOL-A encoder representations. The authors demonstrate that their approach outperforms state-of-the-art supervised methods in non-English settings by treating laughter detection as an anomaly detection task.

How Descript engineers multilingual video dubbing at scale

OpenAI Blog

Descript redesigned its translation pipeline using OpenAI reasoning models to optimize multilingual video dubbing at scale, achieving 15% increase in translated video exports and 13-43% improvement in duration adherence across languages by addressing the challenge of matching speech duration to video timing constraints.