ChatPaper.aiChatPaper

MoVE:基于语音专家混合模型的语音翻译中笑声与哭声的转译技术

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

April 19, 2026
作者: Szu-Chi Chen, I-Ning Tsai, Yi-Cheng Lin, Sung-Feng Huang, Hung-yi Lee
cs.AI

摘要

当前语音到语音翻译系统虽在语义准确性上表现优异,却普遍过滤了传递语用意图的非语言发声(如笑声、哭声),这严重限制了实际应用价值。我们通过三项创新解决该问题:首先提出可扩展的表达性数据集合成流程,以克服数据稀缺限制;其次设计MoVE架构——采用表达性专属适配器的混合LoRA专家模型,通过软加权路由器融合专家能力以捕捉混合表达状态;最后证明预训练音频大模型可实现惊人数据效率:仅需30分钟精选数据即可达成强劲性能。在英汉语音翻译任务中,MoVE在强基线对比下能还原76%的目标非语言发声,获评最高人工标注自然度与情感保真度,而现有系统最多仅保留14%的非语言发声。
English
Recent Speech-to-Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non-verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router that blends experts for capturing hybrid expressive states. Third, we show pretrained AudioLLMs enable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.
PDF01April 23, 2026