MoVE: 音声対音声翻訳における発声専門家の混合による笑いと涙の翻訳

要旨

近年の音声間翻訳（S2ST）システムは意味的精度において高い性能を達成しているものの、実用的意図を伝える笑い声や泣き声といった非言語的発声（NV）を一貫して除去しており、実世界での有用性を大きく制限している。本研究ではこの問題に対し、3つの貢献を行う。第一に、データ不足の課題を克服するため、拡張性のある表現豊かなデータセットを構築する合成パイプラインを提案する。第二に、表現に特化したアダプタと、複数の表現状態を混合するソフト重み付けルータを備えたMixture-of-LoRA-Expertsアーキテクチャ「MoVE」を提案する。第三に、事前学習済みAudioLLMが驚異的なデータ効率を実現することを示す。精選された30分のデータで強力な性能が得られる。英語-中国語S2STにおける評価では、強力なベースラインと比較し、MoVEは76%の事例で目標NVを再現し、全ての比較システム中で最高の人間評価による自然さおよび感情的真実性を達成した。既存のS2STシステムが最大14%のNVしか保持しないのに対し、この結果は顕著である。

English

Recent Speech-to-Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non-verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router that blends experts for capturing hybrid expressive states. Third, we show pretrained AudioLLMs enable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.

MoVE: 音声対音声翻訳における発声専門家の混合による笑いと涙の翻訳

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

要旨

Support