MoVE: 음성-음성 번역에서 발성 전문가 혼합을 통한 웃음과 울음 번역

초록

최근 음성 간 번역(S2ST) 시스템은 강력한 의미론적 정확도를 달성했지만, 화용적 의도를 전달하는 웃음이나 울음과 같은 비언어적 발성(NV)을 지속적으로 제거하여 실제 활용에 심각한 한계를 보입니다. 우리는 세 가지 기여를 통해 이 문제를 해결합니다. 첫째, 데이터 부족 문제를 극복하기 위해 확장 가능한 표현적 데이터셋 구축을 위한 합성 파이프라인을 제안합니다. 둘째, 하이브리드 표현 상태를 포착하기 위해 전문가를 융합하는 표현 특화 어댑터와 소프트 가중치 라우터를 갖춘 MoVE(Mixture-of-LoRA-Experts) 아키텍처를 제안합니다. 셋째, 사전 학습된 AudioLLM이 획기적인 데이터 효율성을 가능하게 함을 보여줍니다. 30분의 선별된 데이터만으로도 강력한 성능을 달성할 수 있습니다. 영어-중국어 S2ST에서 강력한 기준 모델과 비교했을 때, MoVE는 대상 NV를 76%의 경우에서 재현했으며 비교된 모든 시스템 중 인간 평가에서 가장 높은 자연스러움과 정서적 충실도를 달성했습니다. 기존 S2ST 시스템은 최대 14%의 NV만 보존합니다.

English

Recent Speech-to-Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non-verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router that blends experts for capturing hybrid expressive states. Third, we show pretrained AudioLLMs enable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.

MoVE: 음성-음성 번역에서 발성 전문가 혼합을 통한 웃음과 울음 번역

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

초록

Support