MulliVC: 순환 일관성을 갖는 다국어 음성 변환

초록

음성 변환은 원본 화자의 음성을 유지하면서 대상 화자와 유사하게 변환하는 것을 목표로 합니다. 요즘 음성 변환 기술은 상당한 발전을 이루고 있지만, 다국어 음성 변환(단일 언어 및 교차 언어 시나리오 모두 포함)은 아직 체계적으로 연구되지 않았습니다. 이는 두 가지 주요 도전에 직면하고 있습니다: 1) 언어 간 이성 및 조음 습관의 상당한 변이; 그리고 2) 동일 화자의 다국어 데이터가 부족한 점입니다. 본 논문에서는 MulliVC라는 새로운 음성 변환 시스템을 제안합니다. 이 시스템은 다국어 페어 데이터 없이 음색만 변환하고 원본 콘텐츠와 소스 언어의 이성을 유지합니다. 구체적으로 MulliVC의 각 훈련 단계에는 세 가지 하위 단계가 포함되어 있습니다. 첫 번째 단계에서는 모델이 단일 언어 음성 데이터로 훈련되고, 그 다음 두 번째와 세 번째 단계에서는 역 번역에서 영감을 받아 음색과 다른 정보(콘텐츠, 이성 및 다른 언어 관련 정보)를 다국어 데이터 없이 분리하는 순환 프로세스를 구축합니다. 객관적 및 주관적 결과 모두가 MulliVC가 단일 언어 및 교차 언어 맥락에서 다른 방법들을 크게 능가한다는 것을 보여주며, 이는 시스템의 효과성과 순환 일관성을 갖춘 세 단계 접근 방식의 타당성을 입증합니다. 오디오 샘플은 저희 데모 페이지(mullivc.github.io)에서 확인하실 수 있습니다.

English

Voice conversion aims to modify the source speaker's voice to resemble the target speaker while preserving the original speech content. Despite notable advancements in voice conversion these days, multi-lingual voice conversion (including both monolingual and cross-lingual scenarios) has yet to be extensively studied. It faces two main challenges: 1) the considerable variability in prosody and articulation habits across languages; and 2) the rarity of paired multi-lingual datasets from the same speaker. In this paper, we propose MulliVC, a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data. Specifically, each training step of MulliVC contains three substeps: In step one the model is trained with monolingual speech data; then, steps two and three take inspiration from back translation, construct a cyclical process to disentangle the timbre and other information (content, prosody, and other language-related information) in the absence of multi-lingual data from the same speaker. Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts, demonstrating the system's efficacy and the viability of the three-step approach with cycle consistency. Audio samples can be found on our demo page (mullivc.github.io).

MulliVC: 순환 일관성을 갖는 다국어 음성 변환

MulliVC: Multi-lingual Voice Conversion With Cycle Consistency

초록

Support