CoMoSVC: 일관성 모델 기반 노래 목소리 변환

초록

확산 기반의 노래 목소리 변환(Singing Voice Conversion, SVC) 방법은 목표 음색과 높은 유사성을 가진 자연스러운 오디오를 생성하며 뛰어난 성능을 달성했습니다. 그러나 반복적인 샘플링 과정으로 인해 추론 속도가 느려지며, 이에 대한 가속화가 중요해졌습니다. 본 논문에서는 고품질 생성과 고속 샘플링을 동시에 달성하기 위해 CoMoSVC라는 일관성 모델 기반의 SVC 방법을 제안합니다. 먼저 SVC를 위해 특별히 설계된 확산 기반의 교사 모델을 사용하고, 자기 일관성 속성 하에서 학생 모델을 추가로 증류하여 단일 단계 샘플링을 달성합니다. 단일 NVIDIA GTX4090 GPU에서의 실험 결과, CoMoSVC는 최신 확산 기반 SVC 시스템보다 훨씬 빠른 추론 속도를 보이면서도 주관적 및 객관적 지표 모두에서 비슷하거나 더 우수한 변환 성능을 달성함을 확인했습니다. 오디오 샘플과 코드는 https://comosvc.github.io/에서 확인할 수 있습니다.

English

The diffusion-based Singing Voice Conversion (SVC) methods have achieved remarkable performances, producing natural audios with high similarity to the target timbre. However, the iterative sampling process results in slow inference speed, and acceleration thus becomes crucial. In this paper, we propose CoMoSVC, a consistency model-based SVC method, which aims to achieve both high-quality generation and high-speed sampling. A diffusion-based teacher model is first specially designed for SVC, and a student model is further distilled under self-consistency properties to achieve one-step sampling. Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a significantly faster inference speed than the state-of-the-art (SOTA) diffusion-based SVC system, it still achieves comparable or superior conversion performance based on both subjective and objective metrics. Audio samples and codes are available at https://comosvc.github.io/.

CoMoSVC: 일관성 모델 기반 노래 목소리 변환

CoMoSVC: Consistency Model-based Singing Voice Conversion

초록

Support