CoMoSVC:基于一致性模型的歌声转换
CoMoSVC: Consistency Model-based Singing Voice Conversion
January 3, 2024
作者: Yiwen Lu, Zhen Ye, Wei Xue, Xu Tan, Qifeng Liu, Yike Guo
cs.AI
摘要
基于扩散的歌声转换(SVC)方法取得了显著的表现,产生了与目标音色高度相似的自然音频。然而,迭代采样过程导致推理速度较慢,因此加速变得至关重要。本文提出了一种基于一致性模型的CoMoSVC方法,旨在实现高质量生成和高速采样。首先专门为SVC设计了基于扩散的教师模型,然后在自一致性属性下进一步提炼学生模型,实现一步采样。在单个NVIDIA GTX4090 GPU上的实验表明,尽管CoMoSVC的推理速度明显快于最先进的基于扩散的SVC系统,但在主观和客观指标基础上仍实现了可比或更优的转换性能。音频样本和代码可在https://comosvc.github.io/获取。
English
The diffusion-based Singing Voice Conversion (SVC) methods have achieved
remarkable performances, producing natural audios with high similarity to the
target timbre. However, the iterative sampling process results in slow
inference speed, and acceleration thus becomes crucial. In this paper, we
propose CoMoSVC, a consistency model-based SVC method, which aims to achieve
both high-quality generation and high-speed sampling. A diffusion-based teacher
model is first specially designed for SVC, and a student model is further
distilled under self-consistency properties to achieve one-step sampling.
Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a
significantly faster inference speed than the state-of-the-art (SOTA)
diffusion-based SVC system, it still achieves comparable or superior conversion
performance based on both subjective and objective metrics. Audio samples and
codes are available at https://comosvc.github.io/.