CoMoSVC：基于一致性模型的歌声转换

摘要

基于扩散的歌声转换（SVC）方法取得了显著的表现，产生了与目标音色高度相似的自然音频。然而，迭代采样过程导致推理速度较慢，因此加速变得至关重要。本文提出了一种基于一致性模型的CoMoSVC方法，旨在实现高质量生成和高速采样。首先专门为SVC设计了基于扩散的教师模型，然后在自一致性属性下进一步提炼学生模型，实现一步采样。在单个NVIDIA GTX4090 GPU上的实验表明，尽管CoMoSVC的推理速度明显快于最先进的基于扩散的SVC系统，但在主观和客观指标基础上仍实现了可比或更优的转换性能。音频样本和代码可在https://comosvc.github.io/获取。

English

The diffusion-based Singing Voice Conversion (SVC) methods have achieved remarkable performances, producing natural audios with high similarity to the target timbre. However, the iterative sampling process results in slow inference speed, and acceleration thus becomes crucial. In this paper, we propose CoMoSVC, a consistency model-based SVC method, which aims to achieve both high-quality generation and high-speed sampling. A diffusion-based teacher model is first specially designed for SVC, and a student model is further distilled under self-consistency properties to achieve one-step sampling. Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a significantly faster inference speed than the state-of-the-art (SOTA) diffusion-based SVC system, it still achieves comparable or superior conversion performance based on both subjective and objective metrics. Audio samples and codes are available at https://comosvc.github.io/.

CoMoSVC：基于一致性模型的歌声转换

CoMoSVC: Consistency Model-based Singing Voice Conversion

摘要

Support