CoMoSVC：基於一致性模型的歌唱聲音轉換

摘要

基於擴散的歌聲轉換（SVC）方法已經取得了顯著的表現，產生出與目標音色高度相似的自然音頻。然而，迭代取樣過程導致推理速度緩慢，因此加速變得至關重要。在本文中，我們提出了基於一致性模型的CoMoSVC SVC方法，旨在實現高質量生成和高速取樣。首先專門為SVC設計了一個基於擴散的教師模型，並進一步在自一致性特性下提煉出學生模型，以實現一步取樣。在單個NVIDIA GTX4090 GPU上的實驗顯示，雖然CoMoSVC的推理速度顯著快於最先進的基於擴散的SVC系統，但在主觀和客觀指標下，仍實現了可比或優越的轉換性能。音頻樣本和代碼可在https://comosvc.github.io/上獲得。

English

The diffusion-based Singing Voice Conversion (SVC) methods have achieved remarkable performances, producing natural audios with high similarity to the target timbre. However, the iterative sampling process results in slow inference speed, and acceleration thus becomes crucial. In this paper, we propose CoMoSVC, a consistency model-based SVC method, which aims to achieve both high-quality generation and high-speed sampling. A diffusion-based teacher model is first specially designed for SVC, and a student model is further distilled under self-consistency properties to achieve one-step sampling. Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a significantly faster inference speed than the state-of-the-art (SOTA) diffusion-based SVC system, it still achieves comparable or superior conversion performance based on both subjective and objective metrics. Audio samples and codes are available at https://comosvc.github.io/.

CoMoSVC：基於一致性模型的歌唱聲音轉換

CoMoSVC: Consistency Model-based Singing Voice Conversion

摘要

Support