CoMoSVC: 一貫性モデルに基づく歌声変換

要旨

拡散モデルに基づく歌唱音声変換（SVC）手法は、ターゲットの音色に高い類似性を持つ自然な音声を生成し、顕著な性能を達成しています。しかし、反復的なサンプリングプロセスにより推論速度が遅く、高速化が重要な課題となっています。本論文では、高品質な生成と高速なサンプリングの両立を目指す、一貫性モデルに基づくSVC手法「CoMoSVC」を提案します。まず、SVC用に特別に設計された拡散モデルを教師モデルとし、自己一貫性の特性に基づいて蒸留された学生モデルにより、ワンステップサンプリングを実現します。NVIDIA GTX4090 GPUでの実験結果から、CoMoSVCは最先端（SOTA）の拡散モデルベースのSVCシステムと比較して大幅に高速な推論速度を達成しつつ、主観的および客観的指標の両方において同等または優れた変換性能を実現することが示されました。音声サンプルとコードはhttps://comosvc.github.io/で公開されています。

English

The diffusion-based Singing Voice Conversion (SVC) methods have achieved remarkable performances, producing natural audios with high similarity to the target timbre. However, the iterative sampling process results in slow inference speed, and acceleration thus becomes crucial. In this paper, we propose CoMoSVC, a consistency model-based SVC method, which aims to achieve both high-quality generation and high-speed sampling. A diffusion-based teacher model is first specially designed for SVC, and a student model is further distilled under self-consistency properties to achieve one-step sampling. Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a significantly faster inference speed than the state-of-the-art (SOTA) diffusion-based SVC system, it still achieves comparable or superior conversion performance based on both subjective and objective metrics. Audio samples and codes are available at https://comosvc.github.io/.

CoMoSVC: 一貫性モデルに基づく歌声変換

CoMoSVC: Consistency Model-based Singing Voice Conversion

要旨

Support