CoMoSpeech：通過一致性模型實現一步到位的語音和歌唱聲音合成

摘要

去噪擴散概率模型（DDPMs）已顯示出對語音合成具有潛力的性能。然而，為了達到高樣本質量，需要大量的迭代步驟，這限制了推理速度。在增加取樣速度的同時保持樣本質量已成為一項具有挑戰性的任務。在本文中，我們提出了一種基於“一致性模型”的語音合成方法CoMoSpeech，通過單一擴散取樣步驟實現語音合成，同時實現高音頻質量。一致性約束被應用於從設計良好的基於擴散的教師模型中提煉一致性模型，最終在提煉的CoMoSpeech中產生出色的性能。我們的實驗表明，通過單一取樣步驟生成音頻錄製，CoMoSpeech在單個NVIDIA A100 GPU上的推理速度比實時快150倍以上，與FastSpeech2相當，使基於擴散取樣的語音合成真正實用。同時，在文本轉語音和歌聲合成的客觀和主觀評估中，提出的教師模型產生了最佳音頻質量，而基於單步取樣的CoMoSpeech實現了最佳推理速度，並具有比其他傳統多步擴散模型基線更好或相當的音頻質量。音頻樣本可在https://comospeech.github.io/上找到。

English

Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines. Audio samples are available at https://comospeech.github.io/.

CoMoSpeech：通過一致性模型實現一步到位的語音和歌唱聲音合成

CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

摘要

Support