CoMoSpeech：一步到位的语音和歌声合成，通过一致性模型

摘要

去噪扩散概率模型（DDPMs）已经展现出在语音合成方面有着令人期待的表现。然而，为了获得高质量样本，需要大量的迭代步骤，这限制了推断速度。在增加采样速度的同时保持样本质量已经成为一项具有挑战性的任务。在本文中，我们提出了一种基于“一致性模型”的语音合成方法CoMoSpeech，通过单次扩散采样步骤实现语音合成，同时获得高音频质量。一致性约束被应用于从一个精心设计的基于扩散的教师模型中提炼出一致性模型，最终在提炼的CoMoSpeech中产生出优越的性能。我们的实验表明，通过单次采样步骤生成音频记录，CoMoSpeech在单个NVIDIA A100 GPU上的推断速度比实时快150多倍，这与FastSpeech2可媲美，使基于扩散采样的语音合成变得真正实用。同时，在文本到语音和歌声合成的客观和主观评估中，所提出的教师模型产生了最佳音频质量，而基于一步采样的CoMoSpeech在推断速度上表现最佳，并且具有比其他传统多步扩散模型基线更好或可比的音频质量。音频样本可在https://comospeech.github.io/获取。

English

Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines. Audio samples are available at https://comospeech.github.io/.

CoMoSpeech：一步到位的语音和歌声合成，通过一致性模型

CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

摘要

Support