CoMoSpeech: 일관성 모델을 통한 원스텝 음성 및 노래 합성

초록

노이즈 제거 확산 확률 모델(DDPM)은 음성 합성에서 유망한 성능을 보여왔습니다. 그러나 고품질 샘플을 생성하기 위해서는 많은 수의 반복 단계가 필요하며, 이는 추론 속도를 제한하는 요인으로 작용합니다. 샘플 품질을 유지하면서 샘플링 속도를 높이는 것은 어려운 과제가 되었습니다. 본 논문에서는 단일 확산 샘플링 단계로 고품질 음성을 합성하는 "Co"nsistency "Mo"del 기반 "Speech" 합성 방법인 CoMoSpeech를 제안합니다. 일관성 제약 조건을 통해 잘 설계된 확산 기반 교사 모델로부터 일관성 모델을 추출함으로써, 최종적으로 CoMoSpeech에서 우수한 성능을 달성합니다. 실험 결과, 단일 샘플링 단계로 오디오를 생성하는 CoMoSpeech는 단일 NVIDIA A100 GPU에서 실시간 대비 150배 이상 빠른 추론 속도를 달성하며, 이는 FastSpeech2와 비슷한 수준으로 확산 샘플링 기반 음성 합성을 실용적으로 만듭니다. 한편, 텍스트-음성 변환 및 노래 음성 합성에 대한 객관적 및 주관적 평가에서 제안된 교사 모델은 최고의 오디오 품질을 보였으며, 단일 단계 샘플링 기반 CoMoSpeech는 기존의 다단계 확산 모델 기준선과 비교하여 더 나은 또는 비슷한 오디오 품질을 유지하면서 최고의 추론 속도를 달성했습니다. 오디오 샘플은 https://comospeech.github.io/에서 확인할 수 있습니다.

English

Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines. Audio samples are available at https://comospeech.github.io/.

CoMoSpeech: 일관성 모델을 통한 원스텝 음성 및 노래 합성

CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

초록

Support