FlashSpeech: 효율적인 제로샷 음성 합성

초록

대규모 제로샷 음성 합성의 최근 발전은 언어 모델과 확산 모델에 의해 크게 촉진되었습니다. 그러나 두 방법 모두 생성 과정이 느리고 계산 집약적입니다. 이전 연구와 동등한 품질을 달성하면서 더 낮은 컴퓨팅 예산을 사용하는 효율적인 음성 합성은 여전히 중요한 과제로 남아 있습니다. 본 논문에서는 이전 연구 대비 약 5%의 추론 시간을 달성한 대규모 제로샷 음성 합성 시스템인 FlashSpeech를 소개합니다. FlashSpeech는 잠재 일관성 모델을 기반으로 구축되었으며, 사전 훈련된 확산 모델을 교사로 사용하지 않고도 처음부터 훈련할 수 있는 새로운 적대적 일관성 훈련 방식을 적용합니다. 또한, 새로운 운율 생성기 모듈은 운율의 다양성을 향상시켜 음성의 리듬을 더 자연스럽게 만듭니다. FlashSpeech의 생성 과정은 고음질과 제로샷 음성 생성을 위한 오디오 프롬프트와의 높은 유사성을 유지하면서 하나 또는 두 개의 샘플링 단계로 효율적으로 달성할 수 있습니다. 우리의 실험 결과는 FlashSpeech의 우수한 성능을 입증합니다. 특히, FlashSpeech는 음성 품질과 유사성 측면에서 비슷한 성능을 유지하면서 다른 제로샷 음성 합성 시스템보다 약 20배 빠를 수 있습니다. 또한, FlashSpeech는 음성 변환, 음성 편집, 다양한 음성 샘플링과 같은 작업을 효율적으로 수행함으로써 그 다양성을 입증합니다. 오디오 샘플은 https://flashspeech.github.io/에서 확인할 수 있습니다.

English

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in https://flashspeech.github.io/.

FlashSpeech: 효율적인 제로샷 음성 합성

FlashSpeech: Efficient Zero-Shot Speech Synthesis

초록

Support