FlashSpeech:高效的零-shot 语音合成
FlashSpeech: Efficient Zero-Shot Speech Synthesis
April 23, 2024
作者: Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, Wei Xue
cs.AI
摘要
最近,语言模型和扩散模型显著推动了大规模零样本语音合成的进展。然而,这两种方法的生成过程速度较慢且计算密集。在保持与先前工作相当质量的同时,使用更低的计算预算实现高效语音合成仍然是一个重大挑战。本文介绍了FlashSpeech,一个大规模零样本语音合成系统,其推理时间约为先前工作的5\%。FlashSpeech建立在潜在一致性模型之上,并应用了一种新颖的对抗一致性训练方法,可以从头开始训练,无需预先训练的扩散模型作为教师。此外,一个新的韵律生成器模块增强了韵律的多样性,使语音的节奏听起来更加自然。FlashSpeech的生成过程可以通过一两个采样步骤高效实现,同时保持高音频质量和与音频提示的高相似性,用于零样本语音生成。我们的实验结果展示了FlashSpeech的卓越性能。值得注意的是,FlashSpeech的速度大约比其他零样本语音合成系统快20倍,同时在声音质量和相似性方面保持可比性。此外,FlashSpeech通过高效执行诸如语音转换、语音编辑和多样化语音采样等任务展示了其多功能性。音频样本可在https://flashspeech.github.io/找到。
English
Recent progress in large-scale zero-shot speech synthesis has been
significantly advanced by language models and diffusion models. However, the
generation process of both methods is slow and computationally intensive.
Efficient speech synthesis using a lower computing budget to achieve quality on
par with previous work remains a significant challenge. In this paper, we
present FlashSpeech, a large-scale zero-shot speech synthesis system with
approximately 5\% of the inference time compared with previous work.
FlashSpeech is built on the latent consistency model and applies a novel
adversarial consistency training approach that can train from scratch without
the need for a pre-trained diffusion model as the teacher. Furthermore, a new
prosody generator module enhances the diversity of prosody, making the rhythm
of the speech sound more natural. The generation processes of FlashSpeech can
be achieved efficiently with one or two sampling steps while maintaining high
audio quality and high similarity to the audio prompt for zero-shot speech
generation. Our experimental results demonstrate the superior performance of
FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other
zero-shot speech synthesis systems while maintaining comparable performance in
terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates
its versatility by efficiently performing tasks like voice conversion, speech
editing, and diverse speech sampling. Audio samples can be found in
https://flashspeech.github.io/.Summary
AI-Generated Summary