FlashSpeech:高效的零樣本語音合成
FlashSpeech: Efficient Zero-Shot Speech Synthesis
April 23, 2024
作者: Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He, Qifeng Liu, Yike Guo, Wei Xue
cs.AI
摘要
近年來,大規模零樣本語音合成的最新進展顯著受到語言模型和擴散模型的推動。然而,這兩種方法的生成過程速度緩慢且需要大量計算資源。以較低的計算預算實現與先前工作相當質量的高效語音合成仍然是一個重大挑戰。本文介紹了FlashSpeech,一個大規模零樣本語音合成系統,其推理時間約為先前工作的5\%。FlashSpeech基於潛在一致性模型構建,並採用一種新穎的對抗一致性訓練方法,可以從頭開始訓練,無需預先訓練的擴散模型作為教師。此外,一個新的韻律生成器模組增強了韻律的多樣性,使語音的節奏聽起來更加自然。FlashSpeech的生成過程可以在一兩個採樣步驟內高效完成,同時保持高音質並與零樣本語音生成的音頻提示具有高相似性。我們的實驗結果展示了FlashSpeech的優越性能。值得注意的是,FlashSpeech在語音質量和相似性方面的表現與其他零樣本語音合成系統相比,速度大約快了20倍。此外,FlashSpeech通過高效執行語音轉換、語音編輯和多樣語音採樣等任務展示了其多功能性。可在https://flashspeech.github.io/找到音頻樣本。
English
Recent progress in large-scale zero-shot speech synthesis has been
significantly advanced by language models and diffusion models. However, the
generation process of both methods is slow and computationally intensive.
Efficient speech synthesis using a lower computing budget to achieve quality on
par with previous work remains a significant challenge. In this paper, we
present FlashSpeech, a large-scale zero-shot speech synthesis system with
approximately 5\% of the inference time compared with previous work.
FlashSpeech is built on the latent consistency model and applies a novel
adversarial consistency training approach that can train from scratch without
the need for a pre-trained diffusion model as the teacher. Furthermore, a new
prosody generator module enhances the diversity of prosody, making the rhythm
of the speech sound more natural. The generation processes of FlashSpeech can
be achieved efficiently with one or two sampling steps while maintaining high
audio quality and high similarity to the audio prompt for zero-shot speech
generation. Our experimental results demonstrate the superior performance of
FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other
zero-shot speech synthesis systems while maintaining comparable performance in
terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates
its versatility by efficiently performing tasks like voice conversion, speech
editing, and diverse speech sampling. Audio samples can be found in
https://flashspeech.github.io/.Summary
AI-Generated Summary