FlashSpeech：高效的零樣本語音合成

摘要

近年來，大規模零樣本語音合成的最新進展顯著受到語言模型和擴散模型的推動。然而，這兩種方法的生成過程速度緩慢且需要大量計算資源。以較低的計算預算實現與先前工作相當質量的高效語音合成仍然是一個重大挑戰。本文介紹了FlashSpeech，一個大規模零樣本語音合成系統，其推理時間約為先前工作的5\%。FlashSpeech基於潛在一致性模型構建，並採用一種新穎的對抗一致性訓練方法，可以從頭開始訓練，無需預先訓練的擴散模型作為教師。此外，一個新的韻律生成器模組增強了韻律的多樣性，使語音的節奏聽起來更加自然。FlashSpeech的生成過程可以在一兩個採樣步驟內高效完成，同時保持高音質並與零樣本語音生成的音頻提示具有高相似性。我們的實驗結果展示了FlashSpeech的優越性能。值得注意的是，FlashSpeech在語音質量和相似性方面的表現與其他零樣本語音合成系統相比，速度大約快了20倍。此外，FlashSpeech通過高效執行語音轉換、語音編輯和多樣語音採樣等任務展示了其多功能性。可在https://flashspeech.github.io/找到音頻樣本。

English

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in https://flashspeech.github.io/.

FlashSpeech：高效的零樣本語音合成

FlashSpeech: Efficient Zero-Shot Speech Synthesis

摘要

Support