ChatPaper.aiChatPaper

自然語音 3:使用分解編解碼器和擴散模型的零-shot語音合成

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

March 5, 2024
作者: Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao
cs.AI

摘要

儘管最近大規模的文本轉語音(TTS)模型取得了顯著進展,但在語音品質、相似度和韻律方面仍存在不足。考慮到語音複雜地包含各種屬性(例如內容、韻律、音色和聲學細節),這對生成構成了重大挑戰,一個自然的想法是將語音因子化為代表不同屬性的個別子空間,並分別生成它們。受此激勵,我們提出了NaturalSpeech 3,一個具有新型因子化擴散模型的TTS系統,以零樣本方式生成自然語音。具體而言,1)我們設計了一個具有因子化向量量化(FVQ)的神經編解碼器,將語音波形解開為內容、韻律、音色和聲學細節的子空間;2)我們提出了一個因子化擴散模型,根據其對應的提示生成每個子空間中的屬性。通過這種因子化設計,NaturalSpeech 3可以以分而治之的方式有效且高效地建模複雜的語音,其中子空間已解開。實驗表明,NaturalSpeech 3在品質、相似度、韻律和可懂性方面優於最先進的TTS系統。此外,通過擴展至10億參數和20萬小時的訓練數據,我們實現了更好的性能。
English
While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model the intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.

Summary

AI-Generated Summary

PDF383December 15, 2024