Mega-TTS：具備內在歸納偏差的大規模零樣本文本轉語音

摘要

將文本轉換為大型和多樣數據集的文本轉語音已被證明在實現音色和語音風格泛化方面非常有效，特別是在零樣本 TTS 中。然而，先前的研究通常使用音頻編解碼器將語音編碼為潛在變量，並使用自回歸語言模型或擴散模型來生成它，這忽略了語音的內在特性，可能導致較差或無法控制的結果。我們認為語音可以分解為幾個屬性（例如內容、音色、韻律和相位），並且應該使用具有適當歸納偏差的模塊來對每個屬性進行建模。從這個角度出發，我們精心設計了一個名為 Mega-TTS 的新型大型零樣本 TTS 系統，該系統使用大規模多樣數據進行訓練，並以不同方式模擬不同的屬性：1）我們仍然選擇頻譜圖作為中間特徵，而不是使用音頻編解碼器編碼的潛在變量，因為頻譜圖很好地分離了相位和其他屬性。相位可以由基於 GAN 的聲碼器適當地構建，並且不需要由語言模型進行建模。2）我們使用全局向量來模擬音色，因為音色是一個隨時間變化緩慢的全局屬性。3）我們進一步使用基於 VQGAN 的聲學模型生成頻譜圖，並使用潛在代碼語言模型來擬合韻律的分佈，因為韻律在句子中隨時間迅速變化，而語言模型可以捕捉局部和長程依賴性。我們將 Mega-TTS 擴展到具有 20K 小時語音的多領域數據集，並在未見過的說話者上評估其性能。實驗結果表明，由於每個模塊的適當歸納偏差，Mega-TTS 在零樣本 TTS、語音編輯和跨語言 TTS 任務上均超越了最先進的 TTS 系統，具有出色的自然度、韌性和說話者相似性。音頻樣本可在 https://mega-tts.github.io/demo-page 上找到。

English

Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results. We argue that speech can be decomposed into several attributes (e.g., content, timbre, prosody, and phase) and each of them should be modeled using a module with appropriate inductive biases. From this perspective, we carefully design a novel and large zero-shot TTS system called Mega-TTS, which is trained with large-scale wild data and models different attributes in different ways: 1) Instead of using latent encoded by audio codec as the intermediate feature, we still choose spectrogram as it separates the phase and other attributes very well. Phase can be appropriately constructed by the GAN-based vocoder and does not need to be modeled by the language model. 2) We model the timbre using global vectors since timbre is a global attribute that changes slowly over time. 3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K hours of speech and evaluate its performance on unseen speakers. Experimental results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior naturalness, robustness, and speaker similarity due to the proper inductive bias of each module. Audio samples are available at https://mega-tts.github.io/demo-page.

Mega-TTS：具備內在歸納偏差的大規模零樣本文本轉語音

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

摘要

Support