Mega-TTS：具有内在归纳偏差的大规模零样本文本转语音

摘要

将文本转换为大规模和多样化数据集的文本到语音系统已被证明在实现音色和语音风格泛化方面非常有效，特别是在零样本文本到语音系统中。然而，先前的研究通常使用音频编解码器将语音编码为潜在变量，并使用自回归语言模型或扩散模型来生成语音，这忽略了语音的内在特性，可能导致结果较差或无法控制。我们认为语音可以分解为几个属性（如内容、音色、韵律和相位），每个属性都应该使用具有适当归纳偏差的模块进行建模。从这个角度出发，我们精心设计了一种名为Mega-TTS的新型大规模零样本文本到语音系统，该系统使用大规模多样化的数据进行训练，并以不同方式模拟不同的属性：1）我们仍然选择使用频谱图作为中间特征，而不是使用音频编解码器编码的潜在变量，因为频谱图能够很好地分离相位和其他属性。相位可以通过基于GAN的声码器适当构造，无需由语言模型进行建模。2）我们使用全局向量来模拟音色，因为音色是一个随时间变化缓慢的全局属性。3）我们进一步使用基于VQGAN的声学模型生成频谱图，并使用潜在编码语言模型来适应韵律的分布，因为韵律在句子中随时间变化较快，而语言模型可以捕捉局部和长距离依赖关系。我们将Mega-TTS扩展到包含20,000小时语音的多领域数据集，并在未知说话者上评估其性能。实验结果表明，由于每个模块的适当归纳偏差，Mega-TTS在零样本文本到语音、语音编辑和跨语言文本到语音任务上均超越了最先进的文本到语音系统，具有更优越的自然性、稳健性和说话者相似度。音频样本可在https://mega-tts.github.io/demo-page上找到。

English

Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results. We argue that speech can be decomposed into several attributes (e.g., content, timbre, prosody, and phase) and each of them should be modeled using a module with appropriate inductive biases. From this perspective, we carefully design a novel and large zero-shot TTS system called Mega-TTS, which is trained with large-scale wild data and models different attributes in different ways: 1) Instead of using latent encoded by audio codec as the intermediate feature, we still choose spectrogram as it separates the phase and other attributes very well. Phase can be appropriately constructed by the GAN-based vocoder and does not need to be modeled by the language model. 2) We model the timbre using global vectors since timbre is a global attribute that changes slowly over time. 3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K hours of speech and evaluate its performance on unseen speakers. Experimental results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior naturalness, robustness, and speaker similarity due to the proper inductive bias of each module. Audio samples are available at https://mega-tts.github.io/demo-page.

Mega-TTS：具有内在归纳偏差的大规模零样本文本转语音

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

摘要

Support