基于100K小时数据构建十亿参数的文本到语音模型的经验教训

摘要

我们介绍了一种名为BASE TTS的文本到语音（TTS）模型，其代表Big Adaptive Streamable TTS with Emergent abilities。BASE TTS是迄今为止最大的TTS模型，使用了10万小时的公共领域语音数据进行训练，实现了语音自然度的新突破。它采用了一个拥有10亿参数的自回归Transformer，将原始文本转换为离散编码（“语音编码”），然后通过基于卷积的解码器将这些语音编码以增量、可流式传输的方式转换为波形。此外，我们的语音编码采用了一种新颖的语音标记技术，具有说话者ID的解耦和使用字节对编码进行压缩。与大规模语言模型在训练过程中使用更多数据时广泛报道的“新兴能力”相呼应，我们展示了使用10K+小时和500M+参数构建的BASE TTS变体在文本复杂度高的句子上开始展现出自然的韵律。我们设计并分享了一个专门的数据集，用于衡量这些文本到语音的新兴能力。通过与包括公开可用的大规模文本到语音系统YourTTS、Bark和TortoiseTTS在内的基线进行评估，展示了BASE TTS的最新自然度。模型生成的音频样本可在https://amazon-ltts-paper.com/上听取。

English

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

基于100K小时数据构建十亿参数的文本到语音模型的经验教训

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

摘要

Support