ChatPaper.aiChatPaper

基于100K小时数据构建十亿参数的文本到语音模型的经验教训

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

February 12, 2024
作者: Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman
cs.AI

摘要

我们介绍了一种名为BASE TTS的文本到语音(TTS)模型,其代表Big Adaptive Streamable TTS with Emergent abilities。BASE TTS是迄今为止最大的TTS模型,使用了10万小时的公共领域语音数据进行训练,实现了语音自然度的新突破。它采用了一个拥有10亿参数的自回归Transformer,将原始文本转换为离散编码(“语音编码”),然后通过基于卷积的解码器将这些语音编码以增量、可流式传输的方式转换为波形。此外,我们的语音编码采用了一种新颖的语音标记技术,具有说话者ID的解耦和使用字节对编码进行压缩。与大规模语言模型在训练过程中使用更多数据时广泛报道的“新兴能力”相呼应,我们展示了使用10K+小时和500M+参数构建的BASE TTS变体在文本复杂度高的句子上开始展现出自然的韵律。我们设计并分享了一个专门的数据集,用于衡量这些文本到语音的新兴能力。通过与包括公开可用的大规模文本到语音系统YourTTS、Bark和TortoiseTTS在内的基线进行评估,展示了BASE TTS的最新自然度。模型生成的音频样本可在https://amazon-ltts-paper.com/上听取。
English
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

Summary

AI-Generated Summary

PDF629December 15, 2024