ChatPaper.aiChatPaper

基於 100K 小時的數據構建十億參數的文本轉語音模型:BASE TTS 的經驗教訓

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

February 12, 2024
作者: Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman
cs.AI

摘要

我們介紹了一個名為BASE TTS的文本轉語音(TTS)模型,它代表著Big Adaptive Streamable TTS with Emergent abilities。BASE TTS是迄今為止最大的TTS模型,訓練於10萬小時的公共領域語音數據,實現了語音自然度的新突破。它採用了一個擁有10億參數的自回歸Transformer,將原始文本轉換為離散代碼("speechcodes"),隨後通過基於卷積的解碼器將這些speechcodes以增量、可串流的方式轉換為波形。此外,我們的speechcodes採用了一種新穎的語音標記技術,具有語者ID的解耦和壓縮,使用字節對編碼。回應了當大型語言模型在訓練過程中使用更多數據時廣泛報導的"新興能力",我們展示了使用10K+小時和500M+參數構建的BASE TTS變體在文本上複雜句子中開始展現自然的韻律。我們設計並分享了一個專門用於評估這些新興能力的文本轉語音專用數據集。通過與包括YourTTS、Bark和TortoiseTTS在內的公開大規模文本轉語音系統在內的基準進行評估,展示了BASE TTS的最新自然度。模型生成的音頻樣本可在https://amazon-ltts-paper.com/聆聽。
English
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

Summary

AI-Generated Summary

PDF629December 15, 2024