BASE TTS: 100,000時間のデータを用いて構築した10億パラメータのテキスト読み上げモデルからの教訓

要旨

私たちは、BASE TTS（Big Adaptive Streamable TTS with Emergent abilities）と呼ばれるテキスト・トゥ・スピーチ（TTS）モデルを紹介します。BASE TTSは、これまでで最大のTTSモデルであり、10万時間のパブリックドメイン音声データでトレーニングされ、音声の自然さにおいて新たな最先端を達成しました。このモデルは、10億パラメータの自己回帰型Transformerを使用して、生のテキストを離散コード（「スピーチコード」）に変換し、その後、畳み込みベースのデコーダがこれらのスピーチコードを波形に段階的かつストリーミング可能な方法で変換します。さらに、私たちのスピーチコードは、話者IDの分離とバイトペアエンコーディングによる圧縮を特徴とする新しい音声トークン化技術を使用して構築されています。大規模言語モデルがデータ量の増加に伴って示す「創発的能力」に呼応して、10,000時間以上のデータと5億以上のパラメータで構築されたBASE TTSのバリアントが、テキスト的に複雑な文において自然なプロソディを示し始めることを示します。私たちは、これらの創発的能力を測定するための専門的なデータセットを設計し、共有します。BASE TTSの最先端の自然さを、YourTTS、Bark、TortoiseTTSなどの公開されている大規模テキスト・トゥ・スピーチシステムを含むベースラインに対して評価することで示します。モデルによって生成された音声サンプルは、https://amazon-ltts-paper.com/ で聴くことができます。

English

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

BASE TTS: 100,000時間のデータを用いて構築した10億パラメータのテキスト読み上げモデルからの教訓

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

要旨

Support