Mega-TTS: 内在的帰納バイアスを用いた大規模ゼロショットテキスト読み上げ

要旨

大規模で多様なデータセットを用いたテキスト音声合成のスケーリングは、特にゼロショットTTSにおいて、音色や話し方の一般化を達成する上で非常に効果的であることが証明されています。しかし、従来の研究では通常、音声をオーディオコーデックを用いて潜在変数にエンコードし、自己回帰型言語モデルや拡散モデルを用いて生成する方法が取られてきました。この方法は音声の本質的な特性を無視しており、品質の低下や制御不能な結果を招く可能性があります。我々は、音声はいくつかの属性（例: 内容、音色、プロソディ、位相）に分解可能であり、それぞれの属性は適切な帰納的バイアスを持つモジュールでモデル化されるべきであると主張します。この観点から、我々は大規模で多様なデータを用いて訓練され、異なる属性を異なる方法でモデル化する新しいゼロショットTTSシステム「Mega-TTS」を慎重に設計しました。1) オーディオコーデックでエンコードされた潜在変数を中間特徴量として使用する代わりに、位相と他の属性をうまく分離するスペクトログラムを選択しました。位相はGANベースのボコーダーによって適切に構築可能であり、言語モデルでモデル化する必要はありません。2) 音色は時間的にゆっくりと変化するグローバルな属性であるため、グローバルベクトルを用いてモデル化します。3) さらに、VQGANベースの音響モデルを使用してスペクトログラムを生成し、潜在コード言語モデルを使用してプロソディの分布を適合させます。プロソディは文中で急速に変化するため、言語モデルは局所的および長距離の依存関係を捉えることができます。我々はMega-TTSを20,000時間の音声を含むマルチドメインデータセットにスケールし、未知の話者に対する性能を評価しました。実験結果は、Mega-TTSがゼロショットTTS、音声編集、およびクロスリンガルTTSタスクにおいて、各モジュールの適切な帰納的バイアスにより、自然さ、堅牢性、話者類似性において最先端のTTSシステムを凌駕することを示しています。音声サンプルはhttps://mega-tts.github.io/demo-pageで公開されています。

English

Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results. We argue that speech can be decomposed into several attributes (e.g., content, timbre, prosody, and phase) and each of them should be modeled using a module with appropriate inductive biases. From this perspective, we carefully design a novel and large zero-shot TTS system called Mega-TTS, which is trained with large-scale wild data and models different attributes in different ways: 1) Instead of using latent encoded by audio codec as the intermediate feature, we still choose spectrogram as it separates the phase and other attributes very well. Phase can be appropriately constructed by the GAN-based vocoder and does not need to be modeled by the language model. 2) We model the timbre using global vectors since timbre is a global attribute that changes slowly over time. 3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K hours of speech and evaluate its performance on unseen speakers. Experimental results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior naturalness, robustness, and speaker similarity due to the proper inductive bias of each module. Audio samples are available at https://mega-tts.github.io/demo-page.

Mega-TTS: 内在的帰納バイアスを用いた大規模ゼロショットテキスト読み上げ

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

要旨

Support