MegaTTS 3：ゼロショット音声合成のためのスパースアライメント強化型潜在拡散トランスフォーマー

要旨

近年のゼロショットテキスト音声合成（TTS）モデルは、音声品質と表現力の大幅な向上を実現していますが、主流のシステムでは依然として音声-テキストアライメントモデリングに関連する課題が存在します。1) 明示的な音声-テキストアライメントモデリングを欠くモデルは、特に実用アプリケーションにおける難解な文に対して堅牢性に欠ける傾向があります。2) 事前定義されたアライメントベースのモデルは、強制アライメントの自然さに制約を受けます。本論文では、革新的なスパースアライメントアルゴリズムを特徴とするTTSシステム、MegaTTS 3を紹介します。このアルゴリズムは、潜在拡散トランスフォーマー（DiT）をガイドします。具体的には、MegaTTS 3にスパースアライメント境界を提供し、探索空間を制限することなくアライメントの難易度を低減することで、高い自然さを実現します。さらに、アクセント強度調整のための多条件クラス分類不要ガイダンス戦略を採用し、生成プロセスを加速するために区分的整流フロー技術を採用しています。実験結果は、MegaTTS 3が最先端のゼロショットTTS音声品質を達成し、アクセント強度に対する高度に柔軟な制御をサポートすることを示しています。特に、当システムはわずか8サンプリングステップで高品質な1分間の音声を生成可能です。音声サンプルはhttps://sditdemo.github.io/sditdemo/でご覧いただけます。

English

While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces MegaTTS 3, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.