シュレーディンガー橋モデルがテキスト音声合成において拡散モデルを上回る

要旨

テキスト音声合成（TTS）において、拡散モデルは有望な生成品質を達成しています。しかし、事前に定義されたデータからノイズへの拡散プロセスのため、その事前分布はノイズの多い表現に制限されており、生成目標に関する情報がほとんど提供されません。本研究では、確立された拡散ベースのTTS手法におけるノイズの多いガウス事前分布を、クリーンで決定論的な事前分布に置き換える初の試みとして、Bridge-TTSという新しいTTSシステムを提案します。この事前分布は、目標の強力な構造情報を提供します。具体的には、テキスト入力から得られた潜在表現を事前分布として活用し、それとグラウンドトゥルースのメルスペクトログラムの間に完全に追跡可能なシュレーディンガーブリッジを構築し、データからデータへのプロセスを実現します。さらに、我々の定式化の追跡可能性と柔軟性により、ノイズスケジュールなどの設計空間を実証的に研究し、確率的および決定論的なサンプラーを開発することが可能です。LJ-Speechデータセットでの実験結果は、合成品質とサンプリング効率の両面で我々の手法の有効性を示しており、50ステップ/1000ステップの合成において拡散モデルであるGrad-TTSを大幅に上回り、少ステップのシナリオでは強力な高速TTSモデルを凌駕しています。プロジェクトページ: https://bridge-tts.github.io/

English

In text-to-speech (TTS) synthesis, diffusion models have achieved promising generation quality. However, because of the pre-defined data-to-noise diffusion process, their prior distribution is restricted to a noisy representation, which provides little information of the generation target. In this work, we present a novel TTS system, Bridge-TTS, making the first attempt to substitute the noisy Gaussian prior in established diffusion-based TTS methods with a clean and deterministic one, which provides strong structural information of the target. Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram, leading to a data-to-data process. Moreover, the tractability and flexibility of our formulation allow us to empirically study the design spaces such as noise schedules, as well as to develop stochastic and deterministic samplers. Experimental results on the LJ-Speech dataset illustrate the effectiveness of our method in terms of both synthesis quality and sampling efficiency, significantly outperforming our diffusion counterpart Grad-TTS in 50-step/1000-step synthesis and strong fast TTS models in few-step scenarios. Project page: https://bridge-tts.github.io/

シュレーディンガー橋モデルがテキスト音声合成において拡散モデルを上回る

Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis

要旨

Support