E3 TTS: 簡易エンドツーエンド拡散モデルに基づくテキスト音声合成

要旨

我々は、拡散モデルに基づいたシンプルで効率的なエンドツーエンドのテキスト音声合成モデルであるEasy End-to-End Diffusion-based Text to Speech（E3 TTS）を提案する。E3 TTSは、プレーンテキストを直接入力として受け取り、反復的な精緻化プロセスを通じて音声波形を生成する。多くの先行研究とは異なり、E3 TTSはスペクトログラム特徴量やアライメント情報といった中間表現に依存しない。代わりに、E3 TTSは拡散プロセスを通じて波形の時間的構造をモデル化する。追加の条件付け情報に頼ることなく、E3 TTSは与えられた音声内の柔軟な潜在構造をサポートすることができる。これにより、E3 TTSは編集などのゼロショットタスクに追加のトレーニングなしで容易に適応可能となる。実験結果は、E3 TTSが高忠実度の音声を生成し、最先端のニューラルTTSシステムの性能に迫ることを示している。音声サンプルはhttps://e3tts.github.ioで公開されている。

English

We propose Easy End-to-End Diffusion-based Text to Speech, a simple and efficient end-to-end text-to-speech model based on diffusion. E3 TTS directly takes plain text as input and generates an audio waveform through an iterative refinement process. Unlike many prior work, E3 TTS does not rely on any intermediate representations like spectrogram features or alignment information. Instead, E3 TTS models the temporal structure of the waveform through the diffusion process. Without relying on additional conditioning information, E3 TTS could support flexible latent structure within the given audio. This enables E3 TTS to be easily adapted for zero-shot tasks such as editing without any additional training. Experiments show that E3 TTS can generate high-fidelity audio, approaching the performance of a state-of-the-art neural TTS system. Audio samples are available at https://e3tts.github.io.

E3 TTS: 簡易エンドツーエンド拡散モデルに基づくテキスト音声合成

E3 TTS: Easy End-to-End Diffusion-based Text to Speech

要旨

Support