E2 TTS: 驚くほど簡単な完全非自己回帰型ゼロショットTTS

要旨

本論文では、人間レベルの自然さと最先端の話者類似性・明瞭性を実現する、完全非自己回帰型ゼロショットテキスト音声合成システム「Embarrassingly Easy Text-to-Speech (E2 TTS)」を紹介する。E2 TTSフレームワークでは、テキスト入力をフィラートークン付きの文字列に変換し、フローマッチングに基づくメルスペクトログラム生成器を音声補完タスクに基づいて学習させる。従来の多くの研究とは異なり、追加コンポーネント（例：持続時間モデル、書記素-音素変換）や複雑な技術（例：単調アライメント探索）を必要としない。そのシンプルさにもかかわらず、E2 TTSはVoiceboxやNaturalSpeech 3を含む従来研究に匹敵し、あるいは凌駕する最先端のゼロショットTTS性能を達成している。E2 TTSのシンプルさは、入力表現の柔軟性も可能にする。我々は、推論時の使いやすさを向上させるため、E2 TTSのいくつかのバリエーションを提案する。デモサンプルはhttps://aka.ms/e2tts/を参照。

English

This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the audio infilling task. Unlike many previous works, it does not require additional components (e.g., duration model, grapheme-to-phoneme) or complex techniques (e.g., monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that are comparable to or surpass previous works, including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the input representation. We propose several variants of E2 TTS to improve usability during inference. See https://aka.ms/e2tts/ for demo samples.

E2 TTS: 驚くほど簡単な完全非自己回帰型ゼロショットTTS

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

要旨

Support