E2 TTS：令人尷尬地簡單的完全非自回歸零-shot TTS

摘要

本文介紹了「尷尬易」文本轉語音（E2 TTS），這是一種完全非自回歸的零-shot文本轉語音系統，提供人類水準的自然度以及最先進的語者相似度和可懂性。在E2 TTS框架中，文本輸入被轉換為帶有填充標記的字符序列。然後，基於音頻填充任務訓練基於流匹配的mel頻譜圖生成器。與許多先前的工作不同，它不需要額外的組件（例如，持續時間模型，字形到音素）或複雜技術（例如，單調對齊搜索）。儘管其簡單性，E2 TTS實現了與或超越以前的作品（包括Voicebox和NaturalSpeech 3）相媲美的最先進的零-shot TTS能力。E2 TTS的簡單性還允許在輸入表示中具有靈活性。我們提出了幾種E2 TTS的變體，以提高推論過程中的可用性。請參閱https://aka.ms/e2tts/以獲取演示樣本。

English

This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the audio infilling task. Unlike many previous works, it does not require additional components (e.g., duration model, grapheme-to-phoneme) or complex techniques (e.g., monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that are comparable to or surpass previous works, including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the input representation. We propose several variants of E2 TTS to improve usability during inference. See https://aka.ms/e2tts/ for demo samples.

E2 TTS：令人尷尬地簡單的完全非自回歸零-shot TTS

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

摘要

Support