E2 TTS:令人尷尬地簡單的完全非自回歸零-shot TTS
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
June 26, 2024
作者: Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda
cs.AI
摘要
本文介紹了「尷尬易」文本轉語音(E2 TTS),這是一種完全非自回歸的零-shot文本轉語音系統,提供人類水準的自然度以及最先進的語者相似度和可懂性。在E2 TTS框架中,文本輸入被轉換為帶有填充標記的字符序列。然後,基於音頻填充任務訓練基於流匹配的mel頻譜圖生成器。與許多先前的工作不同,它不需要額外的組件(例如,持續時間模型,字形到音素)或複雜技術(例如,單調對齊搜索)。儘管其簡單性,E2 TTS實現了與或超越以前的作品(包括Voicebox和NaturalSpeech 3)相媲美的最先進的零-shot TTS能力。E2 TTS的簡單性還允許在輸入表示中具有靈活性。我們提出了幾種E2 TTS的變體,以提高推論過程中的可用性。請參閱https://aka.ms/e2tts/以獲取演示樣本。
English
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully
non-autoregressive zero-shot text-to-speech system that offers human-level
naturalness and state-of-the-art speaker similarity and intelligibility. In the
E2 TTS framework, the text input is converted into a character sequence with
filler tokens. The flow-matching-based mel spectrogram generator is then
trained based on the audio infilling task. Unlike many previous works, it does
not require additional components (e.g., duration model, grapheme-to-phoneme)
or complex techniques (e.g., monotonic alignment search). Despite its
simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that
are comparable to or surpass previous works, including Voicebox and
NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the
input representation. We propose several variants of E2 TTS to improve
usability during inference. See https://aka.ms/e2tts/ for demo samples.Summary
AI-Generated Summary