ChatPaper.aiChatPaper

E2 TTS:令人尴尬地简单的完全非自回归零样本文本到语音系统

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

June 26, 2024
作者: Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda
cs.AI

摘要

本文介绍了尴尬简单文本转语音(E2 TTS),这是一个完全非自回归的零样本文本转语音系统,具有人类水平的自然度以及最先进的说话者相似性和可懂性。在E2 TTS框架中,文本输入被转换为带有填充标记的字符序列。然后基于音频填充任务训练基于流匹配的梅尔频谱图生成器。与许多先前的工作不同,它不需要额外的组件(例如,持续时间模型,字素到音素)或复杂的技术(例如,单调对齐搜索)。尽管其简单性,E2 TTS实现了与或超过Voicebox和NaturalSpeech 3等先前作品相媲美的最先进的零样本TTS能力。E2 TTS的简单性还允许在输入表示中灵活性。我们提出了几种E2 TTS的变体以提高推断过程中的可用性。请查看https://aka.ms/e2tts/以获取演示样本。
English
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the audio infilling task. Unlike many previous works, it does not require additional components (e.g., duration model, grapheme-to-phoneme) or complex techniques (e.g., monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that are comparable to or surpass previous works, including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the input representation. We propose several variants of E2 TTS to improve usability during inference. See https://aka.ms/e2tts/ for demo samples.

Summary

AI-Generated Summary

PDF234November 28, 2024