E3 TTS:簡單端到端基於擴散的文本轉語音
E3 TTS: Easy End-to-End Diffusion-based Text to Speech
November 2, 2023
作者: Yuan Gao, Nobuyuki Morioka, Yu Zhang, Nanxin Chen
cs.AI
摘要
我們提出了基於擴散的端對端簡易文本轉語音(Easy End-to-End Diffusion-based Text to Speech)模型,這是一種基於擴散的簡單高效的端對端文本轉語音模型。E3 TTS直接將純文本作為輸入,通過迭代細化過程生成音頻波形。與許多先前的工作不同,E3 TTS不依賴於任何中間表示,如頻譜特徵或對齊信息。相反,E3 TTS通過擴散過程對波形的時間結構進行建模。在不依賴額外條件信息的情況下,E3 TTS可以支持給定音頻中的靈活潛在結構。這使得E3 TTS可以輕鬆適應零樣本任務,如編輯,而無需進行額外的訓練。實驗表明,E3 TTS能夠生成高保真音頻,接近最先進的神經TTS系統的性能。音頻樣本可在https://e3tts.github.io上找到。
English
We propose Easy End-to-End Diffusion-based Text to Speech, a simple and
efficient end-to-end text-to-speech model based on diffusion. E3 TTS directly
takes plain text as input and generates an audio waveform through an iterative
refinement process. Unlike many prior work, E3 TTS does not rely on any
intermediate representations like spectrogram features or alignment
information. Instead, E3 TTS models the temporal structure of the waveform
through the diffusion process. Without relying on additional conditioning
information, E3 TTS could support flexible latent structure within the given
audio. This enables E3 TTS to be easily adapted for zero-shot tasks such as
editing without any additional training. Experiments show that E3 TTS can
generate high-fidelity audio, approaching the performance of a state-of-the-art
neural TTS system. Audio samples are available at https://e3tts.github.io.