E3 TTS：簡單端到端基於擴散的文本轉語音

摘要

我們提出了基於擴散的端對端簡易文本轉語音（Easy End-to-End Diffusion-based Text to Speech）模型，這是一種基於擴散的簡單高效的端對端文本轉語音模型。E3 TTS直接將純文本作為輸入，通過迭代細化過程生成音頻波形。與許多先前的工作不同，E3 TTS不依賴於任何中間表示，如頻譜特徵或對齊信息。相反，E3 TTS通過擴散過程對波形的時間結構進行建模。在不依賴額外條件信息的情況下，E3 TTS可以支持給定音頻中的靈活潛在結構。這使得E3 TTS可以輕鬆適應零樣本任務，如編輯，而無需進行額外的訓練。實驗表明，E3 TTS能夠生成高保真音頻，接近最先進的神經TTS系統的性能。音頻樣本可在https://e3tts.github.io上找到。

English

We propose Easy End-to-End Diffusion-based Text to Speech, a simple and efficient end-to-end text-to-speech model based on diffusion. E3 TTS directly takes plain text as input and generates an audio waveform through an iterative refinement process. Unlike many prior work, E3 TTS does not rely on any intermediate representations like spectrogram features or alignment information. Instead, E3 TTS models the temporal structure of the waveform through the diffusion process. Without relying on additional conditioning information, E3 TTS could support flexible latent structure within the given audio. This enables E3 TTS to be easily adapted for zero-shot tasks such as editing without any additional training. Experiments show that E3 TTS can generate high-fidelity audio, approaching the performance of a state-of-the-art neural TTS system. Audio samples are available at https://e3tts.github.io.

E3 TTS：簡單端到端基於擴散的文本轉語音

E3 TTS: Easy End-to-End Diffusion-based Text to Speech

摘要

Support