E3 TTS: 쉬운 종단 간 확산 기반 텍스트 음성 변환

초록

우리는 확산(diffusion) 기반의 간단하고 효율적인 종단 간(end-to-end) 텍스트-음성 변환 모델인 Easy End-to-End Diffusion-based Text to Speech(E3 TTS)를 제안합니다. E3 TTS는 일반 텍스트를 직접 입력으로 받아 반복적인 정제 과정을 통해 오디오 파형을 생성합니다. 많은 기존 연구와 달리, E3 TTS는 스펙트로그램 특징이나 정렬 정보와 같은 중간 표현에 의존하지 않습니다. 대신, E3 TTS는 확산 과정을 통해 파형의 시간적 구조를 모델링합니다. 추가적인 조건 정보에 의존하지 않으면서도, E3 TTS는 주어진 오디오 내에서 유연한 잠재 구조를 지원할 수 있습니다. 이를 통해 E3 TTS는 추가적인 학습 없이도 편집과 같은 제로샷(zero-shot) 작업에 쉽게 적용될 수 있습니다. 실험 결과, E3 TTS는 최신 신경망 TTS 시스템의 성능에 근접한 고품질 오디오를 생성할 수 있음을 보여줍니다. 오디오 샘플은 https://e3tts.github.io에서 확인할 수 있습니다.

English

We propose Easy End-to-End Diffusion-based Text to Speech, a simple and efficient end-to-end text-to-speech model based on diffusion. E3 TTS directly takes plain text as input and generates an audio waveform through an iterative refinement process. Unlike many prior work, E3 TTS does not rely on any intermediate representations like spectrogram features or alignment information. Instead, E3 TTS models the temporal structure of the waveform through the diffusion process. Without relying on additional conditioning information, E3 TTS could support flexible latent structure within the given audio. This enables E3 TTS to be easily adapted for zero-shot tasks such as editing without any additional training. Experiments show that E3 TTS can generate high-fidelity audio, approaching the performance of a state-of-the-art neural TTS system. Audio samples are available at https://e3tts.github.io.

E3 TTS: 쉬운 종단 간 확산 기반 텍스트 음성 변환

E3 TTS: Easy End-to-End Diffusion-based Text to Speech

초록

Support