슈뢰딩거 브리지가 텍스트-음성 합성에서 디퓨전 모델을 능가하다

초록

텍스트-투-스피치(TTS) 합성 분야에서 확산 모델(diffusion models)은 뛰어난 생성 품질을 보여왔습니다. 그러나 사전 정의된 데이터-투-노이즈(datato-noise) 확산 과정으로 인해, 이들의 사전 분포(prior distribution)는 노이즈가 포함된 표현으로 제한되며, 이는 생성 목표에 대한 정보를 거의 제공하지 못합니다. 본 연구에서는 기존의 확산 기반 TTS 방법에서 노이즈가 포함된 가우시안 사전 분포를 깨끗하고 결정론적인 사전 분포로 대체하는 첫 시도인 Bridge-TTS라는 새로운 TTS 시스템을 제안합니다. 이는 목표에 대한 강력한 구조적 정보를 제공합니다. 구체적으로, 우리는 텍스트 입력에서 얻은 잠재 표현(latent representation)을 사전 분포로 활용하고, 이를 실제 멜-스펙트로그램(ground-truth mel-spectrogram)과 연결하는 완전히 추적 가능한 슈뢰딩거 브리지(Schrodinger bridge)를 구축하여 데이터-투-데이터(data-to-data) 과정을 이끌어냅니다. 또한, 우리의 공식화(formulation)의 추적 가능성과 유연성은 노이즈 스케줄(noise schedules)과 같은 설계 공간을 실험적으로 연구하고, 확률적 및 결정론적 샘플러를 개발할 수 있게 합니다. LJ-Speech 데이터셋에서의 실험 결과는 우리의 방법이 합성 품질과 샘플링 효율성 측면에서 모두 효과적임을 보여주며, 50단계/1000단계 합성에서 확산 기반 모델인 Grad-TTS를 크게 능가하고, 소수 단계 시나리오에서 강력한 고속 TTS 모델들보다 우수한 성능을 보입니다. 프로젝트 페이지: https://bridge-tts.github.io/

English

In text-to-speech (TTS) synthesis, diffusion models have achieved promising generation quality. However, because of the pre-defined data-to-noise diffusion process, their prior distribution is restricted to a noisy representation, which provides little information of the generation target. In this work, we present a novel TTS system, Bridge-TTS, making the first attempt to substitute the noisy Gaussian prior in established diffusion-based TTS methods with a clean and deterministic one, which provides strong structural information of the target. Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram, leading to a data-to-data process. Moreover, the tractability and flexibility of our formulation allow us to empirically study the design spaces such as noise schedules, as well as to develop stochastic and deterministic samplers. Experimental results on the LJ-Speech dataset illustrate the effectiveness of our method in terms of both synthesis quality and sampling efficiency, significantly outperforming our diffusion counterpart Grad-TTS in 50-step/1000-step synthesis and strong fast TTS models in few-step scenarios. Project page: https://bridge-tts.github.io/

슈뢰딩거 브리지가 텍스트-음성 합성에서 디퓨전 모델을 능가하다

Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis

초록

Support