薛定谔桥在文本转语音合成方面胜过扩散模型

摘要

在文本转语音（TTS）合成中，扩散模型已经取得了令人期待的生成质量。然而，由于预定义的数据到噪声扩散过程，它们的先验分布受限于嘈杂的表示，提供了很少有关生成目标的信息。在这项工作中，我们提出了一种新颖的TTS系统，Bridge-TTS，首次尝试用干净且确定性的先验替换已建立的基于扩散的TTS方法中的嘈杂高斯先验，这提供了目标的强结构信息。具体来说，我们利用从文本输入获得的潜在表示作为我们的先验，并在其与地面真实的梅尔频谱图之间建立一个完全可追踪的薛定谔桥，导致数据到数据的过程。此外，我们公式的可追踪性和灵活性使我们能够在实验中研究设计空间，如噪声时间表，以及开发随机和确定性采样器。在LJ-Speech数据集上的实验结果展示了我们的方法在合成质量和采样效率方面的有效性，明显优于我们的扩散对应物Grad-TTS在50步/1000步合成和强大的快速TTS模型在少步骤场景中。项目页面：https://bridge-tts.github.io/

English

In text-to-speech (TTS) synthesis, diffusion models have achieved promising generation quality. However, because of the pre-defined data-to-noise diffusion process, their prior distribution is restricted to a noisy representation, which provides little information of the generation target. In this work, we present a novel TTS system, Bridge-TTS, making the first attempt to substitute the noisy Gaussian prior in established diffusion-based TTS methods with a clean and deterministic one, which provides strong structural information of the target. Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram, leading to a data-to-data process. Moreover, the tractability and flexibility of our formulation allow us to empirically study the design spaces such as noise schedules, as well as to develop stochastic and deterministic samplers. Experimental results on the LJ-Speech dataset illustrate the effectiveness of our method in terms of both synthesis quality and sampling efficiency, significantly outperforming our diffusion counterpart Grad-TTS in 50-step/1000-step synthesis and strong fast TTS models in few-step scenarios. Project page: https://bridge-tts.github.io/

薛定谔桥在文本转语音合成方面胜过扩散模型

Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis

摘要

Support