无向量量化的自回归语音合成

摘要

我们提出了MELLE，这是一种基于连续值标记的语言建模方法，用于文本到语音合成（TTS）。MELLE通过自回归方式直接从文本条件生成连续的梅尔频谱帧，避免了对矢量量化的需求，矢量量化最初是为音频压缩而设计的，与梅尔频谱相比牺牲了保真度。具体来说，（i）我们采用回归损失和提出的频谱通量损失函数来建模连续值标记的概率分布，而不是交叉熵损失。（ii）我们将变分推断结合到MELLE中，以促进采样机制，从而增强输出多样性和模型鲁棒性。实验证明，与两阶段编解码语言模型VALL-E及其变体相比，单阶段的MELLE通过避免采样离散编码的固有缺陷，减轻了鲁棒性问题，在多个指标上实现了卓越性能，并且最重要的是提供了更简洁的范式。请访问https://aka.ms/melle 查看我们工作的演示。

English

We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See https://aka.ms/melle for demos of our work.

无向量量化的自回归语音合成

Autoregressive Speech Synthesis without Vector Quantization

摘要

Support