無向量量化的自回歸語音合成

摘要

我們提出了MELLE，一種基於新穎連續值標記的語言建模方法，用於文本轉語音合成（TTS）。MELLE從文本條件中自回歸生成連續的mel-頻譜圖幀，避免了對向量量化的需求，後者最初是為音頻壓縮而設計的，與mel-頻譜圖相比牺牲了保真度。具體來說，（i）我們應用回歸損失而非交叉熵損失，並使用提出的頻譜通量損失函數來建模連續值標記的概率分佈。（ii）我們將變分推理融入MELLE中，以促進採樣機制，從而增強輸出的多樣性和模型的韌性。實驗表明，與兩階段編解碼器語言模型VALL-E及其變體相比，單階段的MELLE通過避免採樣離散代碼的固有缺陷來緩解韌性問題，在多個指標上實現了卓越性能，最重要的是提供了更加流暢的範式。請參見https://aka.ms/melle 以查看我們工作的演示。

English

We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See https://aka.ms/melle for demos of our work.

無向量量化的自回歸語音合成

Autoregressive Speech Synthesis without Vector Quantization

摘要

Support