ChatPaper.aiChatPaper

無向量量化的自回歸語音合成

Autoregressive Speech Synthesis without Vector Quantization

July 11, 2024
作者: Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei
cs.AI

摘要

我們提出了MELLE,一種基於新穎連續值標記的語言建模方法,用於文本轉語音合成(TTS)。MELLE從文本條件中自回歸生成連續的mel-頻譜圖幀,避免了對向量量化的需求,後者最初是為音頻壓縮而設計的,與mel-頻譜圖相比牺牲了保真度。具體來說,(i)我們應用回歸損失而非交叉熵損失,並使用提出的頻譜通量損失函數來建模連續值標記的概率分佈。(ii)我們將變分推理融入MELLE中,以促進採樣機制,從而增強輸出的多樣性和模型的韌性。實驗表明,與兩階段編解碼器語言模型VALL-E及其變體相比,單階段的MELLE通過避免採樣離散代碼的固有缺陷來緩解韌性問題,在多個指標上實現了卓越性能,最重要的是提供了更加流暢的範式。請參見https://aka.ms/melle 以查看我們工作的演示。
English
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See https://aka.ms/melle for demos of our work.

Summary

AI-Generated Summary

PDF174November 28, 2024