無向量量化的自回歸語音合成
Autoregressive Speech Synthesis without Vector Quantization
July 11, 2024
作者: Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei
cs.AI
摘要
我們提出了MELLE,一種基於新穎連續值標記的語言建模方法,用於文本轉語音合成(TTS)。MELLE從文本條件中自回歸生成連續的mel-頻譜圖幀,避免了對向量量化的需求,後者最初是為音頻壓縮而設計的,與mel-頻譜圖相比牺牲了保真度。具體來說,(i)我們應用回歸損失而非交叉熵損失,並使用提出的頻譜通量損失函數來建模連續值標記的概率分佈。(ii)我們將變分推理融入MELLE中,以促進採樣機制,從而增強輸出的多樣性和模型的韌性。實驗表明,與兩階段編解碼器語言模型VALL-E及其變體相比,單階段的MELLE通過避免採樣離散代碼的固有缺陷來緩解韌性問題,在多個指標上實現了卓越性能,最重要的是提供了更加流暢的範式。請參見https://aka.ms/melle 以查看我們工作的演示。
English
We present MELLE, a novel continuous-valued tokens based language modeling
approach for text to speech synthesis (TTS). MELLE autoregressively generates
continuous mel-spectrogram frames directly from text condition, bypassing the
need for vector quantization, which are originally designed for audio
compression and sacrifice fidelity compared to mel-spectrograms. Specifically,
(i) instead of cross-entropy loss, we apply regression loss with a proposed
spectrogram flux loss function to model the probability distribution of the
continuous-valued tokens. (ii) we have incorporated variational inference into
MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity
and model robustness. Experiments demonstrate that, compared to the two-stage
codec language models VALL-E and its variants, the single-stage MELLE mitigates
robustness issues by avoiding the inherent flaws of sampling discrete codes,
achieves superior performance across multiple metrics, and, most importantly,
offers a more streamlined paradigm. See https://aka.ms/melle for demos of our
work.Summary
AI-Generated Summary