ベクトル量子化を用いない自己回帰型音声合成

要旨

本論文では、テキスト音声合成（TTS）のための新しい連続値トークンに基づく言語モデリング手法「MELLE」を提案する。MELLEは、テキスト条件から直接連続的なメルスペクトログラムフレームを自己回帰的に生成し、音声圧縮のために設計されメルスペクトログラムに比べて忠実度が犠牲になるベクトル量子化の必要性を回避する。具体的には、(i) 交差エントロピー損失の代わりに、提案されたスペクトログラムフラックス損失関数を用いた回帰損失を適用し、連続値トークンの確率分布をモデル化する。(ii) サンプリング機構を容易にするために変分推論をMELLEに組み込み、出力の多様性とモデルの堅牢性を向上させる。実験結果から、2段階のコーデック言語モデルであるVALL-Eおよびその派生モデルと比較して、1段階のMELLEは離散コードのサンプリングに伴う固有の欠陥を回避することで堅牢性の問題を軽減し、複数の評価指標で優れた性能を達成し、最も重要な点として、より簡潔なパラダイムを提供することが示された。デモはhttps://aka.ms/melleを参照のこと。

English

We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See https://aka.ms/melle for demos of our work.

ベクトル量子化を用いない自己回帰型音声合成

Autoregressive Speech Synthesis without Vector Quantization

要旨

Support