CLAPSpeech: テキストコンテキストから韻律を学習するための対照的言語-音声事前学習

要旨

表現力豊かなテキスト音声合成（TTS）を実現するため、テキスト表現の改善が注目を集めています。しかし、既存の研究はマスクされたトークンの再構築タスクを通じて韻律を暗黙的に学習するのみで、学習効率が低く、韻律モデリングが困難という課題がありました。本論文では、異なる文脈下での同一テキストトークンの韻律変化を明示的に学習するクロスモーダル対比事前学習フレームワーク「CLAPSpeech」を提案します。具体的には、1) エンコーダ入力と対比損失の精巧な設計により、テキスト文脈とそれに対応する韻律パターンを共同マルチモーダル空間で関連付けるようモデルを促します。2) 複数のレベルで韻律パターンを捉えるため、マルチスケール事前学習パイプラインを導入します。既存のTTSモデルにCLAPSpeechを組み込むことで、より優れた韻律を実現する方法を示します。3つのデータセットを用いた実験では、CLAPSpeechが既存のTTS手法の韻律予測を改善できるだけでなく、複数言語やマルチスピーカーTTSへの適応能力も示されています。また、CLAPSpeechの性能の背後にある原理を深く分析し、アブレーションスタディを通じて各コンポーネントの必要性を実証しました。ソースコードと音声サンプルはhttps://clapspeech.github.ioで公開しています。

English

Improving text representation has attracted much attention to achieve expressive text-to-speech (TTS). However, existing works only implicitly learn the prosody with masked token reconstruction tasks, which leads to low training efficiency and difficulty in prosody modeling. We propose CLAPSpeech, a cross-modal contrastive pre-training framework that explicitly learns the prosody variance of the same text token under different contexts. Specifically, 1) We encourage the model to connect the text context with its corresponding prosody pattern in the joint multi-modal space with the elaborate design of the encoder inputs and contrastive loss; 2) We introduce a multi-scale pre-training pipeline to capture prosody patterns in multiple levels. We show how to incorporate CLAPSpeech into existing TTS models for better prosody. Experiments on three datasets not only show that CLAPSpeech could improve the prosody prediction for existing TTS methods, but also demonstrate its generalization ability to adapt to multiple languages and multi-speaker TTS. We also deeply analyze the principle behind the performance of CLAPSpeech. Ablation studies demonstrate the necessity of each component in our method. Source code and audio samples are available at https://clapspeech.github.io.

CLAPSpeech: テキストコンテキストから韻律を学習するための対照的言語-音声事前学習

CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training

要旨

Support