CLAPSpeech：從文本上下文中學習語調與對比語言音訊預訓練

摘要

改善文本表示已引起廣泛關注，以實現具有表達力的文本轉語音（TTS）。然而，現有研究僅通過遮罩標記重建任務隱式學習韻律，這導致訓練效率低且難以進行韻律建模。我們提出 CLAPSpeech，一種跨模態對比預訓練框架，明確學習相同文本標記在不同上下文下的韻律變異。具體來說，1）我們通過精心設計的編碼器輸入和對比損失，鼓勵模型將文本上下文與其對應的韻律模式連接在聯合多模態空間中；2）我們引入多尺度預訓練流程，以捕獲多個層次的韻律模式。我們展示了如何將 CLAPSpeech 納入現有的 TTS 模型以獲得更好的韻律。在三個數據集上的實驗不僅顯示 CLAPSpeech 可以改善現有 TTS 方法的韻律預測，還展示了其適應多種語言和多語者 TTS 的泛化能力。我們還深入分析了 CLAPSpeech 表現背後的原則。消融研究證明了我們方法中每個組件的必要性。源代碼和音頻樣本可在 https://clapspeech.github.io 上找到。

English

Improving text representation has attracted much attention to achieve expressive text-to-speech (TTS). However, existing works only implicitly learn the prosody with masked token reconstruction tasks, which leads to low training efficiency and difficulty in prosody modeling. We propose CLAPSpeech, a cross-modal contrastive pre-training framework that explicitly learns the prosody variance of the same text token under different contexts. Specifically, 1) We encourage the model to connect the text context with its corresponding prosody pattern in the joint multi-modal space with the elaborate design of the encoder inputs and contrastive loss; 2) We introduce a multi-scale pre-training pipeline to capture prosody patterns in multiple levels. We show how to incorporate CLAPSpeech into existing TTS models for better prosody. Experiments on three datasets not only show that CLAPSpeech could improve the prosody prediction for existing TTS methods, but also demonstrate its generalization ability to adapt to multiple languages and multi-speaker TTS. We also deeply analyze the principle behind the performance of CLAPSpeech. Ablation studies demonstrate the necessity of each component in our method. Source code and audio samples are available at https://clapspeech.github.io.

CLAPSpeech：從文本上下文中學習語調與對比語言音訊預訓練

CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training

摘要

Support