CLAPSpeech：通过对比语言音频预训练从文本上下文中学习语调

摘要

改善文本表示已经引起了广泛关注，以实现具有表现力的文本转语音（TTS）。然而，现有作品仅通过掩码标记重建任务隐式学习韵律，这导致训练效率低，难以建模韵律。我们提出了CLAPSpeech，这是一个跨模态对比预训练框架，明确学习了相同文本标记在不同上下文下的韵律变化。具体来说，1）我们通过精心设计编码器输入和对比损失，鼓励模型在联合多模态空间中将文本上下文与其相应的韵律模式联系起来；2）我们引入了多尺度预训练流程，以捕获多个层次的韵律模式。我们展示了如何将CLAPSpeech整合到现有的TTS模型中以获得更好的韵律。在三个数据集上的实验不仅表明CLAPSpeech可以改善现有TTS方法的韵律预测，还展示了其适应多种语言和多说话人TTS的泛化能力。我们还深入分析了CLAPSpeech性能背后的原理。消融研究表明了我们方法中每个组件的必要性。源代码和音频样本可在https://clapspeech.github.io 获取。

English

Improving text representation has attracted much attention to achieve expressive text-to-speech (TTS). However, existing works only implicitly learn the prosody with masked token reconstruction tasks, which leads to low training efficiency and difficulty in prosody modeling. We propose CLAPSpeech, a cross-modal contrastive pre-training framework that explicitly learns the prosody variance of the same text token under different contexts. Specifically, 1) We encourage the model to connect the text context with its corresponding prosody pattern in the joint multi-modal space with the elaborate design of the encoder inputs and contrastive loss; 2) We introduce a multi-scale pre-training pipeline to capture prosody patterns in multiple levels. We show how to incorporate CLAPSpeech into existing TTS models for better prosody. Experiments on three datasets not only show that CLAPSpeech could improve the prosody prediction for existing TTS methods, but also demonstrate its generalization ability to adapt to multiple languages and multi-speaker TTS. We also deeply analyze the principle behind the performance of CLAPSpeech. Ablation studies demonstrate the necessity of each component in our method. Source code and audio samples are available at https://clapspeech.github.io.

CLAPSpeech：通过对比语言音频预训练从文本上下文中学习语调

CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training

摘要

Support