CLAPSpeech: 대조적 언어-오디오 사전 학습을 통해 텍스트 문맥에서 운율 학습하기

초록

표현력 있는 텍스트-음성 변환(TTS)을 달성하기 위해 텍스트 표현 개선에 많은 관심이 집중되고 있다. 그러나 기존 연구들은 마스크된 토큰 재구성 작업을 통해 간접적으로 운율을 학습함으로써 낮은 학습 효율성과 운율 모델링의 어려움을 초래한다. 본 연구에서는 다양한 문맥에서 동일한 텍스트 토큰의 운율 변화를 명시적으로 학습하는 교차 모달 대조 사전 학습 프레임워크인 CLAPSpeech를 제안한다. 구체적으로, 1) 인코더 입력과 대조 손실의 정교한 설계를 통해 모델이 텍스트 문맥과 해당 운율 패턴을 다중 모달 공간에서 연결하도록 유도하며, 2) 다중 수준에서 운율 패턴을 포착하기 위한 다중 스케일 사전 학습 파이프라인을 도입한다. CLAPSpeech를 기존 TTS 모델에 통합하여 더 나은 운율을 구현하는 방법을 제시한다. 세 가지 데이터셋에 대한 실험을 통해 CLAPSpeech가 기존 TTS 방법의 운율 예측을 개선할 수 있을 뿐만 아니라, 다국어 및 다중 화자 TTS에 적응할 수 있는 일반화 능력을 보여준다. 또한 CLAPSpeech의 성능 배후 원리를 심층적으로 분석한다. 각 구성 요소의 필요성을 입증하기 위해 제거 연구를 수행하였다. 소스 코드와 오디오 샘플은 https://clapspeech.github.io에서 확인할 수 있다.

English

Improving text representation has attracted much attention to achieve expressive text-to-speech (TTS). However, existing works only implicitly learn the prosody with masked token reconstruction tasks, which leads to low training efficiency and difficulty in prosody modeling. We propose CLAPSpeech, a cross-modal contrastive pre-training framework that explicitly learns the prosody variance of the same text token under different contexts. Specifically, 1) We encourage the model to connect the text context with its corresponding prosody pattern in the joint multi-modal space with the elaborate design of the encoder inputs and contrastive loss; 2) We introduce a multi-scale pre-training pipeline to capture prosody patterns in multiple levels. We show how to incorporate CLAPSpeech into existing TTS models for better prosody. Experiments on three datasets not only show that CLAPSpeech could improve the prosody prediction for existing TTS methods, but also demonstrate its generalization ability to adapt to multiple languages and multi-speaker TTS. We also deeply analyze the principle behind the performance of CLAPSpeech. Ablation studies demonstrate the necessity of each component in our method. Source code and audio samples are available at https://clapspeech.github.io.

CLAPSpeech: 대조적 언어-오디오 사전 학습을 통해 텍스트 문맥에서 운율 학습하기

CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training

초록

Support