CLIPSonic：使用未標記視頻和預訓練語言-視覺模型進行文本轉語音合成

摘要

最近的研究探討了使用大量配對的文本-音訊數據進行文本轉音訊合成。然而，具有高質量文本標註的音訊錄製可能難以獲得。在這項研究中，我們通過使用未標記的視頻和預訓練的語言-視覺模型來進行文本轉音訊合成。我們提出通過利用視覺模態作為橋樑來學習所需的文本-音訊對應關係。我們訓練一個有條件的擴散模型，以生成視頻的音訊軌，給定一個由預訓練對比語言-圖像預訓練（CLIP）模型編碼的視頻幀。在測試時，我們首先探索進行零樣本模態轉移，並使用一個CLIP編碼的文本查詢來條件化擴散模型。然而，我們觀察到相對於圖像查詢，性能明顯下降。為了彌合這一差距，我們進一步採用預訓練的擴散先驗模型，以生成一個CLIP圖像嵌入，給定一個CLIP文本嵌入。我們的結果顯示了所提出方法的有效性，以及預訓練的擴散先驗可以減少模態轉移差距。雖然我們專注於文本轉音訊合成，但所提出的模型也可以從圖像查詢生成音訊，在主觀聆聽測試中表現出與最先進的圖像轉音訊合成模型競爭力的性能。這項研究提供了一種新的方法，通過利用視頻中自然發生的音訊-視覺對應和預訓練語言-視覺模型的威力，來處理文本轉音訊合成。

English

Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining (CLIP) model. At test time, we first explore performing a zero-shot modality transfer and condition the diffusion model with a CLIP-encoded text query. However, we observe a noticeable performance drop with respect to image queries. To close this gap, we further adopt a pretrained diffusion prior model to generate a CLIP image embedding given a CLIP text embedding. Our results show the effectiveness of the proposed method, and that the pretrained diffusion prior can reduce the modality transfer gap. While we focus on text-to-audio synthesis, the proposed model can also generate audio from image queries, and it shows competitive performance against a state-of-the-art image-to-audio synthesis model in a subjective listening test. This study offers a new direction of approaching text-to-audio synthesis that leverages the naturally-occurring audio-visual correspondence in videos and the power of pretrained language-vision models.

CLIPSonic：使用未標記視頻和預訓練語言-視覺模型進行文本轉語音合成

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

摘要

Support