CLIPSonic：使用未标记视频和预训练语言-视觉模型进行文本到音频合成

摘要

最近的研究工作使用大量配对的文本-音频数据研究了文本转音频合成。然而，具有高质量文本标注的音频录音可能难以获取。在这项工作中，我们采用未标记视频和预训练的语言-视觉模型来进行文本转音频合成。我们提出通过利用视觉模态作为桥梁来学习所需的文本-音频对应关系。我们训练一个条件扩散模型，以生成视频的音频轨道，给定一个由预训练对比语言-图像预训练（CLIP）模型编码的视频帧。在测试阶段，我们首先尝试进行零样本模态转移，并使用一个CLIP编码的文本查询来条件化扩散模型。然而，我们观察到相对于图像查询存在明显的性能下降。为了弥补这一差距，我们进一步采用预训练的扩散先验模型，以生成给定CLIP文本嵌入的CLIP图像嵌入。我们的结果显示了所提出方法的有效性，以及预训练的扩散先验可以减少模态转移差距。虽然我们关注文本转音频合成，但所提出的模型也可以从图像查询生成音频，并在主观听测试中表现出与最先进的图像转音频合成模型竞争力的性能。这项研究提供了一种利用视频中自然发生的音频-视觉对应关系和预训练语言-视觉模型的力量来处理文本转音频合成的新方向。

English

Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining (CLIP) model. At test time, we first explore performing a zero-shot modality transfer and condition the diffusion model with a CLIP-encoded text query. However, we observe a noticeable performance drop with respect to image queries. To close this gap, we further adopt a pretrained diffusion prior model to generate a CLIP image embedding given a CLIP text embedding. Our results show the effectiveness of the proposed method, and that the pretrained diffusion prior can reduce the modality transfer gap. While we focus on text-to-audio synthesis, the proposed model can also generate audio from image queries, and it shows competitive performance against a state-of-the-art image-to-audio synthesis model in a subjective listening test. This study offers a new direction of approaching text-to-audio synthesis that leverages the naturally-occurring audio-visual correspondence in videos and the power of pretrained language-vision models.

CLIPSonic：使用未标记视频和预训练语言-视觉模型进行文本到音频合成

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

摘要

Support