CLIPSonic: 레이블 없는 비디오와 사전 학습된 언어-비전 모델을 활용한 텍스트-오디오 합성

초록

최근 연구에서는 대량의 텍스트-오디오 쌍 데이터를 사용하여 텍스트-오디오 합성을 연구해 왔습니다. 그러나 고품질의 텍스트 주석이 포함된 오디오 녹음을 획득하는 것은 어려울 수 있습니다. 본 연구에서는 레이블이 없는 비디오와 사전 학습된 언어-비전 모델을 활용하여 텍스트-오디오 합성에 접근합니다. 우리는 시각적 모달리티를 매개체로 활용하여 원하는 텍스트-오디오 대응 관계를 학습하는 방법을 제안합니다. 사전 학습된 대조적 언어-이미지 사전 학습(CLIP) 모델로 인코딩된 비디오 프레임을 기반으로 비디오의 오디오 트랙을 생성하기 위해 조건부 확산 모델을 학습합니다. 테스트 시에는 먼저 제로샷 모달리티 전환을 수행하고, CLIP으로 인코딩된 텍스트 쿼리를 사용하여 확산 모델을 조건화하는 방법을 탐구합니다. 그러나 이미지 쿼리에 비해 성능 저하가 관찰됩니다. 이 격차를 줄이기 위해, 우리는 CLIP 텍스트 임베딩이 주어졌을 때 CLIP 이미지 임베딩을 생성하기 위해 사전 학습된 확산 프라이어 모델을 추가로 도입합니다. 우리의 결과는 제안된 방법의 효과를 보여주며, 사전 학습된 확산 프라이어가 모달리티 전환 격차를 줄일 수 있음을 보여줍니다. 우리는 텍스트-오디오 합성에 초점을 맞추고 있지만, 제안된 모델은 이미지 쿼리에서도 오디오를 생성할 수 있으며, 주관적 청취 테스트에서 최신 이미지-오디오 합성 모델과 경쟁력 있는 성능을 보여줍니다. 이 연구는 비디오에서 자연스럽게 발생하는 오디오-비주얼 대응 관계와 사전 학습된 언어-비전 모델의 힘을 활용하여 텍스트-오디오 합성에 접근하는 새로운 방향을 제시합니다.

English

Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining (CLIP) model. At test time, we first explore performing a zero-shot modality transfer and condition the diffusion model with a CLIP-encoded text query. However, we observe a noticeable performance drop with respect to image queries. To close this gap, we further adopt a pretrained diffusion prior model to generate a CLIP image embedding given a CLIP text embedding. Our results show the effectiveness of the proposed method, and that the pretrained diffusion prior can reduce the modality transfer gap. While we focus on text-to-audio synthesis, the proposed model can also generate audio from image queries, and it shows competitive performance against a state-of-the-art image-to-audio synthesis model in a subjective listening test. This study offers a new direction of approaching text-to-audio synthesis that leverages the naturally-occurring audio-visual correspondence in videos and the power of pretrained language-vision models.

CLIPSonic: 레이블 없는 비디오와 사전 학습된 언어-비전 모델을 활용한 텍스트-오디오 합성

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

초록

Support