CLIPSonic: ラベルなし動画と事前学習済み言語視覚モデルを用いたテキストから音声への合成

要旨

近年の研究では、大量のテキスト-音声ペアデータを用いたテキストから音声への合成が検討されてきた。しかし、高品質なテキスト注釈を伴う音声記録を取得することは困難である。本研究では、ラベルなしの動画と事前学習済みの言語-視覚モデルを用いて、テキストから音声への合成にアプローチする。視覚モダリティを橋渡しとして利用することで、目的とするテキスト-音声の対応関係を学習することを提案する。事前学習済みの対照的言語-画像事前学習（CLIP）モデルによってエンコードされた動画フレームを条件として、条件付き拡散モデルを訓練し、動画の音声トラックを生成する。テスト時には、まずゼロショットのモダリティ転移を実行し、CLIPでエンコードされたテキストクエリを条件として拡散モデルを適用する。しかし、画像クエリと比較して性能の低下が観察される。このギャップを埋めるために、事前学習済みの拡散事前モデルを採用し、CLIPテキスト埋め込みからCLIP画像埋め込みを生成する。提案手法の有効性を示し、事前学習済みの拡散事前モデルがモダリティ転移のギャップを縮小できることを示す。テキストから音声への合成に焦点を当てているが、提案モデルは画像クエリからも音声を生成でき、主観的な聴取テストにおいて最先端の画像から音声への合成モデルと競合する性能を示す。本研究は、動画に自然に存在する音声-視覚の対応関係と事前学習済みの言語-視覚モデルの力を活用した、テキストから音声への合成への新たな方向性を提供する。

English

Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining (CLIP) model. At test time, we first explore performing a zero-shot modality transfer and condition the diffusion model with a CLIP-encoded text query. However, we observe a noticeable performance drop with respect to image queries. To close this gap, we further adopt a pretrained diffusion prior model to generate a CLIP image embedding given a CLIP text embedding. Our results show the effectiveness of the proposed method, and that the pretrained diffusion prior can reduce the modality transfer gap. While we focus on text-to-audio synthesis, the proposed model can also generate audio from image queries, and it shows competitive performance against a state-of-the-art image-to-audio synthesis model in a subjective listening test. This study offers a new direction of approaching text-to-audio synthesis that leverages the naturally-occurring audio-visual correspondence in videos and the power of pretrained language-vision models.

CLIPSonic: ラベルなし動画と事前学習済み言語視覚モデルを用いたテキストから音声への合成

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

要旨

Support