ChatPaper.aiChatPaper

透過文本到視頻模型的多樣且一致的音訊到視訊生成適應

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

September 28, 2023
作者: Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, Yossi Adi
cs.AI

摘要

我們考慮生成多樣且逼真的影片任務,透過各種語義類別的自然音訊樣本進行引導。對於這個任務,影片需要與輸入音訊在全局和時間上對齊:在全局上,輸入音訊與整個輸出影片在語義上相關聯,而在時間上,輸入音訊的每個片段與該影片的相應片段相關聯。我們利用現有的以文本為條件的影片生成模型和預先訓練的音訊編碼器模型。所提出的方法基於一個輕量級的適配器網絡,該網絡學習將基於音訊的表示映射到文本到影片生成模型所期望的輸入表示。因此,它還可以實現基於文本、音訊以及我們可以確定的情況下首次基於文本和音訊的影片生成。我們在三個數據集上廣泛驗證了我們的方法,展示了音視頻樣本的顯著語義多樣性,並進一步提出了一個新穎的評估指標(AV-Align)來評估生成影片與輸入音訊樣本的對齊情況。AV-Align基於兩種模態中能量峰值的檢測和比較。與最近的最先進方法相比,我們的方法生成的影片在內容和時間軸方面與輸入聲音更好地對齊。我們還展示了我們的方法生成的影片呈現更高的視覺質量並且更具多樣性。
English
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse.
PDF112December 15, 2024