ChatPaper.aiChatPaper

通过文本到视频模型的多样化和对齐音频到视频生成适应化

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

September 28, 2023
作者: Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, Yossi Adi
cs.AI

摘要

我们考虑生成多样且逼真的视频任务,其以各种语义类别的自然音频样本为指导。对于这一任务,视频需要在全局和时间上与输入音频对齐:在全局上,输入音频与整个输出视频在语义上相关联;在时间上,输入音频的每个片段与视频的相应片段相关联。我们利用现有的文本条件视频生成模型和预训练的音频编码器模型。所提出的方法基于轻量级适配器网络,该网络学习将基于音频的表示映射到文本到视频生成模型期望的输入表示。因此,它还实现了基于文本、音频以及我们所能确定的首次同时基于文本和音频的视频生成。我们在三个数据集上广泛验证了我们的方法,展示了音视频样本的显著语义多样性,并进一步提出了一种新颖的评估指标(AV-Align)来评估生成视频与输入音频样本的对齐情况。AV-Align基于检测和比较两种模态中的能量峰值。与最近的最先进方法相比,我们的方法生成的视频在内容和时间轴方面与输入声音更好地对齐。我们还展示了我们的方法生成的视频具有更高的视觉质量和更多样性。
English
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse.
PDF112December 15, 2024