通过文本到视频模型的多样化和对齐音频到视频生成适应化
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
September 28, 2023
作者: Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, Yossi Adi
cs.AI
摘要
我们考虑生成多样且逼真的视频任务,其以各种语义类别的自然音频样本为指导。对于这一任务,视频需要在全局和时间上与输入音频对齐:在全局上,输入音频与整个输出视频在语义上相关联;在时间上,输入音频的每个片段与视频的相应片段相关联。我们利用现有的文本条件视频生成模型和预训练的音频编码器模型。所提出的方法基于轻量级适配器网络,该网络学习将基于音频的表示映射到文本到视频生成模型期望的输入表示。因此,它还实现了基于文本、音频以及我们所能确定的首次同时基于文本和音频的视频生成。我们在三个数据集上广泛验证了我们的方法,展示了音视频样本的显著语义多样性,并进一步提出了一种新颖的评估指标(AV-Align)来评估生成视频与输入音频样本的对齐情况。AV-Align基于检测和比较两种模态中的能量峰值。与最近的最先进方法相比,我们的方法生成的视频在内容和时间轴方面与输入声音更好地对齐。我们还展示了我们的方法生成的视频具有更高的视觉质量和更多样性。
English
We consider the task of generating diverse and realistic videos guided by
natural audio samples from a wide variety of semantic classes. For this task,
the videos are required to be aligned both globally and temporally with the
input audio: globally, the input audio is semantically associated with the
entire output video, and temporally, each segment of the input audio is
associated with a corresponding segment of that video. We utilize an existing
text-conditioned video generation model and a pre-trained audio encoder model.
The proposed method is based on a lightweight adaptor network, which learns to
map the audio-based representation to the input representation expected by the
text-to-video generation model. As such, it also enables video generation
conditioned on text, audio, and, for the first time as far as we can ascertain,
on both text and audio. We validate our method extensively on three datasets
demonstrating significant semantic diversity of audio-video samples and further
propose a novel evaluation metric (AV-Align) to assess the alignment of
generated videos with input audio samples. AV-Align is based on the detection
and comparison of energy peaks in both modalities. In comparison to recent
state-of-the-art approaches, our method generates videos that are better
aligned with the input sound, both with respect to content and temporal axis.
We also show that videos produced by our method present higher visual quality
and are more diverse.