VMAS:通过网络音乐视频中的语义对齐进行视频到音乐的生成
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
September 11, 2024
作者: Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang
cs.AI
摘要
我们提出了一个学习从视频输入生成背景音乐的框架。与现有依赖于符号音乐注释的作品不同,这些注释在数量和多样性上存在局限,我们的方法利用大规模网络视频及其背景音乐。这使得我们的模型能够学习生成逼真且多样化的音乐。为实现这一目标,我们开发了一个生成式视频音乐Transformer,其中包含一种新颖的语义视频音乐对齐方案。我们的模型采用联合自回归和对比学习目标,鼓励生成与高级视频内容对齐的音乐。我们还引入了一种新颖的视频-节拍对齐方案,以将生成的音乐节拍与视频中的低级运动匹配。最后,为捕捉生成逼真背景音乐所需的视频中的细粒度视觉线索,我们引入了一种新的时间视频编码器架构,使我们能够高效处理包含许多密集采样帧的视频。我们在我们新策划的DISCO-MV数据集上训练我们的框架,该数据集包含220万个视频音乐样本,比用于视频音乐生成的任何先前数据集都大得多。根据各种音乐生成评估指标,包括人类评估,我们的方法在DISCO-MV和MusicCaps数据集上优于现有方法。结果可在https://genjib.github.io/project_page/VMAs/index.html 查看。
English
We present a framework for learning to generate background music from video
inputs. Unlike existing works that rely on symbolic musical annotations, which
are limited in quantity and diversity, our method leverages large-scale web
videos accompanied by background music. This enables our model to learn to
generate realistic and diverse music. To accomplish this goal, we develop a
generative video-music Transformer with a novel semantic video-music alignment
scheme. Our model uses a joint autoregressive and contrastive learning
objective, which encourages the generation of music aligned with high-level
video content. We also introduce a novel video-beat alignment scheme to match
the generated music beats with the low-level motions in the video. Lastly, to
capture fine-grained visual cues in a video needed for realistic background
music generation, we introduce a new temporal video encoder architecture,
allowing us to efficiently process videos consisting of many densely sampled
frames. We train our framework on our newly curated DISCO-MV dataset,
consisting of 2.2M video-music samples, which is orders of magnitude larger
than any prior datasets used for video music generation. Our method outperforms
existing approaches on the DISCO-MV and MusicCaps datasets according to various
music generation evaluation metrics, including human evaluation. Results are
available at https://genjib.github.io/project_page/VMAs/index.htmlSummary
AI-Generated Summary