VMAS：通过网络音乐视频中的语义对齐进行视频到音乐的生成

摘要

我们提出了一个学习从视频输入生成背景音乐的框架。与现有依赖于符号音乐注释的作品不同，这些注释在数量和多样性上存在局限，我们的方法利用大规模网络视频及其背景音乐。这使得我们的模型能够学习生成逼真且多样化的音乐。为实现这一目标，我们开发了一个生成式视频音乐Transformer，其中包含一种新颖的语义视频音乐对齐方案。我们的模型采用联合自回归和对比学习目标，鼓励生成与高级视频内容对齐的音乐。我们还引入了一种新颖的视频-节拍对齐方案，以将生成的音乐节拍与视频中的低级运动匹配。最后，为捕捉生成逼真背景音乐所需的视频中的细粒度视觉线索，我们引入了一种新的时间视频编码器架构，使我们能够高效处理包含许多密集采样帧的视频。我们在我们新策划的DISCO-MV数据集上训练我们的框架，该数据集包含220万个视频音乐样本，比用于视频音乐生成的任何先前数据集都大得多。根据各种音乐生成评估指标，包括人类评估，我们的方法在DISCO-MV和MusicCaps数据集上优于现有方法。结果可在https://genjib.github.io/project_page/VMAs/index.html 查看。

English

We present a framework for learning to generate background music from video inputs. Unlike existing works that rely on symbolic musical annotations, which are limited in quantity and diversity, our method leverages large-scale web videos accompanied by background music. This enables our model to learn to generate realistic and diverse music. To accomplish this goal, we develop a generative video-music Transformer with a novel semantic video-music alignment scheme. Our model uses a joint autoregressive and contrastive learning objective, which encourages the generation of music aligned with high-level video content. We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video. Lastly, to capture fine-grained visual cues in a video needed for realistic background music generation, we introduce a new temporal video encoder architecture, allowing us to efficiently process videos consisting of many densely sampled frames. We train our framework on our newly curated DISCO-MV dataset, consisting of 2.2M video-music samples, which is orders of magnitude larger than any prior datasets used for video music generation. Our method outperforms existing approaches on the DISCO-MV and MusicCaps datasets according to various music generation evaluation metrics, including human evaluation. Results are available at https://genjib.github.io/project_page/VMAs/index.html

VMAS：通过网络音乐视频中的语义对齐进行视频到音乐的生成

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

摘要

Support