MuVi:具有语义对齐和节奏同步的视频到音乐生成
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization
October 16, 2024
作者: Ruiqi Li, Siqi Zheng, Xize Cheng, Ziang Zhang, Shengpeng Ji, Zhou Zhao
cs.AI
摘要
生成与视频的视觉内容相符的音乐一直是一项具有挑战性的任务,因为它需要对视觉语义有深刻的理解,并涉及生成旋律、节奏和动态与视觉叙事和谐一致的音乐。本文介绍了MuVi,这是一个新颖的框架,有效地解决了这些挑战,以增强音视频内容的凝聚力和沉浸体验。MuVi通过一个专门设计的视觉适配器分析视频内容,提取上下文和时间相关的特征。这些特征被用来生成音乐,不仅与视频的情绪和主题相匹配,还与其节奏和速度相协调。我们还引入了对比音乐-视觉预训练方案,以确保基于音乐乐句周期性特性的同步。此外,我们展示了我们基于流匹配的音乐生成器具有上下文学习能力,使我们能够控制生成音乐的风格和流派。实验结果表明,MuVi在音频质量和时间同步方面表现出优越性能。生成的音乐视频样本可在https://muvi-v2m.github.io获取。
English
Generating music that aligns with the visual content of a video has been a
challenging task, as it requires a deep understanding of visual semantics and
involves generating music whose melody, rhythm, and dynamics harmonize with the
visual narratives. This paper presents MuVi, a novel framework that effectively
addresses these challenges to enhance the cohesion and immersive experience of
audio-visual content. MuVi analyzes video content through a specially designed
visual adaptor to extract contextually and temporally relevant features. These
features are used to generate music that not only matches the video's mood and
theme but also its rhythm and pacing. We also introduce a contrastive
music-visual pre-training scheme to ensure synchronization, based on the
periodicity nature of music phrases. In addition, we demonstrate that our
flow-matching-based music generator has in-context learning ability, allowing
us to control the style and genre of the generated music. Experimental results
show that MuVi demonstrates superior performance in both audio quality and
temporal synchronization. The generated music video samples are available at
https://muvi-v2m.github.io.Summary
AI-Generated Summary