VMAS:通過在網絡音樂視頻中的語義對齊生成音樂
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
September 11, 2024
作者: Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang
cs.AI
摘要
我們提出了一個從視頻輸入中學習生成背景音樂的框架。與現有依賴符號音樂標註的作品不同,這些標註在數量和多樣性上存在限制,我們的方法利用大規模網絡視頻和背景音樂。這使我們的模型能夠學習生成逼真且多樣化的音樂。為了實現這一目標,我們開發了一個具有新穎語義視頻音樂對齊方案的生成式視頻音樂Transformer。我們的模型使用聯合自回歸和對比學習目標,鼓勵生成與高層次視頻內容對齊的音樂。我們還引入了一個新穎的視頻節拍對齊方案,將生成的音樂節拍與視頻中的低層次動作匹配。最後,為了捕獲生成逼真背景音樂所需的視頻中的細粒度視覺線索,我們引入了一種新的時間視頻編碼器架構,使我們能夠有效處理由許多密集採樣幀組成的視頻。我們在我們新編輯的DISCO-MV數據集上訓練我們的框架,該數據集包含220萬個視頻音樂樣本,比用於視頻音樂生成的任何先前數據集都大得多。根據各種音樂生成評估指標,包括人類評估,我們的方法在DISCO-MV和MusicCaps數據集上優於現有方法。結果可在https://genjib.github.io/project_page/VMAs/index.html 查看。
English
We present a framework for learning to generate background music from video
inputs. Unlike existing works that rely on symbolic musical annotations, which
are limited in quantity and diversity, our method leverages large-scale web
videos accompanied by background music. This enables our model to learn to
generate realistic and diverse music. To accomplish this goal, we develop a
generative video-music Transformer with a novel semantic video-music alignment
scheme. Our model uses a joint autoregressive and contrastive learning
objective, which encourages the generation of music aligned with high-level
video content. We also introduce a novel video-beat alignment scheme to match
the generated music beats with the low-level motions in the video. Lastly, to
capture fine-grained visual cues in a video needed for realistic background
music generation, we introduce a new temporal video encoder architecture,
allowing us to efficiently process videos consisting of many densely sampled
frames. We train our framework on our newly curated DISCO-MV dataset,
consisting of 2.2M video-music samples, which is orders of magnitude larger
than any prior datasets used for video music generation. Our method outperforms
existing approaches on the DISCO-MV and MusicCaps datasets according to various
music generation evaluation metrics, including human evaluation. Results are
available at https://genjib.github.io/project_page/VMAs/index.htmlSummary
AI-Generated Summary