VMAS：通過在網絡音樂視頻中的語義對齊生成音樂

摘要

我們提出了一個從視頻輸入中學習生成背景音樂的框架。與現有依賴符號音樂標註的作品不同，這些標註在數量和多樣性上存在限制，我們的方法利用大規模網絡視頻和背景音樂。這使我們的模型能夠學習生成逼真且多樣化的音樂。為了實現這一目標，我們開發了一個具有新穎語義視頻音樂對齊方案的生成式視頻音樂Transformer。我們的模型使用聯合自回歸和對比學習目標，鼓勵生成與高層次視頻內容對齊的音樂。我們還引入了一個新穎的視頻節拍對齊方案，將生成的音樂節拍與視頻中的低層次動作匹配。最後，為了捕獲生成逼真背景音樂所需的視頻中的細粒度視覺線索，我們引入了一種新的時間視頻編碼器架構，使我們能夠有效處理由許多密集採樣幀組成的視頻。我們在我們新編輯的DISCO-MV數據集上訓練我們的框架，該數據集包含220萬個視頻音樂樣本，比用於視頻音樂生成的任何先前數據集都大得多。根據各種音樂生成評估指標，包括人類評估，我們的方法在DISCO-MV和MusicCaps數據集上優於現有方法。結果可在https://genjib.github.io/project_page/VMAs/index.html 查看。

English

We present a framework for learning to generate background music from video inputs. Unlike existing works that rely on symbolic musical annotations, which are limited in quantity and diversity, our method leverages large-scale web videos accompanied by background music. This enables our model to learn to generate realistic and diverse music. To accomplish this goal, we develop a generative video-music Transformer with a novel semantic video-music alignment scheme. Our model uses a joint autoregressive and contrastive learning objective, which encourages the generation of music aligned with high-level video content. We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video. Lastly, to capture fine-grained visual cues in a video needed for realistic background music generation, we introduce a new temporal video encoder architecture, allowing us to efficiently process videos consisting of many densely sampled frames. We train our framework on our newly curated DISCO-MV dataset, consisting of 2.2M video-music samples, which is orders of magnitude larger than any prior datasets used for video music generation. Our method outperforms existing approaches on the DISCO-MV and MusicCaps datasets according to various music generation evaluation metrics, including human evaluation. Results are available at https://genjib.github.io/project_page/VMAs/index.html

VMAS：通過在網絡音樂視頻中的語義對齊生成音樂

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos

摘要

Support