V2Meow：透過音樂生成對視覺節拍進行喵喵叫

摘要

生成與視頻的視覺內容相襯的高質量音樂是一項具有挑戰性的任務。大多數現有的視覺條件音樂生成系統生成符號音樂數據，例如MIDI文件，而不是原始音頻波形。鑒於符號音樂數據的有限可用性，這些方法只能為少數樂器或特定類型的視覺輸入生成音樂。在本文中，我們提出了一種名為V2Meow的新方法，可以生成與各種視頻輸入類型的視覺語義相吻合的高質量音樂音頻。具體來說，所提出的音樂生成系統是一個多階段自回歸模型，該模型是通過與視頻幀配對的數百萬音樂音頻片段進行訓練的，這些片段是從野外音樂視頻中挖掘出來的，並且不涉及平行符號音樂數據。V2Meow能夠僅憑來自任意無聲視頻片段提取的預訓練視覺特徵來合成高保真度的音樂音頻波形，同時還允許通過支持文本提示以及視頻幀條件來對生成示例的音樂風格進行高級控制。通過定性和定量評估，我們展示了我們的模型在視覺-音頻對應和音頻質量方面優於幾個現有的音樂生成系統。

English

Generating high quality music that complements the visual content of a video is a challenging task. Most existing visual conditioned music generation systems generate symbolic music data, such as MIDI files, instead of raw audio waveform. Given the limited availability of symbolic music data, such methods can only generate music for a few instruments or for specific types of visual input. In this paper, we propose a novel approach called V2Meow that can generate high-quality music audio that aligns well with the visual semantics of a diverse range of video input types. Specifically, the proposed music generation system is a multi-stage autoregressive model which is trained with a number of O(100K) music audio clips paired with video frames, which are mined from in-the-wild music videos, and no parallel symbolic music data is involved. V2Meow is able to synthesize high-fidelity music audio waveform solely conditioned on pre-trained visual features extracted from an arbitrary silent video clip, and it also allows high-level control over the music style of generation examples via supporting text prompts in addition to the video frames conditioning. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms several existing music generation systems in terms of both visual-audio correspondence and audio quality.

V2Meow：透過音樂生成對視覺節拍進行喵喵叫

V2Meow: Meowing to the Visual Beat via Music Generation

摘要

Support