V2Meow：通过音乐生成对视觉节拍进行喵喵操作

摘要

生成与视频的视觉内容相配的高质量音乐是一项具有挑战性的任务。大多数现有的视觉条件音乐生成系统生成符号音乐数据，如MIDI文件，而不是原始音频波形。鉴于符号音乐数据的有限可用性，这种方法只能为少数乐器或特定类型的视觉输入生成音乐。在本文中，我们提出了一种名为V2Meow的新方法，它可以生成与各种视频输入类型的视觉语义良好对齐的高质量音乐音频。具体来说，所提出的音乐生成系统是一个多阶段自回归模型，它通过与视频帧配对的来自野外音乐视频的约O(100K)音乐音频剪辑进行训练，而不涉及平行符号音乐数据。V2Meow能够仅基于从任意无声视频剪辑提取的预训练视觉特征来合成高保真音乐音频波形，同时还允许通过支持文本提示来控制生成示例的音乐风格，除了视频帧的条件。通过定性和定量评估，我们证明了我们的模型在视觉-音频对应和音频质量方面优于几种现有音乐生成系统。

English

Generating high quality music that complements the visual content of a video is a challenging task. Most existing visual conditioned music generation systems generate symbolic music data, such as MIDI files, instead of raw audio waveform. Given the limited availability of symbolic music data, such methods can only generate music for a few instruments or for specific types of visual input. In this paper, we propose a novel approach called V2Meow that can generate high-quality music audio that aligns well with the visual semantics of a diverse range of video input types. Specifically, the proposed music generation system is a multi-stage autoregressive model which is trained with a number of O(100K) music audio clips paired with video frames, which are mined from in-the-wild music videos, and no parallel symbolic music data is involved. V2Meow is able to synthesize high-fidelity music audio waveform solely conditioned on pre-trained visual features extracted from an arbitrary silent video clip, and it also allows high-level control over the music style of generation examples via supporting text prompts in addition to the video frames conditioning. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms several existing music generation systems in terms of both visual-audio correspondence and audio quality.

V2Meow：通过音乐生成对视觉节拍进行喵喵操作

V2Meow: Meowing to the Visual Beat via Music Generation

摘要

Support