MuVi: セマンティックアライメントとリズム同期を用いたビデオから音楽への生成

要旨

ビデオの視覚コンテンツに適合する音楽を生成することは、視覚的な意味論を深く理解し、メロディ、リズム、ダイナミクスが視覚的な物語と調和する音楽を生成する必要があるため、困難な課題でした。本論文では、これらの課題に効果的に対処し、オーディオビジュアルコンテンツの結束性と没入型体験を向上させる革新的なフレームワークであるMuViを提案します。MuViは、ビデオコンテンツを分析するために特別に設計された視覚アダプタを介してコンテキストに即した特徴を抽出します。これらの特徴は、ビデオのムードやテーマだけでなく、リズムやペースにも合致する音楽を生成するために使用されます。また、音楽フレーズの周期性に基づく同期を確保するための対照的な音楽-視覚事前トレーニングスキームを導入します。さらに、フローに基づく音楽生成器がコンテキスト内学習能力を持つことを示し、生成された音楽のスタイルとジャンルを制御できることを示します。実験結果は、MuViがオーディオ品質と時間的同期の両方で優れた性能を示すことを示しています。生成された音楽ビデオサンプルは、https://muvi-v2m.github.io で入手可能です。

English

Generating music that aligns with the visual content of a video has been a challenging task, as it requires a deep understanding of visual semantics and involves generating music whose melody, rhythm, and dynamics harmonize with the visual narratives. This paper presents MuVi, a novel framework that effectively addresses these challenges to enhance the cohesion and immersive experience of audio-visual content. MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features. These features are used to generate music that not only matches the video's mood and theme but also its rhythm and pacing. We also introduce a contrastive music-visual pre-training scheme to ensure synchronization, based on the periodicity nature of music phrases. In addition, we demonstrate that our flow-matching-based music generator has in-context learning ability, allowing us to control the style and genre of the generated music. Experimental results show that MuVi demonstrates superior performance in both audio quality and temporal synchronization. The generated music video samples are available at https://muvi-v2m.github.io.

MuVi: セマンティックアライメントとリズム同期を用いたビデオから音楽への生成

MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

要旨

Support