プロンプト類推によるビデオ向け言語誘導型音楽推薦

要旨

入力動画に対して音楽を推薦する方法を提案します。この方法では、ユーザーが自由形式の自然言語を用いて音楽選択をガイドできるようにします。この問題設定における主要な課題は、既存のミュージックビデオデータセットが（動画、音楽）のトレーニングペアを提供しているものの、音楽のテキスト記述が欠如している点です。本研究では、以下の3つの貢献を通じてこの課題に取り組みます。第一に、アナロジーベースのプロンプト手法を利用したテキスト合成アプローチを提案します。この手法では、事前学習済みの音楽タガーの出力と少数の人間によるテキスト記述を基に、大規模言語モデル（BLOOM-176B）を用いて自然言語による音楽記述を生成します。第二に、これらの合成された音楽記述を使用して、テキストと動画の入力表現を融合し、音楽サンプルをクエリする新しいトリモーダルモデルを訓練します。訓練においては、モデルの性能に重要なテキストドロップアウト正則化メカニズムを導入します。提案するモデル設計により、検索された音楽オーディオが、動画に描かれた視覚的スタイルと、自然言語クエリで記述された音楽のジャンル、ムード、または楽器編成の両方に一致するようになります。第三に、提案手法を評価するために、YT8M-MusicVideoデータセットから4,000クリップのサブセットを選び、自然言語による音楽記述を付与したテストデータセットを収集し、公開します。提案手法が、従来の動画から音楽を検索する方法の性能に匹敵またはそれを上回り、テキストガイダンスを使用した場合の検索精度を大幅に向上させることを示します。

English

We propose a method to recommend music for an input video while allowing a user to guide music selection with free-form natural language. A key challenge of this problem setting is that existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music. This work addresses this challenge with the following three contributions. First, we propose a text-synthesis approach that relies on an analogy-based prompting procedure to generate natural language music descriptions from a large-scale language model (BLOOM-176B) given pre-trained music tagger outputs and a small number of human text descriptions. Second, we use these synthesized music descriptions to train a new trimodal model, which fuses text and video input representations to query music samples. For training, we introduce a text dropout regularization mechanism which we show is critical to model performance. Our model design allows for the retrieved music audio to agree with the two input modalities by matching visual style depicted in the video and musical genre, mood, or instrumentation described in the natural language query. Third, to evaluate our approach, we collect a testing dataset for our problem by annotating a subset of 4k clips from the YT8M-MusicVideo dataset with natural language music descriptions which we make publicly available. We show that our approach can match or exceed the performance of prior methods on video-to-music retrieval while significantly improving retrieval accuracy when using text guidance.

プロンプト類推によるビデオ向け言語誘導型音楽推薦

Language-Guided Music Recommendation for Video via Prompt Analogies

要旨

Support