通過提示類比為視頻提供基於語言引導的音樂推薦

摘要

我們提出了一種方法，可以在輸入視頻中推薦音樂，同時允許用戶通過自由形式的自然語言來引導音樂選擇。這個問題的一個關鍵挑戰是現有的音樂視頻數據集提供了所需的（視頻，音樂）訓練對，但缺乏音樂的文本描述。本文通過以下三個貢獻來應對這一挑戰。首先，我們提出了一種文本合成方法，依賴於基於類比的提示程序，從一個大規模語言模型（BLOOM-176B）中生成自然語言音樂描述，給定預先訓練的音樂標記器輸出和少量人類文本描述。其次，我們使用這些合成音樂描述來訓練一個新的三模型，該模型融合文本和視頻輸入表示以查詢音樂樣本。在訓練過程中，我們引入了一種文本輸出規則化機制，我們展示這對模型性能至關重要。我們的模型設計允許檢索到的音樂音頻與兩個輸入模態一致，通過匹配視頻中描繪的視覺風格和自然語言查詢中描述的音樂流派、情緒或樂器。第三，為了評估我們的方法，我們通過為問題標註 YT8M-MusicVideo 數據集中的 4k 個子片段，提供自然語言音樂描述的測試數據集，我們將其公開。我們展示了我們的方法在視頻到音樂檢索上可以達到或超過先前方法的性能，同時在使用文本引導時顯著提高了檢索準確性。

English

We propose a method to recommend music for an input video while allowing a user to guide music selection with free-form natural language. A key challenge of this problem setting is that existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music. This work addresses this challenge with the following three contributions. First, we propose a text-synthesis approach that relies on an analogy-based prompting procedure to generate natural language music descriptions from a large-scale language model (BLOOM-176B) given pre-trained music tagger outputs and a small number of human text descriptions. Second, we use these synthesized music descriptions to train a new trimodal model, which fuses text and video input representations to query music samples. For training, we introduce a text dropout regularization mechanism which we show is critical to model performance. Our model design allows for the retrieved music audio to agree with the two input modalities by matching visual style depicted in the video and musical genre, mood, or instrumentation described in the natural language query. Third, to evaluate our approach, we collect a testing dataset for our problem by annotating a subset of 4k clips from the YT8M-MusicVideo dataset with natural language music descriptions which we make publicly available. We show that our approach can match or exceed the performance of prior methods on video-to-music retrieval while significantly improving retrieval accuracy when using text guidance.

通過提示類比為視頻提供基於語言引導的音樂推薦

Language-Guided Music Recommendation for Video via Prompt Analogies

摘要

Support