通过提示类比实现视频的语言引导音乐推荐
Language-Guided Music Recommendation for Video via Prompt Analogies
June 15, 2023
作者: Daniel McKee, Justin Salamon, Josef Sivic, Bryan Russell
cs.AI
摘要
我们提出了一种方法,可以为输入视频推荐音乐,同时允许用户使用自由形式的自然语言来指导音乐选择。这个问题的一个关键挑战是现有的音乐视频数据集提供了所需的(视频,音乐)训练对,但缺乏音乐的文本描述。本文通过以下三点解决了这一挑战。首先,我们提出了一种文本合成方法,依赖于基于类比的提示过程,从一个大规模语言模型(BLOOM-176B)中生成自然语言音乐描述,给定预训练的音乐标记器输出和少量人类文本描述。其次,我们使用这些合成的音乐描述来训练一个新的三模态模型,该模型融合了文本和视频输入表示以查询音乐样本。在训练过程中,我们引入了一个文本丢失正则化机制,我们证明这对模型性能至关重要。我们的模型设计允许检索到的音乐音频与两个输入模态一致,通过匹配视频中描绘的视觉风格和自然语言查询中描述的音乐流派、情绪或乐器。第三,为了评估我们的方法,我们通过为YT8M-MusicVideo数据集的4k个剪辑子集注释自然语言音乐描述来收集我们问题的测试数据集,并将其公开提供。我们展示了我们的方法可以在视频到音乐检索上匹配或超过先前方法的性能,同时在使用文本指导时显著提高了检索准确性。
English
We propose a method to recommend music for an input video while allowing a
user to guide music selection with free-form natural language. A key challenge
of this problem setting is that existing music video datasets provide the
needed (video, music) training pairs, but lack text descriptions of the music.
This work addresses this challenge with the following three contributions.
First, we propose a text-synthesis approach that relies on an analogy-based
prompting procedure to generate natural language music descriptions from a
large-scale language model (BLOOM-176B) given pre-trained music tagger outputs
and a small number of human text descriptions. Second, we use these synthesized
music descriptions to train a new trimodal model, which fuses text and video
input representations to query music samples. For training, we introduce a text
dropout regularization mechanism which we show is critical to model
performance. Our model design allows for the retrieved music audio to agree
with the two input modalities by matching visual style depicted in the video
and musical genre, mood, or instrumentation described in the natural language
query. Third, to evaluate our approach, we collect a testing dataset for our
problem by annotating a subset of 4k clips from the YT8M-MusicVideo dataset
with natural language music descriptions which we make publicly available. We
show that our approach can match or exceed the performance of prior methods on
video-to-music retrieval while significantly improving retrieval accuracy when
using text guidance.