LP-MusicCaps：基于LLM的伪音乐字幕生成

摘要

自动生成音乐描述的自动化音乐字幕技术具有显著潜力，可增强对大量音乐数据的理解和组织。尽管其重要性显著，研究人员面临挑战，因为现有音乐语言数据集的收集过程昂贵且耗时，并且规模有限。为解决数据稀缺问题，我们提出利用大型语言模型（LLMs）从大规模标签数据集中人工生成描述句子。这产生了大约220万个音频片段配对的字幕。我们将其称为基于大型语言模型的伪音乐字幕数据集，简称LP-MusicCaps。我们对大规模音乐字幕数据集进行系统评估，使用自然语言处理领域中的各种定量评估指标以及人类评估。此外，我们使用该数据集训练了基于Transformer的音乐字幕模型，并在零-shot和迁移学习设置下进行了评估。结果表明，我们提出的方法优于监督基线模型。

English

Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.

LP-MusicCaps：基于LLM的伪音乐字幕生成

LP-MusicCaps: LLM-Based Pseudo Music Captioning

摘要

Support