LP-MusicCaps：基於LLM的虛擬音樂標題生成

摘要

自動音樂標註是為給定的音樂曲目生成自然語言描述，對於增強對大量音樂數據的理解和組織具有重要潛力。儘管其重要性重大，研究人員面臨挑戰，原因是現有音樂語言數據集的收集過程昂貴且耗時，並且數據集規模有限。為了應對這一數據稀缺問題，我們提出使用大型語言模型（LLMs）從大規模標籤數據集中人工生成描述句子。這導致約有220萬條標題與50萬個音頻片段相配。我們稱之為基於大型語言模型的虛擬音樂標註數據集，簡稱LP-MusicCaps。我們對大規模音樂標註數據集進行系統評估，使用自然語言處理領域中的各種定量評估指標以及人類評估。此外，我們使用該數據集訓練了基於變壓器的音樂標註模型，並在零-shot和遷移學習設置下進行評估。結果表明，我們提出的方法優於監督基線模型。

English

Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.

LP-MusicCaps：基於LLM的虛擬音樂標題生成

LP-MusicCaps: LLM-Based Pseudo Music Captioning

摘要

Support