LP-MusicCaps: LLMベースの疑似音楽キャプショニング

要旨

自動音楽キャプショニングは、与えられた音楽トラックに対する自然言語による説明文を生成する技術であり、大量の音楽データの理解と整理を向上させる上で重要な可能性を秘めています。しかし、その重要性にもかかわらず、研究者は既存の音楽-言語データセットの収集プロセスが高コストで時間がかかること、またその規模が限られていることといった課題に直面しています。このデータ不足の問題に対処するため、我々は大規模言語モデル（LLM）を活用して、大規模なタグデータセットから説明文を人工的に生成することを提案します。これにより、約220万のキャプションと50万の音声クリップがペアとなったデータセットが得られます。我々はこれを「大規模言語モデルに基づく疑似音楽キャプションデータセット」、略してLP-MusicCapsと呼びます。この大規模音楽キャプショニングデータセットについて、自然言語処理分野で用いられる様々な定量的評価指標と人間による評価を用いて体系的な評価を行いました。さらに、このデータセットを用いてTransformerベースの音楽キャプショニングモデルを学習し、ゼロショットおよび転移学習の設定下で評価を行いました。その結果、提案手法が教師ありベースラインモデルを上回る性能を示すことが明らかになりました。

English

Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.

LP-MusicCaps: LLMベースの疑似音楽キャプショニング

LP-MusicCaps: LLM-Based Pseudo Music Captioning

要旨

Support