LP-MusicCaps: LLM 기반의 가사 음악 캡셔닝

초록

주어진 음악 트랙에 대한 자연어 설명을 생성하는 자동 음악 캡셔닝은 대량의 음악 데이터를 이해하고 조직화하는 데 있어 상당한 잠재력을 가지고 있습니다. 그러나 그 중요성에도 불구하고, 연구자들은 기존 음악-언어 데이터셋의 제한된 크기와 이를 수집하는 데 드는 비용과 시간이 많이 소요되는 문제로 인해 어려움을 겪고 있습니다. 이러한 데이터 부족 문제를 해결하기 위해, 우리는 대규모 태그 데이터셋에서 설명 문장을 인공적으로 생성하기 위해 대형 언어 모델(LLM)을 사용할 것을 제안합니다. 이를 통해 약 0.5M개의 오디오 클립과 짝을 이루는 약 2.2M개의 캡션을 생성합니다. 우리는 이를 대형 언어 모델 기반의 가짜 음악 캡션 데이터셋, 줄여서 LP-MusicCaps라고 명명합니다. 우리는 이 대규모 음악 캡셔닝 데이터셋을 자연어 처리 분야에서 사용되는 다양한 정량적 평가 지표와 인간 평가를 통해 체계적으로 평가했습니다. 또한, 이 데이터셋으로 트랜스포머 기반의 음악 캡셔닝 모델을 학습시키고, 제로샷 및 전이 학습 설정에서 평가했습니다. 결과는 우리가 제안한 접근 방식이 지도 학습 기반의 베이스라인 모델을 능가함을 보여줍니다.

English

Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.

LP-MusicCaps: LLM 기반의 가사 음악 캡셔닝

LP-MusicCaps: LLM-Based Pseudo Music Captioning

초록

Support