SonicVerse: 음악 특징 정보를 반영한 캡션 생성을 위한 다중 작업 학습

초록

음악 작품의 특성을 정확히 반영한 상세한 캡션은 음악 데이터베이스를 풍부하게 하고 음악 AI 연구를 진전시킬 수 있습니다. 본 논문은 캡션 생성과 함께 조성 감지, 보컬 감지 등의 보조 음악 특징 탐지 작업을 통합한 다중 작업 음악 캡션 모델인 SonicVerse를 소개합니다. 이를 통해 저수준의 음향적 세부 사항과 고수준의 음악적 속성을 직접 포착할 수 있습니다. 주요 기여는 오디오 입력을 언어 토큰으로 변환하면서 전용 보조 헤드를 통해 음악 특징을 탐지하는 투영 기반 아키텍처입니다. 이러한 헤드의 출력 또한 언어 토큰으로 투영되어 캡션 입력을 강화합니다. 이 프레임워크는 짧은 음악 조각에 대한 풍부하고 설명적인 캡션을 생성할 뿐만 아니라, 대형 언어 모델을 사용하여 출력을 연결함으로써 더 긴 음악 작품에 대한 시간 정보가 포함된 상세한 설명을 직접 생성할 수 있게 합니다. 모델을 학습시키기 위해, 모듈식 음악 특징 추출기인 MIRFLEX를 사용하여 MusicBench 데이터셋에 음악 특징을 주석 처리하여 오디오, 캡션 및 음악 특징 데이터를 짝지었습니다. 실험 결과는 이러한 방식으로 특징을 통합함으로써 생성된 캡션의 품질과 세부 사항이 개선됨을 보여줍니다.

English

Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic details as well as high-level musical attributes. The key contribution is a projection-based architecture that transforms audio input into language tokens, while simultaneously detecting music features through dedicated auxiliary heads. The outputs of these heads are also projected into language tokens, to enhance the captioning input. This framework not only produces rich, descriptive captions for short music fragments but also directly enables the generation of detailed time-informed descriptions for longer music pieces, by chaining the outputs using a large-language model. To train the model, we extended the MusicBench dataset by annotating it with music features using MIRFLEX, a modular music feature extractor, resulting in paired audio, captions and music feature data. Experimental results show that incorporating features in this way improves the quality and detail of the generated captions.

SonicVerse: 음악 특징 정보를 반영한 캡션 생성을 위한 다중 작업 학습

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

초록

Support