SonicVerse：基於多任務學習的音樂特徵引導式描述生成

摘要

精確反映音樂作品特徵的詳細描述能夠豐富音樂數據庫，並推動音樂人工智慧研究的發展。本文介紹了一種多任務音樂描述模型——SonicVerse，該模型將描述生成與輔助音樂特徵檢測任務（如調性檢測、人聲檢測等）相結合，從而直接捕捉低層次音頻細節及高層次音樂屬性。其核心貢獻在於一種基於投影的架構，該架構將音頻輸入轉化為語言標記，同時通過專用輔助頭部檢測音樂特徵。這些頭部的輸出也被投影為語言標記，以增強描述輸入。此框架不僅能為短音樂片段生成豐富的描述性文字，還通過利用大型語言模型鏈接輸出，直接實現了對較長音樂作品的詳細時間感知描述生成。為訓練該模型，我們擴展了MusicBench數據集，使用模塊化音樂特徵提取器MIRFLEX對其進行音樂特徵註釋，從而獲得了配對的音頻、描述及音樂特徵數據。實驗結果表明，以這種方式整合特徵提升了生成描述的質量與細節。

English

Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic details as well as high-level musical attributes. The key contribution is a projection-based architecture that transforms audio input into language tokens, while simultaneously detecting music features through dedicated auxiliary heads. The outputs of these heads are also projected into language tokens, to enhance the captioning input. This framework not only produces rich, descriptive captions for short music fragments but also directly enables the generation of detailed time-informed descriptions for longer music pieces, by chaining the outputs using a large-language model. To train the model, we extended the MusicBench dataset by annotating it with music features using MIRFLEX, a modular music feature extractor, resulting in paired audio, captions and music feature data. Experimental results show that incorporating features in this way improves the quality and detail of the generated captions.

SonicVerse：基於多任務學習的音樂特徵引導式描述生成

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

摘要

Support