SonicVerse：基于多任务学习的音乐特征引导式字幕生成

摘要

准确反映音乐作品特征的详细描述能够丰富音乐数据库，并推动音乐人工智能研究的发展。本文介绍了一种多任务音乐描述模型——SonicVerse，该模型将描述生成与辅助音乐特征检测任务（如调性检测、人声检测等）相结合，以直接捕捉音乐的低层次声学细节及高层次音乐属性。其核心贡献在于采用了一种基于投影的架构，该架构将音频输入转化为语言标记，同时通过专门的辅助头检测音乐特征。这些辅助头的输出同样被投影为语言标记，以增强描述输入的丰富性。此框架不仅能为短音乐片段生成丰富、描述性的文本，还能通过大型语言模型串联输出，直接实现对较长音乐作品的详细时间序列描述。为训练该模型，我们扩展了MusicBench数据集，利用模块化音乐特征提取器MIRFLEX为其标注音乐特征，从而获得了配对的音频、描述及音乐特征数据。实验结果表明，通过这种方式整合特征，显著提升了生成描述的质量与细节。

English

Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic details as well as high-level musical attributes. The key contribution is a projection-based architecture that transforms audio input into language tokens, while simultaneously detecting music features through dedicated auxiliary heads. The outputs of these heads are also projected into language tokens, to enhance the captioning input. This framework not only produces rich, descriptive captions for short music fragments but also directly enables the generation of detailed time-informed descriptions for longer music pieces, by chaining the outputs using a large-language model. To train the model, we extended the MusicBench dataset by annotating it with music features using MIRFLEX, a modular music feature extractor, resulting in paired audio, captions and music feature data. Experimental results show that incorporating features in this way improves the quality and detail of the generated captions.

SonicVerse：基于多任务学习的音乐特征引导式字幕生成

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

摘要

Support