SonicVerse:基于多任务学习的音乐特征引导式字幕生成
SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning
June 18, 2025
作者: Anuradha Chopra, Abhinaba Roy, Dorien Herremans
cs.AI
摘要
准确反映音乐作品特征的详细描述能够丰富音乐数据库,并推动音乐人工智能研究的发展。本文介绍了一种多任务音乐描述模型——SonicVerse,该模型将描述生成与辅助音乐特征检测任务(如调性检测、人声检测等)相结合,以直接捕捉音乐的低层次声学细节及高层次音乐属性。其核心贡献在于采用了一种基于投影的架构,该架构将音频输入转化为语言标记,同时通过专门的辅助头检测音乐特征。这些辅助头的输出同样被投影为语言标记,以增强描述输入的丰富性。此框架不仅能为短音乐片段生成丰富、描述性的文本,还能通过大型语言模型串联输出,直接实现对较长音乐作品的详细时间序列描述。为训练该模型,我们扩展了MusicBench数据集,利用模块化音乐特征提取器MIRFLEX为其标注音乐特征,从而获得了配对的音频、描述及音乐特征数据。实验结果表明,通过这种方式整合特征,显著提升了生成描述的质量与细节。
English
Detailed captions that accurately reflect the characteristics of a music
piece can enrich music databases and drive forward research in music AI. This
paper introduces a multi-task music captioning model, SonicVerse, that
integrates caption generation with auxiliary music feature detection tasks such
as key detection, vocals detection, and more, so as to directly capture both
low-level acoustic details as well as high-level musical attributes. The key
contribution is a projection-based architecture that transforms audio input
into language tokens, while simultaneously detecting music features through
dedicated auxiliary heads. The outputs of these heads are also projected into
language tokens, to enhance the captioning input. This framework not only
produces rich, descriptive captions for short music fragments but also directly
enables the generation of detailed time-informed descriptions for longer music
pieces, by chaining the outputs using a large-language model. To train the
model, we extended the MusicBench dataset by annotating it with music features
using MIRFLEX, a modular music feature extractor, resulting in paired audio,
captions and music feature data. Experimental results show that incorporating
features in this way improves the quality and detail of the generated captions.