SonicVerse: 音楽的特徴を考慮したキャプション生成のためのマルチタスク学習

要旨

音楽作品の特性を正確に反映する詳細なキャプションは、音楽データベースを充実させ、音楽AIの研究を推進する上で重要な役割を果たす。本論文では、キャプション生成とキー検出、ボーカル検出などの補助的な音楽特徴検出タスクを統合したマルチタスク音楽キャプションモデル「SonicVerse」を提案する。これにより、低レベルの音響的詳細と高レベルの音楽的属性の両方を直接捉えることを可能にする。主な貢献は、音声入力を言語トークンに変換しつつ、専用の補助ヘッドを通じて音楽特徴を検出する投影ベースのアーキテクチャである。これらのヘッドの出力も言語トークンに投影され、キャプション入力を強化する。このフレームワークは、短い音楽フラグメントに対する豊かで記述的なキャプションを生成するだけでなく、大規模言語モデルを使用して出力を連鎖させることで、長い音楽作品に対する詳細な時間情報付き記述の生成を直接可能にする。モデルの訓練のために、モジュール式音楽特徴抽出器であるMIRFLEXを使用してMusicBenchデータセットに音楽特徴を注釈付けし、音声、キャプション、音楽特徴データをペアリングした。実験結果は、この方法で特徴を組み込むことで生成されるキャプションの品質と詳細が向上することを示している。

English

Detailed captions that accurately reflect the characteristics of a music piece can enrich music databases and drive forward research in music AI. This paper introduces a multi-task music captioning model, SonicVerse, that integrates caption generation with auxiliary music feature detection tasks such as key detection, vocals detection, and more, so as to directly capture both low-level acoustic details as well as high-level musical attributes. The key contribution is a projection-based architecture that transforms audio input into language tokens, while simultaneously detecting music features through dedicated auxiliary heads. The outputs of these heads are also projected into language tokens, to enhance the captioning input. This framework not only produces rich, descriptive captions for short music fragments but also directly enables the generation of detailed time-informed descriptions for longer music pieces, by chaining the outputs using a large-language model. To train the model, we extended the MusicBench dataset by annotating it with music features using MIRFLEX, a modular music feature extractor, resulting in paired audio, captions and music feature data. Experimental results show that incorporating features in this way improves the quality and detail of the generated captions.

SonicVerse: 音楽的特徴を考慮したキャプション生成のためのマルチタスク学習

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

要旨

Support