大規模言語モデルを用いたマルチモーダル音楽推薦システム

要旨

音楽推薦システムは通常、楽曲を不透明なトークンとして扱い、協調的インタラクション履歴に依存することで、意味的または音響的な内容を見落としている。先行研究では、LLM拡張、マルチモーダル、テキスト強化のアプローチによるシーケンシャル推薦が探求されてきたが、一部の手法は意味的、音響的、またはエンゲージメント信号を部分的に組み合わせるものの、実際の楽曲内容に基づいて推薦を根拠づける統一的なLLMベースのシーケンシャル推論フレームワーク内で三者すべてを共同でモデル化したものは存在しない。本研究では、セッションベースの音楽推薦のためのマルチモーダルフレームワークを提案し、LastFM-1Kデータセットを以下の3つの補完的信号で拡充する：(1) 事前学習済みの音楽およびテキスト表現モデルを用いて抽出された音響および歌詞の埋め込み、(2) MGPHotアノテーションスキーマを用いたLLM生成の意味的メタデータ、(3) 聴取完了率。我々はE4SRecフレームワークを採用し、マルチモーダル特徴と、SASRec、BERT4Rec、GRU4Recを含む異なるアイテムIDエンコーダバックボーンで拡張する。さらに、LLMバックボーンのオプションとして、ゼロショットおよびファインチューニング設定の両方でLLaMa-2-13B、Qwen2.5-7B-Instruct、LLaMa-3-70Bを追加する。我々の実験は、コンテンツベースの特徴を統合することで、IDのみのベースラインと比較して、Recallで最大95%、NDCGで最大79%の改善を示す。さらに、我々の実験は、単純なマルチモーダル融合が常に相加的な改善をもたらすわけではなく、クロスモーダル統合における課題を浮き彫りにしている。我々は音楽推薦のための大規模マルチモーダルベンチマークを公開する。

English

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.