使用大型語言模型的多模態音樂推薦系統
Multimodal Music Recommendation System using LLMs
May 28, 2026
作者: Srikar Prabhas Kandagatla, Sreehitha R. Narayana, Chandana Magapu, Swetha Mohan, Shamanth Kuthpadi, Hongjie Chen, Ryan A. Rossi, Franck Dernoncourt, Nesreen Ahmed
cs.AI
摘要
音樂推薦系統通常將歌曲視為不透明標記,依賴協作互動歷史,忽略了語義或聲學內容。先前的研究已探索了LLM增強、多模態及文本增強的序列推薦方法,雖然部分方法部分結合了語義、聲學或參與訊號,但沒有任何方法能在統一的基於LLM的序列推理框架中共同建模這三者,並將推薦建立在實際歌曲內容之上。在本研究中,我們提出一個基於會話的音樂推薦多模態框架,透過三種互補訊號豐富LastFM-1K資料集:(1)使用預訓練音樂與文本表示模型提取的音訊與歌詞嵌入;(2)使用MGPHot註釋架構由LLM生成的語義後設資料;以及(3)收聽完成比率。我們採用E4SRec框架,並透過多模態特徵與不同的項目ID編碼器主幹(包括SASRec、BERT4Rec與GRU4Rec)對其進行擴展。我們進一步以LLaMa-2-13B、Qwen2.5-7B-Instruct及LLaMa-3-70B擴展了LLM主幹選項,並在零樣本與微調兩種設定中進行實驗。結果顯示,整合基於內容的特徵在召回率上比僅使用ID的基準提升高達95%,在NDCG上提升高達79%。此外,實驗結果表明單純的多模態融合並非總能帶來加乘提升,凸顯了跨模態整合的挑戰。我們釋出一個大規模的多模態音樂推薦基準資料集。
English
Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.