利用大语言模型的多模态音乐推荐系统
Multimodal Music Recommendation System using LLMs
May 28, 2026
作者: Srikar Prabhas Kandagatla, Sreehitha R. Narayana, Chandana Magapu, Swetha Mohan, Shamanth Kuthpadi, Hongjie Chen, Ryan A. Rossi, Franck Dernoncourt, Nesreen Ahmed
cs.AI
摘要
音乐推荐系统通常将歌曲视为不透明标识符,依赖协同交互历史,这忽视了语义或声学内容。已有研究探索了大语言模型增强、多模态和文本增强的序列推荐方法,虽然部分方法结合了语义、声学或参与度信号,但尚无工作在统一的基于大语言模型的序列推理框架中联合建模这三种信号,并使推荐扎根于实际歌曲内容。在本研究中,我们提出了一种用于会话式音乐推荐的多模态框架,通过三种互补信号丰富了LastFM-1K数据集:(1) 利用预训练音乐和文本表示模型提取的音频与歌词嵌入;(2) 采用MGPHot标注架构生成的大语言模型语义元数据;(3) 收听完成率。我们采用E4SRec框架,通过扩展多模态特征及不同项目ID编码器主干(包括SASRec、BERT4Rec和GRU4Rec)进行实现。此外,我们在零样本和微调设置中进一步扩展了大语言模型主干选项,包含LLaMa-2-13B、Qwen2.5-7B-Instruct和LLaMa-3-70B。实验表明,整合基于内容的特征相比仅使用ID的基线方法,在召回率上提升最高达95%,在归一化折损累计增益上提升最高达79%。同时,实验显示朴素的多模态融合并不总能带来累加性改进,凸显了跨模态整合的挑战。我们发布了一个用于音乐推荐的大规模多模态基准数据集。
English
Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.