LLM을 활용한 멀티모달 음악 추천 시스템

초록

음악 추천 시스템은 일반적으로 노래를 불투명한 토큰으로 취급하며, 의미론적 또는 음향적 콘텐츠를 간과하는 협력적 상호작용 이력에 의존한다. 선행 연구에서는 LLM 보강, 멀티모달, 텍스트 강화 접근법을 순차적 추천에 적용해 왔으며, 일부 방법은 의미론적, 음향적 또는 참여 신호를 부분적으로 결합하지만, 실제 노래 콘텐츠에 추천을 근거짓는 통합된 LLM 기반 순차 추론 프레임워크 내에서 세 가지를 모두 공동으로 모델링한 연구는 없다. 본 연구에서는 세 가지 보완적 신호, 즉 (1) 사전 학습된 음악 및 텍스트 표현 모델을 사용하여 추출한 오디오 및 가사 임베딩, (2) MGPHot 주석 체계를 활용한 LLM 생성 의미론적 메타데이터, (3) 청취 완료 비율을 통해 LastFM-1K 데이터셋을 보강하는 세션 기반 음악 추천을 위한 멀티모달 프레임워크를 제안한다. 우리는 E4SRec 프레임워크를 채택하여 SASRec, BERT4Rec, GRU4Rec을 포함한 다양한 아이템 ID 인코더 백본 및 멀티모달 특징으로 확장한다. 또한 제로샷 및 파인튜닝 설정에서 LLaMa-2-13B, Qwen2.5-7B-Instruct, LLaMa-3-70B를 사용하여 LLM 백본 옵션을 추가로 확장한다. 실험 결과, 콘텐츠 기반 특징을 통합하면 ID만 사용한 기준선 대비 Recall에서 최대 95%, NDCG에서 최대 79%의 성능 향상을 보였다. 또한 실험을 통해 단순한 멀티모달 융합이 항상 추가적인 개선을 가져오는 것은 아니며, 이는 교차 모달 통합의 어려움을 강조한다. 우리는 음악 추천을 위한 대규모 멀티모달 벤치마크를 공개한다.

English

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.