LLM은 사용자 선호도를 이해하는가? 사용자 평점 예측을 통한 LLM 평가

초록

대규모 언어 모델(LLMs)은 제로샷 또는 퓨샷 방식으로 새로운 작업에 일반화하는 탁월한 능력을 보여주고 있다. 그러나 LLMs가 사용자의 이전 행동을 기반으로 선호도를 이해할 수 있는 정도는 여전히 새롭고 불분명한 연구 주제로 남아 있다. 전통적으로 협업 필터링(CF)은 이러한 작업에 가장 효과적인 방법으로, 주로 방대한 양의 평점 데이터에 의존해 왔다. 반면, LLMs는 일반적으로 훨씬 적은 데이터를 요구하면서도 영화나 제품과 같은 각 아이템에 대한 포괄적인 세계 지식을 유지한다. 본 논문에서는 사용자의 과거 평점을 기반으로 특정 아이템에 대한 평점을 예측하는 고전적인 작업인 사용자 평점 예측에서 CF와 LLMs를 철저히 비교 분석한다. 250M에서 540B 파라미터까지 다양한 크기의 LLMs를 제로샷, 퓨샷, 미세 조정 시나리오에서 평가하며, 그 성능을 검토한다. LLMs와 강력한 CF 방법 간의 포괄적인 비교 분석을 수행한 결과, 제로샷 LLMs는 사용자 상호작용 데이터에 접근할 수 있는 전통적인 추천 모델에 비해 뒤처지는 것으로 나타났으며, 이는 사용자 상호작용 데이터의 중요성을 시사한다. 그러나 미세 조정을 통해 LLMs는 훈련 데이터의 극히 일부만으로도 비슷하거나 더 나은 성능을 달성하며, 데이터 효율성을 통해 그 잠재력을 입증한다.

English

Large Language Models (LLMs) have demonstrated exceptional capabilities in generalizing to new tasks in a zero-shot or few-shot manner. However, the extent to which LLMs can comprehend user preferences based on their previous behavior remains an emerging and still unclear research question. Traditionally, Collaborative Filtering (CF) has been the most effective method for these tasks, predominantly relying on the extensive volume of rating data. In contrast, LLMs typically demand considerably less data while maintaining an exhaustive world knowledge about each item, such as movies or products. In this paper, we conduct a thorough examination of both CF and LLMs within the classic task of user rating prediction, which involves predicting a user's rating for a candidate item based on their past ratings. We investigate various LLMs in different sizes, ranging from 250M to 540B parameters and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios. We conduct comprehensive analysis to compare between LLMs and strong CF methods, and find that zero-shot LLMs lag behind traditional recommender models that have the access to user interaction data, indicating the importance of user interaction data. However, through fine-tuning, LLMs achieve comparable or even better performance with only a small fraction of the training data, demonstrating their potential through data efficiency.

LLM은 사용자 선호도를 이해하는가? 사용자 평점 예측을 통한 LLM 평가

Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction

초록

Support