LLM能理解用户偏好吗？评估LLM在用户评分预测上的表现

摘要

大型语言模型（LLMs）已经展示出在零样本或少样本情况下对新任务进行泛化的卓越能力。然而，LLMs在多大程度上能够根据用户先前的行为理解用户偏好，仍然是一个新兴且尚不清晰的研究问题。传统上，协同过滤（CF）一直是这些任务中最有效的方法，主要依赖大量的评分数据。相比之下，LLMs通常需要的数据量明显较少，同时又能保持对每个项目（如电影或产品）的详尽世界知识。在本文中，我们对CF和LLMs在经典的用户评分预测任务中进行了彻底的研究，该任务涉及基于用户过去的评分来预测用户对候选项目的评分。我们研究了不同规模的LLMs，参数范围从250M到540B，并评估它们在零样本、少样本和微调场景中的性能。我们进行了全面的分析，比较了LLMs和强大的CF方法之间的差异，并发现零样本LLMs落后于具有用户互动数据访问权限的传统推荐模型，这表明用户互动数据的重要性。然而，通过微调，LLMs在只使用少量训练数据的情况下实现了可比甚至更好的性能，展示了它们在数据效率方面的潜力。

English

Large Language Models (LLMs) have demonstrated exceptional capabilities in generalizing to new tasks in a zero-shot or few-shot manner. However, the extent to which LLMs can comprehend user preferences based on their previous behavior remains an emerging and still unclear research question. Traditionally, Collaborative Filtering (CF) has been the most effective method for these tasks, predominantly relying on the extensive volume of rating data. In contrast, LLMs typically demand considerably less data while maintaining an exhaustive world knowledge about each item, such as movies or products. In this paper, we conduct a thorough examination of both CF and LLMs within the classic task of user rating prediction, which involves predicting a user's rating for a candidate item based on their past ratings. We investigate various LLMs in different sizes, ranging from 250M to 540B parameters and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios. We conduct comprehensive analysis to compare between LLMs and strong CF methods, and find that zero-shot LLMs lag behind traditional recommender models that have the access to user interaction data, indicating the importance of user interaction data. However, through fine-tuning, LLMs achieve comparable or even better performance with only a small fraction of the training data, demonstrating their potential through data efficiency.

LLM能理解用户偏好吗？评估LLM在用户评分预测上的表现

Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction

摘要

Support