LLM 能理解使用者偏好嗎？評估 LLM 在使用者評分預測上的表現

摘要

大型語言模型（LLMs）已展示出在零樣本或少樣本方式下對新任務進行泛化的卓越能力。然而，LLMs在多大程度上能夠根據用戶先前的行為理解用戶偏好，仍是一個新興且尚不清晰的研究問題。傳統上，協同過濾（CF）一直是這些任務中最有效的方法，主要依賴大量的評分數據。相比之下，LLMs通常需要更少的數據，同時保持對每個項目（如電影或產品）的豐富世界知識。在本文中，我們對CF和LLMs在經典的用戶評分預測任務中進行了全面的研究，該任務涉及根據用戶過去的評分來預測用戶對候選項目的評分。我們研究了不同大小的LLMs，從250M到540B個參數，並評估它們在零樣本、少樣本和微調情況下的表現。我們進行了全面的分析，比較了LLMs和強大的CF方法之間的差異，發現零樣本LLMs落後於具有用戶互動數據訪問權限的傳統推薦模型，這表明用戶互動數據的重要性。然而，通過微調，LLMs僅使用少量訓練數據就實現了可比甚至更好的性能，展示了它們通過數據效率實現的潛力。

English

Large Language Models (LLMs) have demonstrated exceptional capabilities in generalizing to new tasks in a zero-shot or few-shot manner. However, the extent to which LLMs can comprehend user preferences based on their previous behavior remains an emerging and still unclear research question. Traditionally, Collaborative Filtering (CF) has been the most effective method for these tasks, predominantly relying on the extensive volume of rating data. In contrast, LLMs typically demand considerably less data while maintaining an exhaustive world knowledge about each item, such as movies or products. In this paper, we conduct a thorough examination of both CF and LLMs within the classic task of user rating prediction, which involves predicting a user's rating for a candidate item based on their past ratings. We investigate various LLMs in different sizes, ranging from 250M to 540B parameters and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios. We conduct comprehensive analysis to compare between LLMs and strong CF methods, and find that zero-shot LLMs lag behind traditional recommender models that have the access to user interaction data, indicating the importance of user interaction data. However, through fine-tuning, LLMs achieve comparable or even better performance with only a small fraction of the training data, demonstrating their potential through data efficiency.

LLM 能理解使用者偏好嗎？評估 LLM 在使用者評分預測上的表現

Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction

摘要

Support