FSPO：基於合成偏好數據的少樣本偏好優化，在大型語言模型中實現對真實用戶的有效個性化

摘要

大型語言模型（LLMs）的有效個性化對於虛擬助手和內容策展等廣泛的用戶介面應用至關重要。受LLMs強大的上下文學習能力啟發，我們提出了少樣本偏好優化（Few-Shot Preference Optimization, FSPO），將獎勵建模重新定義為一個元學習問題。在此框架下，LLM通過少量來自用戶的標記偏好快速適應該用戶，為其構建個性化的獎勵函數。此外，由於現實世界中的偏好數據稀缺且難以大規模收集，我們提出了謹慎的設計選擇來構建用於個性化的合成偏好數據集，利用公開可用的LLMs生成了超過100萬條合成個性化偏好。特別是，為了成功將合成數據轉移到真實用戶，我們發現數據必須展現出高度多樣性以及連貫、自洽的結構。我們在三個領域（電影評論、基於教育背景的教學適應以及一般問答）中對多達1,500名合成用戶的個性化開放式生成進行了FSPO評估，並進行了受控的人類研究。總體而言，FSPO在為合成用戶生成個性化回應方面平均達到了87%的Alpaca Eval勝率，在開放式問答中與真實人類用戶的勝率為72%。

English

Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.

FSPO：基於合成偏好數據的少樣本偏好優化，在大型語言模型中實現對真實用戶的有效個性化

FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users

摘要

Support