重新将人类置于大语言模型个性化的中心

摘要

尽管兴趣日益增长，但大多数关于大语言模型（LLMs）个性化能力的研究仍依赖于合成数据。目前尚不清楚现有个性化系统对真实用户的效果如何。本文研究了LLM在使用合成数据与人类数据时的个性化表现差距。我们收集了人类对话（550段对话）以及个性化三个阶段的判断：从对话中提取用户属性（5,949次判断）、将相关属性与新提示配对（11,919次判断）、将相关属性融入个性化回应（1,101次判断）。引入人类数据揭示了每个阶段的系统局限性。模型难以从人类对话中提取属性，与人类在相关属性判断上存在分歧，且生成的个性化回应在人类评估中并不优于通用回应（尽管LLM自身评估普遍认为更优）。我们提出了两种轻量级的基于训练的干预措施，在前两个阶段将自动化个性化评估向人类数据靠拢。然而，在第三阶段，我们发现学习到的奖励模型与人类评分的相关性仅达到中等水平，这表明与人类对齐的个性化质量判断难以直接建模。我们收集的数据为研究模型如何以人类认为有用的方式提取、选择及整合用户信息奠定了基础。

English

Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.