ChatPaper.aiChatPaper

LikeBench:面向个性化的大语言模型主观喜好度评估

LikeBench: Evaluating Subjective Likability in LLMs for Personalization

December 15, 2025
作者: Md Awsafur Rahman, Adam Gabrys, Doug Kang, Jingjing Sun, Tian Tan, Ashwin Chandramouli
cs.AI

摘要

个性化大语言模型应具备记忆用户信息、准确应用这些信息并随时间推移不断适应用户偏好以提供更受欢迎回复的能力。现有LLM个性化基准主要围绕两个维度展开:精确回忆用户信息,以及在后续任务中准确应用已记忆信息。我们认为第三个维度——好感度——虽具主观性但对用户体验至关重要,而当前基准对此衡量不足。为全面评估好感度,我们推出LikeBench:一个多轮次动态评估框架,通过衡量LLM随时间推移适应用户偏好以提供更受欢迎回复的能力,从多维度评估好感度。在该框架中,LLM与模拟用户对话,仅通过持续交流学习用户偏好。随着互动推进,模型尝试调整回复策略,并在每轮对话后由同一模拟用户从七个维度进行好感度评估。我们首次将好感度分解为七大诊断指标:情感适应性、正式度匹配、知识适应性、指代理解、对话长度契合度、幽默匹配度及话题呼应能力,从而精准定位模型短板。为使模拟用户更具真实性和区分度,LikeBench采用基于心理学的细粒度描述性人物画像,而非以往研究中粗糙的高/低特质评级。我们的基准测试表明,强记忆性能并不保证高好感度:DeepSeek R1的记忆准确率(86%,每档案17条事实)虽低于Qwen3(93%,每档案43条事实),但其好感度得分反超后者28%。即便是GPT-5等前沿模型,在简短交流中表现良好,但在更长、更具噪声的互动中仅展现出有限的鲁棒性。
English
A personalized LLM should remember user facts, apply them correctly, and adapt over time to provide responses that the user prefers. Existing LLM personalization benchmarks are largely centered on two axes: accurately recalling user information and accurately applying remembered information in downstream tasks. We argue that a third axis, likability, is both subjective and central to user experience, yet under-measured by current benchmarks. To measure likability holistically, we introduce LikeBench, a multi-session, dynamic evaluation framework that measures likability across multiple dimensions by how much an LLM can adapt over time to a user's preferences to provide more likable responses. In LikeBench, the LLMs engage in conversation with a simulated user and learn preferences only from the ongoing dialogue. As the interaction unfolds, models try to adapt to responses, and after each turn, they are evaluated for likability across seven dimensions by the same simulated user. To the best of our knowledge, we are the first to decompose likability into multiple diagnostic metrics: emotional adaptation, formality matching, knowledge adaptation, reference understanding, conversation length fit, humor fit, and callback, which makes it easier to pinpoint where a model falls short. To make the simulated user more realistic and discriminative, LikeBench uses fine-grained, psychologically grounded descriptive personas rather than the coarse high/low trait rating based personas used in prior work. Our benchmark shows that strong memory performance does not guarantee high likability: DeepSeek R1, with lower memory accuracy (86%, 17 facts/profile), outperformed Qwen3 by 28% on likability score despite Qwen3's higher memory accuracy (93%, 43 facts/profile). Even SOTA models like GPT-5 adapt well in short exchanges but show only limited robustness in longer, noisier interactions.
PDF12December 19, 2025