PersonaFeedback：一个大规模人工标注的个性化基准数据集

摘要

随着大语言模型（LLM）通用能力的快速提升，LLM个性化——即如何构建能够生成针对不同用户角色定制化响应或服务的LLM系统——已成为日益重要的研究与工程课题。然而，与评估通用/推理能力的众多新挑战性基准相比，高质量LLM个性化评估基准的缺失严重阻碍了该领域的发展。为此，我们推出了PersonaFeedback，这是一个直接评估LLM在给定预定义用户角色和查询时提供个性化响应能力的新基准。与现有基准要求模型从历史交互中推断隐含用户角色不同，PersonaFeedback将角色推断与个性化分离，专注于评估模型根据显式角色生成定制化响应的能力。PersonaFeedback包含8298个人工标注的测试案例，这些案例根据用户角色的上下文复杂性及区分两个个性化响应间细微差异的难度，分为简单、中等和困难三个等级。我们对多种模型进行了全面评估，实证结果显示，即便是能够解决复杂现实世界推理任务的最先进LLM，在PersonaFeedback的困难等级上也可能表现不佳，这一等级下甚至人类评估者也可能难以辨别差异。此外，我们对各类系统的失败模式进行了深入分析，表明当前的检索增强框架不应被视为个性化任务的事实解决方案。所有基准数据、标注协议及评估流程将公开，以促进未来LLM个性化研究的发展。

English

With the rapid improvement in the general capabilities of LLMs, LLM personalization, i.e., how to build LLM systems that can generate personalized responses or services that are tailored to distinct user personas, has become an increasingly important research and engineering problem. However, unlike many new challenging benchmarks being released for evaluating the general/reasoning capabilities, the lack of high-quality benchmarks for evaluating LLM personalization greatly hinders progress in this field. To address this, we introduce PersonaFeedback, a new benchmark that directly evaluates LLMs' ability to provide personalized responses given pre-defined user personas and queries. Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization, focusing on evaluating the model's ability to generate responses tailored to explicit personas. PersonaFeedback consists of 8298 human-annotated test cases, which are categorized into easy, medium, and hard tiers based on the contextual complexity of the user personas and the difficulty in distinguishing subtle differences between two personalized responses. We conduct comprehensive evaluations across a wide range of models. The empirical results reveal that even state-of-the-art LLMs that can solve complex real-world reasoning tasks could fall short on the hard tier of PersonaFeedback where even human evaluators may find the distinctions challenging. Furthermore, we conduct an in-depth analysis of failure modes across various types of systems, demonstrating that the current retrieval-augmented framework should not be seen as a de facto solution for personalization tasks. All benchmark data, annotation protocols, and the evaluation pipeline will be publicly available to facilitate future research on LLM personalization.