個性化反饋：一個大規模人工標註的個性化基準

摘要

随着大语言模型（LLM）整体能力的迅速提升，LLM个性化——即如何构建能够生成针对不同用户角色定制化响应或服务的LLM系统——已成为一个日益重要的研究与工程课题。然而，与评估通用/推理能力的众多新挑战性基准测试不断发布形成鲜明对比的是，高质量评估LLM个性化能力的基准测试的缺失，极大地阻碍了该领域的进展。为此，我们引入了PersonaFeedback，这是一个直接评估LLM在给定预设用户角色和查询条件下提供个性化响应能力的新基准。与现有要求模型从历史交互中推断隐含用户角色的基准不同，PersonaFeedback将角色推断与个性化分离，专注于评估模型针对明确角色生成定制化响应的能力。PersonaFeedback包含8298个人工标注的测试案例，这些案例根据用户角色的上下文复杂性及区分两个个性化响应间细微差异的难度，被划分为易、中、难三个等级。我们对多种模型进行了全面评估，实证结果显示，即便是能够解决复杂现实世界推理任务的最先进LLM，在PersonaFeedback的困难等级上也可能表现不佳，这些等级甚至对人类评估者而言也颇具挑战性。此外，我们对各类系统的失败模式进行了深入分析，表明当前的检索增强框架不应被视为个性化任务的事实解决方案。所有基准数据、标注协议及评估流程将公开，以促进未来关于LLM个性化的研究。

English

With the rapid improvement in the general capabilities of LLMs, LLM personalization, i.e., how to build LLM systems that can generate personalized responses or services that are tailored to distinct user personas, has become an increasingly important research and engineering problem. However, unlike many new challenging benchmarks being released for evaluating the general/reasoning capabilities, the lack of high-quality benchmarks for evaluating LLM personalization greatly hinders progress in this field. To address this, we introduce PersonaFeedback, a new benchmark that directly evaluates LLMs' ability to provide personalized responses given pre-defined user personas and queries. Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization, focusing on evaluating the model's ability to generate responses tailored to explicit personas. PersonaFeedback consists of 8298 human-annotated test cases, which are categorized into easy, medium, and hard tiers based on the contextual complexity of the user personas and the difficulty in distinguishing subtle differences between two personalized responses. We conduct comprehensive evaluations across a wide range of models. The empirical results reveal that even state-of-the-art LLMs that can solve complex real-world reasoning tasks could fall short on the hard tier of PersonaFeedback where even human evaluators may find the distinctions challenging. Furthermore, we conduct an in-depth analysis of failure modes across various types of systems, demonstrating that the current retrieval-augmented framework should not be seen as a de facto solution for personalization tasks. All benchmark data, annotation protocols, and the evaluation pipeline will be publicly available to facilitate future research on LLM personalization.

個性化反饋：一個大規模人工標註的個性化基準

PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization

摘要

Support