퍼소나피드백: 개인화를 위한 대규모 인간 주석 벤치마크

초록

LLM의 전반적인 능력이 급속히 향상됨에 따라, LLM 개인화, 즉 개별 사용자 프로필에 맞춤화된 응답이나 서비스를 생성할 수 있는 LLM 시스템을 구축하는 방법은 점점 더 중요한 연구 및 공학적 문제로 대두되고 있다. 그러나 일반/추론 능력을 평가하기 위해 출시되는 많은 새로운 도전적인 벤치마크와 달리, LLM 개인화를 평가하기 위한 고품질 벤치마크의 부재는 이 분야의 발전을 크게 저해하고 있다. 이를 해결하기 위해, 우리는 미리 정의된 사용자 프로필과 질문이 주어졌을 때 LLM이 개인화된 응답을 제공하는 능력을 직접 평가하는 새로운 벤치마크인 PersonaFeedback을 소개한다. 기존 벤치마크들이 모델이 과거 상호작용에서 암묵적인 사용자 프로필을 추론하도록 요구하는 것과 달리, PersonaFeedback은 프로필 추론을 개인화와 분리하여 명시적인 프로필에 맞춰 응답을 생성하는 모델의 능력을 평가하는 데 초점을 맞춘다. PersonaFeedback은 8,298개의 인간 주석이 달린 테스트 케이스로 구성되어 있으며, 이는 사용자 프로필의 맥락적 복잡성과 두 개인화된 응답 간의 미묘한 차이를 구별하는 난이도에 따라 쉬움, 중간, 어려움의 세 단계로 분류된다. 우리는 다양한 모델에 걸쳐 포괄적인 평가를 수행하였다. 실험 결과, 복잡한 현실 세계의 추론 과제를 해결할 수 있는 최첨단 LLM조차도 인간 평가자들이 차이를 구별하기 어려워할 수 있는 PersonaFeedback의 어려움 단계에서는 부족함을 보였다. 또한, 다양한 유형의 시스템에서의 실패 모드에 대한 심층 분석을 수행하여, 현재의 검색 강화 프레임워크가 개인화 작업에 대한 사실상의 해결책으로 간주되어서는 안 된다는 것을 입증하였다. 모든 벤치마크 데이터, 주석 프로토콜 및 평가 파이프라인은 향후 LLM 개인화 연구를 촉진하기 위해 공개될 예정이다.

English

With the rapid improvement in the general capabilities of LLMs, LLM personalization, i.e., how to build LLM systems that can generate personalized responses or services that are tailored to distinct user personas, has become an increasingly important research and engineering problem. However, unlike many new challenging benchmarks being released for evaluating the general/reasoning capabilities, the lack of high-quality benchmarks for evaluating LLM personalization greatly hinders progress in this field. To address this, we introduce PersonaFeedback, a new benchmark that directly evaluates LLMs' ability to provide personalized responses given pre-defined user personas and queries. Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization, focusing on evaluating the model's ability to generate responses tailored to explicit personas. PersonaFeedback consists of 8298 human-annotated test cases, which are categorized into easy, medium, and hard tiers based on the contextual complexity of the user personas and the difficulty in distinguishing subtle differences between two personalized responses. We conduct comprehensive evaluations across a wide range of models. The empirical results reveal that even state-of-the-art LLMs that can solve complex real-world reasoning tasks could fall short on the hard tier of PersonaFeedback where even human evaluators may find the distinctions challenging. Furthermore, we conduct an in-depth analysis of failure modes across various types of systems, demonstrating that the current retrieval-augmented framework should not be seen as a de facto solution for personalization tasks. All benchmark data, annotation protocols, and the evaluation pipeline will be publicly available to facilitate future research on LLM personalization.

퍼소나피드백: 개인화를 위한 대규모 인간 주석 벤치마크

PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization

초록

Support