BESPOKE: 진단 피드백을 통한 검색 강화 대형 언어 모델 개인화 벤치마크

초록

검색 기능이 강화된 대형 언어 모델(LLMs)은 생성 과정에 검색을 통합함으로써 정보 탐색 작업을 발전시켰으며, 이는 기존 검색 시스템에 비해 사용자의 인지 부담을 줄여줍니다. 그러나 동일한 질문이 사용자마다 다른 의도를 반영할 수 있다는 점과 선호하는 형태로 정보를 제공해야 한다는 점에서 다양한 사용자 요구를 완전히 충족시키기에는 여전히 부족합니다. ChatGPT와 Gemini와 같은 최근 시스템들은 사용자 기록을 활용하여 개인화를 시도하고 있지만, 이러한 개인화에 대한 체계적인 평가는 아직 충분히 이루어지지 않았습니다. 이러한 격차를 해결하기 위해, 우리는 검색 기능이 강화된 LLMs의 개인화를 평가하기 위한 현실적인 벤치마크인 BESPOKE를 제안합니다. BESPOKE는 인간으로부터 직접 채팅 및 검색 기록을 수집함으로써 현실적이며, 응답과 세분화된 선호도 점수 및 피드백을 짝지어 진단적입니다. 이 벤치마크는 장기적이고 깊이 있는 인간 주석을 통해 구성되었으며, 인간 주석자들은 자신의 기록을 제공하고 상세한 정보 요구 사항을 포함한 질문을 작성하며, 점수와 진단적 피드백으로 응답을 평가했습니다. BESPOKE를 활용하여, 우리는 정보 탐색 작업에서 효과적인 개인화를 위한 주요 요구 사항을 밝히는 체계적인 분석을 수행함으로써 개인화된 검색 기능이 강화된 LLMs의 세밀한 평가를 위한 기반을 마련했습니다. 우리의 코드와 데이터는 https://augustinlib.github.io/BESPOKE/에서 확인할 수 있습니다.

English

Search-augmented large language models (LLMs) have advanced information-seeking tasks by integrating retrieval into generation, reducing users' cognitive burden compared to traditional search systems. Yet they remain insufficient for fully addressing diverse user needs, which requires recognizing how the same query can reflect different intents across users and delivering information in preferred forms. While recent systems such as ChatGPT and Gemini attempt personalization by leveraging user histories, systematic evaluation of such personalization is under-explored. To address this gap, we propose BESPOKE, the realistic benchmark for evaluating personalization in search-augmented LLMs. BESPOKE is designed to be both realistic, by collecting authentic chat and search histories directly from humans, and diagnostic, by pairing responses with fine-grained preference scores and feedback. The benchmark is constructed through long-term, deeply engaged human annotation, where human annotators contributed their own histories, authored queries with detailed information needs, and evaluated responses with scores and diagnostic feedback. Leveraging BESPOKE, we conduct systematic analyses that reveal key requirements for effective personalization in information-seeking tasks, providing a foundation for fine-grained evaluation of personalized search-augmented LLMs. Our code and data are available at https://augustinlib.github.io/BESPOKE/.

BESPOKE: 진단 피드백을 통한 검색 강화 대형 언어 모델 개인화 벤치마크

BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback

초록

Support