BESPOKE：診断的フィードバックによる検索拡張大規模言語モデルのパーソナライゼーションのベンチマーク

要旨

検索拡張型大規模言語モデル（LLM）は、生成プロセスに検索を統合することで、情報探索タスクを進化させ、従来の検索システムと比較してユーザーの認知的負担を軽減してきました。しかし、同じクエリが異なるユーザーの意図を反映し得ることを認識し、情報を好ましい形式で提供するという多様なユーザーニーズを完全に満たすにはまだ不十分です。ChatGPTやGeminiなどの最近のシステムは、ユーザーの履歴を活用してパーソナライゼーションを試みていますが、そのようなパーソナライゼーションの体系的な評価は十分に検討されていません。このギャップを埋めるため、我々はBESPOKEを提案します。BESPOKEは、検索拡張型LLMのパーソナライゼーションを評価するための現実的なベンチマークです。BESPOKEは、人間から直接収集した本物のチャットと検索履歴を使用することで現実的であり、応答に細かい嗜好スコアとフィードバックを組み合わせることで診断的です。このベンチマークは、長期にわたる深い関与を持つ人間のアノテーションを通じて構築され、人間のアノテーターが自身の履歴を提供し、詳細な情報ニーズを持つクエリを作成し、スコアと診断フィードバックで応答を評価しました。BESPOKEを活用して、我々は情報探索タスクにおける効果的なパーソナライゼーションの主要な要件を明らかにする体系的な分析を行い、パーソナライズされた検索拡張型LLMの詳細な評価の基盤を提供します。我々のコードとデータはhttps://augustinlib.github.io/BESPOKE/で公開されています。

English

Search-augmented large language models (LLMs) have advanced information-seeking tasks by integrating retrieval into generation, reducing users' cognitive burden compared to traditional search systems. Yet they remain insufficient for fully addressing diverse user needs, which requires recognizing how the same query can reflect different intents across users and delivering information in preferred forms. While recent systems such as ChatGPT and Gemini attempt personalization by leveraging user histories, systematic evaluation of such personalization is under-explored. To address this gap, we propose BESPOKE, the realistic benchmark for evaluating personalization in search-augmented LLMs. BESPOKE is designed to be both realistic, by collecting authentic chat and search histories directly from humans, and diagnostic, by pairing responses with fine-grained preference scores and feedback. The benchmark is constructed through long-term, deeply engaged human annotation, where human annotators contributed their own histories, authored queries with detailed information needs, and evaluated responses with scores and diagnostic feedback. Leveraging BESPOKE, we conduct systematic analyses that reveal key requirements for effective personalization in information-seeking tasks, providing a foundation for fine-grained evaluation of personalized search-augmented LLMs. Our code and data are available at https://augustinlib.github.io/BESPOKE/.

BESPOKE：診断的フィードバックによる検索拡張大規模言語モデルのパーソナライゼーションのベンチマーク

BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback

要旨

Support