BESPOKE：基於診斷反饋的搜尋增強型大型語言模型個性化基準測試

摘要

搜尋增強型大型語言模型（LLMs）通過將檢索整合到生成過程中，在資訊尋求任務上取得了進展，相比傳統搜尋系統，減輕了用戶的認知負擔。然而，這些模型仍不足以全面滿足多樣化的用戶需求，這需要識別同一查詢如何反映不同用戶的意圖，並以用戶偏好的形式提供資訊。儘管近期系統如ChatGPT和Gemini嘗試利用用戶歷史進行個性化，但對此類個性化的系統性評估仍顯不足。為填補這一空白，我們提出了BESPOKE，一個用於評估搜尋增強型LLMs個性化能力的現實基準。BESPOKE旨在既真實又具診斷性，通過直接從人類收集真實的聊天和搜尋歷史，並將回應與細粒度的偏好評分及反饋配對來實現。該基準是通過長期、深度參與的人類註解構建的，其中人類註解者貢獻了自己的歷史，創作了帶有詳細資訊需求的查詢，並用評分和診斷性反饋評估了回應。利用BESPOKE，我們進行了系統性分析，揭示了在資訊尋求任務中實現有效個性化的關鍵要求，為個性化搜尋增強型LLMs的細粒度評估奠定了基礎。我們的程式碼和數據可在https://augustinlib.github.io/BESPOKE/獲取。

English

Search-augmented large language models (LLMs) have advanced information-seeking tasks by integrating retrieval into generation, reducing users' cognitive burden compared to traditional search systems. Yet they remain insufficient for fully addressing diverse user needs, which requires recognizing how the same query can reflect different intents across users and delivering information in preferred forms. While recent systems such as ChatGPT and Gemini attempt personalization by leveraging user histories, systematic evaluation of such personalization is under-explored. To address this gap, we propose BESPOKE, the realistic benchmark for evaluating personalization in search-augmented LLMs. BESPOKE is designed to be both realistic, by collecting authentic chat and search histories directly from humans, and diagnostic, by pairing responses with fine-grained preference scores and feedback. The benchmark is constructed through long-term, deeply engaged human annotation, where human annotators contributed their own histories, authored queries with detailed information needs, and evaluated responses with scores and diagnostic feedback. Leveraging BESPOKE, we conduct systematic analyses that reveal key requirements for effective personalization in information-seeking tasks, providing a foundation for fine-grained evaluation of personalized search-augmented LLMs. Our code and data are available at https://augustinlib.github.io/BESPOKE/.

BESPOKE：基於診斷反饋的搜尋增強型大型語言模型個性化基準測試

BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback

摘要

Support