KITAB：評估LLMs在資訊檢索的限制滿足上

摘要

我們研究了最先進模型在回答資訊檢索的約束滿足查詢（例如“聖地牙哥的冰淇淋店列表”）方面的能力。過去，這類查詢被認為只能通過網絡搜索或知識庫來解決。近年來，大型語言模型（LLMs）展示了在此任務中初步出現的能力。然而，許多當前的檢索基準要麼已飽和，要麼未測量約束滿足。受到關於LLMs事實不正確和幻覺的不斷增加的擔憂的驅使，我們提出了KITAB，這是一個用於測量語言模型約束滿足能力的新數據集。KITAB包含來自600多位作者和13,000多個查詢的與書籍相關的數據，並提供了一種相關的動態數據收集和約束驗證方法，以獲取其他作者的類似測試數據。我們對GPT4和GPT3.5進行了擴展實驗，對信息流行度、約束類型和上下文可用性等維度上的常見失敗模式進行了表徵和解耦。結果顯示，在缺乏上下文的情況下，模型在與無關信息、事實錯誤和不完整性相關的方面表現出嚴重限制，其中許多在信息流行度降低時加劇。雖然上下文可用性能夠減輕無關信息，但對於滿足約束並不有幫助，這識別了約束滿足的基本障礙。我們開源我們的貢獻，以促進未來模型改善約束滿足能力的進一步研究。

English

We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval (e.g., 'a list of ice cream shops in San Diego'). In the past, such queries were considered to be tasks that could only be solved via web-search or knowledge bases. More recently, large language models (LLMs) have demonstrated initial emergent abilities in this task. However, many current retrieval benchmarks are either saturated or do not measure constraint satisfaction. Motivated by rising concerns around factual incorrectness and hallucinations of LLMs, we present KITAB, a new dataset for measuring constraint satisfaction abilities of language models. KITAB consists of book-related data across more than 600 authors and 13,000 queries, and also offers an associated dynamic data collection and constraint verification approach for acquiring similar test data for other authors. Our extended experiments on GPT4 and GPT3.5 characterize and decouple common failure modes across dimensions such as information popularity, constraint types, and context availability. Results show that in the absence of context, models exhibit severe limitations as measured by irrelevant information, factual errors, and incompleteness, many of which exacerbate as information popularity decreases. While context availability mitigates irrelevant information, it is not helpful for satisfying constraints, identifying fundamental barriers to constraint satisfaction. We open source our contributions to foster further research on improving constraint satisfaction abilities of future models.

KITAB：評估LLMs在資訊檢索的限制滿足上

KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval

摘要

Support