KITAB：评估LLMs在信息检索中的约束满足能力

摘要

我们研究了最先进模型在回答信息检索的约束满足查询（例如，“圣迭戈的冰淇淋店列表”）方面的能力。过去，这类查询被认为只能通过网络搜索或知识库来解决。最近，大型语言模型（LLMs）展示了在这一任务中的初步新兴能力。然而，许多当前的检索基准要么已经饱和，要么不测量约束满足。受到围绕LLMs事实错误和幻觉不断增加的担忧的启发，我们提出了KITAB，一个用于衡量语言模型约束满足能力的新数据集。KITAB包括来自600多位作者的与书籍相关的数据和超过13,000个查询，还提供了一个相关的动态数据收集和约束验证方法，用于获取其他作者的类似测试数据。我们对GPT4和GPT3.5的扩展实验对信息流行度、约束类型和上下文可用性等维度上的常见失败模式进行了表征和解耦。结果显示，在没有上下文的情况下，模型表现出严重的限制，表现为无关信息、事实错误和不完整性，其中许多问题会随着信息流行度的降低而加剧。虽然上下文的可用性可以减轻无关信息，但对于满足约束并不起作用，这识别了约束满足的基本障碍。我们开源我们的贡献，以促进未来模型约束满足能力的进一步研究。

English

We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval (e.g., 'a list of ice cream shops in San Diego'). In the past, such queries were considered to be tasks that could only be solved via web-search or knowledge bases. More recently, large language models (LLMs) have demonstrated initial emergent abilities in this task. However, many current retrieval benchmarks are either saturated or do not measure constraint satisfaction. Motivated by rising concerns around factual incorrectness and hallucinations of LLMs, we present KITAB, a new dataset for measuring constraint satisfaction abilities of language models. KITAB consists of book-related data across more than 600 authors and 13,000 queries, and also offers an associated dynamic data collection and constraint verification approach for acquiring similar test data for other authors. Our extended experiments on GPT4 and GPT3.5 characterize and decouple common failure modes across dimensions such as information popularity, constraint types, and context availability. Results show that in the absence of context, models exhibit severe limitations as measured by irrelevant information, factual errors, and incompleteness, many of which exacerbate as information popularity decreases. While context availability mitigates irrelevant information, it is not helpful for satisfying constraints, identifying fundamental barriers to constraint satisfaction. We open source our contributions to foster further research on improving constraint satisfaction abilities of future models.

KITAB：评估LLMs在信息检索中的约束满足能力

KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval

摘要

Support