KITAB: 情報検索における制約充足のための大規模言語モデル評価

要旨

我々は、最先端のモデルが情報検索における制約充足クエリ（例：「サンディエゴのアイスクリームショップのリスト」）に答える能力を研究する。過去において、このようなクエリはウェブ検索や知識ベースを通じてのみ解決可能なタスクと考えられていた。しかし最近では、大規模言語モデル（LLMs）がこのタスクにおいて初期の創発能力を示している。しかし、現在の多くの検索ベンチマークは飽和状態にあるか、制約充足を測定していない。LLMsの事実誤認や幻覚に関する懸念が高まる中、我々は言語モデルの制約充足能力を測定するための新しいデータセットであるKITABを提案する。KITABは600人以上の著者と13,000件のクエリにわたる書籍関連データで構成され、他の著者向けの類似したテストデータを収集するための動的データ収集と制約検証アプローチも提供する。GPT4とGPT3.5に対する拡張実験を通じて、情報の人気度、制約タイプ、コンテキストの可用性といった次元にわたる一般的な失敗モードを特徴づけ、分離する。結果は、コンテキストがない場合、モデルが無関係な情報、事実誤認、不完全性によって深刻な制限を示すことを明らかにし、これらの多くは情報の人気度が低下するにつれて悪化する。コンテキストの可用性は無関係な情報を緩和するが、制約を満たすためには役立たず、制約充足における根本的な障壁を特定する。我々は、将来のモデルの制約充足能力を改善するためのさらなる研究を促進するために、貢献をオープンソース化する。

English

We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval (e.g., 'a list of ice cream shops in San Diego'). In the past, such queries were considered to be tasks that could only be solved via web-search or knowledge bases. More recently, large language models (LLMs) have demonstrated initial emergent abilities in this task. However, many current retrieval benchmarks are either saturated or do not measure constraint satisfaction. Motivated by rising concerns around factual incorrectness and hallucinations of LLMs, we present KITAB, a new dataset for measuring constraint satisfaction abilities of language models. KITAB consists of book-related data across more than 600 authors and 13,000 queries, and also offers an associated dynamic data collection and constraint verification approach for acquiring similar test data for other authors. Our extended experiments on GPT4 and GPT3.5 characterize and decouple common failure modes across dimensions such as information popularity, constraint types, and context availability. Results show that in the absence of context, models exhibit severe limitations as measured by irrelevant information, factual errors, and incompleteness, many of which exacerbate as information popularity decreases. While context availability mitigates irrelevant information, it is not helpful for satisfying constraints, identifying fundamental barriers to constraint satisfaction. We open source our contributions to foster further research on improving constraint satisfaction abilities of future models.

KITAB: 情報検索における制約充足のための大規模言語モデル評価

KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval

要旨

Support