QUEST:一个包含隐式集合操作的实体查询检索数据集
QUEST: A Retrieval Dataset of Entity-Seeking Queries with Implicit Set Operations
May 19, 2023
作者: Chaitanya Malaviya, Peter Shaw, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
cs.AI
摘要
制定选择性信息需求会导致查询隐含地指定集合操作,比如交集、并集和差集。例如,一个人可能会搜索“不是千鸟的涉禽”或“在英格兰拍摄的科幻电影”。为了研究检索系统满足这种信息需求的能力,我们构建了一个包含3357个自然语言查询的数据集QUEST,这些查询具有隐含的集合操作,映射到对应维基百科文档的实体集合。该数据集挑战模型匹配查询中提到的多个约束与文档中相应证据,并正确执行各种集合操作。该数据集是半自动构建的,使用维基百科类别名称。查询是从单独的类别自动组成的,然后由众包工作者进行释义和进一步验证自然性和流畅性。众包工作者还根据文档评估实体的相关性,并突出查询约束在文档文本范围内的归因。我们分析了几种现代检索系统,发现它们在这类查询上经常遇到困难。涉及否定和连接的查询尤其具有挑战性,系统在这些操作的组合上进一步受到挑战。
English
Formulating selective information needs results in queries that implicitly
specify set operations, such as intersection, union, and difference. For
instance, one might search for "shorebirds that are not sandpipers" or
"science-fiction films shot in England". To study the ability of retrieval
systems to meet such information needs, we construct QUEST, a dataset of 3357
natural language queries with implicit set operations, that map to a set of
entities corresponding to Wikipedia documents. The dataset challenges models to
match multiple constraints mentioned in queries with corresponding evidence in
documents and correctly perform various set operations. The dataset is
constructed semi-automatically using Wikipedia category names. Queries are
automatically composed from individual categories, then paraphrased and further
validated for naturalness and fluency by crowdworkers. Crowdworkers also assess
the relevance of entities based on their documents and highlight attribution of
query constraints to spans of document text. We analyze several modern
retrieval systems, finding that they often struggle on such queries. Queries
involving negation and conjunction are particularly challenging and systems are
further challenged with combinations of these operations.