QUEST:一個包含隱含集合操作的尋找實體查詢的檢索資料集。
QUEST: A Retrieval Dataset of Entity-Seeking Queries with Implicit Set Operations
May 19, 2023
作者: Chaitanya Malaviya, Peter Shaw, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
cs.AI
摘要
制定選擇性資訊需求導致查詢隱含指定集合操作,例如交集、聯集和差集。例如,有人可能搜索"不是沙鷸的涉禽"或"在英國拍攝的科幻電影"。為了研究檢索系統滿足此類資訊需求的能力,我們構建了 QUEST,一個包含 3357 個自然語言查詢的數據集,這些查詢具有隱含的集合操作,對應到一組與維基百科文件相對應的實體。該數據集挑戰模型匹配查詢中提到的多個約束與文件中相應證據,並正確執行各種集合操作。該數據集是使用維基百科分類名稱半自動構建的。查詢是從單個分類自動組合而成,然後由眾包工作者進行改寫,進一步驗證其自然性和流暢性。眾包工作者還根據文件評估實體的相關性,並突出查詢約束與文件文本範圍的對應。我們分析了幾個現代檢索系統,發現它們在這類查詢上通常遇到困難。涉及否定和連接詞的查詢尤其具有挑戰性,系統在這些操作的組合方面進一步受到挑戰。
English
Formulating selective information needs results in queries that implicitly
specify set operations, such as intersection, union, and difference. For
instance, one might search for "shorebirds that are not sandpipers" or
"science-fiction films shot in England". To study the ability of retrieval
systems to meet such information needs, we construct QUEST, a dataset of 3357
natural language queries with implicit set operations, that map to a set of
entities corresponding to Wikipedia documents. The dataset challenges models to
match multiple constraints mentioned in queries with corresponding evidence in
documents and correctly perform various set operations. The dataset is
constructed semi-automatically using Wikipedia category names. Queries are
automatically composed from individual categories, then paraphrased and further
validated for naturalness and fluency by crowdworkers. Crowdworkers also assess
the relevance of entities based on their documents and highlight attribution of
query constraints to spans of document text. We analyze several modern
retrieval systems, finding that they often struggle on such queries. Queries
involving negation and conjunction are particularly challenging and systems are
further challenged with combinations of these operations.