ChatPaper.aiChatPaper

評估部分標註在資訊檢索上的 D-MERIT

Evaluating D-MERIT of Partial-annotation on Information Retrieval

June 23, 2024
作者: Royi Rassin, Yaron Fairstein, Oren Kalinsky, Guy Kushilevitz, Nachshon Cohen, Alexander Libov, Yoav Goldberg
cs.AI

摘要

檢索模型通常在部分標註的資料集上進行評估。每個查詢都會映射到一些相關文本,而其餘的語料庫則被假定為不相關的。因此,在評估中,成功檢索到偽陰性的模型會受到懲罰。不幸的是,為每個查詢完全標註所有文本並不具備資源效率。在這項工作中,我們展示了在評估中使用部分標註的資料集可能呈現扭曲的情況。我們精心編輯了一個來自維基百科的段落檢索評估集D-MERIT,旨在包含每個查詢的所有相關段落。查詢描述了一個群體(例如,“有關語言學的期刊”),而相關段落則是表明實體屬於該群體的證據(例如,一段指出語言是一本關於語言學的期刊的證據)。我們展示了在僅包含部分相關段落標註的資料集上進行評估可能導致檢索系統排名的誤導,並且隨著評估集中包含更多相關文本,排名會收斂。我們提出我們的資料集作為評估的資源,並將我們的研究建議作為在為文本檢索標註評估集時在資源效率和可靠評估之間取得平衡的建議。
English
Retrieval models are often evaluated on partially-annotated datasets. Each query is mapped to a few relevant texts and the remaining corpus is assumed to be irrelevant. As a result, models that successfully retrieve false negatives are punished in evaluation. Unfortunately, completely annotating all texts for every query is not resource efficient. In this work, we show that using partially-annotated datasets in evaluation can paint a distorted picture. We curate D-MERIT, a passage retrieval evaluation set from Wikipedia, aspiring to contain all relevant passages for each query. Queries describe a group (e.g., ``journals about linguistics'') and relevant passages are evidence that entities belong to the group (e.g., a passage indicating that Language is a journal about linguistics). We show that evaluating on a dataset containing annotations for only a subset of the relevant passages might result in misleading ranking of the retrieval systems and that as more relevant texts are included in the evaluation set, the rankings converge. We propose our dataset as a resource for evaluation and our study as a recommendation for balance between resource-efficiency and reliable evaluation when annotating evaluation sets for text retrieval.

Summary

AI-Generated Summary

PDF362November 29, 2024