ChatPaper.aiChatPaper

评估部分注释对信息检索的D-MERIT

Evaluating D-MERIT of Partial-annotation on Information Retrieval

June 23, 2024
作者: Royi Rassin, Yaron Fairstein, Oren Kalinsky, Guy Kushilevitz, Nachshon Cohen, Alexander Libov, Yoav Goldberg
cs.AI

摘要

检索模型通常在部分注释的数据集上进行评估。每个查询被映射到几个相关文本,其余语料库被假定为不相关。因此,在评估中,成功检索到假阴性的模型会受到惩罚。不幸的是,为每个查询完全注释所有文本在资源上并不高效。在这项工作中,我们展示了在评估中使用部分注释的数据集可能会呈现扭曲的画面。我们精心策划了D-MERIT,这是一个从维基百科中提取的段落检索评估集,旨在包含每个查询的所有相关段落。查询描述了一个群体(例如,“关于语言学的期刊”),相关段落是表明实体属于该群体的证据(例如,一段表明《语言》是一本关于语言学的期刊的段落)。我们展示了在仅包含部分相关段落注释的数据集上进行评估可能导致检索系统排名的误导,并且随着在评估集中包含更多相关文本,排名会趋于一致。我们提出我们的数据集作为评估资源,并将我们的研究作为在为文本检索注释评估集时在资源效率和可靠评估之间取得平衡的建议。
English
Retrieval models are often evaluated on partially-annotated datasets. Each query is mapped to a few relevant texts and the remaining corpus is assumed to be irrelevant. As a result, models that successfully retrieve false negatives are punished in evaluation. Unfortunately, completely annotating all texts for every query is not resource efficient. In this work, we show that using partially-annotated datasets in evaluation can paint a distorted picture. We curate D-MERIT, a passage retrieval evaluation set from Wikipedia, aspiring to contain all relevant passages for each query. Queries describe a group (e.g., ``journals about linguistics'') and relevant passages are evidence that entities belong to the group (e.g., a passage indicating that Language is a journal about linguistics). We show that evaluating on a dataset containing annotations for only a subset of the relevant passages might result in misleading ranking of the retrieval systems and that as more relevant texts are included in the evaluation set, the rankings converge. We propose our dataset as a resource for evaluation and our study as a recommendation for balance between resource-efficiency and reliable evaluation when annotating evaluation sets for text retrieval.

Summary

AI-Generated Summary

PDF362November 29, 2024