修复损害性能的数据：通过级联大语言模型重标注困难负样本以实现稳健的信息检索

摘要

训练稳健的检索和重排序模型通常依赖于大规模的检索数据集；例如，BGE集合包含了来自多种数据源的160万条查询-段落对。然而，我们发现某些数据集可能会对模型效果产生负面影响——从BGE集合中剔除15个数据集中的8个，训练集规模缩减了2.35倍，却使BEIR上的nDCG@10提升了1.0分。这促使我们深入审视训练数据的质量，特别关注“假阴性”问题，即相关段落被错误标记为不相关的情况。我们提出了一种简单且经济高效的方法，利用级联大语言模型（LLM）提示来识别并重新标注困难负样本。实验结果表明，将假阴性重新标注为真阳性后，E5（基础版）和Qwen2.5-7B检索模型在BEIR上的nDCG@10分别提升了0.7至1.4分，在零样本AIR-Bench评估中提升了1.7至1.8分。对于基于重新标注数据微调的重排序模型，如Qwen2.5-3B在BEIR上的表现，也观察到了类似的提升。级联设计的可靠性进一步得到了人工标注结果的支持，我们发现GPT-4o的判断与人类标注者的一致性远高于GPT-4o-mini。

English

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35times and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.

修复损害性能的数据：通过级联大语言模型重标注困难负样本以实现稳健的信息检索

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

摘要

Support