修复损害性能的数据:通过级联大语言模型重标注困难负样本以实现稳健的信息检索
Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval
May 22, 2025
作者: Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin
cs.AI
摘要
训练稳健的检索和重排序模型通常依赖于大规模的检索数据集;例如,BGE集合包含了来自多种数据源的160万条查询-段落对。然而,我们发现某些数据集可能会对模型效果产生负面影响——从BGE集合中剔除15个数据集中的8个,训练集规模缩减了2.35倍,却使BEIR上的nDCG@10提升了1.0分。这促使我们深入审视训练数据的质量,特别关注“假阴性”问题,即相关段落被错误标记为不相关的情况。我们提出了一种简单且经济高效的方法,利用级联大语言模型(LLM)提示来识别并重新标注困难负样本。实验结果表明,将假阴性重新标注为真阳性后,E5(基础版)和Qwen2.5-7B检索模型在BEIR上的nDCG@10分别提升了0.7至1.4分,在零样本AIR-Bench评估中提升了1.7至1.8分。对于基于重新标注数据微调的重排序模型,如Qwen2.5-3B在BEIR上的表现,也观察到了类似的提升。级联设计的可靠性进一步得到了人工标注结果的支持,我们发现GPT-4o的判断与人类标注者的一致性远高于GPT-4o-mini。
English
Training robust retrieval and reranker models typically relies on large-scale
retrieval datasets; for example, the BGE collection contains 1.6 million
query-passage pairs sourced from various data sources. However, we find that
certain datasets can negatively impact model effectiveness -- pruning 8 out of
15 datasets from the BGE collection reduces the training set size by
2.35times and increases nDCG@10 on BEIR by 1.0 point. This motivates a
deeper examination of training data quality, with a particular focus on "false
negatives", where relevant passages are incorrectly labeled as irrelevant. We
propose a simple, cost-effective approach using cascading LLM prompts to
identify and relabel hard negatives. Experimental results show that relabeling
false negatives with true positives improves both E5 (base) and Qwen2.5-7B
retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot
AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on
the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the
cascading design is further supported by human annotation results, where we
find judgment by GPT-4o shows much higher agreement with humans than
GPT-4o-mini.Summary
AI-Generated Summary