ChatPaper.aiChatPaper

修复损害性能的数据:通过级联大语言模型重标注困难负样本以实现稳健的信息检索

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

May 22, 2025
作者: Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin
cs.AI

摘要

训练稳健的检索和重排序模型通常依赖于大规模的检索数据集;例如,BGE集合包含了来自多种数据源的160万条查询-段落对。然而,我们发现某些数据集可能会对模型效果产生负面影响——从BGE集合中剔除15个数据集中的8个,训练集规模缩减了2.35倍,却使BEIR上的nDCG@10提升了1.0分。这促使我们深入审视训练数据的质量,特别关注“假阴性”问题,即相关段落被错误标记为不相关的情况。我们提出了一种简单且经济高效的方法,利用级联大语言模型(LLM)提示来识别并重新标注困难负样本。实验结果表明,将假阴性重新标注为真阳性后,E5(基础版)和Qwen2.5-7B检索模型在BEIR上的nDCG@10分别提升了0.7至1.4分,在零样本AIR-Bench评估中提升了1.7至1.8分。对于基于重新标注数据微调的重排序模型,如Qwen2.5-3B在BEIR上的表现,也观察到了类似的提升。级联设计的可靠性进一步得到了人工标注结果的支持,我们发现GPT-4o的判断与人类标注者的一致性远高于GPT-4o-mini。
English
Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35times and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.

Summary

AI-Generated Summary

PDF123May 23, 2025