修復損害效能的數據:級聯大型語言模型重新標記困難負樣本以實現穩健的資訊檢索
Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval
May 22, 2025
作者: Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin
cs.AI
摘要
訓練穩健的檢索與重排序模型通常依賴於大規模的檢索數據集;例如,BGE 集合包含了來自多種數據源的 160 萬個查詢-段落對。然而,我們發現某些數據集可能會對模型效能產生負面影響——從 BGE 集合中剔除 15 個數據集中的 8 個,可使訓練集規模縮小 2.35 倍,並在 BEIR 上提升 nDCG@10 達 1.0 分。這促使我們更深入地審視訓練數據的質量,特別關注「假負例」,即相關段落被錯誤標記為不相關的情況。我們提出了一種簡單且成本效益高的方法,利用級聯的 LLM 提示來識別並重新標記困難負例。實驗結果顯示,將假負例重新標記為真正例,可使 E5(基礎版)和 Qwen2.5-7B 檢索模型在 BEIR 上的 nDCG@10 提升 0.7-1.4 分,在零樣本 AIR-Bench 評估中提升 1.7-1.8 分。對於基於重新標記數據微調的重排序模型,如 Qwen2.5-3B 在 BEIR 上的表現,也觀察到了類似的提升。級聯設計的可靠性進一步得到了人工標註結果的支持,我們發現 GPT-4o 的判斷與人類的一致性遠高於 GPT-4o-mini。
English
Training robust retrieval and reranker models typically relies on large-scale
retrieval datasets; for example, the BGE collection contains 1.6 million
query-passage pairs sourced from various data sources. However, we find that
certain datasets can negatively impact model effectiveness -- pruning 8 out of
15 datasets from the BGE collection reduces the training set size by
2.35times and increases nDCG@10 on BEIR by 1.0 point. This motivates a
deeper examination of training data quality, with a particular focus on "false
negatives", where relevant passages are incorrectly labeled as irrelevant. We
propose a simple, cost-effective approach using cascading LLM prompts to
identify and relabel hard negatives. Experimental results show that relabeling
false negatives with true positives improves both E5 (base) and Qwen2.5-7B
retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot
AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on
the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the
cascading design is further supported by human annotation results, where we
find judgment by GPT-4o shows much higher agreement with humans than
GPT-4o-mini.