修復損害效能的數據：級聯大型語言模型重新標記困難負樣本以實現穩健的資訊檢索

摘要

訓練穩健的檢索與重排序模型通常依賴於大規模的檢索數據集；例如，BGE 集合包含了來自多種數據源的 160 萬個查詢-段落對。然而，我們發現某些數據集可能會對模型效能產生負面影響——從 BGE 集合中剔除 15 個數據集中的 8 個，可使訓練集規模縮小 2.35 倍，並在 BEIR 上提升 nDCG@10 達 1.0 分。這促使我們更深入地審視訓練數據的質量，特別關注「假負例」，即相關段落被錯誤標記為不相關的情況。我們提出了一種簡單且成本效益高的方法，利用級聯的 LLM 提示來識別並重新標記困難負例。實驗結果顯示，將假負例重新標記為真正例，可使 E5（基礎版）和 Qwen2.5-7B 檢索模型在 BEIR 上的 nDCG@10 提升 0.7-1.4 分，在零樣本 AIR-Bench 評估中提升 1.7-1.8 分。對於基於重新標記數據微調的重排序模型，如 Qwen2.5-3B 在 BEIR 上的表現，也觀察到了類似的提升。級聯設計的可靠性進一步得到了人工標註結果的支持，我們發現 GPT-4o 的判斷與人類的一致性遠高於 GPT-4o-mini。

English

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35times and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.

修復損害效能的數據：級聯大型語言模型重新標記困難負樣本以實現穩健的資訊檢索

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

摘要

Support