パフォーマンスを損なうデータの修正：堅牢な情報検索のための困難なネガティブサンプルの再ラベル付けに向けたLLMのカスケード

要旨

堅牢な検索モデルとリランカーモデルのトレーニングは、通常、大規模な検索データセットに依存します。例えば、BGEコレクションには、さまざまなデータソースから収集された160万のクエリ-パッセージペアが含まれています。しかし、特定のデータセットがモデルの有効性に悪影響を及ぼすことがわかっています。BGEコレクションから15のデータセットのうち8つを削除すると、トレーニングセットのサイズが2.35倍減少し、BEIRでのnDCG@10が1.0ポイント向上します。これにより、トレーニングデータの品質、特に「偽陰性」（関連するパッセージが誤って無関連とラベル付けされるケース）に焦点を当てた詳細な検討が促されます。我々は、カスケード型のLLMプロンプトを使用して、ハードネガティブを特定し、再ラベル付けするシンプルでコスト効率の高いアプローチを提案します。実験結果は、偽陰性を真陽性に再ラベル付けすることで、E5（ベース）とQwen2.5-7B検索モデルのBEIRでのnDCG@10が0.7-1.4ポイント、ゼロショットAIR-Bench評価でのnDCG@10が1.7-1.8ポイント向上することを示しています。再ラベル付けされたデータでファインチューニングされたリランカー（例えば、BEIRでのQwen2.5-3B）でも同様の向上が観察されます。カスケード設計の信頼性は、人間によるアノテーション結果によってさらに裏付けられており、GPT-4oの判断がGPT-4o-miniよりも人間との一致度がはるかに高いことがわかっています。

English

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35times and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.

パフォーマンスを損なうデータの修正：堅牢な情報検索のための困難なネガティブサンプルの再ラベル付けに向けたLLMのカスケード

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

要旨

Support