ECI_{sem}：ハードネガティブ評価のためのセマンティック残差有効対比情報

要旨

高密度検索におけるハードネガティブソースの選択は、通常、ファインチューニングと下流評価の後にのみ決定される。本稿では、凍結されたターゲットエンコーダの埋め込みを用いて候補ネガティブソースをランク付けする、Effective Contrastive Information (ECI) の意味残差変種である ECI_{sem} を提案する。ECI_{sem} は学習不要であるが、ラベル不要ではない。すなわち、各スコアリング対象の例は、クエリ、ラベル付きポジティブ、および明示的な候補ネガティブを必要とする。ECI_{sem} は、ターゲット一貫性、意味的局所性、語彙的残差性、および対数行列式多様性目的から、重み付き残差情報行列を構築する。MS MARCO ネガティブソースにおいて、同一ファミリーの ECI_{sem} は、非ハイブリッドソースの中で LLM ネガティブを最も高くランク付けし、ハイブリッドソースの中で Dense+LLM を最も高くランク付けする。これは、DistilBERT、E5-base、Contriever にわたる最強の総合 BEIR 転送結果と一致する。制御されたアブレーション実験は、この一致がターゲットエンコーダファミリーの使用に依存することを示す一方、追加のアブレーション実験は、サンプルサイズ、温度、トークナイザ、IDFコーパスの摂動下での安定性を示す。理論は損失削減への局所線形化された関連性を提供し、実証研究は下流評価を最終テストとして扱う。

English

Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream evaluation. We propose ECI_{sem}, a semantic residual variant of Effective Contrastive Information (ECI) that ranks candidate negative sources using frozen target-encoder embeddings. ECI_{sem} is training-free, not label-free: each scored example requires a query, a labeled positive, and an explicit candidate negative. ECI_{sem} builds a weighted residual information matrix from target consistency, semantic locality, lexical residuality, and a log-determinant diversity objective. On MS MARCO negative sources, in-family ECI_{sem} ranks LLM negatives highest among non-hybrid sources and Dense+LLM highest among hybrid sources, matching the strongest aggregate BEIR transfer results across DistilBERT, E5-base, and Contriever. Controlled ablations show that this alignment depends on using the target encoder family, while additional ablations show stability under sample-size, temperature, tokenizer, and IDF-corpus perturbations. The theory gives a local linearized link to loss reduction, while the empirical study treats downstream evaluation as the final test.