ECI_{sem}：用於評估困難負樣本的語義殘差有效對比信息

摘要

密集检索中的难负样本源选择通常仅在微调和下游评估之后才能确定。我们提出ECI_{sem}，一种有效对比信息（ECI）的语义残差变体，该方法利用冻结的目标编码器嵌入对候选负样本源进行排序。ECI_{sem}无需训练，但并非无需标签：每个评分样本都需要一个查询、一个标注的正样本以及一个显式的候选负样本。ECI_{sem}根据目标一致性、语义局部性、词汇残差性以及对数行列式多样性目标，构建了一个加权残差信息矩阵。在MS MARCO负样本源上，族内ECI_{sem}将大语言模型（LLM）负样本在非混合源中排名最高，并将Dense+LLM在混合源中排名最高，这与DistilBERT、E5-base和Contriever在最强聚合BEIR迁移结果上的表现一致。受控消融实验表明，这种对齐依赖于使用目标编码器家族，而额外消融实验则显示其在样本量、温度、分词器和IDF语料扰动下保持稳定。理论部分给出了与损失减少的局部线性化联系，而实证研究则将下游评估作为最终检验。

English

Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream evaluation. We propose ECI_{sem}, a semantic residual variant of Effective Contrastive Information (ECI) that ranks candidate negative sources using frozen target-encoder embeddings. ECI_{sem} is training-free, not label-free: each scored example requires a query, a labeled positive, and an explicit candidate negative. ECI_{sem} builds a weighted residual information matrix from target consistency, semantic locality, lexical residuality, and a log-determinant diversity objective. On MS MARCO negative sources, in-family ECI_{sem} ranks LLM negatives highest among non-hybrid sources and Dense+LLM highest among hybrid sources, matching the strongest aggregate BEIR transfer results across DistilBERT, E5-base, and Contriever. Controlled ablations show that this alignment depends on using the target encoder family, while additional ablations show stability under sample-size, temperature, tokenizer, and IDF-corpus perturbations. The theory gives a local linearized link to loss reduction, while the empirical study treats downstream evaluation as the final test.