ChatPaper.aiChatPaper

ECI_{sem}: 语义残差有效对比信息用于评估难负样本

ECI_{sem}: Semantic Residual Effective Contrastive Information for Evaluating Hard Negatives

June 5, 2026
作者: Aarush Sinha, Rahul Seetharaman, Aman Bansal
cs.AI

摘要

针对密集检索中的硬负样本源选择,通常仅在微调和下游评估完成后才能确定。我们提出ECI_sem——有效对比信息(ECI)的一种语义残差变体,该方法利用冻结的目标编码器嵌入对候选负样本源进行排序。ECI_sem无需训练,但并非无标签:每个评分样本需包含一个查询、一个标注正样本和一个显式候选负样本。ECI_sem通过目标一致性、语义局部性、词汇残差性以及基于对数行列式的多样性目标函数,构建加权残差信息矩阵。在MS MARCO负样本源上,族内ECI_sem在非混合源中将大语言模型(LLM)负样本排至最高,在混合源中将Dense+LLM组合排至最高,这与DistilBERT、E5-base和Contriever在BEIR迁移任务中取得的最优聚合结果一致。控制消融实验表明,该对齐效果依赖于目标编码器族的使用;而附加消融实验则显示,该方法在样本规模、温度参数、分词器及IDF语料库扰动下均保持稳定。理论层面给出了损失函数缩减的局部线性化关联,而实证研究则将下游评估作为最终检验标准。
English
Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream evaluation. We propose ECI_{sem}, a semantic residual variant of Effective Contrastive Information (ECI) that ranks candidate negative sources using frozen target-encoder embeddings. ECI_{sem} is training-free, not label-free: each scored example requires a query, a labeled positive, and an explicit candidate negative. ECI_{sem} builds a weighted residual information matrix from target consistency, semantic locality, lexical residuality, and a log-determinant diversity objective. On MS MARCO negative sources, in-family ECI_{sem} ranks LLM negatives highest among non-hybrid sources and Dense+LLM highest among hybrid sources, matching the strongest aggregate BEIR transfer results across DistilBERT, E5-base, and Contriever. Controlled ablations show that this alignment depends on using the target encoder family, while additional ablations show stability under sample-size, temperature, tokenizer, and IDF-corpus perturbations. The theory gives a local linearized link to loss reduction, while the empirical study treats downstream evaluation as the final test.