优化关键指标:基于AUC驱动的鲁棒神经检索学习
Optimizing What Matters: AUC-Driven Learning for Robust Neural Retrieval
September 30, 2025
作者: Nima Sheikholeslami, Erfan Hosseini, Patrice Bechard, Srivatsava Daruru, Sai Rajeswar
cs.AI
摘要
双编码器检索模型依赖于一个基本原则:对于给定查询,相关文档的得分应高于不相关文档。然而,主导性的噪声对比估计(NCE)目标函数,作为对比损失的基础,优化的是一个软化的排序替代指标。我们严格证明,该指标从根本上忽视了得分分离质量,且与AUC(曲线下面积)无关。这种不匹配导致了在下游任务(如检索增强生成,RAG)中的校准不佳和性能次优。为应对这一根本性局限,我们引入了MW损失,一种新的训练目标,旨在最大化曼-惠特尼U统计量,该统计量在数学上等同于ROC曲线下的面积(AUC)。MW损失通过最小化得分差异上的二元交叉熵,激励每一对正负样本正确排序。我们提供了理论保证,证明MW损失直接上界于AoC,从而更好地将优化目标与检索任务对齐。我们进一步提倡将ROC曲线和AUC作为评估检索器校准和排序质量的自然无阈值诊断工具。实证表明,采用MW损失训练的检索模型在AUC和标准检索指标上持续超越对比损失模型。我们的实验证实,MW损失是对比损失的一个实证上更优的替代方案,为诸如RAG等高风险应用提供了校准更佳、区分能力更强的检索模型。
English
Dual-encoder retrievers depend on the principle that relevant documents
should score higher than irrelevant ones for a given query. Yet the dominant
Noise Contrastive Estimation (NCE) objective, which underpins Contrastive Loss,
optimizes a softened ranking surrogate that we rigorously prove is
fundamentally oblivious to score separation quality and unrelated to AUC. This
mismatch leads to poor calibration and suboptimal performance in downstream
tasks like retrieval-augmented generation (RAG). To address this fundamental
limitation, we introduce the MW loss, a new training objective that maximizes
the Mann-Whitney U statistic, which is mathematically equivalent to the Area
under the ROC Curve (AUC). MW loss encourages each positive-negative pair to be
correctly ranked by minimizing binary cross entropy over score differences. We
provide theoretical guarantees that MW loss directly upper-bounds the AoC,
better aligning optimization with retrieval goals. We further promote ROC
curves and AUC as natural threshold free diagnostics for evaluating retriever
calibration and ranking quality. Empirically, retrievers trained with MW loss
consistently outperform contrastive counterparts in AUC and standard retrieval
metrics. Our experiments show that MW loss is an empirically superior
alternative to Contrastive Loss, yielding better-calibrated and more
discriminative retrievers for high-stakes applications like RAG.