比较线性探针与马氏余弦相似度

摘要

线性探针在可解释性研究中被广泛使用，常通过余弦相似度进行比较。两个方向之间的马氏余弦相似度（MCS）利用测试数据协方差对內积进行重新加权，是一种自然的任务感知优化。Ying等人（2026）报告指出，探针与基于分布外（OOD）数据训练的参考探针之间的MCS，几乎完美地线性预测了该探针的OOD AUROC（R² = 0.98）。在此，我们将这一实证发现扩展到不同模型、层和概念域，并以闭式形式证明了这一普遍现象：对于投影后为正态分布的平衡类别，OOD AUROC与参考探针的MCS呈线性关系，因为二者均为探针在测试数据上的信噪比（SNR）的Sigmoid型函数。该理论还预测了这种线性关系何时失效，我们通过实验验证了这一预测。MCS为比较线性探针提供了一种有理论依据且实证有效的替代方案，优于欧几里得余弦相似度。

English

Linear probes are widely used in interpretability research and often compared by cosine similarity. The Mahalanobis cosine similarity (MCS) between two directions, which reweights the inner product by test data covariance, is a natural task-aware refinement. Ying et al. (2026) report that a probe's MCS to a reference probe trained on the out-of-distribution (OOD) data near-perfectly linearly predicts the probe's OOD AUROC (R^2 = 0.98). Here, we extend this empirical finding across models, layers, and concept domains, and prove this general phenomenon in closed form: For balanced classes whose projections are Gaussian, OOD AUROC and MCS to the reference probe are linear because both are sigmoid-shaped functions of the probe's signal-to-noise ratio (SNR) on the test data. The theory also predicts when this linearity fails, which we verify empirically. MCS offers a theoretically grounded and empirically effective alternative to Euclidean cosine similarity for comparing linear probes.