선형 프로브와 마할라노비스 코사인 유사도 비교

초록

선형 프로브는 해석 가능성 연구에서 널리 사용되며, 종종 코사인 유사도로 비교된다. 두 방향 간의 마할라노비스 코사인 유사도(MCS)는 테스트 데이터 공분산으로 내적을 재가중하며, 이는 자연스러운 작업 인식 개선 방법이다. Ying et al. (2026)은 분포 외(OOD) 데이터에서 훈련된 참조 프로브에 대한 프로브의 MCS가 해당 프로브의 OOD AUROC를 거의 완벽하게 선형적으로 예측한다고 보고하였다(R² = 0.98). 본 연구에서는 이 경험적 발견을 모델, 계층 및 개념 도메인에 걸쳐 확장하고, 이 일반적 현상을 폐쇄형으로 증명한다: 투영이 가우시안인 균형 클래스의 경우, OOD AUROC와 참조 프로브에 대한 MCS는 선형 관계를 가지는데, 이는 둘 다 테스트 데이터에서 프로브의 신호 대 잡음비(SNR)에 대한 시그모이드 형태 함수이기 때문이다. 또한 이론은 이러한 선형성이 언제 실패하는지도 예측하며, 이를 경험적으로 검증한다. MCS는 선형 프로브 비교를 위해 유클리드 코사인 유사도에 대한 이론적 근거가 확립되고 경험적으로 효과적인 대안을 제공한다.

English

Linear probes are widely used in interpretability research and often compared by cosine similarity. The Mahalanobis cosine similarity (MCS) between two directions, which reweights the inner product by test data covariance, is a natural task-aware refinement. Ying et al. (2026) report that a probe's MCS to a reference probe trained on the out-of-distribution (OOD) data near-perfectly linearly predicts the probe's OOD AUROC (R^2 = 0.98). Here, we extend this empirical finding across models, layers, and concept domains, and prove this general phenomenon in closed form: For balanced classes whose projections are Gaussian, OOD AUROC and MCS to the reference probe are linear because both are sigmoid-shaped functions of the probe's signal-to-noise ratio (SNR) on the test data. The theory also predicts when this linearity fails, which we verify empirically. MCS offers a theoretically grounded and empirically effective alternative to Euclidean cosine similarity for comparing linear probes.