하드 네거티브를 넘어서: 밀집 검색을 위한 지식 증류에서 점수 분포의 중요성

초록

교차 인코더(cross-encoder) 교사 모델의 지식을 지식 증류(Knowledge Distillation; KD)를 통해 전달하는 것은 검색 모델 훈련의 표준 패러다임이 되었습니다. 기존 연구들은 주로 판별력을 향상시키기 위해 하드 네거티브(hard negatives)를 추출하는 데 집중해 왔으나, 훈련 데이터의 체계적인 구성과 그에 따른 교사 점수 분포에는 상대적으로 적은 관심이 기울여졌습니다. 본 연구에서는 하드 네거티브에만 집중할 경우 학생 모델이 교사 모델의 포괄적인 선호도 구조를 학습하지 못하여 일반화를 저해할 수 있음을 강조합니다. 교사 점수 분포를 효과적으로 모방하기 위해, 점수 스펙트럼 전체를 균일하게 아우르는 계층화 샘플링(Stratified Sampling) 전략을 제안합니다. 도메인 내 및 도메인 외 벤치마크에서의 실험을 통해, 교사 점수의 분산과 엔트로피를 보존하는 계층화 샘플링이 다양한 설정에서 상위 K개 샘플링(top-K sampling) 및 무작위 샘플링을 크게 능가하는 강력한 기준선(baseline)이 됨을 확인했습니다. 이러한 결과는 증류의 본질이 교사 모델이 인지하는 상대 점수의 다양한 범위를 보존하는 데 있음을 시사합니다.

English

Transferring knowledge from a cross-encoder teacher via Knowledge Distillation (KD) has become a standard paradigm for training retrieval models. While existing studies have largely focused on mining hard negatives to improve discrimination, the systematic composition of training data and the resulting teacher score distribution have received relatively less attention. In this work, we highlight that focusing solely on hard negatives prevents the student from learning the comprehensive preference structure of the teacher, potentially hampering generalization. To effectively emulate the teacher score distribution, we propose a Stratified Sampling strategy that uniformly covers the entire score spectrum. Experiments on in-domain and out-of-domain benchmarks confirm that Stratified Sampling, which preserves the variance and entropy of teacher scores, serves as a robust baseline, significantly outperforming top-K and random sampling in diverse settings. These findings suggest that the essence of distillation lies in preserving the diverse range of relative scores perceived by the teacher.

하드 네거티브를 넘어서: 밀집 검색을 위한 지식 증류에서 점수 분포의 중요성

Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval

초록

Support