밀집 검색기에서의 위치 편향은 내장된 것인가, 아니면 데이터로부터 학습된 것인가?

초록

밀집 검색기는 위치 편향을 나타내며, 쿼리 관련 정보가 문서의 앞부분에 있을 때 이를 선호하고, 정보가 뒷부분에 나타날 경우 검색 성능이 저하된다. 밀집 검색기의 위치 편향에 대한 선행 연구는 주로 구조적 설명에 초점을 맞추었으나, 본 연구에서는 훈련 데이터 내 증거의 위치 분포가 검색 수준의 편향 방향에 어떻게 영향을 미치는지 분석한다. 이를 검증하기 위해 쿼리 관련 증거가 문서의 시작, 중간, 또는 끝에 나타나는 합성 위치 타겟 훈련 데이터를 구축하고, 위치 편향 및 균형 훈련 분포에서 8개의 구조적으로 다양한 사전 학습 모델을 미세 조정하였다. 순위 수준에서, 조사된 모델 전반에 걸쳐 강한 방향성 패턴이 관찰되었다. 즉, 편향된 훈련 분포는 해당 위치의 증거를 선호하는 경향을 보였다. 위치 균형 훈련은 위치 인식 벤치마크에서 위치 민감도를 57~87% 감소시켰으며, 통제된 환경에서 경쟁력 있는 평균 검색 성능을 유지하였다. 표현 수준 분석은 미세 조정이 종종 학습된 위치 선호도를 재구성함을 시사하지만, 일부 모델에서는 사전 구조적 또는 사전 학습 특정 경향이 지속됨을 보여준다. 이러한 결과는 훈련 위치 분포가 검색 수준 위치 편향의 주요 통제 가능 요인임을 식별하며, 균형 잡힌 데이터 큐레이션을 실용적인 완화 전략으로 제안한다.

English

Dense retrievers exhibit positional bias, favoring documents whose query-relevant information appears near the beginning and degrading retrieval performance when the information appears later. While prior work on positional bias in dense retrievers has largely focused on architectural explanations, we study how the positional distribution of evidence in training data affects retrieval-level bias direction. To test this, we construct synthetic position-targeted training sets in which query-relevant evidence appears at the beginning, middle, or end of documents, and fine-tune eight architecturally diverse pretrained models under position-skewed and balanced training distributions. At the ranking level, we observe a strong directional pattern across the examined models: skewed training distributions favor evidence at the corresponding positions. Position-balanced training reduces positional sensitivity by 57--87\% on position-aware benchmarks, with competitive mean retrieval performance in our controlled setting. Representation-level analyses further suggest that fine-tuning often reshapes learned positional preferences, although pre-existing architectural or pretraining-specific tendencies persist in some models. These results identify training-position distribution as a major controllable factor in retrieval-level position bias and suggest balanced data curation as a practical mitigation strategy.