密集檢索器中的位置偏差是內建的還是從數據中學習的？

摘要

密集檢索器存在位置偏誤，傾向於將查詢相關資訊出現在文件開頭的文件排得更前面，而當相關資訊出現較晚時，檢索效能便會下降。雖然過去針對密集檢索器中位置偏誤的研究大多聚焦於架構層面的解釋，但我們探討的是訓練資料中證據的位置分佈如何影響檢索層級的偏誤方向。為進行驗證，我們建構了合成的位置定向訓練集，使查詢相關證據分別出現在文件的開頭、中間或結尾，並在位置偏斜與平衡的訓練分佈下，對八種架構各異的預訓練模型進行微調。在排序層級上，我們觀察到受測模型呈現強烈的方向性模式：偏斜的訓練分佈會使模型偏好對應位置的證據。在位置感知基準測試中，位置平衡的訓練可將位置敏感度降低57%至87%，而在我們控制的設定下，平均檢索效能仍具競爭力。表徵層級的分析進一步顯示，微調通常會重塑模型習得的位置偏好，儘管某些模型中仍殘留著先前的架構或預訓練特有的傾向。這些結果將訓練位置分佈確立為檢索層級位置偏誤的主要可控因素，並建議平衡的資料篩選可作為實務上的緩解策略。

English

Dense retrievers exhibit positional bias, favoring documents whose query-relevant information appears near the beginning and degrading retrieval performance when the information appears later. While prior work on positional bias in dense retrievers has largely focused on architectural explanations, we study how the positional distribution of evidence in training data affects retrieval-level bias direction. To test this, we construct synthetic position-targeted training sets in which query-relevant evidence appears at the beginning, middle, or end of documents, and fine-tune eight architecturally diverse pretrained models under position-skewed and balanced training distributions. At the ranking level, we observe a strong directional pattern across the examined models: skewed training distributions favor evidence at the corresponding positions. Position-balanced training reduces positional sensitivity by 57--87\% on position-aware benchmarks, with competitive mean retrieval performance in our controlled setting. Representation-level analyses further suggest that fine-tuning often reshapes learned positional preferences, although pre-existing architectural or pretraining-specific tendencies persist in some models. These results identify training-position distribution as a major controllable factor in retrieval-level position bias and suggest balanced data curation as a practical mitigation strategy.