密集检索器中的位置偏差是固有的还是从数据中学习到的？

摘要

密集检索器存在位置偏差，更倾向于将查询相关信息出现在文档开头的文档排在前面，而当信息出现在后面时则会降低检索性能。虽然以往关于密集检索器位置偏差的研究主要集中于架构层面的解释，但本研究探讨了训练数据中证据的位置分布如何影响检索级别的偏差方向。为进行验证，我们构建了合成性的位置定向训练集，其中与查询相关的证据分别出现在文档的开头、中间或结尾，并在位置偏斜和均衡的训练分布下对八个架构各异的预训练模型进行微调。在排序层面，我们观察到所有被检模型呈现出一致的强方向性模式：偏斜的训练分布会偏向对应位置的证据。位置均衡训练可将位置感知基准上的位置敏感性降低57%至87%，同时在我们控制的环境中保持有竞争力的平均检索性能。表示层分析进一步表明，微调通常会重塑学习到的位置偏好，尽管某些模型中仍存在源于架构或预训练阶段的固有倾向。这些结果将训练数据位置分布确立为检索层面位置偏差的主要可控因素，并提示平衡的数据整理可作为实用的缓解策略。

English

Dense retrievers exhibit positional bias, favoring documents whose query-relevant information appears near the beginning and degrading retrieval performance when the information appears later. While prior work on positional bias in dense retrievers has largely focused on architectural explanations, we study how the positional distribution of evidence in training data affects retrieval-level bias direction. To test this, we construct synthetic position-targeted training sets in which query-relevant evidence appears at the beginning, middle, or end of documents, and fine-tune eight architecturally diverse pretrained models under position-skewed and balanced training distributions. At the ranking level, we observe a strong directional pattern across the examined models: skewed training distributions favor evidence at the corresponding positions. Position-balanced training reduces positional sensitivity by 57--87\% on position-aware benchmarks, with competitive mean retrieval performance in our controlled setting. Representation-level analyses further suggest that fine-tuning often reshapes learned positional preferences, although pre-existing architectural or pretraining-specific tendencies persist in some models. These results identify training-position distribution as a major controllable factor in retrieval-level position bias and suggest balanced data curation as a practical mitigation strategy.