基于影响导向采样的文本检索器领域自适应方法

摘要

通用开放域稠密检索系统通常使用海量混合语料库和搜索任务进行训练。针对这些异构语料库和任务，应当如何制定训练采样策略？传统方法通常采用均匀采样、按实例数量比例采样或依赖专家监督的方式。众所周知，训练数据采样策略会显著影响模型性能，但在嵌入模型领域如何寻找最优策略尚未得到充分研究。我们提出Inf-DDS——一种基于强化学习的自适应采样框架，该框架通过影响力驱动的奖励信号动态调整训练数据集权重，且GPU计算资源消耗显著降低。我们的技术通过迭代优化采样策略，优先选择能最大化目标开发集模型性能的数据集。我们在多种文本检索任务上验证了该采样策略的有效性，结果表明相较于现有基于梯度的采样方法，我们的方法在检索性能上实现显著提升并具有更好的适应性，同时GPU计算成本降低1.5至4倍。在训练多语言bge-m3模型时，我们的采样策略实现了NDCG@10指标5.03的绝对提升；在训练all-MiniLM-L6-v2模型时，即使从专家预设权重的海量训练数据集出发，仍实现了NDCG@10指标0.94的绝对提升。

English

General-purpose open-domain dense retrieval systems are usually trained with a large, eclectic mix of corpora and search tasks. How should these diverse corpora and tasks be sampled for training? Conventional approaches sample them uniformly, proportional to their instance population sizes, or depend on human-level expert supervision. It is well known that the training data sampling strategy can greatly impact model performance. However, how to find the optimal strategy has not been adequately studied in the context of embedding models. We propose Inf-DDS, a novel reinforcement learning driven sampling framework that adaptively reweighs training datasets guided by influence-based reward signals and is much more lightweight with respect to GPU consumption. Our technique iteratively refines the sampling policy, prioritizing datasets that maximize model performance on a target development set. We evaluate the efficacy of our sampling strategy on a wide range of text retrieval tasks, demonstrating strong improvements in retrieval performance and better adaptation compared to existing gradient-based sampling methods, while also being 1.5x to 4x cheaper in GPU compute. Our sampling strategy achieves a 5.03 absolute NDCG@10 improvement while training a multilingual bge-m3 model and an absolute NDCG@10 improvement of 0.94 while training all-MiniLM-L6-v2, even when starting from expert-assigned weights on a large pool of training datasets.

基于影响导向采样的文本检索器领域自适应方法

Influence Guided Sampling for Domain Adaptation of Text Retrievers

摘要

Support