DREAM: 基于自回归建模的密集检索嵌入

摘要

稠密检索嵌入模型是现代基于检索的AI系统的基本组件。大多数稠密检索器通过对比学习目标进行训练，这需要标注好的正负文档对，而这些数据往往成本高昂且难以获取。在这项工作中，我们探究了大语言模型（LLM）的自回归下一个词元预测目标能否为稠密检索提供监督信号。其直觉很简单：如果一个文档包含与查询相关的信息，那么以该文档为条件应能使LLM更轻松地预测目标输出。一个关键挑战在于：下一个词元预测的损失是在LLM内部计算的，而检索器则是一个独立的嵌入模型。为应对这一挑战，我们提出了DREAM（通过自回归建模实现稠密检索嵌入），该方法将检索器生成的查询-文档相似度分数注入冻结LLM的选定注意力头中。在训练过程中，这些分数决定了LLM在预测目标输出时每个候选文档获得多少关注。由此产生的预测损失通过注意力机制为检索器的训练提供梯度。我们使用参数规模从0.5B到3B的嵌入骨干网络，在检索基准BEIR和RTEB上评估了DREAM。在不同模型规模下，DREAM始终优于现有基线。这些结果表明，DREAM为通过自回归建模训练稠密检索器提供了一种有前景的方法。

English

Dense retrieval embedding models are a fundamental component of modern retrieval-based AI systems. Most dense retrievers are trained with contrastive objectives, which require labeled positive and negative document pairs that are often costly and difficult to obtain. In this work, we investigate whether the autoregressive next-token prediction objective of a large language model (LLM) can provide supervision for dense retrieval. The intuition is simple: if a document contains information relevant to a query, conditioning on that document should make the target output easier for the LLM to predict. A key challenge is that the next-token prediction loss is computed inside the LLM, while the retriever is a separate embedding model. To address this challenge, we propose DREAM (Dense Retrieval Embeddings via Autoregressive Modeling), which injects retriever-generated query-document similarity scores into selected attention heads of a frozen LLM. During training, these scores determine how much attention each candidate document receives while the LLM predicts the target output. The resulting prediction loss provides gradients for retriever training through the attention mechanism. We evaluate DREAM on retrieval benchmarks BEIR and RTEB using embedding backbones ranging from 0.5B to 3B parameters. DREAM consistently outperforms existing baselines across different model scales. These results demonstrate that DREAM provides a promising approach for training dense retrievers through autoregressive modeling.