DREAM：自己回帰モデリングによる高密度検索埋め込み

要旨

密集検索埋め込みモデルは、現代の検索ベースAIシステムにおける基本的な構成要素である。ほとんどの密集検索器は対照学習の目的関数で訓練されており、そのためにはラベル付きの正例・負例の文書ペアが必要となるが、それらはコストが高く入手が困難なことが多い。本研究では、大規模言語モデル（LLM）の自己回帰型次トークン予測目的関数が、密集検索に教師信号を提供できるかを調査する。その直感は単純である。すなわち、文書がクエリに関連する情報を含んでいれば、その文書を条件とすることでLLMがターゲット出力を予測しやすくなるはずだ、というものである。ここでの重要な課題は、次トークン予測損失がLLM内部で計算される一方で、検索器は別個の埋め込みモデルである点にある。この課題に対処するため、我々はDREAM（Dense Retrieval Embeddings via Autoregressive Modeling）を提案する。これは、検索器が生成したクエリ-文書類似度スコアを、凍結されたLLMの選択されたアテンションヘッドに注入する手法である。訓練中、これらのスコアは、LLMがターゲット出力を予測する際に各候補文書が受けるアテンションの量を決定する。結果として得られる予測損失は、アテンション機構を通じて検索器の訓練に勾配を提供する。我々は、0.5Bから3Bパラメータの埋め込みバックボーンを用いて、検索ベンチマークBEIRおよびRTEBでDREAMを評価した。DREAMは、異なるモデル規模において既存のベースラインを一貫して上回る。これらの結果は、DREAMが自己回帰型モデリングを通じて密集検索器を訓練する有望なアプローチであることを示している。

English

Dense retrieval embedding models are a fundamental component of modern retrieval-based AI systems. Most dense retrievers are trained with contrastive objectives, which require labeled positive and negative document pairs that are often costly and difficult to obtain. In this work, we investigate whether the autoregressive next-token prediction objective of a large language model (LLM) can provide supervision for dense retrieval. The intuition is simple: if a document contains information relevant to a query, conditioning on that document should make the target output easier for the LLM to predict. A key challenge is that the next-token prediction loss is computed inside the LLM, while the retriever is a separate embedding model. To address this challenge, we propose DREAM (Dense Retrieval Embeddings via Autoregressive Modeling), which injects retriever-generated query-document similarity scores into selected attention heads of a frozen LLM. During training, these scores determine how much attention each candidate document receives while the LLM predicts the target output. The resulting prediction loss provides gradients for retriever training through the attention mechanism. We evaluate DREAM on retrieval benchmarks BEIR and RTEB using embedding backbones ranging from 0.5B to 3B parameters. DREAM consistently outperforms existing baselines across different model scales. These results demonstrate that DREAM provides a promising approach for training dense retrievers through autoregressive modeling.