ReasonIR：為推理任務訓練檢索模型

摘要

我們推出了ReasonIR-8B，這是首個專為通用推理任務訓練的檢索模型。現有的檢索模型在推理任務上表現有限，部分原因在於現有的訓練數據集主要針對簡短的事實性查詢，這些查詢與直接回答它們的文檔緊密相關。我們開發了一個合成數據生成流程，該流程為每個文檔創建一個具有挑戰性且相關的查詢，以及一個看似相關但實際上無用的困難負例。通過在我們的合成數據與現有公開數據的混合數據集上進行訓練，ReasonIR-8B在BRIGHT這一廣泛使用的推理密集型信息檢索（IR）基準測試中，無需重排序器時達到了29.9 nDCG@10的新紀錄，使用重排序器時則達到了36.9 nDCG@10。應用於RAG任務時，ReasonIR-8B相較於閉卷基線，在MMLU和GPQA上的表現分別提升了6.4%和22.6%，超越了其他檢索模型和搜索引擎。此外，ReasonIR-8B在測試時計算效率更高：在BRIGHT上，其性能隨著查詢改寫得更長且信息更豐富而持續提升；與LLM重排序器結合使用時，它依然優於其他檢索模型。我們的訓練方法具有通用性，可輕鬆擴展到未來的LLM；為此，我們開源了代碼、數據和模型。

English

We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.

ReasonIR：為推理任務訓練檢索模型

ReasonIR: Training Retrievers for Reasoning Tasks

摘要

Support