ReasonIR：面向推理任务的检索模型训练

摘要

我们推出ReasonIR-8B，这是首个专为通用推理任务训练的检索模型。现有检索器在推理任务上的提升有限，部分原因在于现有训练数据集多集中于简短的事实性查询，这些查询与直接回答它们的文档紧密相关。我们开发了一种合成数据生成流程，该流程为每篇文档创建一个具有挑战性且相关的查询，同时生成一个看似相关但实际无用的困难负例。通过在合成数据与现有公开数据的混合体上进行训练，ReasonIR-8B在广泛使用的推理密集型信息检索（IR）基准BRIGHT上，无需重排器时达到了29.9 nDCG@10的新纪录，使用重排器后提升至36.9 nDCG@10。应用于RAG任务时，ReasonIR-8B相较于闭卷基线，在MMLU和GPQA上的表现分别提升了6.4%和22.6%，超越了其他检索器和搜索引擎。此外，ReasonIR-8B在测试时计算效率更高：在BRIGHT上，随着查询改写得更长且信息更丰富，其性能持续提升；与LLM重排器结合使用时，仍优于其他检索器。我们的训练方案具有通用性，可轻松扩展至未来的LLM；为此，我们开源了代码、数据及模型。

English

We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.

ReasonIR：面向推理任务的检索模型训练

ReasonIR: Training Retrievers for Reasoning Tasks

摘要

Support