反思推理密集型检索：评估与推进智能搜索系统中的检索器

摘要

推理密集型检索旨在挖掘支持下游推理的证据，而非仅匹配主题相似性。这一能力对于智能体搜索系统日益重要，因为检索器必须在迭代搜索与综合过程中提供互补性证据。然而现有研究在评估与训练方面仍存在局限：BRIGHT等基准仅提供狭窄的标准答案集并对检索器进行孤立评估，而合成训练语料往往优化单段落相关性而非证据组合构建。我们推出BRIGHT-Pro专家标注基准，通过多维度标准证据扩展每个查询，并在静态与智能体搜索双协议下评估检索器。进一步构建RTriever-Synth维度分解合成语料库，生成互补正例及正例条件化难负例，并基于Qwen3-Embedding-4B对RTriever-4B进行LoRA微调。在词汇型、通用型及推理密集型检索器上的实验表明：维度感知与智能体评估能揭示标准指标遮蔽的行为特征，而RTriever-4B较其基础模型实现显著提升。

English

Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is increasingly important for agentic search systems, where retrievers must provide complementary evidence across iterative search and synthesis. However, existing work remains limited on both evaluation and training: benchmarks such as BRIGHT provide narrow gold sets and evaluate retrievers in isolation, while synthetic training corpora often optimize single-passage relevance rather than evidence portfolio construction. We introduce BRIGHT-Pro, an expert-annotated benchmark that expands each query with multi-aspect gold evidence and evaluates retrievers under both static and agentic search protocols. We further construct RTriever-Synth, an aspect-decomposed synthetic corpus that generates complementary positives and positive-conditioned hard negatives, and use it to LoRA fine-tune RTriever-4B from Qwen3-Embedding-4B. Experiments across lexical, general-purpose, and reasoning-intensive retrievers show that aspect-aware and agentic evaluation expose behaviors hidden by standard metrics, while RTriever-4B substantially improves over its base model.

反思推理密集型检索：评估与推进智能搜索系统中的检索器

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

摘要

Support