反思推理密集型檢索：評估與改進代理搜尋系統中的檢索器

摘要

推理密集型檢索旨在發掘支持下游推理的證據，而非僅匹配主題相似性。這項能力對於智能體搜索系統日益重要，因為檢索器必須在迭代搜索與綜合過程中提供互補證據。然而現有研究在評估與訓練方面仍存在侷限：BRIGHT等基準測試僅提供狹窄的黃金標準集且單獨評估檢索器，而合成訓練語料庫往往優化單段落相關性而非證據組合建構。我們推出BRIGHT-Pro專家標註基準，通過多維度黃金證據擴展每個查詢，並在靜態與智能體搜索協議下評估檢索器。我們進一步構建RTriever-Synth維度分解式合成語料庫，生成互補正例與正例條件化困難負例，並以此對Qwen3-Embedding-4B的基礎模型進行LoRA微調得到RTriever-4B。在詞彙型、通用型及推理密集型檢索器上的實驗表明，維度感知與智能體評估能揭露標準指標隱藏的行為特徵，而RTriever-4B相較其基礎模型實現顯著提升。

English

Reasoning-intensive retrieval aims to surface evidence that supports downstream reasoning rather than merely matching topical similarity. This capability is increasingly important for agentic search systems, where retrievers must provide complementary evidence across iterative search and synthesis. However, existing work remains limited on both evaluation and training: benchmarks such as BRIGHT provide narrow gold sets and evaluate retrievers in isolation, while synthetic training corpora often optimize single-passage relevance rather than evidence portfolio construction. We introduce BRIGHT-Pro, an expert-annotated benchmark that expands each query with multi-aspect gold evidence and evaluates retrievers under both static and agentic search protocols. We further construct RTriever-Synth, an aspect-decomposed synthetic corpus that generates complementary positives and positive-conditioned hard negatives, and use it to LoRA fine-tune RTriever-4B from Qwen3-Embedding-4B. Experiments across lexical, general-purpose, and reasoning-intensive retrievers show that aspect-aware and agentic evaluation expose behaviors hidden by standard metrics, while RTriever-4B substantially improves over its base model.

反思推理密集型檢索：評估與改進代理搜尋系統中的檢索器

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

摘要

Support