深度研究中文本排序的再审视

摘要

深度研究已成为一项重要任务，其目标是通过大规模开放网络探索来解决复杂查询。针对这一任务，现有研究大多为基于大语言模型（LLM）的智能体配备不透明的网络搜索API，使其能够迭代式地发起搜索查询、获取外部证据并进行推理。尽管搜索在深度研究中具有关键作用，但黑箱式的网络搜索API阻碍了对搜索组件的系统性分析，导致传统文本排序方法在深度研究中的行为特征尚不明确。为填补这一空白，我们在深度研究场景下复现了信息检索文本排序方法的关键发现与最佳实践。具体而言，我们从三个维度评估其有效性：（一）检索单元（文档级与段落级）；（二）流水线配置（不同检索器、重排序器及重排序深度）；（三）查询特征（智能体生成查询与文本排序器训练查询之间的不匹配性）。我们在固定语料库的深度研究数据集BrowseComp-Plus上开展实验，评估了2种开源智能体、5种检索器和3种重排序器在不同配置下的表现。研究发现：智能体生成的查询通常遵循网络搜索式语法（如带引号的精确匹配），更适用于词汇检索、学习型稀疏检索和多向量检索；段落级单元在有限上下文窗口中效率更高，且能规避词汇检索中文档长度归一化的难题；重排序效果显著；将智能体查询转化为自然语言问题能有效弥合查询不匹配问题。

English

Deep research has emerged as an important task that aims to address hard queries through extensive open-web exploration. To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it. Despite search's essential role in deep research, black-box web search APIs hinder systematic analysis of search components, leaving the behaviour of established text ranking methods in deep research largely unclear. To fill this gap, we reproduce a selection of key findings and best practices for IR text ranking methods in the deep research setting. In particular, we examine their effectiveness from three perspectives: (i) retrieval units (documents vs. passages), (ii) pipeline configurations (different retrievers, re-rankers, and re-ranking depths), and (iii) query characteristics (the mismatch between agent-issued queries and the training queries of text rankers). We perform experiments on BrowseComp-Plus, a deep research dataset with a fixed corpus, evaluating 2 open-source agents, 5 retrievers, and 3 re-rankers across diverse setups. We find that agent-issued queries typically follow web-search-style syntax (e.g., quoted exact matches), favouring lexical, learned sparse, and multi-vector retrievers; passage-level units are more efficient under limited context windows, and avoid the difficulties of document length normalisation in lexical retrieval; re-ranking is highly effective; translating agent-issued queries into natural-language questions significantly bridges the query mismatch.