SAGE：深度研究智能体的检索能力评估与优化

摘要

深度研究智能体已成为解决复杂查询的强大系统，而基于大语言模型的检索器在指令遵循与推理方面展现出卓越能力。这引出一个关键问题：基于大语言模型的检索器能否有效助力深度研究智能体工作流？为探究此问题，我们推出SAGE基准测试——一个包含1,200个跨四大学科领域查询请求、覆盖20万篇论文检索库的科学文献检索评估体系。通过评估六种深度研究智能体，我们发现所有系统在处理推理密集型检索任务时均表现不佳。以DR Tulu为基准框架，我们进一步对比了BM25与基于大语言模型的检索器（如ReasonIR和gte-Qwen2-7B-instruct）作为替代搜索工具的效果。令人惊讶的是，由于现有智能体生成的关键词导向子查询存在局限，BM25以约30%的优势显著优于基于大语言模型的检索器。为提升性能，我们提出基于语料库的测试时扩展框架，利用大语言模型增强文档元数据与关键词标注，使现成检索器能更轻松完成检索任务。该策略在简答型与开放型问题上分别实现8%和2%的性能提升。

English

Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus.We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.

SAGE：深度研究智能体的检索能力评估与优化

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

摘要

Support