ChatPaper.aiChatPaper

SAGE:深度研究智能体的检索能力评估与优化

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

February 5, 2026
作者: Tiansheng Hu, Yilun Zhao, Canyu Zhang, Arman Cohan, Chen Zhao
cs.AI

摘要

深度研究智能体已成为处理复杂查询的强大系统。与此同时,基于大语言模型的检索器在遵循指令与推理方面展现出卓越能力。这引出一个关键问题:基于大语言模型的检索器能否有效助力深度研究智能体工作流?为探究此问题,我们推出科学文献检索基准SAGE,该数据集包含四大科学领域的1,200个查询问题及20万篇论文的检索库。通过评估六种深度研究智能体,我们发现所有系统在处理推理密集型检索任务时均表现不佳。以DR Tulu为架构基础,我们进一步对比了BM25与基于大语言模型的检索器(如ReasonIR和gte-Qwen2-7B-instruct)作为替代搜索工具的效果。令人惊讶的是,由于现有智能体生成的子查询偏向关键词导向,BM25以约30%的优势显著优于基于大语言模型的检索器。为提升性能,我们提出基于语料库的测试时扩展框架,利用大语言模型增强文档的元数据和关键词标注,使现成检索器能更轻松完成检索任务。该方法在简答题和开放题上分别实现了8%和2%的性能提升。
English
Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus.We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.
PDF92February 7, 2026