DeepResearchGym:一个免费、透明且可复现的深度研究评估平台
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
May 25, 2025
作者: João Coelho, Jingjie Ning, Jingyuan He, Kangrui Mao, Abhijay Paladugu, Pranav Setlur, Jiahe Jin, Jamie Callan, João Magalhães, Bruno Martins, Chenyan Xiong
cs.AI
摘要
深度研究系统代表了一类新兴的智能信息检索方法,能够针对复杂查询生成全面且论据充分的报告。然而,现有的大多数框架依赖于动态的商业搜索API,这不仅带来了成本问题,还引发了可复现性和透明度方面的挑战。为解决这些局限,我们推出了DeepResearchGym,一个开源沙盒环境,它结合了可复现的搜索API与严格的评估协议,用于深度研究系统的基准测试。该API通过最先进的密集检索器与基于DiskANN的近似最近邻搜索技术,对大规模公共网络语料库——ClueWeb22和FineWeb进行索引。它在保证跨运行文档排名稳定的同时,实现了比主流商业API更低的延迟,并免费供研究使用。为了评估深度研究系统的输出,我们扩展了Researchy Questions基准,通过LLM-as-a-judge评估引入自动化指标,以衡量系统输出与用户信息需求的契合度、检索的忠实度以及报告质量。实验结果表明,集成DeepResearchGym的系统性能与使用商业API的系统相当,且在不同评估指标下性能排名保持一致。一项人工评估研究进一步证实,我们的自动化协议与人类偏好相符,验证了该框架在支持深度研究系统受控评估方面的能力。我们的代码与API文档可在https://www.deepresearchgym.ai获取。
English
Deep research systems represent an emerging class of agentic information
retrieval methods that generate comprehensive and well-supported reports to
complex queries. However, most existing frameworks rely on dynamic commercial
search APIs, which pose reproducibility and transparency challenges in addition
to their cost. To address these limitations, we introduce DeepResearchGym, an
open-source sandbox that combines a reproducible search API with a rigorous
evaluation protocol for benchmarking deep research systems. The API indexes
large-scale public web corpora, namely ClueWeb22 and FineWeb, using a
state-of-the-art dense retriever and approximate nearest neighbor search via
DiskANN. It achieves lower latency than popular commercial APIs while ensuring
stable document rankings across runs, and is freely available for research use.
To evaluate deep research systems' outputs, we extend the Researchy Questions
benchmark with automatic metrics through LLM-as-a-judge assessments to measure
alignment with users' information needs, retrieval faithfulness, and report
quality. Experimental results show that systems integrated with DeepResearchGym
achieve performance comparable to those using commercial APIs, with performance
rankings remaining consistent across evaluation metrics. A human evaluation
study further confirms that our automatic protocol aligns with human
preferences, validating the framework's ability to help support controlled
assessment of deep research systems. Our code and API documentation are
available at https://www.deepresearchgym.ai.