DeepResearchGym:一個免費、透明且可重現的深度研究評估沙盒
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
May 25, 2025
作者: João Coelho, Jingjie Ning, Jingyuan He, Kangrui Mao, Abhijay Paladugu, Pranav Setlur, Jiahe Jin, Jamie Callan, João Magalhães, Bruno Martins, Chenyan Xiong
cs.AI
摘要
深度研究系統代表了一類新興的代理式信息檢索方法,它們能夠針對複雜查詢生成全面且論據充分的報告。然而,現有的大多數框架依賴於動態的商業搜索API,這不僅帶來了成本問題,還存在可重現性和透明度方面的挑戰。為解決這些限制,我們引入了DeepResearchGym,這是一個開源的沙盒環境,它結合了可重現的搜索API和嚴格的評估協議,用於對深度研究系統進行基準測試。該API利用最先進的密集檢索器和通過DiskANN實現的近似最近鄰搜索,對大規模公共網絡語料庫(即ClueWeb22和FineWeb)進行索引。它在確保跨運行文檔排名穩定的同時,實現了比主流商業API更低的延遲,並且可供研究免費使用。為了評估深度研究系統的輸出,我們通過LLM作為評判者的自動化指標擴展了Researchy Questions基準,以衡量系統與用戶信息需求的契合度、檢索的忠實度以及報告的質量。實驗結果表明,集成DeepResearchGym的系統性能與使用商業API的系統相當,且在不同評估指標下的性能排名保持一致。一項人工評估研究進一步證實,我們的自動化協議與人類偏好相符,驗證了該框架在支持深度研究系統受控評估方面的能力。我們的代碼和API文檔可在https://www.deepresearchgym.ai 獲取。
English
Deep research systems represent an emerging class of agentic information
retrieval methods that generate comprehensive and well-supported reports to
complex queries. However, most existing frameworks rely on dynamic commercial
search APIs, which pose reproducibility and transparency challenges in addition
to their cost. To address these limitations, we introduce DeepResearchGym, an
open-source sandbox that combines a reproducible search API with a rigorous
evaluation protocol for benchmarking deep research systems. The API indexes
large-scale public web corpora, namely ClueWeb22 and FineWeb, using a
state-of-the-art dense retriever and approximate nearest neighbor search via
DiskANN. It achieves lower latency than popular commercial APIs while ensuring
stable document rankings across runs, and is freely available for research use.
To evaluate deep research systems' outputs, we extend the Researchy Questions
benchmark with automatic metrics through LLM-as-a-judge assessments to measure
alignment with users' information needs, retrieval faithfulness, and report
quality. Experimental results show that systems integrated with DeepResearchGym
achieve performance comparable to those using commercial APIs, with performance
rankings remaining consistent across evaluation metrics. A human evaluation
study further confirms that our automatic protocol aligns with human
preferences, validating the framework's ability to help support controlled
assessment of deep research systems. Our code and API documentation are
available at https://www.deepresearchgym.ai.Summary
AI-Generated Summary