DeepResearchGym: 自由で透明性が高く再現可能な深層研究評価のためのサンドボックス

要旨

ディープリサーチシステムは、複雑なクエリに対して包括的で裏付けのあるレポートを生成する、新興のエージェント型情報検索手法を代表するものです。しかし、既存のフレームワークの多くは動的な商用検索APIに依存しており、コストに加えて再現性と透明性の課題を抱えています。これらの制限に対処するため、我々はDeepResearchGymを導入しました。これは、再現可能な検索APIと、ディープリサーチシステムのベンチマークを行うための厳密な評価プロトコルを組み合わせたオープンソースのサンドボックスです。このAPIは、大規模な公開ウェブコーパス（ClueWeb22とFineWeb）を、最先端の密な検索器とDiskANNによる近似最近傍探索を用いてインデックス化します。人気のある商用APIよりも低いレイテンシを実現しつつ、実行間で安定したドキュメントランキングを保証し、研究用途で無料で利用可能です。ディープリサーチシステムの出力を評価するために、我々はResearchy Questionsベンチマークを拡張し、LLM-as-a-judge評価を通じて自動メトリクスを導入しました。これにより、ユーザーの情報ニーズとの整合性、検索の忠実度、レポートの品質を測定します。実験結果は、DeepResearchGymと統合されたシステムが、商用APIを使用したものと同等の性能を達成し、評価メトリクス間で性能ランキングが一貫していることを示しています。人間による評価研究はさらに、我々の自動プロトコルが人間の選好と一致することを確認し、ディープリサーチシステムの制御された評価を支援するフレームワークの能力を検証しました。我々のコードとAPIドキュメントはhttps://www.deepresearchgym.aiで利用可能です。

English

Deep research systems represent an emerging class of agentic information retrieval methods that generate comprehensive and well-supported reports to complex queries. However, most existing frameworks rely on dynamic commercial search APIs, which pose reproducibility and transparency challenges in addition to their cost. To address these limitations, we introduce DeepResearchGym, an open-source sandbox that combines a reproducible search API with a rigorous evaluation protocol for benchmarking deep research systems. The API indexes large-scale public web corpora, namely ClueWeb22 and FineWeb, using a state-of-the-art dense retriever and approximate nearest neighbor search via DiskANN. It achieves lower latency than popular commercial APIs while ensuring stable document rankings across runs, and is freely available for research use. To evaluate deep research systems' outputs, we extend the Researchy Questions benchmark with automatic metrics through LLM-as-a-judge assessments to measure alignment with users' information needs, retrieval faithfulness, and report quality. Experimental results show that systems integrated with DeepResearchGym achieve performance comparable to those using commercial APIs, with performance rankings remaining consistent across evaluation metrics. A human evaluation study further confirms that our automatic protocol aligns with human preferences, validating the framework's ability to help support controlled assessment of deep research systems. Our code and API documentation are available at https://www.deepresearchgym.ai.

DeepResearchGym: 自由で透明性が高く再現可能な深層研究評価のためのサンドボックス

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

要旨

Support