物理科学におけるディープリサーチ：マルチエージェントフレームワークと包括的ベンチマーク

要旨

深層研究エージェントは、自律的かつ多段階の科学的推論のために設計された大規模言語モデル（LLM）ベースのシステムであり、物理科学における研究の加速に大きな可能性を秘めている。しかし、この領域におけるそれらの能力の包括的かつ詳細な評価は依然として不足している。このギャップを埋めるために、我々は物理科学研究に非常に関連性の高いベンチマークであるPhySciBenchを導入する。これは、物理学と化学からバランスよく選ばれた専門家厳選の200問から構成され、現実の科学的ワークフローを反映した6つのタスクカテゴリにわたる。PhySciBench上での最先端モデルおよびエージェントシステムの評価では、限定的な性能が明らかになった。最も強力なベースラインであるGemini Deep Researchでさえ、精度は33.5%に過ぎない。失敗事例の分析から、三つの反復的な欠陥が特定された。すなわち、拡張された推論連鎖における脆弱性、ステップ間の知識伝達の限界、そして物理学に基づく自己検証の欠如である。これらの知見に動機づけられ、我々は適応的計画ループ、二重粒度メモリ、および階層的な物理学に基づく反映機構を備えたモジュール型マルチエージェントフレームワークであるDelveAgentを開発した。四つの科学ベンチマークにわたって、DelveAgentは精度を最大7.5パーセントポイント向上させると同時に、推論コストを最も強力なベースラインの約3分の1に削減した。これらの結果は、物理科学におけるAIシステムを評価するための重要なベンチマークとしてのPhySciBenchの重要性を確立し、アーキテクチャの特化が自律的な科学研究の信頼性を効果的に向上させ得ることを示している。

English

Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.