물리 과학에서의 심층 연구: 다중 에이전트 프레임워크 및 포괄적 벤치마크

초록

심층 연구 에이전트는 대규모 언어 모델(LLM) 기반 시스템으로, 자율적이고 다단계의 과학적 추론을 위해 설계되었으며 물리 과학 분야의 연구 가속화에 큰 잠재력을 지닌다. 그러나 이 도메인 내에서 해당 시스템의 역량에 대한 포괄적이고 심층적인 평가는 여전히 부족한 실정이다. 이러한 격차를 해소하기 위해, 우리는 PhySciBench를 소개한다. 이는 물리 과학 연구에 매우 밀접한 벤치마크로, 물리학과 화학 간에 균형을 이루며 실제 과학 워크플로를 반영하는 여섯 가지 과제 범주에 걸친 200개의 전문가 선별 질문으로 구성된다. PhySciBench에서 최첨단 모델과 에이전트 시스템을 평가한 결과, 가장 강력한 기준 모델인 Gemini Deep Research조차 33.5%의 정확도에 그치는 제한된 성능을 보였다. 오류 사례 분석을 통해 세 가지 반복적 결함, 즉 확장된 추론 체인의 취약성, 단계 간 제한된 지식 전이, 물리 기반 자체 검증의 부재가 확인되었다. 이러한 발견에 착안하여 우리는 적응형 계획 루프, 이중 세분화 메모리, 계층적 물리 기반 반성 메커니즘을 갖춘 모듈형 다중 에이전트 프레임워크인 DelveAgent를 개발하였다. 네 가지 과학 벤치마크에서 DelveAgent는 정확도를 최대 7.5%포인트 향상시키는 동시에 추론 비용을 가장 강력한 기준 모델의 약 3분의 1 수준으로 절감하였다. 이러한 결과는 PhySciBench가 물리 과학 분야의 AI 시스템을 평가하는 중요한 벤치마크로서의 의의를 입증하며, 구조적 전문화가 자율적 과학 연구의 신뢰성을 효과적으로 향상시킬 수 있음을 보여준다.

English

Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.