Curie: AI 에이전트를 통한 엄격하고 자동화된 과학적 실험을 향하여

초록

과학적 실험은 인간의 진보를 위한 초석으로서, 의미 있는 결과를 도출하기 위해 신뢰성, 체계적인 통제, 해석 가능성에 있어 엄격함을 요구합니다. 대규모 언어 모델(LLM)이 과학적 과정의 다양한 측면을 자동화하는 능력이 점차 향상되고 있음에도 불구하고, 엄격한 실험을 자동화하는 것은 여전히 상당한 과제로 남아 있습니다. 이러한 격차를 해소하기 위해, 우리는 Curie라는 AI 에이전트 프레임워크를 제안합니다. Curie는 세 가지 핵심 구성 요소를 통해 실험 과정에 엄격함을 내재화하도록 설계되었습니다: 신뢰성을 강화하기 위한 에이전트 내 엄격성 모듈, 체계적인 통제를 유지하기 위한 에이전트 간 엄격성 모듈, 그리고 해석 가능성을 높이기 위한 실험 지식 모듈입니다. Curie를 평가하기 위해 우리는 컴퓨터 과학의 네 가지 분야에 걸쳐 영향력 있는 연구 논문과 널리 채택된 오픈소스 프로젝트에서 도출된 46개의 질문으로 구성된 새로운 실험 벤치마크를 설계했습니다. 테스트된 가장 강력한 베이스라인과 비교했을 때, 우리는 실험 질문에 대해 3.4배의 정답률 향상을 달성했습니다. Curie는 https://github.com/Just-Curieous/Curie에서 오픈소스로 제공됩니다.

English

Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4times improvement in correctly answering experimental questions.Curie is open-sourced at https://github.com/Just-Curieous/Curie.