자동 과학자: 장기 실행 과학 실험을 위한 자기 조직화 에이전트 팀

초록

과학적 연구는 가설 생성, 실험 설계, 실행, 수정의 반복적 순환을 통해 진행된다. AI 에이전트는 이 과정의 일부를 자동화할 수 있지만, 기존 접근 방식은 일반적으로 단일 연구 궤적을 따르거나 고정된 목표를 가진 중앙 계획자를 통해 조정된다. 그 결과, 병렬 탐색을 지속하거나, 실험 증거가 변화함에 따라 적응하거나, 장기 실행 실험에서 실패한 방향에 대한 지식을 보존하는 데 어려움을 겪는다. 본 연구에서는 장기 실행 계산 과학 실험을 위한 분산형 AI 에이전트 팀인 AutoScientists를 소개한다. 에이전트는 공유된 실험 상태를 해석하고, 유망한 가설을 중심으로 자발적으로 팀을 구성하며, 실험 계산 자원을 사용하기 전에 제안을 비판적으로 검토하고, 성공과 실패를 공유하여 중복 탐색을 줄인다. 동일한 실험 예산 하에서 AutoScientists는 생물의학 머신러닝, 언어 모델 훈련 최적화, 단백질 적합성 예측 분야에서 기존 AI 에이전트보다 성능을 향상시킨다. 생물의학 영상, 단백질 공학, 단일 세포 오믹스, 신약 발견을 포괄하는 BioML-Bench에서 AutoScientists는 24개 과제에 걸쳐 평균 리더보드 백분위 74.4%를 달성하여, 가장 강력한 AI 에이전트보다 +8.33% 향상되었다. GPT 훈련 최적화에서는 AutoScientists가 Autoresearch보다 1.9배 더 빠르게 목표 검증 비트-퍼-바이트에 도달했으며, 단일 에이전트 접근 방식이 전혀 발견하지 못한 개선점을 출발 챔피언으로부터 지속적으로 발견했다(허용된 개선 7건 대 0건). 단백질 적합성 예측 벤치마크인 ProteinGym에서 AutoScientists는 ACE2-스파이크 결합에 대한 방법을 발견하여 기존 최첨단 모델보다 스피어만 상관계수에서 +12.5% 향상되었다. 모든 217개의 ProteinGym 분석에 수정 없이 적용했을 때, 동일한 방법은 기존 최첨단 기술보다 +6.5%(스피어만 상관계수) 향상되었다.

English

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).