가설 트리 정제를 통한 범용 자율 연구를 향하여

초록

과학적 진보는 탐구, 실험, 추상화의 반복 고리에 의존한다. 연구자들은 후보 방향을 시험하고, 증거를 해석하며, 그로부터 얻은 교훈을 이후 시도에 반영한다. 우리는 AI 에이전트가 장기적인 시간 범위에 걸쳐 이 고리를 자율적으로 수행하는 방법을 연구한다. 우리는 일반적인 자율 연구 프레임워크인 Arbor를 제안한다. Arbor는 장기 조정자, 단기 실행자, 그리고 시간에 걸쳐 가설, 산출물, 증거, 정제된 통찰을 연결하는 지속적 트리인 가설 트리 정제(HTR)를 결합한다. 조정자는 트리 위에서 전반적인 연구 전략을 관리하고, 실행자는 격리된 작업 트리에서 개별 가설을 구현하고 시험한다. 결과가 반환됨에 따라 Arbor는 트리를 갱신하고, 재사용 가능한 교훈을 전파하며, 탐색 경계를 정제하고, 검증된 개선 사항을 수용한다. 이 설계는 자율 연구를 일련의 국소적 시도에서 전략, 실행, 증거가 시간에 걸쳐 전달되는 누적적 과정으로 전환한다. 우리는 Arbor를 자율 최적화(AO) 하에서 평가한다. AO는 에이전트가 단계별 인간 감독 없이 반복적 실험을 통해 초기 연구 산출물을 개선하는 운영 설정이다. 모델 훈련, 하네스 엔지니어링, 데이터 합성 분야의 여섯 가지 실제 연구 과제에서 Arbor는 모든 여섯 과제에 대해 최고의 홀드아웃 결과를 달성했으며, 동일한 과제 인터페이스와 자원 예산 하에서 Codex 및 Claude Code 대비 평균 상대적 홀드아웃 이득의 2.5배 이상을 얻었다. MLE-Bench Lite에서 Arbor는 GPT-5.5로 86.36%의 Any Medal을 기록하여 비교 대상 중 가장 강력한 결과를 보였다.

English

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.