邁向基於假設樹細化的通才型自主研究

摘要

科學進步依賴於探索、實驗與抽象化的反覆循環。研究人員測試候選方向、解讀證據，並將所得經驗應用於後續嘗試。我們研究AI代理如何自主運行此循環，並在長期時間跨度中持續進行。為此，我們提出Arbor，一個通用自主研究框架，結合了長效協調器、短期執行器，以及假設樹精煉（HTR）——一棵持續存在的樹，將跨時間的假設、產出物、證據與提煉出的見解連結起來。協調器管理樹上的全局研究策略，而執行器則在隔離的工作樹中實現並測試單個假設。當結果返回時，Arbor更新樹結構，傳播可複用的經驗，精煉搜索前沿，並接納經驗證的改進。此設計將自主研究從一系列局部嘗試，轉變為一個策略、執行與證據隨時間累積的過程。我們在自主優化（AO）設定下評估Arbor——在該操作設定中，代理通過迭代實驗改進初始研究產出物，無需逐步人工監督。在模型訓練、框架工程與數據合成等六項真實研究任務中，Arbor在所有任務上均取得最佳保留測試結果，其平均相對保留增益超過Codex與Claude Code在相同任務介面與資源預算下的2.5倍。在MLE-Bench Lite上，Arbor搭配GPT-5.5達到86.36%的Any Medal，是我們比較中最強的結果。

English

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.