通过假设树精炼迈向通用自主研究

摘要

科学进步依赖于探索、实验和抽象之间的反复循环。研究者测试候选方向，解读证据，并将所得的经验教训应用于后续尝试。我们研究如何让人工智能智能体能够在长时间跨度内自主运行这一循环。我们提出了Arbor，一个用于自主研究的通用框架，它结合了长期存在的协调器、短期运行执行器，以及假设树精炼（HTR）——一种持久化的树结构，它随时间链接假设、产物、证据和提炼出的洞察。协调器管理全局研究策略于该树上，而执行器在隔离的工作树中实施并测试单个假设。随着结果返回，Arbor更新树结构，传播可复用的经验教训，精炼搜索前沿，并接纳经过验证的改进。这一设计将自主研究从一系列局部尝试转变为累积过程，其中策略、执行和证据跨越时间而得以传承。我们在自主优化（AO）这一操作设置下评估Arbor，其中智能体通过迭代实验改进初始研究产物，无需步骤级别的人类监督。在模型训练、工具工程和数据合成等六个真实研究任务中，Arbor在所有六个任务上均取得了最佳的留出结果，在相同的任务接口和资源预算下，其平均相对留出增益超过Codex和Claude Code的2.5倍。在MLE-Bench Lite上，Arbor使用GPT-5.5达到了86.36%的Any Medal成绩，这是我们对比中最强的结果。

English

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.