仮説ツリー精緻化による汎用自律研究に向けて

要旨

科学の進歩は、探求、実験、抽象化というサイクルを繰り返すことに依存しています。研究者は候補となる方向性を検証し、証拠を解釈し、得られた教訓を後の試みに活かします。私たちは、AIエージェントがこのサイクルを長期間にわたり自律的に実行する方法を研究しています。本稿では、長期間持続するコーディネーター、短期間で完了するエグゼキューター、そして仮説、成果物、証拠、抽出された洞察を時間を超えて結びつける永続的なツリーであるHypothesis Tree Refinement（HTR）を組み合わせた、自律研究のための汎用フレームワーク「Arbor」を紹介します。コーディネーターはこのツリー上で全体的な研究戦略を管理し、エグゼキューターは個々の仮説を隔離されたワークツリーで実装・テストします。結果が返ってくるにつれて、Arborはツリーを更新し、再利用可能な教訓を伝播させ、検索フロンティアを洗練させ、検証された改良を取り込みます。この設計により、自律研究は局所的な試行の連続から、戦略、実行、証拠が時間を超えて継承される累積的なプロセスへと変わります。私たちは、エージェントが段階的な人間の監督なしに反復的な実験を通じて初期の研究成果物を改善する運用設定である自律的最適化（AO）の下でArborを評価しました。モデル学習、ハーネスエンジニアリング、データ合成における6つの実際の研究タスクにおいて、Arborは全6タスクで最高のheld-out結果を達成し、同じタスクインターフェースとリソース予算の下でのCodexおよびClaude Codeと比較して、平均相対held-outゲインで2.5倍以上の値を示しました。MLE-Bench Liteでは、ArborはGPT-5.5を用いて86.36%のAny Medalを達成し、比較対象の中で最も強い結果となりました。

English

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.