邁向機器學習研究的自主長程工程化

摘要

自主人工智慧研究已快速進展，但長週期機器學習研究工程仍面臨挑戰：智能體必須在數小時或數天的任務理解、環境設置、實作、實驗與除錯過程中保持連貫性進展。我們提出AiScientist系統，該系統基於一個簡單原則建構自主長週期機器學習研究工程：強大的長週期效能需要結構化協調與持久狀態連續性雙重支持。為此，AiScientist結合分層協調機制與權限限定的「檔案匯流排」工作區——頂層協調器透過簡明摘要與工作區地圖維持階段控制，而專業代理則持續基於持久產出物（如分析報告、計劃、程式碼與實驗證據）進行重定位，而非主要依賴對話交接，實現「薄控制層駕馭厚狀態層」的架構。在兩項互補基準測試中，AiScientist相較最佳匹配基線平均提升PaperBench分數10.54分，並在MLE-Bench Lite上達成81.82%的任意獎牌率。消融研究進一步顯示檔案匯流排協定是效能關鍵驅動因素，移除後會導致PaperBench下降6.41分、MLE-Bench Lite下降31.82分。這些結果表明，長週期機器學習研究工程實質上是協調專業工作於持久專案狀態的系統性問題，而非純粹的局部推理問題。

English

Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.

邁向機器學習研究的自主長程工程化

Toward Autonomous Long-Horizon Engineering for ML Research

摘要

Support