迈向机器学习研究的自主长周期工程化

摘要

自主AI研究发展迅猛，但长周期机器学习研究工程仍面临挑战：智能体需在任务理解、环境配置、代码实现、实验验证和问题调试等环节中维持数小时乃至数天的连贯推进。我们提出AiScientist系统，基于"强效长周期性能需兼具结构化编排与持久化状态连续性"的核心原则，构建了面向机器学习研究的自主长周期工程框架。该系统融合分层编排机制与权限限定的"文件总线"工作区：顶层编排器通过精要摘要和工作区图谱维持阶段控制，而专业化智能体持续基于持久化工件（如分析报告、计划方案、代码实现及实验证据）进行重定位，而非主要依赖对话传递，实现"薄控制层+厚状态层"的架构。在两项互补性基准测试中，AiScientist将PaperBench评分较最佳匹配基线平均提升10.54分，并在MLE-Bench Lite上实现81.82%的任意奖牌获得率。消融实验进一步表明，"文件总线"协议是性能关键驱动因素，移除后导致PaperBench下降6.41分、MLE-Bench Lite下降31.82分。这些结果表明，长周期机器学习研究工程本质上是基于持久化项目状态协调专业化工作的系统性问题，而非纯局部推理问题。

English

Autonomous AI research has advanced rapidly, but long-horizon ML research engineering remains difficult: agents must sustain coherent progress across task comprehension, environment setup, implementation, experimentation, and debugging over hours or days. We introduce AiScientist, a system for autonomous long-horizon engineering for ML research built on a simple principle: strong long-horizon performance requires both structured orchestration and durable state continuity. To this end, AiScientist combines hierarchical orchestration with a permission-scoped File-as-Bus workspace: a top-level Orchestrator maintains stage-level control through concise summaries and a workspace map, while specialized agents repeatedly re-ground on durable artifacts such as analyses, plans, code, and experimental evidence rather than relying primarily on conversational handoffs, yielding thin control over thick state. Across two complementary benchmarks, AiScientist improves PaperBench score by 10.54 points on average over the best matched baseline and achieves 81.82 Any Medal% on MLE-Bench Lite. Ablation studies further show that File-as-Bus protocol is a key driver of performance, reducing PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points when removed. These results suggest that long-horizon ML research engineering is a systems problem of coordinating specialized work over durable project state, rather than a purely local reasoning problem.