借助专业智能体进行自动研究，开发出高效且非平凡的训练方案

摘要

我们将自动化研究视为一个由外部度量驱动的封闭式实证循环。每个提交的试验都包含假设、可执行的代码修改、评估方持有的结果，以及影响后续提案的反馈。该系统的输出并非生成的论文或单一模型检查点，而是由提案、代码差异、实验数据、评分和失败标签构成的可审计轨迹。我们通过专业智能体实例化这一循环，这些智能体划分方案空间并在试验间共享经度量的传承谱系。核心实证发现表明：传承反馈能使智能体将评估结果（包括系统崩溃、预算超支、规模超标和精度门槛未达标等情况）转化为后续程序级的方案修订，而非一次性建议。在一次性设置并启动后，经过1,197次主线试验及600次参数优化对照试验，人类在搜索过程中未介入提案选择、方案修改、评分覆写或失败试验修复。在三条主线实验中，同一提交-试验循环将参数优化验证集的每字节位数降低0.81%，使NanoChat-D12核心指标提升38.7%，并将CIFAR-10 Airbench96的挂钟时间减少4.59%，各项任务均由其专属外部评估器及合规性检查进行度量。轨迹记录包含对157项主线提交内容的严格架构域审计，以及程序重写案例（如NanoChat注意力内核路径变更）。在此范围内，该循环能自主编写代码、提交实验、吸收反馈、在各自环境中应用并融合已知技术，持续改进公共初始方案。

English

We study auto research as a closed empirical loop driven by external measurement. Each submitted trial carries a hypothesis, an executable code edit, an evaluator-owned outcome, and feedback that shapes the next proposal. The output is not a generated paper or a single model checkpoint, but an auditable trajectory of proposals, code diffs, experiments, scores, and failure labels. We instantiate this loop with specialist agents that partition recipe surfaces and share measured lineage across trials. The central empirical finding is that lineage feedback lets agents turn evaluator outcomes, including crashes, budget overruns, size failures, and accuracy-gate misses, into later program-level recipe edits rather than one-shot suggestions. Across 1,197 headline-run trials plus 600 Parameter Golf control trials after one-time setup and launch, humans did not choose proposals, edit recipes, override scores, or repair failed trials during the search. In the three headline runs, the same submitted-trial loop reduces Parameter Golf validation bpb by 0.81%, raises NanoChat-D12 CORE by 38.7%, and reduces CIFAR-10 Airbench96 wallclock by 4.59%, with each task measured by its own external evaluator and legality checks. The trace includes a strict architecture-domain audit of 157 headline-run submissions and program rewrites such as a NanoChat attention-kernel path change. Within this scope the loop autonomously writes code, submits experiments, absorbs feedback, applies and combines known techniques inside each environment, and improves public starting recipes.

借助专业智能体进行自动研究，开发出高效且非平凡的训练方案

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

摘要

Support