利用专业智能体进行自动研究，开发出高效且非平凡的训练方案

摘要

我们采用由外部度量驱动的封闭实证循环来研究自动化研究。每个提交的试验包含假设、可执行的代码修改、评估方持有的结果，以及塑造下一轮提案的反馈。产出并非生成的论文或单一模型检查点，而是可审计的提案轨迹、代码差异、实验数据、评分及失败标签。我们通过专业智能体实例化这一循环，这些智能体划分方案空间并在试验间共享度量谱系。核心实证发现表明：谱系反馈能使智能体将评估结果（包括运行崩溃、预算超支、规模超标和精度门限未达标等）转化为后续程序级方案修改，而非一次性建议。在一次性设置并启动后，经过1,197次主线试验及600次参数优化对照试验，人类未在搜索过程中干预提案选择、方案编辑、分数修正或失败试验修复。在三次主线运行中，同一提交-试验循环使参数优化的验证bpb降低0.81%，将NanoChat-D12核心指标提升38.7%，并使CIFAR-10 Airbench96实际运行时间减少4.59%，每项任务均由其专属外部评估器及合规性检查进行度量。追踪记录包含对157项主线提交内容的严格架构域审计，以及诸如NanoChat注意力内核路径变更等程序重写。在此范围内，该循环能自主编写代码、提交实验、吸收反馈、应用并融合各环境内的已知技术，持续改进公共初始方案。

English

We study auto research as a closed empirical loop driven by external measurement. Each submitted trial carries a hypothesis, an executable code edit, an evaluator-owned outcome, and feedback that shapes the next proposal. The output is not a generated paper or a single model checkpoint, but an auditable trajectory of proposals, code diffs, experiments, scores, and failure labels. We instantiate this loop with specialist agents that partition recipe surfaces and share measured lineage across trials. The central empirical finding is that lineage feedback lets agents turn evaluator outcomes, including crashes, budget overruns, size failures, and accuracy-gate misses, into later program-level recipe edits rather than one-shot suggestions. Across 1,197 headline-run trials plus 600 Parameter Golf control trials after one-time setup and launch, humans did not choose proposals, edit recipes, override scores, or repair failed trials during the search. In the three headline runs, the same submitted-trial loop reduces Parameter Golf validation bpb by 0.81%, raises NanoChat-D12 CORE by 38.7%, and reduces CIFAR-10 Airbench96 wallclock by 4.59%, with each task measured by its own external evaluator and legality checks. The trace includes a strict architecture-domain audit of 157 headline-run submissions and program rewrites such as a NanoChat attention-kernel path change. Within this scope the loop autonomously writes code, submits experiments, absorbs feedback, applies and combines known techniques inside each environment, and improves public starting recipes.

利用专业智能体进行自动研究，开发出高效且非平凡的训练方案

Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes

摘要

Support