AutoResearch-RL：面向自主神经架构发现的持续自评估强化学习智能体

摘要

我们提出AutoResearch-RL框架：该框架中的强化学习代理可在无人监督的情况下开展开放式神经网络架构与超参数研究，直至终止判定器发出收敛信号或资源耗尽时才会停止运行。在每一步迭代中，代理会对目标训练脚本提出代码修改方案，在固定挂钟时间预算内执行该方案，观察基于验证集字节熵（val-bpb）生成的标量奖励，并通过近端策略优化（PPO）更新其策略。核心设计理念在于三重关注点的分离：一是保证实验间可比性的固化环境（数据管道、评估协议与常量）；二是代表代理可编辑状态的动态目标文件（train.py）；三是通过积累实验结果轨迹来指导后续提案的元学习器（即RL代理本身）。我们将该框架形式化为马尔可夫决策过程，在温和假设下推导出收敛保证，并在单GPU纳米聊天模型预训练基准测试中实证验证：经过约300次夜间迭代后，AutoResearch-RL发现的配置方案达到或超越了人工调优基线水平，且全程无需人工干预。

English

We present AutoResearch-RL, a framework in which a reinforcement learning agent conducts open-ended neural architecture and hyperparameter research without human supervision, running perpetually until a termination oracle signals convergence or resource exhaustion. At each step the agent proposes a code modification to a target training script, executes it under a fixed wall clock time budget, observes a scalar reward derived from validation bits-per-byte (val-bpb), and updates its policy via Proximal Policy Optimisation (PPO). The key design insight is the separation of three concerns: (i) a frozen environment (data pipeline, evaluation protocol, and constants) that guarantees fair cross-experiment comparison; (ii) a mutable target file (train.py) that represents the agent's editable state; and (iii) a meta-learner (the RL agent itself) that accumulates a growing trajectory of experiment outcomes and uses them to inform subsequent proposals. We formalise this as a Markov Decision Process, derive convergence guarantees under mild assumptions, and demonstrate empirically on a single GPU nanochat pretraining benchmark that AutoResearch-RL discovers configurations that match or exceed hand-tuned baselines after approximately 300 overnight iterations, with no human in the loop.