无尽终端:面向终端智能体的强化学习环境扩展
Endless Terminals: Scaling RL Environments for Terminal Agents
January 23, 2026
作者: Kanishk Gandhi, Shivam Garg, Noah D. Goodman, Dimitris Papailiopoulos
cs.AI
摘要
环境是自我改进智能体的发展瓶颈。现有终端基准仅为评估而设计,无法满足训练需求;强化学习需要可扩展的流水线,而非单纯的数据集。我们推出"无尽终端"——一个无需人工标注、能够程序化生成终端使用任务的自主流水线。该流水线包含四个阶段:生成多样化任务描述、构建验证容器化环境、设计完成度测试、以及筛选可解任务。通过该流程,我们获得了涵盖文件操作、日志管理、数据处理、脚本编写和数据库操作等领域的3255项任务。我们采用原始PPO算法配合二元回合奖励机制进行智能体训练,仅保留最小交互循环:不引入检索机制、多智能体协同或专用工具。尽管设计极简,在无尽终端上训练的模型仍取得显著提升:在保留开发集上,Llama-3.2-3B从4.0%提升至18.2%,Qwen2.5-7B从10.7%提升至53.3%,Qwen3-8B-openthinker-sft从42.6%提升至59.0%。这种提升同样体现在人工标注基准上:在TerminalBench 2.0测试中,经无尽终端训练的Llama-3.2-3B从0.0%提升至2.2%,Qwen2.5-7B从2.2%提升至3.4%,Qwen3-8B-openthinker-sft从1.1%提升至6.7%,各项结果均优于采用复杂智能体框架的对比方案。这些成果证明:当环境实现规模化扩展时,简易强化学习也能取得显著成效。
English
Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.