ChatPaper.aiChatPaper

无尽终端:面向终端智能体的强化学习环境规模化扩展

Endless Terminals: Scaling RL Environments for Terminal Agents

January 23, 2026
作者: Kanishk Gandhi, Shivam Garg, Noah D. Goodman, Dimitris Papailiopoulos
cs.AI

摘要

环境是自我改进智能体的发展瓶颈。现有终端基准测试集仅为评估而设计,无法满足训练需求;强化学习需要可扩展的流程管道,而非单纯的数据集。我们推出Endless Terminals——一个无需人工标注、能够程序化生成终端使用任务的自主管道系统。该管道包含四个阶段:生成多样化任务描述、构建并验证容器化环境、设计完成度测试、以及筛选可解任务。通过此管道,我们获得了涵盖文件操作、日志管理、数据处理、脚本编写和数据库操作等领域的3255项任务。我们采用原始PPO算法配合二元回合奖励机制进行智能体训练,仅保留最小交互循环:不包含检索功能、多智能体协作或专用工具。尽管设计极简,基于Endless Terminals训练的模型仍实现显著提升:在预留开发集上,Llama-3.2-3B准确率从4.0%升至18.2%,Qwen2.5-7B从10.7%跃至53.3%,Qwen3-8B-openthinker-sft则由42.6%提升至59.0%。这种改进同样体现在人工标注基准测试中:在TerminalBench 2.0上,Llama-3.2-3B从0.0%提升至2.2%,Qwen2.5-7B从2.2%增至3.4%,Qwen3-8B-openthinker-sft从1.1%上升至6.7%,所有模型均优于采用复杂智能体框架的其他方案。这些结果表明:当环境实现规模化扩展时,简易强化学习便能取得成功。
English
Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.
PDF51January 27, 2026