CLI-Gym:基于智能体环境逆向生成的可扩展命令行任务框架
CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion
February 11, 2026
作者: Yusong Lin, Haiyang Wang, Shuzhe Wu, Lue Fan, Feiyang Pan, Sanyuan Zhao, Dandan Tu
cs.AI
摘要
智能体编程要求智能体能够有效与运行时环境(如命令行界面CLI)交互,以完成依赖项解析、系统问题修复等任务。然而,如何大规模获取此类强环境依赖型任务以增强智能体能力,仍缺乏深入探索。为此,基于Dockerfile与智能体任务间的类比性,我们提出通过执行反馈引导智能体模拟探索环境历史。通过追踪健康环境的历史记录,可将其状态回滚至存在运行时故障的早期版本,进而将故障状态与对应错误信息封装生成任务。基于名为CLI-Gym的方法,我们共衍生出1,655个强环境依赖型任务,构成当前最大规模的数据集。此外,借助精选的成功执行轨迹,我们微调的LiberCoder模型在Terminal-Bench基准上实现+21.1%(达到46.1%)的绝对性能提升,显著优于多种强基线模型。据我们所知,这是首个公开的强环境依赖型任务规模化衍生管道。
English
Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.