基于Docker化环境的大规模终端智能体轨迹生成

摘要

针对终端任务训练智能体模型，关键在于获取能够捕捉跨领域真实长程交互的高质量终端轨迹数据。然而大规模构建此类数据仍面临两大挑战：\emph{可执行性}——每个实例都需要适配独特且适宜的Docker环境；\emph{可验证性}——异构任务输出难以进行统一标准化验证。为此我们提出TerminalTraj可扩展流水线，通过（一）筛选高质量代码库构建Docker化执行环境，（二）生成与Docker对齐的任务实例，（三）合成含可执行验证代码的智能体轨迹。基于该方案，我们构建了3.2万个Docker镜像，在八大领域生成50,733条经过验证的终端轨迹。采用Qwen2.5-Coder架构的模型在此数据上训练后，在TerminalBench（TB）评估中持续提升：TB~1.0提升幅度达20%，TB~2.0提升10%。值得注意的是，TerminalTraj-32B在百亿参数以下模型中表现强劲，TB~1.0达到35.30%，TB~2.0达到22.00%，并展现出更优的测试时扩展特性。所有代码与数据详见https://github.com/Wusiwei0410/TerminalTraj。

English

Training agentic models for terminal-based tasks critically depends on high-quality terminal trajectories that capture realistic long-horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \emph{Executability}, since each instance requires a suitable and often distinct Docker environment; and \emph{Verifiability}, because heterogeneous task outputs preclude unified, standardized verification. To address these challenges, we propose TerminalTraj, a scalable pipeline that (i) filters high-quality repositories to construct Dockerized execution environments, (ii) generates Docker-aligned task instances, and (iii) synthesizes agent trajectories with executable validation code. Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains. Models trained on this data with the Qwen2.5-Coder backbone achieve consistent performance improvements on TerminalBench (TB), with gains of up to 20\% on TB~1.0 and 10\% on TB~2.0 over their respective backbones. Notably, TerminalTraj-32B achieves strong performance among models with fewer than 100B parameters, reaching 35.30\% on TB~1.0 and 22.00\% on TB~2.0, and demonstrates improved test-time scaling behavior. All code and data are available at https://github.com/Wusiwei0410/TerminalTraj.

基于Docker化环境的大规模终端智能体轨迹生成

Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments

摘要

Support