LiteCoder-Terminal：为学习语言智能体扩展长程终端环境

摘要

掌握终端环境需要具备多步规划、基于反馈的执行以及动态状态适应能力的语言代理。然而，目前训练此类代理的瓶颈在于依赖从外部存储库中抓取的数据，这限制了领域多样性、环境可控性以及针对特定能力缺陷的优化。我们提出了LiteCoder-Terminal-Gen，一个零依赖合成管道，能够直接从领域规范自动生成可执行且可验证的终端训练环境。利用该框架，我们构建了两个大规模资源：LiteCoder-Terminal-SFT，包含涵盖10个领域的11,255条专家轨迹；以及LiteCoder-Terminal-RL，包含602个可验证环境，用于轨迹级别的偏好优化。对Qwen系列模型进行监督微调后，所得到的代理在性能上显著优于其基础版本。值得注意的是，我们的32B变体在Terminal Bench 1.0、2.0和Pro上的pass@1分别达到了29.06%、18.54%和34.00%。此外，在我们的RL环境中应用直接多轮偏好优化（DMPO）带来了额外的性能提升。这些结果系统地表明，完全合成的可执行环境能够为掌握复杂的现实命令行工作流提供可扩展且可验证的监督信号。

English

Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain diversity, environment controllability, and the targeting of specific capability deficits. We introduce LiteCoder-Terminal-Gen, a zero-dependency synthesis pipeline that autonomously generates executable and verifiable terminal training environments directly from domain specifications. Using this framework, we construct two large-scale resources: LiteCoder-Terminal-SFT, comprising 11,255 expert trajectories across 10 domains, and LiteCoder-Terminal-RL, featuring 602 verifiable environments for trajectory-level preference optimization. Supervised fine-tuning of Qwen-family models on our SFT dataset yields agents that significantly outperform their base counterparts. Notably, our 32B variant achieves 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro, respectively. Furthermore, applying Direct Multi-turn Preference Optimization (DMPO) on our RL environments yields additional performance gains. These results systematically demonstrate that fully synthetic, executable environments offer a scalable and verifiable supervision signal for mastering complex, real-world command-line workflows.