LiteCoder-Terminal：為學習語言智能體擴展長時程終端環境

摘要

掌握终端环境需要具备多步规划、基于反馈的执行以及动态状态适应能力的语言代理。然而，当前训练此类代理的瓶颈在于依赖从外部存储库抓取的数据，这限制了领域多样性、环境可控性以及针对特定能力缺陷的训练。我们提出了LiteCoder-Terminal-Gen，一个零依赖的合成管道，能够直接从领域规范中自主生成可执行且可验证的终端训练环境。利用这一框架，我们构建了两个大规模资源：LiteCoder-Terminal-SFT，包含跨10个领域的11,255条专家轨迹；以及LiteCoder-Terminal-RL，包含602个可验证的环境，用于轨迹级偏好优化。在SFT数据集上对Qwen系列模型进行监督微调后，所得代理在性能上显著优于基础版本。值得注意的是，我们的32B变体在Terminal Bench 1.0、2.0和Pro上分别达到了29.06%、18.54%和34.00%的pass@1。此外，在RL环境中应用直接多轮偏好优化（DMPO）进一步提升了性能。这些结果系统性地表明，完全合成的可执行环境能够为掌握复杂的现实命令行工作流提供可扩展且可验证的监督信号。

English

Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain diversity, environment controllability, and the targeting of specific capability deficits. We introduce LiteCoder-Terminal-Gen, a zero-dependency synthesis pipeline that autonomously generates executable and verifiable terminal training environments directly from domain specifications. Using this framework, we construct two large-scale resources: LiteCoder-Terminal-SFT, comprising 11,255 expert trajectories across 10 domains, and LiteCoder-Terminal-RL, featuring 602 verifiable environments for trajectory-level preference optimization. Supervised fine-tuning of Qwen-family models on our SFT dataset yields agents that significantly outperform their base counterparts. Notably, our 32B variant achieves 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro, respectively. Furthermore, applying Direct Multi-turn Preference Optimization (DMPO) on our RL environments yields additional performance gains. These results systematically demonstrate that fully synthetic, executable environments offer a scalable and verifiable supervision signal for mastering complex, real-world command-line workflows.