CLI-Universe：面向终端代理的可验证任务合成引擎

摘要

虽然近期基于LLM的终端代理展现出了令人期待的能力，但高质量可执行训练数据的稀缺性仍是一大关键瓶颈。现有合成流程通常通过将表面伪影拼凑成任务来扩展规模，但往往导致指令模糊、执行路径浅薄，且测试用例脆弱，无法提供有效的学习信号。为克服这一问题，我们提出了CLI-Universe，一个用于构建终端代理任务的基本原则性合成引擎。CLI-Universe通过在多维能力分类体系（领域、技能类型、能力与工程支柱）中组合采样来生成候选任务，随后依据真实技术资料进行证据引导的深层次研究，将每个候选任务落地。为确保严格的监督机制，经验证的蓝图会被实例化为Docker化环境，并经过多阶段可执行验证流程，包括基于评分准则的测试构建、提示条件过滤以及严格的失败—通过检查。在整个流程中，从候选生成到验证，约三分之二的候选任务会被剔除，仅保留那些真实、可验证且具备非平凡挑战性的任务。为验证我们的框架，我们构建了一个高度精炼的数据集CLI-Universe-6K，包含6000条轨迹。值得注意的是，在CLI-Universe-6K上微调Qwen3-32B模型，在Terminal-Bench 2.0上达到了33.4%的正确率。这创下了使用开源数据训练的32B及更小参数模型的最新最优成绩，并且超越了许多参数规模高一个数量级的模型，充分展示了结构化高保真合成的强大数据效率。

English

While recent LLM-based terminal agents have demonstrated promising capabilities, the scarcity of high-quality, executable training data remains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principled synthesis engine that constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensional capability taxonomy (domain, skill type, capability, and engineering pillar), then grounds each candidate through evidence-guided deep research over real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated into Dockerized environments and subjected to a multi-stage executable verification pipeline featuring rubric-gated test construction, hint-conditional filtering, and strict fail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories called CLI-Universe-6K. Remarkably, fine-tuning Qwen3-32B on CLI-Universe-6K achieves 33.4% on Terminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.