CLI-Universe:面向终端代理的可验证任务合成引擎
CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents
June 22, 2026
作者: Zhanbo Hua, Yifan Yao, Weihao Xie, Yongchi Zhao, Minghao Liu, Ruizhi Qiu, Zhewei Huang, Zun Wang, Yiyan Ji, Yunhai Ye, Letian Zhu, Xinping Lei, Han Li, Zhiyuan Ma, Zili Wang, Zhaoxiang Zhang, Jiaheng Liu
cs.AI
摘要
虽然近期基于LLM的终端代理展现出了令人期待的能力,但高质量可执行训练数据的稀缺性仍是一大关键瓶颈。现有合成流程通常通过将表面伪影拼凑成任务来扩展规模,但往往导致指令模糊、执行路径浅薄,且测试用例脆弱,无法提供有效的学习信号。为克服这一问题,我们提出了CLI-Universe,一个用于构建终端代理任务的基本原则性合成引擎。CLI-Universe通过在多维能力分类体系(领域、技能类型、能力与工程支柱)中组合采样来生成候选任务,随后依据真实技术资料进行证据引导的深层次研究,将每个候选任务落地。为确保严格的监督机制,经验证的蓝图会被实例化为Docker化环境,并经过多阶段可执行验证流程,包括基于评分准则的测试构建、提示条件过滤以及严格的失败—通过检查。在整个流程中,从候选生成到验证,约三分之二的候选任务会被剔除,仅保留那些真实、可验证且具备非平凡挑战性的任务。为验证我们的框架,我们构建了一个高度精炼的数据集CLI-Universe-6K,包含6000条轨迹。值得注意的是,在CLI-Universe-6K上微调Qwen3-32B模型,在Terminal-Bench 2.0上达到了33.4%的正确率。这创下了使用开源数据训练的32B及更小参数模型的最新最优成绩,并且超越了许多参数规模高一个数量级的模型,充分展示了结构化高保真合成的强大数据效率。
English
While recent LLM-based terminal agents have demonstrated promising capabilities, the scarcity of high-quality, executable training data remains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principled synthesis engine that constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensional capability taxonomy (domain, skill type, capability, and engineering pillar), then grounds each candidate through evidence-guided deep research over real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated into Dockerized environments and subjected to a multi-stage executable verification pipeline featuring rubric-gated test construction, hint-conditional filtering, and strict fail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories called CLI-Universe-6K. Remarkably, fine-tuning Qwen3-32B on CLI-Universe-6K achieves 33.4% on Terminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.