TermiGen：面向终端智能体的高保真环境与鲁棒轨迹合成

摘要

执行复杂终端任务对于开放权重的大型语言模型而言仍是重大挑战，主要受限于两个根本性瓶颈。首先，高保真可执行训练环境稀缺：基于真实代码库合成的环境缺乏多样性与可扩展性，而LLM生成的任务轨迹存在幻觉问题。其次，标准指令微调使用的专家轨迹极少包含小模型常见的简单错误，导致分布失配问题，使得学生模型难以从自身运行时错误中恢复。为突破这些限制，我们提出TermiGen——一种可验证环境与鲁棒专家轨迹的端到端合成流程。TermiGen首先通过多智能体迭代优化循环生成功能有效的任务与Docker容器，随后采用生成器-评判器协议，在轨迹采集中主动注入错误，合成富含错误修正循环的数据集。基于TermiGen数据集微调的TermiGen-Qwen2.5-Coder-32B模型在TerminalBench上达到31.3%的通过率，创造了开放权重模型的新标杆，显著超越现有基线模型及o4-mini等专有模型。数据集已发布于https://github.com/ucsb-mlsec/terminal-bench-env。

English

Executing complex terminal tasks remains a significant challenge for open-weight LLMs, constrained by two fundamental limitations. First, high-fidelity, executable training environments are scarce: environments synthesized from real-world repositories are not diverse and scalable, while trajectories synthesized by LLMs suffer from hallucinations. Second, standard instruction tuning uses expert trajectories that rarely exhibit simple mistakes common to smaller models. This creates a distributional mismatch, leaving student models ill-equipped to recover from their own runtime failures. To bridge these gaps, we introduce TermiGen, an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories. Termi-Gen first generates functionally valid tasks and Docker containers via an iterative multi-agent refinement loop. Subsequently, we employ a Generator-Critic protocol that actively injects errors during trajectory collection, synthesizing data rich in error-correction cycles. Fine-tuned on this TermiGen-generated dataset, our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench. This establishes a new open-weights state-of-the-art, outperforming existing baselines and notably surpassing capable proprietary models such as o4-mini. Dataset is avaiable at https://github.com/ucsb-mlsec/terminal-bench-env.

TermiGen：面向终端智能体的高保真环境与鲁棒轨迹合成

TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents

摘要

Support