TermiGen：面向終端智能體的高擬真環境與強健軌跡生成系統

摘要

執行複雜終端任務對於開源權重的大型語言模型而言仍是重大挑戰，主要受限於兩項根本限制。首先，高擬真度的可執行訓練環境極為稀缺：從真實世界代碼庫合成的環境缺乏多樣性與擴展性，而由大型語言模型生成的執行軌跡則存在幻覺問題。其次，標準指令微調使用的專家軌跡極少呈現小型模型常見的簡單錯誤，這種分佈不匹配導致學生模型難以從自身運行時錯誤中恢復。為解決這些問題，我們提出TermiGen——一個能合成可驗證環境與韌性專家軌跡的端到端流程。TermiGen首先通過迭代式多智能體優化循環生成功能有效的任務與Docker容器，隨後採用生成器-評判器協議，在軌跡收集過程中主動注入錯誤，從而合成富含錯誤修正循環的數據集。基於TermiGen數據集微調的TermiGen-Qwen2.5-Coder-32B模型在TerminalBench上達到31.3%的通過率，創下開源權重模型的新標竿，不僅超越現有基線模型，更顯著勝過o4-mini等專有模型。數據集已開源於：https://github.com/ucsb-mlsec/terminal-bench-env。

English

Executing complex terminal tasks remains a significant challenge for open-weight LLMs, constrained by two fundamental limitations. First, high-fidelity, executable training environments are scarce: environments synthesized from real-world repositories are not diverse and scalable, while trajectories synthesized by LLMs suffer from hallucinations. Second, standard instruction tuning uses expert trajectories that rarely exhibit simple mistakes common to smaller models. This creates a distributional mismatch, leaving student models ill-equipped to recover from their own runtime failures. To bridge these gaps, we introduce TermiGen, an end-to-end pipeline for synthesizing verifiable environments and resilient expert trajectories. Termi-Gen first generates functionally valid tasks and Docker containers via an iterative multi-agent refinement loop. Subsequently, we employ a Generator-Critic protocol that actively injects errors during trajectory collection, synthesizing data rich in error-correction cycles. Fine-tuned on this TermiGen-generated dataset, our TermiGen-Qwen2.5-Coder-32B achieves a 31.3% pass rate on TerminalBench. This establishes a new open-weights state-of-the-art, outperforming existing baselines and notably surpassing capable proprietary models such as o4-mini. Dataset is avaiable at https://github.com/ucsb-mlsec/terminal-bench-env.

TermiGen：面向終端智能體的高擬真環境與強健軌跡生成系統

TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents

摘要

Support