CLI-Universe: ターミナルエージェントのための検証可能なタスク合成エンジンを目指して

要旨

近年、LLMベースのターミナルエージェントは有望な能力を示しているが、高品質で実行可能なトレーニングデータの不足が依然として重要なボトルネックとなっている。既存の合成パイプラインは通常、表面的なアーティファクトをタスクに無理やり適合させることでスケールしており、その結果、あいまいな指示、浅い実行パス、脆弱なテストを頻繁に生み出し、弱い学習信号しか提供できない。この問題を克服するために、我々はCLI-Universeを提案する。これは、ターミナルエージェントのタスクを構築する原理的な合成エンジンである。CLI-Universeは、多次元の能力分類（ドメイン、スキルタイプ、能力、エンジニアリングピラー）にわたる組み合わせをサンプリングすることで候補タスクを生成し、その後、実際の技術資料に対するエビデンスに基づく深い調査を通じて各候補を具体化する。厳密な監督を確保するために、検証された設計図はDocker化された環境にインスタンス化され、ルーブリックゲート方式のテスト構築、ヒント条件付きフィルタリング、厳格なFail-to-Passチェックを特徴とする多段階の実行可能検証パイプラインにかけられる。パイプライン全体（候補生成から検証まで）では、約3分の2の候補が破棄され、真正で検証可能かつ自明ではない難易度を持つもののみが保持される。我々のフレームワークを検証するために、CLI-Universe-6Kと呼ばれる高度に精選された6,000の軌跡のデータセットをインスタンス化する。特筆すべきことに、CLI-Universe-6KでQwen3-32Bをファインチューニングしたところ、Terminal-Bench 2.0で33.4%を達成した。これは、32Bパラメータ以下のオープンソースデータで訓練されたモデルとして新たな最先端を記録し、一桁大きな規模のいくつかのモデルを凌駕しており、構造化された高忠実度の合成が持つ顕著なデータ効率を示している。

English

While recent LLM-based terminal agents have demonstrated promising capabilities, the scarcity of high-quality, executable training data remains a critical bottleneck. Existing synthesis pipelines typically scale by retrofitting surface-level artifacts into tasks, frequently yielding ambiguous instructions, shallow execution paths, and brittle tests that provide weak learning signals. To overcome this, we introduce CLI-Universe, a principled synthesis engine that constructs terminal-agent tasks. CLI-Universe generates candidate tasks by sampling combinations across a multi-dimensional capability taxonomy (domain, skill type, capability, and engineering pillar), then grounds each candidate through evidence-guided deep research over real-world technical materials. To ensure rigorous supervision, validated blueprints are instantiated into Dockerized environments and subjected to a multi-stage executable verification pipeline featuring rubric-gated test construction, hint-conditional filtering, and strict fail-to-pass checking. Across the full pipeline, from candidate generation to verification, approximately two-thirds of candidates are discarded, retaining only those that are genuine, verifiable, and non-trivially challenging. To validate our framework, we instantiate a highly distilled dataset of 6,000 trajectories called CLI-Universe-6K. Remarkably, fine-tuning Qwen3-32B on CLI-Universe-6K achieves 33.4% on Terminal-Bench 2.0. This sets a new state-of-the-art for models trained on open-source data at or below 32B parameters, and outperforms several models an order of magnitude larger, demonstrating the profound data efficiency of structured, high-fidelity synthesis.