CLI-Gym: エージェント的環境反転によるスケーラブルなCLIタスク生成

要旨

エージェント型コーディングでは、エージェントがコマンドラインインターフェース（CLI）などのランタイム環境と効果的に相互作用し、依存関係の問題解決やシステム障害の修正などのタスクを完了する必要がある。しかし、このような環境集約型タスクを大規模に取得し、エージェントの能力を強化する方法は未だ十分に検討されていない。この課題に対処するため、Dockerfileとエージェントタスクの類似性に基づき、実行フィードバックをガイドとしてエージェントによる環境履歴のシミュレーションと探索を提案する。正常な環境の履歴を追跡することで、その状態をランタイム障害が発生した以前の状態に逆転させることが可能であり、不具合のある状態と対応するエラーメッセージをパッケージ化することでタスクを導出できる。本手法「CLI-Gym」により、合計1,655の環境集約型タスクを導出し、同種のデータセットとしては最大規模を実現した。さらに、精選された成功軌跡を用いてファインチューニングしたモデル「LiberCoder」は、Terminal-Benchにおいて+21.1%（46.1%到達）の絶対的な改善を達成し、各種強力なベースラインを凌駕する。我々の知る限り、環境集約型タスクのスケーラブルな導出に向けた初の公開パイプラインである。

English

Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents' capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.

CLI-Gym: エージェント的環境反転によるスケーラブルなCLIタスク生成

CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion

要旨

Support