LongCLI-Bench: コマンドラインインターフェースにおける長期的エージェントプログラミングの予備的ベンチマークと研究

要旨

AI支援プログラミングの最近の進歩により、エージェントがコマンドラインインターフェースを通じて複雑なワークフローを実行できるようになったが、既存のベンチマークは短いタスク期間、GitHubスクレイピングによるデータ汚染、細粒度の評価指標の不足によって制限されており、現実的なソフトウェア工学に不可欠な長期的な計画と実行能力を厳密に評価できていない。これらの課題を解決するため、我々は長期的で現実的なタスクにおけるエージェント能力を評価する包括的ベンチマーク「LongCLI-Bench」を提案する。1,000以上のコンピュータサイエンス課題と実世界のワークフローから、スクラッチ開発、機能追加、バグ修正、リファクタリングの4つのエンジニアリングカテゴリにわたる20の高品質な長期タスクを厳選した。LongCLI-Benchでは、要件充足度（fail-to-pass）と回帰回避（pass-to-pass）を測定する二重テストプロトコルを採用し、実行失敗を特定するステップ単位の評価を組み込んでいる。大規模な実験により、最先端のエージェントでさえLongCLI-Benchでの合格率が20%未満であることが明らかになった。ステップ単位の分析では、大半のタスクが完了率30%未満で停滞しており、重大な失敗が初期段階で頻発することが示された。自己修正による改善は限定的である一方、計画注入と対話的ガイダンスによる人間とエージェントの協調は大幅な改善をもたらした。これらの結果は、長期的タスク性能における主要な課題を克服するためには、エージェントの計画・実行能力の進歩と並行して、人間とエージェントの協調ワークフローの開発に重点を置く必要があることを示唆している。

English

Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents' planning and execution capabilities to overcome key challenges in long-horizon task performance.

LongCLI-Bench: コマンドラインインターフェースにおける長期的エージェントプログラミングの予備的ベンチマークと研究

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

要旨

Support