LongCLI-Bench：命令列介面中長時序代理程式設計的初步基準與研究

摘要

近期人工智慧輔助程式設計的進展已使智慧代理能夠透過命令列介面執行複雜工作流程，然而現有基準測試存在三大侷限：任務視野長度不足、GitHub資料爬取導致的數據污染，以及缺乏細粒度評估指標，無法嚴格評估現實軟體工程所需的長視野規劃與執行能力。為解決這些缺陷，我們推出LongCLI-Bench——專為評估長視野現實任務中代理能力而設計的綜合基準測試。我們從逾千份計算機科學作業與真實工作流程中精選20項高品質長視野任務，涵蓋四大工程類別：從零開發、功能擴充、錯誤修復與程式重構。我們提出雙重測試機制，分別衡量需求達成度（失敗轉成功）與回歸規避度（成功保成功），並引入步驟級評分以精準定位執行故障。大規模實驗顯示，即使頂尖代理在LongCLI-Bench中的通過率也低於20%。步驟級分析進一步表明，多數任務在完成度不足30%時便陷入停滯，凸顯關鍵故障往往發生於早期階段。雖然自我修正能帶來有限提升，但透過計劃注入與互動指導的人機協作可實現顯著改進。這些結果強調，未來研究必須在提升代理規劃執行能力的同時，重點發展協同式人機工作流程，方能突破長視野任務效能的核心挑戰。

English

Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents' planning and execution capabilities to overcome key challenges in long-horizon task performance.

LongCLI-Bench：命令列介面中長時序代理程式設計的初步基準與研究

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

摘要

Support