LongCLI-Bench: 명령줄 인터페이스에서의 장기 계획 에이전트 프로그래밍을 위한 예비 벤치마크 및 연구

초록

최근 AI 기반 프로그래밍의 발전으로 에이전트가 명령줄 인터페이스를 통해 복잡한 워크플로우를 실행할 수 있게 되었지만, 기존 벤치마크는 짧은 작업 범위, GitHub 스크래핑으로 인한 데이터 오염, 세분화된 평가 지표의 부재로 인해 현실적인 소프트웨어 엔지니어링에 필수적인 장기 계획 및 실행 능력을 엄격하게 평가하지 못하는 한계가 있습니다. 이러한 격차를 해결하기 위해 우리는 장기적이고 현실적인 작업 전반에 걸친 에이전트 능력을 평가하기 위해 설계된 포괄적인 벤치마크인 LongCLI-Bench를 소개합니다. 우리는 1,000개 이상의 컴퓨터 과학 과제와 실제 워크플로우에서 20개의 고품질 장기 작업을 선별했으며, 이를 처음부터 시작, 기능 추가, 버그 수정, 리팩토링이라는 네 가지 엔지니어링 범주로 분류했습니다. 우리는 LongCLI-Bench를 위해 요구사항 충족도(실패-통과)와 회귀 방지(통과-통과)를 측정하고 실행 실패를 정확히 파악하기 위해 단계별 채점을 통합한 이중 세트 테스트 프로토콜을 제안합니다. 광범위한 실험 결과, 최첨단 에이전트 조차도 LongCLI-Bench에서 20% 미만의 통과율을 달성하는 것으로 나타났습니다. 단계별 분석은 대부분의 작업이 30% 미만 완료 단계에서 중단된다고 추가로 지적하며, 중요한 실패가 초기 단계에서 빈번히 발생함을 강조합니다. 자체 수정(self-correction)은 미미한 성능 향상을 제공하지만, 계획 주입(plan injection)과 대화형 안내를 통한 인간-에이전트 협업은 상당히 높은 개선 효과를 보였습니다. 이러한 결과는 장기 작업 성능의 핵심 과제를 극복하기 위해 미래 연구가 에이전트의 계획 및 실행 능력 발전과 함께 시너지적인 인간-에이전트 워크플로우 개발에 중점을 두어야 함을 시사합니다.

English

Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents' planning and execution capabilities to overcome key challenges in long-horizon task performance.

LongCLI-Bench: 명령줄 인터페이스에서의 장기 계획 에이전트 프로그래밍을 위한 예비 벤치마크 및 연구

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

초록

Support