ChatPaper.aiChatPaper

LongCLI-Bench:命令行界面中长期智能体编程的初步基准与研究

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

February 15, 2026
作者: Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, Kang He, Rui Ma, Jifan Lin, Jie Sun, Yang Xiao, Sizhuo Zhou, Wenxiao Wu, Yiming Liu, Pengfei Liu, Yu Qiao, Shenglin Zhang, Kaipeng Zhang
cs.AI

摘要

人工智能辅助编程的最新进展已使智能体能够通过命令行界面执行复杂工作流,然而现有基准测试受限于任务跨度短、GitHub数据采集导致的数据污染,以及缺乏细粒度评估指标,无法严格评估现实软件工程所需的长程规划与执行能力。为弥补这些不足,我们推出LongCLI-Bench——一个专为评估长跨度现实任务中智能体能力而设计的综合基准。我们从千余项计算机科学作业和真实工作流中精选出20个高质量长跨度任务,涵盖从零开发、功能增补、缺陷修复到代码重构四大工程类别。我们为LongCLI-Bench提出双轨测试协议:既衡量需求实现度(从失败到通过),又评估回归规避能力(从通过到保持通过),并引入步骤级评分以精确定位执行故障。大量实验表明,即使最先进的智能体在LongCLI-Bench中的通过率也不足20%。步骤级分析进一步揭示,绝大多数任务在完成度低于30%时便陷入停滞,表明关键故障往往出现在早期阶段。虽然自我修正能带来有限提升,但通过计划注入和交互式指导实现的人机协作能产生显著改进。这些结果凸显未来研究必须重点关注协同式人机工作流的开发,同时推进智能体的规划与执行能力,以攻克长跨度任务性能中的关键挑战。
English
Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents' planning and execution capabilities to overcome key challenges in long-horizon task performance.
PDF133March 28, 2026