FeatureBench:面向复杂功能开发的智能体编码基准测试平台
FeatureBench: Benchmarking Agentic Coding for Complex Feature Development
February 11, 2026
作者: Qixing Zhou, Jiacheng Zhang, Haiyang Wang, Rui Hao, Jiahe Wang, Minghao Han, Yuxue Yang, Shuzhe Wu, Feiyang Pan, Lue Fan, Dandan Tu, Zhaoxiang Zhang
cs.AI
摘要
基于大语言模型(LLM)的智能体正日益广泛应用于软件行业,作为协作者甚至自主开发者参与代码贡献。随着其应用范围的扩大,评估当前智能体编码能力的边界变得尤为重要。然而,现有的智能体编码基准测试存在任务覆盖面有限(如仅针对单个拉取请求的缺陷修复)、依赖非可执行评估方法或缺乏持续更新评估范围的自动化机制等问题。为解决这些局限性,我们提出FeatureBench——一个专为评估端到端、面向特性的软件开发场景中智能体编码性能而设计的基准测试框架。FeatureBench采用基于执行的评估协议,并通过可扩展的测试驱动方法,能够以最小人力成本从代码仓库中自动提取测试任务。通过沿依赖关系图追踪单元测试,我们的方法可识别跨开发时间线分布、涉及多次提交和拉取请求的特性级编码任务,同时确保其他功能在任务分离后仍正常运行。基于该框架,我们在首版基准测试中从24个开源仓库筛选出200项挑战性评估任务和3825个可执行环境。实证评估表明,在SWE-bench上达到74.4%解决率的最先进智能体模型(如Claude 4.5 Opus)在本基准中仅能完成11.0%的任务,这为推进智能体编码能力提供了新的研究方向。此外,得益于自动化的任务收集工具链,FeatureBench可随时间推移轻松扩展和更新,有效缓解数据泄露问题。所构建环境固有的可验证性也使该方法具备应用于智能体训练的潜在价值。
English
Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current boundaries of their coding abilities. Existing agentic coding benchmarks, however, cover a limited task scope, e.g., bug fixing within a single pull request (PR), and often rely on non-executable evaluations or lack an automated approach for continually updating the evaluation coverage. To address such issues, we propose FeatureBench, a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development. FeatureBench incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort. By tracing from unit tests along a dependency graph, our approach can identify feature-level coding tasks spanning multiple commits and PRs scattered across the development timeline, while ensuring the proper functioning of other features after the separation. Using this framework, we curated 200 challenging evaluation tasks and 3825 executable environments from 24 open-source repositories in the first version of our benchmark. Empirical evaluation reveals that the state-of-the-art agentic model, such as Claude 4.5 Opus, which achieves a 74.4% resolved rate on SWE-bench, succeeds on only 11.0% of tasks, opening new opportunities for advancing agentic coding. Moreover, benefiting from our automated task collection toolkit, FeatureBench can be easily scaled and updated over time to mitigate data leakage. The inherent verifiability of constructed environments also makes our method potentially valuable for agent training.