ChatPaper.aiChatPaper

FeatureBench:面向复杂功能开发的智能体编码基准测试平台

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

February 11, 2026
作者: Qixing Zhou, Jiacheng Zhang, Haiyang Wang, Rui Hao, Jiahe Wang, Minghao Han, Yuxue Yang, Shuzhe Wu, Feiyang Pan, Lue Fan, Dandan Tu, Zhaoxiang Zhang
cs.AI

摘要

基于大语言模型(LLM)的智能体正日益广泛应用于软件行业,作为协作者甚至自主开发者参与代码贡献。随着其应用范围的扩大,评估当前智能体编码能力的边界显得尤为重要。然而现有智能体编码基准测试的任务覆盖范围有限(如仅针对单次拉取请求的缺陷修复),且常依赖非可执行评估方法,或缺乏持续更新评估覆盖度的自动化机制。为应对这些问题,我们提出FeatureBench——一个专注于评估端到端、面向功能特性的软件开发场景中智能体编码性能的基准测试框架。该框架融合了基于执行的评估协议与可扩展的测试驱动方法,能够以最小人力成本从代码仓库自动生成测试任务。通过沿依赖关系图追踪单元测试,我们的方法可识别跨越多提交记录、分散在开发时间线上的功能级编码任务,同时确保功能分离后其他特性的正常运行。基于该框架,我们在首版基准中从24个开源仓库筛选出200项挑战性评估任务和3825个可执行环境。实证评估表明,在SWE-bench上达到74.4%解决率的最先进智能体模型(如Claude 4.5 Opus)在本基准中仅能完成11.0%的任务,这为智能体编码技术的发展提供了新的突破方向。此外,得益于自动化任务收集工具链,FeatureBench可随时间推移轻松扩展和更新,有效缓解数据泄露问题。所构建环境固有的可验证性也使该方法具备用于智能体训练的潜在价值。
English
Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current boundaries of their coding abilities. Existing agentic coding benchmarks, however, cover a limited task scope, e.g., bug fixing within a single pull request (PR), and often rely on non-executable evaluations or lack an automated approach for continually updating the evaluation coverage. To address such issues, we propose FeatureBench, a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development. FeatureBench incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort. By tracing from unit tests along a dependency graph, our approach can identify feature-level coding tasks spanning multiple commits and PRs scattered across the development timeline, while ensuring the proper functioning of other features after the separation. Using this framework, we curated 200 challenging evaluation tasks and 3825 executable environments from 24 open-source repositories in the first version of our benchmark. Empirical evaluation reveals that the state-of-the-art agentic model, such as Claude 4.5 Opus, which achieves a 74.4% resolved rate on SWE-bench, succeeds on only 11.0% of tasks, opening new opportunities for advancing agentic coding. Moreover, benefiting from our automated task collection toolkit, FeatureBench can be easily scaled and updated over time to mitigate data leakage. The inherent verifiability of constructed environments also makes our method potentially valuable for agent training.
PDF171February 13, 2026