FeatureBench: 복잡한 기능 개발을 위한 에이전트 코딩 벤치마크

초록

대규모 언어 모델(LLM) 기반 에이전트가 소프트웨어 산업에서 점차 확산되며 협업자 또는 심지어 자율 개발자로서 코드를 기여하고 있습니다. 이러한 에이전트의 역할이 커짐에 따라 현재 그들의 코딩 능력 한계를 평가하는 것이 중요해졌습니다. 그러나 기존의 에이전트 코딩 벤치마크는 단일 풀 리퀘스트(PR) 내 버그 수정과 같이 제한된 작업 범위만을 다루며, 실행 불가능한 평가에 의존하거나 평가 커버리지를 지속적으로 업데이트하기 위한 자동화된 접근 방식이 부재한 경우가 많습니다. 이러한 문제를 해결하기 위해 본 논문은 종단 간(end-to-end), 기능 지향적 소프트웨어 개발에서 에이전트 코딩 성능을 평가하기 위한 벤치마크인 FeatureBench를 제안합니다. FeatureBench는 실행 기반 평가 프로토콜과 최소한의 인간 노력으로 코드 저장소에서 작업을 자동으로 도출하는 확장 가능한 테스트 주도 방법을 통합합니다. 의존성 그래프를 따라 단위 테스트를 추적함으로써, 우리의 접근 방식은 개발 타임라인 전체에 걸쳐 여러 커밋과 PR에 분산된 기능 수준 코딩 작업을 식별할 수 있으며, 분리 후 다른 기능의 정상 작동을 보장합니다. 이 프레임워크를 사용하여 우리는 벤치마크의 첫 번째 버전에서 24개의 오픈소스 저장소로부터 200개의 도전적인 평가 작업과 3825개의 실행 가능한 환경을 구성했습니다. 실증 평가 결과, SWE-bench에서 74.4%의 해결율을 달성한 Claude 4.5 Opus와 같은 최첨단 에이전트 모델이 단 11.0%의 작업에서만 성공하여 에이전트 코딩 발전을 위한 새로운 기회를 열었습니다. 더 나아가 자동화된 작업 수집 도구킷의 이점으로 인해 FeatureBench는 데이터 누출을 완화하기 위해 시간이 지남에 따라 쉽게 확장 및 업데이트될 수 있습니다. 구성된 환경의 내재적 검증 가능성은 우리 방법이 에이전트 학습에 잠재적으로 가치 있게 만들 수도 있습니다.

English

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current boundaries of their coding abilities. Existing agentic coding benchmarks, however, cover a limited task scope, e.g., bug fixing within a single pull request (PR), and often rely on non-executable evaluations or lack an automated approach for continually updating the evaluation coverage. To address such issues, we propose FeatureBench, a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development. FeatureBench incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort. By tracing from unit tests along a dependency graph, our approach can identify feature-level coding tasks spanning multiple commits and PRs scattered across the development timeline, while ensuring the proper functioning of other features after the separation. Using this framework, we curated 200 challenging evaluation tasks and 3825 executable environments from 24 open-source repositories in the first version of our benchmark. Empirical evaluation reveals that the state-of-the-art agentic model, such as Claude 4.5 Opus, which achieves a 74.4% resolved rate on SWE-bench, succeeds on only 11.0% of tasks, opening new opportunities for advancing agentic coding. Moreover, benefiting from our automated task collection toolkit, FeatureBench can be easily scaled and updated over time to mitigate data leakage. The inherent verifiability of constructed environments also makes our method potentially valuable for agent training.

FeatureBench: 복잡한 기능 개발을 위한 에이전트 코딩 벤치마크

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

초록

Support