BeyondSWE: 현재의 코드 에이전트는 단일 리포지토리 버그 수정을 넘어서도 생존할 수 있을까?

초록

기존 코드 에이전트 벤치마크는 주로 저장소에 국한된 단편적인 수정 과제를 평가하며, 교차 저장소 추론, 도메인 특화 문제 해결, 의존성 기반 마이그레이션, 전체 저장소 생성 등 실제 현장에서 부딪히는 중요한 도전 과제들을 간과하고 있습니다. 이러한 격차를 해소하기 위해 본 연구에서는 해결 범위와 지식 범위라는 두 가지 축을 통해 기존 평가 체계를 확장한 포괄적 벤치마크인 BeyondSWE를 소개합니다. 이는 4가지 상이한 설정 하에 500개의 실제 사례를 활용하여 구성되었습니다. 실험 결과에 따르면, 최첨단 모델조차도 45% 미만의 성공률에 그치는 등 현저한 역량 격차가 존재하며, 단일 모델로 모든 작업 유형에 걸쳐 일관된 성능을 발휘하는 경우는 없는 것으로 나타났습니다. 외부 지식의 역할을 체계적으로 규명하기 위해 우리는 심층 검색과 코딩 능력을 통합한 SearchSWE 프레임워크를 개발했습니다. 실험 결과, 검색 보강은 일관된 성능 향상을 보장하지 않으며 경우에 따라 오히려 성능을 저하시킬 수 있어, 코딩 작업 중 검색과 추론을 교차 수행하는 개발자 워크플로우를 모방하는 것이 얼마나 어려운지 확인할 수 있었습니다. 본 연구는 현실적이고 도전적인 평가 기준과 더불어 향상된 코드 에이전트 연구를 위한 유연한 프레임워크를 제시합니다.

English

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

BeyondSWE: 현재의 코드 에이전트는 단일 리포지토리 버그 수정을 넘어서도 생존할 수 있을까?

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

초록

Support