SWE-Explore: 코딩 에이전트의 리포지토리 탐색 벤치마킹

초록

저장소 수준의 코딩 벤치마크(예: SWE-bench)는 코딩 에이전트의 능력을 급속도로 향상시키는 데 기여해 왔다. 그러나 이러한 벤치마크는 일반적으로 코딩 과제를 전체론적이고 이분법적인 예측 문제(예: 해결 여부)로 취급하며, 저장소 이해, 맥락 검색, 코드 위치 파악, 버그 진단과 같은 세분화된 에이전트 능력을 간과한다. 본 논문에서는 코딩 에이전트의 핵심 역량인 저장소 탐색을 분리하여 평가하는 벤치마크인 SWE-Explore를 소개한다. SWE-Explore는 저장소와 이슈가 주어졌을 때, 탐색자(explorer)가 고정된 라인 예산 하에서 관련 코드 영역의 순위 목록을 반환하도록 요구한다. SWE-Explore는 10개의 프로그래밍 언어와 203개의 오픈소스 저장소에 걸쳐 848개의 이슈를 포함한다. 각 인스턴스에 대해, 동일한 이슈를 성공적으로 해결한 독립적인 에이전트 궤적들로부터 라인 수준의 정답 데이터를 도출하며, 이들이 실제로 참조한 특정 코드 영역을 추출한다. 탐색 성능을 적용 범위, 순위, 맥락 효율성 차원에서 평가하며, 이러한 지표가 하위 수정 행동과 강하게 연관됨을 보여준다. 다양한 검색 방법, 일반 코딩 에이전트, 특화된 위치 파악 도구 전반에 걸쳐, 에이전트 기반 탐색자가 고전적 검색 방법보다 명확히 우수한 계층을 형성함을 발견했다. 파일 수준 위치 파악은 현대적 방법에서 이미 강력하지만, 라인 수준 적용 범위와 효율적인 순위가 최첨단 탐색자를 구별짓는 핵심 축으로 남아 있다.

English

Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.