SWE-Explore：評測編碼代理如何探索程式碼倉庫

摘要

诸如SWE-bench等仓库级编程基准测试，显著推动了编程代理能力的快速提升。然而，这些基准测试通常将编程任务视为整体性的二元预测问题（如已解决或未解决），忽略了细粒度的代理能力，例如仓库理解、上下文检索、代码定位和缺陷诊断。本文提出SWE-Explore基准，聚焦于评估编程代理的一项关键能力——仓库探索。给定一个仓库和问题描述，SWE-Explore要求探索器在固定行数预算下返回相关代码区域的排序列表。SWE-Explore涵盖10种编程语言、203个开源仓库中的848个问题。针对每个实例，我们从独立代理成功解决同一问题的轨迹中推导出行级真实标注，提炼出其解决方案路径实际参考的特定代码区域。我们从覆盖率、排序和上下文效率三个维度评估探索能力，结果表明这些指标与下游修复行为高度相关。在广泛的检索方法、通用编程代理和专用定位器中，我们发现基于代理的探索器明显优于传统检索方法。尽管现代方法在文件级定位上已表现强劲，但行级覆盖率和高效排序仍是区分顶尖探索器的关键维度。

English

Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.