SWE-Explore：评估编码智能体如何探索代码仓库的基准测试

摘要

诸如 SWE-bench 等仓库级编程基准测试推动了编程代理能力的快速提升。然而，它们通常将编程任务视为一个整体的二元预测问题（例如，已解决或未解决），忽略了诸如仓库理解、上下文检索、代码定位和缺陷诊断等细粒度的代理能力。在本文中，我们引入了 SWE-Explore，这是一个专门评估仓库探索能力的基准测试，而仓库探索是编程代理的一项关键能力。给定一个仓库和一个问题，SWE-Explore 要求探索器在固定的行数预算下返回一个相关代码区域的排序列表。SWE-Explore 涵盖了 10 种编程语言和 203 个开源仓库中的 848 个问题。对于每个实例，我们从独立成功解决同一问题的代理轨迹中推导出行级真实标注，提炼出其解决路径实际参考的特定代码区域。我们从覆盖率、排序和上下文效率维度评估探索能力，表明这些指标与下游修复行为高度相关。在广泛的检索方法、通用编程代理和专用定位器中，我们发现代理式探索器明显优于传统检索方法。尽管现代方法在文件级定位上已经很强，但行级覆盖率和高效排序仍然是区分最先进探索器的关键维度。

English

Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.