SWE-Explore: コーディングエージェントによるリポジトリ探索のベンチマーク

要旨

リポジトリレベルのコーディングベンチマーク（SWE-benchなど）は、コーディングエージェントの能力を急速に向上させてきた。しかし、これらのベンチマークは通常、コーディングタスクを全体的な二値予測問題（解決済みか未解決かなど）として扱い、リポジトリ理解、コンテキスト検索、コード位置特定、バグ診断といった細粒度のエージェント能力を軽視している。本論文では、コーディングエージェントの重要な能力であるリポジトリ探索の評価を独立させたベンチマーク「SWE-Explore」を紹介する。SWE-Exploreは、リポジトリと課題が与えられた際に、探索器が固定の行数予算の下で関連するコード領域のランク付けされたリストを返すことを求める。SWE-Exploreは、10のプログラミング言語と203のオープンソースリポジトリにわたる848件の課題をカバーしている。各インスタンスに対して、同じ課題を解決した独立したエージェントの軌跡から行レベルの正解データを導出し、その解決経路が実際に参照した具体的なコード領域を抽出する。我々は、カバレッジ、ランキング、コンテキスト効率の各次元に沿って探索を評価し、これらの指標が下流の修正動作と強く相関することを示す。多様な検索手法、汎用コーディングエージェント、特化型ローカライザーにわたる評価の結果、エージェント型探索器は古典的な検索手法よりも明確に上位の層を形成することがわかった。現代の手法ではファイルレベルの位置特定は既に強力であるが、行レベルのカバレッジと効率的なランキングが、最先端の探索器を差別化する主要な軸であり続けている。

English

Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.