SweRank：基於代碼排序的軟件問題定位

摘要

軟件問題定位，即識別與自然語言問題描述（如錯誤報告、功能請求）相關的精確代碼位置（文件、類或函數），是軟件開發中關鍵但耗時的環節。儘管近期基於大語言模型（LLM）的代理方法展現出潛力，但由於複雜的多步推理和依賴閉源LLM，它們往往帶來顯著的延遲和成本。另一方面，傳統的代碼排序模型通常針對查詢到代碼或代碼到代碼的檢索進行優化，卻難以應對問題定位查詢的冗長和故障描述特性。為彌合這一差距，我們引入了SweRank，一個高效且有效的檢索與重排序框架，專為軟件問題定位設計。為促進訓練，我們構建了SweLoc，這是一個從公開GitHub倉庫中精心挑選的大規模數據集，包含真實世界的問題描述及其對應的代碼修改。在SWE-Bench-Lite和LocBench上的實驗結果表明，SweRank達到了最先進的性能，超越了先前的排序模型以及使用閉源LLM（如Claude-3.5）的高成本代理系統。此外，我們展示了SweLoc在提升現有各種檢索器和重排序模型用於問題定位方面的實用性，確立了該數據集作為社區寶貴資源的地位。

English

Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step reasoning and relying on closed-source LLMs. Alternatively, traditional code ranking models, typically optimized for query-to-code or code-to-code retrieval, struggle with the verbose and failure-descriptive nature of issue localization queries. To bridge this gap, we introduce SweRank, an efficient and effective retrieve-and-rerank framework for software issue localization. To facilitate training, we construct SweLoc, a large-scale dataset curated from public GitHub repositories, featuring real-world issue descriptions paired with corresponding code modifications. Empirical results on SWE-Bench-Lite and LocBench show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems using closed-source LLMs like Claude-3.5. Further, we demonstrate SweLoc's utility in enhancing various existing retriever and reranker models for issue localization, establishing the dataset as a valuable resource for the community.

SweRank：基於代碼排序的軟件問題定位

SweRank: Software Issue Localization with Code Ranking

摘要

Support