ChatPaper.aiChatPaper

SweRank:基于代码排序的软件问题定位

SweRank: Software Issue Localization with Code Ranking

May 7, 2025
作者: Revanth Gangi Reddy, Tarun Suresh, JaeHyeok Doo, Ye Liu, Xuan Phi Nguyen, Yingbo Zhou, Semih Yavuz, Caiming Xiong, Heng Ji, Shafiq Joty
cs.AI

摘要

软件问题定位,即识别与自然语言问题描述(如错误报告、功能请求)相关的精确代码位置(文件、类或函数),是软件开发中至关重要却耗时的一环。尽管近期基于大语言模型(LLM)的代理方法展现出潜力,但由于复杂的多步推理及依赖闭源LLM,它们往往带来显著的延迟和成本。另一方面,传统的代码排序模型,通常针对查询到代码或代码到代码的检索进行优化,在处理冗长且描述故障的问题定位查询时表现欠佳。为弥合这一差距,我们提出了SweRank,一个高效且有效的问题定位检索与重排序框架。为便于训练,我们构建了SweLoc,这是一个从公开GitHub仓库中精心挑选的大规模数据集,包含真实世界的问题描述及其对应的代码修改。在SWE-Bench-Lite和LocBench上的实证结果表明,SweRank实现了最先进的性能,超越了先前的排序模型以及使用Claude-3.5等闭源LLM的高成本代理系统。此外,我们展示了SweLoc在提升现有多种检索器和重排序模型用于问题定位方面的实用性,确立了该数据集作为社区宝贵资源的地位。
English
Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step reasoning and relying on closed-source LLMs. Alternatively, traditional code ranking models, typically optimized for query-to-code or code-to-code retrieval, struggle with the verbose and failure-descriptive nature of issue localization queries. To bridge this gap, we introduce SweRank, an efficient and effective retrieve-and-rerank framework for software issue localization. To facilitate training, we construct SweLoc, a large-scale dataset curated from public GitHub repositories, featuring real-world issue descriptions paired with corresponding code modifications. Empirical results on SWE-Bench-Lite and LocBench show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems using closed-source LLMs like Claude-3.5. Further, we demonstrate SweLoc's utility in enhancing various existing retriever and reranker models for issue localization, establishing the dataset as a valuable resource for the community.

Summary

AI-Generated Summary

PDF61May 15, 2025