SWE-QA：语言模型能否回答仓库级别的代码问题？

摘要

理解和推理整个软件仓库是智能软件工程工具的一项核心能力。尽管现有的基准测试如CoSQA和CodeQA推动了该领域的发展，但它们主要集中于小型、自包含的代码片段。这些设置未能捕捉到现实世界仓库的复杂性，其中有效的理解和推理往往需要跨多个文件导航、理解软件架构，并将答案建立在长距离代码依赖之上。本文中，我们提出了SWE-QA，一个仓库级别的代码问答（QA）基准，旨在促进在真实代码环境中自动化QA系统的研究。SWE-QA包含576个高质量的问题-答案对，涵盖多种类别，包括意图理解、跨文件推理和多跳依赖分析。为构建SWE-QA，我们首先从11个热门仓库中爬取了77,100个GitHub问题。基于从这些问题中提取的自然发生的开发者提问分析，我们开发了一个两级分类的仓库级别问题分类体系，并为每个类别构建了一组种子问题。针对每个类别，我们手动筛选并验证了问题，并收集了相应的答案。作为原型应用，我们进一步开发了SWE-QA-Agent，一个代理框架，其中LLM代理通过推理和行动自动寻找答案。我们在多种上下文增强策略下评估了六种先进的LLM在SWE-QA上的表现。实验结果凸显了LLM，特别是我们的SWE-QA-Agent框架，在处理仓库级别QA方面的潜力，同时也揭示了开放挑战并指明了未来的研究方向。

English

Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.

SWE-QA：语言模型能否回答仓库级别的代码问题？

SWE-QA: Can Language Models Answer Repository-level Code Questions?

摘要

Support