SWE-QA:語言模型能否回答儲存庫層級的程式碼問題?
SWE-QA: Can Language Models Answer Repository-level Code Questions?
September 18, 2025
作者: Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, Xiaodong Gu
cs.AI
摘要
理解並推理整個軟件倉庫是智能軟件工程工具的一項核心能力。儘管現有的基準測試如CoSQA和CodeQA已推動了該領域的發展,但它們主要集中於小型、自包含的代碼片段。這些設置未能捕捉到現實世界倉庫的複雜性,在這些環境中,有效的理解和推理通常需要跨越多個文件、理解軟件架構,並將答案建立在長距離代碼依賴的基礎上。本文中,我們提出了SWE-QA,一個倉庫級別的代碼問答(QA)基準測試,旨在促進在真實代碼環境中自動化QA系統的研究。SWE-QA包含576個高質量的問答對,涵蓋多樣化的類別,包括意圖理解、跨文件推理和多跳依賴分析。為構建SWE-QA,我們首先從11個熱門倉庫中爬取了77,100個GitHub問題。基於從這些問題中提取的自然發生的開發者問題的分析,我們開發了一個兩級分類的倉庫級問題分類法,並為每個類別構建了一組種子問題。對於每個類別,我們手動整理並驗證了問題,並收集了其對應的答案。作為原型應用,我們進一步開發了SWE-QA-Agent,這是一個代理框架,其中LLM代理通過推理和行動自動尋找答案。我們在多種上下文增強策略下評估了六種先進的LLM在SWE-QA上的表現。實驗結果凸顯了LLM,特別是我們的SWE-QA-Agent框架,在解決倉庫級QA問題上的潛力,同時也揭示了開放性挑戰並指明了未來的研究方向。
English
Understanding and reasoning about entire software repositories is an
essential capability for intelligent software engineering tools. While existing
benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly
focus on small, self-contained code snippets. These setups fail to capture the
complexity of real-world repositories, where effective understanding and
reasoning often require navigating multiple files, understanding software
architecture, and grounding answers in long-range code dependencies. In this
paper, we present SWE-QA, a repository-level code question answering (QA)
benchmark designed to facilitate research on automated QA systems in realistic
code environments. SWE-QA involves 576 high-quality question-answer pairs
spanning diverse categories, including intention understanding, cross-file
reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first
crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis
of naturally occurring developer questions extracted from these issues, we
developed a two-level taxonomy of repository-level questions and constructed a
set of seed questions for each category. For each category, we manually curated
and validated questions and collected their corresponding answers. As a
prototype application, we further develop SWE-QA-Agent, an agentic framework in
which LLM agents reason and act to find answers automatically. We evaluate six
advanced LLMs on SWE-QA under various context augmentation strategies.
Experimental results highlight the promise of LLMs, particularly our
SWE-QA-Agent framework, in addressing repository-level QA, while also revealing
open challenges and pointing to future research directions.