SWE-QA: 言語モデルはリポジトリレベルのコード質問に答えられるか？

要旨

ソフトウェアリポジトリ全体を理解し、推論することは、インテリジェントなソフトウェアエンジニアリングツールにとって不可欠な能力です。既存のベンチマークであるCoSQAやCodeQAはこの分野を進展させてきましたが、それらは主に小さな自己完結型のコードスニペットに焦点を当てています。これらの設定では、現実世界のリポジトリの複雑さを捉えることができません。現実のリポジトリでは、効果的な理解と推論には、複数のファイルをナビゲートし、ソフトウェアアーキテクチャを理解し、長距離のコード依存関係に基づいて回答を導くことがしばしば必要です。本論文では、現実的なコード環境における自動QAシステムの研究を促進するために設計されたリポジトリレベルのコード質問応答（QA）ベンチマークであるSWE-QAを紹介します。SWE-QAは、意図理解、クロスファイル推論、マルチホップ依存関係分析など、多様なカテゴリにわたる576の高品質な質問-回答ペアを含んでいます。SWE-QAを構築するために、まず11の有名なリポジトリから77,100件のGitHubイシューをクロールしました。これらのイシューから抽出された自然発生する開発者の質問を分析し、リポジトリレベルの質問の2段階の分類体系を開発し、各カテゴリのシード質問セットを構築しました。各カテゴリについて、手作業で質問をキュレーションし、検証し、対応する回答を収集しました。プロトタイプアプリケーションとして、LLMエージェントが推論し、自動的に回答を見つけるためのエージェント型フレームワークであるSWE-QA-Agentをさらに開発しました。さまざまなコンテキスト拡張戦略の下で、6つの先進的なLLMをSWE-QAで評価しました。実験結果は、特にSWE-QA-Agentフレームワークにおいて、LLMがリポジトリレベルのQAに対処する可能性を示すと同時に、未解決の課題を明らかにし、将来の研究方向を示しています。

English

Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.

SWE-QA: 言語モデルはリポジトリレベルのコード質問に答えられるか？

SWE-QA: Can Language Models Answer Repository-level Code Questions?

要旨

Support