SWE-QA: 언어 모델이 저장소 수준의 코드 질문에 답할 수 있는가?

초록

전체 소프트웨어 저장소를 이해하고 추론하는 능력은 지능형 소프트웨어 엔지니어링 도구에 있어 필수적인 기능입니다. CoSQA와 CodeQA와 같은 기존 벤치마크가 이 분야를 발전시켜 왔지만, 이들은 주로 작고 독립적인 코드 조각에 초점을 맞추고 있습니다. 이러한 설정은 실제 세계의 저장소 복잡성을 포착하지 못하며, 효과적인 이해와 추론은 종종 여러 파일을 탐색하고, 소프트웨어 아키텍처를 이해하며, 장거리 코드 의존성에 기반한 답변을 요구합니다. 본 논문에서는 현실적인 코드 환경에서 자동화된 질문 응답(QA) 시스템 연구를 촉진하기 위해 설계된 저장소 수준의 코드 QA 벤치마크인 SWE-QA를 소개합니다. SWE-QA는 의도 이해, 파일 간 추론, 다중 홉 의존성 분석 등 다양한 범주에 걸친 576개의 고품질 질문-답변 쌍을 포함합니다. SWE-QA를 구축하기 위해, 우리는 먼저 11개의 인기 있는 저장소에서 77,100개의 GitHub 이슈를 크롤링했습니다. 이러한 이슈에서 추출된 자연스럽게 발생하는 개발자 질문을 분석하여, 저장소 수준 질문의 두 단계 분류 체계를 개발하고 각 범주에 대한 시드 질문 세트를 구성했습니다. 각 범주에 대해, 우리는 질문을 수동으로 선별하고 검증하며 해당 답변을 수집했습니다. 프로토타입 애플리케이션으로, 우리는 LLM 에이전트가 자동으로 답변을 찾기 위해 추론하고 행동하는 에이전트 프레임워크인 SWE-QA-Agent를 추가로 개발했습니다. 우리는 다양한 컨텍스트 증강 전략 하에서 SWE-QA에 대해 6개의 고급 LLM을 평가했습니다. 실험 결과는 LLM, 특히 우리의 SWE-QA-Agent 프레임워크가 저장소 수준 QA를 해결하는 데 있어 유망함을 보여주며, 동시에 해결해야 할 과제와 향후 연구 방향을 제시합니다.

English

Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.

SWE-QA: 언어 모델이 저장소 수준의 코드 질문에 답할 수 있는가?

SWE-QA: Can Language Models Answer Repository-level Code Questions?

초록

Support