SWE-Perf: 언어 모델이 실제 저장소에서 코드 성능을 최적화할 수 있을까?

초록

실제 소프트웨어 엔지니어링에서 코드 성능 최적화는 매우 중요하며, 프로덕션 수준 시스템에 있어서도 핵심적인 요소입니다. 대규모 언어 모델(LLMs)이 코드 생성과 버그 수정에서 인상적인 능력을 보여주었지만, 리포지토리 수준에서 코드 성능을 향상시키는 데 대한 숙련도는 아직까지 크게 탐구되지 않았습니다. 이러한 격차를 해결하기 위해, 우리는 실제 리포지토리 컨텍스트 내에서 코드 성능 최적화 작업에 대한 LLMs의 체계적인 평가를 위해 특별히 설계된 첫 번째 벤치마크인 SWE-Perf를 소개합니다. SWE-Perf는 인기 있는 GitHub 리포지토리에서 성능 개선 풀 리퀘스트를 기반으로 한 140개의 신중하게 선별된 인스턴스로 구성됩니다. 각 벤치마크 인스턴스는 관련 코드베이스, 대상 함수, 성능 관련 테스트, 전문가가 작성한 패치, 그리고 실행 가능한 환경을 포함합니다. 파일 수준과 리포지토리 수준 접근법(예: Agentless 및 OpenHands)을 아우르는 대표적인 방법들에 대한 포괄적인 평가를 통해, 우리는 기존 LLMs와 전문가 수준의 최적화 성능 간에 상당한 능력 격차가 있음을 밝혀내며, 이 신흥 분야에서의 중요한 연구 기회를 강조합니다.

English

Code performance optimization is paramount in real-world software engineering and critical for production-level systems. While Large Language Models (LLMs) have demonstrated impressive capabilities in code generation and bug fixing, their proficiency in enhancing code performance at the repository level remains largely unexplored. To address this gap, we introduce SWE-Perf, the first benchmark specifically designed to systematically evaluate LLMs on code performance optimization tasks within authentic repository contexts. SWE-Perf comprises 140 carefully curated instances, each derived from performance-improving pull requests from popular GitHub repositories. Each benchmark instance includes the relevant codebase, target functions, performance-related tests, expert-authored patches, and executable environments. Through a comprehensive evaluation of representative methods that span file-level and repo-level approaches (e.g., Agentless and OpenHands), we reveal a substantial capability gap between existing LLMs and expert-level optimization performance, highlighting critical research opportunities in this emerging field.

SWE-Perf: 언어 모델이 실제 저장소에서 코드 성능을 최적화할 수 있을까?

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

초록

Support