SWE-Perf:语言模型能否优化现实世界代码库的性能?
SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?
July 16, 2025
作者: Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, Zejun Ma
cs.AI
摘要
在实际软件工程中,代码性能优化至关重要,对于生产级系统尤为关键。尽管大型语言模型(LLMs)在代码生成和错误修复方面展现了卓越能力,但它们在仓库级别提升代码性能的熟练度仍待深入探索。为填补这一空白,我们推出了SWE-Perf,这是首个专门设计用于在真实仓库环境下系统评估LLMs代码性能优化任务的基准。SWE-Perf包含140个精心挑选的实例,每个实例均源自GitHub热门仓库中的性能提升拉取请求。每个基准实例均包含相关代码库、目标函数、性能相关测试、专家编写的补丁以及可执行环境。通过对涵盖文件级和仓库级方法(如Agentless和OpenHands)的代表性方法进行全面评估,我们揭示了现有LLMs与专家级优化性能之间的显著能力差距,凸显了这一新兴领域中的关键研究机遇。
English
Code performance optimization is paramount in real-world software engineering
and critical for production-level systems. While Large Language Models (LLMs)
have demonstrated impressive capabilities in code generation and bug fixing,
their proficiency in enhancing code performance at the repository level remains
largely unexplored. To address this gap, we introduce SWE-Perf, the first
benchmark specifically designed to systematically evaluate LLMs on code
performance optimization tasks within authentic repository contexts. SWE-Perf
comprises 140 carefully curated instances, each derived from
performance-improving pull requests from popular GitHub repositories. Each
benchmark instance includes the relevant codebase, target functions,
performance-related tests, expert-authored patches, and executable
environments. Through a comprehensive evaluation of representative methods that
span file-level and repo-level approaches (e.g., Agentless and OpenHands), we
reveal a substantial capability gap between existing LLMs and expert-level
optimization performance, highlighting critical research opportunities in this
emerging field.