SWE-Perf：語言模型能否在真實世界程式庫中優化程式碼效能？

摘要

在現實世界的軟體工程中，程式碼效能優化至關重要，對於生產級系統更是不可或缺。儘管大型語言模型（LLMs）在程式碼生成和錯誤修復方面展現了令人印象深刻的能力，但這些模型在倉庫層面上提升程式碼效能的熟練度仍大多未被探索。為填補這一空白，我們推出了SWE-Perf，這是首個專門設計用於系統性評估LLMs在真實倉庫情境下進行程式碼效能優化任務的基準測試。SWE-Perf包含140個精心挑選的案例，每個案例均源自GitHub熱門倉庫中的效能改進拉取請求。每個基準測試案例都涵蓋了相關的程式碼庫、目標函數、效能相關測試、專家撰寫的修補程式以及可執行的環境。通過對代表性方法（如無代理和開放式方法）進行全面評估，我們揭示了現有LLMs與專家級優化效能之間的顯著能力差距，凸顯了這一新興領域中的關鍵研究機會。

English

Code performance optimization is paramount in real-world software engineering and critical for production-level systems. While Large Language Models (LLMs) have demonstrated impressive capabilities in code generation and bug fixing, their proficiency in enhancing code performance at the repository level remains largely unexplored. To address this gap, we introduce SWE-Perf, the first benchmark specifically designed to systematically evaluate LLMs on code performance optimization tasks within authentic repository contexts. SWE-Perf comprises 140 carefully curated instances, each derived from performance-improving pull requests from popular GitHub repositories. Each benchmark instance includes the relevant codebase, target functions, performance-related tests, expert-authored patches, and executable environments. Through a comprehensive evaluation of representative methods that span file-level and repo-level approaches (e.g., Agentless and OpenHands), we reveal a substantial capability gap between existing LLMs and expert-level optimization performance, highlighting critical research opportunities in this emerging field.

SWE-Perf：語言模型能否在真實世界程式庫中優化程式碼效能？

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

摘要

Support