GSO:用於評估軟體工程代理的具挑戰性軟體優化任務
GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
May 29, 2025
作者: Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica
cs.AI
摘要
開發高效能軟體是一項複雜的任務,需要專業知識。我們引入了GSO,這是一個用於評估語言模型在開發高效能軟體方面能力的基準。我們開發了一個自動化流程,該流程生成並執行效能測試,以分析程式庫的提交歷史,從而識別出10個程式庫中的102個具有挑戰性的優化任務,這些任務涵蓋了多個領域和程式語言。我們為代理提供了一個程式庫和效能測試作為精確的規格,並要求其提升執行效率,這將與專家開發者的優化進行對比測量。我們的定量評估顯示,領先的軟體工程代理(SWE-Agents)表現顯著不佳,成功率低於5%,即使在推理時間擴展的情況下,改進也有限。我們的定性分析揭示了關鍵的失敗模式,包括在低階語言上的困難、實踐懶惰優化策略的挑戰,以及準確定位效能瓶頸的難題。我們發布了基準的程式碼和相關資源,以及代理的執行軌跡,以促進未來的研究。
English
Developing high-performance software is a complex task that requires
specialized expertise. We introduce GSO, a benchmark for evaluating language
models' capabilities in developing high-performance software. We develop an
automated pipeline that generates and executes performance tests to analyze
repository commit histories to identify 102 challenging optimization tasks
across 10 codebases, spanning diverse domains and programming languages. An
agent is provided with a codebase and performance test as a precise
specification, and tasked to improve the runtime efficiency, which is measured
against the expert developer optimization. Our quantitative evaluation reveals
that leading SWE-Agents struggle significantly, achieving less than 5% success
rate, with limited improvements even with inference-time scaling. Our
qualitative analysis identifies key failure modes, including difficulties with
low-level languages, practicing lazy optimization strategies, and challenges in
accurately localizing bottlenecks. We release the code and artifacts of our
benchmark along with agent trajectories to enable future research.Summary
AI-Generated Summary