SWE-fficiency:語言模型能否基於真實工作負載優化現實代碼庫?
SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?
November 8, 2025
作者: Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, Parthasarathy Ranganathan
cs.AI
摘要
优化大规模软件仓库的性能需要代码推理与软件工程(SWE)领域的专业知识,以在保证程序正确性的同时降低运行耗时。然而现有基准测试多聚焦于"修复内容"而非"修复方法"。我们提出SWE-fficiency基准,用于评估真实工作负载下的仓库级性能优化能力。该测试集涵盖九个广泛使用的数据科学、机器学习和高性能计算仓库(如numpy、pandas、scipy)中的498项任务:给定完整代码库与低速工作负载,智能体需解析代码语义、定位瓶颈及相关测试,并生成能匹配或超越专家加速效果且通过单元测试的补丁。为实现这种"如何修复"的评估,我们通过自动化流水线从GitHub拉取请求中采集性能优化编辑,结合关键词过滤、静态分析、覆盖度工具与执行验证,既确认专家加速基准又识别相关仓库单元测试。对前沿智能体的实证评估显示其表现显著欠佳:平均仅达到专家加速效果的0.15倍。智能体在定位优化机会、跨函数执行推理及维护编辑正确性方面存在明显不足。我们公开此基准测试及配套数据流水线,以推动自动化性能工程与长周期软件推理的研究。
English
Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather than how to fix code. We introduce SWE-fficiency, a benchmark for evaluating repository-level performance optimization on real workloads. Our suite contains 498 tasks across nine widely used data-science, machine-learning, and HPC repositories (e.g., numpy, pandas, scipy): given a complete codebase and a slow workload, an agent must investigate code semantics, localize bottlenecks and relevant tests, and produce a patch that matches or exceeds expert speedup while passing the same unit tests. To enable this how-to-fix evaluation, our automated pipeline scrapes GitHub pull requests for performance-improving edits, combining keyword filtering, static analysis, coverage tooling, and execution validation to both confirm expert speedup baselines and identify relevant repository unit tests. Empirical evaluation of state-of-the-art agents reveals significant underperformance. On average, agents achieve less than 0.15x the expert speedup: agents struggle in localizing optimization opportunities, reasoning about execution across functions, and maintaining correctness in proposed edits. We release the benchmark and accompanying data pipeline to facilitate research on automated performance engineering and long-horizon software reasoning.