GSO：面向软件工程智能体评估的挑战性软件优化任务集

摘要

开发高性能软件是一项复杂的任务，需要专业的知识。我们引入了GSO，这是一个用于评估语言模型在开发高性能软件方面能力的基准。我们开发了一个自动化流程，通过生成和执行性能测试来分析代码库的提交历史，从而识别出10个代码库中的102个具有挑战性的优化任务，这些任务涵盖了多个领域和编程语言。我们为智能体提供了一个代码库和性能测试作为精确的规范，并要求其提高运行效率，这一效率将与专家开发者的优化结果进行对比。我们的定量评估显示，领先的软件工程智能体表现显著不佳，成功率不足5%，即使在推理时进行扩展，改进也有限。我们的定性分析揭示了关键失败模式，包括在低级语言处理上的困难、实施惰性优化策略的挑战，以及准确定位性能瓶颈的难题。我们发布了基准的代码和相关资源，以及智能体的执行轨迹，以促进未来的研究。

English

Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

GSO：面向软件工程智能体评估的挑战性软件优化任务集

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

摘要

Support