GSO: SWE-Agentsの評価のための挑戦的なソフトウェア最適化タスク

要旨

高性能なソフトウェアの開発は、専門的な知識を必要とする複雑なタスクです。本論文では、言語モデルの高性能ソフトウェア開発能力を評価するためのベンチマークであるGSOを紹介します。自動化されたパイプラインを開発し、パフォーマンステストを生成・実行することで、リポジトリのコミット履歴を分析し、10のコードベースにわたる102の難易度の高い最適化タスクを特定しました。これらは多様なドメインとプログラミング言語にまたがっています。エージェントにはコードベースとパフォーマンステストが正確な仕様として提供され、実行効率の改善を求められます。その結果は、専門開発者による最適化と比較して測定されます。定量的評価によると、主要なSWE-Agentは大幅な苦戦を強いられ、成功率は5%未満で、推論時間のスケーリングを行っても改善は限定的でした。定性的分析では、低レベル言語の扱いの難しさ、怠惰な最適化戦略の採用、ボトルネックの正確な特定の困難さといった主要な失敗モードが明らかになりました。今後の研究を促進するため、ベンチマークのコードとアーティファクト、およびエージェントの軌跡を公開します。

English

Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

GSO: SWE-Agentsの評価のための挑戦的なソフトウェア最適化タスク

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

要旨

Support