GSO: SWE-Agent 평가를 위한 도전적인 소프트웨어 최적화 과제

초록

고성능 소프트웨어 개발은 전문적인 지식을 요구하는 복잡한 작업입니다. 본 연구에서는 고성능 소프트웨어 개발에서 언어 모델의 능력을 평가하기 위한 벤치마크인 GSO를 소개합니다. 우리는 자동화된 파이프라인을 개발하여 성능 테스트를 생성 및 실행하고, 리포지토리 커밋 히스토리를 분석하여 10개의 코드베이스에 걸쳐 다양한 도메인과 프로그래밍 언어를 아우르는 102개의 도전적인 최적화 작업을 식별했습니다. 에이전트는 코드베이스와 성능 테스트를 정확한 명세로 제공받고, 전문 개발자의 최적화 결과와 비교하여 런타임 효율성을 개선하는 과제를 수행합니다. 정량적 평가 결과, 선두 SWE 에이전트들은 5% 미만의 성공률을 보이며, 추론 시간 확장에도 불구하고 제한된 개선만을 달성하는 것으로 나타났습니다. 정성적 분석에서는 저수준 언어 처리의 어려움, 게으른 최적화 전략 사용, 병목 현상을 정확히 파악하는 데 대한 도전과 같은 주요 실패 모드를 식별했습니다. 우리는 향후 연구를 위해 벤치마크의 코드와 아티팩트, 에이전트 실행 경로를 공개합니다.

English

Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

GSO: SWE-Agent 평가를 위한 도전적인 소프트웨어 최적화 과제

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

초록

Support