SWE-bench 실시간 운영 시작!

초록

실제 버그를 수정하기 위한 패치를 생성하는 문제 해결 작업은 대규모 언어 모델(LLM)의 능력을 평가하는 중요한 벤치마크로 부상했습니다. SWE-bench와 그 변형들이 이 분야에서 표준으로 자리 잡았지만, 몇 가지 주요 한계점이 존재합니다: 초기 릴리스 이후 업데이트가 이루어지지 않았으며, 좁은 범위의 저장소만을 다루고, 인스턴스 구성 및 환경 설정에 수작업이 크게 의존합니다. 이러한 요소들은 확장성을 저해하고 과적합 및 데이터 오염의 위험을 초래합니다. 본 연구에서는 이러한 문제를 극복하기 위해 실시간 업데이트 가능한 벤치마크인 SWE-bench-Live를 제안합니다. 초기 릴리스에는 2024년 이후 생성된 실제 GitHub 이슈에서 도출된 1,319개의 작업이 포함되어 있으며, 93개의 저장소를 아우릅니다. 각 작업은 재현 가능한 실행을 보장하기 위해 전용 Docker 이미지와 함께 제공됩니다. 우리 벤치마크의 핵심은 \method로, 인스턴스 생성부터 환경 설정까지 전체 프로세스를 간소화하는 자동화된 큐레이션 파이프라인입니다. 이를 통해 수작업 병목 현상을 제거하고 확장성과 지속적인 업데이트를 가능하게 합니다. 우리는 SWE-bench-Live에서 다양한 최신 에이전트 프레임워크와 LLM을 평가하여, SWE-bench와 같은 정적 벤치마크와 비교했을 때 상당한 성능 격차가 있음을 확인했습니다. 이러한 차이를 더 잘 이해하기 위해 저장소 출처, 이슈 최신성, 작업 난이도에 걸쳐 상세한 분석을 수행했습니다. 실시간 저장소 활동을 기반으로 한 신선하고 다양하며 실행 가능한 벤치마크를 제공함으로써, SWE-bench-Live는 동적이고 실제 소프트웨어 개발 환경에서 LLM과 에이전트의 엄격하고 오염에 강건한 평가를 가능하게 합니다.

English

The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present SWE-bench-Live, a live-updatable benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.

SWE-bench 실시간 운영 시작!

SWE-bench Goes Live!

초록

Support