SWE-bench 正式上线!
SWE-bench Goes Live!
May 29, 2025
作者: Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang
cs.AI
摘要
解決問題任務,即模型生成修補程式以修復現實世界中的錯誤,已成為評估大型語言模型(LLMs)能力的關鍵基準。儘管SWE-bench及其變體在這一領域已成為標準,但它們存在一些關鍵限制:自初次發布以來未進行更新,涵蓋的儲存庫範圍狹窄,並且在實例建構和環境設置上高度依賴手動操作。這些因素阻礙了可擴展性,並引入了過擬合和數據污染的風險。在本研究中,我們提出了SWE-bench-Live,這是一個可即時更新的基準,旨在克服這些挑戰。我們的初始版本包含1,319個任務,這些任務源自自2024年以來在GitHub上創建的實際問題,涵蓋93個儲存庫。每個任務都配備了一個專用的Docker鏡像,以確保可重現的執行。我們基準的核心是\method,這是一個自動化的策展管道,它簡化了從實例創建到環境設置的整個過程,消除了手動瓶頸,實現了可擴展性和持續更新。我們在SWE-bench-Live上評估了一系列最先進的代理框架和LLMs,揭示了與SWE-bench等靜態基準相比的顯著性能差距,即使在受控的評估條件下也是如此。為了更好地理解這種差異,我們在儲存庫來源、問題新近度和任務難度方面進行了詳細分析。通過提供一個基於即時儲存庫活動的新鮮、多樣且可執行的基準,SWE-bench-Live促進了在動態、現實世界的軟體開發環境中對LLMs和代理進行嚴謹且抗污染的評估。
English
The issue-resolving task, where a model generates patches to fix real-world
bugs, has emerged as a critical benchmark for evaluating the capabilities of
large language models (LLMs). While SWE-bench and its variants have become
standard in this domain, they suffer from key limitations: they have not been
updated since their initial releases, cover a narrow set of repositories, and
depend heavily on manual effort for instance construction and environment
setup. These factors hinder scalability and introduce risks of overfitting and
data contamination. In this work, we present SWE-bench-Live, a
live-updatable benchmark designed to overcome these challenges. Our
initial release consists of 1,319 tasks derived from real GitHub issues created
since 2024, spanning 93 repositories. Each task is accompanied by a dedicated
Docker image to ensure reproducible execution. Central to our benchmark is
\method, an automated curation pipeline that streamlines the entire process
from instance creation to environment setup, removing manual bottlenecks and
enabling scalability and continuous updates. We evaluate a range of
state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a
substantial performance gap compared to static benchmarks like SWE-bench, even
under controlled evaluation conditions. To better understand this discrepancy,
we perform detailed analyses across repository origin, issue recency, and task
difficulty. By providing a fresh, diverse, and executable benchmark grounded in
live repository activity, SWE-bench-Live facilitates rigorous,
contamination-resistant evaluation of LLMs and agents in dynamic, real-world
software development settings.Summary
AI-Generated Summary