SWE-bench 正式上线！

摘要

解決問題任務，即模型生成修補程式以修復現實世界中的錯誤，已成為評估大型語言模型（LLMs）能力的關鍵基準。儘管SWE-bench及其變體在這一領域已成為標準，但它們存在一些關鍵限制：自初次發布以來未進行更新，涵蓋的儲存庫範圍狹窄，並且在實例建構和環境設置上高度依賴手動操作。這些因素阻礙了可擴展性，並引入了過擬合和數據污染的風險。在本研究中，我們提出了SWE-bench-Live，這是一個可即時更新的基準，旨在克服這些挑戰。我們的初始版本包含1,319個任務，這些任務源自自2024年以來在GitHub上創建的實際問題，涵蓋93個儲存庫。每個任務都配備了一個專用的Docker鏡像，以確保可重現的執行。我們基準的核心是\method，這是一個自動化的策展管道，它簡化了從實例創建到環境設置的整個過程，消除了手動瓶頸，實現了可擴展性和持續更新。我們在SWE-bench-Live上評估了一系列最先進的代理框架和LLMs，揭示了與SWE-bench等靜態基準相比的顯著性能差距，即使在受控的評估條件下也是如此。為了更好地理解這種差異，我們在儲存庫來源、問題新近度和任務難度方面進行了詳細分析。通過提供一個基於即時儲存庫活動的新鮮、多樣且可執行的基準，SWE-bench-Live促進了在動態、現實世界的軟體開發環境中對LLMs和代理進行嚴謹且抗污染的評估。

English

The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present SWE-bench-Live, a live-updatable benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.

SWE-bench 正式上线！

SWE-bench Goes Live!

摘要

Support