ChatPaper.aiChatPaper

SWE-bench 正式上线!

SWE-bench Goes Live!

May 29, 2025
作者: Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang
cs.AI

摘要

解决实际问题的任务,即模型生成补丁以修复现实世界中的错误,已成为评估大型语言模型(LLMs)能力的关键基准。尽管SWE-bench及其变体在该领域已成为标准,但它们存在关键局限性:自初始发布以来未进行更新,覆盖的代码库范围狭窄,且在实例构建和环境设置上高度依赖人工操作。这些因素阻碍了可扩展性,并引入了过拟合和数据污染的风险。在本研究中,我们提出了SWE-bench-Live,一个旨在克服这些挑战的实时可更新基准。我们的初始版本包含1,319个任务,源自2024年以来在GitHub上创建的真实问题,覆盖了93个代码库。每个任务均配备专用Docker镜像,以确保可重复执行。我们基准的核心是\method,一个自动化管理流程,它简化了从实例创建到环境设置的整个过程,消除了人工瓶颈,实现了可扩展性和持续更新。我们在SWE-bench-Live上评估了一系列最先进的代理框架和LLMs,揭示了与静态基准如SWE-bench相比,即使在受控评估条件下,性能差距依然显著。为了深入理解这一差异,我们从代码库来源、问题时效性和任务难度等多个维度进行了详细分析。通过提供一个基于活跃代码库活动的新颖、多样且可执行的基准,SWE-bench-Live促进了在动态、现实世界的软件开发环境中对LLMs和代理进行严格、抗污染的评估。
English
The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present SWE-bench-Live, a live-updatable benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.

Summary

AI-Generated Summary

PDF202May 30, 2025