SWE-rebench:软件工程代理任务收集与去污染评估的自动化流程
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
May 26, 2025
作者: Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel
cs.AI
摘要
基于大语言模型(LLM)的代理在日益广泛的软件工程(SWE)任务中展现出显著潜力。然而,推动这一领域发展面临两大关键挑战。首先,高质量的训练数据稀缺,尤其是那些反映真实世界SWE场景的数据,在这些场景中,代理需与开发环境互动、执行代码并根据其行动结果调整行为。现有数据集要么局限于一次性代码生成,要么仅包含少量手动整理的交互任务集合,缺乏规模与多样性。其次,缺乏新鲜的交互式SWE任务影响了快速迭代模型的评估,因为静态基准测试因污染问题迅速过时。为应对这些局限,我们引入了一种新颖、自动化且可扩展的流程,持续从多样化的GitHub仓库中提取真实世界的交互式SWE任务。利用这一流程,我们构建了SWE-rebench,一个包含超过21,000个基于Python的交互式SWE任务的公开数据集,适用于大规模强化学习SWE代理。此外,我们运用SWE-rebench方法论持续收集的新任务,建立了一个无污染的代理软件工程基准测试。我们比较了不同LLM在此基准测试与SWE-bench Verified上的表现,结果显示某些语言模型的性能可能因污染问题而被高估。
English
LLM-based agents have shown promising capabilities in a growing range of
software engineering (SWE) tasks. However, advancing this field faces two
critical challenges. First, high-quality training data is scarce, especially
data that reflects real-world SWE scenarios, where agents must interact with
development environments, execute code and adapt behavior based on the outcomes
of their actions. Existing datasets are either limited to one-shot code
generation or comprise small, manually curated collections of interactive
tasks, lacking both scale and diversity. Second, the lack of fresh interactive
SWE tasks affects evaluation of rapidly improving models, as static benchmarks
quickly become outdated due to contamination issues. To address these
limitations, we introduce a novel, automated, and scalable pipeline to
continuously extract real-world interactive SWE tasks from diverse GitHub
repositories. Using this pipeline, we construct SWE-rebench, a public dataset
comprising over 21,000 interactive Python-based SWE tasks, suitable for
reinforcement learning of SWE agents at scale. Additionally, we use continuous
supply of fresh tasks collected using SWE-rebench methodology to build a
contamination-free benchmark for agentic software engineering. We compare
results of various LLMs on this benchmark to results on SWE-bench Verified and
show that performance of some language models might be inflated due to
contamination issues.Summary
AI-Generated Summary