ChatPaper.aiChatPaper

SWE-rebench:软件工程代理任务收集与去污染评估的自动化流程

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

May 26, 2025
作者: Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel
cs.AI

摘要

基于大语言模型(LLM)的代理在日益广泛的软件工程(SWE)任务中展现出显著潜力。然而,推动这一领域发展面临两大关键挑战。首先,高质量的训练数据稀缺,尤其是那些反映真实世界SWE场景的数据,在这些场景中,代理需与开发环境互动、执行代码并根据其行动结果调整行为。现有数据集要么局限于一次性代码生成,要么仅包含少量手动整理的交互任务集合,缺乏规模与多样性。其次,缺乏新鲜的交互式SWE任务影响了快速迭代模型的评估,因为静态基准测试因污染问题迅速过时。为应对这些局限,我们引入了一种新颖、自动化且可扩展的流程,持续从多样化的GitHub仓库中提取真实世界的交互式SWE任务。利用这一流程,我们构建了SWE-rebench,一个包含超过21,000个基于Python的交互式SWE任务的公开数据集,适用于大规模强化学习SWE代理。此外,我们运用SWE-rebench方法论持续收集的新任务,建立了一个无污染的代理软件工程基准测试。我们比较了不同LLM在此基准测试与SWE-bench Verified上的表现,结果显示某些语言模型的性能可能因污染问题而被高估。
English
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.

Summary

AI-Generated Summary

PDF842May 29, 2025