SWE-rebench:一個用於任務收集與去污染評估軟體工程代理的自動化流程
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
May 26, 2025
作者: Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel
cs.AI
摘要
基於大型語言模型(LLM)的代理在多種軟體工程(SWE)任務中展現出顯著的潛力。然而,推動這一領域發展面臨兩大關鍵挑戰。首先,高品質的訓練數據稀缺,尤其是那些反映真實世界軟體工程場景的數據,在這些場景中,代理需要與開發環境互動、執行代碼並根據其行動結果調整行為。現有的數據集要么僅限於一次性代碼生成,要么由少量手動策劃的互動任務組成,缺乏規模和多樣性。其次,缺乏新鮮的互動式軟體工程任務影響了對快速改進模型的評估,因為靜態基準測試由於污染問題迅速過時。為解決這些限制,我們引入了一種新穎、自動化且可擴展的流程,從多樣化的GitHub倉庫中持續提取真實世界的互動式軟體工程任務。利用這一流程,我們構建了SWE-rebench,這是一個包含超過21,000個基於Python的互動式軟體工程任務的公開數據集,適合大規模的軟體工程代理強化學習。此外,我們利用SWE-rebench方法持續收集的新鮮任務,建立了一個無污染的代理軟體工程基準測試。我們比較了多種大型語言模型在這一基準測試上的結果與在SWE-bench Verified上的結果,顯示某些語言模型的性能可能因污染問題而被高估。
English
LLM-based agents have shown promising capabilities in a growing range of
software engineering (SWE) tasks. However, advancing this field faces two
critical challenges. First, high-quality training data is scarce, especially
data that reflects real-world SWE scenarios, where agents must interact with
development environments, execute code and adapt behavior based on the outcomes
of their actions. Existing datasets are either limited to one-shot code
generation or comprise small, manually curated collections of interactive
tasks, lacking both scale and diversity. Second, the lack of fresh interactive
SWE tasks affects evaluation of rapidly improving models, as static benchmarks
quickly become outdated due to contamination issues. To address these
limitations, we introduce a novel, automated, and scalable pipeline to
continuously extract real-world interactive SWE tasks from diverse GitHub
repositories. Using this pipeline, we construct SWE-rebench, a public dataset
comprising over 21,000 interactive Python-based SWE tasks, suitable for
reinforcement learning of SWE agents at scale. Additionally, we use continuous
supply of fresh tasks collected using SWE-rebench methodology to build a
contamination-free benchmark for agentic software engineering. We compare
results of various LLMs on this benchmark to results on SWE-bench Verified and
show that performance of some language models might be inflated due to
contamination issues.Summary
AI-Generated Summary