ChatPaper.aiChatPaper

SWE-rebench V2:大規模語言無關的軟體工程任務集

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

February 27, 2026
作者: Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Alexander Golubev
cs.AI

摘要

軟體工程智慧體(SWE)正快速進步,近期突破主要得益於強化學習(RL)的應用。然而,RL訓練受到大規模任務集的稀缺性限制——這些任務集需具備可重現的執行環境與可靠的測試套件。儘管相關基準測試日益增多,但適用於訓練的數據集在規模與多樣性上仍顯不足,且往往僅針對少數高資源語言生態系統。我們推出SWE-rebench V2,這是一個與語言無關的自動化流程,能大規模採集可執行的真實世界SWE任務並構建RL訓練環境。該流程通過互動式設置智慧體合成儲存庫專屬的安裝與測試程序,並採用LLM評判器組合過濾不健全的實例,其有效性已通過人工驗證的SWE-bench標註數據檢驗。利用此流程,我們構建了包含32,000多個任務的數據集,涵蓋20種語言與3,600多個儲存庫,並提供預建鏡像確保可重現執行。為進一步擴展訓練數據,我們額外發布12萬多個含安裝指南、失敗轉通過測試及豐富元數據的任務,其中問題描述基於原始拉取請求生成。我們通過診斷研究驗證所收集的實例,該研究覆蓋五種程式語言中的任務子集及七個主流模型,並提供實例級元數據以標記常見干擾因素(如過度嚴格的測試與描述不完整)。我們公開數據集、採集與執行代碼及相關組件,以支持跨語言與儲存庫的大規模SWE智慧體訓練。
English
Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,000+ tasks spanning 20 languages and 3,600+ repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.
PDF451March 4, 2026