ChatPaper.aiChatPaper

SWE-rebench V2:大规模语言无关的软件工程任务集合

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

February 27, 2026
作者: Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Alexander Golubev
cs.AI

摘要

软件工程智能体(SWE)正快速发展,其近期进步主要得益于强化学习(RL)的推动。然而,RL训练受限于缺乏具备可复现执行环境和可靠测试套件的大规模任务集。尽管已有越来越多的基准测试出现,但适用于训练的数据集在规模和多样性上仍显不足,且往往仅针对少数高资源语言生态。我们推出SWE-rebench V2——一个与编程语言无关的自动化流程,能够大规模采集可执行的真实世界SWE任务并构建RL训练环境。该流程通过交互式设置智能体合成仓库特定的安装与测试流程,并采用LLM评审团过滤不可靠实例,其有效性已通过人工标注的SWE-bench数据验证。基于此流程,我们构建了涵盖20种编程语言、3,600余个代码仓库的32,000多项任务数据集,并提供预构建镜像确保执行可复现。为扩展训练数据规模,我们额外发布了12万余项含安装说明、失败转通过测试及丰富元数据的任务,其问题描述基于原始拉取请求内容生成。我们通过对五类编程语言中七种主流模型的任务子集开展诊断研究,验证了所采集实例的质量,并提供实例级元数据以标识常见干扰因素(如过度严格的测试和描述不完整问题)。现开源数据集、采集与执行代码及相关组件,以支持跨多样语言与仓库的大规模SWE智能体训练。
English
Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,000+ tasks spanning 20 languages and 3,600+ repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.
PDF451March 4, 2026