SWE-rebench V2: 규모 확장된 언어 독립적 SWE 작업 컬렉션

초록

소프트웨어 엔지니어링 에이전트(SWE)는 강화 학습(RL)에 주로 힘입어 빠르게 발전하고 있습니다. 그러나 RL 훈련은 재현 가능한 실행 환경과 신뢰할 수 있는 테스트 슈트를 갖춘 대규모 작업 컬렉션의 부족으로 제약을 받고 있습니다. 점점 더 많은 벤치마크가 등장하고 있지만, 훈련에 적합한 데이터셋은 규모와 다양성 측면에서 여전히 제한적이거나 종종 제한된 고자원 언어 생태계만을 대상으로 하는 경우가 많습니다. 우리는 실행 가능한 실제 SWE 작업을 대규모로 수집하고 RL 훈련 환경을 구축하기 위한 언어 중립적 자동화 파이프라인인 SWE-rebench V2를 소개합니다. 이 파이프라인은 대화형 설정 에이전트를 통해 저장소별 설치 및 테스트 절차를 종합하고, 인간 검증 SWE-bench 주석을 기준으로 검증된 LLM 판단자 앙상블을 사용하여 불완전한 인스턴스를 걸러냅니다. 이 파이프라인을 사용하여 20개 언어와 3,600개 이상의 저장소에 걸친 32,000개 이상의 작업으로 구성된 데이터셋을 구축하며, 재현 가능한 실행을 위한 사전 구축된 이미지를 제공합니다. 훈련 데이터를 더욱 확장하기 위해, 문제 설명이 원본 풀 리퀘스트 설명을 기반으로 생성된 설치 지침, 실패-통과 테스트 및 풍부한 메타데이터가 포함된 120,000개 이상의 작업을 추가로 공개합니다. 우리는 5개 프로그래밍 언어의 작업 하위 집합을 7개의 인기 모델에서 진단 연구를 통해 수집된 인스턴스를 검증하고, 지나치게 제한적인 테스트 및 불충분한 설명과 같은 일반적인 혼란 요인을 표시하는 인스턴스 수준 메타데이터를 제공합니다. 다양한 언어와 저장소에 걸쳐 SWE 에이전트의 대규모 훈련을 가능하게 하기 위해 데이터셋, 수집 및 실행 코드, 관련 아티팩트를 공개합니다.

English

Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,000+ tasks spanning 20 languages and 3,600+ repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.

SWE-rebench V2: 규모 확장된 언어 독립적 SWE 작업 컬렉션

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

초록

Support