SWE-rebench V2：大規模な言語非依存SWEタスクコレクション

要旨

ソフトウェアエンジニアリングエージェント（SWE）は急速に進化しており、最近の進歩は主に強化学習（RL）によってもたらされている。しかし、RLトレーニングは、再現可能な実行環境と信頼性の高いテストスイートを備えた大規模タスクコレクションの不足によって制約を受けている。増加しつつあるベンチマークが存在するものの、トレーニングに適したデータセットは規模と多様性において限られており、しばしば高リソース言語エコシステムの限られたセットを対象としている。本研究では、実行可能な実世界のSWEタスクを大規模に収集し、RLトレーニング環境を構築するための言語非依存の自動化パイプライン「SWE-rebench V2」を提案する。このパイプラインは、インタラクティブなセットアップエージェントを通じてリポジトリ固有のインストールおよびテスト手順を統合し、LLM審査官のアンサンブルを用いて不健全なインスタンスをフィルタリングする。このプロセスは、人間による検証済みSWE-benchアノテーションに対して検証されている。本パイプラインを用いて、20言語・3,600以上のリポジトリにわたる32,000以上のタスクからなるデータセットを構築し、再現可能な実行のための事前構築済みイメージを提供する。トレーニングデータのさらなる拡大のために、インストール手順、失敗から合格へのテスト、豊富なメタデータを備えた120,000以上のタスクを追加公開する。これらの問題文は元のプルリクエストの説明に基づいて生成されている。収集したインスタンスについては、5プログラミング言語におけるタスクのサブセットを7つの主要モデルで評価する診断調査を通じて検証し、過度に制限的なテストや不十分な説明といった一般的な交絡因子をフラグ付けするインスタンスレベルのメタデータを提供する。データセット、収集および実行コード、関連アーティファクトを公開し、多様な言語とリポジトリにわたるSWEエージェントの大規模トレーニングを可能にする。

English

Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,000+ tasks spanning 20 languages and 3,600+ repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.

SWE-rebench V2：大規模な言語非依存SWEタスクコレクション

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

要旨

Support