SWE-rebench：ソフトウェアエンジニアリングエージェントのタスク収集とクリーンな評価のための自動化パイプライン

要旨

LLMベースのエージェントは、ソフトウェアエンジニアリング（SWE）タスクの幅広い領域で有望な能力を示しています。しかし、この分野を進展させるには、2つの重要な課題があります。第一に、高品質なトレーニングデータが不足しており、特に現実世界のSWEシナリオを反映したデータが不足しています。これらのシナリオでは、エージェントが開発環境と対話し、コードを実行し、その行動の結果に基づいて適応する必要があります。既存のデータセットは、ワンショットのコード生成に限定されているか、または小規模で手作業でキュレーションされたインタラクティブタスクのコレクションであり、スケールと多様性の両方が欠けています。第二に、新しいインタラクティブなSWEタスクの不足は、急速に進化するモデルの評価に影響を与えます。静的ベンチマークは、汚染問題によりすぐに時代遅れになるためです。これらの制限に対処するため、我々は、多様なGitHubリポジトリから現実世界のインタラクティブなSWEタスクを継続的に抽出するための新規で自動化されたスケーラブルなパイプラインを導入します。このパイプラインを使用して、我々はSWE-rebenchという公開データセットを構築しました。このデータセットは、21,000以上のインタラクティブなPythonベースのSWEタスクを含み、大規模なSWEエージェントの強化学習に適しています。さらに、SWE-rebenchの方法論を使用して収集された新鮮なタスクの継続的な供給を利用して、エージェント型ソフトウェアエンジニアリングのための汚染フリーのベンチマークを構築します。我々は、このベンチマークでのさまざまなLLMの結果をSWE-bench Verifiedの結果と比較し、いくつかの言語モデルの性能が汚染問題により過大評価されている可能性があることを示します。

English

LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.

SWE-rebench：ソフトウェアエンジニアリングエージェントのタスク収集とクリーンな評価のための自動化パイプライン

SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

要旨

Support