Skywork-SWE: 大規模言語モデルにおけるソフトウェア工学のデータスケーリング法則の解明

要旨

ソフトウェア工学（SWE）は、次世代LLMエージェントの重要な試験場として最近注目を集めており、2つの重要な次元における本質的な能力が求められている。すなわち、持続的な反復的問題解決（例：50回以上のインタラクション）と、長文脈依存性の解決（例：32kトークン以上）である。しかし、SWEにおけるデータキュレーションプロセスは、依然として非常に時間がかかることで知られており、コードファイルのフィルタリングや、ユニットテストの実行と検証のための専用ランタイム環境の設定に手動のアノテーションが大きく依存している。その結果、既存のデータセットのほとんどは、GitHubから収集されたわずか数千のインスタンスに限定されている。この問題に対処するため、我々は、SWEデータセットの量と多様性を体系的に拡張するための漸進的で自動化されたデータキュレーションパイプラインを提案する。我々のデータセットは、2,531の異なるGitHubリポジトリから収集された10,169の実世界のPythonタスクインスタンスで構成され、それぞれに自然言語で指定されたタスクと、自動化されたユニットテスト検証のための専用ランタイム環境イメージが付属している。我々は、提案したSWEデータセットから8,000以上のランタイム検証済みのトレーニング軌跡を慎重にキュレーションした。これらの軌跡を用いてSkywork-SWEモデルをファインチューニングした結果、データサイズが増加するにつれて、LLMのソフトウェア工学能力に対するモデルの性能が向上し続け、飽和の兆候が見られないという顕著なデータスケーリング現象を明らかにした。特に、我々のSkywork-SWEモデルは、検証器や複数のロールアウトを使用せずに、SWE-bench Verifiedベンチマークで38.0%のpass@1精度を達成し、OpenHandsエージェントフレームワークに基づくQwen2.5-Coder-32BベースのLLMの中で新たな最先端（SOTA）を確立した。さらに、テスト時のスケーリング技術を組み込むことで、性能は47.0%の精度にまで向上し、32Bパラメータ未満のモデルにおける従来のSOTA結果を上回った。我々は、今後の研究を加速するために、Skywork-SWE-32Bモデルのチェックポイントを公開する。

English

Software engineering (SWE) has recently emerged as a crucial testbed for next-generation LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds) and long-context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time-consuming, as it heavily relies on manual annotation for code file filtering and the setup of dedicated runtime environments to execute and validate unit tests. Consequently, most existing datasets are limited to only a few thousand GitHub-sourced instances. To this end, we propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets. Our dataset comprises 10,169 real-world Python task instances from 2,531 distinct GitHub repositories, each accompanied by a task specified in natural language and a dedicated runtime-environment image for automated unit-test validation. We have carefully curated over 8,000 successfully runtime-validated training trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE model on these trajectories, we uncover a striking data scaling phenomenon: the trained model's performance for software engineering capabilities in LLMs continues to improve as the data size increases, showing no signs of saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on the SWE-bench Verified benchmark without using verifiers or multiple rollouts, establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-based LLMs built on the OpenHands agent framework. Furthermore, with the incorporation of test-time scaling techniques, the performance further improves to 47.0% accuracy, surpassing the previous SOTA results for sub-32B parameter models. We release the Skywork-SWE-32B model checkpoint to accelerate future research.

Skywork-SWE: 大規模言語モデルにおけるソフトウェア工学のデータスケーリング法則の解明

Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

要旨

Support