Skywork-SWE：揭示大型語言模型中軟體工程數據規模法則

摘要

軟件工程（SWE）最近已成為下一代大型語言模型（LLM）代理的關鍵試驗場，要求其在兩個關鍵維度上具備內在能力：持續迭代的問題解決（例如，超過50輪交互）和長上下文依賴解析（例如，超過32k個標記）。然而，SWE中的數據整理過程仍然以耗時著稱，因為它嚴重依賴於手動註釋來過濾代碼文件，並需要設置專用的運行環境來執行和驗證單元測試。因此，現有的大多數數據集僅限於幾千個來自GitHub的實例。為此，我們提出了一種增量式、自動化的數據整理管道，系統地擴展了SWE數據集的數量和多樣性。我們的數據集包含來自2,531個不同GitHub倉庫的10,169個真實世界的Python任務實例，每個實例都附帶一個用自然語言指定的任務和一個專用的運行環境鏡像，用於自動化單元測試驗證。我們精心整理了超過8,000條成功通過運行時驗證的訓練軌跡，這些軌跡來自我們提出的SWE數據集。當在這些軌跡上微調Skywork-SWE模型時，我們發現了一個顯著的數據擴展現象：隨著數據量的增加，訓練模型在LLM中的軟件工程能力持續提升，且未顯示出飽和跡象。值得注意的是，我們的Skywork-SWE模型在SWE-bench Verified基準測試中達到了38.0%的pass@1準確率，且未使用驗證器或多輪滾動，這在基於OpenHands代理框架的Qwen2.5-Coder-32B LLM中建立了新的最先進（SOTA）水平。此外，通過引入測試時擴展技術，性能進一步提升至47.0%的準確率，超越了之前32B以下參數模型的SOTA結果。我們發布了Skywork-SWE-32B模型檢查點，以加速未來的研究。

English

Software engineering (SWE) has recently emerged as a crucial testbed for next-generation LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds) and long-context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time-consuming, as it heavily relies on manual annotation for code file filtering and the setup of dedicated runtime environments to execute and validate unit tests. Consequently, most existing datasets are limited to only a few thousand GitHub-sourced instances. To this end, we propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets. Our dataset comprises 10,169 real-world Python task instances from 2,531 distinct GitHub repositories, each accompanied by a task specified in natural language and a dedicated runtime-environment image for automated unit-test validation. We have carefully curated over 8,000 successfully runtime-validated training trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE model on these trajectories, we uncover a striking data scaling phenomenon: the trained model's performance for software engineering capabilities in LLMs continues to improve as the data size increases, showing no signs of saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on the SWE-bench Verified benchmark without using verifiers or multiple rollouts, establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-based LLMs built on the OpenHands agent framework. Furthermore, with the incorporation of test-time scaling techniques, the performance further improves to 47.0% accuracy, surpassing the previous SOTA results for sub-32B parameter models. We release the Skywork-SWE-32B model checkpoint to accelerate future research.

Skywork-SWE：揭示大型語言模型中軟體工程數據規模法則

Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

摘要

Support