Skywork-SWE：揭示大语言模型中软件工程的数据扩展规律

摘要

软件工程（SWE）近来已成为下一代大语言模型（LLM）代理的关键试验场，其核心能力体现在两大维度：持续迭代的问题解决（例如，超过50轮交互）与长上下文依赖的解析（例如，超过32k个标记）。然而，SWE领域的数据整理过程依然耗时巨大，因为它高度依赖于人工标注以筛选代码文件，并需搭建专门的运行时环境来执行和验证单元测试。因此，现有数据集大多局限于数千个来自GitHub的实例。为此，我们提出了一种增量式、自动化的数据整理流程，旨在系统性提升SWE数据集的规模与多样性。我们的数据集包含了来自2,531个不同GitHub仓库的10,169个真实世界Python任务实例，每个实例均配有自然语言描述的任务说明及专为自动化单元测试验证设计的运行时环境镜像。我们精心筛选了超过8,000条成功通过运行时验证的训练轨迹，用于微调Skywork-SWE模型。在此过程中，我们发现了一个显著的数据扩展现象：随着数据量的增长，训练出的模型在软件工程能力上的表现持续提升，未见饱和迹象。尤为突出的是，我们的Skywork-SWE模型在SWE-bench Verified基准测试中，未使用验证器或多轮回滚的情况下，实现了38.0%的pass@1准确率，在基于OpenHands代理框架构建的Qwen2.5-Coder-32B系列LLM中树立了新的标杆。此外，结合测试时扩展技术，性能进一步提升至47.0%的准确率，超越了此前所有参数规模低于32B模型的最佳成绩。我们公开了Skywork-SWE-32B模型的检查点，以加速未来研究进程。

English

Software engineering (SWE) has recently emerged as a crucial testbed for next-generation LLM agents, demanding inherent capabilities in two critical dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds) and long-context dependency resolution (e.g., >32k tokens). However, the data curation process in SWE remains notoriously time-consuming, as it heavily relies on manual annotation for code file filtering and the setup of dedicated runtime environments to execute and validate unit tests. Consequently, most existing datasets are limited to only a few thousand GitHub-sourced instances. To this end, we propose an incremental, automated data-curation pipeline that systematically scales both the volume and diversity of SWE datasets. Our dataset comprises 10,169 real-world Python task instances from 2,531 distinct GitHub repositories, each accompanied by a task specified in natural language and a dedicated runtime-environment image for automated unit-test validation. We have carefully curated over 8,000 successfully runtime-validated training trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE model on these trajectories, we uncover a striking data scaling phenomenon: the trained model's performance for software engineering capabilities in LLMs continues to improve as the data size increases, showing no signs of saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on the SWE-bench Verified benchmark without using verifiers or multiple rollouts, establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-based LLMs built on the OpenHands agent framework. Furthermore, with the incorporation of test-time scaling techniques, the performance further improves to 47.0% accuracy, surpassing the previous SOTA results for sub-32B parameter models. We release the Skywork-SWE-32B model checkpoint to accelerate future research.

Skywork-SWE：揭示大语言模型中软件工程的数据扩展规律

Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

摘要

Support