SWE-smith：面向软件工程智能体的数据扩展

摘要

尽管语言模型（LMs）在软件工程领域取得了最新进展，收集训练数据仍是一个显著的痛点。现有数据集规模较小，最多包含来自11个或更少GitHub仓库的数千个训练实例。这些数据集的整理过程通常复杂，需要数百小时的人工劳动；配套的执行环境也占用数TB的存储空间，严重限制了其可扩展性和实用性。为解决这一痛点，我们引入了SWE-smith，一个用于大规模生成软件工程训练数据的新颖流程。给定任何Python代码库，SWE-smith构建相应的执行环境，然后自动合成数百至数千个任务实例，这些实例会破坏代码库中现有的测试。利用SWE-smith，我们创建了一个包含128个GitHub仓库来源的50k实例数据集，规模比之前所有工作大一个数量级。我们训练了SWE-agent-LM-32B模型，在SWE-bench Verified基准测试中达到了40.2%的Pass@1解决率，这是开源模型中的最新技术水平。我们开源了SWE-smith（包括收集流程、任务实例、轨迹、模型），以降低自动化软件工程中LM系统研究的入门门槛。所有资源可在https://swesmith.com获取。

English

Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets available at https://swesmith.com.

SWE-smith：面向软件工程智能体的数据扩展

SWE-smith: Scaling Data for Software Engineering Agents

摘要

Support