SWE-smith：擴展軟體工程代理的數據規模

摘要

儘管語言模型（LMs）在軟件工程領域取得了最新進展，收集訓練數據仍是一個重大難題。現有的數據集規模較小，最多僅包含來自11個或更少GitHub倉庫的數千個訓練實例。這些數據集的整理過程通常複雜，需要耗費數百小時的人力；配套的執行環境也佔用數TB的存儲空間，嚴重限制了其可擴展性和可用性。為解決這一難題，我們引入了SWE-smith，這是一種用於大規模生成軟件工程訓練數據的新穎管道。對於任何Python代碼庫，SWE-smith都能構建相應的執行環境，然後自動合成數百到數千個任務實例，這些實例會破壞代碼庫中現有的測試。利用SWE-smith，我們創建了一個包含50k個實例的數據集，這些實例來自128個GitHub倉庫，規模比之前所有工作大了一個數量級。我們訓練了SWE-agent-LM-32B，在SWE-bench Verified基準測試中達到了40.2%的Pass@1解決率，在開源模型中處於領先地位。我們開源了SWE-smith（包括收集過程、任務實例、軌跡和模型），以降低自動化軟件工程中LM系統研究的入門門檻。所有資源可在https://swesmith.com獲取。

English

Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets available at https://swesmith.com.

SWE-smith：擴展軟體工程代理的數據規模

SWE-smith: Scaling Data for Software Engineering Agents

摘要

Support