SWE-Lego:突破软件问题解决中监督式微调的极限
SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving
January 4, 2026
作者: Chaofan Tao, Jierun Chen, Yuxin Jiang, Kaiqi Kou, Shaowei Wang, Ruoyu Wang, Xiaohui Li, Sidi Yang, Yiming Du, Jianbo Dai, Zhiming Mao, Xinyu Wang, Lifeng Shang, Haoli Bai
cs.AI
摘要
我们提出SWE-Lego——一种旨在实现软件工程(SWE)问题解决领域最先进性能的监督微调(SFT)方案。与依赖复杂训练范式(如中期训练、SFT、强化学习及其组合)的主流方法不同,我们探索如何将轻量级纯SFT方法在SWE任务中的性能推向极致。SWE-Lego包含三个核心构建模块,关键发现总结如下:1)SWE-Lego数据集,包含3.2万个高质量任务实例和1.8万条已验证轨迹,通过真实数据与合成数据的结合实现质量与数量的互补;2)融合错误掩码与难度分级课程的改进型SFT流程,可显著提升动作质量与整体性能。实证结果表明,仅凭这两个构建模块,SFT就能使SWE-Lego模型在同类规模的开源模型中达到最先进水平——在SWE-bench Verified基准上,SWE-Lego-Qwen3-8B达到42.2%,SWE-Lego-Qwen3-32B达到52.6%。3)我们在SFT基础上进一步评估并改进了测试时扩展(TTS)策略。基于训练有素的验证器,SWE-Lego模型能获得显著提升:例如在TTS@16设置下,8B和32B模型分别从42.2%提升至49.6%、从52.6%提升至58.8%。
English
We present SWE-Lego, a supervised fine-tuning (SFT) recipe designed to achieve state-ofthe-art performance in software engineering (SWE) issue resolving. In contrast to prevalent methods that rely on complex training paradigms (e.g., mid-training, SFT, reinforcement learning, and their combinations), we explore how to push the limits of a lightweight SFT-only approach for SWE tasks. SWE-Lego comprises three core building blocks, with key findings summarized as follows: 1) the SWE-Lego dataset, a collection of 32k highquality task instances and 18k validated trajectories, combining real and synthetic data to complement each other in both quality and quantity; 2) a refined SFT procedure with error masking and a difficulty-based curriculum, which demonstrably improves action quality and overall performance. Empirical results show that with these two building bricks alone,the SFT can push SWE-Lego models to state-of-the-art performance among open-source models of comparable size on SWE-bench Verified: SWE-Lego-Qwen3-8B reaches 42.2%, and SWE-Lego-Qwen3-32B attains 52.6%. 3) We further evaluate and improve test-time scaling (TTS) built upon the SFT foundation. Based on a well-trained verifier, SWE-Lego models can be significantly boosted--for example, 42.2% to 49.6% and 52.6% to 58.8% under TTS@16 for the 8B and 32B models, respectively.