SWE-Lego：突破软件问题解决中监督微调的极限

摘要

我们提出SWE-Lego——一种旨在实现软件工程问题解决领域最先进性能的监督微调方案。与依赖复杂训练范式（如预训练、监督微调、强化学习及其组合）的主流方法不同，我们探索如何将轻量级纯监督微调方法在软件工程任务中的性能推向极致。SWE-Lego包含三个核心构建模块，关键发现如下：1）SWE-Lego数据集包含3.2万个高质量任务实例和1.8万条已验证轨迹，融合真实数据与合成数据实现质量与数量的互补；2）采用错误掩码和难度分级课程的改进型监督微调流程，可显著提升动作质量与整体性能。实证结果表明，仅凭这两个构建模块，监督微调即可使SWE-Lego模型在同类规模开源模型中达到最优水平——在SWE-bench Verified基准上，SWE-Lego-Qwen3-8B达到42.2%，SWE-Lego-Qwen3-32B达到52.6%。3）我们在监督微调基础上进一步评估并改进测试时扩展策略。基于训练有素的验证器，SWE-Lego模型性能可获得显著提升：8B和32B模型在TTS@16设置下分别从42.2%提升至49.6%、从52.6%提升至58.8%。

English

We present SWE-Lego, a supervised fine-tuning (SFT) recipe designed to achieve state-ofthe-art performance in software engineering (SWE) issue resolving. In contrast to prevalent methods that rely on complex training paradigms (e.g., mid-training, SFT, reinforcement learning, and their combinations), we explore how to push the limits of a lightweight SFT-only approach for SWE tasks. SWE-Lego comprises three core building blocks, with key findings summarized as follows: 1) the SWE-Lego dataset, a collection of 32k highquality task instances and 18k validated trajectories, combining real and synthetic data to complement each other in both quality and quantity; 2) a refined SFT procedure with error masking and a difficulty-based curriculum, which demonstrably improves action quality and overall performance. Empirical results show that with these two building bricks alone,the SFT can push SWE-Lego models to state-of-the-art performance among open-source models of comparable size on SWE-bench Verified: SWE-Lego-Qwen3-8B reaches 42.2%, and SWE-Lego-Qwen3-32B attains 52.6%. 3) We further evaluate and improve test-time scaling (TTS) built upon the SFT foundation. Based on a well-trained verifier, SWE-Lego models can be significantly boosted--for example, 42.2% to 49.6% and 52.6% to 58.8% under TTS@16 for the 8B and 32B models, respectively.