SWE-Bench++:基于开源仓库的可扩展软件工程基准生成框架
SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories
December 19, 2025
作者: Lilin Wang, Lucas Ramalho, Alan Celestino, Phuc Anthony Pham, Yu Liu, Umang Kumar Sinha, Andres Portillo, Onassis Osunwa, Gabriel Maduekwe
cs.AI
摘要
诸如SWE-bench等基准测试已实现了大语言模型在仓库级软件工程任务评估上的标准化。然而,这些尝试仍受限于人工标注、静态数据集以及对Python错误修复的单一关注。我们推出SWE-Bench++——一个从开源GitHub项目自动生成仓库级编程任务的框架。与合成方法不同,我们的流水线通过采集实时拉取请求,覆盖11种编程语言的错误修复与功能需求。该框架通过四个阶段将GitHub拉取请求转化为可复现的、基于执行的任务:程序化采集、环境合成、测试预言提取与质量保证。最后的提示引导轨迹合成步骤可将强模型失败案例转化为训练轨迹。我们的初始基准包含来自3,971个仓库的11,133个任务实例,涵盖11种语言。在1,782个实例的子集测试中,当前最强模型表现为:claude-sonnet-4.5达成36.20%的pass@10,gpt-5-2025-08-07为34.57%,gemini/gemini-2.5-pro为24.92%,gpt-4o为16.89%。我们进一步通过实验证明,基于SWE-Bench++实例的微调能在SWE-bench多语言基准上带来显著提升。该框架为评估和改进仓库级代码生成提供了可扩展的多语言基准。
English
Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on Python-based bug fixes. We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects. Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages. SWE-Bench++ turns GitHub pull requests (PRs) into reproducible, execution-based tasks via four stages: programmatic sourcing, environment synthesis, test oracle extraction, and quality assurance. A final hint-guided trajectory synthesis step converts instances that strong models fail on into training trajectories. Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages. On a subset of 1,782 instances of this benchmark, today's strongest models perform as follows: claude-sonnet-4.5 achieves 36.20% pass@10, gpt-5-2025-08-07 34.57%, gemini/gemini-2.5-pro 24.92%, and gpt-4o 16.89%. We further demonstrate the utility of our dataset by showing that fine-tuning on SWE-Bench++ instances yields measurable improvements on the SWE-bench Multilingual benchmark. SWE-Bench++ provides a scalable, multilingual benchmark for evaluating and improving repository-level code generation.