daVinci-Env:规模化开放软件工程环境综合生成
daVinci-Env: Open SWE Environment Synthesis at Scale
March 13, 2026
作者: Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Liu
cs.AI
摘要
训练具备能力的软件工程(SWE)智能体需要大规模、可执行且可验证的环境,这些环境应提供动态反馈循环以支持迭代式代码编辑、测试执行和方案优化。然而,现有开源数据集在规模和仓库多样性方面仍存在局限,而工业解决方案则因未公开基础设施而缺乏透明度,这为大多数学术研究团队设置了难以逾越的障碍。我们推出OpenSWE——目前规模最大的全透明Python软件工程智能体训练框架,包含45,320个可执行的Docker环境,覆盖超过1.28万个代码仓库,所有Dockerfile、评估脚本及基础设施均完全开源以确保可复现性。OpenSWE通过部署在64节点分布式集群上的多智能体合成流水线构建,实现了仓库探索、Dockerfile构建、评估脚本生成和迭代式测试分析的自动化。除规模优势外,我们提出以质量为核心的过滤流水线,可量化每个环境的内在难度,过滤掉不可解决或挑战性不足的实例,仅保留能最大化学习效率的环境。该项目在环境构建阶段投入89.1万美元,轨迹采样与难度感知筛选阶段追加57.6万美元,总投资约147万美元,最终从约9000个质量受控环境中产出约1.3万条精选轨迹。大量实验验证了OpenSWE的有效性:OpenSWE-32B和OpenSWE-72B在SWE-bench Verified基准上分别达到62.4%和66.0%的准确率,创下Qwen2.5系列的新纪录。值得注意的是,专注于软件工程的训练还带来显著的跨领域提升:数学推理任务最高提升12个百分点,科学基准提升5个百分点,且未削弱事实召回能力。
English
Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With 891K spent on environment construction and an additional 576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.