daVinci-Env:大規模開放式軟體工程環境合成
daVinci-Env: Open SWE Environment Synthesis at Scale
March 13, 2026
作者: Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Liu
cs.AI
摘要
訓練具備能力的軟體工程(SWE)智能體需要大規模、可執行且可驗證的環境,這些環境應提供動態反饋循環以支持迭代式程式碼編輯、測試執行與解決方案優化。然而,現有開源數據集在規模與儲存庫多樣性方面仍顯不足,而工業級解決方案則因未公開基礎設施而缺乏透明度,這為多數學術研究團隊設定了難以逾越的門檻。我們提出 OpenSWE——目前規模最大、完全透明的 Python 軟體工程智能體訓練框架,包含 45,320 個可執行的 Docker 環境,涵蓋超過 12.8k 個儲存庫,所有 Dockerfile、評估腳本及基礎設施均完全開源以確保可重現性。OpenSWE 透過部署於 64 節點分散式集群的多智能體合成流水線構建,實現了儲存庫探索、Dockerfile 構建、評估腳本生成與迭代式測試分析的自動化。除了規模優勢,我們還提出以質量為核心的篩選流水線,可量化每個環境的固有難度,過濾無法解決或挑戰性不足的實例,僅保留能最大化學習效率的環境。該項目在環境構建階段投入 89.1 萬美元,並在軌跡採樣與難度感知篩選階段追加 57.6 萬美元,總投資約 147 萬美元,最終從約 9,000 個質量受控環境中產出約 13,000 條精選軌跡。大量實驗驗證了 OpenSWE 的有效性:OpenSWE-32B 與 OpenSWE-72B 在 SWE-bench Verified 上分別達到 62.4% 與 66.0% 的成績,創下 Qwen2.5 系列的新標竿。此外,專注於軟體工程的訓練還帶來顯著的領域外提升,包括數學推理任務最高提升 12 個百分點、科學基準提升 5 個百分點,且未損害事實回憶能力。
English
Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With 891K spent on environment construction and an additional 576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.